Re Sorting really BIG files the Java source code

home | about | contact | privacy statement

Re: Sorting really BIG files - the Java source code

Posted by Sam on May 14, 2007 at 08:01 AM UTC - 5 hrs

Due to a couple of requests in the comments about Sorting really BIG files, I'm posting the Java source code that does the job.

First, we have the main externalSort algorithm. It expects parameters relation (the name of a table, which in this case is the .csv file to sort without the ".csv" in the file name) and the column to sort on (it checks the first row of the .csv file for the value passed). Now, of course that is easy enough to change to fit your own needs - it worked fine for me and my CSVs.

There is a magic number in there that you might want to pull out: 10000. That is the number of rows to read at a time. Clearly, ten thousand rows is way smaller than we would want to sort on - depending on how many characters are in each row, we could probably quite easily fit 10 million rows into memory. But, I didn't want to spend time generating 100 million row files to test this on - so I just tried it with 100 thousand or so. For a real use scenario, you'd probably want to go with something bigger - if not even calculate the size and average row length.

private static void externalSort(String relation, String attribute)

{

     try

     {

         FileReader intialRelationInput = new FileReader(relation + ".csv"); 

         BufferedReader initRelationReader = new BufferedReader(intialRelationInput);

         String [] header = initRelationReader.readLine().split(",");

         String [] row = header;

         int indexToCompare = getIndexForColumn(header,attribute);

         ArrayList<Integer[]> tenKRows = new ArrayList<Integer[]>();

         int numFiles = 0;

         while (row!=null)

         {

             // get 10k rows

             for(int i=0; i<10000; i++)

             {

                 String line = initRelationReader.readLine();

                 if (line==null) 

                 {

                     row = null;

                     break;

                 }

                 row = line.split(",");

                 tenKRows.add(getIntsFromStringArray(row));

             }

             // sort the rows

             tenKRows = mergeSort(tenKRows, indexToCompare);

             // write to disk

             FileWriter fw = new FileWriter(relation + "_chunk" + numFiles + ".csv");

             BufferedWriter bw = new BufferedWriter(fw);

             bw.write(flattenArray(header,",")+"\n");

             for(int i=0; i<tenKRows.size(); i++)

             {

                 bw.append(flattenArray(tenKRows.get(i),",")+"\n");

             }

             bw.close();

             numFiles++;

             tenKRows.clear();

         }

         mergeFiles(relation, numFiles, indexToCompare);

         initRelationReader.close();

         intialRelationInput.close();

     }

     catch (Exception ex)

     {

         ex.printStackTrace();

         System.exit(-1);

     }

}

Now, there were a couple of functions that the externalSort() method depended on. First we see the getIndexForColumn. This method just wraps code that finds the index where the value of attribute occurs in the array header. That's easy enough to write, so I won't add to the clutter and post it here.

Next, we have getIntsFromStringArray. This is also simple and just uses parseInt on each index in the array and returns a new one for ints. I'm assuming these are all integers, so you may need to change that line to deal with Strings or other data types.

Then we have the mergeSort algorithm, which is shown below. flattenArray is also omitted here because it simply takes an array, puts a comma between each element, and returns a String. Finally, mergeFiles is also shown below.

// sort an arrayList of arrays based on the ith column

private static ArrayList<Integer[]> mergeSort(ArrayList<Integer[]> arr, int index)

{

     ArrayList<Integer[]> left = new ArrayList<Integer[]>();

     ArrayList<Integer[]> right = new ArrayList<Integer[]>();

     if(arr.size()<=1)

         return arr;

     else

     {

         int middle = arr.size()/2;

         for (int i = 0; i<middle; i++)

             left.add(arr.get(i));

         for (int j = middle; j<arr.size(); j++)

             right.add(arr.get(j));

         left = mergeSort(left, index);

         right = mergeSort(right, index);

         return merge(left, right, index);

     }

}

// merge the the results for mergeSort back together 

private static ArrayList<Integer[]> merge(ArrayList<Integer[]> left, ArrayList<Integer[]> right, int index)

{

     ArrayList<Integer[]> result = new ArrayList<Integer[]>();

     while (left.size() > 0 && right.size() > 0)

     {

         if(left.get(0)[index] <= right.get(0)[index])

         {

             result.add(left.get(0));

             left.remove(0);

         }

         else

         {

             result.add(right.get(0));

             right.remove(0);

         }

     }

     if (left.size()>0) 

     {

         for(int i=0; i<left.size(); i++)

             result.add(left.get(i));

     }

     if (right.size()>0) 

     {

         for(int i=0; i<right.size(); i++)

             result.add(right.get(i));

     }

     return result;

}

There is nothing special about the mergeSort and merge methods above. They simply follow the well known merge sort algorithm, accounting for the fact that we are sorting on one element in an array of arrays, rather that an array with only one dimension.

Finally, we have the mergeFiles function, which simply reads all the files, one row at a time, and puts the smallest row at the top of the sorted file.

private static void mergeFiles(String relation, int numFiles, int compareIndex)

{

     try

     {

         ArrayList<FileReader> mergefr = new ArrayList<FileReader>();

         ArrayList<BufferedReader> mergefbr = new ArrayList<BufferedReader>();

         ArrayList<Integer[]> filerows = new ArrayList<Integer[]>(); 

         FileWriter fw = new FileWriter(relation + "_sorted.csv");

         BufferedWriter bw = new BufferedWriter(fw);

         String [] header;

         boolean someFileStillHasRows = false;

         for (int i=0; i<numFiles; i++)

         {

             mergefr.add(new FileReader(relation+"_chunk"+i+".csv"));

             mergefbr.add(new BufferedReader(mergefr.get(i)));

             // get each one past the header

             header = mergefbr.get(i).readLine().split(",");

             if (i==0) bw.write(flattenArray(header,",")+"\n");

             // get the first row

             String line = mergefbr.get(i).readLine();

             if (line != null)

             {

                 filerows.add(getIntsFromStringArray(line.split(",")));

                 someFileStillHasRows = true;

             }

             else 

             {

                 filerows.add(null);

             }

         }

         Integer[] row;

         int cnt = 0;

         while (someFileStillHasRows)

         {

             Integer min;

             int minIndex = 0;

             row = filerows.get(0);

             if (row!=null) {

                 min = row[compareIndex];

                 minIndex = 0;

             }

             else {

                 min = null;

                 minIndex = -1;

             }

             // check which one is min

             for(int i=1; i<filerows.size(); i++)

             {

                 row = filerows.get(i);

                 if (min!=null) {

                     if(row!=null && row[compareIndex] < min)

                     {

                         minIndex = i;

                         min = filerows.get(i)[compareIndex];

                     }

                 }

                 else

                 {

                     if(row!=null)

                     {

                         min = row[compareIndex];

                         minIndex = i;

                     }

                 }

             }

             if (minIndex < 0) {

                 someFileStillHasRows=false;

             }

             else

             {

                 // write to the sorted file

                 bw.append(flattenArray(filerows.get(minIndex),",")+"\n");

                 // get another row from the file that had the min

                 String line = mergefbr.get(minIndex).readLine();

                 if (line != null)

                 {

                     filerows.set(minIndex,getIntsFromStringArray(line.split(",")));

                 }

                 else 

                 {

                     filerows.set(minIndex,null);

                 }

             }                                 

             // check if one still has rows

             for(int i=0; i<filerows.size(); i++)

             {

                 someFileStillHasRows = false;

                 if(filerows.get(i)!=null) 

                 {

                     if (minIndex < 0) 

                     {

                         puts ("mindex lt 0 and found row not null" + flattenArray(filerows.get(i)," "));

                         System.exit(-1);

                     }

                     someFileStillHasRows = true;

                     break;

                 }

             }

             // check the actual files one more time

             if (!someFileStillHasRows)

             {

                 //write the last one not covered above

                 for(int i=0; i<filerows.size(); i++)

                 {

                     if (filerows.get(i) == null)

                     {

                         String line = mergefbr.get(i).readLine();

                         if (line!=null) 

                         {

                             someFileStillHasRows=true;

                             filerows.set(i,getIntsFromStringArray(line.split(",")));

                         }

                     }

                 }

             }

         }

         // close all the files

         bw.close();

         fw.close();

         for(int i=0; i<mergefbr.size(); i++)

             mergefbr.get(i).close();

         for(int i=0; i<mergefr.size(); i++)

             mergefr.get(i).close();

     }

     catch (Exception ex)

     {

         ex.printStackTrace();

         System.exit(-1);

     }

}

Ideally, you'd want to do better abstraction on a lot of this code and put it into different methods. Like Ben Nadel did in his ColdFusion/Java version of this, I'm keeping it inline so you don't need to track over to a function to see exactly what it does (though a good name would certainly tell you without the details - which is what abstraction is about!). In particular, I'd move all the file-specific operations into abstractions which better describe what they're doing in this particular domain. It would really help to better understand what's going on in the mergeFiles method at a glance, rather than reading all the details.

Questions? Comments?

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

Last modified on May 14, 2007 at 08:05 AM UTC - 5 hrs

Comments

Leave a comment

can you please send me the full code with all the methods?
tomerab@hotmail.com

Posted by Tomer on May 20, 2007 at 03:02 PM UTC - 5 hrs

Are you missing the code for the method "getIndexForColumn" :) ?

A part this, Great Work!

Posted by Andrew on Jun 13, 2007 at 02:08 PM UTC - 5 hrs

Hi Andrew - indeed, I left it out because it didn't add much value for the clutter it provided. That method takes a string array and a string, and just returns the index where it found the value. Basically, something like (I haven't tested this, but I think you'll get the idea):

private int getIndexForColumn(String [] arr, String val)
{
    int result = -1;
    for(int i=0; i<arr.length; i++)
    {
       if (val==arr[i]) { result=i; break; }
    }
    return result;
}

Posted by Sam on Jun 14, 2007 at 12:14 PM UTC - 5 hrs

Hello,
Can you please explain, how is the implementation of "getIntsFromStringArray" method,

"private static Integer[] getIntsFromStringArray(String[] row)"

Posted by Rohidas on Jun 19, 2007 at 05:57 AM UTC - 5 hrs

If I understod correctly this "getIntsFromStringArray()" method only deal with the integer datatype if I am using all the datatypes then how to deal with this method.will it affect the sorting techniques.

Posted by Rohidas on Jun 19, 2007 at 07:33 AM UTC - 5 hrs

Well, I think you can really only sort on one data type per column, or else it has to be a string. You might be able to think of some way around that, but I don't think it follows that dates are less than or greater than strings or numbers or ...

The input is expected to be in string format, so if you want to sort on strings, just leave out the getIntsFromStringArray(). If you want to sort on doubles, floats, dates, or whatever, you just need to write a function that converts the string representation to the correct data type. That's all getIntsFromStringArray does.

Basically, it does this (not tested, so you'll need to make it work if it doesn't):

private int[] getIntsFromStringArray(String[] arr)
{
   int[] result = new int[arr.length];
   for (int i=0; i<arr.length; i++) {
     result[i]=Integer.parseInt(arr[i]);
   }
   return result;
}

Of course, you'll probably want to do some error checking to make sure it doesn't crash if arr[i] is not able to be converted to an int.

Posted by Sam on Jun 19, 2007 at 12:41 PM UTC - 5 hrs

who knows how to sort date using quick sort???? send it to my email add..... blueglenn3@yahoo.com

Posted by glenn on Jul 08, 2007 at 11:05 PM UTC - 5 hrs

Hi Glenn. Assuming the date is originally in string format, you'll need to parse it to extract a Date or (more likely) Calendar object. Then sort as normal, using comparisons on the dates.

Posted by Sam on Jul 08, 2007 at 11:13 PM UTC - 5 hrs

any source code for date pal???? the date sorting?????

Posted by glenn on Jul 10, 2007 at 07:46 AM UTC - 5 hrs

input int and then display string u know that ????

Posted by glenn on Jul 12, 2007 at 12:13 AM UTC - 5 hrs

Sorry, I don't understand what you are asking.

Posted by Sam on Jul 12, 2007 at 06:26 AM UTC - 5 hrs

a prgram that you can input integer then it outputs a string like for example input 1 then the computer outputs "one" a string pal. from int it become a string you know that??????

Posted by glenn on Jul 13, 2007 at 11:09 AM UTC - 5 hrs

How about if(input==1) output("one") ... and so forth?

Posted by Sam on Jul 13, 2007 at 12:36 PM UTC - 5 hrs

that's to long pal im looking for an altenative for that hehe... how about the user inputs 2,000,000 do i need to write a code until 2 million " if(input==2,000,000)output("two million") it so long and hard to write can i use a loop for that??/

Posted by glenn on Jul 13, 2007 at 11:43 PM UTC - 5 hrs

Oh I see. I was thinking it might output "two zero zero..."

I suppose you could do it up to 15, then it becomes x+"teen" for 5<x<20. Then "twenty"+x for x betwwen 1 and 9, and so on for each twenty, thirty, forty, ... ninety. Then hundreds, and so on. Basically how we learn to count - we only need to know 1-15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion, 1 trillion, and so on.

Of course, you cannot encode all of it, but you could stop at some really high number and just call it one-thousand-trillion instead of the next highest word for it.

So really there is only 28 cases if you did that. That's still a lot of cases, so you might look to replace the conditionals with a table lookup.

Posted by Sam on Jul 14, 2007 at 07:34 AM UTC - 5 hrs

thanks pal somewhat i can see the logic.... Are you a IT graduate?? if not what is your course.....

Posted by glenn on Jul 14, 2007 at 10:19 AM UTC - 5 hrs

Yes, I'm working on my masters in computer science at the moment.

Posted by Sam on Jul 14, 2007 at 10:45 AM UTC - 5 hrs

Hi Sam,
I have problem about this.I have 780mb text file and I must sort it with only 256 mb ram and it must be in a short time.
How can I do this?
Txt file web site:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99...
main file:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.d...
test file (about %10 of main file)
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.d...

Best regards,
adios_lepido07@hotmail.com

Posted by Tigris07 on Nov 02, 2007 at 02:46 AM UTC - 5 hrs

Tigris,

The above should work just fine - you might need to tailor some things to get it to work, but it should work fine.

If you'd like an overview of the algorithm in pseudocode, see http://www.codeodor.com/index.cfm/2007/5/10/Sortin...

Posted by Sam on Nov 02, 2007 at 05:40 AM UTC - 5 hrs

Thanks Sam

I started write code, I'm a bit slow.
What is flattenArray's job in the code.
Can you write the whole code?

Special Thanks again.

Posted by Tigris07 on Nov 05, 2007 at 04:06 AM UTC - 5 hrs

Flatten array just takes an array and joins each element, separated by the 2nd parameter, into a string. Something like:

String flattenArray(Array a, String separator)
{

String result = "";
for(i=0; i<a.length; i++)
result+=a[i] + separator;
return result;
}

Posted by Sam on Nov 05, 2007 at 06:05 AM UTC - 5 hrs

Unix sort provides for out-of-memory sort, but it's been lacking in the Java world. Thanks.

Would you be willing to contribute your code to the public domain so lots of people can use it?

Posted by George on Nov 30, 2007 at 09:13 AM UTC - 5 hrs

Bugfix: getIndexForColumn()

Instead of testing for pointer equality with "val == arr[i]" it should test for string equality with "val.equals(arr[i])".

Optimization: mergeFiles()

For each line, it performs a linear search to determine which file has the minimum line. And we're facing files with trillions of lines and probably many files to merge. Instead, it'd be good to use a heap. Lotta code to write in the world.

-George

Posted by George on Nov 30, 2007 at 09:42 AM UTC - 5 hrs

Hi George, thanks for pointing those things out -- I often forget the equality and normally catch it in the test, but for the comment I didn't look at the original code, just wrote what I thought it would be, so I never ever ran it, much less tested it =).

Re: opening it up to public domain: I'm happy to let whoever wants to use it use it, but to make it completely useful I'd need to reconstruct it from this post as the computer it resided on is no longer with me. That's ok though, because it was all just in the main class, so it'd need to be rewritten to make it useful and I'd have to add sorting for other data types as well.

So, consider it public domain, but I wouldn't want to actually start a project because I wouldn't be available to fix it to get it to a releasable state. Now if /you/ want to do all that, I certainly have no qualms about it. =)

Thanks again for the important comments!

Posted by Sammy Larbi on Nov 30, 2007 at 10:28 AM UTC - 5 hrs

Of course, if someone were to do that and base it on the code here, I wouldn't particularly mind if they gave credit to where it came from. =)

Posted by Sammy Larbi on Nov 30, 2007 at 10:30 AM UTC - 5 hrs

how to sort the result that i get from the file. example:

in the file contain data as below:
----------------------------------
uname1,no1
uname2,no2

the result display as in the file, but how to produce the data randomly for example:
-----------------------------------------------------------

uname2,no2
uname1,no1

or

uname1,no2
uname2,no1

help me....

thank you..

Posted by wannie on Dec 27, 2009 at 09:56 PM UTC - 5 hrs

I just copied the code and made it compile. I understand that header line with column names is needed, and sort is done by specifying single column.

1. What if no header line (how could I work by column numbers instead of by column names)? How to sort not only for single specified column, but for all columns?
2. [How to sort for n specified columns?]
3. Could I get the whole code file for it?

Posted by Michael Lev on Feb 02, 2011 at 03:47 PM UTC - 5 hrs

Hi Michael,

1) You can just remove the parts that read the header and change the parts that find the index of the header column by the name to just use the index you want.

2) To sort for n specified columns is way outside the bounds of what I'd have time to get in depth about right now, but you might look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads.

3) I don't have the whole code file any more (that I've been able to find anyway). I do get a lot of requests for it. One of these days I might have to see if I can find that old hard drive, and then see if I can find it. If I do, I might toss it up on github. I'll leave a comment here if and when I find it.

Posted by Sammy Larbi on Feb 28, 2011 at 01:18 PM UTC - 5 hrs

I finally got a hold of the old computer that code was on, and luckily it was still there!

I've posted the working main function along with all the code above and a test.csv to my github miscellany project: https://github.com/codeodor/miscellany/blob/master...

Posted by Sammy Larbi on Mar 18, 2011 at 08:20 AM UTC - 5 hrs

can you send me all code...

Posted by rahul on May 23, 2011 at 02:53 AM UTC - 5 hrs

As I mentioned in the comment above, the full source code is available at https://github.com/codeodor/miscellany/blob/master...

Posted by Sammy Larbi on May 23, 2011 at 12:46 PM UTC - 5 hrs

Can I use this for sorting based on multiple columns at a time ? e.g. B ascending and C descending ?

Posted by Abhay on Aug 04, 2011 at 04:30 AM UTC - 5 hrs

Abhay:

You might look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads.

Posted by Sammy Larbi on Aug 04, 2011 at 05:56 AM UTC - 5 hrs

Guys,

I need to sort based on two different keys(columns). First one is a numeric and the second one is a string.

Data contains duplicates so whenever the two numeric are the same I need to sort based on string.

Cane somebody provide help to me.

Posted by shaklasah on Nov 07, 2011 at 01:27 AM UTC - 5 hrs

Shaklasah,

As I mentioned above: "look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads."

Posted by Sammy Larbi on Nov 07, 2011 at 05:45 AM UTC - 5 hrs

This is a great idea. However, one thing that came to mind as I was browsing the comments. Since Java already has a TreeMap that implements SortedMap why not construct the key and simply use a TreeMap to build the individual sorted files prior to the merge? This would be simple for ascending keys. It might be more challenging if it had mixed keys. I would write both the sort key and the record content and use the sort key when merging the files back together. I have a need for this sort of thing in my current development. Maybe I'll take a look.

Posted by GP on Jan 06, 2012 at 04:41 PM UTC - 5 hrs

Admittedly, it's been a while since I've worked in Java in any capacity, but I don't see a reason you couldn't do it, assuming you could guarantee unique values in the sorted field.

To my knowledge, if you had duplicate values in the field on which you're sorting, you won't end up with the right output -- specifically, you'll only get the last encountered duplicate key's row.

Do you have a plan to deal with that? (Or is it unnecessary for some reason?)

Anyway, give it a shot and let me know how it goes!

Posted by Sammy Larbi on Jan 06, 2012 at 06:27 PM UTC - 5 hrs

Can you send me a whole code to my email?
I want to take a look at it. I am just beginner of programming. Want to learn more. Thank you

green_goblin013@yahoo.com

Posted by Amy on Apr 17, 2012 at 07:16 PM UTC - 5 hrs

The code is here: https://github.com/codeodor/miscellany/blob/master...

Posted by Sammy Larbi on Apr 17, 2012 at 07:59 PM UTC - 5 hrs

Thank you , Sammy

Can I ask you question?

When did you work for this one? and what class is that?

You work on this program for your master degree?

green_goblin013@yahoo.com

Posted by Amy on Apr 24, 2012 at 04:03 PM UTC - 5 hrs

Yeah, I think it was for a course on database theory. I don't remember for sure though.

Posted by Sammy Larbi on Apr 25, 2012 at 07:41 PM UTC - 5 hrs

Thanks for sharing your idea and your code, I'm working on the very same job now and luckily I found this, what you have done is particularly useful for me. BTW I'm on my master degree of computer science at the moment too ;-D

Posted by Sherlock Wesker on Oct 11, 2012 at 02:37 AM UTC - 5 hrs

Cool Sherlock, glad you found it helpful. Best of luck in your MS!

Posted by Sammy Larbi on Oct 11, 2012 at 07:04 AM UTC - 5 hrs

Can any one explain, why we are using "// check if one still has rows" this block?

tell me in which scenario System exist will occur in this block.

Thanks,

Posted by Mahamood on Nov 07, 2014 at 09:28 AM UTC - 5 hrs

@Mahamood I don't recall the reasoning at the time, but looking back through it I think it may be to cover the unexpected case that another process opened the file and added a row while we were sorting.

Posted by Sammy Larbi on Nov 09, 2014 at 11:09 AM UTC - 5 hrs

Leave a comment

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit

RSS 2.0: Full Post | Short Blurb
Subscribe by email: