My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement | getting started with cfrails
Due to a couple of requests in the comments about Sorting really BIG files, I'm posting the Java source code that does the job.

First, we have the main externalSort algorithm. It expects parameters relation (the name of a table, which in this case is the .csv file to sort without the ".csv" in the file name) and the column to sort on (it checks the first row of the .csv file for the value passed). Now, of course that is easy enough to change to fit your own needs - it worked fine for me and my CSVs.

There is a magic number in there that you might want to pull out: 10000. That is the number of rows to read at a time. Clearly, ten thousand rows is way smaller than we would want to sort on - depending on how many characters are in each row, we could probably quite easily fit 10 million rows into memory. But, I didn't want to spend time generating 100 million row files to test this on - so I just tried it with 100 thousand or so. For a real use scenario, you'd probably want to go with something bigger - if not even calculate the size and average row length.

private static void externalSort(String relation, String attribute)
{
     try
     {
         FileReader intialRelationInput = new FileReader(relation + ".csv");
         BufferedReader initRelationReader = new BufferedReader(intialRelationInput);
         String [] header = initRelationReader.readLine().split(",");
         String [] row = header;
         int indexToCompare = getIndexForColumn(header,attribute);
         ArrayList<Integer[]> tenKRows = new ArrayList<Integer[]>();
                    
         int numFiles = 0;
         while (row!=null)
         {
             // get 10k rows
             for(int i=0; i<10000; i++)
             {
                 String line = initRelationReader.readLine();
                 if (line==null)
                 {
                     row = null;
                     break;
                 }
                 row = line.split(",");
                 tenKRows.add(getIntsFromStringArray(row));
             }
             // sort the rows
             tenKRows = mergeSort(tenKRows, indexToCompare);
            
             // write to disk
             FileWriter fw = new FileWriter(relation + "_chunk" + numFiles + ".csv");
             BufferedWriter bw = new BufferedWriter(fw);
             bw.write(flattenArray(header,",")+"\n");
             for(int i=0; i<tenKRows.size(); i++)
             {
                 bw.append(flattenArray(tenKRows.get(i),",")+"\n");
             }
             bw.close();
             numFiles++;
             tenKRows.clear();
         }
    
         mergeFiles(relation, numFiles, indexToCompare);
        
         initRelationReader.close();
         intialRelationInput.close();
        
     }
     catch (Exception ex)
     {
         ex.printStackTrace();
         System.exit(-1);
     }
    
    
}
    

Now, there were a couple of functions that the externalSort() method depended on. First we see the getIndexForColumn. This method just wraps code that finds the index where the value of attribute occurs in the array header. That's easy enough to write, so I won't add to the clutter and post it here.

Next, we have getIntsFromStringArray. This is also simple and just uses parseInt on each index in the array and returns a new one for ints. I'm assuming these are all integers, so you may need to change that line to deal with Strings or other data types.

Then we have the mergeSort algorithm, which is shown below. flattenArray is also omitted here because it simply takes an array, puts a comma between each element, and returns a String. Finally, mergeFiles is also shown below.

// sort an arrayList of arrays based on the ith column
private static ArrayList<Integer[]> mergeSort(ArrayList<Integer[]> arr, int index)
{
     ArrayList<Integer[]> left = new ArrayList<Integer[]>();
     ArrayList<Integer[]> right = new ArrayList<Integer[]>();
     if(arr.size()<=1)
         return arr;
     else
     {
         int middle = arr.size()/2;
         for (int i = 0; i<middle; i++)
             left.add(arr.get(i));
         for (int j = middle; j<arr.size(); j++)
             right.add(arr.get(j));
         left = mergeSort(left, index);
         right = mergeSort(right, index);
         return merge(left, right, index);
        
     }
    
}

// merge the the results for mergeSort back together
private static ArrayList<Integer[]> merge(ArrayList<Integer[]> left, ArrayList<Integer[]> right, int index)
{
     ArrayList<Integer[]> result = new ArrayList<Integer[]>();
     while (left.size() > 0 && right.size() > 0)
     {
         if(left.get(0)[index] <= right.get(0)[index])
         {
             result.add(left.get(0));
             left.remove(0);
         }
         else
         {
             result.add(right.get(0));
             right.remove(0);
         }
     }
     if (left.size()>0)
     {
         for(int i=0; i<left.size(); i++)
             result.add(left.get(i));
     }
     if (right.size()>0)
     {
         for(int i=0; i<right.size(); i++)
             result.add(right.get(i));
     }
     return result;
}


There is nothing special about the mergeSort and merge methods above. They simply follow the well known merge sort algorithm, accounting for the fact that we are sorting on one element in an array of arrays, rather that an array with only one dimension.

Finally, we have the mergeFiles function, which simply reads all the files, one row at a time, and puts the smallest row at the top of the sorted file.

private static void mergeFiles(String relation, int numFiles, int compareIndex)
{
     try
     {
         ArrayList<FileReader> mergefr = new ArrayList<FileReader>();
         ArrayList<BufferedReader> mergefbr = new ArrayList<BufferedReader>();
         ArrayList<Integer[]> filerows = new ArrayList<Integer[]>();
         FileWriter fw = new FileWriter(relation + "_sorted.csv");
         BufferedWriter bw = new BufferedWriter(fw);
         String [] header;
            
         boolean someFileStillHasRows = false;
        
         for (int i=0; i<numFiles; i++)
         {
             mergefr.add(new FileReader(relation+"_chunk"+i+".csv"));
             mergefbr.add(new BufferedReader(mergefr.get(i)));
             // get each one past the header
             header = mergefbr.get(i).readLine().split(",");
                            
             if (i==0) bw.write(flattenArray(header,",")+"\n");
            
             // get the first row
             String line = mergefbr.get(i).readLine();
             if (line != null)
             {
                 filerows.add(getIntsFromStringArray(line.split(",")));
                 someFileStillHasRows = true;
             }
             else
             {
                 filerows.add(null);
             }
                
         }
        
         Integer[] row;
         int cnt = 0;
         while (someFileStillHasRows)
         {
             Integer min;
             int minIndex = 0;
            
             row = filerows.get(0);
             if (row!=null) {
                 min = row[compareIndex];
                 minIndex = 0;
             }
             else {
                 min = null;
                 minIndex = -1;
             }
            
             // check which one is min
             for(int i=1; i<filerows.size(); i++)
             {
                 row = filerows.get(i);
                 if (min!=null) {
                    
                     if(row!=null && row[compareIndex] < min)
                     {
                         minIndex = i;
                         min = filerows.get(i)[compareIndex];
                     }
                 }
                 else
                 {
                     if(row!=null)
                     {
                         min = row[compareIndex];
                         minIndex = i;
                     }
                 }
             }
            
             if (minIndex < 0) {
                 someFileStillHasRows=false;
             }
             else
             {
                 // write to the sorted file
                 bw.append(flattenArray(filerows.get(minIndex),",")+"\n");
                
                 // get another row from the file that had the min
                 String line = mergefbr.get(minIndex).readLine();
                 if (line != null)
                 {
                     filerows.set(minIndex,getIntsFromStringArray(line.split(",")));
                 }
                 else
                 {
                     filerows.set(minIndex,null);
                 }
             }                                
             // check if one still has rows
             for(int i=0; i<filerows.size(); i++)
             {
                
                 someFileStillHasRows = false;
                 if(filerows.get(i)!=null)
                 {
                     if (minIndex < 0)
                     {
                         puts ("mindex lt 0 and found row not null" + flattenArray(filerows.get(i)," "));
                         System.exit(-1);
                     }
                     someFileStillHasRows = true;
                     break;
                 }
             }
            
             // check the actual files one more time
             if (!someFileStillHasRows)
             {
                
                 //write the last one not covered above
                 for(int i=0; i<filerows.size(); i++)
                 {
                     if (filerows.get(i) == null)
                     {
                         String line = mergefbr.get(i).readLine();
                         if (line!=null)
                         {
                            
                             someFileStillHasRows=true;
                             filerows.set(i,getIntsFromStringArray(line.split(",")));
                         }
                     }
                            
                 }
             }
                
         }
        
        
        
         // close all the files
         bw.close();
         fw.close();
         for(int i=0; i<mergefbr.size(); i++)
             mergefbr.get(i).close();
         for(int i=0; i<mergefr.size(); i++)
             mergefr.get(i).close();
        
        
        
     }
     catch (Exception ex)
     {
         ex.printStackTrace();
         System.exit(-1);
     }
}


Ideally, you'd want to do better abstraction on a lot of this code and put it into different methods. Like Ben Nadel did in his ColdFusion/Java version of this, I'm keeping it inline so you don't need to track over to a function to see exactly what it does (though a good name would certainly tell you without the details - which is what abstraction is about!). In particular, I'd move all the file-specific operations into abstractions which better describe what they're doing in this particular domain. It would really help to better understand what's going on in the mergeFiles method at a glance, rather than reading all the details.

Questions? Comments?

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

can you please send me the full code with all the methods?
tomerab@hotmail.com

Posted by Tomer on May 20, 2007 at 03:02 PM UTC - 6 hrs

Are you missing the code for the method "getIndexForColumn" :) ?

A part this, Great Work!

Posted by Andrew on Jun 13, 2007 at 02:08 PM UTC - 6 hrs

Hi Andrew - indeed, I left it out because it didn't add much value for the clutter it provided. That method takes a string array and a string, and just returns the index where it found the value. Basically, something like (I haven't tested this, but I think you'll get the idea):

private int getIndexForColumn(String [] arr, String val)
{
    int result = -1;
    for(int i=0; i<arr.length; i++)
    {
       if (val==arr[i]) { result=i; break; }
    }
    return result;
}

Posted by Sam on Jun 14, 2007 at 12:14 PM UTC - 6 hrs

Hello,
Can you please explain, how is the implementation of "getIntsFromStringArray" method,

"private static Integer[] getIntsFromStringArray(String[] row)"

Posted by Rohidas on Jun 19, 2007 at 05:57 AM UTC - 6 hrs

If I understod correctly this "getIntsFromStringArray()" method only deal with the integer datatype if I am using all the datatypes then how to deal with this method.will it affect the sorting techniques.

Posted by Rohidas on Jun 19, 2007 at 07:33 AM UTC - 6 hrs

Well, I think you can really only sort on one data type per column, or else it has to be a string. You might be able to think of some way around that, but I don't think it follows that dates are less than or greater than strings or numbers or ...

The input is expected to be in string format, so if you want to sort on strings, just leave out the getIntsFromStringArray(). If you want to sort on doubles, floats, dates, or whatever, you just need to write a function that converts the string representation to the correct data type. That's all getIntsFromStringArray does.

Basically, it does this (not tested, so you'll need to make it work if it doesn't):

private int[] getIntsFromStringArray(String[] arr)
{
   int[] result = new int[arr.length];
   for (int i=0; i<arr.length; i++) {
     result[i]=Integer.parseInt(arr[i]);
   }
   return result;
}

Of course, you'll probably want to do some error checking to make sure it doesn't crash if arr[i] is not able to be converted to an int.

Posted by Sam on Jun 19, 2007 at 12:41 PM UTC - 6 hrs

who knows how to sort date using quick sort???? send it to my email add..... blueglenn3@yahoo.com

Posted by glenn on Jul 08, 2007 at 11:05 PM UTC - 6 hrs

Hi Glenn. Assuming the date is originally in string format, you'll need to parse it to extract a Date or (more likely) Calendar object. Then sort as normal, using comparisons on the dates.

Posted by Sam on Jul 08, 2007 at 11:13 PM UTC - 6 hrs

any source code for date pal???? the date sorting?????

Posted by glenn on Jul 10, 2007 at 07:46 AM UTC - 6 hrs

input int and then display string u know that ????

Posted by glenn on Jul 12, 2007 at 12:13 AM UTC - 6 hrs

Sorry, I don't understand what you are asking.

Posted by Sam on Jul 12, 2007 at 06:26 AM UTC - 6 hrs

a prgram that you can input integer then it outputs a string like for example input 1 then the computer outputs "one" a string pal. from int it become a string you know that??????

Posted by glenn on Jul 13, 2007 at 11:09 AM UTC - 6 hrs

How about if(input==1) output("one") ... and so forth?

Posted by Sam on Jul 13, 2007 at 12:36 PM UTC - 6 hrs

that's to long pal im looking for an altenative for that hehe... how about the user inputs 2,000,000 do i need to write a code until 2 million " if(input==2,000,000)output("two million") it so long and hard to write can i use a loop for that??/

Posted by glenn on Jul 13, 2007 at 11:43 PM UTC - 6 hrs

Oh I see. I was thinking it might output "two zero zero..."

I suppose you could do it up to 15, then it becomes x+"teen" for 5<x<20. Then "twenty"+x for x betwwen 1 and 9, and so on for each twenty, thirty, forty, ... ninety. Then hundreds, and so on. Basically how we learn to count - we only need to know 1-15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion, 1 trillion, and so on.

Of course, you cannot encode all of it, but you could stop at some really high number and just call it one-thousand-trillion instead of the next highest word for it.

So really there is only 28 cases if you did that. That's still a lot of cases, so you might look to replace the conditionals with a table lookup.

Posted by Sam on Jul 14, 2007 at 07:34 AM UTC - 6 hrs

thanks pal somewhat i can see the logic.... Are you a IT graduate?? if not what is your course.....

Posted by glenn on Jul 14, 2007 at 10:19 AM UTC - 6 hrs

Yes, I'm working on my masters in computer science at the moment.

Posted by Sam on Jul 14, 2007 at 10:45 AM UTC - 6 hrs

Hi Sam,
I have problem about this.I have 780mb text file and I must sort it with only 256 mb ram and it must be in a short time.
How can I do this?
Txt file web site:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99...
main file:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.d...
test file (about %10 of main file)
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.d...

Best regards,
adios_lepido07@hotmail.com

Posted by Tigris07 on Nov 02, 2007 at 02:46 AM UTC - 6 hrs

Tigris,

The above should work just fine - you might need to tailor some things to get it to work, but it should work fine.

If you'd like an overview of the algorithm in pseudocode, see http://www.codeodor.com/index.cfm/2007/5/10/Sortin...

Posted by Sam on Nov 02, 2007 at 05:40 AM UTC - 6 hrs

Thanks Sam

I started write code, I'm a bit slow.
What is flattenArray's job in the code.
Can you write the whole code?

Special Thanks again.

Posted by Tigris07 on Nov 05, 2007 at 04:06 AM UTC - 6 hrs

Flatten array just takes an array and joins each element, separated by the 2nd parameter, into a string. Something like:

String flattenArray(Array a, String separator)
{

String result = "";
for(i=0; i<a.length; i++)
result+=a[i] + separator;
return result;
}

Posted by Sam on Nov 05, 2007 at 06:05 AM UTC - 6 hrs

Unix sort provides for out-of-memory sort, but it's been lacking in the Java world. Thanks.

Would you be willing to contribute your code to the public domain so lots of people can use it?

Posted by George on Nov 30, 2007 at 09:13 AM UTC - 6 hrs

Bugfix: getIndexForColumn()

Instead of testing for pointer equality with "val == arr[i]" it should test for string equality with "val.equals(arr[i])".


Optimization: mergeFiles()

For each line, it performs a linear search to determine which file has the minimum line. And we're facing files with trillions of lines and probably many files to merge. Instead, it'd be good to use a heap. Lotta code to write in the world.

-George

Posted by George on Nov 30, 2007 at 09:42 AM UTC - 6 hrs

Hi George, thanks for pointing those things out -- I often forget the equality and normally catch it in the test, but for the comment I didn't look at the original code, just wrote what I thought it would be, so I never ever ran it, much less tested it =).

Re: opening it up to public domain: I'm happy to let whoever wants to use it use it, but to make it completely useful I'd need to reconstruct it from this post as the computer it resided on is no longer with me. That's ok though, because it was all just in the main class, so it'd need to be rewritten to make it useful and I'd have to add sorting for other data types as well.

So, consider it public domain, but I wouldn't want to actually start a project because I wouldn't be available to fix it to get it to a releasable state. Now if /you/ want to do all that, I certainly have no qualms about it. =)

Thanks again for the important comments!

Posted by Sammy Larbi on Nov 30, 2007 at 10:28 AM UTC - 6 hrs

Of course, if someone were to do that and base it on the code here, I wouldn't particularly mind if they gave credit to where it came from. =)

Posted by Sammy Larbi on Nov 30, 2007 at 10:30 AM UTC - 6 hrs

good to go thru

Posted by shamli on May 23, 2008 at 09:03 AM UTC - 6 hrs

I need to sort large csv files (several M records) for multiple key columns.
e.g.:
Head1;Head2;Head3;Head4;Head5;Head6
Val11;Val12;Val13;Val14;Val15;Val16
Val21;Val22;Val23;Val24;Val25;Val26
Val31;Val32;Val33;Val34;Val35;Val36
Val41;Val42;Val43;Val44;Val45;Val46
...
should be sorted for Head4;Head2;Head5

I dont want to sort the large file multiple times in opposite order of the sort columns. Have you any suggestion, how i can do this in one external sort?
Thanks

Posted by Dominik on Jul 10, 2008 at 11:10 AM UTC - 6 hrs

Dominik,

Unfortunately, no I cannot think of any way to sort a file multiple times without sorting it multiple times. =)

Perhaps I misunderstood the question?

Posted by Sammy Larbi on Jul 10, 2008 at 12:11 PM UTC - 6 hrs

I think of an Comparator, which takes more than one column for key compare. I think there must be an efficienter way to sort a file with the sort criterion of more than one column, in stead of starting the algorithm multiple times.

Posted by Dominik on Jul 11, 2008 at 02:07 AM UTC - 6 hrs

Are you talking about changing the order of the columns (scenario 1), or about sorting first on one column (scenario 2), and then sorting the duplicates by another column?

For example:
a 2 x
a 2 y
b 3 z
a 1 x

In scenario 1 may be:
2 a x
2 a y
3 b z
1 a x

The 1st and 2nd columns switched.

Or, scenario 2, sorted first by column 1, then by column 2, then by column 3:

a 1 x
a 2 x
a 2 y
b 3 z

In scenario 2, which I thought is what you meant, if we say sorting one column is O(x), then sorting m columns is always going to be O(mx) - that is, there will be a constant increase in time.

You may be able to achieve constant reductions in time with clever code, which in your case of millions of rows will be worthwhile, but I don't think you can get a smaller complexity class.

Even if you move all the operations into 1 loop instead of 3 (for instance), I don't think it would make much difference. It may be worth a try. If the number of duplicates for each value in a column is small enough, you could look for that and perform it in memory instead of on disk, and that would be an outstanding increase in speed - but that will only work if you know for sure a priori that that is the case.

I could be wrong, but those are my thoughts just looking at a fairly shallow level.

Posted by Sammy Larbi on Jul 11, 2008 at 07:53 AM UTC - 6 hrs

I meant scenario 2. Well i can't ensure that the number of duplicates in one of the sort columns is small enough everytime.

Perhaps i can achive the most performance increase by doing some analysis, how to define the optimal number of rows to read at a time.

Thanks, Dominik

Posted by Dominik on Jul 15, 2008 at 12:36 PM UTC - 6 hrs

There's a limited amount of file descriptors, so the method won't work on very large files. If we look at Tigris07's example, 780MB of text, 256MB or RAM, we can estimate that the program will need to open more than 3000 files simultaneously for the merge-part. Depending on the underlying system and on the JVM, this will fail.

The solution is to merge a smaller amount of files at a time, which makes the code a bit more complex and adds another iteration (or a third or fourth, if we want to sort huge files).

Posted by Toke Eskildsen on Jul 30, 2008 at 04:25 AM UTC - 6 hrs

Good point Toke. If you're going to use this in production, that's something you'll need to take into account. I may update it in the future, but as this was meant to illustrate the concept, I'll leave it for now.

Posted by Sammy Larbi on Jul 30, 2008 at 07:48 AM UTC - 6 hrs

The point might be valid enough, but I confused gigabytes with megabytes in the Tigtis07-case where the sorter will work just fine. If he allocates, say, 50MB for the sorter, it will require just 17 file-handles.

Another thing to consider is the number of seeks across the drive. I haven't done any tests on this, but I would guess that reading from 100's of files nearly simultaneously could confuse the disk-read-ahead-cache. It might be faster to merge fewer files at a time and take the hit of another iteration.

Posted by Toke Eskildsen on Jul 31, 2008 at 02:22 AM UTC - 6 hrs

Right now it takes just 10k rows into memory and creates a file for each row. Ideally you'd write it in such a way that it checked how much memory was available and used as much as would be comfortable (that's pretty arbitrary, but I think it gets the point across), and then write the files.

I think that would provide a pretty decent balance in most cases.

However, it wouldn't be hard to change it to just open a few files at a time, as you say.

Posted by Sammy Larbi on Jul 31, 2008 at 07:16 AM UTC - 6 hrs

Hi,
I couldn't do the multiway merging in Java.Can you help me,pls send me a Java code of the p-way merge sort.

Posted by aylin on Dec 17, 2008 at 10:07 AM UTC - 6 hrs

Hi,
first thanks for this job. but what is "attribute"? can you clearly define it?

Posted by jackie on Dec 18, 2008 at 08:48 AM UTC - 6 hrs

attribute is the attribute on which to sort - the column name, if you will.

Posted by Sammy Larbi on Dec 18, 2008 at 09:43 AM UTC - 6 hrs

Sorry but I still cant imagine what you mean? and "column name" doesnt mean anything to me. Can you expalin it by a specific example?

Posted by jackie on Dec 18, 2008 at 12:58 PM UTC - 6 hrs

for example ?n the getIndexForColumn method, you write:

if (val==arr[i])
and as I understand here "val" is actually "attribute".(from the call getIndexForColumn(header, attribute)) And I dont understand what is actually "val" is and what it is equal to??

Posted by jackie on Dec 18, 2008 at 01:04 PM UTC - 6 hrs

Suppose you have a file with a header row and data:

x y z
a 2 3
a 8 1
b 2 9
d e 2


Well, you want to sort by the column named "z". Therefore, you'd use "z" as the attribute.

Yes, attribute would be passed in to find which array index matches "z", in this case it would return 2. But you could also use it in general, without worrying about attribute. It's just a function that returns the first index in an array where the value matches val.

Hope that helps.

Posted by Sammy Larbi on Dec 18, 2008 at 02:00 PM UTC - 6 hrs

yeah, its clear now, thanks!

Posted by jackie on Dec 18, 2008 at 02:41 PM UTC - 6 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me

Topics
.NET (27)
AI/Machine Learning (15)
Bioinformatics (3)
C and C++ (9)
cfrails (22)
ColdFusion (85)
Customer Relations (20)
Databases (2)
DRY (19)
DSLs (13)
Electronics (2)
Future Tech (6)
Games (8)
Groovy/Grails (9)
Hardware (2)
IDEs (10)
Java (45)
JavaScript (6)
Lisp (3)
Mac OS (3)
Management (4)
Miscellany (63)
OOAD (40)
Programming (137)
Programming Quotables (10)
Rails (23)
Ruby (60)
Save Your Job (65)
scriptaGulous (4)
Software Development Process (28)
TDD (43)
TDDing xorblog (6)
Tools (6)
Web Development (9)
YAGNI (12)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit



RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner