Due to a couple of requests in the comments about
Sorting really BIG files,
I'm posting the Java source code that does the job.
First, we have the main
externalSort algorithm. It expects parameters
relation
(the name of a table, which in this case is the .csv file to sort without the ".csv" in the file name)
and the column to sort on (it checks the first row of the .csv file for the value passed). Now, of course
that is easy enough to change to fit your own needs - it worked fine for me and my CSVs.
There is a magic number in there that you might want to pull out: 10000. That is the number of rows to
read at a time.
Clearly, ten thousand rows is way smaller than we would want to sort on - depending on how many characters
are in each row, we could probably quite easily fit 10 million rows into memory. But, I didn't want to
spend time generating 100 million row files to test this on - so I just tried it with 100 thousand or so.
For a real use scenario, you'd probably want to go with something bigger - if not even calculate the size
and average row length.
private static void externalSort(String relation, String attribute)
{
try
{
FileReader intialRelationInput = new FileReader(relation + ".csv");
BufferedReader initRelationReader = new BufferedReader(intialRelationInput);
String [] header = initRelationReader.readLine().split(",");
String [] row = header;
int indexToCompare = getIndexForColumn(header,attribute);
ArrayList<Integer[]> tenKRows = new ArrayList<Integer[]>();
int numFiles = 0;
while (row!=null)
{
// get 10k rows
for(int i=0; i<10000; i++)
{
String line = initRelationReader.readLine();
if (line==null)
{
row = null;
break;
}
row = line.split(",");
tenKRows.add(getIntsFromStringArray(row));
}
// sort the rows
tenKRows = mergeSort(tenKRows, indexToCompare);
// write to disk
FileWriter fw = new FileWriter(relation + "_chunk" + numFiles + ".csv");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(flattenArray(header,",")+"\n");
for(int i=0; i<tenKRows.size(); i++)
{
bw.append(flattenArray(tenKRows.get(i),",")+"\n");
}
bw.close();
numFiles++;
tenKRows.clear();
}
mergeFiles(relation, numFiles, indexToCompare);
initRelationReader.close();
intialRelationInput.close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Now, there were a couple of functions that the
externalSort() method depended on.
First we see the
getIndexForColumn. This method just wraps code that finds the index
where the value of
attribute occurs in the array
header. That's
easy enough to write, so I won't add to the clutter and post it here.
Next, we have
getIntsFromStringArray. This is also simple and just uses
parseInt
on each index in the array and returns a new one for
ints. I'm assuming these are all integers,
so you may need to change that line to deal with
Strings or other data types.
Then we have the
mergeSort algorithm, which is shown below.
flattenArray is
also omitted here because it simply takes an array, puts a comma between each element, and returns
a
String. Finally,
mergeFiles is also shown below.
// sort an arrayList of arrays based on the ith column
private static ArrayList<Integer[]> mergeSort(ArrayList<Integer[]> arr, int index)
{
ArrayList<Integer[]> left = new ArrayList<Integer[]>();
ArrayList<Integer[]> right = new ArrayList<Integer[]>();
if(arr.size()<=1)
return arr;
else
{
int middle = arr.size()/2;
for (int i = 0; i<middle; i++)
left.add(arr.get(i));
for (int j = middle; j<arr.size(); j++)
right.add(arr.get(j));
left = mergeSort(left, index);
right = mergeSort(right, index);
return merge(left, right, index);
}
}
// merge the the results for mergeSort back together
private static ArrayList<Integer[]> merge(ArrayList<Integer[]> left, ArrayList<Integer[]> right, int index)
{
ArrayList<Integer[]> result = new ArrayList<Integer[]>();
while (left.size() > 0 && right.size() > 0)
{
if(left.get(0)[index] <= right.get(0)[index])
{
result.add(left.get(0));
left.remove(0);
}
else
{
result.add(right.get(0));
right.remove(0);
}
}
if (left.size()>0)
{
for(int i=0; i<left.size(); i++)
result.add(left.get(i));
}
if (right.size()>0)
{
for(int i=0; i<right.size(); i++)
result.add(right.get(i));
}
return result;
}
There is nothing special about the
mergeSort and
merge methods above. They simply
follow the well known
merge sort algorithm, accounting
for the fact that we are sorting on one element in an array of arrays, rather that an array with only one
dimension.
Finally, we have the
mergeFiles function, which simply reads all the files, one row at a time,
and puts the smallest row at the top of the sorted file.
private static void mergeFiles(String relation, int numFiles, int compareIndex)
{
try
{
ArrayList<FileReader> mergefr = new ArrayList<FileReader>();
ArrayList<BufferedReader> mergefbr = new ArrayList<BufferedReader>();
ArrayList<Integer[]> filerows = new ArrayList<Integer[]>();
FileWriter fw = new FileWriter(relation + "_sorted.csv");
BufferedWriter bw = new BufferedWriter(fw);
String [] header;
boolean someFileStillHasRows = false;
for (int i=0; i<numFiles; i++)
{
mergefr.add(new FileReader(relation+"_chunk"+i+".csv"));
mergefbr.add(new BufferedReader(mergefr.get(i)));
// get each one past the header
header = mergefbr.get(i).readLine().split(",");
if (i==0) bw.write(flattenArray(header,",")+"\n");
// get the first row
String line = mergefbr.get(i).readLine();
if (line != null)
{
filerows.add(getIntsFromStringArray(line.split(",")));
someFileStillHasRows = true;
}
else
{
filerows.add(null);
}
}
Integer[] row;
int cnt = 0;
while (someFileStillHasRows)
{
Integer min;
int minIndex = 0;
row = filerows.get(0);
if (row!=null) {
min = row[compareIndex];
minIndex = 0;
}
else {
min = null;
minIndex = -1;
}
// check which one is min
for(int i=1; i<filerows.size(); i++)
{
row = filerows.get(i);
if (min!=null) {
if(row!=null && row[compareIndex] < min)
{
minIndex = i;
min = filerows.get(i)[compareIndex];
}
}
else
{
if(row!=null)
{
min = row[compareIndex];
minIndex = i;
}
}
}
if (minIndex < 0) {
someFileStillHasRows=false;
}
else
{
// write to the sorted file
bw.append(flattenArray(filerows.get(minIndex),",")+"\n");
// get another row from the file that had the min
String line = mergefbr.get(minIndex).readLine();
if (line != null)
{
filerows.set(minIndex,getIntsFromStringArray(line.split(",")));
}
else
{
filerows.set(minIndex,null);
}
}
// check if one still has rows
for(int i=0; i<filerows.size(); i++)
{
someFileStillHasRows = false;
if(filerows.get(i)!=null)
{
if (minIndex < 0)
{
puts ("mindex lt 0 and found row not null" + flattenArray(filerows.get(i)," "));
System.exit(-1);
}
someFileStillHasRows = true;
break;
}
}
// check the actual files one more time
if (!someFileStillHasRows)
{
//write the last one not covered above
for(int i=0; i<filerows.size(); i++)
{
if (filerows.get(i) == null)
{
String line = mergefbr.get(i).readLine();
if (line!=null)
{
someFileStillHasRows=true;
filerows.set(i,getIntsFromStringArray(line.split(",")));
}
}
}
}
}
// close all the files
bw.close();
fw.close();
for(int i=0; i<mergefbr.size(); i++)
mergefbr.get(i).close();
for(int i=0; i<mergefr.size(); i++)
mergefr.get(i).close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Ideally, you'd want to do better abstraction on a lot of this code and put it into different methods.
Like Ben Nadel did in his
ColdFusion/Java version of this,
I'm keeping it inline so you don't need to track over to a function to see exactly what it does (though
a good name would certainly tell you without the details - which is what abstraction is about!). In
particular, I'd move all the file-specific operations into abstractions which better describe what
they're doing in this particular domain. It would really help to better understand what's going on
in the
mergeFiles method at a glance, rather than reading all the details.
Questions? Comments?
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
can you please send me the full code with all the methods?
tomerab@hotmail.com
Posted by Tomer
on May 20, 2007 at 03:02 PM UTC - 6 hrs
Are you missing the code for the method "getIndexForColumn" :) ?
A part this, Great Work!
Posted by Andrew
on Jun 13, 2007 at 02:08 PM UTC - 6 hrs
Hi Andrew - indeed, I left it out because it didn't add much value for the clutter it provided. That method takes a string array and a string, and just returns the index where it found the value. Basically, something like (I haven't tested this, but I think you'll get the idea):
private int getIndexForColumn(String [] arr, String val)
{
int result = -1;
for(int i=0; i<arr.length; i++)
{
if (val==arr[i]) { result=i; break; }
}
return result;
}
Posted by
Sam
on Jun 14, 2007 at 12:14 PM UTC - 6 hrs
Hello,
Can you please explain, how is the implementation of "getIntsFromStringArray" method,
"private static Integer[] getIntsFromStringArray(String[] row)"
Posted by Rohidas
on Jun 19, 2007 at 05:57 AM UTC - 6 hrs
If I understod correctly this "getIntsFromStringArray()" method only deal with the integer datatype if I am using all the datatypes then how to deal with this method.will it affect the sorting techniques.
Posted by Rohidas
on Jun 19, 2007 at 07:33 AM UTC - 6 hrs
Well, I think you can really only sort on one data type per column, or else it has to be a string. You might be able to think of some way around that, but I don't think it follows that dates are less than or greater than strings or numbers or ...
The input is expected to be in string format, so if you want to sort on strings, just leave out the getIntsFromStringArray(). If you want to sort on doubles, floats, dates, or whatever, you just need to write a function that converts the string representation to the correct data type. That's all getIntsFromStringArray does.
Basically, it does this (not tested, so you'll need to make it work if it doesn't):
private int[] getIntsFromStringArray(String[] arr)
{
int[] result = new int[arr.length];
for (int i=0; i<arr.length; i++) {
result[i]=Integer.parseInt(arr[i]);
}
return result;
}
Of course, you'll probably want to do some error checking to make sure it doesn't crash if arr[i] is not able to be converted to an int.
Posted by
Sam
on Jun 19, 2007 at 12:41 PM UTC - 6 hrs
who knows how to sort date using quick sort???? send it to my email add..... blueglenn3@yahoo.com
Posted by glenn
on Jul 08, 2007 at 11:05 PM UTC - 6 hrs
Hi Glenn. Assuming the date is originally in string format, you'll need to parse it to extract a Date or (more likely) Calendar object. Then sort as normal, using comparisons on the dates.
Posted by
Sam
on Jul 08, 2007 at 11:13 PM UTC - 6 hrs
any source code for date pal???? the date sorting?????
Posted by glenn
on Jul 10, 2007 at 07:46 AM UTC - 6 hrs
input int and then display string u know that ????
Posted by glenn
on Jul 12, 2007 at 12:13 AM UTC - 6 hrs
Sorry, I don't understand what you are asking.
Posted by
Sam
on Jul 12, 2007 at 06:26 AM UTC - 6 hrs
a prgram that you can input integer then it outputs a string like for example input 1 then the computer outputs "one" a string pal. from int it become a string you know that??????
Posted by glenn
on Jul 13, 2007 at 11:09 AM UTC - 6 hrs
How about if(input==1) output("one") ... and so forth?
Posted by
Sam
on Jul 13, 2007 at 12:36 PM UTC - 6 hrs
that's to long pal im looking for an altenative for that hehe... how about the user inputs 2,000,000 do i need to write a code until 2 million " if(input==2,000,000)output("two million") it so long and hard to write can i use a loop for that??/
Posted by glenn
on Jul 13, 2007 at 11:43 PM UTC - 6 hrs
Oh I see. I was thinking it might output "two zero zero..."
I suppose you could do it up to 15, then it becomes x+"teen" for 5<x<20. Then "twenty"+x for x betwwen 1 and 9, and so on for each twenty, thirty, forty, ... ninety. Then hundreds, and so on. Basically how we learn to count - we only need to know 1-15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion, 1 trillion, and so on.
Of course, you cannot encode all of it, but you could stop at some really high number and just call it one-thousand-trillion instead of the next highest word for it.
So really there is only 28 cases if you did that. That's still a lot of cases, so you might look to replace the conditionals with a table lookup.
Posted by
Sam
on Jul 14, 2007 at 07:34 AM UTC - 6 hrs
thanks pal somewhat i can see the logic.... Are you a IT graduate?? if not what is your course.....
Posted by glenn
on Jul 14, 2007 at 10:19 AM UTC - 6 hrs
Yes, I'm working on my masters in computer science at the moment.
Posted by
Sam
on Jul 14, 2007 at 10:45 AM UTC - 6 hrs
Tigris,
The above should work just fine - you might need to tailor some things to get it to work, but it should work fine.
If you'd like an overview of the algorithm in pseudocode, see
http://www.codeodor.com/index.cfm/2007/5/10/Sortin...
Posted by
Sam
on Nov 02, 2007 at 05:40 AM UTC - 6 hrs
Thanks Sam
I started write code, I'm a bit slow.
What is flattenArray's job in the code.
Can you write the whole code?
Special Thanks again.
Posted by Tigris07
on Nov 05, 2007 at 04:06 AM UTC - 6 hrs
Flatten array just takes an array and joins each element, separated by the 2nd parameter, into a string. Something like:
String flattenArray(Array a, String separator)
{
String result = "";
for(i=0; i<a.length; i++)
result+=a[i] + separator;
return result;
}
Posted by
Sam
on Nov 05, 2007 at 06:05 AM UTC - 6 hrs
Unix sort provides for out-of-memory sort, but it's been lacking in the Java world. Thanks.
Would you be willing to contribute your code to the public domain so lots of people can use it?
Posted by
George
on Nov 30, 2007 at 09:13 AM UTC - 6 hrs
Bugfix: getIndexForColumn()
Instead of testing for pointer equality with "val == arr[i]" it should test for string equality with "val.equals(arr[i])".
Optimization: mergeFiles()
For each line, it performs a linear search to determine which file has the minimum line. And we're facing files with trillions of lines and probably many files to merge. Instead, it'd be good to use a heap. Lotta code to write in the world.
-George
Posted by
George
on Nov 30, 2007 at 09:42 AM UTC - 6 hrs
Hi George, thanks for pointing those things out -- I often forget the equality and normally catch it in the test, but for the comment I didn't look at the original code, just wrote what I thought it would be, so I never ever ran it, much less tested it =).
Re: opening it up to public domain: I'm happy to let whoever wants to use it use it, but to make it completely useful I'd need to reconstruct it from this post as the computer it resided on is no longer with me. That's ok though, because it was all just in the main class, so it'd need to be rewritten to make it useful and I'd have to add sorting for other data types as well.
So, consider it public domain, but I wouldn't want to actually start a project because I wouldn't be available to fix it to get it to a releasable state. Now if /you/ want to do all that, I certainly have no qualms about it. =)
Thanks again for the important comments!
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:28 AM UTC - 6 hrs
Of course, if someone were to do that and base it on the code here, I wouldn't particularly mind if they gave credit to where it came from. =)
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:30 AM UTC - 6 hrs
good to go thru
Posted by shamli
on May 23, 2008 at 09:03 AM UTC - 6 hrs
I need to sort large csv files (several M records) for multiple key columns.
e.g.:
Head1;Head2;Head3;Head4;Head5;Head6
Val11;Val12;Val13;Val14;Val15;Val16
Val21;Val22;Val23;Val24;Val25;Val26
Val31;Val32;Val33;Val34;Val35;Val36
Val41;Val42;Val43;Val44;Val45;Val46
...
should be sorted for Head4;Head2;Head5
I dont want to sort the large file multiple times in opposite order of the sort columns. Have you any suggestion, how i can do this in one external sort?
Thanks
Posted by Dominik
on Jul 10, 2008 at 11:10 AM UTC - 6 hrs
Dominik,
Unfortunately, no I cannot think of any way to sort a file multiple times without sorting it multiple times. =)
Perhaps I misunderstood the question?
Posted by
Sammy Larbi
on Jul 10, 2008 at 12:11 PM UTC - 6 hrs
I think of an Comparator, which takes more than one column for key compare. I think there must be an efficienter way to sort a file with the sort criterion of more than one column, in stead of starting the algorithm multiple times.
Posted by Dominik
on Jul 11, 2008 at 02:07 AM UTC - 6 hrs
Are you talking about changing the order of the columns (scenario 1), or about sorting first on one column (scenario 2), and then sorting the duplicates by another column?
For example:
a 2 x
a 2 y
b 3 z
a 1 x
In scenario 1 may be:
2 a x
2 a y
3 b z
1 a x
The 1st and 2nd columns switched.
Or, scenario 2, sorted first by column 1, then by column 2, then by column 3:
a 1 x
a 2 x
a 2 y
b 3 z
In scenario 2, which I thought is what you meant, if we say sorting one column is O(x), then sorting m columns is always going to be O(mx) - that is, there will be a constant increase in time.
You may be able to achieve constant reductions in time with clever code, which in your case of millions of rows will be worthwhile, but I don't think you can get a smaller complexity class.
Even if you move all the operations into 1 loop instead of 3 (for instance), I don't think it would make much difference. It may be worth a try. If the number of duplicates for each value in a column is small enough, you could look for that and perform it in memory instead of on disk, and that would be an outstanding increase in speed - but that will only work if you know for sure a priori that that is the case.
I could be wrong, but those are my thoughts just looking at a fairly shallow level.
Posted by
Sammy Larbi
on Jul 11, 2008 at 07:53 AM UTC - 6 hrs
I meant scenario 2. Well i can't ensure that the number of duplicates in one of the sort columns is small enough everytime.
Perhaps i can achive the most performance increase by doing some analysis, how to define the optimal number of rows to read at a time.
Thanks, Dominik
Posted by Dominik
on Jul 15, 2008 at 12:36 PM UTC - 6 hrs
There's a limited amount of file descriptors, so the method won't work on very large files. If we look at Tigris07's example, 780MB of text, 256MB or RAM, we can estimate that the program will need to open more than 3000 files simultaneously for the merge-part. Depending on the underlying system and on the JVM, this will fail.
The solution is to merge a smaller amount of files at a time, which makes the code a bit more complex and adds another iteration (or a third or fourth, if we want to sort huge files).
Posted by
Toke Eskildsen
on Jul 30, 2008 at 04:25 AM UTC - 6 hrs
Good point Toke. If you're going to use this in production, that's something you'll need to take into account. I may update it in the future, but as this was meant to illustrate the concept, I'll leave it for now.
Posted by
Sammy Larbi
on Jul 30, 2008 at 07:48 AM UTC - 6 hrs
The point might be valid enough, but I confused gigabytes with megabytes in the Tigtis07-case where the sorter will work just fine. If he allocates, say, 50MB for the sorter, it will require just 17 file-handles.
Another thing to consider is the number of seeks across the drive. I haven't done any tests on this, but I would guess that reading from 100's of files nearly simultaneously could confuse the disk-read-ahead-cache. It might be faster to merge fewer files at a time and take the hit of another iteration.
Posted by
Toke Eskildsen
on Jul 31, 2008 at 02:22 AM UTC - 6 hrs
Right now it takes just 10k rows into memory and creates a file for each row. Ideally you'd write it in such a way that it checked how much memory was available and used as much as would be comfortable (that's pretty arbitrary, but I think it gets the point across), and then write the files.
I think that would provide a pretty decent balance in most cases.
However, it wouldn't be hard to change it to just open a few files at a time, as you say.
Posted by
Sammy Larbi
on Jul 31, 2008 at 07:16 AM UTC - 6 hrs
Hi,
I couldn't do the multiway merging in Java.Can you help me,pls send me a Java code of the p-way merge sort.
Posted by aylin
on Dec 17, 2008 at 10:07 AM UTC - 6 hrs
Hi,
first thanks for this job. but what is "attribute"? can you clearly define it?
Posted by jackie
on Dec 18, 2008 at 08:48 AM UTC - 6 hrs
attribute is the attribute on which to sort - the column name, if you will.
Posted by
Sammy Larbi
on Dec 18, 2008 at 09:43 AM UTC - 6 hrs
Sorry but I still cant imagine what you mean? and "column name" doesnt mean anything to me. Can you expalin it by a specific example?
Posted by jackie
on Dec 18, 2008 at 12:58 PM UTC - 6 hrs
for example ?n the getIndexForColumn method, you write:
if (val==arr[i])
and as I understand here "val" is actually "attribute".(from the call getIndexForColumn(header, attribute)) And I dont understand what is actually "val" is and what it is equal to??
Posted by jackie
on Dec 18, 2008 at 01:04 PM UTC - 6 hrs
Suppose you have a file with a header row and data:
x y z
a 2 3
a 8 1
b 2 9
d e 2
Well, you want to sort by the column named "z". Therefore, you'd use "z" as the attribute.
Yes, attribute would be passed in to find which array index matches "z", in this case it would return 2. But you could also use it in general, without worrying about attribute. It's just a function that returns the first index in an array where the value matches val.
Hope that helps.
Posted by
Sammy Larbi
on Dec 18, 2008 at 02:00 PM UTC - 6 hrs
yeah, its clear now, thanks!
Posted by jackie
on Dec 18, 2008 at 02:41 PM UTC - 6 hrs
Leave a comment