Due to a couple of requests in the comments about
Sorting really BIG files,
I'm posting the Java source code that does the job.
First, we have the main
externalSort algorithm. It expects parameters
relation
(the name of a table, which in this case is the .csv file to sort without the ".csv" in the file name)
and the column to sort on (it checks the first row of the .csv file for the value passed). Now, of course
that is easy enough to change to fit your own needs - it worked fine for me and my CSVs.
There is a magic number in there that you might want to pull out: 10000. That is the number of rows to
read at a time.
Clearly, ten thousand rows is way smaller than we would want to sort on - depending on how many characters
are in each row, we could probably quite easily fit 10 million rows into memory. But, I didn't want to
spend time generating 100 million row files to test this on - so I just tried it with 100 thousand or so.
For a real use scenario, you'd probably want to go with something bigger - if not even calculate the size
and average row length.
private static void externalSort(String relation, String attribute)
{
try
{
FileReader intialRelationInput = new FileReader(relation + ".csv");
BufferedReader initRelationReader = new BufferedReader(intialRelationInput);
String [] header = initRelationReader.readLine().split(",");
String [] row = header;
int indexToCompare = getIndexForColumn(header,attribute);
ArrayList<Integer[]> tenKRows = new ArrayList<Integer[]>();
int numFiles = 0;
while (row!=null)
{
// get 10k rows
for(int i=0; i<10000; i++)
{
String line = initRelationReader.readLine();
if (line==null)
{
row = null;
break;
}
row = line.split(",");
tenKRows.add(getIntsFromStringArray(row));
}
// sort the rows
tenKRows = mergeSort(tenKRows, indexToCompare);
// write to disk
FileWriter fw = new FileWriter(relation + "_chunk" + numFiles + ".csv");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(flattenArray(header,",")+"\n");
for(int i=0; i<tenKRows.size(); i++)
{
bw.append(flattenArray(tenKRows.get(i),",")+"\n");
}
bw.close();
numFiles++;
tenKRows.clear();
}
mergeFiles(relation, numFiles, indexToCompare);
initRelationReader.close();
intialRelationInput.close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Now, there were a couple of functions that the
externalSort() method depended on.
First we see the
getIndexForColumn. This method just wraps code that finds the index
where the value of
attribute occurs in the array
header. That's
easy enough to write, so I won't add to the clutter and post it here.
Next, we have
getIntsFromStringArray. This is also simple and just uses
parseInt
on each index in the array and returns a new one for
ints. I'm assuming these are all integers,
so you may need to change that line to deal with
Strings or other data types.
Then we have the
mergeSort algorithm, which is shown below.
flattenArray is
also omitted here because it simply takes an array, puts a comma between each element, and returns
a
String. Finally,
mergeFiles is also shown below.
// sort an arrayList of arrays based on the ith column
private static ArrayList<Integer[]> mergeSort(ArrayList<Integer[]> arr, int index)
{
ArrayList<Integer[]> left = new ArrayList<Integer[]>();
ArrayList<Integer[]> right = new ArrayList<Integer[]>();
if(arr.size()<=1)
return arr;
else
{
int middle = arr.size()/2;
for (int i = 0; i<middle; i++)
left.add(arr.get(i));
for (int j = middle; j<arr.size(); j++)
right.add(arr.get(j));
left = mergeSort(left, index);
right = mergeSort(right, index);
return merge(left, right, index);
}
}
// merge the the results for mergeSort back together
private static ArrayList<Integer[]> merge(ArrayList<Integer[]> left, ArrayList<Integer[]> right, int index)
{
ArrayList<Integer[]> result = new ArrayList<Integer[]>();
while (left.size() > 0 && right.size() > 0)
{
if(left.get(0)[index] <= right.get(0)[index])
{
result.add(left.get(0));
left.remove(0);
}
else
{
result.add(right.get(0));
right.remove(0);
}
}
if (left.size()>0)
{
for(int i=0; i<left.size(); i++)
result.add(left.get(i));
}
if (right.size()>0)
{
for(int i=0; i<right.size(); i++)
result.add(right.get(i));
}
return result;
}
There is nothing special about the
mergeSort and
merge methods above. They simply
follow the well known
merge sort algorithm, accounting
for the fact that we are sorting on one element in an array of arrays, rather that an array with only one
dimension.
Finally, we have the
mergeFiles function, which simply reads all the files, one row at a time,
and puts the smallest row at the top of the sorted file.
private static void mergeFiles(String relation, int numFiles, int compareIndex)
{
try
{
ArrayList<FileReader> mergefr = new ArrayList<FileReader>();
ArrayList<BufferedReader> mergefbr = new ArrayList<BufferedReader>();
ArrayList<Integer[]> filerows = new ArrayList<Integer[]>();
FileWriter fw = new FileWriter(relation + "_sorted.csv");
BufferedWriter bw = new BufferedWriter(fw);
String [] header;
boolean someFileStillHasRows = false;
for (int i=0; i<numFiles; i++)
{
mergefr.add(new FileReader(relation+"_chunk"+i+".csv"));
mergefbr.add(new BufferedReader(mergefr.get(i)));
// get each one past the header
header = mergefbr.get(i).readLine().split(",");
if (i==0) bw.write(flattenArray(header,",")+"\n");
// get the first row
String line = mergefbr.get(i).readLine();
if (line != null)
{
filerows.add(getIntsFromStringArray(line.split(",")));
someFileStillHasRows = true;
}
else
{
filerows.add(null);
}
}
Integer[] row;
int cnt = 0;
while (someFileStillHasRows)
{
Integer min;
int minIndex = 0;
row = filerows.get(0);
if (row!=null) {
min = row[compareIndex];
minIndex = 0;
}
else {
min = null;
minIndex = -1;
}
// check which one is min
for(int i=1; i<filerows.size(); i++)
{
row = filerows.get(i);
if (min!=null) {
if(row!=null && row[compareIndex] < min)
{
minIndex = i;
min = filerows.get(i)[compareIndex];
}
}
else
{
if(row!=null)
{
min = row[compareIndex];
minIndex = i;
}
}
}
if (minIndex < 0) {
someFileStillHasRows=false;
}
else
{
// write to the sorted file
bw.append(flattenArray(filerows.get(minIndex),",")+"\n");
// get another row from the file that had the min
String line = mergefbr.get(minIndex).readLine();
if (line != null)
{
filerows.set(minIndex,getIntsFromStringArray(line.split(",")));
}
else
{
filerows.set(minIndex,null);
}
}
// check if one still has rows
for(int i=0; i<filerows.size(); i++)
{
someFileStillHasRows = false;
if(filerows.get(i)!=null)
{
if (minIndex < 0)
{
puts ("mindex lt 0 and found row not null" + flattenArray(filerows.get(i)," "));
System.exit(-1);
}
someFileStillHasRows = true;
break;
}
}
// check the actual files one more time
if (!someFileStillHasRows)
{
//write the last one not covered above
for(int i=0; i<filerows.size(); i++)
{
if (filerows.get(i) == null)
{
String line = mergefbr.get(i).readLine();
if (line!=null)
{
someFileStillHasRows=true;
filerows.set(i,getIntsFromStringArray(line.split(",")));
}
}
}
}
}
// close all the files
bw.close();
fw.close();
for(int i=0; i<mergefbr.size(); i++)
mergefbr.get(i).close();
for(int i=0; i<mergefr.size(); i++)
mergefr.get(i).close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Ideally, you'd want to do better abstraction on a lot of this code and put it into different methods.
Like Ben Nadel did in his
ColdFusion/Java version of this,
I'm keeping it inline so you don't need to track over to a function to see exactly what it does (though
a good name would certainly tell you without the details - which is what abstraction is about!). In
particular, I'd move all the file-specific operations into abstractions which better describe what
they're doing in this particular domain. It would really help to better understand what's going on
in the
mergeFiles method at a glance, rather than reading all the details.
Questions? Comments?
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
can you please send me the full code with all the methods?
tomerab@hotmail.com
Posted by Tomer
on May 20, 2007 at 03:02 PM UTC - 6 hrs
Are you missing the code for the method "getIndexForColumn" :) ?
A part this, Great Work!
Posted by Andrew
on Jun 13, 2007 at 02:08 PM UTC - 6 hrs
Hi Andrew - indeed, I left it out because it didn't add much value for the clutter it provided. That method takes a string array and a string, and just returns the index where it found the value. Basically, something like (I haven't tested this, but I think you'll get the idea):
private int getIndexForColumn(String [] arr, String val)
{
int result = -1;
for(int i=0; i<arr.length; i++)
{
if (val==arr[i]) { result=i; break; }
}
return result;
}
Posted by
Sam
on Jun 14, 2007 at 12:14 PM UTC - 6 hrs
Hello,
Can you please explain, how is the implementation of "getIntsFromStringArray" method,
"private static Integer[] getIntsFromStringArray(String[] row)"
Posted by Rohidas
on Jun 19, 2007 at 05:57 AM UTC - 6 hrs
If I understod correctly this "getIntsFromStringArray()" method only deal with the integer datatype if I am using all the datatypes then how to deal with this method.will it affect the sorting techniques.
Posted by Rohidas
on Jun 19, 2007 at 07:33 AM UTC - 6 hrs
Well, I think you can really only sort on one data type per column, or else it has to be a string. You might be able to think of some way around that, but I don't think it follows that dates are less than or greater than strings or numbers or ...
The input is expected to be in string format, so if you want to sort on strings, just leave out the getIntsFromStringArray(). If you want to sort on doubles, floats, dates, or whatever, you just need to write a function that converts the string representation to the correct data type. That's all getIntsFromStringArray does.
Basically, it does this (not tested, so you'll need to make it work if it doesn't):
private int[] getIntsFromStringArray(String[] arr)
{
int[] result = new int[arr.length];
for (int i=0; i<arr.length; i++) {
result[i]=Integer.parseInt(arr[i]);
}
return result;
}
Of course, you'll probably want to do some error checking to make sure it doesn't crash if arr[i] is not able to be converted to an int.
Posted by
Sam
on Jun 19, 2007 at 12:41 PM UTC - 6 hrs
who knows how to sort date using quick sort???? send it to my email add..... blueglenn3@yahoo.com
Posted by glenn
on Jul 08, 2007 at 11:05 PM UTC - 6 hrs
Hi Glenn. Assuming the date is originally in string format, you'll need to parse it to extract a Date or (more likely) Calendar object. Then sort as normal, using comparisons on the dates.
Posted by
Sam
on Jul 08, 2007 at 11:13 PM UTC - 6 hrs
any source code for date pal???? the date sorting?????
Posted by glenn
on Jul 10, 2007 at 07:46 AM UTC - 6 hrs
input int and then display string u know that ????
Posted by glenn
on Jul 12, 2007 at 12:13 AM UTC - 6 hrs
Sorry, I don't understand what you are asking.
Posted by
Sam
on Jul 12, 2007 at 06:26 AM UTC - 6 hrs
a prgram that you can input integer then it outputs a string like for example input 1 then the computer outputs "one" a string pal. from int it become a string you know that??????
Posted by glenn
on Jul 13, 2007 at 11:09 AM UTC - 6 hrs
How about if(input==1) output("one") ... and so forth?
Posted by
Sam
on Jul 13, 2007 at 12:36 PM UTC - 6 hrs
that's to long pal im looking for an altenative for that hehe... how about the user inputs 2,000,000 do i need to write a code until 2 million " if(input==2,000,000)output("two million") it so long and hard to write can i use a loop for that??/
Posted by glenn
on Jul 13, 2007 at 11:43 PM UTC - 6 hrs
Oh I see. I was thinking it might output "two zero zero..."
I suppose you could do it up to 15, then it becomes x+"teen" for 5<x<20. Then "twenty"+x for x betwwen 1 and 9, and so on for each twenty, thirty, forty, ... ninety. Then hundreds, and so on. Basically how we learn to count - we only need to know 1-15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion, 1 trillion, and so on.
Of course, you cannot encode all of it, but you could stop at some really high number and just call it one-thousand-trillion instead of the next highest word for it.
So really there is only 28 cases if you did that. That's still a lot of cases, so you might look to replace the conditionals with a table lookup.
Posted by
Sam
on Jul 14, 2007 at 07:34 AM UTC - 6 hrs
thanks pal somewhat i can see the logic.... Are you a IT graduate?? if not what is your course.....
Posted by glenn
on Jul 14, 2007 at 10:19 AM UTC - 6 hrs
Yes, I'm working on my masters in computer science at the moment.
Posted by
Sam
on Jul 14, 2007 at 10:45 AM UTC - 6 hrs
Tigris,
The above should work just fine - you might need to tailor some things to get it to work, but it should work fine.
If you'd like an overview of the algorithm in pseudocode, see
http://www.codeodor.com/index.cfm/2007/5/10/Sortin...
Posted by
Sam
on Nov 02, 2007 at 05:40 AM UTC - 6 hrs
Thanks Sam
I started write code, I'm a bit slow.
What is flattenArray's job in the code.
Can you write the whole code?
Special Thanks again.
Posted by Tigris07
on Nov 05, 2007 at 04:06 AM UTC - 6 hrs
Flatten array just takes an array and joins each element, separated by the 2nd parameter, into a string. Something like:
String flattenArray(Array a, String separator)
{
String result = "";
for(i=0; i<a.length; i++)
result+=a[i] + separator;
return result;
}
Posted by
Sam
on Nov 05, 2007 at 06:05 AM UTC - 6 hrs
Unix sort provides for out-of-memory sort, but it's been lacking in the Java world. Thanks.
Would you be willing to contribute your code to the public domain so lots of people can use it?
Posted by
George
on Nov 30, 2007 at 09:13 AM UTC - 6 hrs
Bugfix: getIndexForColumn()
Instead of testing for pointer equality with "val == arr[i]" it should test for string equality with "val.equals(arr[i])".
Optimization: mergeFiles()
For each line, it performs a linear search to determine which file has the minimum line. And we're facing files with trillions of lines and probably many files to merge. Instead, it'd be good to use a heap. Lotta code to write in the world.
-George
Posted by
George
on Nov 30, 2007 at 09:42 AM UTC - 6 hrs
Hi George, thanks for pointing those things out -- I often forget the equality and normally catch it in the test, but for the comment I didn't look at the original code, just wrote what I thought it would be, so I never ever ran it, much less tested it =).
Re: opening it up to public domain: I'm happy to let whoever wants to use it use it, but to make it completely useful I'd need to reconstruct it from this post as the computer it resided on is no longer with me. That's ok though, because it was all just in the main class, so it'd need to be rewritten to make it useful and I'd have to add sorting for other data types as well.
So, consider it public domain, but I wouldn't want to actually start a project because I wouldn't be available to fix it to get it to a releasable state. Now if /you/ want to do all that, I certainly have no qualms about it. =)
Thanks again for the important comments!
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:28 AM UTC - 6 hrs
Of course, if someone were to do that and base it on the code here, I wouldn't particularly mind if they gave credit to where it came from. =)
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:30 AM UTC - 6 hrs
how to sort the result that i get from the file. example:
in the file contain data as below:
----------------------------------
uname1,no1
uname2,no2
the result display as in the file, but how to produce the data randomly for example:
-----------------------------------------------------------
uname2,no2
uname1,no1
or
uname1,no2
uname2,no1
help me....
thank you..
Posted by wannie
on Dec 27, 2009 at 09:56 PM UTC - 6 hrs
Leave a comment