Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.
Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).
-
Until finished reading the large file
-
Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
-
Sort those records in memory.
-
Write them to a (new) file
-
Open each of the files you created above
-
Read the top record from each file
-
Until no record exists in any of the files (or until you have read the entirety of every file)
-
Write the smallest record to the sorted file
-
Read the next record from the file that had the smallest record
Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.
Update: I noticed a slight bug in the algorithm. The line "Read one record from each file"
was inside the last loop, but should have
been outside of it. The post was changed to reflect the correct way to do it.
Update 2: Here's the
Java source code for external merge sort.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?
Posted by
Ben Nadel
on May 10, 2007 at 12:36 PM UTC - 5 hrs
Sounds good to me!
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 5 hrs
Although, I thought CF read the entire file anyway... ?
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 5 hrs
You'll just have to wait and see ;)
Posted by
Ben Nadel
on May 10, 2007 at 04:47 PM UTC - 5 hrs
Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)
Posted by
Sam
on May 10, 2007 at 06:00 PM UTC - 5 hrs
Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)
Posted by
Sam
on May 10, 2007 at 08:07 PM UTC - 5 hrs
Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.
Posted by Shankar Vasudevan
on May 13, 2007 at 12:32 PM UTC - 5 hrs
Sure Shankar, I'll dig it up tomorrow and post its location here.
Posted by
Sam
on May 13, 2007 at 12:52 PM UTC - 5 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.
Thank you.
Denk.
Posted by denk
on May 13, 2007 at 04:18 PM UTC - 5 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.
Thank you.
Tomer.
Posted by Tomer
on May 20, 2007 at 12:35 PM UTC - 5 hrs
thanks a lot
Posted by Nasim
on Oct 16, 2007 at 07:14 PM UTC - 5 hrs
Sam,
A heap can be used to merge sorted files. First read the first number of each file and build a heap on these numbers. Then extract min from the heap and insert the next number from the same file of the min to the heap. The next element to be output from n files can be chosen in O(logn) time.
Posted by Binh Nguyen
on Aug 28, 2008 at 10:18 PM UTC - 5 hrs
@Binh- Thanks for the comment. There are a lot of ways to sort (files or otherwise), to be sure. This way was for sorting files which are too large to fit in memory.
Imagine you have to work within a small constraint of RAM - say 64MB. Perhaps you are on a mobile device. Or suppose you have a 32 bit processor which limits you to 4GB of RAM and you want to sort a file which is 8GB.
In those cases, using a heap (in memory) wouldn't allow you to sort the file, because it couldn't fit within the constraints, so you'd need to use the disk as well.
Thanks for pointing out the heap though - it's a data structure I rarely use and often forget. In fact, it was just a few months ago that someone had to remind me how one worked. =)
Posted by
Sammy Larbi
on Aug 29, 2008 at 06:05 AM UTC - 5 hrs
can you tell me what is the attriubute ?
Posted by somthing
on Nov 06, 2008 at 04:08 PM UTC - 5 hrs
Problem in algorithm.
If all of the sorted files can be opened and read at once in memory, what stops us from using any other algorithm like merge-sort where two successive blocks can be read in memory at a time, sort independently, merge sorted arrays, and write back. Do this procedure until no more data to sort. Log N iterations through file blocks will be required to sort the whole file. I dont find any thing good in algorithm by Sam except from the fact that it looks deceptively simple.
Posted by Ripunjay
on Nov 19, 2008 at 08:58 AM UTC - 5 hrs
Hi Ripunjay: "open each file" is perhaps a misnaming what really happens.
Generally when you are programming and you open a file, you in fact are opening a file pointer - NOT an entire file.
So in this case you are not opening and reading all of each file. You only read one line of each file into memory at a time.
Thanks for the opportunity to clarify. =)
Posted by
Sammy Larbi
on Nov 19, 2008 at 11:22 AM UTC - 5 hrs
Leave a comment