Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.
Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).
-
Until finished reading the large file
-
Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
-
Sort those records in memory.
-
Write them to a (new) file
-
Open each of the files you created above
-
Read the top record from each file
-
Until no record exists in any of the files (or until you have read the entirety of every file)
-
Write the smallest record to the sorted file
-
Read the next record from the file that had the smallest record
Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.
Update: I noticed a slight bug in the algorithm. The line "Read one record from each file"
was inside the last loop, but should have
been outside of it. The post was changed to reflect the correct way to do it.
Update 2: Here's the
Java source code for external merge sort.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?
Posted by
Ben Nadel
on May 10, 2007 at 12:36 PM UTC - 6 hrs
Sounds good to me!
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 6 hrs
Although, I thought CF read the entire file anyway... ?
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 6 hrs
You'll just have to wait and see ;)
Posted by
Ben Nadel
on May 10, 2007 at 04:47 PM UTC - 6 hrs
Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)
Posted by
Sam
on May 10, 2007 at 06:00 PM UTC - 6 hrs
Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)
Posted by
Sam
on May 10, 2007 at 08:07 PM UTC - 6 hrs
Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.
Posted by Shankar Vasudevan
on May 13, 2007 at 12:32 PM UTC - 6 hrs
Sure Shankar, I'll dig it up tomorrow and post its location here.
Posted by
Sam
on May 13, 2007 at 12:52 PM UTC - 6 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.
Thank you.
Denk.
Posted by denk
on May 13, 2007 at 04:18 PM UTC - 6 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.
Thank you.
Tomer.
Posted by Tomer
on May 20, 2007 at 12:35 PM UTC - 6 hrs
thanks a lot
Posted by Nasim
on Oct 16, 2007 at 07:14 PM UTC - 6 hrs
Thx !
Posted by
Linage pvp
on Jan 31, 2010 at 11:58 AM UTC - 6 hrs
Leave a comment