My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement | getting started with cfrails
Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.

Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).

  1. Until finished reading the large file
    1. Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
    2. Sort those records in memory.
    3. Write them to a (new) file
  2. Open each of the files you created above
  3. Read the top record from each file
  4. Until no record exists in any of the files (or until you have read the entirety of every file)
    1. Write the smallest record to the sorted file
    2. Read the next record from the file that had the smallest record

Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.

Update: I noticed a slight bug in the algorithm. The line "Read one record from each file" was inside the last loop, but should have been outside of it. The post was changed to reflect the correct way to do it.

Update 2: Here's the Java source code for external merge sort.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?

Posted by Ben Nadel on May 10, 2007 at 12:36 PM UTC - 6 hrs

Sounds good to me!

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 6 hrs

Although, I thought CF read the entire file anyway... ?

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 6 hrs

You'll just have to wait and see ;)

Posted by Ben Nadel on May 10, 2007 at 04:47 PM UTC - 6 hrs

Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)

Posted by Sam on May 10, 2007 at 06:00 PM UTC - 6 hrs

There is absolutely no way I am doing this for a CSV :)

But, here is is for a more simple data:

http://www.bennadel.com/index.cfm?dax=blog:698.vie... That was a fun little exercise. I am starving now, gotta go grab some dinner and watch some TV :D

Posted by Ben Nadel on May 10, 2007 at 07:47 PM UTC - 6 hrs

Oops, that links didn't come through so well. Let me try again:

Link: ( http://bennadel.com/index.cfm?dax=blog:698.view )

Posted by Ben Nadel on May 10, 2007 at 07:48 PM UTC - 6 hrs

Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)

Posted by Sam on May 10, 2007 at 08:07 PM UTC - 6 hrs

Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.

Posted by Shankar Vasudevan on May 13, 2007 at 12:32 PM UTC - 6 hrs

Sure Shankar, I'll dig it up tomorrow and post its location here.

Posted by Sam on May 13, 2007 at 12:52 PM UTC - 6 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.

Thank you.

Denk.

Posted by denk on May 13, 2007 at 04:18 PM UTC - 6 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.

Thank you.

Tomer.

Posted by Tomer on May 20, 2007 at 12:35 PM UTC - 6 hrs

As per the update, the source can be found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sor...

Other than what is described there, you'll need to customize it to your requirements.

Posted by Sam on May 20, 2007 at 12:46 PM UTC - 6 hrs

thanks a lot

Posted by Nasim on Oct 16, 2007 at 07:14 PM UTC - 6 hrs

Thx !

Posted by Linage pvp on Jan 31, 2010 at 11:58 AM UTC - 6 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me