My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement | getting started with cfrails
Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.

Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).

  1. Until finished reading the large file
    1. Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
    2. Sort those records in memory.
    3. Write them to a (new) file
  2. Open each of the files you created above
  3. Read the top record from each file
  4. Until no record exists in any of the files (or until you have read the entirety of every file)
    1. Write the smallest record to the sorted file
    2. Read the next record from the file that had the smallest record

Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.

Update: I noticed a slight bug in the algorithm. The line "Read one record from each file" was inside the last loop, but should have been outside of it. The post was changed to reflect the correct way to do it.

Update 2: Here's the Java source code for external merge sort.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?

Posted by Ben Nadel on May 10, 2007 at 12:36 PM UTC - 5 hrs

Sounds good to me!

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 5 hrs

Although, I thought CF read the entire file anyway... ?

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 5 hrs

You'll just have to wait and see ;)

Posted by Ben Nadel on May 10, 2007 at 04:47 PM UTC - 5 hrs

Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)

Posted by Sam on May 10, 2007 at 06:00 PM UTC - 5 hrs

There is absolutely no way I am doing this for a CSV :)

But, here is is for a more simple data:

http://www.bennadel.com/index.cfm?dax=blog:698.vie... That was a fun little exercise. I am starving now, gotta go grab some dinner and watch some TV :D

Posted by Ben Nadel on May 10, 2007 at 07:47 PM UTC - 5 hrs

Oops, that links didn't come through so well. Let me try again:

Link: ( http://bennadel.com/index.cfm?dax=blog:698.view )

Posted by Ben Nadel on May 10, 2007 at 07:48 PM UTC - 5 hrs

Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)

Posted by Sam on May 10, 2007 at 08:07 PM UTC - 5 hrs

Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.

Posted by Shankar Vasudevan on May 13, 2007 at 12:32 PM UTC - 5 hrs

Sure Shankar, I'll dig it up tomorrow and post its location here.

Posted by Sam on May 13, 2007 at 12:52 PM UTC - 5 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.

Thank you.

Denk.

Posted by denk on May 13, 2007 at 04:18 PM UTC - 5 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.

Thank you.

Tomer.

Posted by Tomer on May 20, 2007 at 12:35 PM UTC - 5 hrs

As per the update, the source can be found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sor...

Other than what is described there, you'll need to customize it to your requirements.

Posted by Sam on May 20, 2007 at 12:46 PM UTC - 5 hrs

thanks a lot

Posted by Nasim on Oct 16, 2007 at 07:14 PM UTC - 5 hrs

Sam,
A heap can be used to merge sorted files. First read the first number of each file and build a heap on these numbers. Then extract min from the heap and insert the next number from the same file of the min to the heap. The next element to be output from n files can be chosen in O(logn) time.

Posted by Binh Nguyen on Aug 28, 2008 at 10:18 PM UTC - 5 hrs

@Binh- Thanks for the comment. There are a lot of ways to sort (files or otherwise), to be sure. This way was for sorting files which are too large to fit in memory.

Imagine you have to work within a small constraint of RAM - say 64MB. Perhaps you are on a mobile device. Or suppose you have a 32 bit processor which limits you to 4GB of RAM and you want to sort a file which is 8GB.

In those cases, using a heap (in memory) wouldn't allow you to sort the file, because it couldn't fit within the constraints, so you'd need to use the disk as well.

Thanks for pointing out the heap though - it's a data structure I rarely use and often forget. In fact, it was just a few months ago that someone had to remind me how one worked. =)

Posted by Sammy Larbi on Aug 29, 2008 at 06:05 AM UTC - 5 hrs

can you tell me what is the attriubute ?

Posted by somthing on Nov 06, 2008 at 04:08 PM UTC - 5 hrs

Problem in algorithm.

If all of the sorted files can be opened and read at once in memory, what stops us from using any other algorithm like merge-sort where two successive blocks can be read in memory at a time, sort independently, merge sorted arrays, and write back. Do this procedure until no more data to sort. Log N iterations through file blocks will be required to sort the whole file. I dont find any thing good in algorithm by Sam except from the fact that it looks deceptively simple.

Posted by Ripunjay on Nov 19, 2008 at 08:58 AM UTC - 5 hrs

Hi Ripunjay: "open each file" is perhaps a misnaming what really happens.

Generally when you are programming and you open a file, you in fact are opening a file pointer - NOT an entire file.

So in this case you are not opening and reading all of each file. You only read one line of each file into memory at a time.

Thanks for the opportunity to clarify. =)

Posted by Sammy Larbi on Nov 19, 2008 at 11:22 AM UTC - 5 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me

Topics
.NET (29)
AI/Machine Learning (15)
Answers To 100 Interview Questions (9)
Bioinformatics (3)
C and C++ (10)
cfrails (22)
ColdFusion (85)
Customer Relations (21)
Databases (2)
DRY (20)
DSLs (13)
Electronics (2)
Future Tech (6)
Games (8)
Groovy/Grails (9)
Hardware (2)
IDEs (10)
Java (45)
JavaScript (6)
Lisp (3)
Mac OS (3)
Management (9)
Miscellany (67)
OOAD (42)
Programming (164)
Programming Quotables (12)
Rails (25)
Ruby (66)
Save Your Job (66)
scriptaGulous (4)
Software Development Process (31)
TDD (46)
TDDing xorblog (6)
Tools (6)
Web Development (10)
With (1)
YAGNI (12)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit



RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner