My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement
Suppose for the purposes of our example we have string the_string of length n, and we're trying to determine if string the_substring of length m is found within the_string.

The straightforward approach in many languages would be to use a find() or indexOf() function on the string. It might look like this:

substring_found_at_index = the_string.index(the_substring)

However, if no such method exists, the straightforward approach would be to just scan all the substrings and compare them against the_substring until you find a match. Even if the aforementioned function exists, it likely uses the same strategy:

def find_substring_pos(the_string, the_substring)
  (0..(the_string.length-1)).each do |i|
    this_sub = the_string[i,the_substring.length]
    return i if this_sub == the_substring
  end
  return nil
end

That is an O(n) function, which is normally fast enough.

Even though I'm one of the guys who camps out in line so I can be one of the first to say "don't prematurely optimize your code," there are situations where the most straightforward way to program something just doesn't work. One of those situations is where you have a long string (or set of data), and you will need to do many comparisons over it looking for substrings. Although you'll find it in many cases, an example of the need for this I've seen relatively recently occurs in bioinformatics, when searching through an organism's genome for specific subsequences. (Can you think of any other examples you've seen?)

In that case, with m much smaller than a very large n, O(m * log n) represents a significant improvement over O(n) (or worst case m*n). We can get there with a suffix array.

Of course building the suffix array takes some time - so much so that if we had to build it for each comparison, we're better off with the straightforward approach. But the idea is that we'll build it once, and reuse it many times, amortizing the cost out to "negligible" over time.

The idea of the suffix array is that you store every suffix of a string (and it's position) in a sorted array. This way, you can do a binary search for the substring in log n time. After that, you just need to compare to see if the_substring is there, and if so, return the associated index.

The Wikipedia page linked above uses the example of "abracadabra." The suffix array would store each of these suffixes, in order:
a
abra
abracadabra
acadabra
adabra
bra
bracadabra
cadabra
dabra
ra
racadabra
Below is an implementation of a suffix array in Ruby. You might want to write a more efficient sort algorithm, as I'm not sure what approach Enumerable#sort takes. Also, you might want to take into account the ability to get all substrings, not just the first one to be found.

class SuffixArray
  def initialize(the_string)
    @the_string = the_string
    @suffix_array = Array.new
    #build the suffixes
    last_index = the_string.length-1
    (0..last_index).each do |i|
      the_suffix = the_string[i..last_index]
      the_position = i
      # << is the append (or push) operator for arrays in Ruby
      @suffix_array << { :suffix=>the_suffix, :position=>the_position }
    end
      
    #sort the suffix array
    @suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }
  end
    
  def find_substring(the_substring)
    #uses typical binary search
    high = @suffix_array.length - 1
    low = 0
    while(low <= high)
      mid = (high + low) / 2
      this_suffix = @suffix_array[mid][:suffix]
      compare_len = the_substring.length-1
      comparison = this_suffix[0..compare_len]
      
      if   comparison > the_substring
        high = mid - 1
      elsif comparison < the_substring
        low = mid + 1
      else
        return @suffix_array[mid][:position]
      end
    end
    return nil
  end
end
  
sa = SuffixArray.new("abracadabra")
puts sa.find_substring("ac") #outputs 3

Thoughts, corrections, and improvements are always appreciated.

Update: Thanks to Walter's comment below, the return statement above has been corrected.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

See this question
http://stackoverflow.com/questions/5322428/finding...

Posted by biorelated on Mar 16, 2011 at 02:59 AM UTC - 5 hrs

Thanks for pointing it out to me. I answered the question there, and also added some code to have it return an array of indices that match. That code is on github: https://github.com/codeodor/miscellany/blob/master...

Posted by Sammy Larbi on Mar 16, 2011 at 03:04 PM UTC - 5 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit



RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner