A little while ago I was trying to think of ways to have a program compare strings
and discover patterns among them (as opposed to knowing patterns and looking for particular ones).
Over the years, I've learned about a lot of algorithms, but there's no way I could recall
all of them. I knew I'd probably need to look at artificial intelligence, or more specifically,
machine learning. But that's all I had to go on.
At the time, I decided it would be helpful to have a list of algorithms and
data structures with short descriptions to browse and jog my memory.
Most of the problems you'll solve in your programming career don't require a lot
of thought to arrive at a correct solution. But algorithms, data structures, and
approaches to problems aren't just limited to the realm of programming. Reg Braithwaite
reminds us of another reason to have these things at your disposal -
even the problem of determining
who to hire can be reduced to Naïve Bayes Classification
And when you have those problems where
there is no human solution (how can I discover patterns in several strings
which may have hundreds of characters?), or the computer solution takes too long to
find the optimal one where good enough will do, or there just isn't necessarily a
right answer -- those are the hard ones where you aren't likely to stumble upon an
answer -- where do you turn?
It turns out, a lot of problems can be reduced to others we already know how to solve.
In fact, the basis of proving complexity class for an algorithm utilizes that: reduction of one
problem to another
will prove that if you solve the first one, you can solve the second one,
and it will be just as complex. A famous example is
I haven't yet compiled the list I spoke of above, but luckily for all of us,
Wikipedia has a good starting point
It's missing a couple that stand out in my mind
(or that have a different name I didn't look for, or multiple classifications and it
didn't make it to the one I looked at), but that's just
something I can put on my to-do list to improve. The Machine Learning category, for instance,
seems fairly light.
So just browsing a list and short description of algorithms may enlighten you as to how you
can reduce your
problem to one that's already been solved. If you can do that, you've got a solution
from someone who's probably much smarter than you are. It's as if you have
and the rest of computer
science academia on your team, and you don't even have to pay them (except, perhaps by buying
their book, or subscribing to a journal that will disseminate their knowledge).
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
Good list. The data structures list off that page is good. It mentions suffix trees, a more sophisticated [I don't understand them yet!] version of the suffix arrays described in "Programming Pearls", excellent for finding commun substrings in a string. For odd (less common) forms of similarity look at "Programming Collaborative Intelligence" by Toby Segaran.
Another way to get a similarity metric is a co-occurrence matrix as per Haralick criteria in texture analysis. I've found that one a quick way to detect similar files, comparing 256*256 numbers, rather than N megabytes.
Posted by hgs
on Feb 20, 2008 at 09:33 AM UTC - 5 hrs
Leave a comment
I considered using simple diff (or Longest Common Subsequence) for my problem at hand, but I haven't been in that direction yet because we actually knew more about the pattern than I originally thought, so now we can generate all such patterns that fit a pattern-template (? I don't know the word). You could say out of all possible patterns, we are narrowed down to a finite subset of that.
I haven't seen Haralick's, matrix, but we immediately ruled out other distance matrices because they would just be entirely too slow and a bit overfitting.
Always glad to have you point me at new things though - I will check out the ones you mentioned at some point just to familiarize myself with them. Thanks!
Posted by Sammy Larbi
on Feb 20, 2008 at 02:40 PM UTC - 5 hrs