Andrei Mackenzie is a software engineer at Clypd.

20 May 2011

In my effort to learn more about "Node.js":http://nodejs.org, I implemented a simple price lookup web application using Node and "Express":http://expressjs.com. This application prompts users to enter the name of some product in an input box, and then to hit a 'search' button. The page refreshes with price data for the queried product, if it is found.
A common problem with all search applications of this sort is that users make typos, or might not even know the exact name of the product they seek. This calls for a 'fuzzy' string matching algorithm. Rather than merely deciding if two strings are an exact match, a satisfactory algorithm finds a _degree_ of matching along a sliding scale. The scale ranges from 'exact match' to 'not similar whatsoever. Using such an algorithm, products named similarly to a user's query can be presented as suggestions (_Did you mean ...?_) if an exact match is not found.
"Levenshtein distance":http://en.wikipedia.org/wiki/Levenshtein_distance is probably the most well known algorithm for this task. The algorithm works by constructing an *n* x *m* matrix of edit distances corresponding to the *n* and *m* characters in the two input strings being compared. Matrix entries are determined by character _insertions_, _deletions_, and _substitutions_ based on the edit distance matrix constructed thus far. The final edit distance for the two input strings is in the matrix entry that is last filled by the algorithm. Since each matrix entry is determined based on previously determined entries (i.e. Levenshtein solutions for sub-strings of the inputs), this algorithm is a classic example of dynamic programming.
Check out this "gist":https://gist.github.com/982927 for a JavaScript implementation of Levenshtein distance. The file is formatted as a Node.js module. Here's an example, assuming @levenshtein.js@ is in the same directory as your JavaScript file:
h3. Caveats
Since the algorithm constructs a *n* x *m* matrix, the simplest implementation is _O(mn)_ in terms of both execution time and memory usage. It is therefore advisable to limit the maximum length of a user's query before feeding it into the algorithm.
_O(mn)_ applies to a single comparison. In practice, it is desirable to compare input to every possible searchable entity to find those that are most similar. This adds an additional factor to the execution time. There are likely to be many unpredictable queries, so caching the edit distance of any query/entity pair probably isn't very useful in most applications. These hurdles need to be overcome in order to scale a Levenshtein-based solution up to many queries.
That being said, I'm sure my implementation is far from the most efficient. I welcome all feedback!