In my effort to learn more about Node.js, I implemented a simple price lookup web application using Node and Express. This application prompts users to enter the name of some product in an input box, and then to hit a ‘search’ button. The page refreshes with price data for the queried product, if it is found.
A common problem with all search applications of this sort is that users make typos, or might not even know the exact name of the product they seek. This calls for a ‘fuzzy’ string matching algorithm. Rather than merely deciding if two strings are an exact match, a satisfactory algorithm finds a degree of matching along a sliding scale. The scale ranges from ‘exact match’ to ‘not similar whatsoever. Using such an algorithm, products named similarly to a user’s query can be presented as suggestions (Did you mean …?) if an exact match is not found.
Levenshtein distance is probably the most well known algorithm for this task. The algorithm works by constructing an n x m matrix of edit distances corresponding to the n and m characters in the two input strings being compared. Matrix entries are determined by character insertions, deletions, and substitutions based on the edit distance matrix constructed thus far. The final edit distance for the two input strings is in the matrix entry that is last filled by the algorithm. Since each matrix entry is determined based on previously determined entries (i.e. Levenshtein solutions for sub-strings of the inputs), this algorithm is a classic example of dynamic programming.
var lev = require('levenshtein'); var distance = lev.getEditDistance('kitten', 'sitting'); // 3
Since the algorithm constructs a n x m matrix, the simplest implementation is O(mn) in terms of both execution time and memory usage. It is therefore advisable to limit the maximum length of a user’s query before feeding it into the algorithm.
O(mn) applies to a single comparison. In practice, it is desirable to compare input to every possible searchable entity to find those that are most similar. This adds an additional factor to the execution time. There are likely to be many unpredictable queries, so caching the edit distance of any query/entity pair probably isn’t very useful in most applications. These hurdles need to be overcome in order to scale a Levenshtein-based solution up to many queries.
That being said, I’m sure my implementation is far from the most efficient. I welcome all feedback!