Search-related encoding and edit distance algorithms


The algorithms illustrated by the implementations linked on this page are used for a variety of purposes: The algorithms encode words in such a way as to generalise over classes of similar words, thus providing a way of finding words whose exact spelling is not known.
  1. Encoding algorithms: Three of the most well-known algorithms for this purpose are demonstrated in this application:
    1. "Phonetic" coding:
      1. Soundex: very old, dating back to the beginning of the 20th century, used to code sets of similar proper names in the same way, originally for postal and telephone services. The algorithm is now considered somewhat out-dated. Word similarity is based on the phonetic properties of sounds corresponding to the letters.
        See also the Wikipedia Soundex page, and check internet search engines for "soundex algorithm".
      2. Phonex: an improvement on Soundex, used for the same purposes.
        Check internet search engines for "phonex algorithm".
    2. "Morphological" coding:
      1. Porter Stemmer: a standard algorithm which removes suffixes (to a certain extent) from words, leaving a stem. Stem sharing is a specific feature of word forms which have the same meaning, but different endings.
        Martin Porter in person
        See also the Wikipedia Porter Stemmer page.
  2. String comparison algorithms:
    1. Edit distance:
      1. Levenshtein Edit Distance: an approach to defining the numerical distance between words in an edit space in terms of the number of insert, delete and substitute operations which are required toconvert one word into another.
        Here are several photos including Vladimir Levenshtein, at the ZiF in Bielefeld, March 2003.
        See also the short Wikipedia articles about Wladimir Josipowitsch Lewenstein and Levenshtein Distance.

        Check the internet for many other explanations and implementations. A particularly elegant recursive and short (but slow) version in the Python programming language (yes, it IS named after Monty Python) is given here:

         def levenshtein(a, b):
             if not a: return len(b)
             if not b: return len(a)
             return min(int(a[0] != b[0]) + levenshtein(a[1:], b[1:]), 1 + levenshtein(a[1:], b), 1 + levenshtein(a, b[1:]))
        
        (Source: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python)

        A classic iterative Python implementation is this longer but faster version:

        def levenshtein_iter02(a,b):
                c = {}
                n = len(a); m = len(b)
                for i in range(0,n+1):
                        c[i,0] = i
                for j in range(0,m+1):
                        c[0,j] = j
                for i in range(1,n+1):
                        for j in range(1,m+1):
                                x = c[i-1,j]+1
                                y = c[i,j-1]+1
                                if a[i-1] == b[j-1]:
                                        z = c[i-1,j-1]
                                else:
                                        z = c[i-1,j-1]+1
                                if traceflag=='on':
                                        print deletion,insertion, substution,"
        " c[i,j] = min(x,y,z) return c[n,m]
        Both Levenshtein Distance Algorithm implementations are included on the demo page.


D. Gibbon, Thu Aug 14 16:48:21 MEST 2008