next up previous contents index
Next: Grammar based language models Up: Refined language models Previous: Refined language models

Equivalence histories and word classes

  In this section, we will consider structures that allow us to define classes (or categories) for words that exhibit similar behaviour in the context of language modelling. Word classes or parts of speech (POS)   can be viewed as an attempt to cope with the problem of sparse data in   language modelling [Derouault & Merialdo (1986)]. Typically, these word classes are based on syntactic-semantic concepts and are defined by linguistic experts. Generalising the concept of word similarities, we can also define word classes by using a statistical criterion, which in most cases is, but does not have to be, maximum likelihood [Jelinek et al. (1990), Jelinek (1991), Brown et al. (1992), Ney et al. (1994)].

There are two levels at which we will introduce equivalence concepts: the level of word histories h and the level of single words w. In a sequence of training data  tex2html_wrap_inline45625, the history of word tex2html_wrap_inline45555 is given by the most recent M words tex2html_wrap_inline45869. The equivalence classes of the possible word histories h for a word w will be called states and denoted by a so-called

state mapping tex2html_wrap_inline45875.

For the words w, we will use the so-called
class (or category) mapping tex2html_wrap_inline45879.

For both the states s and the word classes g, we define new counts:
Using only the history mapping tex2html_wrap_inline45885, we have the probability model:
with the log-likelihood function:


Here, as usual, the symbol p(.) denotes a more specific model as opposed to Pr(.). Plugging in the relative frequencies as estimates for p(w|s) = N(s,w)/N(s), we obtain:
This criterion has the form of an entropy, which also arises in the context of hierarchical equivalence classes and CART [Bahl et al. (1989), Breiman et al. (1984)],   where a tree structure is imposed on the mapping S. In contrast, here no hierarchical structure is imposed on the word classes so that no special structure for the mapping tex2html_wrap_inline45885 is assumed. By using equivalence states, we can reduce the number of free parameters: tex2html_wrap_inline45897 in comparison with tex2html_wrap_inline45899 for a bigram  model and tex2html_wrap_inline45901 for a trigram  model, where |S| is the number of equivalence states. However, these numbers are somewhat artificial, because even for large text databases, the number of really independent parameters is much smaller.

When now adding word classes using the mapping tex2html_wrap_inline45905, we have to distinguish two types of probability distribution:

For the probability Pr(w|h), we then have the decomposition:
The log-likelihood function now depends on both mappings G and S and can be written as [Ney et al. (1994)]:
Plugging in the relative frequencies for p(w), p(s,g),p(s),p(g) as maximum likelihood estimates, we have:


A two-sided symmetric model can be defined by using the word classes both for the current word and its predecessor words. For a bigram  (v,w), we use tex2html_wrap_inline45941 to denote the corresponding word classes. For a bigram  model, we have then:
The log-likelihood function for such a symmetric model is easily derived using the equations of the preceding paragraph:


where tex2html_wrap_inline45941 denotes the class bigram  of the word bigram (v,w) and tex2html_wrap_inline45947 are defined in the usual way. Such a symmetric model leads to a drastic reduction in the number of free parameters: tex2html_wrap_inline45949 probabilities for the table tex2html_wrap_inline45951; (W-|G|) probabilities for the table tex2html_wrap_inline45955, and W indices for the mapping tex2html_wrap_inline45905. In a similar way, we have for the symmetric trigram  model:


So far, the assumption has been that the mappings for the word classes or the equivalence states are known. Now we describe a procedure by which such mappings can be determined automatically. This automatic procedure will be developed for the two-sided symmetric model of Eq.(7.35). The task is to find a mapping tex2html_wrap_inline45879 that assigns each word to one of |G| different word classes. Since these classes are found by a statistical clustering procedure, they are also referred to as word clusters . The perplexity  of the training data , i.e. tex2html_wrap_inline45965, is used as optimisation criterion. In the spirit of decision-directed learning [Duda & Hart (1973), p. 210,], the basic concept of the algorithm is to improve the value of the optimisation criterion by making local optimisations, which means moving a word from one class to another in order to improve the log-likelihood criterion. Thus we obtain the algorithm shown in Table 7.5.


Start with some initial mapping tex2html_wrap_inline45967
Iterate until some convergence criterion is met
Loop over all words w
Loop over all clusters g'
Compute likelihood if w is moved from cluster g=G(w) to g'

Table 7.5: Algorithm for word clustering  

More details on this algorithm and related ones can be found in [Brown et al. (1992)] and [Ney et al. (1994)]. For small amounts of training data , such automatically determined word classes may help to reduce the perplexity  of a word based bigram  or trigram  language model [Brown et al. (1992), Ney et al. (1994)].


next up previous contents index
Next: Grammar based language models Up: Refined language models Previous: Refined language models

EAGLES SWLG SoftEdition, May 1997. Get the book...