Next: Cache Up: Multilevel smoothing for trigram Previous: The full trigram model

## Practical issues

### Pooling of parameters

In smoothing trigram models by linear discounting  or interpolation , the smoothing parameters, both and , have been assumed to depend on the word pair uv. The number of these word pairs itself is huge so that reducing the number of smoothing parameters is desirable. Often, the parameters are pooled or tied   across different histories by setting [Jelinek & Mercer (1980)]:

This means that the parameters are tied across histories h with the same count N(h). It is straightforward to repeat the derivations for this type of tying.  As result, we obtain:

Similarly, when assuming the parameters to be independent of the histories h, we obtain:

For absolute discounting , we obtain similar formulae in the case of tying.  In particular for absolute discounting , the experimental results show that there is no degradation in perplexity  when using history independent discounting parameters [Ney & Essen (1993)].

### Computational complexity

In a real implementation, no matter whether for off-line purposes or in an operational prototype  system, the computational complexity of the trigram or bigram  model has to be considered, namely the memory requirements and the access time.

• To store the trigram and bigram  counts, we cannot use a standard array implementation. Even for a simple model like a bigram model, it is impossible to store the full table of conditional probabilities, which would require entries for a vocabulary  of 20000 words. Instead, often a so-called list implementation is used which works as follows. For each word w, we have a pointer into the actual list, and for each observed trigram (u,v,w), an entry of the list consists of two parts, namely the index pair (u,v) and the trigram count N(u,v,w). For efficiency, this organisation should be combined with the counts for the bigram  models. A further reduction in memory costs is achieved by removing the singletons [Katz (1987)] from the list of seen bigrams  and trigrams. In particular, by simply removing the trigram singletons, the storage costs can be reduced dramatically without affecting the perplexity  much. We give a specific example for the WSJ corpus of 39 million running words, for which experimental results will be described. Here, the storage cost is 126 Mbyte when 4-byte values are used for all pointers and counts. This storage cost can be reduced down to 66 Mbyte by omitting the trigram singletons.
• For fast access to the bigram  and trigram lists, we have to use suitable techniques, e.g. binary search . Short term storing of language probabilities might also be appropriate, depending on the interaction between language model and search .

Next: Cache Up: Multilevel smoothing for trigram Previous: The full trigram model

EAGLES SWLG SoftEdition, May 1997. Get the book...