Practical issues

Next: Cache Up: Multilevel smoothing for trigram Previous: The full trigram model

In smoothing trigram models by linear discounting or interpolation , the smoothing parameters, both and , have been assumed to depend on the word pair uv. The number of these word pairs itself is huge so that reducing the number of smoothing parameters is desirable. Often, the parameters are pooled or tied across different histories by setting [Jelinek & Mercer (1980)]:

This means that the parameters are tied across histories h with the same count N(h). It is straightforward to repeat the derivations for this type of tying. As result, we obtain:
eqnarray9304
Similarly, when assuming the parameters to be independent of the histories h, we obtain:
eqnarray9309
For absolute discounting , we obtain similar formulae in the case of tying. In particular for absolute discounting , the experimental results show that there is no degradation in perplexity when using history independent discounting parameters [Ney & Essen (1993)].

Computational complexity

In a real implementation, no matter whether for off-line purposes or in an operational prototype system, the computational complexity of the trigram or bigram model has to be considered, namely the memory requirements and the access time.

To store the trigram and bigram counts, we cannot use a standard array implementation. Even for a simple model like a bigram model, it is impossible to store the full table of conditional probabilities, which would require entries for a vocabulary of 20000 words. Instead, often a so-called list implementation is used which works as follows. For each word w, we have a pointer into the actual list, and for each observed trigram (u,v,w), an entry of the list consists of two parts, namely the index pair (u,v) and the trigram count N(u,v,w). For efficiency, this organisation should be combined with the counts for the bigram models. A further reduction in memory costs is achieved by removing the singletons [Katz (1987)] from the list of seen bigrams and trigrams. In particular, by simply removing the trigram singletons, the storage costs can be reduced dramatically without affecting the perplexity much. We give a specific example for the WSJ corpus of 39 million running words, for which experimental results will be described. Here, the storage cost is 126 Mbyte when 4-byte values are used for all pointers and counts. This storage cost can be reduced down to 66 Mbyte by omitting the trigram singletons.
For fast access to the bigram and trigram lists, we have to use suitable techniques, e.g. binary search . Short term storing of language probabilities might also be appropriate, depending on the interaction between language model and search .

Next: Cache Up: Multilevel smoothing for trigram Previous: The full trigram model

EAGLES SWLG SoftEdition, May 1997. Get the book...

Practical issues

Pooling of parameters

Computational complexity