Cache

Next: Experimental results Up: Multilevel smoothing for trigram Previous: Practical issues

Cache

The so-called cache model has been used successfully by a number of researchers [Kuhn & De Mori (1990), Jelinek et al. (1991b), Rosenfeld (1994)]. The cache can be viewed as a short-term memory where the probability of the most recent words is increased. In other words, the cache model takes into account that the words of the vocabulary are not distributed homogeneously over a text, but tend to occur in clusters. The typical mathematical formulation for the cache contribution is as follows:

where denotes the Kronecker function, which is 1 if the two arguments are the same and 0 otherwise. The probability of the cache model is typically combined with the trigram model by linear interpolation . There are refinements that suggest themselves:

We can introduce weights that depend on the distance in terms of word positions; typically these weights should go smoothly to zero to introduce some sort of forgetting.
In a number of cases such as dictation of documents, the beginning of a new document is known in most cases, and the cache should be reset to zero at the document boundaries.
One can argue that the cache is most important for low frequency words and therefore should be used only for this subset of words.

The cache concept considered so far is based on unigrams only. As in the case of unigrams , we can argue that word bigrams and trigrams tend to occur in clusters, too. Extensions of the unigram cache to bigrams and/or trigrams have been successfully used in [Jelinek et al. (1991b)] and [Rosenfeld (1994)]. For example, in the case of a bigram cache, the bigram counts based on the most recent history are used to compute the probabilities for the bigram cache. The cache model described here can be interpreted as a special case of so-called adaptive language models that adapt their probabilities to the most recent history, say the last 100 to 1000 predecessor words. In contrast, a non-adaptive language model does not depend on the test data , but remains unchanged as trained on the training data . For other types of adaptive language models see [Essen & Steinbiss (1992)] and [Rosenfeld (1994)].

Next: Experimental results Up: Multilevel smoothing for trigram Previous: Practical issues

EAGLES SWLG SoftEdition, May 1997. Get the book...