The so-called cache model has been used successfully
by a number of researchers [Kuhn & De Mori (1990), Jelinek et al. (1991b), Rosenfeld (1994)].
The cache can be viewed as a short-term memory
where the probability of the most recent words is increased.
In other words, the cache model takes into account that
the words of the vocabulary are not distributed homogeneously
over a text, but tend to occur in clusters.
The typical mathematical formulation
for the cache contribution is as follows:
where denotes the Kronecker function, which is 1 if the
two arguments are the same and 0 otherwise.
The probability of the cache model is typically combined with
the trigram model by linear interpolation .
There are refinements that suggest themselves:
The cache concept considered so far is based on unigrams only. As in the case of unigrams , we can argue that word bigrams and trigrams tend to occur in clusters, too. Extensions of the unigram cache to bigrams and/or trigrams have been successfully used in [Jelinek et al. (1991b)] and [Rosenfeld (1994)]. For example, in the case of a bigram cache, the bigram counts based on the most recent history are used to compute the probabilities for the bigram cache. The cache model described here can be interpreted as a special case of so-called adaptive language models that adapt their probabilities to the most recent history, say the last 100 to 1000 predecessor words. In contrast, a non-adaptive language model does not depend on the test data , but remains unchanged as trained on the training data . For other types of adaptive language models see [Essen & Steinbiss (1992)] and [Rosenfeld (1994)].