Next: Cache
Up: Multilevel smoothing for trigram
Previous: The full trigram model
In smoothing trigram models by linear discounting or interpolation ,
the smoothing parameters,
both and , have been assumed
to depend on the word pair uv.
The number of these word pairs itself is huge so that
reducing the number of smoothing parameters is desirable.
Often, the parameters are pooled or tied
across different histories by setting [Jelinek & Mercer (1980)]:
This means that the parameters are tied across histories h with
the same count N(h).
It is straightforward to repeat the derivations
for this type of tying.
As result, we obtain:
Similarly, when assuming the parameters to
be independent of the histories h, we obtain:
For absolute discounting , we obtain similar formulae in
the case of tying.
In particular for absolute discounting , the experimental results
show that there is no degradation in perplexity when
using history independent discounting parameters [Ney & Essen (1993)].
In a real implementation, no matter whether for off-line purposes
or in an operational prototype system, the
computational complexity of the trigram or bigram model
has to be considered, namely
the memory requirements and the access time.
- To store the trigram and bigram counts, we cannot use a
standard array implementation. Even for a simple model like
a bigram model, it is impossible to store
the full table of conditional probabilities, which
would require entries for
a vocabulary of 20000 words.
Instead, often a so-called list implementation is used
which works as follows.
For each word w, we have a pointer into the
actual list, and for each observed trigram
(u,v,w), an entry of the list
consists of two parts, namely
the index pair (u,v) and the trigram count N(u,v,w).
For efficiency, this organisation should be combined
with the counts for the bigram models.
A further reduction in memory costs is achieved by
removing the singletons [Katz (1987)] from the list of
seen bigrams and trigrams. In particular, by simply removing
the trigram singletons, the storage costs can be reduced
dramatically without affecting the perplexity much.
We give a specific example for the WSJ corpus
of 39 million running words, for which experimental results
will be described. Here, the storage cost is
126 Mbyte when 4-byte values are used for all pointers
and counts. This storage cost can be reduced down
to 66 Mbyte by omitting the trigram singletons.
- For fast access to the bigram and trigram lists,
we have to use suitable techniques, e.g. binary
search . Short term storing of language probabilities might also
be appropriate, depending on the interaction between
language model and search .
Next: Cache
Up: Multilevel smoothing for trigram
Previous: The full trigram model
EAGLES SWLG SoftEdition, May 1997. Get the book...