To illustrate the fundamental problem in language modelling, we consider a specific example, namely a bigram model. Conventional methods like the widely used maximum likelihood method [Lehmann (1983)] produce the relative frequencies as estimates for the bigram probabilities:
Here, vw is the word bigram under consideration, and
N(v,w) and N(v) are the numbers of observed
word bigrams (vw) and words v, respectively.
Now assuming a vocabulary size of W=20000 words,
there are million possible word bigrams ,
but the training corpus consists rarely of more than 10 million words.
As a result, the conventional probability estimate for each unseen event
is zero, and only
of all bigrams can be observed
due to the size of the training set .
For trigram language models, the effect will be even more
disastrous.
To overcome these shortcomings of the conventional probability estimates, we have to apply some sort of ``smoothing'' smoothing to the relative frequencies and to make sure that each probability estimate is larger than zero. It is important to understand that the need for language model smoothing results from very specific boundary conditions:
In this section, we describe suitable methods for smoothing, namely linear discounting discounting with backing-off, linear interpolation interpolationand absolute discounting with backing-off . The parameters of these models will be estimated by leaving-one-out [Duda & Hart (1973)], which is a special type of cross-validation.
Throughout the chapter,
we will use the so-called counts to describe the training data .
For example, trigram counts N(u,v,w) are obtained by
counting how often the particular word trigram (uvw) occurs
in the training data :
Other counts are defined in the same way. We use the following notation for counts, where the dot symbol can be interpreted as a sum or wild card:
In addition, we have to count how often a certain count r has occurred. These ``count-counts'' are defined as follows:
Events with counts with r=0,1,2 occur often and have special names:
In the following, we formulate the smoothing techniques by considering two types of history. As an example, we consider a trigram language model: