To illustrate the fundamental problem in language modelling, we consider a specific example, namely a bigram model. Conventional methods like the widely used maximum likelihood method [Lehmann (1983)] produce the relative frequencies as estimates for the bigram probabilities:
Here, vw is the word bigram under consideration, and N(v,w) and N(v) are the numbers of observed word bigrams (vw) and words v, respectively. Now assuming a vocabulary size of W=20000 words, there are million possible word bigrams , but the training corpus consists rarely of more than 10 million words. As a result, the conventional probability estimate for each unseen event is zero, and only of all bigrams can be observed due to the size of the training set . For trigram language models, the effect will be even more disastrous.
To overcome these shortcomings of the conventional probability estimates, we have to apply some sort of ``smoothing'' smoothing to the relative frequencies and to make sure that each probability estimate is larger than zero. It is important to understand that the need for language model smoothing results from very specific boundary conditions:
In this section, we describe suitable methods for smoothing, namely linear discounting discounting with backing-off, linear interpolation interpolationand absolute discounting with backing-off . The parameters of these models will be estimated by leaving-one-out [Duda & Hart (1973)], which is a special type of cross-validation.
Throughout the chapter, we will use the so-called counts to describe the training data . For example, trigram counts N(u,v,w) are obtained by counting how often the particular word trigram (uvw) occurs in the training data :
Other counts are defined in the same way. We use the following notation for counts, where the dot symbol can be interpreted as a sum or wild card:
In addition, we have to count how often a certain count r has occurred. These ``count-counts'' are defined as follows:
Events with counts with r=0,1,2 occur often and have special names:
In the following, we formulate the smoothing techniques by considering two types of history. As an example, we consider a trigram language model: