Next: Linear discounting and backing-off Up: Language model smoothing: modelling Previous: Language model smoothing: modelling

Problem formulation

To illustrate the fundamental problem in language modelling, we consider a specific example, namely a bigram  model. Conventional methods like the widely used maximum likelihood method [Lehmann (1983)] produce the relative frequencies as estimates for the bigram probabilities:

Here, vw is the word bigram  under consideration, and N(v,w) and N(v) are the numbers of observed word bigrams (vw) and words v, respectively. Now assuming a vocabulary size  of W=20000 words, there are million possible word bigrams , but the training  corpus consists rarely of more than 10 million words. As a result, the conventional probability estimate for each unseen event is zero, and only of all bigrams can be observed due to the size of the training set . For trigram  language models, the effect will be even more disastrous.

To overcome these shortcomings of the conventional probability estimates, we have to apply some sort of ``smoothing'' smoothing to the relative frequencies and to make sure that each probability estimate is larger than zero. It is important to understand that the need for language model smoothing results from very specific boundary conditions:

• All word combinations should be possible, i.e.\ there is no word sequence with exactly zero probability.
• For such models, the amount of training data  is always small, even if the size is 100 million of running words.
• To take into account unseen events, i.e. events that were not seen in the training data, we make use of cross-validation   cross-validation [Efron & Tibshirani (1993)].

In this section, we describe suitable methods for smoothing, namely linear discounting discounting with backing-off,   linear interpolation    interpolationand absolute discounting with backing-off .   The parameters of these models will be estimated by leaving-one-out  [Duda & Hart (1973)], which is a special type of cross-validation.

Throughout the chapter, we will use the so-called counts to describe the training data  . For example, trigram  counts N(u,v,w) are obtained by counting how often the particular word trigram (uvw) occurs in the training data :

Other counts are defined in the same way. We use the following notation for counts, where the dot symbol can be interpreted as a sum or wild card:

N(h,w): number of joint events (h,w);
: number of joint events for a fixed word w;
: number of joint events for a fixed history h;
: total number of joint events (h,w).

In addition, we have to count how often a certain count r has occurred. These ``count-counts'' are defined as follows:

: number of events (words) w that occurred exactly r times for a fixed history h;
: total number of joint events (h,w) that occurred exactly r times.

Events with counts with r=0,1,2 occur often and have special names:

r=0: unseen events, i.e. events never observed in the training data ;
r=1: singleton events, i.e. events that were observed exactly once;
r=2: doubleton events, i.e. events that were observed exactly twice.

In the following, we formulate the smoothing techniques by considering two types of history. As an example, we consider a trigram  language model:

• a specific history h; e.g. h=(u,v) for a trigram (uvw);
• a generalised history ; e.g. for trigram models with (h,w)=(u,v,w), typically we define to be the bigram  .

Next: Linear discounting and backing-off Up: Language model smoothing: modelling Previous: Language model smoothing: modelling

EAGLES SWLG SoftEdition, May 1997. Get the book...