next up previous contents index
Next: Linear discounting and backing-off Up: Language model smoothing: modelling Previous: Language model smoothing: modelling

Problem formulation

To illustrate the fundamental problem in language modelling, we consider a specific example, namely a bigram  model. Conventional methods like the widely used maximum likelihood method [Lehmann (1983)] produce the relative frequencies as estimates for the bigram probabilities:


eqnarray8995

Here, vw is the word bigram  under consideration, and N(v,w) and N(v) are the numbers of observed word bigrams (vw) and words v, respectively. Now assuming a vocabulary size  of W=20000 words, there are tex2html_wrap_inline45621 million possible word bigrams , but the training  corpus consists rarely of more than 10 million words. As a result, the conventional probability estimate for each unseen event is zero, and only tex2html_wrap_inline45623 of all bigrams can be observed due to the size of the training set . For trigram  language models, the effect will be even more disastrous.

To overcome these shortcomings of the conventional probability estimates, we have to apply some sort of ``smoothing'' smoothing to the relative frequencies and to make sure that each probability estimate is larger than zero. It is important to understand that the need for language model smoothing results from very specific boundary conditions:

In this section, we describe suitable methods for smoothing, namely linear discounting discounting with backing-off,   linear interpolation    interpolationand absolute discounting with backing-off .   The parameters of these models will be estimated by leaving-one-out  [Duda & Hart (1973)], which is a special type of cross-validation. 

Throughout the chapter, we will use the so-called counts to describe the training data  tex2html_wrap_inline45625. For example, trigram  counts N(u,v,w) are obtained by counting how often the particular word trigram (uvw) occurs in the training data :


eqnarray9026

Other counts are defined in the same way. We use the following notation for counts, where the dot symbol can be interpreted as a sum or wild card:

N(h,w): number of joint events (h,w);
tex2html_wrap_inline45635: number of joint events tex2html_wrap_inline45637 for a fixed word w;
tex2html_wrap_inline45641: number of joint events tex2html_wrap_inline45643 for a fixed history h;
tex2html_wrap_inline45647: total number of joint events (h,w).

In addition, we have to count how often a certain count r has occurred. These ``count-counts'' are defined as follows:

tex2html_wrap_inline45653: number of events (words) w that occurred exactly r times for a fixed history h;
tex2html_wrap_inline45661: total number of joint events (h,w) that occurred exactly r times.

Events with counts with r=0,1,2 occur often and have special names:

r=0: unseen events, i.e. events never observed in the training data ;
r=1: singleton events, i.e. events that were observed exactly once;
r=2: doubleton events, i.e. events that were observed exactly twice.

In the following, we formulate the smoothing techniques by considering two types of history. As an example, we consider a trigram  language model:



next up previous contents index
Next: Linear discounting and backing-off Up: Language model smoothing: modelling Previous: Language model smoothing: modelling

EAGLES SWLG SoftEdition, May 1997. Get the book...