Next: Linear interpolation Up: Final note: the mathematics Previous: Final note: the mathematics

## Linear discounting and backing-off

The model of linear discounting in conjunction with backing-off [Katz (1987), Jelinek (1991)] has the advantage that it results in relatively simple formulae. The model is:

Here we have two types of parameters to be estimated:

• the discounting parameters for each history h
• the backing-off distribution for a generalised history . Note that for each history h the generalised history must be well defined in order to have a backing-off distribution .

The unknown parameters are estimated by maximum likelihood in combination with leaving-one-out . We obtain the log-likelihood function:

where denotes the probability distribution for leaving out the event (h,w) from the training data .

By doing some elementary manipulations, we can decompose the log-likelihood function into two parts, one of which depends only on and the other depends only on :

The dependent part is:

Taking the partial derivatives with respect to and equating them to zero, we obtain the closed-form solution:

The same value is obtained when we compute the probability mass of unseen words in the training data  for a given history h:

To estimate the backing-off distribution , we rearrange the sums:

where is the number of singletons (h,w) for a given history h, i.e. the number of words following h exactly once, and where is defined as:

Taking the derivative, we have:

where we have taken into account that there are only contributions from those histories h which appear in the sum over w'. We do not know a closed-form solution for . By extending the sum over all histories h [Kneser & Ney (1995)], we obtain the approximation:

For convenience, we have chosen the normalisation . This type of backing-off distribution will be referred to as singleton distribution.

Next: Linear interpolation Up: Final note: the mathematics Previous: Final note: the mathematics

EAGLES SWLG SoftEdition, May 1997. Get the book...