The model of linear discounting in conjunction with backing-off [Katz (1987), Jelinek (1991)] has the advantage that it results in relatively simple formulae. The model is:
Here we have two types of parameters to be estimated:
The unknown parameters are estimated by maximum likelihood in combination with leaving-one-out . We obtain the log-likelihood function:
where denotes the probability distribution for leaving out the event (h,w) from the training data .
By doing some elementary manipulations, we can decompose the log-likelihood
function into two parts, one of which depends only on and the
other depends only on :
The dependent part is:
Taking the partial derivatives with respect to
and equating them to zero, we obtain
the closed-form solution:
The same value is obtained when we compute the probability mass of unseen
words in the training data for a given history h:
To estimate the backing-off distribution ,
we rearrange the sums:
where is the number of singletons (h,w) for a given history h,
i.e. the number of words following h exactly once, and where
is defined as:
Taking the derivative, we have:
where we have taken into account that there are only contributions
from those histories h which appear in the sum over w'.
We do not know a closed-form solution for .
By extending the sum over all histories h [Kneser & Ney (1995)],
we obtain the approximation:
For convenience, we have chosen
the normalisation .
This type of backing-off distribution will be referred to as
singleton distribution.