The model of linear discounting in conjunction with backing-off [Katz (1987), Jelinek (1991)] has the advantage that it results in relatively simple formulae. The model is:

Here we have two types of parameters to be estimated:

- the discounting parameters for each history
*h* - the backing-off distribution for a
generalised history . Note that
for each history
*h*the generalised history must be well defined in order to have a backing-off distribution .

The unknown parameters are estimated by maximum likelihood in combination with leaving-one-out . We obtain the log-likelihood function:

where denotes the probability
distribution for leaving out the event (*h*,*w*) from the
training data .

By doing some elementary manipulations, we can decompose the log-likelihood
function into two parts, one of which depends only on and the
other depends only on :

The dependent part is:

Taking the partial derivatives with respect to
and equating them to zero, we obtain
the closed-form solution:

The same value is obtained when we compute the probability mass of unseen
words in the training data for a given history *h*:

To estimate the backing-off distribution ,
we rearrange the sums:

where is the number of singletons (*h*,*w*) for a given history *h*,
i.e. the number of words following *h* exactly once, and where
is defined as:

Taking the derivative, we have:

where we have taken into account that there are only contributions
from those histories *h* which appear in the sum over *w*'.
We do not know a closed-form solution for .
By extending the sum over all histories *h* [Kneser & Ney (1995)],
we obtain the approximation:

For convenience, we have chosen
the normalisation .
This type of backing-off distribution will be referred to as
singleton distribution.