The model of linear discounting in conjunction with backing-off [Katz (1987), Jelinek (1991)] has the advantage that it results in relatively simple formulae. The model is:

Here we have two types of parameters to be estimated:
The unknown parameters are estimated by maximum likelihood in combination with leaving-one-out . We obtain the log-likelihood function:

where 
 denotes the probability
distribution for leaving out the event (h,w) from the
training data .
By doing some elementary manipulations, we can decompose the log-likelihood
function into two parts, one of which depends only on 
 and the
other depends only on 
:
![]()
The 
 dependent part is:

Taking the partial derivatives with respect to 
 
and equating them to zero, we obtain
the closed-form solution:
![]()
The same value is obtained when we compute the probability mass of unseen
words in the training data  for a given history h:
![]()
To estimate the backing-off distribution 
,
we rearrange the sums:

where 
 is the number of singletons (h,w) for a given history h,
i.e. the number of words following h exactly once, and where 
is defined as:
![]()
Taking the derivative, we have:

where we have taken into account that there are only contributions
from those histories h which appear in the sum over w'.
We do not know a closed-form solution for 
.
By extending the sum over all histories h [Kneser & Ney (1995)],
we obtain the approximation:
![]()
For convenience, we have chosen
the normalisation  
.
This type of backing-off distribution will be referred to as
singleton distribution.