next up previous contents index
Next: Implications of the perplexity Up: Perplexity definition Previous: Perplexity definition

Formal definition

 

Strictly speaking, to evaluate the quality of a stochastic language model,   we would have to run a whole recognition experiment. However, as a first approximation, we can separate the two types of probability distribution in Bayes' decision rule  and confine ourselves to the probability that the language model produces for a sequence of (test or training) words tex2html_wrap_inline45599. To normalise this prior probability with respect to the number N of words, we take the Nth root and take the inverse to obtain the so-called corpus (or test set ) perplexity perplexity [Bahl et al. (1983)]:
equation8904
Inserting the decomposition into conditional probabilities of Eq.(7.2) and taking the logarithm, we obtain:
equation8908
To avoid confusions, we prefer the term ``corpus perplexity '' because it can be used for both training  and test data.   The above equations show that the corpus perplexity is the geometric average of the reciprocal probability over all N words. Apart from the constant factor (-1/N), the corpus perplexity is identical to the average conditional probability or likelihood. Therefore minimising the corpus perplexity  is the same as maximising the log-likelihood function.

The perplexity measures the constraints expressed by the language model. From the viewpoint of the recognition task, we can say that the language model reduces the number of word choices during the recognition process. Thus the perplexity can be interpreted as the average number of word choices during the recognition process. As a first approximation, the perplexity measures the difficulty of a recognition task: the smaller the perplexity, the lower the error rate . For example, depending on the application and the language model, a recognition system  with a vocabulary  of 1000 words can have such strong language constraints that the recognition task is easier than digit recognition. This was true for all of the early speech recognition systems  like HARPY and HEARSAY [Lea (1980)]. A special aspect in the definition of corpus perplexity  should be noted. If a word in the corpus is assigned a probability of zero by the language model, the perplexity will be infinitely large. This is one of the real challenges for the language model: the prediction of the next word should be as good as possible without excluding any of the words of the vocabulary. 

In some articles, the authors relate perplexity to entropy as used in information theory [Bahl et al. (1983)]. There the assumption is that the underlying probability distribution of the language model is exactly known. However, for practical comparisons, so-called test set perplexity  or corpus perplexity  is more useful.  



next up previous contents index
Next: Implications of the perplexity Up: Perplexity definition Previous: Perplexity definition

EAGLES SWLG SoftEdition, May 1997. Get the book...