Next: Language model smoothing: modelling
Up: Perplexity definition
Previous: Extensions of the perplexity
When comparing the difficulty of two recognition
tasks, the perplexity is only a first
approximation. For such a purpose,
it is helpful to remember
that there are a number of important details to be checked.
This is even more true when a direct comparison
of two language models for the same corpus is
performed.
In comparing perplexities,
the following points should be checked:
- What is the exact vocabulary, and above all, what
is the exact size of the vocabulary?
- How are punctuation marks and in particular
sentence boundaries treated?
Often, in text dictation ,
punctuation marks are included
in the vocabulary.
- How is the unknown or out-of-vocabulary word
handled?
Is it included in the calculation of the perplexity,
or is the perplexity calculated only by averaging
over the spoken words?
- What are the conventions for representing
numbers and dates?
- It makes a difference whether the probabilities
or their logarithms are averaged. To avoid
potential confusions, the
corpus perplexity should be computed for the
corpus as a whole. If it is computed on a
sentence-by-sentence basis, it should be done
by using the log-perplexities rather than the
perplexities themselves.
EAGLES SWLG SoftEdition, May 1997. Get the book...