Implications of the perplexity definition

Next: Extensions of the perplexity Up: Perplexity definition Previous: Formal definition

Implications of the perplexity definition

We consider in more detail the implications of the formal definition of the perplexity:

Perplexity refers to written (e.g. transcribed) forms of the language only and completely ignores the acoustic-phonetic modelling. This may be viewed as a strength and a weakness at the same time.
Perplexity is based on the written form of the spoken words or, to be precise, the fully inflected word forms; in speech recognition , there is a convention to call every sequence of characters between blanks a word.
Perplexity requires a closed vocabulary. If a word occurs that is not part of the vocabulary, the perplexity definition may run into problems because it becomes infinitely large. This out-of-vocabulary word problem will be considered below.
Perplexity is merely a single averaged scalar-valued quantity; there is no information about local variations across the corpus. It would be straightforward to define the variance; an even more informative method would use a histogram over local probabilities, i.e.\ reciprocal perplexities.
By definition, perplexity depends on both a specific corpus and a specific language model. So it has a dual function: perplexity is a measure for characterising both the corpus and the specific language model. In other words, using the same language model, we can compare the difficulty of two corpora, i.e.\ their redundancy from the viewpoint of the language model. This also works the other way round: Using the same corpus, we can compare the quality of two language models.

The definition of perplexity involves the issue of coverage at several levels and in different aspects:

vocabulary coverage : The vocabulary is assumed to be closed, i.e. each word spoken in the test set must be part of the vocabulary of the recogniser specified beforehand. In recognition tasks like text dictation , this problem is often circumvented by adding the out-of-vocabulary word to the conventional vocabulary.
bigram and trigram coverage: The language model should cover those word bigram and word trigrams that are typical of the test sentences .
coverage measure: The perplexity can be used as a quantitative measure of the coverage of the language model, i.e. the perplexity measures how well the language model covers the test sentences.

In most cases, the definition of the recognition vocabulary is based on the collection of representative text corpora. The most frequent words in the corpus define the recognition vocabulary. This method seems to be widely used for recognition systems working in speaker independent mode. For speaker dependent systems, it is not practical to collect a sufficiently large corpus from a single person. Therefore, typically, some combination with a speaker independent corpus is used. There are special techniques that have been developed for this purpose of vocabulary personalisation [Jelinek et al. (1991a)].

Next: Extensions of the perplexity Up: Perplexity definition Previous: Formal definition

EAGLES SWLG SoftEdition, May 1997. Get the book...