Next: Extensions of the perplexity
Up: Perplexity definition
Previous: Formal definition
We consider in more detail the implications of the
formal definition of the perplexity:
- Perplexity
refers to written (e.g. transcribed) forms of the language only and
completely ignores
the acoustic-phonetic modelling. This may be viewed as
a strength and a weakness at the same time.
- Perplexity is based on the written form
of the spoken words or, to be precise,
the fully inflected word forms;
in speech recognition , there is a convention to call
every sequence of characters between blanks a
word.
- Perplexity requires a
closed vocabulary. If a word occurs that is not
part of the vocabulary, the perplexity definition
may run into problems because it becomes infinitely
large. This
out-of-vocabulary word problem will be considered
below.
- Perplexity is merely a single averaged scalar-valued quantity;
there is no information about local variations across the
corpus. It would be straightforward to define
the variance; an even more informative method would use
a histogram over local probabilities, i.e.\
reciprocal perplexities.
- By definition, perplexity depends on both
a specific corpus and a specific language model.
So it has a dual function: perplexity is a measure for
characterising both the corpus and the specific language
model. In other words, using the same language model,
we can compare the difficulty of two corpora, i.e.\
their redundancy from the viewpoint of the language model.
This also works the other way round: Using the same corpus,
we can compare the quality of two language models.
The definition of perplexity involves the issue
of coverage
at several levels and in
different aspects:
- vocabulary coverage : The vocabulary is
assumed to be closed, i.e. each word spoken in
the test set must be part of the vocabulary
of the recogniser specified beforehand.
In recognition tasks like text dictation , this
problem is often circumvented by adding the
out-of-vocabulary word to the conventional
vocabulary.
- bigram and trigram coverage:
The language
model should cover those word bigram and word trigrams
that are typical of the test sentences .
- coverage measure: The perplexity can be
used as a quantitative measure of the coverage of
the language model, i.e. the perplexity measures
how well the language model covers the test sentences.
In most cases, the definition of the recognition vocabulary is
based on the collection of representative
text corpora. The most frequent words in the corpus
define the recognition vocabulary.
This method seems to be widely used for recognition
systems working in speaker independent mode.
For speaker dependent systems, it is not practical
to collect a sufficiently large corpus from a single person.
Therefore, typically, some combination with a speaker independent corpus is
used. There are special techniques that have been developed for this purpose
of vocabulary personalisation [Jelinek et al. (1991a)].
Next: Extensions of the perplexity
Up: Perplexity definition
Previous: Formal definition
EAGLES SWLG SoftEdition, May 1997. Get the book...