Next: Extensions of the perplexity 
Up: Perplexity definition
 Previous: Formal definition
We consider in more detail the implications of the
formal definition of the perplexity:
-  Perplexity
      refers to written (e.g. transcribed) forms of the language only and
      completely ignores
      the acoustic-phonetic modelling. This may be viewed as 
      a strength and a weakness at the same time.
 -  Perplexity is based on the written form
      of the spoken words or, to be precise,
      the fully inflected word forms;
      in speech recognition , there is a convention to call
      every sequence of characters between blanks a
      word.
 -  Perplexity requires a 
      closed vocabulary.  If a word occurs that is not
      part of the vocabulary,  the perplexity definition
      may run into problems because it becomes infinitely
      large. This
      out-of-vocabulary word  problem will be considered
      below.
 -  Perplexity is merely a single averaged scalar-valued quantity;
there is no information about local variations across the
corpus. It would be straightforward to define
the variance; an even more informative method would use 
a histogram  over local probabilities, i.e.\
reciprocal perplexities.
 -  By definition, perplexity depends on both
a specific corpus and a specific language model.
So it has a dual function: perplexity is a measure for
characterising both the corpus and the specific language
model. In other words, using the same language model,
we can compare the difficulty of two corpora, i.e.\
their redundancy  from the viewpoint of the language model.
This also works the other way round: Using the same corpus,
we can compare the quality of two language models.
 
The definition of perplexity involves the issue
of coverage 
      
at several levels and in
different aspects:
-  vocabulary coverage  : The vocabulary is
      assumed to be closed, i.e. each word spoken in 
      the test set  must be part of the vocabulary
      of the recogniser  specified beforehand.
      In recognition tasks like text dictation , this
      problem is often circumvented by adding the
     out-of-vocabulary word  to the conventional
      vocabulary. 
 -  bigram  and trigram  coverage:
  The language
      model should cover those word bigram and word trigrams
      that are typical of the test sentences .
 -  coverage measure: The perplexity can be
      used as a quantitative measure of the coverage of
      the language model, i.e. the perplexity measures
      how well the language model covers the test sentences.
 
 
In most cases, the definition of the recognition vocabulary  is 
based on the collection of representative 
text corpora. The most frequent words in the corpus
define the recognition vocabulary. 
This method seems to be widely used for recognition
systems  working in speaker independent mode.
For speaker dependent  systems, it is not practical 
to collect a sufficiently large corpus from a single person.
Therefore, typically, some combination with a speaker independent corpus is
used. There are special techniques that have been developed for this purpose
of vocabulary  personalisation [Jelinek et al. (1991a)].
 
 
 
 
 
 
 
 Next: Extensions of the perplexity 
Up: Perplexity definition
 Previous: Formal definition
EAGLES SWLG SoftEdition, May 1997. Get the book...