A number of parameters define the capability of a speech recognition\
system. In Table 10.1 these parameters are categorised. The
classification made here is based upon the typical design
considerations of a recognition system, which may be closely related
to a specific application or task. In general, these parameters are
one way or another fixed into the system. For each of the categories,
the extremes of an easy and difficult task, from the recogniser's
point of view, are given.
- Vocabulary size
- The vocabulary size is of importance to
the recogniser and its performance. The vocabularyvocabulary is
defined to be the set of words that the recogniser can select from, i.e. the words it can refer to. In cases where there are few
choices the recognition is obviously easier than if the vocabulary
is large. The adjectives ``small'', ``medium'' and ``large'' are applied
to vocabulary sizes of the order of 100, 1000 and (over) 5000
words, respectively. A typical small vocabulary
recogniser can
recognise only ten digits, a typical large vocabulary
recognition system 20000 words.
- Speech type
- There is a distinction between
``isolated words '', ``connected words'' and
``continuous speech ''. For isolated wordsisolated words , the beginning and the end
of each word can be detected directly from the energy of the signal.
This makes the job of word boundary detection
(segmentation ) and often that of recognition a lot easier
than if the words are connected or even
continuous , as is the case
for natural connected discourse. The difference in classification between
``connected words'' and ``continuous speechcontinuous speech '' is somewhat
technical. A connected word
recogniser uses words as recognition
units, which can be trained in an isolated word mode. Continuous speech
is generally connected to large vocabulary
recognisers that
use subword units such as phone s as recognition units, and can be trained
with continuous speech .
- Speaker dependency
- The recognition task can be either speaker dependent , or speaker
independent .
Speaker independent recognition is more difficult, because the internal
representation of the speech must somehow be global enough to cover
all types of voices and all possible ways of pronouncing words, and
yet specific enough to discriminate between the various words of the
vocabulary.
For a speaker dependent system the training is usually
carried out by the user, but for applications such as large
vocabulary dictation systems this is too time consuming for an
individual user. In such cases an intermediate technique known as
speaker adaptation is used. Here, the system is
bootstrapped with speaker-independent models,
and then gradually
adapts to the specific aspects of the user.
- Grammar
- In order to reduce the effective number
of words to select from, recognition system s are often equipped with some
knowledge of the language. This may vary from very strict
syntax rules, in which the words that may follow one another
are defined by certain rules, to probabilistic language models,
in
which the probability of the output sentence is taken into
consideration, based on statistical knowledge of the language. An
objective measure of the ``freedom'' of the grammargrammar is the
perplexity , which measures the average
branching factor of the grammar . The higher the
perplexityperplexity , the more words to choose from at each
instant, and hence the more difficult the task. See
Chapter 7 for a detailed discussion on language model ling.
An example of a very simple grammar is the following
sentence-generating syntax:
which can generate only six different sentences, which vary in the
number of words.
For an example of statistical knowledge, consider the word million
being recognised. If the domain is financial jargon,
one can make a prediction of the next word, based on the following
excerpt of conditional probabilities:
| million acres | 0.00139 |
| million boxes | 0.00023 |
| million canadian | 0.00846 |
| million dollar | 0.0935 |
| million dollars | 0.642 |
| million left | 0.0000081 |
There are almost two out of three chances that the word following
million will be dollars (at least, within the domaindomain
of the Wall Street Journal (WSJ ).
These numbers were calculated from
37 million words of texts of a financial newspaper (the WSJ).
- Training
- The way an automatic speech recognition system is trained can vary.
If each word of the vocabulary is trained many times, the system has
an opportunity to build robust models of the words , and
hence a good performance should be expected. Some systems can be
trained with only one example of each word, or even none (if the
models are pre-built). The number of times each word is trained is
called the number of training passes .
Another trainingtraining issue that defines the capability of a
system is whether or not it can deal with embedded training.
In embedded training the systems is trained with strings of words
(utterances) of which the starting and ending points are not
specified explicitly. A typical example is a large vocabulary
continuous speech recognition system that is trained with
whole sentences, of which only the orthographic transcriptions
are available.