next up previous contents index
Next: Speech quality and conditions Up: Introduction Previous: Introduction

Classification of recognition systems


A number of parameters define the capability of a speech recognition\ system. In Table 10.1 these parameters are categorised. The classification made here is based upon the typical design considerations of a recognition system, which may be closely related to a specific application or task. In general, these parameters are one way or another fixed into the system. For each of the categories, the extremes of an easy and difficult task, from the recogniser's   point of view, are given.


Parameter Easy task Difficult task
Vocabulary size  small  unlimited 
Speech type isolated words  continuous speech 
Speaker dependency  speaker dependent speaker independent
Grammar  strict syntax  natural language
Training method  multiple training embedded training 
Table 10.1: Classification of speech recognition systems 

Vocabulary size
  The vocabulary size is of importance to the recogniser  and its performance. The vocabularyvocabulary  is defined to be the set of words that the recogniser  can select from, i.e. the words it can refer to. In cases where there are few choices the recognition is obviously easier than if the vocabulary is large.    The adjectives ``small'', ``medium'' and ``large'' are applied to vocabulary sizes  of the order of 100, 1000 and (over) 5000 words, respectively. A typical small vocabulary recogniser  can recognise only ten digits, a typical large vocabulary    recognition system 20000 words.

Speech type
There is a distinction between ``isolated words '', ``connected words''  and ``continuous speech ''. For isolated wordsisolated words , the beginning and the end of each word can be detected directly from the energy of the signal. This makes the job of word boundary  detection (segmentation ) and often that of recognition a lot easier than if the words are connected  or even continuous , as is the case for natural connected discourse. The difference in classification between ``connected words''  and ``continuous speechcontinuous speech ''  is somewhat technical. A connected word  recogniser  uses words as recognition units, which can be trained  in an isolated word mode. Continuous speech   is generally connected to large vocabulary   recognisers  that use subword units such as phone s as recognition units, and can be trained with continuous speech .

Speaker dependency
  The recognition task can be either speaker dependent , or speaker independent . Speaker independent recognition is more difficult, because the internal representation of the speech must somehow be global enough to cover all types of voices and all possible ways of pronouncing words, and yet specific enough to discriminate between the various words of the vocabulary. 

For a speaker dependent system the training  is usually carried out by the user, but for applications such as large vocabulary  dictation  systems this is too time consuming for an individual user. In such cases an intermediate technique known as speaker adaptation  is used. Here, the system is bootstrapped with speaker-independent models,    and then gradually adapts to the specific aspects of the user.

In order to reduce the effective number of words to select from, recognition system s are often equipped with some knowledge of the language. This may vary from very strict syntax  rules, in which the words that may follow one another are defined by certain rules, to probabilistic language models,   in which the probability of the output sentence is taken into consideration, based on statistical knowledge of the language. An objective measure of the ``freedom'' of the grammargrammar  is the perplexity , which measures the average branching factor  of the grammar . The higher the perplexityperplexity  , the more words to choose from at each instant, and hence the more difficult the task. See Chapter 7 for a detailed discussion on language model ling.

An example of a very simple grammar  is the following sentence-generating syntax: 


which can generate only six different sentences, which vary in the number of words.

For an example of statistical knowledge, consider the word million being recognised. If the domain is financial jargon, one can make a prediction of the next word, based on the following excerpt of conditional probabilities:
million acres 0.00139
million boxes 0.00023
million canadian 0.00846
million dollar 0.0935
million dollars 0.642
million left 0.0000081
There are almost two out of three chances that the word following million will be dollars (at least, within the domaindomain  of the Wall Street Journal (WSJ ). These numbers were calculated from 37 million words of texts of a financial newspaper  (the WSJ).

The way an automatic speech recognition system is trained can vary. If each word of the vocabulary  is trained many times, the system has an opportunity to build robust models of the words , and hence a good performance should be expected. Some systems can be trained with only one example of each word, or even none (if the models are pre-built). The number of times each word is trained is called the number of training passes .

Another trainingtraining  issue that defines the capability of a system is whether or not it can deal with embedded training. In embedded training  the systems is trained with strings of words (utterances) of which the starting and ending points are not specified explicitly. A typical example is a large vocabulary    continuous speech  recognition system that is trained with whole sentences, of which only the orthographic transcriptions    are available.


next up previous contents index
Next: Speech quality and conditions Up: Introduction Previous: Introduction

EAGLES SWLG SoftEdition, May 1997. Get the book...