next up previous contents index
Next: Experimental design of small Up: Parameters Previous: Recogniser specific parameters

Assessment parameters

The following list contains parameters that can be changed according to the type of test one uses. These parameters control the level of diagnostics , representativeness, the accuracy  of the results, etc.

Speech material
This is in fact the database choice, as described in Section 10.3. There is a long list of possible speech material to be used. Some frequently used databases are:
Isolated words 
for an isolated word recogniser  (digits, numbers, the (spelling) alphabet, task related words, etc.)
Connected words  or small sentences
for a connected word recogniser  (digit strings, connected task words according to a syntax , etc.)
Isolated CVC s
  for a diagnostic test  of an isolated word recogniser
CVC s in carrier sentences
    for diagnostic tests   of a connected word  recogniser 
newspaper  sentences
for benchmark  evaluation of continuous speech  recognisers 

Speaker characterisation
  Each speaker has some associated properties, such as sex , age , dialect , profession, etc. Some control over these properties can be obtained by selecting the test speakers or specific material in the database.

Number of speakers
This parameter is relevant for speaker-independent systems.    The variability of speech recognition scores is known to be very dependent on the speaker. Apparently, speakers can be classified as ``goats''  (low recognition scores) and ``sheep '' (high recognition scores). Because knowledge of this classification is often not available a priori, many speakers are necessary for a benchmarking  evaluation. For a speaker independent  recognition system,   20 speakers is considered to be a reasonable amount, but this depends very much on the variance within the individual speaker scores. A sufficient number of speakers allows estimation of the variance in score due to the speaker variability, and significance can be tested using Student's t-test.

Training method
The training method  is determined by the possible application. Some applications (typically ones with a large vocabulary ) might demand that the complete vocabulary is trained only once or twice for each user. For a dictation  system (with ``unlimited'' vocabulary ) one may have to use a pre-trained system. Other applications (e.g. command-and-control ) might assume more effort from the user. If the assessment is application oriented, a representative training  should be used.

A relevant parameter for isolated  or connected word   recognisers  is the number of training sessions . Prediction of the performance as a function of number of training sessions may optimise the use.

RECOMMENDATION 4
For determining the minimum number of training sessions , carry out a small scale experiment, and make an estimate of the variance in the scores.

For a large vocabulary  continuous speech  recogniser , the training  effort is characterised by the total training time .

Grammar
  Often recognition systems are equipped with some kind of a grammar that specifies what the word order of recognition can be. Examples of these are:
Word pair
In a word-pair grammar  (regular grammar), for each word in the vocabulary,  a list of words that can possibly follow that word is given. This information can be specified in tex2html_wrap_inline46813 bits, where V is the vocabulary size. 
Syntax with nodes
   In a syntax  with nodes (context-free grammar), words in the vocabulary  are divided into groups. Each group is characterised by a node . The syntax defines what nodes may follow other nodes, not specifying which word within each node  actually fits the input.
n-gram
In an n-gram grammar , statistics on the probability of occurrence are given. For n=1 we speak about a unigram  grammar , and then for each word of the vocabulary  the relative frequency of occurrence is given. Put simply, when the recogniser  is in doubt between two possible words, it can use this information to choose the most frequently occurring one. This concept is expanded to bigram s, where the probability that two words occur successively is defined. This concept can be expanded for sequences of n words (see Chapter 7). The word sequences with highest probability, including both the acoustic match and the n-gram probabilities, is chosen by the recogniser .

If the automatic speech recognition system is tested with grammar , the input speech should actually match the grammar. For a strict grammar, such as a word-pair grammar or a syntax  with nodes , no sentences that are not in accordance with the grammar should be used in assessment if the purpose is benchmarking.  However, it is of interest to study the recognition output for ungrammatical speech input, which tests the rejection capability . For probabilistic grammars,   the perplexity  of the test sentences  should match that of the ``test set '' that was used to generate the grammar , if the purpose of assessment is benchmarking. 


next up previous contents index
Next: Experimental design of small Up: Parameters Previous: Recogniser specific parameters

EAGLES SWLG SoftEdition, May 1997. Get the book...