Assessment parameters

Next: Experimental design of small Up: Parameters Previous: Recogniser specific parameters

Assessment parameters

The following list contains parameters that can be changed according to the type of test one uses. These parameters control the level of diagnostics , representativeness, the accuracy of the results, etc.

Speech material

This is in fact the database choice, as described in Section 10.3. There is a long list of possible speech material to be used. Some frequently used databases are:

Isolated words: for an isolated word recogniser (digits, numbers, the (spelling) alphabet, task related words, etc.)
Connected words or small sentences: for a connected word recogniser (digit strings, connected task words according to a syntax , etc.)
Isolated CVC s: for a diagnostic test of an isolated word recogniser
CVC s in carrier sentences: for diagnostic tests of a connected word recogniser
newspaper sentences: for benchmark evaluation of continuous speech recognisers

Speaker characterisation

Each speaker has some associated properties, such as sex , age , dialect , profession, etc. Some control over these properties can be obtained by selecting the test speakers or specific material in the database.

Number of speakers

This parameter is relevant for speaker-independent systems. The variability of speech recognition scores is known to be very dependent on the speaker. Apparently, speakers can be classified as ``goats'' (low recognition scores) and ``sheep '' (high recognition scores). Because knowledge of this classification is often not available a priori, many speakers are necessary for a benchmarking evaluation. For a speaker independent recognition system, 20 speakers is considered to be a reasonable amount, but this depends very much on the variance within the individual speaker scores. A sufficient number of speakers allows estimation of the variance in score due to the speaker variability, and significance can be tested using Student's t-test.

Training method

The training method is determined by the possible application. Some applications (typically ones with a large vocabulary ) might demand that the complete vocabulary is trained only once or twice for each user. For a dictation system (with ``unlimited'' vocabulary ) one may have to use a pre-trained system. Other applications (e.g. command-and-control ) might assume more effort from the user. If the assessment is application oriented, a representative training should be used.

A relevant parameter for isolated or connected word recognisers is the number of training sessions . Prediction of the performance as a function of number of training sessions may optimise the use.

RECOMMENDATION 4
For determining the minimum number of training sessions , carry out a small scale experiment, and make an estimate of the variance in the scores.

For a large vocabulary continuous speech recogniser , the training effort is characterised by the total training time .

Grammar

Often recognition systems are equipped with some kind of a grammar that specifies what the word order of recognition can be. Examples of these are:

Word pair: In a word-pair grammar (regular grammar), for each word in the vocabulary, a list of words that can possibly follow that word is given. This information can be specified in bits, where V is the vocabulary size.
Syntax with nodes: In a syntax with nodes (context-free grammar), words in the vocabulary are divided into groups. Each group is characterised by a node . The syntax defines what nodes may follow other nodes, not specifying which word within each node actually fits the input.
n-gram: In an n-gram grammar , statistics on the probability of occurrence are given. For n=1 we speak about a unigram grammar , and then for each word of the vocabulary the relative frequency of occurrence is given. Put simply, when the recogniser is in doubt between two possible words, it can use this information to choose the most frequently occurring one. This concept is expanded to bigram s, where the probability that two words occur successively is defined. This concept can be expanded for sequences of n words (see Chapter 7). The word sequences with highest probability, including both the acoustic match and the n-gram probabilities, is chosen by the recogniser .

If the automatic speech recognition system is tested with grammar , the input speech should actually match the grammar. For a strict grammar, such as a word-pair grammar or a syntax with nodes , no sentences that are not in accordance with the grammar should be used in assessment if the purpose is benchmarking. However, it is of interest to study the recognition output for ungrammatical speech input, which tests the rejection capability . For probabilistic grammars, the perplexity of the test sentences should match that of the ``test set '' that was used to generate the grammar , if the purpose of assessment is benchmarking.

Next: Experimental design of small Up: Parameters Previous: Recogniser specific parameters

EAGLES SWLG SoftEdition, May 1997. Get the book...