Recognition score

Next: Confusions Up: Definitions and nomenclature Previous: The performance measure as

Recognition score

Basically, the assessment of an automatic speech recognition system is very simple: you take some speech material, train the system if that is required, have the recognition system recognise the speech, and compare the results to a written transcription of the utterances. The way this is carried out, depends on the particular system, and the purpose of the assessment (see sections 10.5 and 10.6 for two typical ways to do this). For instance, consider the assessment of a phone-based recognition system . If the purpose is diagnostic , concentrating on the acoustical part of the system, the scoring algorithm should be based upon phoneme alignment . For benchmarking purposes, however, a simple word alignment is preferred, and scoring can be based upon word error rates and sentence error rates.

Isolated word scoring

The definition of the error rate E of a system is not so simple. In words, the error rate is defined as ``the average fraction of items incorrectly recognised''. Here, an item can be a word, a subword unit (e.g. a phonephone ), or an entire utterance. An average is a statistical property, so experimentally we can only measure an estimator for the property, based on observation of a specific sample. The definition of the estimatorestimator is simplest for an isolated word recognition system:

Here N is the number of words in the test sample and the number of words incorrectly recognised. The latter can be further subdivided into the contributions:

Here, the subscripts S and D are the number of words substituted and the number of words incorrectly rejected (deletion s). For these classes of errors the fractions can be defined separately,

It is customary for isolated word recognition systems to express the error rate in its complementary quantity, the fraction of correct words C=1-E. It is the fraction of words correctly recognised, and its estimate is

This measure does not include so-called insertion s (see the next section), which are only defined for connected word recognition.

For isolated word recognition systems , there is another measure besides the fraction of correct words, which is also of importance. It describes the capability of rejecting an input word that is not in the vocabulary and the sensitivity to non-speech events. If a recogniser outputs a word when there is no specific input, this is called a false alarm . In conditions where there is no speech input, the number of false alarms will most likely scale with time, and the correct measure would be the false alarm rate f,

expressed in events per second. Here is the number of false alarms observed in a time T. Under the condition that there are many input words not in the vocabulary (as is the case in word spotting systems ) the number of false alarms is most likely to scale with the number of input words, and hence an estimator for the false alarm fraction F is

where is the number of out-of-vocabulary words.

As a last measure, there is the response time . It can be defined as the average time it takes to output the recognised word after the input word has been uttered.

In conclucion, the isolated word recogniser has four different performance measures, S, D, f, F and . One can try to combine these measures into one ``figure of merit '', but the weights to the different quantities depend on the application. The combination of substitutions and deletiondeletion s are often combined to the error rate E.

Connected or continuous word scoring

For a connected word or continuous recognition system the measures of performance are more complicated. Because the output words are generally not time-synchronous with the input, the output stream has to be aligned with the reference transcription . This implies that classifications such as substitutions , deletions, words correct and false alarms can no longer be identified with complete certainty.

For these reasons, the term ``false alarm'' is replaced by the term ``inserted word'' or ``insertion '', with the corresponding symbol I and the estimator

where is the number of insertioninsertion s according to the alignment procedure. Because the absolute identification of errors is lost in the alignment procedure, the insertions are generally included in the error rate , so that for connected word recognition, performance is expressed in the total word error rate

Note that this error measure can become larger than 1 in cases of extremely bad recognition.

Often, one defines the accuracy of a system

Note that this is not just the fraction C of words correctly recognised, because the latter does not include insertions.

The actual measurement of the quantities through alignment is difficult. See Chapter 10.6 and [Hunt (1990)] for a discussion about alignment. In the above formulas the three types of errors (S, I, D) have equal weight. Depending on the application, one can assign different weights to the various kinds of errors. [Hunt (1990)] introduces the concept ``figure of merit '' for connected word recognisers and discusses the effects of different weights on the alignment procedure.

Next: Confusions Up: Definitions and nomenclature Previous: The performance measure as

EAGLES SWLG SoftEdition, May 1997. Get the book...