next up previous contents index
Next: Confusions Up: Definitions and nomenclature Previous: The performance measure as

Recognition score


Basically, the assessment of an automatic speech recognition system is very simple: you take some speech material, train  the system if that is required, have the recognition system recognise the speech, and compare the results to a written transcription  of the utterances. The way this is carried out, depends on the particular system, and the purpose of the assessment (see sections 10.5 and 10.6 for two typical ways to do this). For instance, consider the assessment of a phone-based recognition system . If the purpose is diagnostic , concentrating on the acoustical part of the system, the scoring algorithm should be based upon phoneme alignment .   For benchmarking  purposes, however, a simple word alignment   is preferred, and scoring can be based upon word error rates   and sentence error rates. 

Isolated word scoring


The definition of the error rate  E of a system is not so simple. In words, the error rate is defined as ``the average fraction of items incorrectly recognised''. Here, an item can be a word, a subword unit  (e.g. a phonephone  ), or an entire utterance. An average is a statistical property, so experimentally we can only measure an estimator  for the property, based on observation of a specific sample. The definition of the estimatorestimator   is simplest for an isolated word recognition system: 
Here N is the number of words in the test sample and tex2html_wrap_inline46715 the number of words incorrectly recognised. The latter can be further subdivided into the contributions:
Here, the subscripts S and D are the number of words substituted  and the number of words incorrectly rejected (deletion s). For these classes of errors the fractions can be defined separately,
It is customary for isolated word recognition systems to express the error rate  in its complementary quantity, the fraction of correct words C=1-E. It is the fraction of words correctly recognised, and its estimate is
This measure does not include so-called insertion s (see the next section), which are only defined for connected word  recognition.

For isolated word recognition systems , there is another measure besides the fraction of correct words, which is also of importance. It describes the capability of rejecting an input word that is not in the vocabulary  and the sensitivity to non-speech events. If a recogniser  outputs a word when there is no specific input, this is called a false alarm . In conditions where there is no speech input, the number of false alarms  will most likely scale with time, and the correct measure would be the false alarm rate f,
expressed in events per second. Here tex2html_wrap_inline46725 is the number of false alarms observed in a time T. Under the condition that there are many input words not in the vocabulary  (as is the case in word spotting systems ) the number of false alarms  is most likely to scale with the number of input words, and hence an estimator  for the false alarm  fraction F is
where tex2html_wrap_inline46731 is the number of out-of-vocabulary words.  

As a last measure, there is the response time  tex2html_wrap_inline46733. It can be defined as the average time it takes to output the recognised word after the input word has been uttered.

In conclucion, the isolated word recogniser   has four different performance measures, S, D, f, F and tex2html_wrap_inline46733. One can try to combine these measures into one ``figure of merit '', but the weights to the different quantities depend on the application. The combination of substitutions  and deletiondeletion s  are often combined to the error rate E.     

Connected or continuous word scoring


For a connected word or continuous recognition system the measures of performance are more complicated. Because the output words are generally not time-synchronous with the input, the output stream has to be aligned  with the reference transcription . This implies that classifications such as substitutions , deletions,  words correct and false alarms   can no longer be identified with complete certainty.

For these reasons, the term ``false alarm''  is replaced by the term ``inserted word'' or ``insertion '', with the corresponding symbol I and the estimator 
where tex2html_wrap_inline46749 is the number of insertioninsertion s according to the alignment  procedure. Because the absolute identification of errors is lost in the alignment procedure, the insertions are generally included in the error rate , so that for connected word recognition,  performance is expressed in the total word error rate  
Note that this error measure can become larger than 1 in cases of extremely bad recognition.

Often, one defines the accuracy  of a system
Note that this is not just the fraction C of words correctly recognised, because the latter does not include insertions.

The actual measurement of the quantities through alignment  is difficult. See Chapter 10.6 and [Hunt (1990)] for a discussion about alignment. In the above formulas the three types of errors (S, I, D) have equal weight. Depending on the application, one can assign different weights to the various kinds of errors. [Hunt (1990)] introduces the concept ``figure of merit '' for connected word recognisers    and discusses the effects of different weights on the alignment  procedure.    


next up previous contents index
Next: Confusions Up: Definitions and nomenclature Previous: The performance measure as

EAGLES SWLG SoftEdition, May 1997. Get the book...