Basically, the assessment of an automatic speech recognition system is very simple: you take some speech material, train the system if that is required, have the recognition system recognise the speech, and compare the results to a written transcription of the utterances. The way this is carried out, depends on the particular system, and the purpose of the assessment (see sections 10.5 and 10.6 for two typical ways to do this). For instance, consider the assessment of a phone-based recognition system . If the purpose is diagnostic , concentrating on the acoustical part of the system, the scoring algorithm should be based upon phoneme alignment . For benchmarking purposes, however, a simple word alignment is preferred, and scoring can be based upon word error rates and sentence error rates.
The definition of the error rate E of a system is
not so simple. In words, the error rate is defined as ``the average
fraction of items incorrectly recognised''. Here, an item can be a
word, a subword unit (e.g. a phonephone ), or an entire
utterance. An average is a statistical property, so experimentally we
can only measure an estimator for the property, based
on observation of a specific sample. The definition of the
estimatorestimator is simplest for an isolated word
recognition system:
Here N is the number of words in the test sample and the
number of words incorrectly recognised. The latter can be further
subdivided into the contributions:
Here, the subscripts S and D are the number of words
substituted and the number of
words incorrectly rejected (deletion s). For
these classes of errors the fractions can be defined separately,
It is customary for isolated word recognition systems to express the
error rate in its complementary quantity, the fraction of correct
words C=1-E. It is the fraction of words
correctly recognised, and its estimate is
This measure does not include so-called insertion s (see the
next section), which are only defined for connected word recognition.
For isolated word recognition systems , there is another measure besides
the fraction of correct words, which is also of importance. It
describes the capability of rejecting an input word that is not in the
vocabulary and the sensitivity to non-speech events. If a
recogniser outputs a word when there is no specific input, this is
called a false alarm . In conditions where there is no speech
input, the number of false alarms will most likely scale with time,
and the correct measure would be the false alarm rate f,
expressed in events per second. Here is the number of false
alarms observed in a time T. Under the condition that there are many
input words not in the vocabulary (as is the case in word spotting
systems ) the number of false alarms is most likely to scale with the
number of input words, and hence an estimator for the false
alarm fraction F is
where is the number of out-of-vocabulary words.
As a last measure, there is the response time . It can be defined as the average time it takes to output the recognised word after the input word has been uttered.
In conclucion, the isolated word recogniser has four different performance measures, S, D, f, F and . One can try to combine these measures into one ``figure of merit '', but the weights to the different quantities depend on the application. The combination of substitutions and deletiondeletion s are often combined to the error rate E.
For a connected word or continuous recognition system the measures of performance are more complicated. Because the output words are generally not time-synchronous with the input, the output stream has to be aligned with the reference transcription . This implies that classifications such as substitutions , deletions, words correct and false alarms can no longer be identified with complete certainty.
For these reasons, the term ``false alarm'' is replaced by the term
``inserted word'' or ``insertion '', with the corresponding
symbol I and the estimator
where is the number of insertioninsertion s according to the
alignment procedure. Because the absolute identification of errors is
lost in the alignment procedure, the insertions are generally included
in the error rate , so that for connected word recognition, performance
is expressed in the total word error rate
Note that this error measure can become larger than 1 in cases of
extremely bad recognition.
Often, one defines the accuracy of a system
Note that this is not just the fraction C of words correctly
recognised, because the latter does not include insertions.
The actual measurement of the quantities through alignment is difficult. See Chapter 10.6 and [Hunt (1990)] for a discussion about alignment. In the above formulas the three types of errors (S, I, D) have equal weight. Depending on the application, one can assign different weights to the various kinds of errors. [Hunt (1990)] introduces the concept ``figure of merit '' for connected word recognisers and discusses the effects of different weights on the alignment procedure.