Test material selection

Next: Evaluation protocol Up: Experimental design of large Previous: Dry run

Test material selection

The selection of the test sentences requires some attention. If the assessment purpose is purely benchmarking for a very specific application, one could use the ``representative database '' approach and select utterances randomly from a source of possible utterances. For large vocabulary recognition systems, the set of possible utterances is virtually unlimited, however, and this means that a random selection will actually be from a pre-selected set of utterances. In the ARPA paradigm, this pre-selection is formed by selecting paragraphs randomly from a specified period of time from specified newspapers . For the ``20k open vocabulary test'' of the 1993 evaluation, the paragraphs were pre-filtered to contain only words from the most frequent 64k words in the WSJ . For the ``5k closed test'' the words were restricted to be in the top most 5000 words of the frequency-sorted list.

If the purpose of the assessment is a little more diagnostic, the following considerations might be made. From previous experiments with continuous speech recognition assessment it is known that very basic parameters, such as gender and sentence perplexity , influence the recognition result. Because the diversity in speakers is responsible for another important part of the variability in recognition score, it is wise to balance the earlier mentioned parameters over the speaker. For the parameter gender, this has the logical consequence of having as many male speakers as female speakers, which is contrary to the representative purpose of assessment, where the gender ratio of the users of the foreseen application should be reflected; for fighter pilots, the ratio could lean towards male speakers. The number of out-of-vocabulary words in the set of sentences should be kept constant across the speaker. The distribution of sentence perplexity should be more or less the same across speakers. Thus one can obtain diagnostic information on what part of the variability is due to speaker variation and what part is due to other factors, such as gender, perplexity and fraction of out-of-vocabulary words.

With large vocabulary recognition systems one is generally not able to test all words in the vocabulary, as in assessment of small size vocabulary words recognisers. This is not really necessary, because generally the systems are phone -based and it is more important to cover all phones in representative amounts in the assessment. This allows the size of the test to be restricted to typically 20 speakers uttering 15 sentences of approximately 20 words on the average. This corresponds to typically 6000 words in the test and a multiple of this number of (context dependent) phones . A large vocabulary system typically uses 500-3000 models, and this size of test more or less covers the phones which are modelled.

In order to cover as many words (or context dependent phone models ) as possible, in the ARPA style of benchmark ing assessment all speakers utter distinct sentences. From a more diagnostic point of view, it would be ideal to have all speakers utter the same sentence, in order to be able to distinguish variability caused by the speakers and variability caused by sentences. This is practically impossible, however, because this would both need many recordings and a long recognition time. One way to overcome this problem, as is done within SQALE , is to divide the evaluation test into two parts. One part consists of the ``classic'' 20 speakers, 10 sentences per speaker, all sentences unique. The other part consists of extra sentences for variance estimation. This part has few sentences (typically 3) uttered by 10 different speakers, where the sentences are the same across the speakers. Additionally, typical 6 speakers utter the same sentence 5 times (each speaker a different sentence). These replica s allow the estimation of the variance within one speaker for the same sentence. Although these utterances are different from the ones used in the first part of the assessment test, they may shed light on the different contributions to the variance.

Next: Evaluation protocol Up: Experimental design of large Previous: Dry run

EAGLES SWLG SoftEdition, May 1997. Get the book...