next up previous contents index
Next: Development test Up: Experimental design of large Previous: Experimental design of large

Training material


Contrary to the small vocabulary  word recogniser , a large vocabulary recognition system  generally uses subword units such as phone s  as acoustic units for recognition. This keeps the number of models to be trained limited (to the number of phone(eme)s  in a language, typically 35-50), although training of context-dependent phones  (so-called triphone s) again increases the number of models (to typically 2000).

This means that the training vocabulary is not necessarily complete with respect to the recogniser's vocabulary. Instead, the recognition system uses a dictionary  to find the possible ways to pronounce each word in terms of the limited set of phones . The words in the dictionary define the recogniser's vocabulary. 

The fact that these systems are designed to recognise continuous speech  means that they are equipped with algorithms that can segment the input utterance into distinct words. This process of segmentation  often can also be used during the training process: this relieves the training databases from giving labelling information on the word boundaries  in the speech files.

When organising competitive assessment of various systems, it is important to have carefully defined the training that is allowed. This includes acoustical and language model ling training. In the ARPA  paradigm, part of the evaluation test puts fewer restrictions on training material, but demands that this material is available to other participants.

Acoustic training

The acoustic training material consists of large databases, with many hours of speech recorded from many people. The most famous training database for American English is the ``Wall Street Journal '' database (WSJ), with two releases, WSJ 0 which contains 84 speakers, and WSJ 1, which contains an additional 200 speakers. The total training time  is approximately 60 hours. The training sentences come from the Wall Street Journal  newspaper . All training sentences have been orthographically transcribed.

It is important that the acoustic training material comes with orthographic transcription s ; without these the material is virtually worthless for training. The size of the material is also relevant; often large vocabulary systems   work with models for phone   sequences, with up to 3 phones (triphone ). This means that the number of models to be trained is quite large, typically 2000. All the models must be trained many times with many different speakers in order to be robust. Up to now, there is no indication that the recognition result as a function of the amount of training material saturates at the available maximum of 60 hours.


For training the phone models  , some automatic conversion from the orthographic texts to the phones  is necessary. This is generally performed by dictionary  lookup. Compiling a dictionary is very laborious, and more often than not these dictionaries are considered proprietary information. It is often viewed as part of the recognition system. Recognisers  may fall back on a text-to-speech (TTS)  system if words in the transcription  do not occur in the dictionary  (see Chapter 6).

Language model


An essential part of the large vocabulary continuous speech recognition system   is the language model . It represents the machine's knowledge of the language it is supposed to recognise. Because the recognisers often have a probabilistic approach to the acoustic modelling, a probabilistic language model   fits in perfectly. There are many ways to implement a probabilistic grammar  (see Chapter 7, but the most widely used is the n-gram grammar.   In an n-gram grammar, the probability that word tex2html_wrap_inline45555 follows a sequence of words tex2html_wrap_inline46847 is defined. The number of possible combinations of n consecutive words is tex2html_wrap_inline46851, where V is the vocabulary size . For n=3 and tex2html_wrap_inline46857, the number of trigrams  that must be known is astronomic, tex2html_wrap_inline46859. Apart from storage problems, it would require enormous amounts of text just to see all combinations at least once. Therefore, techniques have been developed to deal with this problem. One of these is the back-off  principle. In this technique, an untrained n-gram  is expressed as the product of a back-off probability  and the (n-1)-gram of the final n-1 words. The back-off probability depends on the first n-1 words. This process can be continued recursively, up to the unigram  probabilities.

The n-gram  probabilities and back-off  probabilities have to be trained with large amounts of text. A common source for benchmark  evaluation is newspaper  texts, but in principle the domain should match that of the application. If the application is dictation  of law texts, a good choice for training texts are (electronic versions) of law books. Getting these texts electronically might be difficult, and in all cases copyright s have to be respectedgif In order to give the reader an idea of the text sizes: in the November 1993 ARPA  benchmark   evaluation the standard language model was trained with 37 million words of WSJ text , in the December 1994 evaluation the language model training material increased to 237 million words from 5 sources. Generating n-grams  from texts has been standardised by Carnegie Mellon University. They have made a toolkit which is freely available for research purposes.

The language model can be precompiled. In fact, in the ARPA \ benchmark  evaluations a trigram  language model is shipped with the training material that has a precompiled form. This model was built by Doug Paul , then at MIT (Cambridge, MA), and the format of this language model has become the de-facto standard. It is very simple in structure. The file is a simple text file, starting with some comments. The header ends with the keyword \data, after which the keyword \ngrams: starts an n-gram   block. In the following lines, each line specifies the n-gram probability, the n words, and a back-off probability . The probabilities p are given as tex2html_wrap_inline46891. For instance, in the block containing 2-grams, one may find a line like
which should be interpreted as follows: the probability that the word dollars occurs, given the fact that the previous word is million, is tex2html_wrap_inline46893. If a trigram  ``million dollars tex2html_wrap_inline46045'' is not specified in the file, try to use the bigram  probability ``million tex2html_wrap_inline46045'' and correct for backing-off  with an extra factor tex2html_wrap_inline46899. (For instance, the word tex2html_wrap_inline46045 might be ``left'', and the training combination ``million dollars left'' might not have occurred in the training texts. If in the recognition process the probability for this combination must be estimated, this back-off   procedure is used.)    

next up previous contents index
Next: Development test Up: Experimental design of large Previous: Experimental design of large

EAGLES SWLG SoftEdition, May 1997. Get the book...