Contrary to the small vocabulary word recogniser , a large vocabulary recognition system generally uses subword units such as phone s as acoustic units for recognition. This keeps the number of models to be trained limited (to the number of phone(eme)s in a language, typically 35-50), although training of context-dependent phones (so-called triphone s) again increases the number of models (to typically 2000).
This means that the training vocabulary is not necessarily complete with respect to the recogniser's vocabulary. Instead, the recognition system uses a dictionary to find the possible ways to pronounce each word in terms of the limited set of phones . The words in the dictionary define the recogniser's vocabulary.
The fact that these systems are designed to recognise continuous speech means that they are equipped with algorithms that can segment the input utterance into distinct words. This process of segmentation often can also be used during the training process: this relieves the training databases from giving labelling information on the word boundaries in the speech files.
When organising competitive assessment of various systems, it is important to have carefully defined the training that is allowed. This includes acoustical and language model ling training. In the ARPA paradigm, part of the evaluation test puts fewer restrictions on training material, but demands that this material is available to other participants.
The acoustic training material consists of large databases, with many hours of speech recorded from many people. The most famous training database for American English is the ``Wall Street Journal '' database (WSJ), with two releases, WSJ 0 which contains 84 speakers, and WSJ 1, which contains an additional 200 speakers. The total training time is approximately 60 hours. The training sentences come from the Wall Street Journal newspaper . All training sentences have been orthographically transcribed.
It is important that the acoustic training material comes with orthographic transcription s ; without these the material is virtually worthless for training. The size of the material is also relevant; often large vocabulary systems work with models for phone sequences, with up to 3 phones (triphone ). This means that the number of models to be trained is quite large, typically 2000. All the models must be trained many times with many different speakers in order to be robust. Up to now, there is no indication that the recognition result as a function of the amount of training material saturates at the available maximum of 60 hours.
For training the phone models , some automatic conversion from the orthographic texts to the phones is necessary. This is generally performed by dictionary lookup. Compiling a dictionary is very laborious, and more often than not these dictionaries are considered proprietary information. It is often viewed as part of the recognition system. Recognisers may fall back on a text-to-speech (TTS) system if words in the transcription do not occur in the dictionary (see Chapter 6).
An essential part of the large vocabulary continuous speech recognition system is the language model . It represents the machine's knowledge of the language it is supposed to recognise. Because the recognisers often have a probabilistic approach to the acoustic modelling, a probabilistic language model fits in perfectly. There are many ways to implement a probabilistic grammar (see Chapter 7, but the most widely used is the n-gram grammar. In an n-gram grammar, the probability that word follows a sequence of words is defined. The number of possible combinations of n consecutive words is , where V is the vocabulary size . For n=3 and , the number of trigrams that must be known is astronomic, . Apart from storage problems, it would require enormous amounts of text just to see all combinations at least once. Therefore, techniques have been developed to deal with this problem. One of these is the back-off principle. In this technique, an untrained n-gram is expressed as the product of a back-off probability and the (n-1)-gram of the final n-1 words. The back-off probability depends on the first n-1 words. This process can be continued recursively, up to the unigram probabilities.
The n-gram probabilities and back-off probabilities have to be trained with large amounts of text. A common source for benchmark evaluation is newspaper texts, but in principle the domain should match that of the application. If the application is dictation of law texts, a good choice for training texts are (electronic versions) of law books. Getting these texts electronically might be difficult, and in all cases copyright s have to be respected In order to give the reader an idea of the text sizes: in the November 1993 ARPA benchmark evaluation the standard language model was trained with 37 million words of WSJ text , in the December 1994 evaluation the language model training material increased to 237 million words from 5 sources. Generating n-grams from texts has been standardised by Carnegie Mellon University. They have made a toolkit which is freely available for research purposes.
The language model can be precompiled. In fact, in the ARPA \
benchmark evaluations a trigram language model is shipped with the training
material that has a precompiled form. This model was built by
Doug Paul , then at MIT (Cambridge, MA), and the format of this
language model has become the de-facto standard. It is very
simple in structure. The file is a simple text file, starting with
some comments. The header ends with the keyword \data
, after
which the keyword \
ngrams:
starts an n-gram
block.
In the following lines, each line specifies the n-gram probability, the n
words, and a back-off probability . The probabilities p are given as . For
instance, in the block containing 2-grams, one may find a line like
which should be interpreted as follows: the probability that the word
dollars occurs, given the fact that the previous word is million,
is . If a trigram ``million dollars '' is
not specified in the file, try to use the bigram probability ``million
'' and correct for backing-off with an extra factor
. (For instance, the word might be
``left'', and the training combination ``million dollars left'' might not
have occurred in the training texts. If in the recognition process the
probability for this combination must be estimated, this back-off
procedure is used.)