Stochastic language modelling

Next: Types of language models Up: System architecture for speech Previous: Bayes decision rule

Stochastic language modelling

The task of a language model is to express the restrictions imposed on the way in which words can be combined to form sentences. In other words, the idea is to capture the inherent redundancy that is present in the language, or to be more exact, in the language subset handled by the system. This redundancy results from the syntactic, semantic and pragmatic constraints of the language and may be modelled by probabilistic or non-probabilistic (``yes/no'') methods. For a vocabulary of three words A,B,C, Figure 7.2 illustrates the situation. What is always possible is to arrange the word sequences in the form of a tree. Figure 7.2 shows the sentence tree for all four-word sentences. Now some of these word sequences may be impossible, some may be possible, and others may be very typical according to syntactic and semantic and maybe pragmatic constraints. The task of the language model is now to express these constraints by assigning a probability to each of the sentences. In simple cases like voice command applications, it might be sufficient to just remove the illegal sentences from the diagram and compress it into a finite state network .

Figure 7.2: Illustration of the decision problem for a three-word vocabulary

For large vocabulary recognition tasks, such methods cannot be used because we have to allow any word sequence types, which is difficult to describe deterministically. The task of a stochastic language model stochastic language model is to provide estimates of these prior probabilities . Using the definition of conditional probabilities, we obtain the decomposition:

Strictly speaking, this equation requires a suitable interpretation of the variable N, the number of words. When considering a single sentence, the number of words is a random variable itself, and we need an additional distribution over the sentence lengths. In practice, the problem is circumvented by applying the above equation to a whole set of sentences and extending the vocabulary by a special symbol (or ``word'') that marks the end of a sentence (and the beginning of the next sentence).

For large vocabulary speech recognition, these conditional probabilities are typically used in the following way [Bahl et al. (1983)]. The dependence of the conditional probability of observing a word at a position n is modelled as being restricted to its immediate (m-1) predecessor words . The resulting model is that of a Markov chain and is referred to as m-gram model. m-gram model The following types of model are quite common:

trigram model:
bigram model:
unigram model:
zerogram model:

Here we have used W to denote the vocabulary size. Note that the zerogram model is a special unigram model with uniform probabilities. Other types of language models will be considered later in this chapter. The probabilities of these models are estimated from a text corpus during a training phase . However, due to the experimental conditions, we are faced with a particular problem that is usually referred to as the problem of sparse training data. We consider this problem in more detail. For bigram and trigram models, most of the possible events, i.e. word pairs and word triples, are never seen in training because there are so many of them. We have to make sure that nevertheless these unseen events are assigned a probability greater than zero. Otherwise word sequences that contain these unseen word bigram s or trigram s cannot possibly be hypothesised or recognised during the speech recognition process.

It is obvious that, apart from speech recognition, language models are also essential for optical character recognition [Mori et al. (1992)] and language translation [Berger et al. (1994)]. It is interesting to mention that similar m-gram techniques are used in the context of acoustic-phonetic modelling. The main difference is the level at which the statistical data are collected, i.e. at the level of phonemes or the level of phones, which are the acoustic realisations of the phonemes . Phone bigrams and trigrams are referred to as diphones and triphones , respectively. Statistical techniques related to those used in language modelling can also be applied to language understanding [Pieraccini et al. (1993)].

Next: Types of language models Up: System architecture for speech Previous: Bayes decision rule

EAGLES SWLG SoftEdition, May 1997. Get the book...