Search: Finding the single best sentence

Next: Search: Word graphs Up: Language models and search Previous: Sources of recognition errors

Search: Finding the single best sentence

In a suboptimal search, which must be used in a real recognition system , there is no guarantee to find the global optimum as required by the Bayes decision rule. Therefore it is important to organise the search as efficiently as possible to minimise the number of search errors. Here, we will describe so-called time synchronous beam search which is the key component in many large vocabulary systems [Alleva et al. (1993), Fissore et al. (1993), Murveit et al. (1993), Soong & Huang (1991), Haeb-Umbach & Ney (1994), Valtech et al. (1994)].

Acoustic-phonetic modelling is based on Hidden Markov models. For an utterance to be recognised, there is a huge number of possible state sequences, and all combinations of state and time must be systematically considered. In the so-called Viterbi or maximum approximation [Bahl et al. (1983), Levinson et al. (1983)], the sum over all paths is approximated by the path which has the maximum contribution to the sum:
eqnarray9924
where the sum is to be taken over all paths that are consistent with the word sequence under consideration.

Figure 7.9: Example of a time alignment path

In the maximum approximation, the search problem can be specified as follows. We wish to assign each acoustic vector at time t to a (state,word) index pair. This mapping can be viewed as a time alignment path, which is a sequence of (state,word) index pairs (stretching notation):

An example of such a time alignment path in connected word recognition is depicted in Figure 7.9. The dynamic programming strategy to be presented will allow us to compute the probabilities

in a left-to-right fashion over time t and to carry out the optimisation over the unknown word sequence at the same time. Within the framework of the maximum approximation, the dynamic programming algorithm presents a closed-form solution for handling the interdependence of non-linear time alignment , word boundary detection and word identification in continuous speech recognition. Taking this strong interdependence into account has been the basis for many different variants of dynamic programming solutions to the problem [Vintsyuk (1971), Sakoe (1979), Bridle et al. (1982), Ney (1984)]. The sequence of acoustic vectors extracted from the input speech signal is processed strictly from left to right. The search procedure works with a time-synchronous breadth-first strategy, i.e. all hypotheses for word sequences are extended in parallel for each incoming acoustic vector.

Originally, the dynamic programming strategy was designed for small vocabulary tasks like digit string recognition. To extend this algorithm towards large vocabulary tasks, there are two concepts that have to be added:

pruning: Dynamic programming is still a full search (within the maximum approximation). By incorporating a pruning strategy, the full search is converted to a so-called beam search focussed on the most promising hypotheses [Lowerre & Reddy (1980)]. Since all hypotheses cover the same portion of the input, their scores can be directly compared. Every 10 ms, the score of the best hypothesis is determined, then all hypotheses whose scores fall short of this optimal score by more than a fixed factor are pruned, i.e. are removed from further consideration. For an efficient organisation, see [Ney et al. (1992)]; for the details and refinements of the pruning strategy, see [Steinbiss et al. (1994)]. The experimental tests indicate that for this type of beam search , depending on the acoustic input and the language model constraints , only a few percent of the potential state hypotheses have to be processed for every 10 ms of the input speech while at the same time the number of recognition errors is virtually not increased.
lexicon tree: The experimental results show that the lion's share of the search effort is concentrated in the first two or three phonemes of each word. Therefore, to reduce the search effort at the word beginnings, it is important to organise the pronunciation lexicon in the form of a tree [Haeb-Umbach & Ney (1994)].

Figure 7.10: Bigram language model and search

Figure 7.10 illustrates how a bigram language model is incorporated into the search, where the vocabulary consists of the three words A,B,C. When starting up a new word w, all predecessor words v have to be considered along with their bigram probabilities p(w|v). Note that, since the bigram language model is a special case, the set of word sequences shown in Figure 7.2 can be compressed into a finite state network.

When using a lexicon tree to represent the pronunciation lexicon, we are faced with the problem that the identity of the word is known only when we have reached the end of the tree. The solution is to use a separate lexicon tree in search for each predecessor word [Haeb-Umbach & Ney (1994)] as illustrated in Figure 7.11.

Figure 7.11: Word recombination for a lexical tree

Fortunately, in the framework of beam search , the number of surviving predecessor words and thus the number of active trees is small, say 50 even for a 20000 word task with a perplexity of 200.

To prepare the ground for dynamic programming , we introduce two auxiliary quantities, namely the score and the back pointer :

score of the best path up to time t that ends in state s of word w for predecessor v.
starting time of the best path up to time t that ends in state s of word w for predecessor v.

Note that for both quantities, the index w is known only at the leaves of the lexicon trees. Both quantities are evaluated using the dynamic programming recursion for :

where is the product of emission and transition probability as given by the Hidden Markov model , and is the optimum predecessor state for the hypothesis (t,s;w) and predecessor word v. The back pointers are propagated according to the dynamic programming (DP) decision. Unlike the predecessor word v, the index w is well defined only at the end of the lexical tree. Using a suitable initialisation for , this equation includes optimisation over the unknown word boundaries. At word boundaries, we have to find the best predecessor word for each word. To this purpose, we define:

To start words, we have to pass on the score and the time index:

Note that here the right-hand side does not depend on word index w since as said before the identity of the spoken word is known only at the end of a tree.

To measure the search effort there are a number of quantities that can be used:

The most important quantity is the average number of state hypotheses per 10-ms segment of speech. Strictly speaking, it should also be specified whether this number is measured before or after the pruning step.
To study the distribution of the search effort over the search space in more detail, it can be useful to include the average number of lexical trees, of word ends and of phoneme arcs.
For the dictation of text with a duration of several minutes, the storage requirements for bookkeeping can increase drastically. So a measure for the bookkeeping cost, both in terms of time and memory, should also be included.

Next: Search: Word graphs Up: Language models and search Previous: Sources of recognition errors

EAGLES SWLG SoftEdition, May 1997. Get the book...