next up previous contents index
Next: Search: Word graphs Up: Language models and search Previous: Sources of recognition errors

Search: Finding the single best sentence


In a suboptimal search, which must be used in a real recognition system , there is no guarantee to find the global optimum as required by the Bayes decision rule.  Therefore it is important to organise the search as efficiently as possible to minimise the number of search errors.  Here, we will describe so-called time synchronous beam search  which is the key component in many large vocabulary  systems  [Alleva et al. (1993), Fissore et al. (1993), Murveit et al. (1993), Soong & Huang (1991), Haeb-Umbach & Ney (1994), Valtech et al. (1994)].

Acoustic-phonetic modelling is based on Hidden Markov models.  For an utterance to be recognised, there is a huge number of possible state sequences, and all combinations of state and time must be systematically considered. In the so-called Viterbi or maximum approximation [Bahl et al. (1983), Levinson et al. (1983)], the sum over all paths is approximated by the path which has the maximum contribution to the sum:
where the sum is to be taken over all paths tex2html_wrap_inline46065 that are consistent with the word sequence tex2html_wrap_inline46067 under consideration.

Figure 7.9: Example of a time alignment path 

In the maximum approximation, the search problem can be specified as follows. We wish to assign each acoustic vector at time t to a (state,word) index pair. This mapping can be viewed as a time alignment  path, which is a sequence of (state,word) index pairs (stretching notation):
An example of such a time alignment  path in connected word  recognition  is depicted in Figure 7.9. The dynamic programming  strategy to be presented will allow us to compute the probabilities
in a left-to-right fashion over time t and to carry out the optimisation over the unknown word sequence at the same time. Within the framework of the maximum approximation, the dynamic programming  algorithm presents a closed-form solution for handling the interdependence of non-linear time alignment , word boundary  detection and word identification in continuous speech  recognition.  Taking this strong interdependence into account has been the basis for many different variants of dynamic programming   solutions to the problem [Vintsyuk (1971), Sakoe (1979), Bridle et al. (1982), Ney (1984)]. The sequence of acoustic vectors extracted from the input speech signal is processed strictly from left to right. The search procedure works with a time-synchronous breadth-first strategy, i.e. all hypotheses for word sequences are extended in parallel for each incoming acoustic vector.

Originally, the dynamic programming strategy  was designed for small vocabulary   tasks like digit string recognition. To extend this algorithm towards large vocabulary   tasks, there are two concepts that have to be added:

Figure 7.10: Bigram language model and search 

Figure 7.10 illustrates how a bigram  language model is incorporated into the search, where the vocabulary  consists of the three words A,B,C. When starting up a new word w, all predecessor words v have to be considered along with their bigram  probabilities p(w|v). Note that, since the bigram language model is a special case, the set of word sequences shown in Figure 7.2 can be compressed into a finite state network. 

When using a lexicon tree to represent the pronunciation lexicon, we are faced with the problem that the identity of the word is known only when we have reached the end of the tree. The solution is to use a separate lexicon tree in search for each predecessor word [Haeb-Umbach & Ney (1994)] as illustrated in Figure 7.11.

Figure 7.11: Word recombination for a lexical tree 

Fortunately, in the framework of beam search , the number of surviving predecessor words and thus the number of active trees is small, say 50 even for a 20000 word task with a perplexity  of 200.

To prepare the ground for dynamic programming , we introduce two auxiliary quantities, namely the score tex2html_wrap_inline46081 and the back pointer tex2html_wrap_inline46083:

Note that for both quantities, the index w is known only at the leaves of the lexicon trees. Both quantities are evaluated using the dynamic programming  recursion for tex2html_wrap_inline46081:
where tex2html_wrap_inline46109 is the product of emission and transition probability as given by the Hidden Markov model , and tex2html_wrap_inline46111 is the optimum predecessor state for the hypothesis (t,s;w) and predecessor word v. The back pointers tex2html_wrap_inline46083 are propagated according to the dynamic programming  (DP) decision. Unlike the predecessor word v, the index w is well defined only at the end of the lexical tree. Using a suitable initialisation for tex2html_wrap_inline46123, this equation includes optimisation over the unknown word boundaries.  At word boundaries, we have to find the best predecessor word for each word. To this purpose, we define:
To start words, we have to pass on the score and the time index:
Note that here the right-hand side does not depend on word index w since as said before the identity of the spoken word is known only at the end of a tree.

To measure the search effort there are a number of quantities that can be used:


next up previous contents index
Next: Search: Word graphs Up: Language models and search Previous: Sources of recognition errors

EAGLES SWLG SoftEdition, May 1997. Get the book...