In a suboptimal search, which must be used in a real recognition system , there is no guarantee to find the global optimum as required by the Bayes decision rule. Therefore it is important to organise the search as efficiently as possible to minimise the number of search errors. Here, we will describe so-called time synchronous beam search which is the key component in many large vocabulary systems [Alleva et al. (1993), Fissore et al. (1993), Murveit et al. (1993), Soong & Huang (1991), Haeb-Umbach & Ney (1994), Valtech et al. (1994)].
is based on Hidden Markov models.
For an utterance to be recognised,
there is a huge number of possible state sequences, and all combinations
of state and time must be systematically considered.
In the so-called Viterbi or maximum approximation
[Bahl et al. (1983), Levinson et al. (1983)], the sum over all
paths is approximated by the path which has the maximum contribution
to the sum:
where the sum is to be taken over all paths that are consistent with the word sequence under consideration.
Figure 7.9: Example of a time alignment path
In the maximum approximation, the
search problem can be specified as follows.
We wish to assign each acoustic vector at time t
to a (state,word) index pair. This mapping
can be viewed as a
time alignment path, which is a sequence of (state,word)
index pairs (stretching notation):
An example of such a time alignment path in connected word recognition is depicted in Figure 7.9. The dynamic programming strategy to be presented will allow us to compute the probabilities
in a left-to-right fashion over time t and to carry out the optimisation over the unknown word sequence at the same time. Within the framework of the maximum approximation, the dynamic programming algorithm presents a closed-form solution for handling the interdependence of non-linear time alignment , word boundary detection and word identification in continuous speech recognition. Taking this strong interdependence into account has been the basis for many different variants of dynamic programming solutions to the problem [Vintsyuk (1971), Sakoe (1979), Bridle et al. (1982), Ney (1984)]. The sequence of acoustic vectors extracted from the input speech signal is processed strictly from left to right. The search procedure works with a time-synchronous breadth-first strategy, i.e. all hypotheses for word sequences are extended in parallel for each incoming acoustic vector.
Originally, the dynamic programming strategy was designed for small vocabulary tasks like digit string recognition. To extend this algorithm towards large vocabulary tasks, there are two concepts that have to be added:
Figure 7.10: Bigram language model and search
Figure 7.10 illustrates how a bigram language model is incorporated into the search, where the vocabulary consists of the three words A,B,C. When starting up a new word w, all predecessor words v have to be considered along with their bigram probabilities p(w|v). Note that, since the bigram language model is a special case, the set of word sequences shown in Figure 7.2 can be compressed into a finite state network.
When using a lexicon tree to represent the pronunciation lexicon, we are faced with the problem that the identity of the word is known only when we have reached the end of the tree. The solution is to use a separate lexicon tree in search for each predecessor word [Haeb-Umbach & Ney (1994)] as illustrated in Figure 7.11.
Figure 7.11: Word recombination for a lexical tree
Fortunately, in the framework of beam search , the number of surviving predecessor words and thus the number of active trees is small, say 50 even for a 20000 word task with a perplexity of 200.
To prepare the ground for dynamic programming , we introduce two auxiliary quantities, namely the score and the back pointer :
Note that for both quantities, the index w is known only at the leaves of
the lexicon trees.
Both quantities are evaluated using
the dynamic programming recursion for :
where is the product of emission and transition probability as given by the Hidden Markov model , and is the optimum predecessor state for the hypothesis (t,s;w) and predecessor word v. The back pointers are propagated according to the dynamic programming (DP) decision. Unlike the predecessor word v, the index w is well defined only at the end of the lexical tree. Using a suitable initialisation for , this equation includes optimisation over the unknown word boundaries. At word boundaries, we have to find the best predecessor word for each word. To this purpose, we define:
To start words, we have to pass on the score and the time index:
Note that here the right-hand side does not depend on word index w since as said before the identity of the spoken word is known only at the end of a tree.
To measure the search effort there are a number of quantities that can be used: