Next: Speaker dependency Up: System design Previous: System in operation versus

Speech recognition systems

The expression ``speech recognition system'' is meant here as the module that deals with speech input (acoustic wave forms) to deliver either a label or a corresponding command. Usually the designers focus on three major elements, which are the vocabulary (complexity, syntax , size ), the environment (bandwidth , noise level, distortion type ), speakers (stressed/relaxed, trained/untrained).

The major requirements relate to:

vocabulary, speech and language modelling ,
training material (if needed), the data collection platform , pre-processing procedures,
speaker dependency and speaking mode s,
environment conditions.

In the following we will describe most of the factors that may occur in the specification process of a speech recognition technology or the expectations of the users and thus of application developers.

A speech recogniser is based on some speech modelling using various paradigms. The best known are Dynamic Time Warping (DTW), Hidden Markov Modelling (HMM), and Artificial Neural Networks (ANNs). Most of the approaches distinguish two phases: A training phase and an exploitation phase . The first phase is devoted to learning speech characteristics from data:

acoustic wave forms,
phonetic/linguistic descriptions,
specific features, etc.

The material needed for this phase is important and will be elaborated upon in the following sections. The second phase, related to exploitation, consists of the use of the trained system to recognise speech input. The key characteristics of the recogniser are described within the following sections.

EAGLES SWLG SoftEdition, May 1997. Get the book...