Speech quality and conditions


The conditions under which a recogniser  is usedgreatly influence on its performance. Speech quality can be characterised by various properties. There is a distinction between pre-production factors , which influence the way speech is produced, and post-production factors, which influence the way the speech is transmitted from the mouth of the speaker to the recognition system. We have summarised some of the conditions in Table 10.2.


Parameter easy task difficult task
Pre: Vocabulary  choice distinct words similar words
Talking style  read speech  spontaneous speech 
constant energy level fluctuating level
Recording conditions undisturbed speech deteriorated speech
(e.g. stressed,
Lombard effectLombard effect  )
Post:Electrical characteristics wide bandwidth  small bandwidth
good transmission unreliable channel  quality
no noise  noise
Table 10.2: Conditions of speech 

Vocabulary choice
  Within the vocabulary, words can be chosen to be acoustically very distinct, or very similar. One would choose the former for an application (e.g. a set of control words), while for diagnostic purposes  the latter serves very well (e.g. CVC-words , see Section 10.3.4).

Talking style 
Firstly, a distinction is made between read speech  and spontaneous speech . The former is somewhat unnatural, as there are only few circumstances in which speech approaches this quality, but it has been used in evaluation of speech recognition system s for a long time because it is relatively easy to define and reproduce . Spontaneous speech  comes in a variety of flavours, but it generally consists of a much less well-defined grammar , and contains errors, corrections, mispronunciations, and stronger prosody . Secondly, the level of the speech can vary. When the level varies strongly within a short time frame (e.g. the distance between microphone  and mouth may not be constant) this is called a large dynamic range.   On a more global scale, the speech itself can be influenced by the speech level , i.e. the speech can range from ``whispering'' to ``shouting''.

Recording conditions
The recording conditions  may vary. One of the most important quantities in this respect is the signal-to-noise ratio  (SNR). Databases are often recorded ``clean'' (high SNR), and adverse conditions,   such as environmental noise  and crosstalk  are added to the signal in a later stage. However, for some conditions such an approach is not valid (e.g. with the Lombard effect ), and the recordings have to be made under realistic conditions.

Electrical characteristics 
The bandwidth  is of some importance to the recognition performance. In principle, limited bandwidth  contains less information about the speech, and can hence make the recognition task more difficult. However, some recognition systems may limit the bandwidth to telephone speech on purpose - even if wide band  speech is available - because band limiting  has the advantage of reducing the amount of data while keeping most of the speech information. In this way, some trivial filtering of noise  outside the typical speech spectrum is obtained. Another ``electrical characteristic'' is the transmission channel  quality. Obviously, non-ideal transformations of the signal, such as non-linearities , ticks, echo es, reverberation s and drop-out s, will have a degrading influence on the recognition performance.


