next up previous contents index
Next: Specification of the linguistic Up: Applications of spoken language Previous: Speech corpora for research

Speech corpora for technological applications

Technological applications for which speech corpora are needed can be roughly divided into four major classes: speech synthesis , speech recognition , spoken language system s, and speaker recognition /verification . Depending on the specific application, the speech corpora which are needed are very diverse. For example, speech synthesis usually requires a large amount of speech data from one or two speakers, whereas speech recognition  often requires a smaller amount of speech data from many speakers. In the following sections the four domains of speech research for technological applications and the speech corpora they need are discussed.

Speech synthesis


The seemingly most natural way to synthesise speech is to model human speech production directly by simulating lung pressure, vocal fold vibration, articulatory gestures, etc. However, the human system is not completely understood. This is probably the reason why it turns out to be extremely difficult to determine and control the details of the model parameters in computer simulations. This is the reason that articulatory synthesisers have only been moderately successful in generating perceptually important acoustic features. Yet, modern measurement techniques have allowed the collection of substantial amounts of measurement data. Most of these data are now being made available to the research community (see the ESPRIT  project ACCOR (ESPRIT/BRA 3279 ACCOR, and the special issue of the journal ``Language and Speech'' (1993)).

A relatively simple way to build a speech synthesiser is through concatenation of stored human speech components. In order to achieve natural coarticulation   in the synthesised speech, it is necessary to include transition regions in the building blocks. Often-used transition units are diphones,   which represent the transition from one phone  to another. Since diphone  inventories are derived directly from human utterances, diphone synthesis  might be expected to be inherently natural sounding. However, this is not completely true, because the diphones  have to be concatenated and in practice there will be many diphone  junctions that do not fit properly together. In order to be able to smoothe these discontinuities the waveform segments have to be converted to a convenient format, such as some form of LPC  parameters, often with some inherent loss of auditory quality. Until recently it was believed that a parametric representation was mandatory to be able to change the pitch  and timing of utterances without disturbing the spectral envelope pattern. Since the invention of PSOLA-like  techniques, high quality pitch  and time changes can be effected directly in the time domain.
For limited applications, such as train information systems, whole words and even phrases may be stored. Lately, this method of speech synthesis is being applied more and more, because of the possibility of cheap mass storage. The quality of concatenated-word sentences is often acceptable, especially in the light of the still not optimal quality of the other types of synthesis.

Another important method of generating computerised speech is through synthesis by rule. The usual approach is to input a string of allophones   to some form of formant synthesiser. Target formant values for each allophone   are derived from human utterances and these values are stored in large tables. With an additional set of rules these target values can be adapted to account for all kinds of phonological and phonetic phenomena and to generate proper prosody .

More detailed accounts of speech synthesis systems can be found in, for instance, [Klatt (1987)] and [Holmes (1988)], and in Chapter 12.

For all types of speech synthesis systems corpora are needed to determine the model parameters. If the user wants many different types of voice, the speech corpus should contain various speakers for the extraction of speaker-specific model parameters. In particular, the user might want to be able to generate both male and female speech. Transformations to convert rule systems between male and female speech have had limited success, so it seems more convenient to include both sexes  in the speech corpus. Application specific corpora are needed to investigate issues related to prosody .  

Speech recognition


There are several types of speech recognition systems, which may differ in three important ways:

  1. the recognition strategies they use,
  2. the speakers they have to recognise,
  3. the speech they have to recognise.

These different aspects will be discussed below.

Knowledge-based vs. stochastic systems


With respect to the strategies they use, speech recognition systems  can be roughly divided in two classes: knowledge-based systems and stochastic systems. All state-of-the-art systems belong to the second category. In the knowledge-based approach an attempt was made to specify explicit acoustic-phonetic rules that are robust enough to allow recognition of linguistically meaningful units and that ignore irrelevant variation in these units. Stochastic systems, such as Hidden Markov Models (HMMs)  or neural networks , do not use explicit rules for speech recognition. On the contrary, they rely on stochastic models which are estimated or trained  with (very) large amounts of speech, using some statistical optimalisation procedure (e.g. the Estimate-Maximise or the Baum-Welch algorithm).
Higher level linguistic knowledge can be used to constrain the recognition hypotheses generated at the acoustic-phonetic level. Higher level knowledge can be represented by knowledge-based explicit rules, for example syntactic constraints on word order. More often it is represented by stochastic language models, for example bigrams  or trigrams   that reflect the likelihood of a sequence of two or three words, respectively (see also Chapter 7). Recently, promising work on enhancing HMMs with morphological and phonological structure has been conducted, pointing to the possibility of convergence between knowledge-based and stochastic approaches.


Speaker-independent vs. speaker-dependent systems


Speech recognition systems can be either speaker-dependent or speaker-independent. In the former case the recognition system is designed to recognise the speech of just a single person, and in the latter case the recognition system should be able to recognise the speech of a variety of speakers. All other things being equal, the performance of speaker-independent systems is likely to be worse than in speaker-dependent systems, because speaker-independent systems have to deal with a considerable amount of inter-speaker variability. It is often sensible to train  separate recognition models for specific subgroups of speakers, such as men and women, or speakers with different dialects  [Van Compernolle et al. (1991)].

Some systems can to some extent adapt to new speakers by adjusting the parameters of their models. This can be done in a separate training session  with a set of predetermined utterances of the new speaker, or it can be done on-line as the recognition of the new speaker's utterances gradually proceeds.

Most recognition systems are very sensitive to the recording environment . In the past, speakers employed to train  and develop a system were often recorded under ``laboratory''  conditions, for instance in an anechoic room.  It appears that the performance of speech recognisers  which are trained  with such high quality recordings severely degrades if they are tested  with some form of ``noisy''  speech [Gong (1995)]. Also the use of different microphones   during training sessions  and test sessions  has a considerable impact on recognition performance.      

Isolated words vs. continuous speech


The third main distinction between speech recognition systems  is based on the type of speech they have to recognise. The system can be designed for isolated word recognition  or for continuous speech recognition. In the latter case word boundaries have to be established, which can be extremely difficult. Nevertheless, continuous speech recognition systems  are nowadays reasonably successful, although their performance of course strongly depends on the size of their vocabulary.  
Word spotting  can be regarded as a special form of isolated word recognition : the recogniser  is ``listening'' for a limited number of words. These words may come embedded in background noise , possibly consisting of speech of competing speakers, or may come from the target speaker who is producing the word embedded in extraneous speech.        

Corpora for speech recognition research

In general, two similar speech corpora are needed for the development of speech recognition systems:  one for the training phase   and one for the testing phase . The training material  is used to set the model parameters of the recognition system . The testing material  is used to determine the performance of the trained system. It is necessary to use different speech data for training  and testing  in order to get a fair evaluation of the system performance.
For speaker-dependent   systems, obviously the same speaker is used for the training  and testing phase . For speaker-independent systems,     the corpora for training  and testing  could contain the same speakers (but different speech data), or they could contain different speakers to determine the system's robustness for new speakers.
When a system is designed for isolated word  recognition, it should be trained  and tested  with isolated words . And similarly, when a system is designed for telephone speech, it should be trained and tested with telephone speech. The design of corpora for speech recognition research thus strongly depends on the type of recognition system that one wants to develop. Several large corpora for isolated word  (e.g. TIDIGITS) and continuous speech  recognition (e.g. Resource Management, ATIS , BREF, EUROM , TIMIT and Wall Street Journal)  have been collected and made available (cf. Appendix L).  

Spoken language systems


Speech synthesis  and speech recognition systems  can be combined with natural language processing  and Dialogue Management systems to form a Spoken Language System (SLS) that allows an interactive communication between man and machine. A spoken language system should be able to recognise a person's speech, interpret the sequence of words to obtain a meaning in terms of the application, and provide an appropriate response to the user.
Apart from speech corpora needed to design the speech synthesis  and the speech recognition  part of the spoken language system, speech corpora are also needed to model relevant features of spontaneous speech  (pauses, hesitations, turn-taking  behaviour, etc.) and to model dialogue structures for a proper man-machine interaction.  
An excellent overview of spoken language systems and their problems is given in [Cole (1995)]. The ATIS corpora  mentioned above exemplify the type of corpus used for the development of SLS.  

Speaker recognition/verification


The task of automatic speaker recognition is to determine the identity of a speaker by machine. Speaker recognition (usually called speaker identification  can be divided into two categories: closed-set and open-set problems. The closed-set problem  is to identify a speaker from a group of known speakers, whereas the open-set problem  is to decide whether a speaker belongs to a group of known speakers. Speaker verification is a special case of the open-set problem and refers to the task of deciding whether a speaker is who he claims to be.
Speaker recognition can be text-dependent or it can be text-independent. In the former case the text in both the training phase  and the testing phase  is known, i.e. the system employs a sort of password procedure. One popular example of password-like phrases are the so-called ``combination lock'' phrases, consisting of sequences of numbers (mostly between 0 and 99) or digits (between 0 and 9). LDC provides a corpus for training  and testing  speaker verification systems based on combination lock phrases consisting of three numbers between 11 and 99 (e.g. 26-81-57) [Campbell (1995)].
Knowledge of the text enables the use of systems which combine speech and speaker recognition. In other words, before granting access to data or premises the speaker verification system can request that the claimant says the combination lock; the system then checks both the correctness of the password and the voice characteristics  of the speaker. However, most password systems are susceptible to fraud using recordings of the passwords spoken by a customer. One way of making fraud with recordings much more difficult is by the use of text prompted  techniques, whereby the customer is asked to repeat one or more sentences randomly drawn from a very large set of possible sentences. Again, the system checks both the contents of the reply and the voice characteristics  of the speaker. Since surreptitious recording  of millions of specific utterances is impossible, text prompted  speaker verification systems  should offer a very high level of security and immunity to fraud.
In the case of text-independent speaker verification the acceptance procedure should work for any text in both the training  or the testing phase . Since this approach is especially susceptible to fraud by playing recordings of the customer, text-independent verification technology should best be combined with other fraud prevention techniques.

There are various application areas for speaker recognition, for instance helping to identify suspects in forensic cases, or controlling access to buildings or bank accounts. As with speech recognition,   the corpora needed for speaker recognition or speaker verification are dependent on the specific application. If, for instance, the technology is based on combination lock phrases, a training  database should obviously contain a large number of connected number or digit expressions.  For the development of text-independent speaker technology there are no strict requirements as to what the training speakers say.

Corpora for the development and testing  of speaker recognition systems  differ in a crucial aspect from corpora collected to support speech recognition . For speaker recognition research it is absolutely essential that the corpus contains multiple recordings of the same speaker, made under different conditions. There is a range of conditions that should ideally be sampled, in order to be able to build a model of the natural variation in a person's speech due to realistic variations in the conditions under which the speech is produced.
Conditions to be sampled and to be represented in a corpus can be divided into two broad groups, viz. channel  conditions, and physiological and psychological conditions of the speaker.

Channel conditions

  The details of the acoustic speech patterns depend heavily on the acoustic background in which the speech was produced and on the response of the transmission network. A corpus for speaker recognition research should at least include multiple recordings of the speakers made with different microphones  or telephone handsets. Especially the transmission differences between carbon button and electret microphones  in telephone handsets are known to affect the performance of speaker recognition systems . In this context, attention should also be paid to the different transmission characteristics of the fixed, landline telephone network and the rapidly growing cellular networks.
In actual practice, it is much more difficult to obtain a representative sampling of acoustic backgrounds. Yet one must be aware that loud background noise  affects the speech in two ways, one of which is highly non-linear and therefore cannot easily be compensated for: in addition to decreasing the signal-to-noise ratio , background noise  is likely to cause the speaker to change his speaking behaviour. The best known effect is known as the Lombard effect:  in a high noise  environment speakers tend to raise their voice level, and therewith their phonation style and probably also their articulation style.  

Psychological and physiological conditions

The type of speaker variation addressed under this heading is also very difficult to sample. Given the practical limitations of a corpus collection project it is hardly feasible to require that each speaker be recorded in perfect health  conditions, as well as when having a cold, the flu, or any other mild disease.
One simple approximation to sampling within speaker variation that is feasible from a practical point of view is to record speakers at different times of the day (early morning, noon, late night), and on different days of the week. In any case, the period over which the recordings are extended should span at least a couple of months. One might also consider recording speakers in completely sober conditions and after the consumption of a reasonable amount of intoxicating drugs.

More detailed accounts of speaker recognition can be found in [O'Shaughnessy (1986)] and [Gish & Schmidt (1994)].

Developing and testing  speaker recognition systems  with a database containing only a single recording session for the speakers should be avoided, because such databases cannot possibly account for even the slightest degree of within-speaker variation. Results reported on such databases (e.g. TIMIT) grossly overestimate the performance of the system being tested.       

next up previous contents index
Next: Specification of the linguistic Up: Applications of spoken language Previous: Speech corpora for research

EAGLES SWLG SoftEdition, May 1997. Get the book...