Speech corpora for technological applications

Next: Specification of the linguistic Up: Applications of spoken language Previous: Speech corpora for research

Speech corpora for technological applications

Technological applications for which speech corpora are needed can be roughly divided into four major classes: speech synthesis , speech recognition , spoken language system s, and speaker recognition /verification . Depending on the specific application, the speech corpora which are needed are very diverse. For example, speech synthesis usually requires a large amount of speech data from one or two speakers, whereas speech recognition often requires a smaller amount of speech data from many speakers. In the following sections the four domains of speech research for technological applications and the speech corpora they need are discussed.

Speech synthesis

The seemingly most natural way to synthesise speech is to model human speech production directly by simulating lung pressure, vocal fold vibration, articulatory gestures, etc. However, the human system is not completely understood. This is probably the reason why it turns out to be extremely difficult to determine and control the details of the model parameters in computer simulations. This is the reason that articulatory synthesisers have only been moderately successful in generating perceptually important acoustic features. Yet, modern measurement techniques have allowed the collection of substantial amounts of measurement data. Most of these data are now being made available to the research community (see the ESPRIT project ACCOR (ESPRIT/BRA 3279 ACCOR, and the special issue of the journal ``Language and Speech'' (1993)).

A relatively simple way to build a speech synthesiser is through concatenation of stored human speech components. In order to achieve natural coarticulation in the synthesised speech, it is necessary to include transition regions in the building blocks. Often-used transition units are diphones, which represent the transition from one phone to another. Since diphone inventories are derived directly from human utterances, diphone synthesis might be expected to be inherently natural sounding. However, this is not completely true, because the diphones have to be concatenated and in practice there will be many diphone junctions that do not fit properly together. In order to be able to smoothe these discontinuities the waveform segments have to be converted to a convenient format, such as some form of LPC parameters, often with some inherent loss of auditory quality. Until recently it was believed that a parametric representation was mandatory to be able to change the pitch and timing of utterances without disturbing the spectral envelope pattern. Since the invention of PSOLA-like techniques, high quality pitch and time changes can be effected directly in the time domain.
For limited applications, such as train information systems, whole words and even phrases may be stored. Lately, this method of speech synthesis is being applied more and more, because of the possibility of cheap mass storage. The quality of concatenated-word sentences is often acceptable, especially in the light of the still not optimal quality of the other types of synthesis.

Another important method of generating computerised speech is through synthesis by rule. The usual approach is to input a string of allophones to some form of formant synthesiser. Target formant values for each allophone are derived from human utterances and these values are stored in large tables. With an additional set of rules these target values can be adapted to account for all kinds of phonological and phonetic phenomena and to generate proper prosody .

More detailed accounts of speech synthesis systems can be found in, for instance, [Klatt (1987)] and [Holmes (1988)], and in Chapter 12.

For all types of speech synthesis systems corpora are needed to determine the model parameters. If the user wants many different types of voice, the speech corpus should contain various speakers for the extraction of speaker-specific model parameters. In particular, the user might want to be able to generate both male and female speech. Transformations to convert rule systems between male and female speech have had limited success, so it seems more convenient to include both sexes in the speech corpus. Application specific corpora are needed to investigate issues related to prosody .

Speech recognition

There are several types of speech recognition systems, which may differ in three important ways:

the recognition strategies they use,
the speakers they have to recognise,
the speech they have to recognise.

These different aspects will be discussed below.

Knowledge-based vs. stochastic systems

With respect to the strategies they use, speech recognition systems can be roughly divided in two classes: knowledge-based systems and stochastic systems. All state-of-the-art systems belong to the second category. In the knowledge-based approach an attempt was made to specify explicit acoustic-phonetic rules that are robust enough to allow recognition of linguistically meaningful units and that ignore irrelevant variation in these units. Stochastic systems, such as Hidden Markov Models (HMMs) or neural networks , do not use explicit rules for speech recognition. On the contrary, they rely on stochastic models which are estimated or trained with (very) large amounts of speech, using some statistical optimalisation procedure (e.g. the Estimate-Maximise or the Baum-Welch algorithm).
Higher level linguistic knowledge can be used to constrain the recognition hypotheses generated at the acoustic-phonetic level. Higher level knowledge can be represented by knowledge-based explicit rules, for example syntactic constraints on word order. More often it is represented by stochastic language models, for example bigrams or trigrams that reflect the likelihood of a sequence of two or three words, respectively (see also Chapter 7). Recently, promising work on enhancing HMMs with morphological and phonological structure has been conducted, pointing to the possibility of convergence between knowledge-based and stochastic approaches.

Speaker-independent vs. speaker-dependent systems

Speech recognition systems can be either speaker-dependent or speaker-independent. In the former case the recognition system is designed to recognise the speech of just a single person, and in the latter case the recognition system should be able to recognise the speech of a variety of speakers. All other things being equal, the performance of speaker-independent systems is likely to be worse than in speaker-dependent systems, because speaker-independent systems have to deal with a considerable amount of inter-speaker variability. It is often sensible to train separate recognition models for specific subgroups of speakers, such as men and women, or speakers with different dialects [Van Compernolle et al. (1991)].

Some systems can to some extent adapt to new speakers by adjusting the parameters of their models. This can be done in a separate training session with a set of predetermined utterances of the new speaker, or it can be done on-line as the recognition of the new speaker's utterances gradually proceeds.

Most recognition systems are very sensitive to the recording environment . In the past, speakers employed to train and develop a system were often recorded under ``laboratory'' conditions, for instance in an anechoic room. It appears that the performance of speech recognisers which are trained with such high quality recordings severely degrades if they are tested with some form of ``noisy'' speech [Gong (1995)]. Also the use of different microphones during training sessions and test sessions has a considerable impact on recognition performance.

Isolated words vs. continuous speech

The third main distinction between speech recognition systems is based on the type of speech they have to recognise. The system can be designed for isolated word recognition or for continuous speech recognition. In the latter case word boundaries have to be established, which can be extremely difficult. Nevertheless, continuous speech recognition systems are nowadays reasonably successful, although their performance of course strongly depends on the size of their vocabulary.
Word spotting can be regarded as a special form of isolated word recognition : the recogniser is ``listening'' for a limited number of words. These words may come embedded in background noise , possibly consisting of speech of competing speakers, or may come from the target speaker who is producing the word embedded in extraneous speech.

Corpora for speech recognition research

In general, two similar speech corpora are needed for the development of speech recognition systems: one for the training phase and one for the testing phase . The training material is used to set the model parameters of the recognition system . The testing material is used to determine the performance of the trained system. It is necessary to use different speech data for training and testing in order to get a fair evaluation of the system performance.
For speaker-dependent systems, obviously the same speaker is used for the training and testing phase . For speaker-independent systems, the corpora for training and testing could contain the same speakers (but different speech data), or they could contain different speakers to determine the system's robustness for new speakers.
When a system is designed for isolated word recognition, it should be trained and tested with isolated words . And similarly, when a system is designed for telephone speech, it should be trained and tested with telephone speech. The design of corpora for speech recognition research thus strongly depends on the type of recognition system that one wants to develop. Several large corpora for isolated word (e.g. TIDIGITS) and continuous speech recognition (e.g. Resource Management, ATIS , BREF, EUROM , TIMIT and Wall Street Journal) have been collected and made available (cf. Appendix L).

Spoken language systems

Speech synthesis and speech recognition systems can be combined with natural language processing and Dialogue Management systems to form a Spoken Language System (SLS) that allows an interactive communication between man and machine. A spoken language system should be able to recognise a person's speech, interpret the sequence of words to obtain a meaning in terms of the application, and provide an appropriate response to the user.
Apart from speech corpora needed to design the speech synthesis and the speech recognition part of the spoken language system, speech corpora are also needed to model relevant features of spontaneous speech (pauses, hesitations, turn-taking behaviour, etc.) and to model dialogue structures for a proper man-machine interaction.
An excellent overview of spoken language systems and their problems is given in [Cole (1995)]. The ATIS corpora mentioned above exemplify the type of corpus used for the development of SLS.

Speaker recognition/verification

The task of automatic speaker recognition is to determine the identity of a speaker by machine. Speaker recognition (usually called speaker identification can be divided into two categories: closed-set and open-set problems. The closed-set problem is to identify a speaker from a group of known speakers, whereas the open-set problem is to decide whether a speaker belongs to a group of known speakers. Speaker verification is a special case of the open-set problem and refers to the task of deciding whether a speaker is who he claims to be.
Speaker recognition can be text-dependent or it can be text-independent. In the former case the text in both the training phase and the testing phase is known, i.e. the system employs a sort of password procedure. One popular example of password-like phrases are the so-called ``combination lock'' phrases, consisting of sequences of numbers (mostly between 0 and 99) or digits (between 0 and 9). LDC provides a corpus for training and testing speaker verification systems based on combination lock phrases consisting of three numbers between 11 and 99 (e.g. 26-81-57) [Campbell (1995)].
Knowledge of the text enables the use of systems which combine speech and speaker recognition. In other words, before granting access to data or premises the speaker verification system can request that the claimant says the combination lock; the system then checks both the correctness of the password and the voice characteristics of the speaker. However, most password systems are susceptible to fraud using recordings of the passwords spoken by a customer. One way of making fraud with recordings much more difficult is by the use of text prompted techniques, whereby the customer is asked to repeat one or more sentences randomly drawn from a very large set of possible sentences. Again, the system checks both the contents of the reply and the voice characteristics of the speaker. Since surreptitious recording of millions of specific utterances is impossible, text prompted speaker verification systems should offer a very high level of security and immunity to fraud.
In the case of text-independent speaker verification the acceptance procedure should work for any text in both the training or the testing phase . Since this approach is especially susceptible to fraud by playing recordings of the customer, text-independent verification technology should best be combined with other fraud prevention techniques.

There are various application areas for speaker recognition, for instance helping to identify suspects in forensic cases, or controlling access to buildings or bank accounts. As with speech recognition, the corpora needed for speaker recognition or speaker verification are dependent on the specific application. If, for instance, the technology is based on combination lock phrases, a training database should obviously contain a large number of connected number or digit expressions. For the development of text-independent speaker technology there are no strict requirements as to what the training speakers say.

Corpora for the development and testing of speaker recognition systems differ in a crucial aspect from corpora collected to support speech recognition . For speaker recognition research it is absolutely essential that the corpus contains multiple recordings of the same speaker, made under different conditions. There is a range of conditions that should ideally be sampled, in order to be able to build a model of the natural variation in a person's speech due to realistic variations in the conditions under which the speech is produced.
Conditions to be sampled and to be represented in a corpus can be divided into two broad groups, viz. channel conditions, and physiological and psychological conditions of the speaker.

Channel conditions

The details of the acoustic speech patterns depend heavily on the acoustic background in which the speech was produced and on the response of the transmission network. A corpus for speaker recognition research should at least include multiple recordings of the speakers made with different microphones or telephone handsets. Especially the transmission differences between carbon button and electret microphones in telephone handsets are known to affect the performance of speaker recognition systems . In this context, attention should also be paid to the different transmission characteristics of the fixed, landline telephone network and the rapidly growing cellular networks.
In actual practice, it is much more difficult to obtain a representative sampling of acoustic backgrounds. Yet one must be aware that loud background noise affects the speech in two ways, one of which is highly non-linear and therefore cannot easily be compensated for: in addition to decreasing the signal-to-noise ratio , background noise is likely to cause the speaker to change his speaking behaviour. The best known effect is known as the Lombard effect: in a high noise environment speakers tend to raise their voice level, and therewith their phonation style and probably also their articulation style.

Psychological and physiological conditions

The type of speaker variation addressed under this heading is also very difficult to sample. Given the practical limitations of a corpus collection project it is hardly feasible to require that each speaker be recorded in perfect health conditions, as well as when having a cold, the flu, or any other mild disease.
One simple approximation to sampling within speaker variation that is feasible from a practical point of view is to record speakers at different times of the day (early morning, noon, late night), and on different days of the week. In any case, the period over which the recordings are extended should span at least a couple of months. One might also consider recording speakers in completely sober conditions and after the consumption of a reasonable amount of intoxicating drugs.

More detailed accounts of speaker recognition can be found in [O'Shaughnessy (1986)] and [Gish & Schmidt (1994)].

Developing and testing speaker recognition systems with a database containing only a single recording session for the speakers should be avoided, because such databases cannot possibly account for even the slightest degree of within-speaker variation. Results reported on such databases (e.g. TIMIT) grossly overestimate the performance of the system being tested.

Next: Specification of the linguistic Up: Applications of spoken language Previous: Speech corpora for research

EAGLES SWLG SoftEdition, May 1997. Get the book...