Technological applications for which speech corpora are needed can be roughly divided into four major classes: speech synthesis , speech recognition , spoken language system s, and speaker recognition /verification . Depending on the specific application, the speech corpora which are needed are very diverse. For example, speech synthesis usually requires a large amount of speech data from one or two speakers, whereas speech recognition often requires a smaller amount of speech data from many speakers. In the following sections the four domains of speech research for technological applications and the speech corpora they need are discussed.
The seemingly most natural way to synthesise speech is to model human speech production directly by simulating lung pressure, vocal fold vibration, articulatory gestures, etc. However, the human system is not completely understood. This is probably the reason why it turns out to be extremely difficult to determine and control the details of the model parameters in computer simulations. This is the reason that articulatory synthesisers have only been moderately successful in generating perceptually important acoustic features. Yet, modern measurement techniques have allowed the collection of substantial amounts of measurement data. Most of these data are now being made available to the research community (see the ESPRIT project ACCOR (ESPRIT/BRA 3279 ACCOR, and the special issue of the journal ``Language and Speech'' (1993)).
A relatively simple way to build a speech synthesiser is through concatenation
of stored human speech components. In order to achieve natural coarticulation
in the synthesised speech, it is necessary to include transition regions in
the building blocks. Often-used transition units are diphones,
which
represent the transition from one phone to another. Since diphone inventories
are derived directly from human utterances, diphone synthesis might be
expected to be inherently natural sounding. However, this is not completely
true, because the diphones have to be concatenated and in practice there will
be many diphone junctions that do not fit properly together. In order to be
able to smoothe these discontinuities the waveform segments have to be converted
to a convenient format, such as some form of LPC parameters, often with some
inherent loss of auditory quality. Until recently it was believed that a
parametric representation was mandatory to be able to change the pitch and
timing of utterances without disturbing the spectral envelope pattern. Since
the invention of PSOLA-like techniques, high quality pitch and time changes can
be effected directly in the time domain.
For limited applications, such as train information systems, whole words and even phrases may be stored.
Lately, this method of speech synthesis is being applied more and
more, because of the possibility of cheap mass
storage. The quality of concatenated-word sentences is often acceptable, especially in the light of the still
not optimal quality of the other types of synthesis.
Another important method of generating computerised speech is through synthesis by rule. The usual approach is to input a string of allophones to some form of formant synthesiser. Target formant values for each allophone are derived from human utterances and these values are stored in large tables. With an additional set of rules these target values can be adapted to account for all kinds of phonological and phonetic phenomena and to generate proper prosody .
More detailed accounts of speech synthesis systems can be found in, for instance, [Klatt (1987)] and [Holmes (1988)], and in Chapter 12.
For all types of speech synthesis systems corpora are needed to determine the model parameters. If the user wants many different types of voice, the speech corpus should contain various speakers for the extraction of speaker-specific model parameters. In particular, the user might want to be able to generate both male and female speech. Transformations to convert rule systems between male and female speech have had limited success, so it seems more convenient to include both sexes in the speech corpus. Application specific corpora are needed to investigate issues related to prosody .
There are several types of speech recognition systems, which may differ in three important ways:
These different aspects will be discussed below.
With respect to the strategies they use, speech
recognition systems can be roughly divided in two
classes: knowledge-based systems and stochastic systems. All
state-of-the-art systems belong to the second category. In the knowledge-based approach an attempt was made to
specify explicit acoustic-phonetic rules that are robust enough to allow
recognition of linguistically meaningful units and that
ignore irrelevant variation in these units. Stochastic
systems, such as Hidden Markov Models (HMMs)
or neural networks , do not use explicit rules for speech
recognition. On the contrary, they rely on stochastic models which are
estimated or trained with (very) large amounts of speech, using some
statistical optimalisation procedure (e.g. the Estimate-Maximise or the
Baum-Welch algorithm).
Higher level linguistic knowledge can be used to constrain the recognition
hypotheses generated at the acoustic-phonetic level. Higher level knowledge
can be represented by knowledge-based explicit rules, for example syntactic
constraints on word order. More often it is represented by stochastic language
models, for example bigrams or trigrams
that reflect the
likelihood of a sequence of two or three words, respectively (see also
Chapter 7). Recently, promising work on enhancing HMMs with
morphological and phonological structure has been conducted, pointing
to the possibility of convergence between knowledge-based and
stochastic approaches.
Speech recognition systems can be either speaker-dependent or speaker-independent. In the former case the recognition system is designed to recognise the speech of just a single person, and in the latter case the recognition system should be able to recognise the speech of a variety of speakers. All other things being equal, the performance of speaker-independent systems is likely to be worse than in speaker-dependent systems, because speaker-independent systems have to deal with a considerable amount of inter-speaker variability. It is often sensible to train separate recognition models for specific subgroups of speakers, such as men and women, or speakers with different dialects [Van Compernolle et al. (1991)].
Some systems can to some extent adapt to new speakers by adjusting the parameters of their models. This can be done in a separate training session with a set of predetermined utterances of the new speaker, or it can be done on-line as the recognition of the new speaker's utterances gradually proceeds.
Most recognition systems are very sensitive to the recording environment . In the past, speakers employed to train and develop a system were often recorded under ``laboratory'' conditions, for instance in an anechoic room. It appears that the performance of speech recognisers which are trained with such high quality recordings severely degrades if they are tested with some form of ``noisy'' speech [Gong (1995)]. Also the use of different microphones during training sessions and test sessions has a considerable impact on recognition performance.
The third main distinction between speech
recognition systems is based on the type of speech they
have to recognise. The system can be designed for
isolated word recognition or
for continuous speech recognition. In the latter case word boundaries
have to be established, which can be extremely difficult.
Nevertheless, continuous speech recognition systems
are nowadays reasonably successful, although their
performance of course strongly depends on the size of
their vocabulary.
Word spotting can be regarded as a special form
of isolated word recognition : the recogniser is ``listening'' for a limited number of words. These words may come
embedded in background noise , possibly consisting of
speech of competing speakers, or may come from the target speaker who is producing
the word embedded in extraneous speech.
In general, two similar speech corpora are needed for the development of speech
recognition systems: one for the training phase
and one for the testing phase .
The training material is used to set the model parameters of the
recognition system . The testing
material is used to determine the performance
of the trained system. It is necessary to use different speech data for
training and testing in order to get a fair evaluation of the system
performance.
For speaker-dependent systems, obviously the same speaker is used for the
training and testing phase . For speaker-independent systems,
the corpora for training
and testing could
contain the same speakers (but different speech data), or they could contain
different speakers to determine the system's robustness for new speakers.
When a system is designed for isolated word recognition, it should be trained
and tested with isolated words . And similarly, when a system is designed for
telephone speech, it should be trained and tested with telephone speech. The
design of corpora for speech recognition research thus strongly depends on the
type of recognition system that one wants to develop. Several large corpora
for isolated word (e.g. TIDIGITS) and continuous speech recognition (e.g.
Resource Management, ATIS , BREF, EUROM , TIMIT and Wall Street Journal) have
been collected and made available (cf. Appendix L).
Speech synthesis and speech recognition systems can be combined with natural
language processing and Dialogue Management systems to form a Spoken Language
System (SLS) that allows an interactive communication between man and
machine. A spoken language system should be able
to recognise a person's speech, interpret the sequence of words to obtain a
meaning in terms of the application, and provide an appropriate response
to the user.
Apart from speech corpora needed to design the speech synthesis and the
speech recognition part of the spoken language system,
speech corpora are also needed to model relevant features of spontaneous
speech (pauses, hesitations, turn-taking
behaviour, etc.) and to model
dialogue structures for a proper man-machine interaction.
An excellent overview of spoken language systems and their problems is given
in [Cole (1995)]. The ATIS corpora mentioned above exemplify the type of
corpus used for the development of SLS.
The task of automatic speaker recognition is to determine the identity of a
speaker by machine. Speaker recognition (usually called speaker
identification can be divided into two
categories: closed-set and open-set problems. The closed-set
problem is to identify a speaker from a group of known speakers, whereas the
open-set problem is to decide whether a speaker belongs to a group of known
speakers. Speaker verification is a special case of the open-set problem
and refers to the task of deciding whether a speaker is who he claims
to be.
Speaker recognition can be text-dependent or it can be
text-independent. In the former case the text in both the
training phase and the testing phase is known, i.e. the system employs a sort of password
procedure. One popular example of password-like phrases are the so-called ``combination lock'' phrases,
consisting of sequences of numbers (mostly between 0 and 99) or digits (between 0 and 9). LDC provides
a corpus for training and testing speaker verification systems based on combination lock phrases
consisting of three numbers between 11 and 99 (e.g. 26-81-57) [Campbell (1995)].
Knowledge of the text enables the use of systems which combine
speech and speaker recognition. In other words, before granting access to data or premises the speaker
verification system can request that the claimant says the combination lock; the system then checks both
the correctness of the password and the voice characteristics of the speaker. However, most password
systems are susceptible to
fraud using recordings of the passwords spoken by a customer. One way of
making fraud with recordings much more difficult is by the use of text
prompted techniques, whereby the customer is asked to repeat one or more
sentences randomly drawn from a very large set of possible sentences. Again, the system checks both the
contents of the reply and the voice characteristics of the speaker. Since
surreptitious recording of millions
of specific utterances is impossible, text prompted speaker
verification systems should offer a very high
level of security and immunity to fraud.
In the case of text-independent speaker verification the acceptance
procedure should work for any text in both the training or the testing phase . Since this approach is
especially susceptible to fraud by playing recordings of the customer, text-independent verification
technology should best be combined with other fraud prevention techniques.
There are various application areas for speaker recognition, for instance helping to identify suspects in forensic cases, or controlling access to buildings or bank accounts. As with speech recognition, the corpora needed for speaker recognition or speaker verification are dependent on the specific application. If, for instance, the technology is based on combination lock phrases, a training database should obviously contain a large number of connected number or digit expressions. For the development of text-independent speaker technology there are no strict requirements as to what the training speakers say.
Corpora for the development and testing of speaker recognition
systems
differ in a crucial aspect from corpora collected to support speech
recognition . For speaker recognition research it is absolutely essential
that the corpus contains multiple recordings of the same speaker, made under
different conditions. There is a range of conditions that should ideally be
sampled, in order to be able to build a model of the natural variation in a
person's speech due to realistic variations in the conditions under which the
speech is produced.
Conditions to be sampled and to be represented in a corpus can be divided into
two broad groups, viz. channel conditions, and
physiological and psychological conditions of the speaker.
The details of the acoustic speech patterns depend heavily on the acoustic
background in which the speech was produced and on the response of the
transmission network. A corpus for speaker recognition research should at
least include multiple recordings of the speakers made with different
microphones or
telephone handsets. Especially the transmission differences between carbon
button and electret microphones in telephone handsets are known to affect the
performance of speaker recognition systems . In this context, attention should
also be paid to the different transmission characteristics of the fixed,
landline telephone network and the rapidly growing cellular networks.
In actual practice, it is much more difficult to obtain a representative
sampling of acoustic backgrounds. Yet one must be aware that loud background
noise affects the speech in two ways, one of which is highly non-linear and
therefore cannot easily be compensated for: in addition to decreasing the
signal-to-noise ratio , background noise is likely to cause the speaker to change
his speaking behaviour. The best known effect is known as the
Lombard effect: in a high noise environment speakers tend to raise their voice
level, and therewith their phonation style and probably also their
articulation style.
The type of speaker variation addressed under this heading is also very
difficult to sample. Given the practical limitations of a corpus collection
project it is hardly feasible to require that each speaker be recorded in
perfect health conditions, as well as when having a cold, the flu, or any other
mild disease.
One simple approximation to sampling within speaker variation that is feasible
from a practical point of view is to record speakers at different times of the
day (early morning, noon, late night), and on different days of the week. In
any case, the period over which the recordings are extended should span at least
a couple of months. One might also consider recording speakers in completely
sober conditions and after the consumption of a reasonable amount of
intoxicating drugs.
More detailed accounts of speaker recognition can be found in [O'Shaughnessy (1986)] and [Gish & Schmidt (1994)].
Developing and testing speaker recognition systems with a database containing only a single recording session for the speakers should be avoided, because such databases cannot possibly account for even the slightest degree of within-speaker variation. Results reported on such databases (e.g. TIMIT) grossly overestimate the performance of the system being tested.