The speech material in a corpus can vary from isolated sounds to complete conversations. In
general, the extent to which the experimenter has control over the speech material decreases as it
becomes more and more spontaneous and natural.
The term natural refers to a rather intuitive concept that can be interpreted in different ways.
We regard speech to be maximally natural when two or more speakers have a
conversation in a familiar environment about a subject
they themselves
choose to talk about, since this is presumably the situation for which speech was ``invented''.
Although read aloud speech is a commonly used speaking style
(and may be
regarded as a natural speaking style from a sociolinguistic point of view),
we regard this style as derived from the most natural style mentioned above.
When reading a text , people have the tendency to speak more
formally and
to articulate more carefully than when they are involved in free conversation.
Thus, in our opinion the naturalness of speech should be
judged on a gradual scale.
It should be noted
that control over the speech material is not always necessary and may
even be counterproductive, especially when one wants to study the variation
of speech as a function of communicative context.
However, strict control over the speech material is required for some applications, such as the
development of speech synthesis systems.
In the following, eight types of speech data will be distinguished.
Vowels pronounced in isolation (or in a ``neutral'' context, such as /hVt/) are often used as the frame of reference for experiments in which vowels from connected speech are investigated. Continuant consonants, e.g. /l, r, w, j, n, m, s, f/, can also be pronounced in isolation. Non-continuants, e.g.\ /p, t, k, b, d, g/, must be followed or preceded by a vowel, e.g. the ``neutral'' schwa .
Isolated words can be either ``nonsense'' words
or existing words. In the case of nonsense words the
experimenter can create all possible kinds of phonotactically correct sound sequences. This gives the opportunity to
study coarticulation in a systematic way. Nonsense
words are also
used to extract models for a dictionary
of phonetic elements
when a synthesis system is developed.
When existing words are used, the number of possible sound sequences is
restricted to what is phonotactically appropriate in the lexicon of a given
language. It must be realised that
control over the sounds produced by the speakers may not be perfect, because the pronunciation of
polysyllabic
words can be influenced by the stress pattern, which may
be ambiguous (cf. words like record in
English).
When speakers have to read aloud a list of isolated words, their pronunciation may be influenced
by the orthographic
representation of the words, a phenomenon known as spelling pronunciation.
Spelling pronunciation is
especially apparent in languages which form nominal compounds ; if
sound sequences occur across the morpheme boundaries that
would lead to assimilation and degemination in connected
speech, one should still
anticipate that in reading aloud all sounds are realised. This
phenomenon
can be circumvented by
having the speakers name the words through the presentation of pictures, but this
method can only be applied to a limited number of
words. It is, for instance, not suitable for abstract concepts.
The carrier sentence is one type of an isolated sentence.
Carrier sentences are often used when
one wants to get a somewhat more natural pronunciation of (nonsense)
words in comparison with
words spoken in isolation,
especially with respect to speech rate. The test words are embedded in the
carrier sentence, as illustrated
by the example
``I will say - a test word - again''. The same carrier
sentence is
used repeatedly for all occurring test words, so that the influence of the acoustic and
linguistic context on the test words is controlled.
More natural speech material can be obtained when ``normal'' (linguistically meaningful)
sentences are constructed by the experimenter. Such sentences can be used to train
phoneme based recognisers and to study, for
instance, word stress or coarticulation in a relatively
natural linguistic context. One should note that an isolated
sentence may be interpreted by a speaker in a wider semantic context, which can
influence the pronunciation of the sentence, especially with respect to the position of
sentence accent(s) .
Sometimes a semantic relation between subsequent ``isolated'' sentences may arise as a
result of the specific ordering
of the speech material. Since individual speakers may imagine different semantic contexts
for a specific sentence, variability in the
suprasegmental
features of the test sentences can occur. If desired,
this variability can be reduced by using punctuation and other typographical means (for
instance, capitals or boldface characters) to indicate words that should have a sentence
accent .
A more natural way of doing this is to let each sentence be
preceded by a question that evokes sentence accents
at the
desired positions. It should be clear, however, that neither practice can be recommended in the collection
of
large corpora of telephone speech.
For many purposes, such as the development of a phoneme-based recogniser , it is
crucial that all phonemes are represented
in
the speech corpus in sufficiently high numbers. Due to the large differences in
frequency of occurrence of the phonemes in
the
language in general, uniform phoneme frequencies
will not obtain
in randomly chosen sentence material: such material will,
instead, reflect the differences in phoneme frequencies. It is
proposed to reserve the term phonetically balanced for
speech material containing phonemes according to their
frequency
of occurrence in the general language. Phonetically
balanced sentences may be used for speech audiometry and for testing the transmission
characteristics of communication
channels or public address systems.
Approximately uniform phoneme frequency
distributions can be achieved by using
phonetically rich sentences. For that purpose greedy algorithms [Van Santen (1992)] can be used. Suppose you
want to have a set of sentences in which each phoneme of the language of
interest occurs at least once. Of course, you could try to create this set of
sentences yourself, but this would be difficult and time-consuming.
Furthermore, you might end up with sentences that look rather ``constructed''.
An alternative would be to search for an appropriate set of sentences in a
sufficient large text corpus, for instance, a large amount of newspaper data
on CD-ROM. An advantage of this procedure is that much more variation in the sentences is obtained. A
greedy algorithm can be used to obtain the minimum number of
sentences containing all phonemes . The following steps have to be taken to get
the desired test set :
The naturalness of the produced speech may increase even more when speakers read aloud a series of sentences that are semantically related, provided that the subject is able and used to reading aloud paragraph length material. The prompting material can consist of a text fragment taken from, for instance, a newspaper or a book (e.g. BREF, Wall Street Journal) . But the text fragment can also be created by the experimenter, when it is necessary to impose some specific restrictions on the speech material, for instance with respect to phonemic structure, word structure, or syntactic structure. Reading aloud a text fragment is more difficult than reading aloud a list of isolated sentences . It is very likely that the speech produced by different speakers who are asked to read a text fragment will vary considerably, especially with regard to aspects like vividness, speech rate, omitted speech segments, prosody , etc. The preferred position of sentence accents in a text fragment can be indicated with capitals or boldface characters. This is not recommended if one is interested in more natural speech.
When speech corpora are gathered for commercial applications, a common task of speakers is to read numbers or alpha-numerical expressions, such as ZIP-codes . Speakers have to some extent the freedom to pronounce these numbers or alpha-numerical expressions as they like. For example, there appear to be substantial differences between the ways in which subjects express telephone numbers. Some may read the telephone number as a string of digits, whereas others may read it as a string of numbers containing two or more digits. In addition, it may make a difference whether the telephone number is familiar (for instance, a friend's number), or unfamiliar. The POLYPHONE corpora are good examples of corpora that contain such semi-spontaneous speech .
The previous types of speech material were all concerned with the reading
aloud of some piece of text by one speaker at the time (disregarding
the naming of words through the presentation of pictures). In the present
section we will discuss spontaneous speech from one or more
speakers. The major difference between read speech and spontaneous
speech is that the former fixes vocabulary and
syntax , whereas the latter leaves speakers free to choose their
own vocabulary and syntax . The naturalness of
the produced speech increases when speakers are allowed to choose their own
words. In order to keep some control over the speech material, the
experimenter can determine the subject the speaker has to talk about.
The subject of conversation is relatively fixed when speakers are asked to
retell a story that they heard or read shortly before. Since it is likely that
speakers will use at least some of the words that occurred in the story, this
method allows the experimenter to gather ``spontaneously'' spoken versions of
specific words of interest. In a variant of this method, speakers can be asked
to invent a story based on a cartoon (without text balloons), or on some
complex picture that is bound to evoke the words of interest. In all these
designs, monologues are involved, although a session manager
may try to guide the discourse in the desired direction. However, one should
be aware that many naive subjects do not feel at ease in a situation in which
they must maintain a monologue for an extended period of time. Most people
feel much more comfortable in a dialogue situation. Moreover,
interview situations provide some additional control over subjects' speech,
because the interviewer determines the subject of conversation, and
subsequently guides the conversation in the desired direction.
Another kind of guided spontaneous speech is an
information dialogue: people who attempt to obtain information about, for
instance, train or plane schedules. Speakers ask information from an
information agent or a computer system about time and place of departure,
destination, etc. In this way spontaneous speech can
be obtained, even if it concerns a very restricted subject. This paradigm is
used in the (D)ARPA Air Line Travel Information System (ATIS)
task. Train time
table information dialogues are now being recorded in several languages, e.g.\
German, Dutch, Italian, French, British English etc. in the MLAP
projects MAIS and RAILTEL.
Although a speech situation with two or more people is more natural than a
monologue , overlapping acoustic material may result from
several people speaking simultaneously. For some applications, such as
research on basic speech processes, overlapping acoustic material is difficult
or impossible to use. Of course, one can try to extract speech fragments from recorded dialogues in which
only a single speaker is talking. The study
of simultaneous speech from two or more speakers is important for research on
dialogue or discourse analysis , intention analysis, and
spoken language understanding. The gathering of multiple simultaneous speaker
corpora is still in its infancy. Such corpora are indispensable for studying
speech in all its relevant aspects. In addition, speech recognisers , which are
up to now only able to deal with one speaker at a time, would eventually also
have to be able to deal with different speakers talking simultaneously. Speech
corpora containing dialogues could supply the training
and testing data for
such advanced recognisers . To make such corpora useful for research and
development purposes each individual speaker should be recorded on a separate
track, using a microphone array with very high directional sensitivity.
Additional tracks can then be synthesised, simulating less perfect
directional sensitivity. Alternatively, subjects could be recorded in a
teleconference, although such distributed recordings would require extensive
precautions to allow one to synchronise the tracks originating from completely
independent recorders.
A special type of information seeking dialogue, which is becoming increasingly
important, is the one between a human and a computer. In order to gain a clear
insight into the way people behave when they have to interact with computers,
in the absence of computers that can entertain such a conversation,
the Wizard of Oz technique was invented. This
technique will be briefly described in the next section.
In the children's novel The Wizard of Oz [Baum (1900)] a young boy is bullied
by an oracle called the Wizard of Oz. The crux of the story is that the Wizard
of Oz turns out to be nothing more than a device operated by a man. In the
Wizard of Oz technique a human plays the role of
the computer in a simulated human-computer
interaction . Of course, the easiest
way to learn about the way humans behave when they have to interact with
computers would be to actually have them interact with a computer. However, in
order to be able to build a computer system that can participate in a dialogue
with a human, one has to know how a human-computer interaction is likely to
proceed. The Wizard of Oz technique can be seen as an intermediate step in the
design of such a computer system. Because the subjects who participate in a
Wizard of Oz experiment have to be convinced
that they are actually talking to a computer, some precautions must be taken.
For example, the wizard simulating the computer should be talking with a
``computer voice'' (in the case that spoken output is required), and the wizard
should also make deliberate errors similar to the ones that a computer could
be expected to make in the application of interest.
As spoken language systems are rapidly approaching a performance level
that is acceptable for an increasing range of applications, it seems likely that
man-machine dialogue systems will be used more and more in the near future.
For the development of such systems speech data gathered in Wizard of
Oz experiments will be indispensable, as long as
at least one part of the system is not yet good enough for experiments with
large groups of users. A more comprehensive discussion of the Wizard of Oz
technique is given in Chapter 13.
As the performance of SLSs improves, the development of new applications will
be increasingly based on pilot experiments with a system in the loop ,
i.e. with test versions of the application in which the wizard is replaced by
a computer system which has enough functionality to support the man-machine
interaction.
This type of speech material, in which speakers are allowed to freely choose their own words and their own subject of conversation is most natural, especially in a dialogue situation. Most remarks that were made in the previous sections, also apply to the present one. As with all natural processes, the observer's paradox can play a role in the recording of spontaneous speech: in order to obtain speech that is as natural as possible, the researcher has to observe how people speak when they are not being observed [Labov (1972)]. To overcome this methodological paradox, several techniques have been proposed throughout the development of sociolinguistic research [Argente (1991)]: