The speech material in a corpus can vary from isolated sounds to complete conversations. In
general, the extent to which the experimenter has control over the speech material decreases as it
becomes more and more spontaneous and natural.
The term natural refers to a rather intuitive concept that can be interpreted in different ways.
We regard speech to be maximally natural when two or more speakers have a
conversation in a familiar environment about a subject
choose to talk about, since this is presumably the situation for which speech was ``invented''.
Although read aloud speech is a commonly used speaking style
(and may be
regarded as a natural speaking style from a sociolinguistic point of view),
we regard this style as derived from the most natural style mentioned above.
When reading a text , people have the tendency to speak more
to articulate more carefully than when they are involved in free conversation.
Thus, in our opinion the naturalness of speech should be
judged on a gradual scale.
It should be noted
that control over the speech material is not always necessary and may
even be counterproductive, especially when one wants to study the variation
of speech as a function of communicative context.
However, strict control over the speech material is required for some applications, such as the
development of speech synthesis systems.
In the following, eight types of speech data will be distinguished.
Vowels pronounced in isolation (or in a ``neutral'' context, such as /hVt/) are often used as the frame of reference for experiments in which vowels from connected speech are investigated. Continuant consonants, e.g. /l, r, w, j, n, m, s, f/, can also be pronounced in isolation. Non-continuants, e.g.\ /p, t, k, b, d, g/, must be followed or preceded by a vowel, e.g. the ``neutral'' schwa .
Isolated words can be either ``nonsense'' words
or existing words. In the case of nonsense words the
experimenter can create all possible kinds of phonotactically correct sound sequences. This gives the opportunity to
study coarticulation in a systematic way. Nonsense
words are also
used to extract models for a dictionary
of phonetic elements
when a synthesis system is developed.
When existing words are used, the number of possible sound sequences is
restricted to what is phonotactically appropriate in the lexicon of a given
language. It must be realised that
control over the sounds produced by the speakers may not be perfect, because the pronunciation of
words can be influenced by the stress pattern, which may
be ambiguous (cf. words like record in
When speakers have to read aloud a list of isolated words, their pronunciation may be influenced by the orthographic representation of the words, a phenomenon known as spelling pronunciation. Spelling pronunciation is especially apparent in languages which form nominal compounds ; if sound sequences occur across the morpheme boundaries that would lead to assimilation and degemination in connected speech, one should still anticipate that in reading aloud all sounds are realised. This phenomenon can be circumvented by having the speakers name the words through the presentation of pictures, but this method can only be applied to a limited number of words. It is, for instance, not suitable for abstract concepts.
The carrier sentence is one type of an isolated sentence.
Carrier sentences are often used when
one wants to get a somewhat more natural pronunciation of (nonsense)
words in comparison with
words spoken in isolation,
especially with respect to speech rate. The test words are embedded in the
carrier sentence, as illustrated
by the example
``I will say - a test word - again''. The same carrier
used repeatedly for all occurring test words, so that the influence of the acoustic and
linguistic context on the test words is controlled.
More natural speech material can be obtained when ``normal'' (linguistically meaningful) sentences are constructed by the experimenter. Such sentences can be used to train phoneme based recognisers and to study, for instance, word stress or coarticulation in a relatively natural linguistic context. One should note that an isolated sentence may be interpreted by a speaker in a wider semantic context, which can influence the pronunciation of the sentence, especially with respect to the position of sentence accent(s) . Sometimes a semantic relation between subsequent ``isolated'' sentences may arise as a result of the specific ordering of the speech material. Since individual speakers may imagine different semantic contexts for a specific sentence, variability in the suprasegmental features of the test sentences can occur. If desired, this variability can be reduced by using punctuation and other typographical means (for instance, capitals or boldface characters) to indicate words that should have a sentence accent . A more natural way of doing this is to let each sentence be preceded by a question that evokes sentence accents at the desired positions. It should be clear, however, that neither practice can be recommended in the collection of large corpora of telephone speech.
For many purposes, such as the development of a phoneme-based recogniser , it is crucial that all phonemes are represented in the speech corpus in sufficiently high numbers. Due to the large differences in frequency of occurrence of the phonemes in the language in general, uniform phoneme frequencies will not obtain in randomly chosen sentence material: such material will, instead, reflect the differences in phoneme frequencies. It is proposed to reserve the term phonetically balanced for speech material containing phonemes according to their frequency of occurrence in the general language. Phonetically balanced sentences may be used for speech audiometry and for testing the transmission characteristics of communication channels or public address systems.
Approximately uniform phoneme frequency distributions can be achieved by using phonetically rich sentences. For that purpose greedy algorithms [Van Santen (1992)] can be used. Suppose you want to have a set of sentences in which each phoneme of the language of interest occurs at least once. Of course, you could try to create this set of sentences yourself, but this would be difficult and time-consuming. Furthermore, you might end up with sentences that look rather ``constructed''. An alternative would be to search for an appropriate set of sentences in a sufficient large text corpus, for instance, a large amount of newspaper data on CD-ROM. An advantage of this procedure is that much more variation in the sentences is obtained. A greedy algorithm can be used to obtain the minimum number of sentences containing all phonemes . The following steps have to be taken to get the desired test set :
The naturalness of the produced speech may increase even more when speakers read aloud a series of sentences that are semantically related, provided that the subject is able and used to reading aloud paragraph length material. The prompting material can consist of a text fragment taken from, for instance, a newspaper or a book (e.g. BREF, Wall Street Journal) . But the text fragment can also be created by the experimenter, when it is necessary to impose some specific restrictions on the speech material, for instance with respect to phonemic structure, word structure, or syntactic structure. Reading aloud a text fragment is more difficult than reading aloud a list of isolated sentences . It is very likely that the speech produced by different speakers who are asked to read a text fragment will vary considerably, especially with regard to aspects like vividness, speech rate, omitted speech segments, prosody , etc. The preferred position of sentence accents in a text fragment can be indicated with capitals or boldface characters. This is not recommended if one is interested in more natural speech.
When speech corpora are gathered for commercial applications, a common task of speakers is to read numbers or alpha-numerical expressions, such as ZIP-codes . Speakers have to some extent the freedom to pronounce these numbers or alpha-numerical expressions as they like. For example, there appear to be substantial differences between the ways in which subjects express telephone numbers. Some may read the telephone number as a string of digits, whereas others may read it as a string of numbers containing two or more digits. In addition, it may make a difference whether the telephone number is familiar (for instance, a friend's number), or unfamiliar. The POLYPHONE corpora are good examples of corpora that contain such semi-spontaneous speech .
The previous types of speech material were all concerned with the reading
aloud of some piece of text by one speaker at the time (disregarding
the naming of words through the presentation of pictures). In the present
section we will discuss spontaneous speech from one or more
speakers. The major difference between read speech and spontaneous
speech is that the former fixes vocabulary and
syntax , whereas the latter leaves speakers free to choose their
own vocabulary and syntax . The naturalness of
the produced speech increases when speakers are allowed to choose their own
words. In order to keep some control over the speech material, the
experimenter can determine the subject the speaker has to talk about.
The subject of conversation is relatively fixed when speakers are asked to retell a story that they heard or read shortly before. Since it is likely that speakers will use at least some of the words that occurred in the story, this method allows the experimenter to gather ``spontaneously'' spoken versions of specific words of interest. In a variant of this method, speakers can be asked to invent a story based on a cartoon (without text balloons), or on some complex picture that is bound to evoke the words of interest. In all these designs, monologues are involved, although a session manager may try to guide the discourse in the desired direction. However, one should be aware that many naive subjects do not feel at ease in a situation in which they must maintain a monologue for an extended period of time. Most people feel much more comfortable in a dialogue situation. Moreover, interview situations provide some additional control over subjects' speech, because the interviewer determines the subject of conversation, and subsequently guides the conversation in the desired direction.
Another kind of guided spontaneous speech is an information dialogue: people who attempt to obtain information about, for instance, train or plane schedules. Speakers ask information from an information agent or a computer system about time and place of departure, destination, etc. In this way spontaneous speech can be obtained, even if it concerns a very restricted subject. This paradigm is used in the (D)ARPA Air Line Travel Information System (ATIS) task. Train time table information dialogues are now being recorded in several languages, e.g.\ German, Dutch, Italian, French, British English etc. in the MLAP projects MAIS and RAILTEL.
Although a speech situation with two or more people is more natural than a monologue , overlapping acoustic material may result from several people speaking simultaneously. For some applications, such as research on basic speech processes, overlapping acoustic material is difficult or impossible to use. Of course, one can try to extract speech fragments from recorded dialogues in which only a single speaker is talking. The study of simultaneous speech from two or more speakers is important for research on dialogue or discourse analysis , intention analysis, and spoken language understanding. The gathering of multiple simultaneous speaker corpora is still in its infancy. Such corpora are indispensable for studying speech in all its relevant aspects. In addition, speech recognisers , which are up to now only able to deal with one speaker at a time, would eventually also have to be able to deal with different speakers talking simultaneously. Speech corpora containing dialogues could supply the training and testing data for such advanced recognisers . To make such corpora useful for research and development purposes each individual speaker should be recorded on a separate track, using a microphone array with very high directional sensitivity. Additional tracks can then be synthesised, simulating less perfect directional sensitivity. Alternatively, subjects could be recorded in a teleconference, although such distributed recordings would require extensive precautions to allow one to synchronise the tracks originating from completely independent recorders.
A special type of information seeking dialogue, which is becoming increasingly important, is the one between a human and a computer. In order to gain a clear insight into the way people behave when they have to interact with computers, in the absence of computers that can entertain such a conversation, the Wizard of Oz technique was invented. This technique will be briefly described in the next section.
In the children's novel The Wizard of Oz [Baum (1900)] a young boy is bullied
by an oracle called the Wizard of Oz. The crux of the story is that the Wizard
of Oz turns out to be nothing more than a device operated by a man. In the
Wizard of Oz technique a human plays the role of
the computer in a simulated human-computer
interaction . Of course, the easiest
way to learn about the way humans behave when they have to interact with
computers would be to actually have them interact with a computer. However, in
order to be able to build a computer system that can participate in a dialogue
with a human, one has to know how a human-computer interaction is likely to
proceed. The Wizard of Oz technique can be seen as an intermediate step in the
design of such a computer system. Because the subjects who participate in a
Wizard of Oz experiment have to be convinced
that they are actually talking to a computer, some precautions must be taken.
For example, the wizard simulating the computer should be talking with a
``computer voice'' (in the case that spoken output is required), and the wizard
should also make deliberate errors similar to the ones that a computer could
be expected to make in the application of interest.
As spoken language systems are rapidly approaching a performance level that is acceptable for an increasing range of applications, it seems likely that man-machine dialogue systems will be used more and more in the near future. For the development of such systems speech data gathered in Wizard of Oz experiments will be indispensable, as long as at least one part of the system is not yet good enough for experiments with large groups of users. A more comprehensive discussion of the Wizard of Oz technique is given in Chapter 13.
As the performance of SLSs improves, the development of new applications will be increasingly based on pilot experiments with a system in the loop , i.e. with test versions of the application in which the wizard is replaced by a computer system which has enough functionality to support the man-machine interaction.
This type of speech material, in which speakers are allowed to freely choose their own words and their own subject of conversation is most natural, especially in a dialogue situation. Most remarks that were made in the previous sections, also apply to the present one. As with all natural processes, the observer's paradox can play a role in the recording of spontaneous speech: in order to obtain speech that is as natural as possible, the researcher has to observe how people speak when they are not being observed [Labov (1972)]. To overcome this methodological paradox, several techniques have been proposed throughout the development of sociolinguistic research [Argente (1991)]: