next up previous contents index
Next: Factorial experiments and corpus Up: Specification of the linguistic Previous: Specification of the linguistic

Different types of speech data


The speech material in a corpus can vary from isolated sounds to complete conversations. In general, the extent to which the experimenter has control over the speech material decreases as it becomes more and more spontaneous  and natural. The term natural refers to a rather intuitive concept that can be interpreted in different ways. We regard speech to be maximally natural when two or more speakers have a conversation in a familiar environment  about a subject they themselves choose to talk about, since this is presumably the situation for which speech was ``invented''. Although read aloud speech  is a commonly used speaking style   (and may be regarded as a natural speaking style  from a sociolinguistic  point of view), we regard this style as derived from the most natural style mentioned above. When reading a text , people have the tendency to speak more formally and to articulate more carefully than when they are involved in free conversation. Thus, in our opinion the naturalness  of speech should be judged on a gradual scale. It should be noted that control over the speech material is not always necessary and may even be counterproductive, especially when one wants to study the variation of speech as a function of communicative context. However, strict control over the speech material is required for some applications, such as the development of speech synthesis  systems.
In the following, eight types of speech data will be distinguished.


Read aloud isolated phonemes  

Vowels pronounced in isolation (or in a ``neutral'' context, such as /hVt/) are often used as the frame of reference for experiments in which vowels from connected speech  are investigated. Continuant consonants, e.g. /l, r, w, j, n, m, s, f/, can also be pronounced in isolation. Non-continuants, e.g.\ /p, t, k, b, d, g/, must be followed or preceded by a vowel, e.g. the ``neutral'' schwa .

Read aloud isolated words


Isolated words can be either ``nonsense'' words  or existing words. In the case of nonsense words  the experimenter can create all possible kinds of phonotactically correct sound sequences. This gives the opportunity to study coarticulation  in a systematic way. Nonsense words  are also used to extract models for a dictionary  of phonetic elements when a synthesis system  is developed. When existing words are used, the number of possible sound sequences is restricted to what is phonotactically appropriate in the lexicon of a given language. It must be realised that control over the sounds produced by the speakers may not be perfect, because the pronunciation of polysyllabic words can be influenced by the stress  pattern, which may be ambiguous (cf. words like record in English).
When speakers have to read aloud a list of isolated words, their pronunciation may be influenced by the orthographic representation of the words, a phenomenon known as spelling pronunciation. Spelling pronunciation is especially apparent in languages which form nominal compounds ; if sound sequences occur across the morpheme  boundaries that would lead to assimilation  and degemination in connected speech,  one should still anticipate that in reading aloud all sounds are realised. This phenomenon can be circumvented by having the speakers name the words through the presentation of pictures, but this method can only be applied to a limited number of words. It is, for instance, not suitable for abstract concepts.   

Read aloud isolated sentences


The carrier  sentence is one type of an isolated sentence. Carrier  sentences are often used when one wants to get a somewhat more natural pronunciation of (nonsense) words  in comparison with words spoken in isolation, especially with respect to speech rate. The test words  are embedded in the carrier  sentence, as illustrated by the example ``I will say - a test word - again''. The same carrier   sentence is used repeatedly for all occurring test words, so that the influence of the acoustic and linguistic context on the test words is controlled.
More natural speech material can be obtained when ``normal'' (linguistically meaningful) sentences are constructed by the experimenter. Such sentences can be used to train  phoneme  based recognisers  and to study, for instance, word stress  or coarticulation  in a relatively natural linguistic context. One should note that an isolated sentence may be interpreted by a speaker in a wider semantic context, which can influence the pronunciation of the sentence, especially with respect to the position of sentence accent(s) . Sometimes a semantic relation between subsequent ``isolated'' sentences may arise as a result of the specific ordering of the speech material. Since individual speakers may imagine different semantic contexts for a specific sentence, variability in the suprasegmental  features of the test sentences  can occur. If desired, this variability can be reduced by using punctuation  and other typographical means (for instance, capitals or boldface characters) to indicate words that should have a sentence accent . A more natural way of doing this is to let each sentence be preceded by a question that evokes sentence accents  at the desired positions. It should be clear, however, that neither practice can be recommended in the collection of large corpora of telephone speech.
For many purposes, such as the development of a phoneme-based recogniser  , it is crucial that all phonemes  are represented in the speech corpus in sufficiently high numbers. Due to the large differences in frequency of occurrence of the phonemes  in the language in general, uniform phoneme  frequencies will not obtain in randomly chosen sentence material: such material will, instead, reflect the differences in phoneme  frequencies. It is proposed to reserve the term phonetically  balanced for speech material containing phonemes  according to their frequency of occurrence in the general language. Phonetically  balanced sentences may be used for speech audiometry  and for testing  the transmission characteristics of communication channels  or public address systems.
Approximately uniform phoneme  frequency distributions can be achieved by using phonetically  rich sentences. For that purpose greedy algorithms  [Van Santen (1992)] can be used. Suppose you want to have a set of sentences in which each phoneme  of the language of interest occurs at least once. Of course, you could try to create this set of sentences yourself, but this would be difficult and time-consuming. Furthermore, you might end up with sentences that look rather ``constructed''. An alternative would be to search  for an appropriate set of sentences in a sufficient large text corpus, for instance, a large amount of newspaper  data on CD-ROM. An advantage of this procedure is that much more variation in the sentences is obtained. A greedy algorithm  can be used to obtain the minimum number of sentences containing all phonemes . The following steps have to be taken to get the desired test set :

  1. Use a grapheme-to-phoneme  converter in order to be able to search  for phonemes  instead of characters.
  2. Select the sentence in the corpus with the largest number of phonemes , not counting phonemes  that are repeated within the same sentence.
  3. Select each next sentence as the one with the largest number of phonemes  that have not yet been covered. Stop this procedure when the entire set of phonemes  has been covered.
To obtain more occurrences of each phoneme , the procedure described above can be repeated any number of times with the remaining sentences in the text corpus. Of course, the greedy algorithm  can also be used for other basic units than phonemes , for instance: characters, diphones , vowels in specific consonantal contexts, subsets of words, or specific discourse units. The greedy algorithm  can be amended in various ways. For example, one can maximise coverage of high frequency units by using the sum of the frequencies of the units in a sentence as criterion. This may be important when complete coverage of all units is impossible, in which case one likes to cover the most frequent units first. Furthermore, all kinds of constraints can be imposed on the sentences that are selected, for instance with respect to their length, word material, syntactic structure, etc. Note that you can also choose other contexts for the basic units than sentences, such as clauses, or words. For example, you might want to search  the text corpus for the minimal set of words in which each phoneme  occurs at least once.
It should be clear that very large text corpora may be needed to obtain a sufficient number of phonetically  rich sentences. In some corpora phonetically  rich does not only mean that an attempt has been made to obtain uniform phoneme  frequencies, but also uniform diphone  or triphone  frequencies. When designing phonetically  rich sentences intended to be read by members of the general public  (e.g. in the POLYPHONE  corpora) care must be taken to avoid very long sentences, because these are extremely difficult to read aloud. Moreover, all sentences must be checked for very rare words (which are likely to cause reading problems) and for contents which are potentially insulting. In POLYPHONE , candidate sentences had to contain at least four words and a maximum of 80 characters, including spaces and punctuation marks . The latter condition guarantees that the prompting  text will not span more than two lines on the prompting  sheet (40 characters per line). Selection of sentences on the basis of length and phonemic contents can be done automatically. However, checking for insulting contents must be done manually.    

Read aloud text fragments


The naturalness  of the produced speech may increase even more when speakers read aloud a series of sentences that are semantically related, provided that the subject is able and used to reading aloud paragraph length material. The prompting  material can consist of a text fragment taken from, for instance, a newspaper  or a book (e.g. BREF, Wall Street Journal) . But the text fragment can also be created by the experimenter, when it is necessary to impose some specific restrictions on the speech material, for instance with respect to phonemic  structure, word structure, or syntactic structure. Reading aloud a text fragment is more difficult than reading aloud a list of isolated sentences . It is very likely that the speech produced by different speakers who are asked to read a text fragment will vary considerably, especially with regard to aspects like vividness, speech rate, omitted speech segments, prosody , etc. The preferred position of sentence accents  in a text fragment can be indicated with capitals or boldface characters. This is not recommended if one is interested in more natural speech.  

Semi-spontaneous speech


When speech corpora are gathered for commercial applications, a common task of speakers is to read numbers or alpha-numerical expressions, such as ZIP-codes . Speakers have to some extent the freedom to pronounce these numbers or alpha-numerical expressions as they like. For example, there appear to be substantial differences between the ways in which subjects express telephone numbers. Some may read the telephone number as a string of digits, whereas others may read it as a string of numbers containing two or more digits. In addition, it may make a difference whether the telephone number is familiar (for instance, a friend's number), or unfamiliar. The POLYPHONE  corpora are good examples of corpora that contain such semi-spontaneous speech .  

Spontaneous speech about a predetermined subject


The previous types of speech material were all concerned with the reading aloud of some piece of text by one speaker at the time (disregarding the naming of words through the presentation of pictures). In the present section we will discuss spontaneous speech from one or more speakers. The major difference between read speech  and spontaneous speech is that the former fixes vocabulary  and syntax , whereas the latter leaves speakers free to choose their own vocabulary and syntax . The naturalness  of the produced speech increases when speakers are allowed to choose their own words. In order to keep some control over the speech material, the experimenter can determine the subject the speaker has to talk about.
The subject of conversation is relatively fixed when speakers are asked to retell a story that they heard or read shortly before. Since it is likely that speakers will use at least some of the words that occurred in the story, this method allows the experimenter to gather ``spontaneously'' spoken versions of specific words of interest. In a variant of this method, speakers can be asked to invent a story based on a cartoon (without text balloons), or on some complex picture that is bound to evoke the words of interest. In all these designs, monologues  are involved, although a session manager may try to guide the discourse in the desired direction. However, one should be aware that many naive subjects do not feel at ease in a situation in which they must maintain a monologue for an extended period of time. Most people feel much more comfortable in a dialogue situation. Moreover, interview situations provide some additional control over subjects' speech, because the interviewer determines the subject of conversation, and subsequently guides the conversation in the desired direction.
Another kind of guided spontaneous speech is an information dialogue: people who attempt to obtain information about, for instance, train or plane schedules. Speakers ask information from an information agent or a computer system about time and place of departure, destination, etc. In this way spontaneous speech can be obtained, even if it concerns a very restricted subject. This paradigm is used in the (D)ARPA Air Line Travel Information System (ATIS)   task. Train time table information dialogues are now being recorded in several languages, e.g.\ German, Dutch, Italian, French, British English etc. in the MLAP projects MAIS and RAILTEL.
Although a speech situation with two or more people is more natural than a monologue , overlapping acoustic material may result from several people speaking simultaneously. For some applications, such as research on basic speech processes, overlapping acoustic material is difficult or impossible to use. Of course, one can try to extract speech fragments from recorded dialogues in which only a single speaker is talking. The study of simultaneous speech from two or more speakers is important for research on dialogue or discourse analysis , intention analysis, and spoken language understanding. The gathering of multiple simultaneous speaker corpora is still in its infancy. Such corpora are indispensable for studying speech in all its relevant aspects. In addition, speech recognisers , which are up to now only able to deal with one speaker at a time, would eventually also have to be able to deal with different speakers talking simultaneously. Speech corpora containing dialogues could supply the training   and testing data  for such advanced recognisers . To make such corpora useful for research and development purposes each individual speaker should be recorded on a separate track, using a microphone array   with very high directional sensitivity. Additional tracks can then be synthesised, simulating less perfect directional sensitivity. Alternatively, subjects could be recorded in a teleconference, although such distributed recordings would require extensive precautions to allow one to synchronise the tracks originating from completely independent recorders.
A special type of information seeking dialogue, which is becoming increasingly important, is the one between a human and a computer. In order to gain a clear insight into the way people behave when they have to interact with computers, in the absence of computers that can entertain such a conversation, the Wizard of Oz technique  was invented. This technique will be briefly described in the next section.  

The Wizard of Oz technique


In the children's novel The Wizard of Oz [Baum (1900)] a young boy is bullied by an oracle called the Wizard of Oz. The crux of the story is that the Wizard of Oz turns out to be nothing more than a device operated by a man. In the Wizard of Oz technique a human plays the role of the computer in a simulated human-computer interaction . Of course, the easiest way to learn about the way humans behave when they have to interact with computers would be to actually have them interact with a computer. However, in order to be able to build a computer system that can participate in a dialogue with a human, one has to know how a human-computer interaction  is likely to proceed. The Wizard of Oz technique can be seen as an intermediate step in the design of such a computer system. Because the subjects who participate in a Wizard of Oz experiment have to be convinced that they are actually talking to a computer, some precautions must be taken. For example, the wizard simulating the computer should be talking with a ``computer voice'' (in the case that spoken output is required), and the wizard should also make deliberate errors similar to the ones that a computer could be expected to make in the application of interest.
As spoken language systems  are rapidly approaching a performance level that is acceptable for an increasing range of applications, it seems likely that man-machine dialogue systems  will be used more and more in the near future. For the development of such systems speech data gathered in Wizard of Oz experiments will be indispensable, as long as at least one part of the system is not yet good enough for experiments with large groups of users. A more comprehensive discussion of the Wizard of Oz technique is given in Chapter 13.
As the performance of SLSs improves, the development of new applications will be increasingly based on pilot experiments with a system in the loop , i.e. with test versions of the application in which the wizard is replaced by a computer system which has enough functionality to support the man-machine interaction.   


Spontaneous speech

This type of speech material, in which speakers are allowed to freely choose their own words and their own subject of conversation is most natural, especially in a dialogue situation. Most remarks that were made in the previous sections, also apply to the present one. As with all natural processes, the observer's paradox can play a role in the recording of spontaneous speech: in order to obtain speech that is as natural as possible, the researcher has to observe how people speak when they are not being observed [Labov (1972)]. To overcome this methodological paradox, several techniques have been proposed throughout the development of sociolinguistic   research [Argente (1991)]:

next up previous contents index
Next: Factorial experiments and corpus Up: Specification of the linguistic Previous: Specification of the linguistic

EAGLES SWLG SoftEdition, May 1997. Get the book...