Speech corpora for research purposes

Next: Speech corpora for technological Up: Applications of spoken language Previous: Applications of spoken language

Speech corpora for research purposes

The speech corpora needed for scientific purposes can be very diverse. Some researchers may need carefully pronounced lists of words to study a specific hypothesis about speech production; others may want to study samples of the vernacular , the way people speak in their everyday life. In the following sections some of the major scientific fields with interest in spoken language corpora are mentioned.

Phonetic research

In phonetic research all aspects of speech are studied. Phonetic experiments often require carefully controlled speech data, especially when basic phenomena, such as coarticulation , have to be studied in a systematic way. In this type of research, more often than not the researcher will have no alternative to collecting new data, specifically designed for the investigation at hand. However, in recent years more and more attention is being paid to uncontrolled (or less controlled) forms of speech as well, because one has begun to realise that results obtained for carefully pronounced speech cannot simply be generalised to more casual speech. This type of research, which requires other experimental designs and other statistical test procedures as well, will profit considerably from existing corpora. Moreover, since the corpora that can support this type of research must of necessity be very large, it will be very unlikely that a researcher will have the opportunity to collect new, project-specific corpora.

Sociolinguistic research

In sociolinguistic research variation in language use is studied in heterogeneous communities, especially urban ones. Variables of interest are among others age , sex (gender), and social status. Three common methods to gather data in this research field are:

BY MEANS OF WRITTEN QUESTIONNAIRES
Members of the communities of interest might, for instance, have to indicate on a questionnaire how they pronounce specific words, or whether they use certain sociolect variants of words. A large drawback of this method is that many people are not aware of their pronunciation habits, or the sociolect variants they use. Furthermore, people might regard their actual language use as undesirable and vulgar, and pretend they use a more prestigious form of language.
BY OBSERVATIONS OF THE INVESTIGATOR
This strategy was, for instance, used by William Labov to investigate the occurrence of /r/ deletions in New York English [Labov (1972)]. He simply wrote down whether his informants pronounced an /r/ or not in specific words. The major drawback of this method is that the data collection is based on the subjective (and possibly biased) observations of a single person. In addition, the phenomenon of interest is only heard once at a possibly unexpected moment and in a possibly noisy environment (Labov, for instance, did an investigation in department stores).
BY COLLECTING SPEECH CORPORA
The gathering of speech corpora offers sociolinguists the opportunity to make detailed and reliable analyses of various phenomena of interest. Perceptual evaluation of pronunciation phenomena could be supplemented or replaced with acoustic measurements. This is especially useful when the differences between pronunciation variants are very subtle, as for instance in the case of a slightly varying vowel colour [Labov (1994)].

Dialect research is closely related to sociolect research. In dialect studies variation in language use due to differences in geographical background of speakers is investigated. Since the methods of data collection are similar to the ones used in sociolect research, the remarks made above also apply to dialect research.

Psycholinguistic research

Psycholinguistics is a very broad scientific field in which the psychology of language is studied, from language acquisition by children to the mental processes underlying adult comprehension and production of speech and language disorders.

Psycholinguistic experiments sometimes involve carefully controlled speech material, for instance in on-line phoneme monitoring or gating experiments . In phoneme monitoring experiments subjects are asked to spot the first occurrence of a specific phoneme in a spoken utterance, and press a button as soon as they have spotted it. The reaction time between the actual occurrence of the phoneme and the subject's response is used to form hypotheses about underlying mental processes. In gating experiments a progressively larger portion of words is presented to listeners, who are asked to predict what the ending will be. Both techniques can be useful to get more insight into the organisation of the mental lexicon [Aitchison (1994)].
Another way to obtain information about the mental lexicon and speech production processes is to study the dysfluencies in spontaneous speech . For example, false starts tell us something about the way in which speech is planned and articulated. Also repetitions of words or word fragments give information about the production and representation of speech. For this type of research, spontaneous speech corpora are very useful. For more information about planning processes of speech see [Levelt (1989)].
Yet another way to gather cues about the mental lexicon is to study ``slips of the tongue ''. Many tongue-slip collectors carry round a small notebook in which they write down errors whenever they hear them, on a bus, at parties, etc. As mentioned in the former section, data acquired in this way is subjective and unreliable. The use of speech corpora containing spontaneous speech samples would be the answer to this problem, but investigations in this research area would only benefit from extremely large spontaneous speech corpora, because the number of slips of the tongue produced in any one hour of spoken speech is fairly small.

First language acquisition

Language acquisition by children is subject of investigation in many disciplines of, among others, linguistics and psychology. For example, the speech of (young) children can be used to investigate (ir)regularities in language ( linguistics); it can also be used to learn more about the mental organisation of language (psycholinguistics ); it can be studied in relation to the sociolinguistic background of children; or it can be used to gain more insight into basic phonetic processes. All these scientific fields as well as early learning oriented technologies would benefit from extensive corpora containing speech of children.
Collecting language acquisition corpora is extremely time consuming and expensive, because of the difficulty in transcribing the speech, especially speech of very young children. In (psycho-)linguistics a considerable amount of work has been done to collect and transcribe corpora, and to make them available to the research community. Presently, only transcriptions are readily accessible [MacWhinney (1995), e.g. the CHILDES transcription of,].
In the case of toddlers only ``spontaneous'' speech samples can be obtained. As soon as children get somewhat older, more controlled forms of speech can also be obtained, such as naming pictures or reading texts. Game playing is another way of eliciting quasi-controlled speech.
Speech acquisition corpora must preferably be longitudinal, i.e. the same person must be recorded repeatedly at consecutive stages in the acquisition process.

Second language acquisition

Migration between language areas is as old as history, and no doubt much older. Depending on their practical and social status, migrants may be hindered by their lack of adequate knowledge and command of the majority language in their new home countries. Now that low-education jobs are becoming increasingly rare in First World countries this situation has become economically and politically significant. Because command of the language is a prerequisite to education, the study of how immigrants learn to master the language of the host country (the ``second'' language) has become an important topic in sociolinguistic research. The European Science Foundation, for instance, has sponsored a large scale project on second language learning in several Western European countries. The research was corpus based: large numbers of migrants were recorded each fortnight for over a year. Transcripts and audio tapes comprising this corpus are maintained by the Max Planck Institute of Psycholinguistics in Nijmegen, The Netherlands.
It is especially important to study second language acquisition of immigrant children in order to find out how this might influence their education progress. In a similar vein, research into the acquisition of the majority language is needed in ``second generation children'' who grow up in families which still use the language of their country of origin.

Since immigrants form a minority group in the country they reside in, their native language can be strongly influenced by the second language. For the investigation of these so-called language attrition processes special purpose corpora must be (and have been) collected. In this context one must not only think of African and Asian migrants who are living in the U.S.A. or Western Europe, but also of non-Anglo Europeans who moved to the U.S.A., Canada or Australia.

From a psycholinguistic point of view, it is interesting to study how the different lexicons are organised in the minds of bilingual (and multilingual) speakers. For example, the occurrence of ``blends'' (combinations of two words, in this case from different languages) shows that words are subconsciously activated in both languages [Green (1986)]. Up to now, much of the research into bilingual lexicons has taken the form of controlled experiments (e.g. cross language priming in lexical decision tasks). It is conceivable, though, that large corpora of spontaneous speech of bilinguals could be used to study lexical and syntactic interferences between the languages.

Large corpora of speech of second language learners may also be very interesting for the development of tools for second language learning. For example, types of grammatical and pronunciation errors can be identified. Knowledge of these errors may be helpful for the development of language instruction materials, which might include spoken examples of actual errors (to be corrected by the learners).

General linguistic research

A substantial part of modern linguistic research since the 1960s has been based on Chomsky's ``generative paradigm''. The goal of this so-called mentalistic research programme is to eventually understand the competence of language users, i.e. their abstract knowledge of the language system. What speakers and hearers actually do, i.e. their performance, is usually of less interest to linguists in Chomsky's tradition. The construction of competence models is generally based on introspection and impressionistic ideas about language use. So, in its strictest form mentalistic linguistic research cannot benefit much from speech corpora that contain samples of the performance of language users.
However, many linguists no longer think that performance can be neglected completely. For one thing, it has been noted that spontaneous speech corpora often contain utterances which would seem implausible (if not impossible) from introspection, but which are perfectly natural and acceptable in context. And conversely, sentences invented to illustrate grammatical points may be implausible as actual utterances, because it is extremely difficult to imagine a situation in which they would not violate discourse constraints, aspectual perspectives taken on events, etc. [Chafe (1992)]. Moreover, only an integrated theory of competence and performance would ultimately be able to account for actual language phenomena. In this respect speech corpora are indispensable to fill the gap between a competence grammar and actual language use.
Presently, more and more linguists are starting to realise the importance of linguistic analysis of constructs of larger size than isolated sentences or utterances. Discourse analysis is the branch of linguistics which is concerned with the analysis of naturally occurring connected spoken or written discourse [Stubbs (1984)]. Obviously, discourse analysis will profit very much from large corpora of meaningful speech, whether it is conversational or more formal, e.g. in information seeking dialogues.
In [Edwards & Lampert (1993)] a comprehensive methodology is presented for the transcription and coding of discourse data from various perspectives. This book also contains a list of language corpora that might be useful in discourse research.

Audiology

Audiology is the scientific study of hearing, often including the treatment of persons with hearing defects. A conventional audiometer can be used to test the intensity and frequency range of pure tones that the human ear can detect. This instrument can give a rough indication of the degree of hearing loss in hearing-impaired persons. Present day evaluation of hearing includes the use of controlled speech samples to assist in the determination of a patient's communicative capabilities.
Interest in the use of speech to measure hearing has been centered around both research orientation and practical clinical orientation. The first orientation has resulted in research areas such as experimental phonetics, the effects of various types of distortion on human speech recognition and speaker identification , etc. The second orientation has led to research in areas such as the effects of hearing loss on the reception of speech, auditory processing, and the effects of modifications in the range of reception of speech. The second area more or less grew out of the research in the first area [O'Neill (1975)].
For speech corpora to be useful in audiology they must be carefully calibrated, establishing performance (e.g. in terms of recognition scores) of non-hearing impaired reference subjects. Audiological test corpora may contain various types of speech stimuli to evaluate normal and disordered hearing acuity. The speech stimuli can consist of isolated phonemes , nonsense words or real words, and also of connected forms of speech [House et al. (1965), Voiers (1977), see e.g.,].

Speech pathology

In this scientific field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to severe disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorder of interest. However, phenomena like aphasia can also be subject of psycholinguistic studies, because such language disorders can shed some light on underlying mental processes [Aitchison (1994)]. Corpora of pathological speech are very useful for these purposes. These corpora may also be useful for the development of automatic classifications of speech pathologies.

Next: Speech corpora for technological Up: Applications of spoken language Previous: Applications of spoken language

EAGLES SWLG SoftEdition, May 1997. Get the book...