The speech corpora needed for scientific purposes can be very diverse. Some researchers may need carefully pronounced lists of words to study a specific hypothesis about speech production; others may want to study samples of the vernacular , the way people speak in their everyday life. In the following sections some of the major scientific fields with interest in spoken language corpora are mentioned.
In phonetic research all aspects of speech are studied. Phonetic experiments often require carefully controlled speech data, especially when basic phenomena, such as coarticulation , have to be studied in a systematic way. In this type of research, more often than not the researcher will have no alternative to collecting new data, specifically designed for the investigation at hand. However, in recent years more and more attention is being paid to uncontrolled (or less controlled) forms of speech as well, because one has begun to realise that results obtained for carefully pronounced speech cannot simply be generalised to more casual speech. This type of research, which requires other experimental designs and other statistical test procedures as well, will profit considerably from existing corpora. Moreover, since the corpora that can support this type of research must of necessity be very large, it will be very unlikely that a researcher will have the opportunity to collect new, project-specific corpora.
In sociolinguistic research variation in language use is studied in heterogeneous communities, especially urban ones. Variables of interest are among others age , sex (gender), and social status. Three common methods to gather data in this research field are:
Dialect research is closely related to sociolect research. In dialect studies variation in language use due to differences in geographical background of speakers is investigated. Since the methods of data collection are similar to the ones used in sociolect research, the remarks made above also apply to dialect research.
Psycholinguistics is a very broad scientific field in which the psychology of language is studied, from language acquisition by children to the mental processes underlying adult comprehension and production of speech and language disorders.
Psycholinguistic experiments sometimes involve carefully controlled speech material, for
instance in on-line phoneme monitoring or gating
experiments . In phoneme monitoring
experiments subjects are asked to spot the first
occurrence of a specific phoneme in a spoken utterance, and press a button as soon as they
have spotted it. The reaction time between the actual occurrence of the
phoneme and the subject's response is used to form hypotheses about underlying
mental processes. In gating experiments a
progressively larger portion of words is presented to listeners, who are asked
to predict what the ending will be. Both techniques can be useful to get more
insight into the organisation of the mental lexicon
Another way to obtain information about the mental lexicon and speech production processes is to study the dysfluencies in spontaneous speech . For example, false starts tell us something about the way in which speech is planned and articulated. Also repetitions of words or word fragments give information about the production and representation of speech. For this type of research, spontaneous speech corpora are very useful. For more information about planning processes of speech see [Levelt (1989)].
Yet another way to gather cues about the mental lexicon is to study ``slips of the tongue ''. Many tongue-slip collectors carry round a small notebook in which they write down errors whenever they hear them, on a bus, at parties, etc. As mentioned in the former section, data acquired in this way is subjective and unreliable. The use of speech corpora containing spontaneous speech samples would be the answer to this problem, but investigations in this research area would only benefit from extremely large spontaneous speech corpora, because the number of slips of the tongue produced in any one hour of spoken speech is fairly small.
Language acquisition by children is subject of investigation in many
disciplines of, among others, linguistics and psychology. For example, the speech of
(young) children can be used to investigate (ir)regularities in language (
linguistics); it can also be used to learn more about the mental organisation
of language (psycholinguistics ); it can be studied in relation to the
sociolinguistic background of children; or it can be used to gain more
insight into basic phonetic processes. All these scientific
fields as well as early learning oriented technologies would
benefit from extensive corpora containing speech of children.
Collecting language acquisition corpora is extremely time consuming and expensive, because of the difficulty in transcribing the speech, especially speech of very young children. In (psycho-)linguistics a considerable amount of work has been done to collect and transcribe corpora, and to make them available to the research community. Presently, only transcriptions are readily accessible [MacWhinney (1995), e.g. the CHILDES transcription of,].
In the case of toddlers only ``spontaneous'' speech samples can be obtained. As soon as children get somewhat older, more controlled forms of speech can also be obtained, such as naming pictures or reading texts. Game playing is another way of eliciting quasi-controlled speech.
Speech acquisition corpora must preferably be longitudinal, i.e. the same person must be recorded repeatedly at consecutive stages in the acquisition process.
Migration between language areas is as old as history, and no doubt much older.
Depending on their practical and social status, migrants may be hindered by their lack of adequate
knowledge and command of the majority language in their new home countries.
Now that low-education jobs are becoming increasingly rare in First World
countries this situation has become economically and politically significant. Because command of the language is a
prerequisite to education, the study of how immigrants learn to master the
language of the host country (the ``second'' language) has become an important
topic in sociolinguistic research. The European Science Foundation, for
instance, has sponsored a large scale project on second language learning in
several Western European countries. The research was corpus based: large
numbers of migrants were recorded each fortnight for over a year. Transcripts and audio tapes comprising
this corpus are maintained by the
Max Planck Institute of Psycholinguistics in Nijmegen, The Netherlands.
It is especially important to study second language acquisition of immigrant children in order to find out how this might influence their education progress. In a similar vein, research into the acquisition of the majority language is needed in ``second generation children'' who grow up in families which still use the language of their country of origin.
Since immigrants form a minority group in the country they reside in, their native language can be strongly influenced by the second language. For the investigation of these so-called language attrition processes special purpose corpora must be (and have been) collected. In this context one must not only think of African and Asian migrants who are living in the U.S.A. or Western Europe, but also of non-Anglo Europeans who moved to the U.S.A., Canada or Australia.
From a psycholinguistic point of view, it is interesting to study how the different lexicons are organised in the minds of bilingual (and multilingual) speakers. For example, the occurrence of ``blends'' (combinations of two words, in this case from different languages) shows that words are subconsciously activated in both languages [Green (1986)]. Up to now, much of the research into bilingual lexicons has taken the form of controlled experiments (e.g. cross language priming in lexical decision tasks). It is conceivable, though, that large corpora of spontaneous speech of bilinguals could be used to study lexical and syntactic interferences between the languages.
Large corpora of speech of second language learners may also be very interesting for the development of tools for second language learning. For example, types of grammatical and pronunciation errors can be identified. Knowledge of these errors may be helpful for the development of language instruction materials, which might include spoken examples of actual errors (to be corrected by the learners).
A substantial part of modern linguistic research since the 1960s has been based on Chomsky's
``generative paradigm''. The goal of this so-called mentalistic research
programme is to eventually understand the competence
of language users,
i.e. their abstract knowledge of the language system. What speakers and
hearers actually do, i.e. their performance,
is usually of less interest
to linguists in Chomsky's tradition. The construction of competence models is
generally based on introspection and impressionistic ideas about language use.
So, in its strictest form mentalistic linguistic research cannot benefit much
from speech corpora that contain samples of the performance of language users.
However, many linguists no longer think that performance can be neglected completely. For one thing, it has been noted that spontaneous speech corpora often contain utterances which would seem implausible (if not impossible) from introspection, but which are perfectly natural and acceptable in context. And conversely, sentences invented to illustrate grammatical points may be implausible as actual utterances, because it is extremely difficult to imagine a situation in which they would not violate discourse constraints, aspectual perspectives taken on events, etc. [Chafe (1992)]. Moreover, only an integrated theory of competence and performance would ultimately be able to account for actual language phenomena. In this respect speech corpora are indispensable to fill the gap between a competence grammar and actual language use.
Presently, more and more linguists are starting to realise the importance of linguistic analysis of constructs of larger size than isolated sentences or utterances. Discourse analysis is the branch of linguistics which is concerned with the analysis of naturally occurring connected spoken or written discourse [Stubbs (1984)]. Obviously, discourse analysis will profit very much from large corpora of meaningful speech, whether it is conversational or more formal, e.g. in information seeking dialogues.
In [Edwards & Lampert (1993)] a comprehensive methodology is presented for the transcription and coding of discourse data from various perspectives. This book also contains a list of language corpora that might be useful in discourse research.
Audiology is the scientific study of hearing, often including the
treatment of persons with hearing defects. A conventional audiometer can
be used to test the intensity and frequency range of pure tones that the human
ear can detect. This instrument can give a rough indication of the degree of
hearing loss in hearing-impaired persons. Present day evaluation of hearing
includes the use of controlled speech samples to assist in the determination
of a patient's communicative capabilities.
Interest in the use of speech to measure hearing has been centered around both research orientation and practical clinical orientation. The first orientation has resulted in research areas such as experimental phonetics, the effects of various types of distortion on human speech recognition and speaker identification , etc. The second orientation has led to research in areas such as the effects of hearing loss on the reception of speech, auditory processing, and the effects of modifications in the range of reception of speech. The second area more or less grew out of the research in the first area [O'Neill (1975)].
For speech corpora to be useful in audiology they must be carefully calibrated, establishing performance (e.g. in terms of recognition scores) of non-hearing impaired reference subjects. Audiological test corpora may contain various types of speech stimuli to evaluate normal and disordered hearing acuity. The speech stimuli can consist of isolated phonemes , nonsense words or real words, and also of connected forms of speech [House et al. (1965), Voiers (1977), see e.g.,].
In this scientific field various types of pathological speech are studied, ranging from mild disorders such as hoarseness to severe disorders such as aphasia. The aim of most studies of pathological speech is to find therapies that can alleviate or cure the speech disorder of interest. However, phenomena like aphasia can also be subject of psycholinguistic studies, because such language disorders can shed some light on underlying mental processes [Aitchison (1994)]. Corpora of pathological speech are very useful for these purposes. These corpora may also be useful for the development of automatic classifications of speech pathologies.