In this chapter only the design of spoken language corpora and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For attempts to survey existing speech corpora the reader is referred to [Fourcin et al. (1989)] and to Appendix L which contains a list of existing public domain spoken language corpora .
Corpora, tools, and resources in general are not aims in their own right, but means to an
independently specified purpose. Thus, the eventual specification of a corpus
depends in an essential way on the purpose it is intended to serve. Yet, if
that purpose is not too limited, and provided the corpus is properly documented
and annotated, it is quite likely that it will be useful for other, perhaps
unrelated research. At present there are few, if any, official standards
for corpus development. Given the dependence on research goals, this is not
The present chapter intends to address as large an audience as possible. Specifically, it includes information and recommendations not only for speech technology research, but also for the development of corpora meant to support research in speech science, psycholinguistics and sociolinguistics. The recommendations concern general aspects and factors that should be considered in designing a corpus, and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished. In the pre-recording phase one has to define the content of the corpus. Specifications of experiment design, of linguistic content, of number and type of speakers, and of the physical situation must be established. These topics will be covered in this chapter. In the recording phase speaker instruction and prompting , experiment and recording control, as well as storage of the recordings are involved. These topics will be covered in Chapter 4. In the post-recording phase transcription (and possibly segmentation and labelling ), corpus lexicon construction, and database management take place. These topics will be discussed in Chapters 5 and 6.
In the remainder of this chapter we focus on the pre-recording phase , including the following steps in preparing the recording of a speech corpus:
Before we embark on these discussions, however, it is necessary to elaborate on the differences between written language corpora and spoken language corpora.