About this chapter

Next: Eight main differences between Up: Introduction Previous: Spoken language corpus

About this chapter

In this chapter only the design of spoken language corpora and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For attempts to survey existing speech corpora the reader is referred to [Fourcin et al. (1989)] and to Appendix L which contains a list of existing public domain spoken language corpora .

Corpora, tools, and resources in general are not aims in their own right, but means to an independently specified purpose. Thus, the eventual specification of a corpus depends in an essential way on the purpose it is intended to serve. Yet, if that purpose is not too limited, and provided the corpus is properly documented and annotated, it is quite likely that it will be useful for other, perhaps unrelated research. At present there are few, if any, official standards for corpus development. Given the dependence on research goals, this is not surprising.
The present chapter intends to address as large an audience as possible. Specifically, it includes information and recommendations not only for speech technology research, but also for the development of corpora meant to support research in speech science, psycholinguistics and sociolinguistics. The recommendations concern general aspects and factors that should be considered in designing a corpus, and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished. In the pre-recording phase one has to define the content of the corpus. Specifications of experiment design, of linguistic content, of number and type of speakers, and of the physical situation must be established. These topics will be covered in this chapter. In the recording phase speaker instruction and prompting , experiment and recording control, as well as storage of the recordings are involved. These topics will be covered in Chapter 4. In the post-recording phase transcription (and possibly segmentation and labelling ), corpus lexicon construction, and database management take place. These topics will be discussed in Chapters 5 and 6.
In the remainder of this chapter we focus on the pre-recording phase , including the following steps in preparing the recording of a speech corpus:

defining the application of the corpus,
specifying the linguistic content of the corpus,
specifying the number and type of speakers.

Before we embark on these discussions, however, it is necessary to elaborate on the differences between written language corpora and spoken language corpora.

Next: Eight main differences between Up: Introduction Previous: Spoken language corpus

EAGLES SWLG SoftEdition, May 1997. Get the book...