next up previous contents index
Next: Eight main differences between Up: Introduction Previous: Spoken language corpus

About this chapter

In this chapter only the design of spoken language corpora   and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For attempts to survey existing speech corpora the reader is referred to [Fourcin et al. (1989)] and to Appendix L which contains a list of existing public domain spoken language corpora .

Corpora, tools, and resources in general are not aims in their own right, but means to an independently specified purpose. Thus, the eventual specification of a corpus depends in an essential way on the purpose it is intended to serve. Yet, if that purpose is not too limited, and provided the corpus is properly documented and annotated, it is quite likely that it will be useful for other, perhaps unrelated research. At present there are few, if any, official standards for corpus development. Given the dependence on research goals, this is not surprising.
The present chapter intends to address as large an audience as possible. Specifically, it includes information and recommendations not only for speech technology research, but also for the development of corpora meant to support research in speech science, psycholinguistics  and sociolinguistics.  The recommendations concern general aspects and factors that should be considered in designing a corpus, and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished. In the pre-recording phase  one has to define the content of the corpus. Specifications of experiment design, of linguistic content, of number and type of speakers, and of the physical situation must be established. These topics will be covered in this chapter. In the recording phase  speaker instruction and prompting , experiment and recording control, as well as storage of the recordings are involved. These topics will be covered in Chapter 4. In the post-recording phase  transcription  (and possibly segmentation  and labelling ), corpus lexicon construction,  and database management  take place. These topics will be discussed in Chapters 5 and 6.
In the remainder of this chapter we focus on the pre-recording phase , including the following steps in preparing the recording of a speech corpus:

Before we embark on these discussions, however, it is necessary to elaborate on the differences between written language corpora and spoken language corpora.


next up previous contents index
Next: Eight main differences between Up: Introduction Previous: Spoken language corpus

EAGLES SWLG SoftEdition, May 1997. Get the book...