In this chapter only the design of spoken language corpora and the use of these corpora are covered. It is expressis verbis not the intention of this chapter to give a comprehensive overview of corpora existing worldwide; we even do not intend to give a comprehensive list of all corpora existing in Europe. For attempts to survey existing speech corpora the reader is referred to [Fourcin et al. (1989)] and to Appendix L which contains a list of existing public domain spoken language corpora .
Corpora, tools, and resources in general are not aims in their own right, but means to an
independently specified purpose. Thus, the eventual specification of a corpus
depends in an essential way on the purpose it is intended to serve. Yet, if
that purpose is not too limited, and provided the corpus is properly documented
and annotated, it is quite likely that it will be useful for other, perhaps
unrelated research. At present there are few, if any, official standards
for corpus development. Given the dependence on research goals, this is not
surprising.
The present chapter intends to address as large an audience as possible. Specifically, it includes information and
recommendations not only for speech technology research, but also for the development of corpora meant
to support research in speech science, psycholinguistics and
sociolinguistics.
The recommendations concern general aspects and factors that should be considered in designing a
corpus, and guidelines for making decisions on these issues.
In the development of a speech corpus, three phases can be distinguished.
In the pre-recording phase one has
to define the content of the corpus.
Specifications of experiment design, of linguistic content, of
number and type of speakers, and of the physical situation must be
established. These topics will be covered in this chapter. In the
recording phase speaker instruction and
prompting , experiment and recording control, as well as storage of the recordings
are involved. These topics will be covered in Chapter 4. In the
post-recording phase
transcription (and possibly segmentation and
labelling ), corpus lexicon construction, and database management take place. These topics will be
discussed in Chapters 5 and 6.
In the remainder of this chapter we focus on
the pre-recording phase , including the
following steps in preparing the recording of a speech corpus:
Before we embark on these discussions, however, it is necessary to elaborate on the differences between written language corpora and spoken language corpora.