Next: Data collection dimensions Up: SL corpus collection Previous: SL corpus collection

Introduction

The previous chapter discussed the design of spoken language corpus collection. This chapter concentrates on the practical aspects of collecting spoken language material. In the first part, the dimensions of data collection are described to result in a framework for the classification and description of spoken language data collections.

The procedures section contains recommendations for the actual collection of speech data. These recommendations should enable anyone interested in speech recordings to establish a suitable recording environment that will deliver data in a controlled procedure and in an acceptable technical quality. Note that speech data collections always contain ``errors'': mispronunciations, ungrammatical sentences, new words, technical errors, and so on. These ``errors'' must be marked, but not removed from the corpus because they contain valuable information. For the development of applications, such errors are required to test the performance and limitations of an application. In speech science, errors are of interest in their own right. The procedures and recommendations in this chapter do not lead to error-free data collections. Instead, they define standards for many aspects of speech data collection which may then be used to explain spoken language phenomena, including the errors.

Clearly, the main object of interest in any spoken language data collection is the speech signal itself. However, additional information can be gathered apart from the basic acoustic speech signal. Whatever choices of speakers, speech material, and recording conditions are made, it is always of crucial importance that the collecting procedure is documented as elaborately as possible. It is good practice to record all possible details about, for instance, sex and age of speakers , type of speech material (isolated words, sentences, discourse, etc.), place of recording (in a laboratory , on location, etc.), type of microphone and recording medium. Although you may not be interested in specific information at the time, it can turn out to be important at a later stage. In that case it is often difficult or impossible to recollect the information you need. And in the second place, a well documented speech corpus may also be used for other research. The following list summarises the most common information sources that may be useful for a speech corpus:

Transduced signals
Examples: acoustic speech signal, laryngograph signal, X-ray data.
Analysis results
Examples: FFT data , LPC data , filter bank data , pitch extraction, formant extraction .
Descriptors
Examples: Characteristics of the speakers, or the recording conditions.
Markers
Examples: Markers to indicate pitch periods, or the beginning of vowels.
Annotations /Labels
Examples: Orthographic , phonemic , or phonetic transcriptions .
Assessment parameters
Examples: Test material , assessment results.

All these information sources must be stored in such a way that potential users of the speech corpus can get access to the speech and the speech-related data in an efficient and easy-to-use manner.

Next: Data collection dimensions Up: SL corpus collection Previous: SL corpus collection

EAGLES SWLG SoftEdition, May 1997. Get the book...