Spoken language corpus

Next: About this chapter Up: Introduction Previous: Introduction

Spoken language corpus

The definition introduced here for a spoken language corpus is ``any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organisations.'' This tentative definition excludes a large number of speech recordings on analogue tapes (sometimes even on disks) and recordings without the annotation and documentation which is necessary in order to use the recordings effectively. For instance, it is well known that virtually all public broadcasting corporations in Europe maintain an archive of recordings of programmes, including newscasts, reports of events ranging from football matches to royal weddings and funerals. However, in most cases these recordings can only be accessed by the date of the original broadcast, and perhaps also by the type of programme. Only in very rare cases are transcripts of the speech material in the recordings available. This makes it extremely difficult and time-consuming to use these data for almost all types of research. Speech coding forms the most notable exception to this rule, although even for coding research knowledge of who has said what may be helpful. Of course, lack of annotation does not diminish the value of these recordings for cultural and scientific purposes, but due to the inordinate amount of pre-processing necessary for any type of research they do not qualify as a spoken language corpus under our definition. In many other respects our definition is very wide and liberal. For instance, a set of computer files containing speech signals, EMG signals and sub- and supraglottal pressure signals measured in two subjects who sustained vowels on different pitch and intensity levels would qualify as a spoken language corpus, provided that the files come with suitable annotation and documentation .

Many additional sources of information can be gathered apart from the basic acoustic speech signal. Whatever choices of speakers, speech material, and recording conditions are made, it is always of crucial importance that the collecting procedure is documented as elaborately as possible. It is good practice to record all possible details about, for instance, sex (gender) and age of speakers , type of speech material (isolated words , sentences, discourse, etc.), place of recording (in a laboratory , on location, etc.), type of microphone and recording medium (see also Chapter 8). Although one may not be interested in specific information at the time, it can turn out to be important at a later stage. In that case it is often difficult or impossible to recollect the required information. Furthermore, a well documented speech corpus may also be used for other directions of research. The following list summarises the most common information sources that may be present in a speech corpus:

Transduced signals
Examples: The acoustic speech signal, laryngograph signal, X-ray data.
Analysis results
Examples: FFT data , LPC data , filter bank data , pitch extraction, formant extraction .
Descriptors
Examples: Characteristics of the speakers, or the recording conditions.
Markers
Examples: Markers to indicate pitch periods, or the beginning of vowels.
Annotations /Labels
Examples: Orthographic , phonemic , or phonetic transcriptions .
Assessment parameters
Examples: Test material , assessment results.

All these information sources must be stored in such a way that potential users of the speech corpus can get access to the speech and the speech-related data in an efficient and easy-to-use manner.

Next: About this chapter Up: Introduction Previous: Introduction

EAGLES SWLG SoftEdition, May 1997. Get the book...