Types and specificities of corpora

There are as many types of corpora as relevant factors which can be used to define them: speakers, texts, speech type, recording conditions, tasks and so on. Among this wide range of possible corpora, we may characterise them according to their intended use:

1. Experimental research:
These corpora are widely used for speech technology development and assessment. Much of the basic material was collected several years ago, and more recent technology requires more advanced materials.
1.1 Basic material:
Numbers, Words, Sentences, Logatoms
  • Number of speakers: medium (100--500)
  • Several repetitions.
1.2 Advanced material:
Continuous speech, passages, situated dialogue.
  • Number of speakers: small to medium (10--200)
  • Recently the trend has been to increase the number of speakers in such corpora
1.3 Specific databases:
multi-sensor corpora (Lx), articulatory, acoustic, video databases.
  • Number of speakers: small
  • These corpora tend to be relatively expensive to collect and may require sophisticated recording facilities and sensors, as well as specialised operators.
2. General-purpose Telephone corpora:
These used for speech recognition and coding over the telephone. These type of corpora are relatively easy to obtain (the speaker only needs to call a specified telephone number) and relatively cost effective. However, with advances in communication technology, some of the problems currently posed by the limited bandwidth and noisy communication channel of today's telephones can be expected to disappear.
3. Application-oriented corpora:
For specific tasks and/or environments (many of which involve the telephone network). By essence of their application specificity, many of these corpora are not easily reused for other applications.

The two extremes in corpus type are on one side very specific corpora for fundamental research, which may require complex recording conditions with multi-channel recordings, and a low number of speakers, and at the other application-specific corpora which may be recorded over the telephone with a large number of speakers. In addition to the recorded speech signal, we must highlight the importance and effort required to ensure that the appropriate associated information is provided. This associated information depends heavily on the type of corpus, but at a minimum must include revelant speaker information, transcriptions (at a miminum an orthographic transliteration), prompt material in the case of read-speech corpora, lexica, noise or channel characteristics and details of the recording configuration.

