next up previous contents index
Next: Speaker characteristics Up: Specification of number and Previous: Specification of number and

Corpus size in terms of speakers

The number of speakers who are represented is one of the most important characteristics of a spoken language corpus . Based on the number of speakers in the corpus, speech corpora can be roughly divided into the following three classes:

  1. speech corpora with few speakers,
  2. speech corpora with many (about 5 to 50) speakers,
  3. speech corpora with very many (more than 50) speakers.

Speech corpora with few speakers

Such corpora are often used in the development of speech synthesis  systems. In most cases the speech of one or two persons (typically one man and one woman) is recorded. The corpus is used to prepare dictionaries of phonetic elements   (allophones , diphones , etc.), and to design prosodic  models. The speech material may consist of nonsense words  in which sequences of phonetic elements are systematically varied, and a series of sentences to extract prosodic  rules. For developing synthesis systems  it is recommended to use experienced speakers. Especially when recording the material that serves for building the segment inventory it is extremely important that the speakers can keep pitch , loudness, voice quality and tempo constant.
Corpora comprising very few speakers are also common in basic speech research, especially where invasive measurements must be made. Corpora in this domain typically contain several additional signals recorded simultaneously with the acoustic speech signal, see e.g. the ESPRIT   Basic Research Project on Articulatory Phonetics . The additional signals can range from the Electroglottogram  (which was also recorded in part of the EUROM-1 corpus  and the Transnational English Corpus) to subglottal pressure recorded via tracheal puncture and EMG activity  of intrinsic laryngeal muscles. It should be emphasised that very few speakers does not necessarily imply a small corpus. For instance, when one needs to record one speaker producing all three-consonant clusters  in languages like Dutch, English or German, in all possible phonetic contexts, within syllables , across syllable boundaries , across word boundaries, in stressed  and unstressed  syllables , at several positions in a sentence or in a prosodic  contour, the amount of speech required is formidable, even when greedy search  algorithms   [Van Santen (1992)] are used to find the smallest possible number of sentences which comprise all contexts.

Similar remarks apply to intonation  and prosody  in general. If a text-to-speech system  is developed that must be employed in many different applications (reading factual information in e.g. a train time table information system, or reading popular daily newspapers  to blind subscribers), enormous amounts of speech are needed to capture all relevant prosodic  phenomena.

Speech corpora with about 5 to 50 speakers

Speech corpora of this size are often used in experimental factorial  research. The speech material can range from isolated  nonsense words  to a complete discourse, dependent on the specific application. Studies on prosody , for instance, would require linguistic units that exceed the word level. Speakers can be men, women, or children. The speech can be recorded under high quality laboratory  conditions, but also ``in the field ''. In general, the number of speakers and the number of repetitions of the speech phenomena that are investigated should be large enough for a meaningful statistical processing if factorial experimental  designs are planned. The power of a statistical test depends on the number of independent observations. If a corpus is developed for a factorial experiment , standard procedures are available and should be adhered to for determining the minimum number of speakers and/or the minimum number of utterances per speaker to allow planned statistical tests to reach a pre-specified power. These standard procedures can be found in most textbooks on statistics, such as [Hayes (1963)] (pp. 269-280), [Ferguson (1976)], or [Marascuilo & Serlin (1988)]. In designing very large vocabulary  speech recognition systems , on the other hand, one will strive for a maximally broad coverage of relevant phenomena, probably at the expense of high numbers of exact replications of specific (relatively rare) phenomena Chapter 9 should be consulted on methodology.

Speech corpora with more than 50 speakers

Speech corpora of this composition are necessary to adequately train  and test  speaker-independent recognition systems  . Speakers can be men, women, or children, dependent on the application. The speech material can be limited to a list of isolated words  or numbers, but it can also contain read   aloud sentences and paragraphs or even spontaneous speech  in the case of interactive dialogues. Speech may be recorded under laboratory  conditions or in (quiet) offices, but if a telephone recognition system is involved, the speech corpus should, of course, consist of telephone speech both for the training phase  and the testing phase .

General remarks

Of course, possible applications of corpora may be quite different from the typical ones listed above; some fundamental research may, for instance, require a very large speech corpus, whereas a simple recognition system  may be developed with a rather small speech corpus. Furthermore, the list of applications of speech corpora given above is not meant to be exhaustive, but it should help to illustrate the large differences between speech corpora, depending on their research goal. Speaker Recognition  is a branch of speech (technology) research which has received little attention in the past decade. This is reflected in the lack of publicly available corpora to support speaker recognition  research. However, it is not completely true to say that there are no corpora which are suitable for speaker recognition research; notable exceptions are the King corpus and the Switchboard corpus , both available though the LDC (cf. also the Proceedings of the ESCA workshop on Speaker Recognition   in Martigny, April 1994). For a corpus to be suitable for speaker recognition research it is essential that speakers are recorded more than once, and that recordings are made at different days, in different realistic acoustic environments  and with different microphones .

next up previous contents index
Next: Speaker characteristics Up: Specification of number and Previous: Specification of number and

EAGLES SWLG SoftEdition, May 1997. Get the book...