The number of speakers who are represented is one of the most important characteristics of a spoken language corpus . Based on the number of speakers in the corpus, speech corpora can be roughly divided into the following three classes:
Such corpora are often used in the development of speech
synthesis systems. In most cases the speech of one or
two persons (typically one man and one woman) is recorded. The corpus is used
to prepare dictionaries of phonetic elements
(allophones ,
diphones , etc.), and to design prosodic models.
The speech material may consist of nonsense words in
which sequences of phonetic elements are systematically varied, and a series
of sentences to extract prosodic rules. For developing
synthesis systems it is recommended to use experienced speakers. Especially
when recording the material that serves for building the
segment inventory it is extremely important that the speakers
can keep pitch , loudness, voice quality and tempo constant.
Corpora comprising very few speakers are also common in basic speech research,
especially where invasive measurements must be made. Corpora in this
domain typically contain several additional signals recorded
simultaneously with the acoustic speech signal, see e.g. the ESPRIT
Basic
Research Project on Articulatory Phonetics . The
additional signals can range from the
Electroglottogram (which was also recorded in part of
the EUROM-1 corpus and the Transnational English Corpus) to subglottal
pressure recorded via tracheal puncture and EMG activity of intrinsic laryngeal muscles. It should be
emphasised that very few speakers does not necessarily imply a
small corpus. For instance, when one needs to record one speaker producing
all three-consonant clusters in languages like Dutch,
English or German, in all possible phonetic contexts, within
syllables , across syllable boundaries , across word boundaries, in stressed and
unstressed syllables , at several positions in a
sentence or in a prosodic contour, the amount of speech
required is formidable, even when greedy search algorithms
[Van Santen (1992)] are used to find the smallest possible
number of sentences which comprise all contexts.
Similar remarks apply to intonation and prosody in general. If a text-to-speech system is developed that must be employed in many different applications (reading factual information in e.g. a train time table information system, or reading popular daily newspapers to blind subscribers), enormous amounts of speech are needed to capture all relevant prosodic phenomena.
Speech corpora of this size are often used in experimental factorial research. The speech material can range from isolated nonsense words to a complete discourse, dependent on the specific application. Studies on prosody , for instance, would require linguistic units that exceed the word level. Speakers can be men, women, or children. The speech can be recorded under high quality laboratory conditions, but also ``in the field ''. In general, the number of speakers and the number of repetitions of the speech phenomena that are investigated should be large enough for a meaningful statistical processing if factorial experimental designs are planned. The power of a statistical test depends on the number of independent observations. If a corpus is developed for a factorial experiment , standard procedures are available and should be adhered to for determining the minimum number of speakers and/or the minimum number of utterances per speaker to allow planned statistical tests to reach a pre-specified power. These standard procedures can be found in most textbooks on statistics, such as [Hayes (1963)] (pp. 269-280), [Ferguson (1976)], or [Marascuilo & Serlin (1988)]. In designing very large vocabulary speech recognition systems , on the other hand, one will strive for a maximally broad coverage of relevant phenomena, probably at the expense of high numbers of exact replications of specific (relatively rare) phenomena Chapter 9 should be consulted on methodology.
Speech corpora of this composition are necessary to adequately train and test speaker-independent recognition systems . Speakers can be men, women, or children, dependent on the application. The speech material can be limited to a list of isolated words or numbers, but it can also contain read aloud sentences and paragraphs or even spontaneous speech in the case of interactive dialogues. Speech may be recorded under laboratory conditions or in (quiet) offices, but if a telephone recognition system is involved, the speech corpus should, of course, consist of telephone speech both for the training phase and the testing phase .
Of course, possible applications of corpora may be quite different from the typical ones listed above; some fundamental research may, for instance, require a very large speech corpus, whereas a simple recognition system may be developed with a rather small speech corpus. Furthermore, the list of applications of speech corpora given above is not meant to be exhaustive, but it should help to illustrate the large differences between speech corpora, depending on their research goal. Speaker Recognition is a branch of speech (technology) research which has received little attention in the past decade. This is reflected in the lack of publicly available corpora to support speaker recognition research. However, it is not completely true to say that there are no corpora which are suitable for speaker recognition research; notable exceptions are the King corpus and the Switchboard corpus , both available though the LDC (cf. also the Proceedings of the ESCA workshop on Speaker Recognition in Martigny, April 1994). For a corpus to be suitable for speaker recognition research it is essential that speakers are recorded more than once, and that recordings are made at different days, in different realistic acoustic environments and with different microphones .