Monitoring is the task of controlling and modifying technical and phonetic characteristics on-line, i.e. during the course of a recording. Validation relates to an off-line (or post hoc) technical or phonetic evaluation of the material recorded.
Monitoring can best be applied in studio recordings of read speech , and, to a limited extent, to interviews or dialogues. Technical characteristics such as the recording level, selection of recording channels , and even adapting a computer based script to a speaker can be varied while the recording is going on. Monitoring the phonetic characteristics of a recording concerns the quality of the speech - pronunciation, speaking style , etc. Two on-line monitoring paradigms can be distinguished: one in which any deviation or error is signalled to the experimenter only, and another one in which also the speaker is informed that a particular error has occurred.
Monitoring is the only practical paradigm that guarantees that the corpus will indeed contain exactly the items and the number of repetitions planned for during corpus design. The procedure has one very important limitation, that should not be underestimated: it will yield a corpus which is (virtually) completely devoid of dysfluencies, out-of-vocabulary words, coughs, sneezes, etc. Cleaned-up corpora of the type implied here have misled engineers to think that speech recognisers had reached performance levels sufficient for actual applications. What they failed to realise - due to the absence of these phenomena from the training materials - was that in real life dysfluencies etc. abound, and that these phenomena may be more important in determining the real life performance of speech recognisers than recognition error rate on a clean corpus. For this reason it is strongly recommended to use post hoc transliteration whenever that is possible. In making this recommendation it is acknowledged that recording dysfluencies etc. makes no sense in recording speech material for carefully designed perception experiments.
Some characteristics of recorded speech can only be evaluated after the recording has taken place. In the technical domain, such characteristics are the signal-to-noise ratio for the whole material, and an analysis of noises that were recorded together with the speech. The phonetic characteristics include an anlysis of the items produced, an orthographic transliteration or phonemic transcription of the speech signal, and segmental and prosodic descriptions.
Post hoc validation was employed in collecting some (very) large corpora like Voice Across America and POLYPHONE . It is also used in the German VERBMOBIL corpus collection, where the dialogue recordings are transliterated orthographically and then transcribed relative to a given citation form [Hess et al. (1995)].