Most of the older speech corpora have been recorded in sound studios. Studio recordings have the drawback that most subjects do not feel at home in that environment , with all possible impacts on their speech behaviour. However, as long as the speech to be elicited consists of lists of words, words embedded in carrier phrases, and the like, the ``abnormality'' due to the unusual reading tasks may outweigh the contribution of the unusual situation.
Studio recordings have the advantage of superior signal-to-noise ratio, thanks to the controlled acoustic environment and - at least as importantly - the possibility to monitor recording levels closely, distance of the speaker to the microphone , use of superior but volatile condenser microphones , for example.
One must be aware, however, that ``studio'' is not a well defined concept. Not all rooms called studio have good acoustic properties. It is not at all unusual to find rooms which have indeed relatively low ambient sound levels, but at the cost of very long or extremely short reverberation times . If studios are used to record large corpora, room acoustics calibration data should be provided with the speech recordings (see Chapter 8 for details on the calibration procedure).
Many (small) corpora designed for basic speech research have been recorded in rooms which were not designed as an audio studio. This is the case with most simultaneous recordings of speech and EMG signals , which are typically made in the research labs of hospitals. Only rarely have these rooms been prepared to provide acceptable room acoustics.
For many applications high quality speech recordings from one speaker at a time are required. These recordings should be free of background noise , including noises made by the speakers themselves. The following guidelines apply specifically to the common situation that speakers are recorded one at a time in a sound studio.
Corpora recorded in the field have the advantage that the speaker is acting in an ecologically realistic environment . In most cases the price to be paid for this advantage is a substantial loss in signal-to-noise ratio, either because of high ambient noise levels, or limited possibility of monitoring recording levels, distance to microphone , or both. If ecological reality dictates recordings in the field one should nevertheless plan for conditions which allow an audio engineer to monitor the procedure.
Complete ecological validity may not be feasible. For instance, recording a speech corpus in a running car cannot safely be accomplished if the speaker is in the driver's seat.
Two important classes of recordings on location are recordings
Recording speech in actual applications (which are based on speech input) is one obvious way to obtain ``realistic'' speech data. At least in some countries it is not legally required to advise the user of the service that his speech is being recorded, as long as the recordings are only used for research purposes internal to the company which runs the service. This procedure is probably most often used in pilot versions of an application, where the number of users is limited, so that one may realistically hope to be able to process all recorded speech. Such recordings are necessary to systematically evaluate the success or failure of the speech input parts of an application by relating the speech recordings to the log of the use of the application.
In any case, continued recordings in an application are an efficient means of collecting large amounts of speech data relevant to a given task. Whenever possible, such recordings should thus be made.
Recording speech on the telephone (preferably digital, i.e. ISDN ) is suitable for the gathering of limited amounts of speech material from a large number of speakers (POLYPHONE , [Damhuis et al. (1994)]).
A possible drawback of telephone recordings is the limited bandwidth of the speech signal, typically between 300Hz and 3000Hz, which may pose problems for some kinds of basic speech research. For example, the absence of low-frequency components prevents a proper pitch analysis of the recorded speech using methods which rely on the presence of the fundamental frequency (frequency-domain methods, such as the ``harmonic sieve'', may yield satisfactory results). The absence of high-frequency components prevents for instance the proper spectral analysis of consonants, especially fricatives . Apart from the limited bandwidth of the speech signal, telephone channels can also give a substantial loss in signal-to-noise ratio, especially in the non-western countries where digital telephone systems are not yet commonplace. Even in modern digital telephone networks signal-to-noise ratio suffers from the limited dynamic range that can be accommodated with 8 bit A-law (or -law ) coded samples.
If these drawbacks are not important for the research goal, recording of telephone speech appears to be a simple way to collect a large amount of speech data in a very short time. For some applications, such as the training and testing of a telephone speech recogniser , a speech corpus with telephone recordings is of course indispensable. It should be emphasised that telephone speech is suitable for many linguistic research projects, including research into most aspects of dialects and regional language variants as well as all aspects related to spoken language syntax and vocabulary.