The basic equipment needed for speech recordings consists of
The choice of equipment depends on the choices along the dimensions visibility, environment , communication mode, and on the data to be recorded.
The quality of the recording channel itself (microphone and recording medium) is determined by three characteristics: signal-to-noise ratio , bandwidth , and dynamic range .
For every speech recording a log or journal should be kept. It contains the essential administrative information about recording setup, personnel involved, speaker data, and recording time and date.
It is necessary to store at least the following data for a recording session:
Recording time and date and the recording engineer are independent of the number of speakers or channels recorded. The environment , speakers, and equipment may differ for each channel and thus should be written down separately for each channel . A separation of recording dependent and channel dependent data is thus advisable, and this separation should be made explicit in the layout of a form or a database structure (see Table 4.1).
Microphones can be classified by their
A well-documented speech corpus should contain data about the microphone, such as make and type (condenser , dynamic , etc.), position of the microphone relative to the speaker or speaker's mouth, possible calibration procedures, etc. (see Chapter 8 for further information).
Unidirectional microphones are sensitive to the direction where the sound comes from. Unidirectional microphones are preferred when a single speaker is recorded in a laboratory environment .
Omnidirectional microphones are not sensitive to the direction where the sound comes from. They are suitable for recordings on location, or when the speaker is moving, e.g. working, walking, or driving a car. Omnidirectional microphones may be used to record several speakers if it is guaranteed that their turns will not overlap.
Microphone arrays are a possible means of sound source location. Computer controlled microphone arrays can focus on a single sound source, therewith improving the effective signal-to-noise ratio dramatically [Flanagan et al. (1991)].
The electrical transducer principle is a second dimension along which microphones can be distinguished. For most purposes in speech research the differences between the tranducer principles is not very important, with the exception of carbon button microphones. Carbon button microphones are used in older telephone handsets. They may distort the frequency response of the signal quite considerably. Moreover, their transmission properties may change significantly over time. Electret microphones are more stable than carbon button microphones. They can have almost flat frequency response in the telephone bandwidth , at least in principle. The actual frequency response of a microphone depends primarily on the acoustic properties of the case in which it is encapsulated. Badly designed handsets therefore can have bad frequency response characteristics, regardless of the use of electret transducers. For basic research into the characteristics of the glottal sound source the phase distortion of the microphone is as important as its amplitude response. For virtually all other research and development purposes in speech phase response is immaterial.
Finally, the microphone position relative to the speaker's mouth can be used to distinguish types of microphones.
Headset microphones usually are attached to headphones via an arm. The position of the microphone relative to the articulatory tract is fixed, and the speaker is free to move the head. However, the microphone has to be positioned very carefully to avoid noise through breathing, and speakers often feel uncomfortable with a headset. If the task to be solved by the speaker is sufficiently complex, unconsciously produced gestures such as lip smacking, scratching one's head, rubbing one's chin, etc. may produce significant noise , especially if the headset is touched.
Close-up microphones are attached to the speaker's clothes, usually on the chest. The microphone does not disturb the speaker and it is quite close to the articulatory tract. However, the distance of the microphone varies greatly with body movements, and new noise sources, e.g. rustling of clothes, are introduced.
Table-top microphones usually are unidirectional microphones placed approximately 50 cm away from a speaker. The microphone does not disturb the speaker, and the distance of the microphone varies only very little with body movements. However, with more than one speaker in a room there is little channel separation, and new noise sources, e.g. interference from room echo , tapping on table, movements of prompt sheets , are introduced.
Room microphones are omnidirectional microphones that are placed in specified positions in a room. They are independent of speaker position and can be hidden completely. However, there is little (if any) channel separation, and surrounding noise interferes with the speech signal.
The signal coming from a microphone must be amplified to be recorded. In many cases, some processing is also needed, e.g. analog to digital conversion, transformations for different encoding schemes, filtering to reduce noise , etc.
Some processing steps have to be performed only once, e.g.\ analog to digital conversion. Others will be performed repeatedly, e.g. the transformations for different encoding schemes.
Basically, there exist two types of recording devices: tape drives, and computers with hard disks. Recordings to tape are either analog (audio tapes, compact cassette, video tapes) or digital (DAT), whereas recordings to hard disk are always digital.
Ongoing development in the field of audio technology has shifted the emphasis away from analogue recording media to digital recording media. The traditional recording medium has been the reel-to-reel magnetic tape. Apart from a relatively poor signal-to-noise ratio of typically 60-70 dB, this medium suffers from mechanical problems such as flutter and wow. Moreover, the quality of an analogue speech recording severely degrades after it has been copied repeatedly. Because of these drawbacks, it is strongly recommended to use digital media for the recording of speech. The most widespread digital medium for recording of speech signals is the DAT (Digital Audio Tape). This medium is strongly recommended. Recordings are made on two channels with standard sampling frequency of 48kHz, and 16-bit resolution. Another option, that can only be used in a laboratory environment , is to record the speech directly on a high capacity computer disk. Two other digital audio media, the CD-ROM and WORM (Write Once Read Many), are less suitable for speech recording, because they cannot be erased. That is, data (for instance, speech recordings) can only once be written to a CD-ROM or WORM ; afterwards, the stored data can be read as many times as one likes (compare a grammophone disc). The CD-ROM and WORM are especially useful for the permanent storage of selected recordings in a database.
The recording devices can be characterised according to the following criteria:
The portability of a recording device is determined primarily by its size and weight, and secondarily by its operating requirements, e.g. power supply, environmental conditions , etc.
Tape drives, analog or digital, come in all sizes, including Walkman-sized DAT recorders. Usually, tape drives are optimised to record or playback signals, i.e. they do not produce very much noise themselves. Portable tape drives usually have only a reduced set of features, they can operate on batteries and are quite immune to adverse environments (some are even water resistant). Non-portable tape drives offer more features (e.g. remote control, manual setting of recording parameters, computer interfaces), require a permanent power supply and operate in the usual office environments .
Computers too can be divided into portable and desktop computers. In general, they produce significant noise during operation (hard disk spinning, keyboard clicks, system alerts etc.) and must thus be shielded from the signal to be recorded. Furthermore, sound cards in computers are subject to interference from other devices inside the computer, e.g. noise from the bus, the processor etc. Portable computers are about the size of an A4 book and weigh approx. 2kg. At present, only high-end portable computers are equipped with the signal processing facilities (e.g. signal processor, 16 bit quantisation , sample rate > 8kHz) required for speech recordings.
The capacity of tape drives is almost unlimited because full tapes can be replaced by empty ones quickly and at low cost. Typically, an analog compact cassette holds about 90 minutes of stereo signals, a video cassette up to four hours, a digital DAT tape up to two hours.
The capacity of computers for speech recording is mainly limited by the capacity of the hard disk. A 1-Gigabyte disk can store approx. 8 hours of mono signals (16 bit quantisation , 16kHz sample rate ). Such disks are becoming common on many desktop computers and even in portable computers, so that hard disks are suitable recording devices for very many speech recordings already. The major limitation of recording to hard disks is that the hard disk cannot simply be exchanged against another one. This means that the data on a hard disk has to be saved to some backup medium, e.g. magnetic tape or CD-ROM.
Ease of use must be seen under two aspects: first, the ease with which the device can be used to perform the recording; second, the ease with which the recorded data can be accessed for further processing.
Tape drives are easy to set up and speakers are used to them. However, especially for analog recordings, it is quite cumbersome to access recordings for further processing. The appropriate tapes have to be located and the tape drive has to be attached to a computer.
Computers as recording devices are still uncommon. They require the expertise of an engineer to be set up correctly, and speakers are easily distracted by the presence of a computer. However, computers offer significant advantages over tapes: recordings can be fully automated, administrative data is collected together with the recordings, and data is available immediately, either for control purposes or further processing.