next up previous contents index
Next: Management Up: Procedures Previous: Procedures


The basic equipment needed for speech recordings consists of

The choice of equipment depends on the choices along the dimensions visibility, environment , communication mode, and on the data to be recorded.

The quality of the recording channel  itself (microphone  and recording medium) is determined by three characteristics: signal-to-noise ratio , bandwidth , and dynamic range .

For every speech recording a log or journal should be kept. It contains the essential administrative information about recording setup, personnel involved, speaker data, and recording time and date.

It is necessary to store at least the following data for a recording session:

Recording time and date and the recording engineer are independent of the number of speakers or channels  recorded. The environment , speakers, and equipment may differ for each channel  and thus should be written down separately for each channel . A separation of recording dependent and channel  dependent data is thus advisable, and this separation should be made explicit in the layout of a form or a database structure (see Table 4.1).


rec_id date time engineer remarks
M0127D 22.01.95 17:10 CSC VM scenario  ``Time Table''
April-June calendar
id channel  type recorder micro-  environ-  speaker
phone ment
342 left audio sony DAT HMD416 studio UAA
343 right audio sony DAT HMD416 office AJB
344 A electro- c:/epg/300 palate UAA studio UAA
... ... ... ... ... ... ...
Table 4.1: Possible layout for recording session information 


  Microphones can be classified by their

A well-documented speech corpus should contain data about the microphone, such as make and type (condenser , dynamic , etc.), position of the microphone relative to the speaker or speaker's mouth, possible calibration  procedures, etc. (see Chapter 8 for further information).

Unidirectional microphones  are sensitive to the direction where the sound comes from. Unidirectional microphones are preferred when a single speaker is recorded in a laboratory environment .

Omnidirectional microphones  are not sensitive to the direction where the sound comes from. They are suitable for recordings on location, or when the speaker is moving, e.g. working, walking, or driving a car. Omnidirectional microphones may be used to record several speakers if it is guaranteed that their turns  will not overlap.

Microphone arrays  are a possible means of sound source location. Computer controlled microphone arrays can focus on a single sound source, therewith improving the effective signal-to-noise  ratio dramatically [Flanagan et al. (1991)].

The electrical transducer principle is a second dimension along which microphones can be distinguished. For most purposes in speech research the differences between the tranducer principles is not very important, with the exception of carbon button microphones. Carbon button microphones are used in older telephone handsets. They may distort the frequency response  of the signal quite considerably. Moreover, their transmission properties may change significantly over time. Electret microphones are more stable than carbon button microphones. They can have almost flat frequency response  in the telephone bandwidth , at least in principle. The actual frequency response  of a microphone depends primarily on the acoustic properties of the case in which it is encapsulated. Badly designed handsets therefore can have bad frequency response  characteristics, regardless of the use of electret transducers. For basic research into the characteristics of the glottal  sound source the phase distortion  of the microphone is as important as its amplitude response. For virtually all other research and development purposes in speech phase response is immaterial.

Finally, the microphone position  relative to the speaker's mouth can be used to distinguish types of microphones.

Headset microphones  usually are attached to headphones via an arm. The position of the microphone relative to the articulatory tract is fixed, and the speaker is free to move the head. However, the microphone has to be positioned very carefully to avoid noise  through breathing, and speakers often feel uncomfortable with a headset. If the task to be solved by the speaker is sufficiently complex, unconsciously produced gestures such as lip smacking, scratching one's head, rubbing one's chin, etc. may produce significant noise , especially if the headset is touched.

Close-up microphones  are attached to the speaker's clothes, usually on the chest. The microphone does not disturb the speaker and it is quite close to the articulatory tract. However, the distance of the microphone varies greatly with body movements, and new noise  sources, e.g. rustling of clothes, are introduced.

Table-top microphones  usually are unidirectional microphones placed approximately 50 cm away from a speaker. The microphone does not disturb the speaker, and the distance of the microphone varies only very little with body movements. However, with more than one speaker in a room there is little channel  separation, and new noise  sources, e.g. interference from room echo , tapping on table, movements of prompt sheets , are introduced.

Room microphones  are omnidirectional microphones that are placed in specified positions in a room. They are independent of speaker position and can be hidden completely. However, there is little (if any) channel  separation, and surrounding noise  interferes with the speech signal.  


  1. If acceptable in the recording environment , and for optimal acoustical quality, use headset microphones. 
  2. Place the microphone  slightly to the left or the right of the mouth and a bit below the lower lip to avoid breathing noises . Take care that no cables touch the microphone   arm, and that the speaker is comfortable with the headset.
  3. With headsets, have the speakers control their hands, e.g. by pressing a button or holding a computer mouse so that they do not touch the headset.
  4. Take care that the attached cable does not tap against any hard surface. The sound is transmitted to the headset.


The signal coming from a microphone  must be amplified to be recorded. In many cases, some processing is also needed, e.g. analog to digital conversion, transformations for different encoding schemes, filtering to reduce noise , etc.

Some processing steps have to be performed only once, e.g.\ analog to digital conversion. Others will be performed repeatedly, e.g. the transformations for different encoding schemes.


  1. Define a standard setup and procedure for all the steps from recording the signal to storing it.
  2. Choose de-facto accepted standards for the storage formats, and use standardised conversion tools. For normal speech recordings the standard quantisation  and sample rates  should be 16 bit (linear encoding) and 16kHz, and 16 bit 8kHz for analog telephone speech. For ISDN  telephone recordings, one should use the ISDN  standard 8 bit A-law  encoding at 8kHz sample rate. 
  3. Use the same equipment wherever possible (and appropriate).

Recording device

Basically, there exist two types of recording devices: tape drives, and computers with hard disks. Recordings to tape are either analog (audio tapes, compact cassette, video tapes) or digital (DAT), whereas recordings to hard disk are always digital.

Ongoing development in the field of audio technology has shifted the emphasis away from analogue recording media to digital recording media. The traditional recording medium has been the reel-to-reel magnetic tape. Apart from a relatively poor signal-to-noise  ratio of typically 60-70 dB, this medium suffers from mechanical problems such as flutter and wow. Moreover, the quality of an analogue speech recording severely degrades after it has been copied repeatedly. Because of these drawbacks, it is strongly recommended to use digital media for the recording of speech. The most widespread digital medium for recording of speech signals is the DAT (Digital Audio Tape). This medium is strongly recommended. Recordings are made on two channels  with standard sampling frequency  of 48kHz, and 16-bit resolution. Another option, that can only be used in a laboratory environment , is to record the speech directly on a high capacity computer disk. Two other digital audio media, the CD-ROM and WORM   (Write Once Read Many), are less suitable for speech recording, because they cannot be erased. That is, data (for instance, speech recordings) can only once be written to a CD-ROM or WORM ; afterwards, the stored data can be read as many times as one likes (compare a grammophone disc). The CD-ROM and WORM  are especially useful for the permanent storage of selected recordings in a database.

The recording devices can be characterised according to the following criteria:

The portability  of a recording device is determined primarily by its size and weight, and secondarily by its operating requirements, e.g. power supply, environmental conditions , etc.

Tape drives, analog or digital, come in all sizes, including Walkman-sized DAT recorders. Usually, tape drives are optimised to record or playback  signals, i.e. they do not produce very much noise  themselves. Portable tape drives usually have only a reduced set of features, they can operate on batteries and are quite immune to adverse environments   (some are even water resistant). Non-portable tape drives offer more features (e.g. remote control, manual setting of recording parameters, computer interfaces), require a permanent power supply and operate in the usual office environments .

Computers too can be divided into portable and desktop computers. In general, they produce significant noise  during operation (hard disk spinning, keyboard clicks, system alerts etc.) and must thus be shielded from the signal to be recorded. Furthermore, sound cards in computers are subject to interference from other devices inside the computer, e.g. noise  from the bus, the processor etc. Portable computers are about the size of an A4 book and weigh approx. 2kg. At present, only high-end portable computers are equipped with the signal processing facilities (e.g. signal processor, 16 bit quantisation , sample rate  > 8kHz) required for speech recordings.

The capacity of tape drives is almost unlimited because full tapes can be replaced by empty ones quickly and at low cost. Typically, an analog compact cassette holds about 90 minutes of stereo signals, a video cassette up to four hours, a digital DAT tape up to two hours.

The capacity of computers for speech recording is mainly limited by the capacity of the hard disk. A 1-Gigabyte disk can store approx. 8 hours of mono signals (16 bit quantisation , 16kHz sample rate ). Such disks are becoming common on many desktop computers and even in portable computers, so that hard disks are suitable recording devices for very many speech recordings already. The major limitation of recording to hard disks is that the hard disk cannot simply be exchanged against another one. This means that the data on a hard disk has to be saved to some backup medium, e.g. magnetic tape or CD-ROM.

Ease of use must be seen under two aspects: first, the ease with which the device can be used to perform the recording; second, the ease with which the recorded data can be accessed for further processing.

Tape drives are easy to set up and speakers are used to them. However, especially for analog recordings, it is quite cumbersome to access recordings for further processing. The appropriate tapes have to be located and the tape drive has to be attached to a computer.

Computers as recording devices are still uncommon. They require the expertise of an engineer to be set up correctly, and speakers are easily distracted by the presence of a computer. However, computers offer significant advantages over tapes: recordings can be fully automated, administrative data is collected together with the recordings, and data is available immediately, either for control purposes or further processing.


  1. Use digital recording devices.
  2. Use a computer for the recording to automate recording procedures and for easy access to data for further processing.

next up previous contents index
Next: Management Up: Procedures Previous: Procedures

EAGLES SWLG SoftEdition, May 1997. Get the book...