Speech production is an extremely complex process. It involves several major organs (lungs, larynx, mouth, nose, brain, etc.) and stands in more or less direct dependence to diverse second order physiological parameters such as pulse, blood pressure, or sweat production.
But nobody just talks. There are always facial expressions and gestures involved, which are in fact the only source of information for the interpretation of speech by the deaf. Speech production is clearly highly dependent on biographical factors.
In general it is essential to not only record the pure speech, i.e. the time signal representing the air pressure at the microphone versus time, but to log as much correlated data as possible at the same time.
The range of speech-related parameters to be investigated may be large and often depends on the underlying purpose for which the speech recording is made. Within the framework of this handbook the scope is restricted to the description of the most commonly used parallel (simultaneous) recording techniques.
However, for obvious reasons, parallel recording must not interfere with actual speech production, either physically or by imposing additional psychological stress on the talking subject.
The reasons for determining pitch or fundamental frequency, in parallel to the time-signal speech data, are numerous. Since pitch determination by machine generally performs a precise and reliable job [Hess (1983)], it is frequently utilised for the automatic segmentation and labelling of speech. Second, a laryngogram permits classification of a voice almost at a glance, which is very useful for the classification of talkers according to the recommendations described in Section 8.3. Finally, it might be used to visualise speech for education and rehabilitation purposes.
Although plenty of pitch determination algorithms (PDAs) have been developed in the past decade or so, none has achieved the reliable performance of a PDI. This is even more true of potentially pathological voices.
Although the term pitch is often used as a synonym for fundamental frequency, the following distinction is sometimes made:
The underlying principle of mechanical PDIs is to directly convert vibrations at the throat into an electric signal. The application of these instruments is mainly in education and rehabilitation, e.g. in teaching the intonation of foreign languages, or in the education of the deaf. The mechanical PDI yields an excellent signal for pitch detection; for glottal waveform investigation, however, its output signal is not equally well suited since the detection of the instant of glottal opening and closure is difficult. This is due to inertia effects that mechanical PDIs, which operate on the basis of throat microphones , contact microphones, and accelerometers, suffer from. The most critical point in the realisation of suitable microphones is to decouple background noise and speech of the talker itself from the actual throat vibration. In any case a very tight coupling and extremely good isolation at the throat is needed and these measures may have an adverse effect on the talking subject.
Electrical PDIs utilise the change of the electric impedance of the larynx due to the opening and closing of the glottis. The technical idea is to let the changing impedance of the moving larynx modulate a high-frequency (usually about 1MHz) circuit (see Figure 8.7).
Figure 8.7: Technical principle of an electrical PDI
The output signal of an electrical PDI is extracted from the HF-voltage by a simple radio-frequency AM-demodulator. Pitch determination from this signal is straightforward, since large jumps at the instant of glottal closure are observed, which may be detected by a threshold analysis on the first derivative of the recorded curve.
Electrical PDIs are commercially available under the names glottograph, laryngograph , or electroglottograph. They are optimal with regard to precision, handling, robustness, and the negligible discomfort they may cause to the talker. In some rare cases the PDI might not work for an individual speaker; when it does work, however, it is fairly foolproof.
The principle of this method is based on the fact that the acoustic impedance of air is extremely different from that of flesh, cartilages, and tissue. If a focussed ultrasound beam is transmitted through the vibrating vocal cords, it will only be able to pass if the glottis is closed. If the glottis is open, the ultrasound wave is almost totally reflected due to the impedance mismatch between the tissue and the air in the glottis.
Basically, two different principles of investigating the vocal cords with this method have been developed: the pulse-echo method and the continuous-wave method. To design a PDI, the latter appears most promising. It is based on the idea that the vocal cord vibrations modulate a continuous-wave ultrasound that is transmitted through the larynx at the level of the vocal cords.
Ultrasonic PDIs using continuous-wave ultrasound show an output signal similar to that of electrical PDIs, but unlike the latter have almost 100% amplitude modulation when the beam passes through the vocal cords. On the other hand, the ultrasonic PDI is much more sensitive to vertical positioning of the transducers. That is why we do not recommend this device for parallel recordings, at least not for the untrained operator.
As with the electrical PDI, the photoelectric PDI is commercially available.
It is based on the principle of transillumination of the glottis. A strong light source is placed at the neck below the glottis. Part of the light passes through the skin and the tissue into the trachea. If the glottis is closed, the light is absorbed by the vocal cords, and the pharynx remains dark.
A phototransistor in the pharynx which works as a light transducer picks up the temporal variations of light in the pharynx due to glottal opening and closure. In contrast to the electrical and the ultrasonic PDI, the photoelectric PDI gives a measure of the cross-sectional area of the glottis, not the degree of glottal closure.
The practical application of this technique, i.e. the positioning of the photoelectric transducer in the pharynx, poses a certain difficulty in respect of the long-term consistency of records produced with photoelectric PDIs (not to mention the stress and discomfort this little piece of high-tech might impose on the person it is attached to). However, the short-term performance of this instrument is excellent. It exhibits an exact synchronisation with both the point of glottal opening and glottal closure. Unlike the electrical and the ultrasonic PDI types, measurements with the photoelectric PDI are possible when the glottis does not close completely, e.g. due to voice disease or breathy voice in normal speech.
Accordingly, a photoelectric PDI should be restricted to voice source investigation in basic phonetic and linguistic research in logopedics and phoniatrics; for simultaneous high-quality microphone recordings of speech, however, this technique is not recommended.
The simultaneous measurement of second order physiological quantities like pulse, blood pressure, body temperature, skin can be useful for some purposes. One might think of taking an EKG and/or EEG parallel to the speech recording. However, the range of possible physiological measurements is large and the choice depends on the specific purpose the speech material is to be made for. Whatever additional recording is decided on, it must again be stressed that any disturbance to the talking subject should be minimised.
It is known that speech relates closely to facial expression and gesture; For deaf and dumb persons, direct communication relies largely on lip reading and sign language. In applications which may be related to this domain, we therefore recommend carrying out additional video recordings simultaneously with the speech recordings.
Since sign language takes place in 3-dimensional space and since hands might overlap for some gestures, there should at least be two cameras, separated from each other by a well-defined angle and distance, mounted in front of the talker. For later reconstruction and spheric evaluation, it is crucial to note the exact data related to the relative positioning of the cameras. Commercial systems for 3-D recordings, however, are widely available though not inexpensive.
For lip-reading recordings, a single camera in front of the subject may suffice.
The following recommendations can be given with respect to the use of parallel recording techniques:
For further reading consult [Bartlett (1987), Ballou (1987), Davis & Davis (1975), Tomlinson (1990), Hess (1983)].