Task specific descriptors

Next: Transducer characteristics Up: Talker / listener descriptors Previous: General (demographic) descriptors

Task specific descriptors

Task specific descriptors we define to be such that they directly describe the ability of a subject to perform a specific task. For example in listening tests it is crucial to check whether the experimentee is able to hear at all.

Talker descriptors

General talker descriptors

1. Voice / speech related medical records

Since certain diseases like inflammation of the vocal cords are known to potentially harm or at least influence the voice permanently, we recommend asking for any related medical records.

2. Voice / speech relevant habits

Voice relevant habits we consider to be smoking, drinking (cf. Section 3.5.2), and whether and to what extent a subject has received voice training or is accustomed to professional public speaking. Moreover, one should check whether a talker practises regular singing for either private or professional reasons, and how much.

Anatomical (voice) descriptors

With anatomical talker descriptors, we distinguish between descriptors derived from the subject's laryngeal behaviour and miscellaneous descriptors. These types of descriptor are not mutually exclusive, i.e. a laryngeal feature might well be explained in terms of jitter , shimmer , or glottal-to-noise excitation and vice versa.

The difference is that a close look at the laryngeal properties of voice necessitates the use of special pitch determination instruments (cf. Section 8.4.2), whereas the other descriptors rely on the analysis of the microphone time signal.

A perceptual classification based on listening might be sufficient, as long as the classification is performed consistently by the same judge(s) on the entire population.

But there exists no such thing as an absolute and generally accepted scale for the quantisation of voice quality.

1. Laryngeal descriptors

The following explanations on how to qualify a voice are partly based on a signal output by a so-called laryngograph (cf. Section 8.4.2). This signal, which is proportional to the electrical impedance of the larynx (i.e. the opening/closure of the glottis), is referred to as Lx; Fx denotes the fundamental frequency (a direct derivative of Lx), and Cx stands for the scatterplot of

over the fundamental frequency (Fx). The latter represents a measure of the variance of the fundamental frequency as a function of the fundamental frequency.

Breathy Voice:

A breathy voice results from slow, sometimes incomplete closure of the vocal folds during the laryngeal cycle. It is more often found in women than in men. The auditory impression is that of a ``gentle'' voice, which in women sometimes reaches the point of sounding ``whispery''.

A more sinusoidal shape of the Lx signal, as well as a lower closed/open phase ratio calculated from it, are an indicator of breathy voice . Acoustically, the zero-crossing rate in the 3-4kHz band during the voiced sections of an utterance is a measure of accompanying glottal friction . A further measure is the relative strength of the first and second harmonics (there being a step-down from the first to the second rather than an equal slope).

Harsh Voice:

This is the converse of a breathy voice , and is more often found in men. It results, probably, from a very fast closing gesture, and a high closed/open phase ratio . It is the sort of voice that ``carries'' well in voice babble.

As might be expected by its converse relation to breathy voice , the more vertical closing phase of the Lx wave and the higher close/open phase ratio indicates a harsher voice . Acoustically, the absence of the step-down from the first to the second harmonic and an overall flatter spectrum are characteristic of the voice quality.

Creaky Voice:

This is the result of irregular laryngeal vibrations, often with a cycle of ``normal'' duration being followed by a cycle of roughly twice the normal duration. It is found in both men and women. In some speakers it occurs at particular parts of an intonation contour, typically at the end of a phrase, when the voice sinks to the bottom of its range.

Irregular laryngeal vibrations are clearly visible in the Lx, and the Fx distribution reveals a clear secondary mode about one octave below the main mode. There are usually also points on the Cx scatterplot to either side of the main diagonal at the lower end of the speaker's frequency range.

Hoarse Voice:

This adjective is often given to a mixture of laryngeal irregularity with breathiness . In everyday terms it is the sort of voice that makes you think the speaker has been shouting a lot.

The combination of laryngeal irregularity and glottal friction in this voice quality means that it is open to both laryngographic and acoustic representation. The points to the side of the main diagonal of the Cx plot usually spread along the whole of the speaker's range. Acoustically, a similar zero-crossing measure can be used as for breathy voice . In addition, however, the irregularity is clearly visible on a spectrogram.

2. Miscellaneous descriptors

It is currently impossible to give an all-embracing compendium on voice descriptors. To this end, we restrict ourselves to those we consider to be most common:

Vocal tract size:

It is generally agreed that body size correlates with vocal tract size. However, observation of head size relative to body size is a further criterion. We recommend logging such personal data, i.e. height and weight and head perimeter of all subjects.

First and third-formant averages over a given utterance, spoken by persons with the same regional accent , can be used as an indicator of relative vocal-tract length.

Jitter and Shimmer:

Jitter and shimmer are measures of the average perturbation of someone's fundamental frequency and of its magnitude, respectively. They are given by the formula:

where u(n) denotes either the length of the observed excitation period (jitter) or the energy in the period (shimmer) . Details on how to extract a value for u(n) may be found in [Kasuya et al. (1993)] and [Michaelis & Strube (1995)].

Both measures mutually correlate to a high degree and have to be expected to have high values in creaky as well as in hoarse voices (see above).

Glottal-to-Noise Excitation Parameter:

The glottal-to-noise excitation parameter (GNE parameter) gives a figure of whether vocal excitation is mainly due to glottal vibration (GNE = 1) or rather turbulent noise (GNE = 0).

Since it is a measure of harshness it exhibits high values in harsh voices and it will be found to be small in breathy and hoarse voices. For further details on this parameter consult [Michaelis & Strube (1995)].

Habitual speech descriptors

1. Average level and dynamics of rate of articulation

This can be quantified by average word length for a number of agreed isolated words. In continuous speech, average duration of (underlying) syllables in a given utterance, excluding pauses, may serve. This allows for a rate measure which excludes consideration of articulatory precision. At the same time the minimum and maximum duration of the syllables can be recorded to establish a measure of the dynamics of the rate of articulation.

2. Precision of articulation (coarticulation)

It is difficult to define this in objective terms, and possibly there will be disagreement in selecting speakers, except for extreme cases. Note that this is not necessarily the same as speaking slow or fast, though the two dimensions may covary among the same speakers. Though it has not been investigated experimentally, we can assume in the first instance that the impression of precise articulation has to do with the consistent avoidance of frication for stops , and not producing fricatives as approximants, not eliding or slurring unstressed syllables very much. These are undoubtedly properties that are of interest with respect to recogniser assessment.

In contrast to the rate of articulation, this measure should be based on the average duration of actually realised syllables . Elided syllables, which contribute to the rate measure, would therefore be ignored.

3. Average level and range (dynamics) of fundamental frequency

Though the fundamental frequency in principle is a function of the subject's anatomical data, it is modulated in both directions by intonation, tone and accentuation. The dynamics are constituted by the maximum and the minimum frequency observed in an agreed set of utterances.

4. Average level and dynamics of speech intensity

The intensity contour provides speech with what is commonly known as volume and rhythm, respectively. It can be derived directly from the energy of the time-signal of the recording. As with other measures of this kind, this should be done on an agreed set of utterances.

Audiometric descriptors

The manner of speaking and the very ability to speak depends on the ability to hear. For that reason it is recommended to check for potential hearing impairment of subjects to be recorded, at least in case of doubts. Appropriate tests are given by pure-tone audiometry and so-called speech audiometry (cf. Section 8.3.2). Audiometric ``functionality'' of talkers becomes crucial in recordings in which acoustic feedback or stimulation is planned during the recording.

Listener descriptors

General listener descriptors

1. Hearing related medical records

Various diseases, such as inflammation of the middle-ear can significantly degrade hearing properties, even if they occurred decades earlier. For this reason we recommend asking potential candidates if they happen to suffer from any such disease, and to ask for the anamnesis.

2. Hearing relevant habits

Average Noise Consumption:

The kind and amount of noise a subject is frequently exposed to gives a clue to possible hearing losses as well as to the degree to which he is accustomed to noisy environments. A person, for example, who is used to professionally communicating in noisy environments exhibits significantly better listening performance than inexperienced listeners. The average noise is measured by its level, duration and spectral characteristics. A comprehensive discussion on the judgement of effects and figures of everyday noise loads can be found in [Rose (1971)].

Experience:

Experience in listening experiments clearly enhances performance in such tests. For this reason we recommend always establishing a ``listening test record'' for all members of a test population. Primarily this should include the types of test the subject is experienced in.

A ``normalisation'' of all listeners with respect to experience, however, can generally be achieved by giving dummy examples prior to the actual experiment.

Audiometric descriptors

1. Pure-tone audiometry

Pure tone audiometry provides a measure of the hearing sensitivity as a function of frequency . It is measured by air conduction and by bone conduction. In addition to the absolute sensitivity in dB-SPL, a pure-tone audiogram also displays how much a subject deviates from the average listener and whether this deviation is within an admissible range or not.

The technical setup and procedure is standardised to a high degree (ISO 1964) and appropriate test equipment is widely available.

For further details on pure-tone audiometry consult [Rose (1971)].

2. Speech audiometry

The goal of speech audiometry is to investigate the listener's response to speech. Despite theoretical and practical difficulties, it provides a method by which such assessment can be made. It is almost too obvious to state that it is not normally necessary to attend to or discriminate among pure-tone stimuli, but rather it is constantly necessary to identify speech units. Speech audiometry is concerned with answering the three questions:

What is the lowest intensity level at which a listener can identify simple speech fragments?
How well does a listener understand everyday speech under everyday conditions?
What is the highest intensity at which the listener can tolerate speech.

Though not really standardised, certain standard procedures for the speech audiometry have been established during recent decades. For comparison between separately tested populations, however, the exact test configuration and procedure has to be recorded.

For further reading consult [Barry & Fourcin (1990), Kasuya et al. (1993), Michaelis & Strube (1995), Rose (1971)].

Next: Transducer characteristics Up: Talker / listener descriptors Previous: General (demographic) descriptors

EAGLES SWLG SoftEdition, May 1997. Get the book...