Reference conditions

Next: Comparability across languages Up: Methodology Previous: Benchmarks

Reference conditions

Next to a widely accepted benchmark , it would appear to us that designers of speech output systems should want to know how well their systems perform relative to some optimum, and what performance could be expected of a system that contains no intelligence at all. In other words, the designer is looking for topline and baseline reference conditions. Reference conditions such as these do not yield diagnostic information in the strict sense of the word. However, they do provide the systems developer with an estimate of how much improvement can still be made to a system as a whole (in a black box approach ) or to specific modules (in a glass box approach ).

Segmental reference conditions

There has been no general practice to include topline and baseline reference conditions in segmental quality testing (Section 12.5.2). Still, it seems to us that it is important to reach consensus on a number of measures. If the output system uses waveform concatenation techniques, the designer will want to know how well the synthesis performs relative to the live human speaker, or to facilitate procedures, to some electronic operationalisation of live speech (e.g. CD quality speech recorded at a short distance from the speaker's mouth in a quiet environment ). However, if the system's waveforms have been coded with a lower bitrate than CD quality, the designer should determine to what extent degradation of system performance is due to the synthesis itself as opposed to the non-optimal bitrate. An easy way to determine this, is to adopt a second reference condition using the same (lower) bitrate as the synthesis. This precaution is even more necessary for parametric synthesis. Obviously, no type of parametric synthesis can be better than the maximum quality that is afforded by the analysis-resynthesis coding scheme adopted for the synthesiser. This requirement can generally be fulfilled when LPC synthesis schemes are used. However, for a range of synthesisers (e.g. the Klatt and the JSRU synthesisers) no automatic parameter estimation for straightforward analysis-resynthesis is possible at this time. The optimal parametric representation of human reference materials will then have to be found by trial and error (i.e., by adjusting parameter values while making auditory/spectrographic comparisons between the synthesis and the human original), or else the attempt should be abandoned.

The designer of an output system claims that the intelligence incorporated into the synthesis systems (e.g. through rules) makes the systems perform better than with no intelligence built in at all. In order to establish the extent to which this claim is true, a baseline condition is needed which consists in a type of synthetic speech that has no knowledge of speech processes at all.

Recommendations on choice of segmental reference conditions

Absolute segmental topline: In the case of allophone synthesis, use human speech produced by a designated talker, i.e. the same individual on whose speech the table values and synthesis rules were based, or who, in the case of concatenative synthesis, provided the basic synthesis building blocks. The absolute topline reference will then be based on CD-quality digital speech.
Relative segmental topline for parametric synthesis: A second useful topline reference is the human reference speech but analysed and (re-)synthesised using exactly the same coding scheme that is employed in the speech output system to be tested.
Relative segmental topline for waveform concatenation: Use the same (lower) bitrate in the reference condition as in the speech output system.
Segmental baseline for allophone synthesis : Use speech in which all segments retain their table values and are strung together merely by smoothing spectral discontinuities at segment boundaries.
Segmental baseline for concatenative synthesis: Use speech made by stringing together coarticulatory neutral phones (i.e. stressed vowels spoken between two /s/-es, or stressed consonants preceded by schwa and followed by an unrounded central vowel, cf. the ``neutrone'' condition in [Van Bezooijen & Pols (1993)]. Minimal smoothing should be applied to avoid spectral jumps.

Prosodic reference conditions

The need for suitable topline and baseline reference conditions has clearly been recognised in the field of prosody (i.e. temporal and melodic structure, cf. Section 12.5.2) testing. The following are recommendations for prosodic topline and baseline conditions. Note that, in contrast to segmental evaluation, listeners often find it very difficult to differentiate between different prosodic versions of an utterance. Therefore testers often need examples of ``very bad'' systems to check whether the listeners are indeed sensitive to prosodic differences.

Recommendations on choice of temporal reference conditions

Temporal and melodic topline: Copy, as accurately as possible within the limitations of the synthesiser, the temporal structures and speech melodies of a single designated professional human speaker onto the synthetic speech output.
Temporal baseline: Use a condition in which the smallest synthesis building blocks (phoneme , diphone , demisyllable ) retain their original, unmanipulated durations as they were copied from the human original from which they were extracted (or, in the case of allophone synthesis , the phoneme duration table values [Carlson et al. (1979)].
This baseline condition, then, contains no intelligence, so that any improvement in the target conditions with duration rules must be due to the added explicit knowledge on duration structure . A reference in which segment durations vary at random (within realistic bounds) can be included for validation purposes, as an example of a ``very bad system''. Listeners should rate this condition as poorer than any other condition.

Recommendations on choice of melodic reference conditions

Melodic baselines: Synthesise utterances on a monotone, at a pitch level that coincides with the average pitch of the test items. Also, include a random melodic reference for the sake of validation, by introducing random pitch variations (in terms of excursion size, rate of change, and segmental alignment ), within physiologically and linguistically reasonable limits and with a mean pitch equal to the average of the test items.

There is a practical problem that not every synthesiser allows the generation of monotonous pitch so that some sort of waveform manipulation (e.g. pitch synchronous overlap and add, PSOLA ) may have to be used in order to monotonise the synthetic melody.

Voice characteristics reference conditions

In the area of voice characteristics (voice quality, Section 12.5.2), the problem of reference conditions has not been recognised. Generally, there seems to be little point in laying down a baseline reference for voice quality. The choice of a suitable topline would depend on the application of the speech output system. If the goal is personalised speech output (for the vocally handicapped) or automatic speaker conversion (as in interpreting telephony), the obvious topline is the speaker who is being modelled by the system, using the same coding scheme when applicable. When a general purpose (i.e. non-personalised) speech output system is the goal, one would first need to know the desired voice quality, i.e. ideal voices should be defined for specific applications, and speakers should be located who adequately represent the ideal voices. At this time we will refrain from making any further suggestions on this matter. The definition of ``ideal'' voices and voice qualities, and the implementation of topline references should be a matter of priority in the near future.

Overall quality reference conditions

Given the existence of an overall quality topline reference condition, it would be advantageous to have a set of reference conditions that are poorer than the optimum by a number of calibrated steps until a quality equal to or less than the baseline reference is reached (see also Section 12.4.1). Such a set of reference conditions would yield a grid within which each type of speech, whether produced by humans or by machines, can be located and compared with other types of speech. Recently, attempts have been made at creating such a continuum of reference conditions by taking high-quality human speech and applying some calibrated distortion to it, such as multiplicative white noise at various signal-to-noise ratios (``Modulated Noise Reference Unit or MNRU'', cf. ITU-T Recommendation P.81), or time-frequency warping (TFW, ITU-T Recommendation P.85 [Burrell (1991)], or T-reference [Cartier et al. (1992)].

TFW introduces greater or lesser (random) deviations from the mean rate of a recorded utterance (2.5%, %, ..., 20%) over successive stretches of 150 ms, so that the speech contains potentially disturbing accelerations and decelerations and associated frequency shifts. [Fellbaum et al. (1994)] showed that the MNRU is not suitable for the evaluation of synthetic speech. TFW of natural speech, however, provided a highly sensitive reference grid within which TTS systems could be clearly differentiated from each other in terms of judged listening effort [Johnston (1993)]. Moreover, Johnston showed that the perceived quality ordering among a range of TTS systems interacts with the sound pressure level at which the speech output is presented.

Recommendations on choice of overall quality reference conditions

Use time-frequency warping of optimal human speech to create a grid of overall quality reference conditions.

Next: Comparability across languages Up: Methodology Previous: Benchmarks

EAGLES SWLG SoftEdition, May 1997. Get the book...