Next to a widely accepted benchmark , it would appear to us that designers of speech output systems should want to know how well their systems perform relative to some optimum, and what performance could be expected of a system that contains no intelligence at all. In other words, the designer is looking for topline and baseline reference conditions. Reference conditions such as these do not yield diagnostic information in the strict sense of the word. However, they do provide the systems developer with an estimate of how much improvement can still be made to a system as a whole (in a black box approach ) or to specific modules (in a glass box approach ).
There has been no general practice to include topline and baseline reference conditions in segmental quality testing (Section 12.5.2). Still, it seems to us that it is important to reach consensus on a number of measures. If the output system uses waveform concatenation techniques, the designer will want to know how well the synthesis performs relative to the live human speaker, or to facilitate procedures, to some electronic operationalisation of live speech (e.g. CD quality speech recorded at a short distance from the speaker's mouth in a quiet environment ). However, if the system's waveforms have been coded with a lower bitrate than CD quality, the designer should determine to what extent degradation of system performance is due to the synthesis itself as opposed to the non-optimal bitrate. An easy way to determine this, is to adopt a second reference condition using the same (lower) bitrate as the synthesis. This precaution is even more necessary for parametric synthesis. Obviously, no type of parametric synthesis can be better than the maximum quality that is afforded by the analysis-resynthesis coding scheme adopted for the synthesiser. This requirement can generally be fulfilled when LPC synthesis schemes are used. However, for a range of synthesisers (e.g. the Klatt and the JSRU synthesisers) no automatic parameter estimation for straightforward analysis-resynthesis is possible at this time. The optimal parametric representation of human reference materials will then have to be found by trial and error (i.e., by adjusting parameter values while making auditory/spectrographic comparisons between the synthesis and the human original), or else the attempt should be abandoned.
The designer of an output system claims that the intelligence incorporated into the synthesis systems (e.g. through rules) makes the systems perform better than with no intelligence built in at all. In order to establish the extent to which this claim is true, a baseline condition is needed which consists in a type of synthetic speech that has no knowledge of speech processes at all.
The need for suitable topline and baseline reference conditions has clearly been recognised in the field of prosody (i.e. temporal and melodic structure, cf. Section 12.5.2) testing. The following are recommendations for prosodic topline and baseline conditions. Note that, in contrast to segmental evaluation, listeners often find it very difficult to differentiate between different prosodic versions of an utterance. Therefore testers often need examples of ``very bad'' systems to check whether the listeners are indeed sensitive to prosodic differences.
This baseline condition, then, contains no intelligence, so that any improvement in the target conditions with duration rules must be due to the added explicit knowledge on duration structure . A reference in which segment durations vary at random (within realistic bounds) can be included for validation purposes, as an example of a ``very bad system''. Listeners should rate this condition as poorer than any other condition.
There is a practical problem that not every synthesiser allows the generation of monotonous pitch so that some sort of waveform manipulation (e.g. pitch synchronous overlap and add, PSOLA ) may have to be used in order to monotonise the synthetic melody.
In the area of voice characteristics (voice quality, Section 12.5.2), the problem of reference conditions has not been recognised. Generally, there seems to be little point in laying down a baseline reference for voice quality. The choice of a suitable topline would depend on the application of the speech output system. If the goal is personalised speech output (for the vocally handicapped) or automatic speaker conversion (as in interpreting telephony), the obvious topline is the speaker who is being modelled by the system, using the same coding scheme when applicable. When a general purpose (i.e. non-personalised) speech output system is the goal, one would first need to know the desired voice quality, i.e. ideal voices should be defined for specific applications, and speakers should be located who adequately represent the ideal voices. At this time we will refrain from making any further suggestions on this matter. The definition of ``ideal'' voices and voice qualities, and the implementation of topline references should be a matter of priority in the near future.
Given the existence of an overall quality topline reference condition, it would be advantageous to have a set of reference conditions that are poorer than the optimum by a number of calibrated steps until a quality equal to or less than the baseline reference is reached (see also Section 12.4.1). Such a set of reference conditions would yield a grid within which each type of speech, whether produced by humans or by machines, can be located and compared with other types of speech. Recently, attempts have been made at creating such a continuum of reference conditions by taking high-quality human speech and applying some calibrated distortion to it, such as multiplicative white noise at various signal-to-noise ratios (``Modulated Noise Reference Unit or MNRU'', cf. ITU-T Recommendation P.81), or time-frequency warping (TFW, ITU-T Recommendation P.85 [Burrell (1991)], or T-reference [Cartier et al. (1992)].
TFW introduces greater or lesser (random) deviations from the mean rate of a recorded utterance (2.5%, %, ..., 20%) over successive stretches of 150 ms, so that the speech contains potentially disturbing accelerations and decelerations and associated frequency shifts. [Fellbaum et al. (1994)] showed that the MNRU is not suitable for the evaluation of synthetic speech. TFW of natural speech, however, provided a highly sensitive reference grid within which TTS systems could be clearly differentiated from each other in terms of judged listening effort [Johnston (1993)]. Moreover, Johnston showed that the perceived quality ordering among a range of TTS systems interacts with the sound pressure level at which the speech output is presented.