With a few provisos (see below) there is general consensus that the procedures for testing segmental quality of speech output systems are more or less fully developed (cf. Section 12.5.2 under DRT /MRT, CLID and SUS Tests ). Under the auspices of the SAM consortium, efficient test generators have been developed that enable the construction of a large variety of tests that allow quick standardised administration and data analysis of consonant and vowel intelligibility scores, both for isolated word intelligibility and for intelligibility of words in (semantically unpredictable) context. These tools will be very useful in the testing of even the latest generation of parametric synthesisers. However, the upcoming generation of waveform synthesisers (PSOLA based) will have segmental quality that will be hard to discriminate from human speech. Though it may be possible to further refine the discriminatory power of our test procedures , one may well wonder what purpose would be served by such endeavors. A reasonable alternative view would be to consider the quality of waveform concatenation speech output equivalent to the human ideal (if indeed the test shows that no intelligibility difference remains) and leave the matter at that.
A short-term recommendation that should be made, concerns the quality of segments in unstressed syllables . It has correctly been pointed out by, for instance, [Van Santen (1993)] that most segmental quality tests consider monosyllabic words only (or ``minisyllabic'' words for languages without lexical monosyllables). There is a risk involved here that insufficient attention is being paid to the quality of unstressed syllables in longer words. The same, of course, holds true of the quality assessment of (unstressable) function words. Unstressed syllables are generally reduced in human speech, and synthesis-by-rule systems have often neglected to carefully model the reduction processes. In concatenative synthesis, the problem of unstressed syllables can be solved by enlarging the set of normally unreduced acoustic building blocks with a parallel set of reduced building blocks [Drullman & Collier (1993)]. The testing problem that crops up in this connection presents an important perceptual question addressing the interaction between segmental and prosodic quality: if unstressed syllables are overarticulated, as would be the case when the reduction processes are not adequately modelled in our synthesis, does the resulting speech output get more intelligible, or does its intelligibility deteriorate? One might predict that, although the identifiability of each individual segment may decrease when reduction is truthfully mimicked, the overall intelligibility, in terms of word scores, will increase, reasoning that the rhythmic structure of words showing natural gradation of strong and weak syllables might be more important to word identifiability than optimal identifiability of each individual phoneme .
On a more general note, we suggest that serious attention be paid to differences in the contribution made to the overall intelligibility of words by the various constituent segments. It is important that we learn to what extent word intelligibility depends on identifying vowels versus consonants, in stressed versus unstressed syllables , in onset , medial, and final position, in short and longer words. Psycholinguistic studies on auditory word recognition have shown that, indeed, stressed segments - because of their greater inherent loudness and duration - have a better chance of contributing to the recognition process, as do segments early in the word. Ideally, we would like to be able to predict the intelligibility of an arbitrary selection of words from the lexicon of a language, just by looking at the identification scores of the constituent vowels and consonants in unpredictable words (i.e. segment strings that are phonotactically legal and may be lexical words or nonsense strings ).
With the advent of high-quality segmental speech output (Section 12.1.2) a shift from segmental quality testing to prosody quality testing seems imminent. It is obvious that there is still a long way ahead of us before the evaluation of prosody will get full coverage. What is needed is a careful taxonomy of prosodic functions at all linguistic and pragmatic levels (see also Chapter 5). We suggest, therefore, that the first priority should be for linguists to chart out all the prosodic functions relevant to human-machine communication . We need to know not only what functions are fulfilled by prosody, but also what the communicative importance of each specific function is (if any). Once a reasonably complete view of relevant prosodic functions has been obtained, attempts should be made at defining adequate tests in order to determine to what extent each function is expressed by the speech output system.
It will be difficult to separate the evaluation of prosodic forms from their communicative functions, and perhaps such a dissociation is not even necessary. It seems reasonable to assume that a prosodic feature fulfills its communicative function better as its formal properties are closer to the human model. If this relationship holds, we would not have to test the formal adequacy of speech timing and melody rules in abstraction from their communicative functions. Once we know the communicative function of each formal prosodic distinction, the prosodic quality of speech output systems can be measured by the effectiveness with which each of the communicative functions is signalled to the human listener. For these reasons we suggested that functional testing of prosody be given priority. Whatever audible flaws remain after the communicative functions have been shown to be signalled as effectively as in human speech will have to be addressed in a later stage, using judgment tasks .
We recommend therefore that the emphasis should be on the functions of prosody, rather than on the details of prosodic form. Our point of departure, for the time being, is that the formal aspects of prosody cannot be too far off the mark if the prosodic functions are all adequately fulfilled. This should not be interpreted in the sense that we consider the details of prosodic form (such as exact pitch movements and timing) unimportant. In fact, there is every reason to believe that prosodic functions such as accentuation are only adequately expressed by language-specific pitch movements which are very narrowly defined (in terms of direction, excursion size, and segmental alignment. ) In this context it seems obvious that adequate prosodic functioning can only be guaranteed if speech output systems are capable of synthesising not only binary accent or boundary distinctions but also more subtle degrees of contrast within such categories. For instance, the adequacy of prosodic boundary markings should be tested at least at four levels of depth: strong and weaker boundaries within the sentence, as well as sentence and paragraph boundaries, which are signalled in parallel by melody, temporal organisation, and (possibly even) intensity .
Generally, we believe that the identification of prosodic functions to be tested (including the expression of emotion ) presents a greater problem than devising tests to determine the functional adequacy of prosody once a particular function has been identified. Still, choices will have to made as to what particular test methodology to adopt. We propose that a pilot study be initiated to examine the pros and cons of the various tests used in the experimental phonetic and psycholinguistic literature (as outlined in Section 12.5.2) that seem relevant to this matter.
As a consequence of claiming priority for prosodic functions, the development of (multilingual) prosodic form tests (and test generators) should be postponed until some later stage.
It would appear that the evaluation of voice quality is going to be a matter of increasing concern. Developers of personalised voice speech output will need test procedures in order to determine how convincingly their systems mimic the quality of the model's voice. Simple same-different testing (Is it Ella? Or is it Memorex?) will not do, since developers will need the evaluation as a diagnostic tool . We suggest that a test tool be developed that enables the efficient drawing up of voice quality profiles (cf. Section 12.5.2).
Apart from the development of personalised voice synthesis, the voice quality of general purpose speech output systems will get a lot more attention in the coming decade. With the improvement of segmental, and to a lesser extent, prosodic quality of speech output, the need for more natural and pleasant voice quality will be strongly felt. It will be a concern for the evaluation field to develop test procedures in order to determine the appropriateness of voice quality for speech output in general and for specific applications (e.g. alert messages).
Now that the quality of speech output systems is getting closer to that of human speech, assessment should concentrate on other aspects of quality testing than linguistic functions. Synthetic speech may be virtually equivalent to human speech in all aspects, and still be lacking in certain subtle qualities. This aspect of speech output testing should be considered in a special study, looking at the effects of listening to synthetic speech in terms of fatigue and allocation of attention to secondary tasks (cf. Section 12.4.1). The development of efficient multilingual test generators addressing this aspect would be a welcome addition to our repertoire.