In Section 12.4, speech output assessment was approached within a black box context, i.e. with an emphasis on the speech output as a whole. Black box tests are by force acoustic in nature. In the present section acoustic tests are discussed as well, however this time within a glass box context, which means that attention is focussed on the quality of separate modules, mainly with a view to diagnostic testing. The structure of this section is based on traditional views in phonetics [Abercrombie (1967), e.g.,] according to which three layers are present in speech: a segmental layer (related to short-term fluctuations in the speech signal), a voice dynamics or prosodic layer (medium-term fluctuations), and a voice characteristics (or voice quality) layer (long-term fluctuations). We will make the same distinction here, 12.5.2 being concerned with testing segments, 12.5.2 with prosody , and 12.5.2 with voice characteristics .
The primary function of segments, i.e. the consonants and vowels in the language, is simply to enable listeners to recognise words. Generally, when the segments are sufficiently identifiable, words can be recognised regardless of the durations of the segments and the melodic pattern. In the experience of most researchers good quality (readily identifiable) vowels are afforded by even the simplest speech synthesis systems. One reason is that most coding schemes allow adequate parametrisation of vocalic sounds (narrow band formants slowly varying with time). The synthesis of good quality consonants is an altogether different matter (due to multiple excitation signals, notion of formant not always applicable, abrupt spectral changes), and this is where most (parametric) synthesisers show defects.
Moreover, since speech extends along the time dimension, segments early in the word in practice contribute more to auditory word recognition than later segments. Trailing segments, especially in long (i.e. polysyllabic) words are often not needed to distinguish the word from its competitors. Also, stressed syllables tend to contribute more to a word's identity than segments in unstressed syllables . For these reasons it makes sense to break down the segmental quality of speech output systems for vowels and consonants in various positions (initial, medial, final), within monosyllabic and polysyllabic words, and in stressed versus unstressed syllables .
Of all aspects of synthetic speech output, the evaluation of the segmental aspect has received most attention so far, because:
Near perfect segmental quality is essential for applications with a strong emphasis on the transmission of low-predictability information to untrained listeners , for example traffic information services and reverse telephone directory assistance (What name and address belong to this telephone number?). Unlike the case of ``normal'' words, the pronunciation of names cannot be deduced from the context. Moreover, for names it is particularly important that each consonant and vowel be clearly enunciated because there are many near-homophones , i.e. names that differ in just one sound, and strange names which listeners may never have heard before. In applications like these, where prosody is of minor importance, the required intelligibility level can be attained for instance by making use of canned speech or waveform concatenation. In other applications, where text-to-speech is preferred, it may perhaps not be necessary for each sound to be identified correctly. However, since very little is known as yet on the specific contributions of single sounds to overall intelligibility, synthesis designers have usually taken the pragmatic position that in principle all sounds should be identifiable. In that case detailed diagnostic testing of segmental quality using a glass box approach remains necessary.
As stated above, many tests have been developed to evaluate the quality of synthetic segments. There is a basic distinction between segmental tests at the word level, where single words (meaningful, meaningless or lexically unpredictable) are presented to listeners, and segmental tests at the sentence level, where complete sentences (meaningful, meaningless, or semantically unpredictable) are presented to listeners. Within either category tests can be categorised in functional and judgment studies .
FUNCTIONAL SEGMENTAL TESTS AT THE WORD LEVEL
The test approach used to evaluate segments at the word level has been in general functional, quality being expressed in terms of the percentage of correct phoneme identification. In this section we will discuss the Diagnostic Rhyme Test (DRT) , the Modified Rhyme Test (MRT) , the SAM Standard Segmental Test , an anonymous type of test which we shall henceforth call the Diphone Test , the CLuster IDentification (CLID) Test , the Bellcore Test , and the (Modified) Minimal Pairs Intelligibility Test. The reasons why these tests were selected for inclusion are varied: because they are well-known, well-designed, easy and fast to administer and/or promising. Summary information on most tests is provided in Section 12.7.
DRT (DIAGNOSTIC RHYME TEST)\INDEXDIAGNOSTIC APPROACH AND MRT (MODIFIED RHYME TEST)
The DRT (see Section 12.7.4) is a closed response test with two response alternatives. Items presented to the subjects are of the form CVC, i.e. an initial Consonant followed by a medial Vowel followed by a final Consonant. The identifiability of the medial vowel and final consonant is not examined, only the identifiability of the initial consonant is tested. All items are meaningful, which means that only factors (1) through (3) as listed in Section 12.4.1 can influence the results. In order to obtain insight into the precise nature of possibly poor identifiability of initial consonants, the two categories from among which the subjects are forced to select the correct response contain minimal phonemic contrasts. The subject would be asked for instance to indicate whether a synthetic item was intended as dune or tune.
The MRT (see Section 12.7.5) is an (originally) closed response test with six response alternatives. All items are of the form CVC and (in its original form) meaningful. Both the identifiability of initial and of final consonants are tested, but never simultaneously. An example of response alternatives testing the identifiability of a final consonant would be a contrastive syllable coda series such as peas, peak, peal, peace, peach, and peat.
The use of meaningful test items has some positive effects:
However, the DRT and MRT have some serious drawbacks and restrictions as well:
Both the DRT and MRT have a long tradition in speech output assessment and have been used in many studies, mainly for comparative purposes. The DRT has been employed among others by [Pratt (1987)], who compared a wide range of synthetic voices/systems and a human reference, both clear and with noise added to give a speech-to-noise ratio of 0 dB(A). Eight subjects participated. The percentages correct for the human voice and five synthesisers are given in Table 12.4 [Pratt (1987)].
All factors (speech system, speech-to-noise ratio, and type of phonemic contrast in the two response categories) had a significant effect on the percentage of correct identification. More interestingly, all interactions appeared to be significant as well. For example, as can be seen above, the intelligibility of synthetic speech was affected by adding noise to a much higher degree than that of human speech. Moreover, adding noise extended the range of the percentages correct, thus making the test more sensitive. So, if synthesis systems are compared which are rather similar, it might be a advisable to add noise .
The MRT has been employed, among others, by [Logan et al. (1985)] to evaluate eight synthesisers and a human reference. On the basis of the results, the systems were grouped into four categories, namely (1) human voice, (2) high-quality DECtalk 1.8 Paul, DECtalk 1.8 Betty, MITalk-79, Prose 3.0, (3) moderate-quality INFOVOX SA 101, Berkeley, TSI-proto 1, and (4) low-quality Votrax Type'n'Talk, and Echo. Percentages correct are given in Table 12.5 [Logan et al. (1985)] for the closed response variant.
Speech type | Initial C | Final C | Overall |
Human | 99 | 99 | 99 |
High-quality | 96 | 93 | 95 |
Moderate-quality | 90 | 82 | 85 |
Low-quality | 66 | 71 | 68 |
The categories distinguished could be used as benchmarks (although the data are somewhat dated, the set of synthesisers tested is probably representative of the quality range of more recent synthesisers). Methodological matters were considered as well. A test/retest design proved the MRT to be reliable. Moreover, the closed and open response variants (compared for the five best systems) yielded the same rank order .
For diagnostic purposes the SAM Standard Segmental Test developed within the Speech Assessment Methods (SAM) project of ESPRIT (see Section 12.7.1) is to be preferred above the DRT and MRT tests. The test items in this test consist of meaningless and (sometimes by chance) meaningful, i.e. lexically unpredictable stimuli, which means that factors (1) and (2) as listed in Section 12.4.1 have an effect on the responses. Items are CV, VC, and VCV stimuli, where C stands for all consonants allowed in the given position in a given language and V for one of the three point vowels of the given language, typically open /a/, close front /i/, and close back /u/. So, all permissible consonants are tested in word initial, word medial, and word final position. Vowels are not tested; they provide varying phonetic contexts within which the consonants to be tested are placed (the identifiability of sounds can vary depending upon neighbouring sounds). Examples of test items are pa, ap, apa, ki, ik, and iki. An open response format is used, i.e. listeners choose a response from among all consonants.
The SAM Standard Segmental Test has many positive points:
The main restriction of the SAM Standard Segmental Test is that:
Part of the SAM Standard Segmental Test has been applied to English, German, Swedish, and Dutch synthesisers. Comparative results are available for Swedish medial C produced by a human and two synthesisers as perceived by listeners with perfect and imperfect hearing [Goldstein & Till (1992)]. The percentages correct medial C identification are given in Table 12.6 [Goldstein & Till (1992)]. Of the 54 test items, 3 were found to differ significantly (p=2%) between human and KTH, 9 between human and INFOVOX, and 3 between KTH and INFOVOX.
Speech type Perfect hearing (N=24) Imperfect hearing (N=14) Human 94 91
KTH-synthesis 91 84
INFOVOX-synthesis 88 79
A more complete overview of the performance of segments in a wider variety of contexts is provided by a test which assesses the intelligibility of all permissible (pronounceable) CVC, CVVC, VCV, and VCCV sequences of a given language. Such a test will be referred to as a Diphone Test, because the test items can be constructed by combining all the diphones in a diphone inventory. Just like in the SAM Standard Segmental Test , the test items are lexically unpredictable and the response categories open, so that it is useful for diagnostic purposes . Extra advantages of the Diphone Test over the SAM Standard Segmental Test are the following:
The main disadvantages/restrictions of the Diphone Test are:
The Diphone Test has been used to evaluate diphone synthesis in French [Pols et al. (1987)], Italian [Van Son et al. (1988)], and Dutch [Van Bezooijen & Pols (1987)]. The Dutch Diphone Test combined all Dutch diphones into a set of 768 test items: 307 CVC, 173 VCV, 267 VCCV, and 21 CVVC. The only thing needed to construct the test material for a particular language is a matrix with the phonotactic constraints operating in that language, i.e. restrictions on the occurrence of all consonants and vowels in various word positions and phonetic contexts. Such matrices have been constructed for a number of European languages within the ESPRIT-SAM project .
BELLCORE TEST AND CLID (CLUSTER IDENTIFICATION) TEST
As mentioned above, even the Diphone Test is not complete, since no tautosyllabic consonant clusters are included. The importance of this structure should not be underestimated. According to [Spiegel et al. (1990)], about 40% of all one-syllable words in English begin and 60% end with consonant clusters . The Bellcore Test (see Section 12.7.3) and the CLID Test (see Section 12.7.2), have been developed to fill this gap.
The Bellcore Test has a fixed set of CVC-stimuli, comprising both meaningless and meaningful words, e.g. frimp and friend or glurch and parch. Tautosyllabic consonant clusters and single clusters are tested separately in initial and final position. Vowels are not tested. Open response categories are used. Compared to the Diphone Test , the Bellcore Test has some advantages, the main being that:
However, the Bellcore Test has restrictions as well:
The test has been applied to assess the intelligibility of two synthesisers compared with human speech, presented over the telephone [Spiegel et al. (1990)]. The syllable score was 88% for human telephone speech and around 70% for the synthetic telephone speech. Consonant clusters had lower intelligibility than single consonants. Intelligibility for meaningful words was higher than for meaningless words, a finding which could not be explained.
The CLID Test is a very flexible architecture which can be used for generating a wide variety of monosyllabic test items in an in principle unlimited number of languages. Both meaningful and meaningless items can be generated as long as matrices are available with the phonotactic constraints to be taken into account. Open response categories are used. Intelligibility can be assessed in whatever way one wants. The CLID test has been applied to testing the intelligibility of German synthesisers [Jekosch (1992), Kraft & Portele (1995), e.g.,].
The CLID test had all the advantages of the SAM Segmental Test listed above, whereas it does not have the restriction mentioned. Thus the positive points of the CLID test are the following:
(MODIFIED) MINIMAL PAIRS INTELLIGIBILITY TEST
The last tests we want to mention in this context are the so-called Minimal Pairs Intelligibility Test (MPI Test), proposed by [Van Santen (1993)] as an alternative to the DRT, and a modification to it introduced by [Syrdal & Sciacca (1994)], the Diagnostic Pairs Sentence Intelligibility Evaluation Test (DPSIE Test). These tests were designed to reduce ceiling effects and expand the coverage of the DRT to include:
The MPI Test consists of a fixed set of 256 sentence pairs containing one contrast, e.g.\ The horrid courts scorch a revolution versus The horrid courts score a revolution. The minimal pair appears on the screen and the correct sentence has to be identified. Differences between the MPI Test and the DPSIE Test include:
The main advantage of the MPI and DPSIE Tests is that:
The main disadvantages of the tests are:
JUDGMENT TESTS AT THE WORD LEVEL
In principle, in addition to functional intelligibility tests, judgment tests, where subjects rate their subjective impression of the stimuli on scales, are possible for evaluating the segmental quality at the word level as well. For example, [Van Bezooijen (1988)], in addition to running a consonant cluster identification test, presented 26 Dutch consonant clusters (both initial and final) to be rated on naturalness , intelligibility, and pleasantness . The clusters were embedded in meaningful words. In order to obtain ``pure'' judgments, unaffected by the quality of the rest of the word, subjects were explicitly asked to pay attention to the clusters only. So, the test required analytic listening . However, one can never be sure to what extent listeners in fact stick to the instructions. Perhaps this is one of the reasons why judgment tests of this type have been rare.
In addition to the word level, tests for the assessment of segmental quality have been developed at the sentence level as well. Here the effect of prosody could be minimised by presenting the material on a monotone, but in practice, if only for naturalness' sake, prosody is usually included. Compared with segmental tests at the word level, tests at the sentence level are more similar to speech perception in normal communication but at the same time, as a consequence, less suitable for diagnostic purposes , for the following reasons:
Of course, if the test is not intended as a diagnostic tool but has a purely comparative aim, these consequences of using sentences do not necessarily detract from its value. However, it is important to remember that as soon as complete sentences are presented to listeners, the test is no longer limited to evaluating segmental quality alone. This means that the title of this section ``segmental tests at the sentence level'' is not completely adequate. In fact, depending on the extent to which restrictions are imposed on the construction of the test materials, tests at the sentence level are in between a glass box approach and a black box approach . So, the main differences among the segmental sentence tests described is their position on the glass box - black box continuum.
In this section only functional tests will be discussed. In addition, judgment tests at the sentence level have frequently been carried out. These are described under the heading ``black box approach'' in Section 12.4.1, where judgment tests to evaluate overall output quality are discussed. These tests entail the rating of scales such as acceptability, intelligibility and naturalness.
HARVARD PSYCHOACOUSTIC SENTENCES
One of the most well-known segmental intelligibility tests at the sentence level is the fixed set of 100 semantically and syntactically ``normal'' Harvard Psychoacoustic Sentences (Add salt before you fry the egg) (see Section 12.7.8). Intelligibility is expressed by means of the percentage of correctly identified keywords (nouns and verbs). In this test no restrictions are placed upon the composition of the test materials, which means that the percentage of correct responses is determined only to a limited extent by the acoustic characteristics of the individual segments. This test would therefore have to be placed towards the black box end of the continuum. In terms of the factors listed in Section 12.4.1, only (6), (7) and (8) are excluded.
The main advantages of the Harvard Psychoacoustic Sentences Test are:
The main disadvantage/restrictions of the test are:
The Harvard Psychoacoustic Sentences were compared with the Haskins sentences by [Pisoni et al. (1985a), Pisoni et al. (1985b)] for four synthesisers and human speech (see below).
HASKINS SYNTACTIC SENTENCES Another famous test at the sentence level is the fixed set of 100 Haskins Syntactic Sentences (see Section 12.7.6). These sentences are semantically unpredictable, which means that they do not occur in daily life. An example is The old farm cost the blood. In terms of advantages and disadvantages, the Harvard Sentences and Haskins Sentences have much in common. The only difference is that Haskins listeners can rely less on semantic coherence (factor (5) in the list of factors in Section 12.4.1) so that the role of the acoustic characteristics of the segments is more important. Therefore, the Haskins sentences find themselves somewhat closer to the glass box end of the continuum than the Harvard sentences . The Haskins sentences were applied to four synthesisers and human speech by [Pisoni et al. (1985a), Pisoni et al. (1985b)], and were compared with the Harvard sentences . The percentage of correct keyword identification is given in Table 12.7.
It can be seen that the two tests yield the same rankorder. However, as expected, due to the reduced semantic coherence, the Haskins sentences are more sensitive.
SEMANTICALLY UNPREDICTABLE SENTENCES (SUS)
Both the Harvard and Haskins sentences had a fixed set of sentences, characterised by a single syntactic structure, which could be used as test materials. More recently, a more flexible approach was opted for with Semantically Unpredictable Sentences (SUS), developed by SAM (see Section 12.7.7). The test materials in the SUS consist of a fixed set of five syntactic structures which are common in most Western European languages, such as ``Subject-Verb-Adverbial'' (The table walked through the blue truth). The lexical slots in these structures are filled with high-frequency words from language specific lexicons. The resulting stimulus sentences are semantically unpredictable, just like the Haskins Syntactic Sentences .
The advantages of the SUS Test are the following:
The main disadvantages/restrictions of the test are:
Pilot studies with the SUS test have been run in French, German, and English [Benoît (1989), Benoît et al. (1989), Hazan & Grice (1989)]. Results showed, among other things, that keywords presented in isolation were identified significantly less well than the same words in a sentence context. This is attributed in part to the fact that the syntactic category of the isolated words is not known. Furthermore, the SUS were found to be sensitive enough to discriminate between two synthesisers differing in prosody .
By prosody we mean the ensemble of properties of speech utterances that cannot be derived in a straightforward fashion from the identity of the vowel and consonant phonemes that are strung together in the linguistic representation underlying the speech utterance. Prosody would then comprise the melody of the speech, word and phrase boundaries, (word) stress , (sentence) accent, rhythm, tempo, and changes in speaking rate . We exclude from the realm of prosody the class of voice characteristics (see Section 12.5.2).
Prosodic features may be used to differentiate between otherwise identical words in a language (e.g. trusty - trustee, or export (noun) - export (verb), with initial stress versus final stress , respectively). Yet, word stress is not so much concerned with making lexical distinctions (this is what vowels and consonants are for) as with providing checks and bounds to the word recognition process. Hearing a stressed syllable in languages with more or less fixed stress informs the listener where a new word may begin; error responses in word recognition strongly tend to agree with the stimulus in terms of stress position. In a minority of the EU languages (Swedish, Norwegian) lexical tone (rather than stress ) is exploited for the purpose of differentiating between segmentally identical words (See also Chapter 6).
The more important functions of prosody, however, are located at the linguistic levels above the word:
These functions suggest that prosody affects comprehension (establishing the semantic relationships between words) rather than intelligibility (determining the identity of words), and, indeed, this is what most functional tests of prosody aim to evaluate.
Evaluation of the prosody of speech output systems is alternately focussed on the formal or the functional aspects. Only a handful of tests are directed at the formal quality of temporal organisation. An exemplary evaluation study on the duration rules of MITalk [Allen et al. (1987)] was done by [Carlson et al. (1979)]. They generated six different versions of a set of sentences by including or excluding effects of consonant duration rules , vowel duration rules, and stressed syllable and preboundary lengthening rules in the synthesis. These versions were compared with a topline reference condition where (normalised) segment durations copied from human versions of the test sentences spoken by the MITalk designated talker were imposed on the synthesis. There were two baseline conditions, one with the neutral (inherent) table values substituted for all segments, and one with random segment duration variation (within realistic bounds). The results showed that the temporal organisation afforded by the complete rule set was judged to be natural as the human topline control. Moreover, sentences generated with boundary markers at minor and major breaks were judged to be more natural than speech without boundary markers.
More work has been done in the field of melodic structure. Let us first consider judgments of formal aspects of speech melody. The formal properties of, for example, pitch movements or complete speech melodies can be tested by asking groups of listeners (either naive or expert) to state their preference in pairwise comparisons or to rate a melody in a more absolute way along some goodness or naturalness scale. At the level of elementary pitch movements (such as accent-lending or boundary marking rises, falls, or rise-fall combinations) the SAM Prosodic Form Test (see Section 12.7.9) is a useful tool. The test was applied to two English and two Italian synthesisers, with 3 contours, 4 levels of segmental complexity , 5 items at each level, 4 repetitions of each token [Grice et al. (1991)]. Significant effects were found for synthesiser and contour, as well as for the interactions between synthesiser and contour and between synthesiser, complexity, and contour. By relating the scores for the contours to those for the monotone reference , the effect of differences in segmental quality on the ratings could be cancelled out.
Using the same methodology, i.e. rating and pairwise comparisons, the quality of synthetic speech melody can be evaluated at the higher linguistic levels. At the level of isolated sentences pairwise comparisons of competing intonation-by-rule modules is feasible when the number of systems (or versions) is limited [Akers & Lennig (1985), e.g.,]. When multiple modules are tested using a larger variety of sentences and melodies, scale rating is to be preferred over pairwise comparisons for reasons of efficiency [De Pijper (1983), Willems et al. (1988)].
Evaluation of speech melody generators should not stop at the level of isolated sentences . Ratings by expert listeners in Dutch could not reveal any quality differences between synthetic melodies and a human reference when the sentences were listened to in isolation; however, the same synthetic melodies proved inferior to the human reference when they were presented in the context of a full paragraph [Terken & Collier (1989)]. Along the same lines, [Salza et al. (1993)] evaluated the prosody of the Eloquens TTS for Italian in a train schedule consultation application. Three hundred sentences were tested, realistically distributed over seven melodic modalities: Command sentences, Simple declaratives, List sentences, Wh-questions, Yes/no-questions, Yes/no-echo questions, and Yes-no modal questions. Expert listeners' scores did not differ from those of naive subjects, and scores were better for utterances presented as part of a dialogue than for sentences presented in isolation. Clearly, both studies demonstrate that paragraph position or function within a dialogue induces certain perceptually and communicatively relevant adaptations to sentence prosody.
The form tests discussed so far address prosody globally. An analytic approach to prosodic evaluation using judgment testing was proposed by [Bladon (1990)] and co-workers. They developed an elaborate check list of formal properties that should be satisfied by any speech output system that claims to generate English melodies. Trained (but phonetically naive) judges listen to synthetic utterances, while looking at orthographic transcripts of the utterance with a crucial word or syllable underlined. Their task is to check whether the target syllable does in fact contain the melodic property prescribed by the check list. Although this idea is attractive from a diagnostic point of view (melodic flaws are immediately identified) the system has some drawbacks that should be considered before extending its use to other materials and other languages. First, drawing up a valid checklist presupposes a theory of intonation , or at least a detailed and valid description of the test sentences. Workable theories and descriptions may be provided for English and some other languages, but will not be available for all (EU) languages. Second, even for English, the criteria for each melodic check point were formulated in rather crude terms, which makes it difficult for the judges to determine whether the utterance does or does not satisfy the criterion. Third, it is impossible to determine the overall quality of the melodies tested, since there is no way of combining the pass/fail scores for the various check points into a weighted overall score. A preliminary experiment revealed that three output systems could be meaningfully rank-ordered along a quality scale, but not at the interval measurement level. Systems that were clearly different as judged by experts were very close to each other in terms of their unweighed overall score, whereas systems that were rated as equally good by experts differed by many points.
For the reasons given above, we do not recommend analytic judgments by naive listeners using a check list as an approach to evaluating prosody.
There is (at least) one judgment test that assesses how well certain communicative functions are signalled by prosody at a higher level. The SAM Prosodic Function Test (see Section 12.7.9) asks for ratings of the communicative appropriateness of melodies in the context of plausible human-machine dialogue situations. The test was applied to human-machine dialogues designed to simulate a telephone enquiry service giving flight information [Grice et al. (1992b)]. A restricted set of contexts and illocutionary acts were included: asking (seeking information, seeking confirmation), assertive (conclude, put forward, state), expressive (greet), and commissive (offer, propose to). Two intonation versions were compared, one based on an orthographic input with punctuation (target intonation algorithm) and the other based on a text input edited to conform to the type of text generated by an automatic language generator (reference intonation algorithm ). The test should be seen as a first attempt to evaluate the paralinguistic appropriateness of intonation in dialogue situations. For general comparative purposes, it would be useful to have an agreed-on, systematic inventory of situations or speech acts one would want to include, taking as a point of departure, for example, the classification of speech acts proposed by [Searle (1979)].
Finally, we are not aware of tests asking subjects to judge the quality of the expression of emotions and attitudes in synthetic speech. It would appear that functional testing of these qualities is preferred in all cases.
Evaluating speech output prosody using functional tests is even more in its infancy. Since prosody is highly redundant given the segmental information (with the exception of the signalling of sentence type and emotion /attitude), it can be functionally tested only if measures are taken to reduce its redundancy . The first course of action, then, has been to concentrate on atypical, rather contrived materials in which prosody is non-redundant. That is, the materials consist of segmental structures that would be ambiguous without the prosody, and listeners are asked to solve the ambiguity. To the extent that the disambiguation is successful, the speech output system can be said to possess the appropriate prosodic functions. We find examples of such functional tests for the disambiguation of minimal stress pairs [Beckman (1986), for a survey], word boundaries [Quené (1993), for a survey], constituent structure []Lehiste76, sentence type [Thorsen (1980), e.g.,], and focus distribution [Nooteboom & Kruijt (1987), e.g.,]. However, in these kinds of study speech output assessment typically was not the primary research goal. Rather, speech synthesis was used here by psycholinguists or experimental phoneticians to manipulate the speech parameters in a controlled fashion.
The second route is to make prosody less redundant by degrading the segmental quality , such that without prosody (i.e. in the baseline conditions identified above) the intelligibility of the speech output would be extremely poor. The quality of the prosody would then be measured in terms of the gain in intelligibility, i.e.\ increase in the percentage of correctly reported linguistic units (phonemes , morphemes , words) due to the addition of prosody. [Carlson et al. (1979)] measured intelligibility of utterances synthesised by MITalk with and without application of vowel duration , consonant duration and boundary marking rules (see above). They found that adding duration rules improved word intelligibility; adding within-sentence boundaries, however, did not boost intelligibility (even though the result was judged to be more natural, see above). [Scharpff & Van Heuven (1988)] demonstrate that adding within-sentence boundaries (i.e. changing the temporal organisation) does improve word intelligibility (especially for monosyllabic words) in Dutch diphone synthesis , and that utterances with pauses were judged as more pleasant to listen to (but only when listeners were unfamiliar with the contents of the sentence [Van Heuven & Scharpff (1991)]. Reasoning along the same lines, one would predict that quality differences in speech melody would have an effect on word recognition in segmentally degraded speech. Such effects were, in fact, reported by [Maassen & Povel (1985)], who used (highly abnormal) speech utterances produced by deaf speakers resynthesised with corrected temporal and/or melodic organisation.
There is a substantial literature on the perception of emotion and attitude in human speech [Murray & Arnott (1993), for a survey]. Typically, listeners are asked to indicate which emotion they perceive in the stimulus utterance, in open or closed response format. Predictably, the larger the set of response alternatives, the poorer the identification of each emotion . It is not clear, in this context, how many different emotions should be distinguished, and to what extent these can be signalled by phonetic means. Still, results tend to show that the most basic emotions can be identified, in lexically neutral utterances, at better than 50% correct, in a 10 alternative closed response test. Synthesis of emotion in speech output is being attempted by several research groups. Preliminary evaluation of emotion-by-rule in Dutch diphone synthesis was presented by [Vroomen et al. (1993)], as summarised in Table 12.8 [Vroomen et al. (1993), after,].
Note: The human reference condition were neutral utterances with temporal and melodic organisation copied from emotional utterances of the same sentences, using PSOLA .
Whereas the segmental and prosodic features of speech are continuously varying, voice characteristics are taken to refer to aspects of speech which generally remain relatively constant over longer stretches of speech. Voice characteristics, also referred to as voice quality [Laver (1991)], can most easily be viewed as the background against which segmental and prosodic variation is produced and perceived. In our definition, it includes such varied aspects of speech as mean pitch level, mean loudness, mean tempo, harshness , creak, whisper, tongue body orientation, dialect , accent, etc.\ Voice quality is mainly used by the listener to form a (sometimes incorrect) idea of the speaker's
In principle voice quality is not communicative, i.e. not consciously used by the speaker to make the listener aware of something which he was not previously aware of, but informative, which means that, regardless of the intention of the speaker, it is used by the listener to infer information (see also Chapter 11). This information may have practical consequences for the continuation of the communicative interaction, since it may influence the listener's attitudes towards the speaker in a positive or negative sense and may affect his interpretation of the message [Laver (1994)].
Recently, increased attention has been paid to voice quality aspects of synthetic speech. In fact, [Sorin (1994)] regards the successful creation of personalised synthetic voices (``personalised TTS '') as one of the most ambitious challenges of the near future. This aspect of synthesis is, for example, relevant in such applications as Translating (Interpreting) Telephony services, where along with translating the content of the message the original voice of the speaker has to be reconstructed (automatic voice conversion). Moreover, the correct encoding of speaker characteristics such as sex , age , and regional background is also relevant for the synthetic reading of novels for the blind. Finally, a third application is to be found in non-speaking disabled individuals, who have to use a synthetic speech to replace their own.
With a view of the latter application, [Murray & Arnott (1993)] describe a system allowing rapid development of new voice ``personalities'' for the DECtalk synthesiser with immediate feedback to the user. Voice alteration is done by interpolating between the existing DECtalk voices (five male voices, five females voices, and a unisex child). Thus a voice may be created that sounds ``a bit like Paul with a bit of Harry''. A somewhat different approach aimed at a somewhat different type of application is described by [Yarrington & Foulds (1993)], who use original recordings of speakers who know they are going to lose their voice to construct speaker-specific diphone sets .
Apart from specific requirements imposed by concrete applications, a general requirement of the voice quality of synthetic output is that it should not sound unacceptably unpleasant. Voice pleasantness is one of the scales included in the overall quality test proposed by the ITU-T to evaluate synthetic speech transmitted over the telephone (see Section 12.7.12). It has also been used by [Van Bezooijen & Jongenburger (1993)] in a field test to evaluate the functioning of an electronic newspaper for the blind. In this test, 24 visually handicapped rated the pleasantness of voice of two synthesisers on a 10-point scale (1: extremely bad, 10: extremely good). Ratings were collected at three points in time: (1) in a first confrontation with the synthesis output, (2) after one month, and (3) after two months of ``reading'' the newspaper . Interestingly, the pleasantness of voice ratings were found not to change over time, in contrast to the intelligibility ratings, which reflected a strong learning effect. From this it was concluded that voice quality has to be good right from the start; one cannot count on the beneficial effect of habituation. Both synthesis systems were generally considered good enough for the reading of popular scientific books and newspapers . However, partly due to the unpleasant voice quality, they were found unfit for the reading of novels or poetry [Jongenburger & Van Bezooijen (1992)]. So, voice quality mainly seems to play a role when attention is directed to the form of the message, for recreational purposes. Finally, we hypothesise that perhaps more than for aspects of speech affecting comprehension, motivation and positive attitude might compensate for poor voice quality.
Of course, judgment studies such as these can only provide global information; if results are negative, no diagnostic information is available as to what voice quality component should be improved. There are no standard tests to diagnostically evaluate the voice quality characteristics of speech output. This type of information could in principle be obtained by means of a modular test, where various acoustic parameters affecting voice quality are systematically varied so that their effect on the evaluation of voice quality can be assessed. This would be the most direct approach.
A more indirect approach would involve asking subjects to listen analytically to and rate various aspects of voice quality on separate scales. A potentially useful instrument for obtaining a very detailed description is the Vocal Profile Analysis Protocol developed by [Laver (1991)]. This protocol, which comprises more than 30 voice quality features, requires extensive training . If data are available for several synthesis outputs the descriptive voice quality ratings could be used to predict the overall pleasantness of voice ratings.
It may also be possible to use untrained listeners, although the number of aspects described will necessarily be more limited and less ``phonetic''. Experience with human speech samples representing various voice quality settings [Van Bezooijen (1986)] has shown that naive subjects can reliably describe 1-minute speech samples with respect to the following 14 voice quality scales: warm-sharp, smooth-rough, low-high, soft-loud, nasal-free of nasality , clear-dull, trembling-free of trembles, hoarse-free of hoarseness , full-thin, precise-slurred, fast-slow, accentuated-unaccentuated , expressive-flat, and fluent-halting. Again, if descriptive ratings of this type were available for synthetic speech they could be correlated with global ratings of synthesised voice quality. Alternatively, this type of scale could also be used more directly for diagnostic purposes , i.e. subjects could be asked to rate each of these voice quality aspects on a 10-point scale, with 1: extremely bad and 10: extremely good.
However, as mentioned above, experience with detailed perceptual descriptions of voice quality is as yet limited to non-distorted human speech. It remains to be assessed whether such descriptions can also be reliably made for synthetic speech. And even if this proved to be the case, the translation of the results obtained to actual system improvement is not unproblematic, since not much is known about the acoustic basis of perceptual ratings. Attempts in this direction have been rather disappointing [Boves (1984), e.g.,].
In addition to judgment tests to evaluate the formal aspects of voice quality, functional tests may be used to assess the adequacy of voice quality. Although here also no standard tests are available, the procedures are rather straightforward and dictated directly by application requirements. One can think, for example, of tests in which subjects are asked, in an open or closed response format, to identify the speaker. This would be useful in an application where one tries to construct a synthetic voice for a given speaker or reconstruct the natural voice of a given speaker. Or one can ask people to identify the speaker's sex , or estimate his age or other characteristics.
In this context, accent and dialect features are relevant as well. For example, for Dutch a new set of diphones was derived from a western speaker, because some non-speaking users complained that the old diphone set had too much of a southern accent to be acceptable for communication in their living environment . To test whether naive listeners were in fact able to discriminate between the two diphone sets , listeners from different parts of the Netherlands rated CVC, VCV, and VCCV stimuli produced with the two systems on a 10-point bipolar regional accent - standard Dutch scale. The diphone sets were indeed clearly discriminable [Van Bezooijen (1988)].
Summarising it can be stated that very little experience has as yet been gained with the diagnostic and comparative evaluation of voice quality of speech output systems, either by means of judgment or functional tests . Moreover, except for specific applications where synthesis is closely connected with the identity of a speaker (in a clinical or automatic voice conversion setting), it is not even clear how much importance is attached to voice quality by naive listeners. How much does it really bother people when voice quality is unpleasant? For example, does an unpleasant voice quality prevent them from using a synthetic information service? It is too early to give concrete recommendations on how to approach the evaluation of voice quality aspects of speech output; this is one of the topics for further investigation in the near future.
Knowledge about the relationships among tests is important for at least two reasons:
What would be needed to assess the relationships among tests is a large scale study which compares the performance of all ``serious'' tests testing the same aspect (e.g. intelligibility or comprehension) for a wide range of synthesisers . One would then like to know the stable differences among the tests in quality measured (e.g.\ percentage correct), as well as the correlations among the rank orderings of the synthesisers. In addition, it would be useful to have information on the reliability of ``identical'' tests developed and applied to a wide variety of different languages.
Some differences between the results obtained with different tests can be predicted to some extent. For example, when considering intelligibility, we think at least four factors will affect the outcomes: Intelligibility can be expected to increase
These predictions can be tested by looking at actual intelligibility results. [Jekosch & Pols (1994)], for example, assessed the intelligibility of one German synthesiser by means of four different tests (all described in Section 12.7):
The percentage of correct elements (phonemes in the SAM Standard Segmental Test , clusters in the CLID test , words in the MRT and the SUS test) differed widely, from 19% to 85%. The lowest percentage was obtained for the SUS test, followed by the SAM Standard Segmental Test , the CLID test, and the MRT . The fact that the highest score was obtained with the MRT agrees with our predictions, since this test possesses not a single aspect with a negative effect on intelligibility: The unit of measurement is small (phoneme ), the structure is fixed (CVC), the items are meaningful, and the response set is closed (six categories). The results for the other three tests point to complex interactions among the four factors.
[Delogu et al. (1992a)] compared four different test methods for evaluating the overall quality of Semantically Unpredictable Sentences produced by a male speaker (once with and once without noise added), three synthesisers, and three vocoders:
Very high correlations were obtained among categorical estimation , magnitude estimation, and paired comparison (r> 0.90); somewhat lower but still high correlations were found between these three test methods and reaction time (r around 0.80). Reaction time showed the smallest variation in the responses, but the least discriminatory power. The best discrimination among the systems was obtained with paired comparisons.
[Silverman et al. (1990)] compared the results of the Bellcore intelligibility test (see Section 12.7.3) with a comprehension test in which subjects had to answer questions related to the content of synthesised utterances with yes, no, or can't tell from the information provided. The faster subjects answered questions, the more items they heard. Two synthesisers, A and B, were tested. The intelligibility test yielded higher percentages correct for A than for B (77% versus 70%), whereas the comprehension test yielded higher percentages correct for B than for A (69% versus 63%). A few remarks are in order when attempting to interpret these seemingly contradictory results:
Whatever the exact basis of the opposite rank orders yielded by the two tests, it is clear that caution should be exercised when generalising from a laboratory-type intelligibility test to a field-type , application-oriented comprehension test. Low correlations between intelligibility (MRT ) and comprehension are also reported by [Ralston et al. (1991)].
In general, studies comparing different tests comprehend only a limited number of systems, which makes it difficult to learn to what extent the different tests rank the systems in the same way. Moreover, the relationship between the results yielded by glass-box and black-box tests deserves more systematic attention. We think that the importance of further studies of the relationships among tests cannot be stressed enough, if one wants to have a good idea of the meaning and generality of results obtained.