Laboratory testing

Next: Field testing Up: Black box approach Previous: Black box approach

Black box assessment tests a system's performance as a whole, without considering the performance of modules internal to the system. Ideally, within black box testing, one would want to have at one's disposal a functional test to assess the adequacy of the complete speech output in all respects: does the output function as it should? Such a test does not exist, and is difficult to conceive. In practice, the functional quality of overall speech output has often been equated with comprehensibility: to what extent can synthesised continuous speech be understood by listeners?

Speech comprehension is a complex process involving the interpretation and integration of many sources of information. Important sources of information in complete communication situations, where both auditory and visual information are available to interactants, are:

Speech signal information at different levels (segments, prosody , voice characteristics ),
Segment combinatory probabilities (e.g. /str../ is a permissible consonant sequence at the onset of words in many EU languages, but all other permutations of this sequence are illegal (e.g. /tsr.../,
Knowledge of which segment strings are existing words in the language (e.g. the permissible string /strk/ is not a word in English),
Word combinatory probabilities (e.g. the article ``the'' will tend to be followed by adjectives or nouns rather than verbs),
Semantic coherence (e.g. in the context of ``arrive'' a word like ``train'' is more probable than a word like ``pain''),
Meaning extracted from the preceding linguistic context; due to the repetition of words and the progressive building up of meaning, the last sentence of a text will generally be easier to understand than the first,
World knowledge and expectations of the listener based on previous experience,
Cues provided by the extra-linguistic context in which the message is spoken (e.g.\ facial expressions and gestures of the speaker, relevant things happening in the immediate environment ).

In normal daily life all these different sources, and others, may be combined by listeners to construct the meaning of a spoken message. As a result, in applied contexts the contributions of separate sources are difficult to assess. Laboratory tests typically try to minimise or control for the effects of at least some of the sources in order to focus on the auditory input. Some segmental intelligibility tests at the word level (such as the SAM Standard Segmental Test, see Section 12.7.1) try to minimise the effects of all sources except (1) and (2): only meaningless but permissible consonant-vowel-consonant combinations (e.g. /hos/) or even shorter items (/ze, ok/) are presented to the listener. In comprehensibility tests, factor (8) is excluded completely and (7) as far as possible. The latter is done by selecting texts with supposedly novel information for all subjects.

No completely developed standardised test, with fixed test material and fixed response categories, is available for evaluating comprehension, but one wonders whether this would be very useful in the first place, since it is not clear what the ``average'' text to be used should look like in terms, for example, of the complexity and type of vocabulary, grammatical structures, sentence length, and style. At this level of evaluation it advisable to take the characteristics of the intended application into account.

Testing the comprehensibility of speech output destined to provide traffic information requires a more specific type of test materials (e.g. short sentences, only statements, restricted range of lexical items, formal style) than speech output to be used for reading a digital daily newspaper for the blind, where the test materials should be more varied in all respects. The greatest variation should probably be present in speech material testing text-to-speech systems developed to read novels to the visually handicapped.

As to the type of comprehension test, several general approaches can be outlined. The most obvious one involves the presentation of synthesised texts at the paragraph level, preferably with human produced versions as a topline control, with a series of open or closed (multiple choice) questions. Results are expressed in terms of the percentage of correct responses. An example of a closed response approach is [Pisoni et al. (1985a), Pisoni et al. (1985b)], who used 15 narrative passages selected from standardised adult reading comprehension tests. Performance was compared between listening to synthetic speech, listening to human speech, and silent reading. Each condition was tested with 20 subjects. One of the most important findings was a strong learning effect for synthetic speech within a very short time, and the absence of clear differences among the test conditions.

At first sight, the results of closed response comprehension tests seem counterintuitive: although the human produced texts sound better than the synthetic version, often no difference in comprehension is revealed [Nye et al. (1975), Delogu et al. (1992b)] or, after a short period of familiarisation, even superior performance for synthetic speech [Pisoni et al. (1985b), Pisoni et al. (1985a)] is observed. These results have been tentatively explained by hypothesising that subjects may make more of an effort to understand synthetic speech. This could be expected to lead to:

Slower reaction times in a sentence verification test.
A decrease in performance as a function of fatigue.
Poorer performance for secondary tasks.

Confirmation of the first prediction was found by [Manous et al. (1985)]. The second and third predictions were tested by [Luce et al. (1983)], using a word recall test, and by [Boogaart & Silverman (1992)], using a tracking task. The first study revealed a significant effect, whereas the second did not.

However, the lack of differentiation in comprehensibility between human and synthetic speech in the above studies may also be due to the use of the closed response approach, where subjects have a fair chance of guessing the correct answer. Open response tests are known to be more sensitive, i.e. more apt to bring to light differences among test conditions. An example of an open response study is [Van Bezooijen (1989)], who presented five types of texts typically found in daily Dutch newspapers , pertaining to the weather, nature, disasters, small events, and sports, to 16 visually handicapped subjects. An example of a question testing the comprehensibility of the weather forecasts is: What will the temperature be tomorrow? The questions were sensitive enough to yield significant differences in comprehensibility among two text-to-speech (one automated and one manually corrected) and one human produced version of the texts. Crucially, the results also suggest that the effect of the supposedly greater effort expended in understanding synthetic speech has its limits. If the synthetic speech is bad enough, increased effort cannot compensate for loss of quality.

The tests described ask subjects to answer questions after the texts have been presented, thus measuring the final product of text interpretation. In addition to these off-line tests, more psycholinguistically oriented on-line approaches have been developed which request instantaneous reactions to the auditory material being presented. These tests primarily aim at gaining insight into the cognitive processes underlying comprehension: to what extent is synthetic speech processed differently from human speech? A few of these psycholinguistic tests are:

The word monitoring task , requiring subjects to press a button as soon as they hear a prespecified word.
The sentence-by-sentence listening task, in which subjects push a button whenever they are ready to listen to the next sentence (comprehension is checked afterwards but is not part of the test proper).
The sentence verification test, where subjects have to decide whether short sentences are true statements or not (e.g. Mud is dirty and Rockets move slowly).

All three are on-line measures, the first indexing cognitive workload, the second and third assessing speed of comprehension. On-line tests of this type, which invariably reveal differences between human and synthetic speech, have been hypothesised to be more sensitive than off-line measures [Ralston et al. (1991)]. However, the results of such psycholinguistic tests (``subjects responded significantly faster to system A (740 ms) than to system B (930 ms)'') are less interpretable for non-scientists than those of comprehension tests (``subjects answered 74% of the system A questions correctly versus 93% of the system B questions''). On the other hand, insight into cognitive load may ultimately prove important in dual task applications.

Recommendations on functional testing of overall output quality

Try to avoid the use of functional tests to assess overall output quality: on-line reaction time tests are difficult to interpret and off-line comprehension tests are difficult to develop.
If determined to develop a comprehension test, beware of the fact that reading tests may be too compact to be used as listening tests; adapt the materials or use materials that are meant to be listened to.
Use open comprehension questions rather than closed ones, the former being more sensitive than the latter.
When administering a comprehension test, include a topline reference with a dedicated speaker realising exactly the same texts presented in a synthetic version; use different groups of subjects for the various speech conditions (or better still block conditions over listeners such that no listener hears more than one version of the same text while at the same time each listener gets an equal number of different text versions).
When interpreting comprehension results, look at difference scores (synthetic compared to human) rather than at absolute scores in order to abstract from the intrinsic difficulty of questions.

Judgment laboratory tests

The black box tests described so far are functional in nature. However, instead of evaluating overall quality functionally, subjects can also indicate their subjective impression of global quality aspects of synthetic output by means of rating scales. Taking comprehensibility as an example, a functional task would be one where subjects answer a number of questions related to the content of a text passage as described above. Alternatives from a judgment point of view include:

Paired comparison , where subjects indicate which of two synthesisers sounds more comprehensible,
Magnitude estimation , where subjects assign a value expressing, or draw a line of a length which is equal to the magnitude of, their impression of comprehensibility,
Categorical estimation , where subjects rate synthesisers, for instance, along a 10-point scale which runs from 1: extremely incomprehensible to 10: extremely comprehensible.

Some methodological aspects of the second and third method are described in detail in Section 12.3.2. There it is also indicated that magnitude estimation is relatively laborious and better suited to test external comparison, whereas categorical estimation is relatively fast and easy, and better suited to test internal comparison.

Both the magnitude (continuous scale) and categorical estimation (20-point scale) methods have been included in SOAP in the form of the SAM Overall Quality Test (see Section 12.7.11). Three judgment scales are recommended, related to:

Intelligibility (How identifiable does the message sound?),
Naturalness (To what extent does the message sound like being produced by a human speaker?),
Acceptability (The overall user's satisfaction with the communicative situation).

The intelligibility and naturalness ratings are based on pairs of (unrelated) sentences. Fixed lists of 160 sentences of varying content and length are available for Dutch, English, French, German, Italian, and Swedish. Examples for English are: I realise you're having supply problems but this is rather excessive and I need to arrive by 10.30 a.m. on Saturday. For the acceptability ratings, application specific test materials are recommended. The magnitude and categorical estimation procedures have been applied to speech output in a number of studies [Pavlovic et al. (1990), Delogu et al. (1991), Goldstein et al. (1992), e.g.,]. Methodological aspects, such as the effects of stimulus range and the number of categories, relationships among methods, reliability, and validity, are emphasised.

The importance of application-specific test materials is also stressed by the International Telecommunication Union Telecommunication Standardisation (ITU-T) sector (see Section 12.7.12). They developed a test specifically aimed at evaluating the quality of telephone speech (where synthesis could be the input). It is a categorical estimation judgment test comprising ratings on (a subset of) eight scales:

Acceptance
Overall impression
Listening effort
Comprehension problems
Articulation
Pronunciation
Speaking rate
Voice pleasantness

The first scale is a 2-point scale, the others are 5-point scales. Strictly speaking, only the first four scales can be captured under the heading overall quality ; the other four scales are directed at more specific aspects of the output and require analytic listening . The content of the speech samples presented should be in accordance with the application. Examples of application-specific test items are: Miss Robert, the running shoes Adidas Edberg Pro Club, colour: white, size: 11, reference: 501-97-52, price 319 francs, will be delivered to you in 3 weeks (mail order shopping) and The train number 9783 from Poitiers will arrive at 9:24, platform number 3, track G (railway traffic information). In addition to rating the eight scales, subjects are required to reproduce information contained in the message. A pilot study has been run by [Cartier et al. (1992)]. [Fellbaum et al. (1994)] tested 13 synthesis systems for German using the ITU-T Overall Quality Test as well as open response functional intelligibility tests. Waveform concatenative synthesis systems proved measurably better than formant synthesis systems.

[Van Bezooijen & Jongenburger (1993)] employed a similar series of judgment scales as proposed by the ITU-T in a mixed laboratory/field study which addressed the suitability of synthetic speech within the context of a digital daily newspaper for the blind (see Section 12.4.2). Their battery comprised ten 10-point scales:

Intelligibility
General quality
Naturalness
Precision of articulation
Accuracy of pronunciation
Pleasantness of voice
Adequacy of word stress
Appropriateness of tempo
Liveliness
Fluency

Again a distinction can be made between scales relating to overall quality (the first three scales), and the other scales, relating to specific aspects of the speech output. A factor analysis yielded two factors, the first with high loadings of intelligibility, general quality, and precision of articulation, the second with high loadings of naturalness , pleasantness of voice, and adequacy of word stress . Intelligibility and naturalness were taken by the authors to be the two central dimensions underlying the evaluative judgments.

Recommendations on judgment testing of overall output quality

1.

Since there is no consensus on the most appropriate judgment scales to evaluate overall quality , choose between:

Intelligibility, naturalness , and acceptability (SAM Overall Quality Test),
Acceptance, overall impression, listening effort , and comprehension problems (ITU-T ), or only listening effort (practice in telephony),
Intelligibility, general quality, and naturalness [Van Bezooijen & Jongenburger (1993)].

2.

It is important that the scale positions have a clear meaning for the subjects and that the scale is wide enough to allow differentiation among systems compared. Use at least a 10-point scale.

Next: Field testing Up: Black box approach Previous: Black box approach

EAGLES SWLG SoftEdition, May 1997. Get the book...