Black box assessment tests a system's performance as a whole, without considering the performance of modules internal to the system. Ideally, within black box testing, one would want to have at one's disposal a functional test to assess the adequacy of the complete speech output in all respects: does the output function as it should? Such a test does not exist, and is difficult to conceive. In practice, the functional quality of overall speech output has often been equated with comprehensibility: to what extent can synthesised continuous speech be understood by listeners?
Speech comprehension is a complex process involving the interpretation and integration of many sources of information. Important sources of information in complete communication situations, where both auditory and visual information are available to interactants, are:
In normal daily life all these different sources, and others, may be combined by listeners to construct the meaning of a spoken message. As a result, in applied contexts the contributions of separate sources are difficult to assess. Laboratory tests typically try to minimise or control for the effects of at least some of the sources in order to focus on the auditory input. Some segmental intelligibility tests at the word level (such as the SAM Standard Segmental Test, see Section 12.7.1) try to minimise the effects of all sources except (1) and (2): only meaningless but permissible consonant-vowel-consonant combinations (e.g. /hos/) or even shorter items (/ze, ok/) are presented to the listener. In comprehensibility tests, factor (8) is excluded completely and (7) as far as possible. The latter is done by selecting texts with supposedly novel information for all subjects.
No completely developed standardised test, with fixed test material and fixed response categories, is available for evaluating comprehension, but one wonders whether this would be very useful in the first place, since it is not clear what the ``average'' text to be used should look like in terms, for example, of the complexity and type of vocabulary, grammatical structures, sentence length, and style. At this level of evaluation it advisable to take the characteristics of the intended application into account.
Testing the comprehensibility of speech output destined to provide traffic information requires a more specific type of test materials (e.g. short sentences, only statements, restricted range of lexical items, formal style) than speech output to be used for reading a digital daily newspaper for the blind, where the test materials should be more varied in all respects. The greatest variation should probably be present in speech material testing text-to-speech systems developed to read novels to the visually handicapped.
As to the type of comprehension test, several general approaches can be outlined. The most obvious one involves the presentation of synthesised texts at the paragraph level, preferably with human produced versions as a topline control, with a series of open or closed (multiple choice) questions. Results are expressed in terms of the percentage of correct responses. An example of a closed response approach is [Pisoni et al. (1985a), Pisoni et al. (1985b)], who used 15 narrative passages selected from standardised adult reading comprehension tests. Performance was compared between listening to synthetic speech, listening to human speech, and silent reading. Each condition was tested with 20 subjects. One of the most important findings was a strong learning effect for synthetic speech within a very short time, and the absence of clear differences among the test conditions.
At first sight, the results of closed response comprehension tests seem counterintuitive: although the human produced texts sound better than the synthetic version, often no difference in comprehension is revealed [Nye et al. (1975), Delogu et al. (1992b)] or, after a short period of familiarisation, even superior performance for synthetic speech [Pisoni et al. (1985b), Pisoni et al. (1985a)] is observed. These results have been tentatively explained by hypothesising that subjects may make more of an effort to understand synthetic speech. This could be expected to lead to:
Confirmation of the first prediction was found by [Manous et al. (1985)]. The second and third predictions were tested by [Luce et al. (1983)], using a word recall test, and by [Boogaart & Silverman (1992)], using a tracking task. The first study revealed a significant effect, whereas the second did not.
However, the lack of differentiation in comprehensibility between human and synthetic speech in the above studies may also be due to the use of the closed response approach, where subjects have a fair chance of guessing the correct answer. Open response tests are known to be more sensitive, i.e. more apt to bring to light differences among test conditions. An example of an open response study is [Van Bezooijen (1989)], who presented five types of texts typically found in daily Dutch newspapers , pertaining to the weather, nature, disasters, small events, and sports, to 16 visually handicapped subjects. An example of a question testing the comprehensibility of the weather forecasts is: What will the temperature be tomorrow? The questions were sensitive enough to yield significant differences in comprehensibility among two text-to-speech (one automated and one manually corrected) and one human produced version of the texts. Crucially, the results also suggest that the effect of the supposedly greater effort expended in understanding synthetic speech has its limits. If the synthetic speech is bad enough, increased effort cannot compensate for loss of quality.
The tests described ask subjects to answer questions after the texts have been presented, thus measuring the final product of text interpretation. In addition to these off-line tests, more psycholinguistically oriented on-line approaches have been developed which request instantaneous reactions to the auditory material being presented. These tests primarily aim at gaining insight into the cognitive processes underlying comprehension: to what extent is synthetic speech processed differently from human speech? A few of these psycholinguistic tests are:
All three are on-line measures, the first indexing cognitive workload, the second and third assessing speed of comprehension. On-line tests of this type, which invariably reveal differences between human and synthetic speech, have been hypothesised to be more sensitive than off-line measures [Ralston et al. (1991)]. However, the results of such psycholinguistic tests (``subjects responded significantly faster to system A (740 ms) than to system B (930 ms)'') are less interpretable for non-scientists than those of comprehension tests (``subjects answered 74% of the system A questions correctly versus 93% of the system B questions''). On the other hand, insight into cognitive load may ultimately prove important in dual task applications.
The black box tests described so far are functional in nature. However, instead of evaluating overall quality functionally, subjects can also indicate their subjective impression of global quality aspects of synthetic output by means of rating scales. Taking comprehensibility as an example, a functional task would be one where subjects answer a number of questions related to the content of a text passage as described above. Alternatives from a judgment point of view include:
Some methodological aspects of the second and third method are described in detail in Section 12.3.2. There it is also indicated that magnitude estimation is relatively laborious and better suited to test external comparison, whereas categorical estimation is relatively fast and easy, and better suited to test internal comparison.
Both the magnitude (continuous scale) and categorical estimation (20-point scale) methods have been included in SOAP in the form of the SAM Overall Quality Test (see Section 12.7.11). Three judgment scales are recommended, related to:
The intelligibility and naturalness ratings are based on pairs of (unrelated) sentences. Fixed lists of 160 sentences of varying content and length are available for Dutch, English, French, German, Italian, and Swedish. Examples for English are: I realise you're having supply problems but this is rather excessive and I need to arrive by 10.30 a.m. on Saturday. For the acceptability ratings, application specific test materials are recommended. The magnitude and categorical estimation procedures have been applied to speech output in a number of studies [Pavlovic et al. (1990), Delogu et al. (1991), Goldstein et al. (1992), e.g.,]. Methodological aspects, such as the effects of stimulus range and the number of categories, relationships among methods, reliability, and validity, are emphasised.
The importance of application-specific test materials is also stressed by the International Telecommunication Union Telecommunication Standardisation (ITU-T) sector (see Section 12.7.12). They developed a test specifically aimed at evaluating the quality of telephone speech (where synthesis could be the input). It is a categorical estimation judgment test comprising ratings on (a subset of) eight scales:
The first scale is a 2-point scale, the others are 5-point scales. Strictly speaking, only the first four scales can be captured under the heading overall quality ; the other four scales are directed at more specific aspects of the output and require analytic listening . The content of the speech samples presented should be in accordance with the application. Examples of application-specific test items are: Miss Robert, the running shoes Adidas Edberg Pro Club, colour: white, size: 11, reference: 501-97-52, price 319 francs, will be delivered to you in 3 weeks (mail order shopping) and The train number 9783 from Poitiers will arrive at 9:24, platform number 3, track G (railway traffic information). In addition to rating the eight scales, subjects are required to reproduce information contained in the message. A pilot study has been run by [Cartier et al. (1992)]. [Fellbaum et al. (1994)] tested 13 synthesis systems for German using the ITU-T Overall Quality Test as well as open response functional intelligibility tests. Waveform concatenative synthesis systems proved measurably better than formant synthesis systems.
[Van Bezooijen & Jongenburger (1993)] employed a similar series of judgment scales as proposed by the ITU-T in a mixed laboratory/field study which addressed the suitability of synthetic speech within the context of a digital daily newspaper for the blind (see Section 12.4.2). Their battery comprised ten 10-point scales:
Again a distinction can be made between scales relating to overall quality (the first three scales), and the other scales, relating to specific aspects of the speech output. A factor analysis yielded two factors, the first with high loadings of intelligibility, general quality, and precision of articulation, the second with high loadings of naturalness , pleasantness of voice, and adequacy of word stress . Intelligibility and naturalness were taken by the authors to be the two central dimensions underlying the evaluative judgments.