In the previous section, the black box approach to speech output evaluation was operationalised within a laboratory context. From an experimental point of view, the main advantage of a laboratory study is control over possibly interfering factors. However, ultimately it is the functioning of a speech output system in real life, with all its variability, that counts. If overall quality is extended to include all aspects of the synthesis in the context of an application, testing may be necessary in the field. Due to the variety of applications, it is difficult to summarise the aspects which field tests have in common. To illustrate the diversity, some examples will be given below.
A combined laboratory /field functional/judgment test , with equal attention to the speech output itself and the context within it is used, was done by [Van Bezooijen & Jongenburger (1993)]. They used the following suite of four tests to evaluate the functioning of an electronic newspaper for the visually handicapped:
Each of 24 visually handicapped subjects was visited at home, at three points in time. Since the subjects lived scattered all over the Netherlands, administration of the suite of tests was very time consuming.
Comparable studies have been conducted to evaluate a digital daily newspaper in Sweden [Hjelmquist et al. (1987)]. However, the experimental set-up to assess the quality of various aspects of the Swedish speech output was less strict: most information was obtained through interviews. On the other hand, much emphasis was placed upon the reading habits of the users: all keystrokes were registered during long periods of time, so that the frequency of use of all reading commands (e.g. next sentence, previous sentence, talk faster, talk in letters) could be determined.
A semi-field study combining function and judgment testing within the context of telephone information services was done by [Roelofs (1987)]. In this test resynthesised human speech was used, but the set-up and results can be generalised to synthetic speech output. Two applications were considered, namely directory assistance (the subject puts his request to an operator and then the number is spoken by the computer twice, thus freeing the operator for the next subscriber) and a service for train departure times (in a single pre-stored message the departure times of a number of trains with different destinations are given). In the former application a human operator served as a reference, in the second high-quality PCM speech was presented. Subjects were sent the instructions in advance and dialled the two services from their homes. The availability of interrupt facilities and speaking rate were examined. Both actual performance (success in writing down the requested data), subjective reactions were registered (14 5-point scales such as bad-good, impersonal-personal, inefficient-efficient) and two questions were added, namely: Do you find this way of information presentation acceptable? and Do you think this service could replace the current service? Due to several factors, the results are of limited value. However, the method is a nice example of how different approaches to testing can be combined.
With a view to exploring the possibilities of synthetic speech for a name and address telephone service, [Delogu et al. (1993b)] tested six Italian TTS systems by presenting lexically unpredictable VCV and CV sequences in an open response format. Intelligibility scores dropped from 31 to 21% when the same materials were listened to through a telephone line rather than good quality headphones . Curiously enough, the best TTS systems suffered most from telephone bandwidth limitation.
Finally, an important area of speech output evaluation are applications where people are required to process auditory information while simultaneously performing some other task, involving hands and eyes, for instance to write down a telephone number or land an aircraft. The requirements imposed by double tasks like these have been simulated for instance by having subjects answer simple questions related to the content of short synthesised messages while at the same time tracking a randomly moving square on a video monitor by moving a mouse [Boogaart & Silverman (1992)]. This type of laboratory study could and should be extended to more real-life situations. Other important areas are field tests where the functioning of speech output is tested under various noise conditions, and combinations of noise and secondary tasks.
Since field tests will often have to meet specific requirements, it is not realistic to think in terms of standard tests and standard recommendations. Each case will have to be examined in its own right. In order to get an overview of complex test situations that may arise, [Jekosch & Pols (1994)] recommend a ``feature'' analysis to define a test set-up, where features are all aspects relevant to the choice of the test. Their analysis comprises three steps, naturally leading to a fourth step:
Because of the specific nature of some applications, often there will be no ready-made test available, so that it is perhaps better to talk of (suggestions for) test approaches than tests. Moreover, a single test will generally not suffice, but a suite of tests will be needed instead. In this suite both functional and judgment tests can be included. Interviews can be part of the evaluation as well. Moreover, it is possible to administer laboratory type experiments in a field situation. This can be done, for example, by preparing stimulus tapes beforehand and playing them to subjects in the environment where the synthesis system will be used.