Field testing

Next: Glass box approach Up: Black box approach Previous: Laboratory testing

In the previous section, the black box approach to speech output evaluation was operationalised within a laboratory context. From an experimental point of view, the main advantage of a laboratory study is control over possibly interfering factors. However, ultimately it is the functioning of a speech output system in real life, with all its variability, that counts. If overall quality is extended to include all aspects of the synthesis in the context of an application, testing may be necessary in the field. Due to the variety of applications, it is difficult to summarise the aspects which field tests have in common. To illustrate the diversity, some examples will be given below.

Field tests

A combined laboratory /field functional/judgment test , with equal attention to the speech output itself and the context within it is used, was done by [Van Bezooijen & Jongenburger (1993)]. They used the following suite of four tests to evaluate the functioning of an electronic newspaper for the visually handicapped:

An interview enquiring after the subjects' attitudes towards the technology,
A functional open response identification test with CVC-words,
Judgments on 10 evaluative scales for continuous text passages (1: extremely bad, 10: extremely good) related to global quality and more specific aspects of the speech output, such as pleasantness of voice (see Section 12.3.4), adequacy of word stress , appropriateness of tempo, liveliness, and fluency (see Section 12.4.1 for further details),
A functional test assessing the subjects' proficiency in finding their way through the newspaper . This test involved a number of searches, such as Is there an article on Japan in the economy section? Proficiency was assessed both in terms of the percentage of correct answers and task completion time.

Each of 24 visually handicapped subjects was visited at home, at three points in time. Since the subjects lived scattered all over the Netherlands, administration of the suite of tests was very time consuming.

Comparable studies have been conducted to evaluate a digital daily newspaper in Sweden [Hjelmquist et al. (1987)]. However, the experimental set-up to assess the quality of various aspects of the Swedish speech output was less strict: most information was obtained through interviews. On the other hand, much emphasis was placed upon the reading habits of the users: all keystrokes were registered during long periods of time, so that the frequency of use of all reading commands (e.g. next sentence, previous sentence, talk faster, talk in letters) could be determined.

A semi-field study combining function and judgment testing within the context of telephone information services was done by [Roelofs (1987)]. In this test resynthesised human speech was used, but the set-up and results can be generalised to synthetic speech output. Two applications were considered, namely directory assistance (the subject puts his request to an operator and then the number is spoken by the computer twice, thus freeing the operator for the next subscriber) and a service for train departure times (in a single pre-stored message the departure times of a number of trains with different destinations are given). In the former application a human operator served as a reference, in the second high-quality PCM speech was presented. Subjects were sent the instructions in advance and dialled the two services from their homes. The availability of interrupt facilities and speaking rate were examined. Both actual performance (success in writing down the requested data), subjective reactions were registered (14 5-point scales such as bad-good, impersonal-personal, inefficient-efficient) and two questions were added, namely: Do you find this way of information presentation acceptable? and Do you think this service could replace the current service? Due to several factors, the results are of limited value. However, the method is a nice example of how different approaches to testing can be combined.

With a view to exploring the possibilities of synthetic speech for a name and address telephone service, [Delogu et al. (1993b)] tested six Italian TTS systems by presenting lexically unpredictable VCV and CV sequences in an open response format. Intelligibility scores dropped from 31 to 21% when the same materials were listened to through a telephone line rather than good quality headphones . Curiously enough, the best TTS systems suffered most from telephone bandwidth limitation.

Finally, an important area of speech output evaluation are applications where people are required to process auditory information while simultaneously performing some other task, involving hands and eyes, for instance to write down a telephone number or land an aircraft. The requirements imposed by double tasks like these have been simulated for instance by having subjects answer simple questions related to the content of short synthesised messages while at the same time tracking a randomly moving square on a video monitor by moving a mouse [Boogaart & Silverman (1992)]. This type of laboratory study could and should be extended to more real-life situations. Other important areas are field tests where the functioning of speech output is tested under various noise conditions, and combinations of noise and secondary tasks.

Since field tests will often have to meet specific requirements, it is not realistic to think in terms of standard tests and standard recommendations. Each case will have to be examined in its own right. In order to get an overview of complex test situations that may arise, [Jekosch & Pols (1994)] recommend a ``feature'' analysis to define a test set-up, where features are all aspects relevant to the choice of the test. Their analysis comprises three steps, naturally leading to a fourth step:

Determine the application conditions (What is to be tested? What are the properties of the material generated?), resulting in a feature profile of the application scenario .
Define the best possible test matching this feature profile.
Make a comparison of what is desired and what is available in terms of tests.
Adapt tests or develop your own test.

Because of the specific nature of some applications, often there will be no ready-made test available, so that it is perhaps better to talk of (suggestions for) test approaches than tests. Moreover, a single test will generally not suffice, but a suite of tests will be needed instead. In this suite both functional and judgment tests can be included. Interviews can be part of the evaluation as well. Moreover, it is possible to administer laboratory type experiments in a field situation. This can be done, for example, by preparing stimulus tapes beforehand and playing them to subjects in the environment where the synthesis system will be used.

Next: Glass box approach Up: Black box approach Previous: Laboratory testing

EAGLES SWLG SoftEdition, May 1997. Get the book...

Field testing

Preliminary remarks

Field tests