In the majority of test procedures, human subjects are called upon in order to determine the quality of a speech output system. This should come as no surprise to us, since the end user of a speech output system is a human listener. However, there are certain drawbacks inherent in the use of human subjects. Firstly, humans, whether acting as single individuals or collectively as a group, are always somewhat noisy, i.e. inconsistent, in their judgments or task performance; the results of tests involving human responses are never perfectly reliable in the statistical, psychometric sense of the word. Another drawback of tests involving human subjects it that they are time-consuming and therefore expensive to run.
Recent developments, which are still very much in the laboratory stage, seek to replace human evaluation by automated assessment of speech output systems or modules thereof. Attempts can be (and in fact have been) made to automatically measure the discrepancy in acoustical terms between a system's output and the speech of the human speaker that serves as the model the system is intended to imitate. This is the type of evaluation technique that one would ultimately want to come up with: the use of human listeners is avoided, so that perfectly reproducible noiseless results can be obtained in as little time as it takes a computer to execute the program. At the same time, however, it will be clear that implementation of such techniques as a substitute for human listeners presupposes that we know exactly how human listeners evaluate differences between two realisations of the same linguistic message. Unfortunately, this type of knowledge is largely lacking at the moment; filling the gap would be a research priority. Nevertheless, preliminary automatic comparisons of synthetic and human speech output have been undertaken in the fields of melody and pause distribution [Barry et al. (1989)], long term average spectral characteristics [Pavlovic et al. (1991)] and dynamics of speech in the frequency and time domains [Houtgast & Verhave (1991), Houtgast & Verhave (1992)]. Generally, the results obtained through these techniques show sufficient promise to warrant extension of their scope. We will come back to the possibilities of automated testing in Section 12.6.