In the majority of test procedures, human subjects are called upon in order to determine the quality of a speech output system. This should come as no surprise to us, since the end user of a speech output system is a human listener. However, there are certain drawbacks inherent in the use of human subjects. Firstly, humans, whether acting as single individuals or collectively as a group, are always somewhat noisy, i.e. inconsistent, in their judgments or task performance; the results of tests involving human responses are never perfectly reliable in the statistical, psychometric sense of the word. Another drawback of tests involving human subjects it that they are time-consuming and therefore expensive to run.
Recent developments, which are still very much in the
laboratory stage, seek to replace
human evaluation by automated assessment of speech output systems or modules
thereof. Attempts can be (and in fact have been) made to
automatically measure the
discrepancy in acoustical terms between a system's output and
the speech of the human
speaker that serves as the model the system is intended to
imitate. This is the type of
evaluation technique that one would ultimately want to come up
with: the use of human
listeners is avoided, so that perfectly reproducible noiseless
results can be obtained in as
little time as it takes a computer to execute the program. At
the same time, however, it
will be clear that implementation of such techniques as a
substitute for human listeners
presupposes that we know exactly how human listeners evaluate
differences between two
realisations of the same linguistic message. Unfortunately,
this type of knowledge is
largely lacking at the moment; filling the gap would be a
research priority. Nevertheless,
preliminary automatic comparisons of synthetic and human
speech output have been
undertaken in the fields of melody and pause distribution
[Barry et al. (1989)], long term
average spectral characteristics [Pavlovic et al. (1991)] and
dynamics of speech in the
frequency and time domains [Houtgast & Verhave (1991), Houtgast & Verhave (1992)].
Generally, the results
obtained through these techniques show sufficient promise to
warrant extension of their
scope. We will come back to the possibilities of
automated testing in Section 12.6.