Judgment vs. functional testing

Next: Global vs. analytic assessment Up: Towards a taxonomy of Previous: Human vs. automated

Judgment vs. functional testing

By judgment testing (also called opinion testing in telecommunication research) we mean a procedure whereby a group of listeners is asked to judge the performance of a speech output system along a number of rating scales. The scales are typically bi-polar adjectives that allow the listeners to express the quality of the output system along a more global or more specific aspect of its performance. Although the construction of an appropriate scaling instrument is by no means a trivial task, a scaling test can be administered with little effort and yields a lot of potentially useful information.

At the other extreme the speech output can be assessed in terms of how well it actually performs its communicative purpose. This is called functional testing. For instance, if we want to know to what extent the output speech is intelligible, we may prefer to measure its intelligibility not by asking listeners how intelligible they think the speech is, but by determining, for instance, whether listeners correctly identify the sounds. Consider, as an example on a higher level of communication, the assessment of an information system using speech output. We may ask users to judge the output quality, but we may also functionally determine the system's adequacy by looking at task completion: how often and how efficiently do the users get the information from the system that they need?

One would hope that the results of judgment and functional assessments converge. Obviously, one would like to use the results of functional assessments in order to gauge the validity of judgments, rather than the other way about. As far as we have been able to ascertain, there has been little research into this matter. Yet, there is at least one set of intersubjective and functional data that was collected for the same group of listeners and stimuli, testing two different text-to-speech systems at three different points in time, from which it appeared that the scaling results were highly correlated with the corresponding functional test scores [Pavlovic et al. (1990)].

Next: Global vs. analytic assessment Up: Towards a taxonomy of Previous: Human vs. automated

EAGLES SWLG SoftEdition, May 1997. Get the book...