next up previous contents index
Next: Linguistic testing: Creating test Up: Further developments in speech Previous: Introduction


Long-term strategy: Towards predictive tests


From human to automated testing


The ultimate criterion to decide on the quality of speech output resides with the human listener. Speech output assessment is therefore basically a matter of human perception research. It is commonly acknowledged that the human listener is a noisy  measurement instrument, which causes output assessment to be a slow and (therefore) expensive undertaking. There are generally felt to be two ways out of this problem. One is to look for assessment procedures which are optimally efficient, i.e.\ use perception tasks that are least susceptible to observer noise , and that concentrate on a small set of representative materials from which valid generalisations to all other situations can be made. This line of development has been followed for some time, especially by the SAM  consortium, and could fruitfully be extended into the next five years.

The second way out is to replace the human observer by a computer-simulated observer, i.e. to use automated assessment  methods. Using automated methods presupposes that we know exactly how human listeners react to speech output. The development of objective methods is therefore necessarily subsequent to the development of human test methods. In those areas of auditory perception where sufficient, consolidated knowledge has been assembled, attempts at computer-simulation can be launched even today, and, in fact, pilot studies have recently been undertaken that show the feasibility of objective testing in selected areas (see Section 12.2.4). The field will have to reach agreement on what further aspects of human perception, relevant to speech output assessment, have evolved to the point that computer-simulation of the human listener can realistically be undertaken. Once such areas have been identified, the next step will be to go ahead and implement them.

Candidates that present themselves for automated testing   will be:

We advocate a two-pronged approach here. The field should concentrate on developing optimally efficient tests involving human listeners, and at the same time begin to work on the development of perceptual distance estimation procedures that can be used later in automated assessment .  


Predicting functional behaviour from judgment testing


There is a paradox involved in the choice between judgment tasks and functional tests. On the one hand, it could well be argued that a speech output system is adequate if a representative user group judges the system to be adequate for its purpose. Why should the field go to more trouble to improve the system's quality if the users profess to be satisfied? On the other hand, we can predict with near certainty that the users will not be able to estimate precisely the level of adequacy needed for the output system to function smoothly in a concrete application. The relationship between judgments and functional test scores has been studied in the context of segmental quality,  but so far not in the field of prosodic  quality testing. It would seem a point of immediate concern, therefore, to consider research into the interrelationship between judgments and functional test behaviour, with emphasis on prosodic  quality. To what extent do orderings among competing speech output systems, as derived from judgment tests, correspond to orderings derived from functional tests? If we were able to predict functional test behaviour from judgment test scores, the latter, as a cheaper alternative for functional testing, could be used in all initial stages of speech output assessment. The use of functional testing would then typically be restricted to diagnostic testing.     


Predicting global from analytic testing


Generally, one would expect the global quality of a speech output system to be a function of the quality of the various system components. One would like to be able to predict and quantify the overall ratings and global performance measures from the scores on the components through some form of regression analysis. Obviously, if system designers have only limited resources available, they should direct their efforts toward improving the quality of those aspects that contribute most (in terms of regression coefficients) to the overall assessment of his system. We suggest that research be undertaken in order to address this type of question.    


Predicting field performance from laboratory testing


There is general agreement that laboratory tests such as are available today do not allow a useful prediction of how well a speech output system will perform in a concrete application. A short-term recommendation is, therefore, to develop a field-test generator, along the same lines as the successful test generators for laboratory intelligibility tests (such as the CLID   and SUS tests     developed by the SAM  consortium). The field-test generator should enable the fast compilation of test materials and adequate simulation of a range of application conditions. For this purpose, an adequate cross-section of applications for speech output has to be inventoried and parametrised along such dimensions as (1) type of users (non-cooperative, children, elderly people, non-native language users), (2) specific aspects of the situation in terms of, for instance, noise , reverberation , telephone channel , and (3) secondary tasks. An integrated software package, PMT (Parametric Test Manager)  has recently been made available (by the Electrical Engineering and Acoustics Department of the University of Bochum, Germany) that contains some of the features proposed here:

  1. data structure allows mixing stimuli with external audio and video signals, with audiovisual feature links,
  2. signal editing capabilities,
  3. contains query language for post hoc data analysis, and an interface with statistical packages,
  4. allows cross-comparisons with alternative synthesisers and languages,
  5. allows rhyme tests with both closed and open formats.

On a longer-term basis we advocate a more fundamental solution to the problem of field testing. Ideally, of course, one should not have to go to field every time a new application presents itself. Rather, one would like to be able to predict accurately, on the basis of available results of standard laboratory tests (e.g.\ intelligibility scores and prosodic  adequacy profiles) how a speech output system would perform in a concrete field situation. For this to be the case, it will be necessary to have a valid analysis of the field tasks that have to be accomplished. A task profile will have to be drawn up that analyses the demands that carrying out the task (including and excluding listening to speech output) makes on the user, such as attentional load of the primary task, environmental noise , negative influence of fatigue and boredom, physical strain, etc. Accomplishing this type of prediction calls for cooperation between experts in the field of speech quality  assessment experts and human factors studies. We recommend exploratory studies along the lines suggested above, based on quantitative task analyses of a few selected applications.    

next up previous contents index
Next: Linguistic testing: Creating test Up: Further developments in speech Previous: Introduction

EAGLES SWLG SoftEdition, May 1997. Get the book...