The last few years have brought a new awareness of the importance of evaluating speech and language technology according to accepted standards. This allows progress to be monitored within a single project; it also facilitates meaningful comparison across projects. In Europe, much of the enabling work on speech has been carried out as part of the SAM project [Fourcin et al. (1989)]. This has concentrated on developing standards for the storage, labelling , and basic speech processing of acoustic data. Similar work has been carried out as part of the U.S. DARPA program in Spoken Language. Text-based systems have also been considered under the DARPA Written Language program. As well as looking at low level issues, these programmes have gone on to consider higher levels of speech and language processing. For example, in text processing, progress has been achieved in evaluating the coverage of grammars [Black et al. (1991)] and on the evaluation of text understanding systems [Sundheim (1991)]. Spoken language understanding has been evaluated by monitoring the ability of systems to generate appropriate database queries on the basis of spoken questions [Bates et al. (1990), Pallet et al. (1990), Price (1990), for example].
At the root of almost all approaches to evaluation of speech and language technology lies the notion of a reference response or reference answer. The performance of the system is judged using the standard of the reference answer. Thus, a speech recogniser's performance is evaluated against what was actually said. A text understander is judged according to its ability to fill slots in a reference frame constructed on the basis of experts' judgements. A speech understander is judged according to its ability to construct for an utterance the same database query as a panel of experts. In each case, it is possible to prepare - in advance of any trials - a database of paired tasks and reference answers. This greatly simplifies the task of objectively comparing different systems.
The ``reference answer'' approach does not extend straightforwardly to the evaluation of dialogue systems, whether they use spoken or written language. First, dialogues are complex structures: they may accomplish multiple tasks for which multiple metrics are required. Note, however, that a certain amount of misunderstanding is a normal feature of successful human dialogue. The success of parts of a dialogue is subordinate to the success of the dialogue as a whole. Second, dialogues are dynamic structures: the overall structure of the dialogue emerges out of the interaction of system and user, where each utterance is contingent upon those which precede it. This makes it very difficult indeed to construct meaningful reference material.
These problems have led to dialogue systems being evaluated in a relatively simplistic fashion. For example, the final evaluation of the VODIS voice operated database inquiry system looked at the percentage of completed tasks and the mean time for completion, the reasons why dialogues were abandoned, the number of words in subject utterances, the word recognition rates , and an analysis of those instances in the trial when the system recognised nothing [Cookson (1988)]. In addition, a questionnaire was used to elicit subjects' perceptions about the usability of the system. While all of these results are interesting descriptions of aspects of the system, taken together they do not present a clear picture of the system's capabilities. This work is presented as an exemplar of a class of similar evaluations. There is no intention to single it out for special criticism.
One proposal for obtaining a measure of the effectiveness of a system qua dialogue participant is to look at its ability to understand an utterance in a dialogue context. [Hirschman et al. (1990)] have proposed a methodology which involves setting up a database of paired tasks and reference answer, where a task consists of an utterance plus an encoding of a dialogue state. This allows the system to be reinitialised between turns in a dialogue. In this way, the problem of the dynamic nature of dialogue can be managed. Different dialogue systems can be compared, so long as they are presented with the same utterance + canonical context pair as input. A similar approach was developed in the SUNDIAL project for testing and debugging the system. An extension of this approach - the dialogue breadth test - tests a system with a broad range of different utterance types for each canonical context, thus exploring the ability of the system to cope with the lack of constraint on next user utterance which exists at many points in a dialogue [Bates & Ayuso (1991)]. The canonical context approach represents a significant improvement on previous reference answer approaches. However, it only evaluates the ability of a system to perform context-sensitive interpretation; it focusses on the abilities of a system to interpret local structures in dialogue. But dialogue consists of larger structures, and these are beyond the scope of this metric. A dialogue system could perform reasonably well on this metric but be incapable of completing a single dialogue successfully.
One approach to evaluating the abilities of a dialogue system to deal with the larger structures of interaction might be to monitor the abilities of such a system to recreate an entire reference dialogue. For example, a corpus of dialogues could be collected using the Wizard of Oz simulation methodology discussed above. Subjects could then be asked to accomplish the same tasks using a dialogue system, and the results of the two exercises could be compared. Bates and Ayuso have argued convincingly that such an approach is unrealistic; in fact, they go so far as to compare it to ``asking one chess expert to exactly reproduce every move that some other expert made in a past game!'' [Bates & Ayuso (1991), p. 320,]. While accepting this general point, [Fraser (1991)] has claimed that there is some merit in comparing the results of simulations and system data collections in a more sophisticated way. This involves analysing the simulation corpus and generating an abstract multilevel description of it. This has the effect of defining at a theoretical level the space within which reasonable system behaviour may be located. A similar analysis of the system corpus is carried out and the results are compared. In the language of Bates and Ayuso's chess analogy, this is like comparing two games of chess and observing that, though they may differ in detail, they both include a version of the Sicilian Defence. However, to the best of our knowledge, this approach has not yet been thoroughly tested.