Recommendations on evaluation methodology

Next: Readings in interactive dialogue Up: Evaluation Previous: Assessment framework

Recommendations on evaluation methodology

In this section we summarise the evaluation methodology proposed here and present some concrete recommendations (most of them recapitulating points raised in the preceding discussion) for carrying out effective and reliable interactive dialogue system evaluations.

The evaluation methodology can be expressed by means of a series of recommendations. These should be read in sequence.

The sequence of events to be followed in evaluating an interactive dialogue system is: characterisation, data collection, analysis and application of metrics. Try to keep these as discrete, non-overlapping phases of work, as this helps to ensure that the test is as fair as possible.
Provide a characterisation of all relevant aspects of the dialogue system, the task, the user, the environment , the corpus, and the overall system. Most of this can be done before the data collection phase, though certain pieces of information (e.g. relating to users and corpus characteristics) will necessarily have to wait until the data collection is under way.
Produce a clear statement of the objectives of the evaluation exercise prior to the start of that exercise.
Select the minimum set of metrics which will satisfy the evaluation objectives.
When budgeting time and personnel for the evaluation task, be sure to plan adequate resources to complete the task. A partial evaluation can turn out be of no more use than no evaluation at all. Remember that most meaningful black box metrics at the dialogue level cannot be automated, given the current state of the art.
If the system to be evaluated is intended for use in a real context, ensure that the test conditions match the end-use conditions as closely as possible.
Wherever possible, use evaluation metrics which have already been described in the literature.
Where it is necessary to invent some new metric ensure (i) that it is well-motivated, (ii) that it is fair (not favouring your system unreasonably), and (iii) that it is fully described whenever it is mentioned in public documents.
Be very cautious when comparing systems. Valid conclusions may not be drawn when significant differences exist between (i) the application domains, (ii) the test conditions, and (iii) the metrics used.
Whole systems can only be compared meaningfully with black box metrics, and not glass box metrics.

Next: Readings in interactive dialogue Up: Evaluation Previous: Assessment framework

EAGLES SWLG SoftEdition, May 1997. Get the book...