Next: Readings in interactive dialogue
Up: Evaluation
Previous: Assessment framework
In this section we summarise the evaluation methodology proposed here
and present some concrete recommendations (most of them recapitulating
points raised in the preceding discussion) for carrying out effective
and reliable interactive dialogue system evaluations.
The evaluation methodology can be expressed by means of a series of
recommendations. These should be read in sequence.
- The sequence of events to be followed in evaluating an interactive
dialogue system is: characterisation, data collection, analysis and
application of metrics. Try to keep these as discrete, non-overlapping
phases of work, as this helps to ensure that the test is as fair as
possible.
- Provide a characterisation of all relevant aspects of the dialogue
system, the task, the user, the environment , the corpus, and
the overall system. Most of this can be done before the data
collection phase, though certain pieces of information (e.g. relating
to users and corpus characteristics) will necessarily have to wait
until the data collection is under way.
- Produce a clear statement of the objectives of the evaluation exercise
prior to the start of that exercise.
- Select the minimum set of metrics which will satisfy the evaluation
objectives.
- When budgeting time and personnel for the evaluation task,
be sure to plan adequate resources to complete the task. A partial
evaluation can turn out be of no more use than no evaluation at all.
Remember that most meaningful black box metrics at the dialogue level
cannot be automated, given the current state of the art.
- If the system to be evaluated is intended for use in a real context,
ensure that the test conditions match the end-use conditions as
closely as possible.
- Wherever possible, use evaluation metrics which have already been
described in the literature.
- Where it is necessary to invent some new metric ensure (i) that it is
well-motivated, (ii) that it is fair (not favouring your system
unreasonably), and (iii) that it is fully described whenever it is
mentioned in public documents.
- Be very cautious when comparing systems. Valid conclusions may not be
drawn when significant differences exist between (i) the application
domains, (ii) the test conditions, and (iii) the metrics used.
- Whole systems can only be compared meaningfully with
black box metrics, and not glass box
metrics.
Next: Readings in interactive dialogue
Up: Evaluation
Previous: Assessment framework
EAGLES SWLG SoftEdition, May 1997. Get the book...