Assessment framework

Next: Recommendations on evaluation methodology Up: Evaluation Previous: Characterisation

Assessment framework

It is unlikely that a single measure could be used meaningfully to sum up the quality of an interactive language system in the foreseeable future, due to the large variety of dialogue systems and the complexity of their different components. It is more likely that systems will be characterised by vectors of metrics, each one picking out a different aspect of the system's performance. Some of these aspects can readily be assigned a numeric value, whereas others are clearly qualitative.

Assessment metrics

Different types of evaluation must be identified depending on:

the environment in which tests take place;
the degree of simulation or system integration;
how much is being evaluated.

The environment in which tests take place

The broad category of environment in which tests take place (laboratory tests against field trials ) is of vital importance, and the selection of this environment will depend on the purpose for which the evaluation results are required. There are competing imperatives here. On the one hand it is valuable to be able to repeat experiments exactly, changing only the desired variables. The way to achieve this is to carry out laboratory tests with pre-recorded databases. (This is easier said than done in interactive systems, since there may be many different routes through a dialogue and even modest changes in experimental variables may cause the pre-recorded utterances to be out of phase with the system's utterances). On the other hand, since users contribute about half of every dialogue, it is important to trial dialogue systems in the field with real users operating under target usage conditions.

The degree of simulation or system integration

The degree of simulation or system integration (pre-recorded databases, Wizard-of-Oz versus system integration tested with real users). WOZ simulations are frequently used to test dialogue system specifications in advance of implementation. Likewise, simulations in which some components are real and others are simulated are used to test system integration plans. Unless there are good grounds for doubting it, it is reasonable to suppose that the same evaluation standards should be usable when all, part, or none of a system is being simulated.

How much is being evaluated?

Glass box: (diagnostic) evaluation considers the performance of one or several subcomponents of a dialogue system. The objective is to evaluate subcomponents in the context of a whole dialogue system, and diagnose the contribution of each part to the overall success or failure of the system.
Black box: (performance) evaluation considers the overall performance of a dialogue system without reference to any internal components or behaviours.

Glass box assessment

The objective of a glass box evaluation is to evaluate each component as it serves its function in the whole system. This involves determining the gross performance characteristics of the major subcomponents (such as the recogniser , parser , semantic analyser, dialogue manager , message generator, and speech synthesiser ). Further information on the assessment of these core technologies can be found in Chapters 10, 11, 12.

However, it should, in principle, also be possible to monitor certain more fine-grained internal features of the system's performance which relate more directly to the systems rôle as an interactive dialogue system, rather than simply a spoken language processing system. For example, the following features could profitably be investigated:

The system's ability to:
- interrupt synthesis,
- shift focus or restart,
- internally correct sentences or detect incoherences,
- anticipate next exchanges ,
- resolve references (ellipsis, anaphora ) and ambiguities (assessment of the dialogue history),
- accept different user's formulations for the same semantic meaning,
- accept ill-formed sentences, self-corrections, incomplete sentences,
- process unknown words,
- infer new data or information,
- handle several dialogues or tasks in parallel (several users, several communicating entities), and
- deliver equivalent outputs using different output media, according to the context of interaction.
accuracy /pertinence of each answer;
average number of repetitions to obtain the correct response (or action ) (assessment of the way dialogue handles correction in interaction with the user);
number of recognition errors corrected by higher-levels of comprehension (assessment of system's internal predictive and corrective mechanisms).

This is currently little more than a ``wish list'' since very few results in any of these areas have yet been achieved or published. However, readers working on interactive dialogue systems are encouraged to consider these questions and contribute findings or lessons learned to help extend current levels of knowledge.

Black box assessment

In a black box assessment exercise, the interactive dialogue system is treated as an informationally encapsulated module. It is possible to monitor inputs and outputs, but not to look inside the box. Black box metrics are appropriate for characterising whole systems, and system comparisons should take place at the level of black box results.

No standards for black box assessment of interactive dialogue systems have yet emerged. However, the areas mentioned in the following sections should be considered as candidates for this kind of analysis.

Quantitative measures

Average number of exchanges to obtain relevant responses or reactions from the system;
task completion rate: the percentage of all tasks which the system succeeds in completing;
transaction success rate: the percentage of all tasks which the system succeeds in completing with the correct answer;
interaction speed (response time of the system);
terseness of system's answers (average number of words per system utterance - whether a high or a low average is deemed to be desirable is likely to vary from application domain to application domain).

Qualitative measures

User satisfaction: a number of projects have used questionnaires, with apparent success, to elicit subjective evaluations from users. Users are typically asked to evaluate the system for some subjective qualities (pleasantness , usefulness, etc.) and to compare the system to more traditional means of communication, such as human-human interaction . A simple method is to present users with a five point scale, from which the best fitting choice should be selected. For example:

        Compared to using an existing information service, using the
        computer information system is...
        [circle choice which is most appropriate]

       Much                  the                   much
       easier     easier     same       harder     harder
         |----------|----------|----------|----------|

ability to adapt to new users (this, and the following capability to be judged on external evidence only),
ability to adapt to the same user,
ability to handle multi-modality.

A core set of metrics for system comparison

It should be clear from the somewhat schematic discussion of interactive dialogue evaluation metrics that the field is still at a fairly primitive stage of development. Therefore, in this section we describe a core set of evaluation metrics (all of them black box metrics) which can be used, in the interim, in order to provide a comprehensible and concise characterisation of the system's capabilities. The proposed set is far from complete, and must be regarded as no more than provisional. However, if the metrics are taken up and applied to a number of different systems, it should be possible to learn a reasonable amount about the performance of a given system relative to other systems assessed (under near-identical conditions) using the same set of metrics.

The core metrics to be employed are these:

Dialogue duration :: the average duration of a dialogue.
Turn duration:: the average duration of a turn.
Contextual appropriateness:: a measure of the appropriateness of the system's turn-by-turn behaviour.
Correction rate:: a measure of the proportion of all turns which are devoted to rectifying problems.
Transaction success rate:: a measure of the percentage of all transactions which the system completes appropriately.

These metrics are described in more detail below.

Dialogue duration

SHORT NAME: DD

DEFINITION
Dialogue duration is a measure of the average duration in seconds of a dialogue.

METHODOLOGY
Ensure that all dialogues in the evaluation corpus are timed. A good way to do this is to get the system to keep a record of its ``connect time'' (time when it is being used). To calculate DD, divide the total amount of dialogue connect time by the number of dialogues in the corpus.

Turn duration

SHORT NAME: TD

DEFINITION
Turn duration is a measure of the average duration of one turn in a corpus of dialogues.

METHODOLOGY
The methodology proposed here is for a minimal TD measure. To obtain a TD figure, divide the total amount of dialogue connect time by the total number of turns in the corpus (where a turn is a contiguous block of speech contributed to a dialogue by either the system or the user). This figure averages across system turns (measured from when the user stops speaking to when the system stops speaking) and user turns (measured from when the system stops speaking to when the user stops speaking). Some researchers may wish to distinguish between these. However, for the baseline set of metrics described here, it is proposed that the simpler calculation be used.

Contextual appropriateness

SHORT NAME: CA

DEFINITIONS
Contextual appropriateness is a measure of the appropriateness of a system utterance in its immediate dialogue context. This is a five-valued measure, with values drawn from the set:

TF (total failure)
Total failure is recorded when the system fails to produce any linguistic response to a user utterance.
AP (appropriate)
A complete definition of appropriateness is not practically possible. Roughly speaking, an utterance is deemed to be appropriate if it does not violate any of Grice's maxims for cooperative conversation [Grice (1975)] , as re-interpreted in the context of human-computer dialogue [Gibbon (1992b), cf. also,].
Intuitively, an utterance is appropriate in context if it is not unexpectedly conspicuous (marked) in some way. An utterance can be marked in a number of ways such as by introducing an unnatural or non-sensical progression of dialogue acts, or by being uncooperative, or by being noticeably over- or under-informative. This is a first order metric which groups together a number of phenomena. Analysis of these results should lead to the development of some finer-grained second order metrics in the future as understanding of the key issues builds. Contextual appropriateness values are expressed as percentages of all system utterances.
IA (inappropriate)
An utterance is deemed to be inappropriate in context if it violates one of the conversational maxims.
AI (appropriate/inappropriate)
An utterance is classed as appropriate/inappropriate if the appointed experts cannot reach agreement on its contextual appropriateness.
IC (incomprehensible)
An utterance is categorised as incomprehensible if the experts are unable to make an appropriateness judgement because the content of the system utterance cannot be discerned. This may happen for a variety of reasons, e.g. an error in the text-to-speech system.

METHODOLOGY
It is difficult - perhaps too difficult - to specify a priori the range of all possible system utterances, in part because it is impossible to anticipate a priori the full range of user utterances for which a response will be required. Thus, contextual appropriateness scoring will be carried out by a ``panel of experts''. Two ``experts'' (e.g. members of a project team, though this may lead to overrating, cf. Chapter 9) will independently score each system utterance in a corpus. Where both experts agree, the scores will stand. Where the experts disagree, they will try in the first instance to reach agreement by discussion. Where this still fails to produce agreement, the utterance will be classified as AI (appropriate/inappropriate), thus flagging the fact that there is genuine uncertainty which requires further investigation.

EXAMPLES
(These illustrative examples are drawn from the travel information domain.)

EXAMPLE CA.1: TF [total failure]

U: What time does that leave?

[closedown]

EXAMPLE CA.2: IA [inappropriate dialogue act]

U: What time does the next train leave for Milan?

S: Can I help you?

EXAMPLE CA.3: IA [uncooperative answer]

U: Is there a direct flight from Paris to Inverness?

S: No.

EXAMPLE CA.4: IA [under-informative answer]

U: What time does the next train from Rome arrive?

S: In the afternoon.

EXAMPLE CA.5: IA [over-informative answer]

U: What time does the next train from Rome arrive?

S: The next train from Rome leaves at 11 o'clock in the morning.

It stops at Florence at 12.30. It arrives in Turin at 3 o'clock

in the afternoon. The train consists of seven cars, drawn by an

electric engine (serial number B475). Refreshments will be

available, between 11.30 and 14.45. All major credit cards accepted.

EXAMPLE CA.6: IC [unintelligible answer]

U: What time does the next train from Rome arrive?

S: rhubarbrhubarbrhubarb

As pointed out above, CA is a first order metric. Though each of the examples classed as IA is inappropriate in some way, some seem more profoundly bad than others. So, for example, CA.2 is non-sensical, whereas CA.4 is just extremely curt. Two things are worth bearing in mind. First, CA is just one metric amongst several, and we can expect the categories used by other metrics to cut across IA. Second, notwithstanding the general issues relating to Grice's Co-operative Principle, judgements of contextual appropriateness must be earthed in a system specification. An unco-operative answer may not be disasterous for the flow of the dialogue but, given some specification of a cooperative spoken language dialogue system , and un-cooperative answer may be judged to be just as inappropriate as a non-sensical one.

Correction rate

SHORT NAME: CR

DEFINITION
The correction rate is the percentage of all turns in a dialogue which are concerned primarily with rectifying a ``trouble''. In general, turns which introduce troubles and those which correct them have the status of insertion sequences - they interrupt the flow of the dialogue without contributing new propositional content to it. (They may, of course, make substitutions in the propositional content.) If these sequences were removed, the dialogue would retain the same basic informational content and progression.

System turns which seek to correct a user misunderstanding about the capabilities of the system should not be included in the measure as a correction turn . The rationale is that when users try from the outset of a dialogue to misuse the system, dialogues can be very short with all of the system's utterances devoted to correcting the user's misapprehensions. This would skew the figures badly, though the user might have been acting reasonably in ignorance of the system's capabilities and the system might have coped perfectly.

The correction rate could be divided into two subsidiary measures: the ``system correction rate (SCR)'' and the ``user correction rate (UCR)''. Definitions of these rates are as follows:

CR:: Percentage of all turns which are correction turns
SCR:: Percentage of all system turns which are correction turns
UCR:: Percentage of all user turns which are correction turns

METHODOLOGY
Once again, this metric can only be applied subjectively. Two experts independently count the number of system and user turns in a set of dialogues and keep a record of the system and user correction turns . Any conflict between experts will either be resolved by discussion or the offending turn will be removed from the measure and stored for further consideration. Once all of the uncertain turns have been removed, normalisation must take place in order to obtain a reliable CR rate. If n more system turns have been removed than user turns or vice versa, then the total number of turns must be reduced by n. This is not relevant in the case of SCR or UCR.

Examples
( = system correction turn ; = user correction turn .)

EXAMPLE CR.1

S: Flight information. Can I help you?

U: What time does flight BA 123 arrive from Paris?

S: BA 923?

: No. BA 123.

S: BA 123?

U: Yes.

S: BA 123 from Paris leaves at 7.15am.

: What time does it arrive?

S: That flight arrives at London Heathrow at 7.20am.

U: 7.28?

: 20 past 7 in the morning.

U: OK. Thanks.

S: Thank you. Goodbye.

No. of turns = 13 No. of corrections = 3

No. of system turns = 7 No. of turns = 1

Number of user turns = 6 No. of turns = 2

CR = 23%

SCR = 14%

UCR = 33%

Under normal circumstances the correction rate would be calculated for a corpus rather than for a single dialogue.

EXAMPLE CR.2

S: Flight information. Can I help you?

U: I'd like to book a flight to Genoa.

S: I'm sorry, this is just an information service.

For reservations dial 071-340 4000.

U: OK. Thanks.

S: Goodbye.

None of these turns counts as a correction turn for the purposes of this metric. Thus the CR = 0%.

Transaction success

SHORT NAME: TS

DEFINITION
Transaction success is a measure of the success of the system in providing users with the information they require, if such information is available in the database. This is a four-valued measure:

S:: succeed
SC:: succeed with constraint relaxation
SN:: succeed in spotting that no answer exists
F:: fail

TASKS FOR WHICH ANSWERS EXIST. A ``reference answer'' can be associated with most task scenarios in advance of the dialogues based on the scenario . A reference answer consists of a frame with some or all slots filled in. Some of these are marked as obligatory - the transaction can only be judged to have succeeded (S) if the system conveys to the user the information stored in these slots. Other slots are marked as optional - the transaction will succeed even if the system does not tell the user the information in these slots. A transaction must be judged to have failed (F) if the system provides to the user any information which is inconsistent with that found in the reference answer frame, or if it fails to provide obligatory information to the user.

For example, here is a flight information scenario and associated answer frame. (Slots marked with an asterisk must be filled in a successful answer).

SCENARIO 1

Find out when flight BA 123 from Paris arrives.

REFERENCE ANSWER FRAME 1

TASK: flight enquiry

FLIGHT ID: BA123

FROM CITY: Paris

FROM AIRPORT: Charles de Gaulle

TO CITY: London

TO AIRPORT: Heathrow

TO TERMINAL: 4

DEPART TIME: 15.35

ARRIVE TIME: 16.00*

Thus, any transaction in which the system tells the user the arrival time and does not contradict any of the other slot-fillers will succeed.

If a user introduces information which was not contained in the scenario and not anticipated in the reference answer frame, then an expert must produce a post hoc reference answer frame using this information and the task success must be judged in the light of this reference answer frame.

TASKS INVOLVING UNKNOWN OBJECTS WHICH CAN BE FOUND BY RELAXATION. If the user asks the system to perform a task type within its general competence, but the user references non-existent objects, then the transaction will be judged to have succeeded (SC) if the system is able to relax constraints until the user accepts an answer relating to the closest known object, and that information is correct according to the database.

TASKS INVOLVING UNKNOWN OBJECTS WHICH CANNOT BE FOUND BY RELAXATION. If the user asks the system to perform a task type within its general competence, but the user references non-existent objects which cannot be found by constraint relaxation, then the transaction will be judged to have succeeded if the system informs the user of the non-existence of the object and the user either ends the dialogue (SN) or replaces the non-existent object with one which exists (S) or can be relaxed satisfactorily (SC).

TASKS FOR WHICH NO ANSWER IS KNOWN. If the user asks the system to perform a task beyond the planned competence of the system then the reference answer frame will include the entry:

REFERENCE ANSWER FRAME 2

TASK: unknown

The dialogue will be judged to have succeeded (SN) if the system informs the user that it is unable to perform the task requested.

In all other circumstances, the task will be judged to have failed (F).

Transaction success is a very simple measure of whether or not the system as a whole has come up with ``the right answer''.

METHODOLOGY
Where possible, an expert will prepare a reference answer frame for each scenario in advance of data collection. Where this is impossible (as in the circumstances described above), an expert must construct a reference answer frame on the basis of the task which the user appears to be trying to solve and on the basis of information in the database. The success of the task must then be judged against the new reference answer frame.

EXAMPLES
The following examples assume Scenario 1 and Reference Answer Frame 1, above.

EXAMPLE TS.1: S

S: Flight information. Can I help you?

U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?

S: BA 923?

U: No. BA 123.

S: BA 123?

U: That's right.

S: That flight leaves Paris at 3.35pm and

arrives at London Heathrow airport at 4pm.

U: Ok. Thanks a lot.

S: Thank you. Goodbye.

EXAMPLE TS.2: SC

S: Flight information. Can I help you?

U: Hello. Can you tell me what time BA 923 gets in from Paris this afternoon?

S: BA 923?

U: That's right.

S: There is no flight BA 923. Perhaps you mean flight BA 123 which leaves

Paris at 3.35pm and arrives at London Heathrow airport at 4pm.

U: Oh. OK. Thanks a lot.

S: Thank you. Goodbye.

EXAMPLE TS.3: SN

S: Flight information. Can I help you?

U: Hello. I'd like to book a flight on this afternoon's flight BA 123 from Paris to London.

S: I'm sorry. This is just an information service.

For reservations dial 071-340 4000.

U: Oh. OK. Thanks a lot.

S: Thank you. Goodbye.

EXAMPLE TS.4: F

S: Flight information. Can I help you?

U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?

S: BA 923?

U: No. BA 123.

[closedown]

EXAMPLE TS.5: F

S: Flight information. Can I help you?

U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?

S: BA 123?

U: That's right.

S: That flight arrives in London at 5.30pm this evening.

U: OK. Thanks very much.

S: Goodbye.

NAME VALUE

DD Dialogue duration average dialogue duration in seconds

TD Turn duration average turn duration in seconds

CA Contextual

Appropriateness TF total failure

AP appropriate

IA inappropriate

AI appropriate/inappropriate

IC incomprehensible

CR Correction Rate percentage of turns which are corrections

SCR System Correction Rate percentage of system turns which

are corrections

UCR User Correction Rate percentage of user turns which

are corrections

TS Transaction Success S succeed

SC succeed with constraint relaxation

SN succeed in spotting that no answer exists

F fail

Table 13.1: Summary of core comparative evaluation metrics

**Table 13.1:** Summary of core comparative evaluation metrics
NAME	VALUE
DD	Dialogue duration	average dialogue duration in seconds
TD	Turn duration	average turn duration in seconds
CA	Contextual
	Appropriateness	TF	total failure
		AP	appropriate
		IA	inappropriate
		AI	appropriate/inappropriate
		IC	incomprehensible
CR	Correction Rate	percentage of turns which are corrections
SCR	System Correction Rate	percentage of system turns which
		are corrections
UCR	User Correction Rate	percentage of user turns which
		are corrections
TS	Transaction Success	S	succeed
		SC	succeed with constraint relaxation
		SN	succeed in spotting that no answer exists
		F	fail

Next: Recommendations on evaluation methodology Up: Evaluation Previous: Characterisation

EAGLES SWLG SoftEdition, May 1997. Get the book...

U:	What time does the next train leave for Milan?
S:	Can I help you?

U:	What time does the next train from Rome arrive?
S:	In the afternoon.

U:	What time does the next train from Rome arrive?
S:	The next train from Rome leaves at 11 o'clock in the morning.
	It stops at Florence at 12.30. It arrives in Turin at 3 o'clock
	in the afternoon. The train consists of seven cars, drawn by an
	electric engine (serial number B475). Refreshments will be
	available, between 11.30 and 14.45. All major credit cards accepted.

U:	What time does the next train from Rome arrive?
S:	rhubarbrhubarbrhubarb

S:	Flight information. Can I help you?
U:	What time does flight BA 123 arrive from Paris?
S:	BA 923?
:	No. BA 123.
S:	BA 123?
U:	Yes.
S:	BA 123 from Paris leaves at 7.15am.
:	What time does it arrive?
S:	That flight arrives at London Heathrow at 7.20am.
U:	7.28?
:	20 past 7 in the morning.
U:	OK. Thanks.
S:	Thank you. Goodbye.

No. of turns	=	13	No. of corrections	=	3
No. of system turns	=	7	No. of turns	=	1
Number of user turns	=	6	No. of turns	=	2

S:	Flight information. Can I help you?
U:	I'd like to book a flight to Genoa.
S:	I'm sorry, this is just an information service.
	For reservations dial 071-340 4000.
U:	OK. Thanks.
S:	Goodbye.

REFERENCE ANSWER FRAME	1
TASK:	flight enquiry
FLIGHT ID:	BA123
FROM CITY:	Paris
FROM AIRPORT:	Charles de Gaulle
TO CITY:	London
TO AIRPORT:	Heathrow
TO TERMINAL:	4
DEPART TIME:	15.35
ARRIVE TIME:	16.00*

S:	Flight information. Can I help you?
U:	Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?
S:	BA 923?
U:	No. BA 123.
S:	BA 123?
U:	That's right.
S:	That flight leaves Paris at 3.35pm and
	arrives at London Heathrow airport at 4pm.
U:	Ok. Thanks a lot.
S:	Thank you. Goodbye.

S:	Flight information. Can I help you?
U:	Hello. Can you tell me what time BA 923 gets in from Paris this afternoon?
S:	BA 923?
U:	That's right.
S:	There is no flight BA 923. Perhaps you mean flight BA 123 which leaves
	Paris at 3.35pm and arrives at London Heathrow airport at 4pm.
U:	Oh. OK. Thanks a lot.
S:	Thank you. Goodbye.

S:	Flight information. Can I help you?
U:	Hello. I'd like to book a flight on this afternoon's flight BA 123 from Paris to London.
S:	I'm sorry. This is just an information service.
	For reservations dial 071-340 4000.
U:	Oh. OK. Thanks a lot.
S:	Thank you. Goodbye.