next up previous contents index
Next: Recommendations on evaluation methodology Up: Evaluation Previous: Characterisation

Assessment framework

It is unlikely that a single measure could be used meaningfully to sum up the quality of an interactive language system in the foreseeable future, due to the large variety of dialogue systems and the complexity of their different components. It is more likely that systems will be characterised by vectors of metrics, each one picking out a different aspect of the system's performance. Some of these aspects can readily be assigned a numeric value, whereas others are clearly qualitative.

Assessment metrics

Different types of evaluation must be identified depending on:

The environment in which tests take place

 

The broad category of environment in which tests  take place (laboratory tests  against field trials ) is of vital importance, and the selection of this environment will depend on the purpose for which the evaluation results are required. There are competing imperatives here. On the one hand it is valuable to be able to repeat experiments exactly, changing only the desired variables. The way to achieve this is to carry out laboratory tests  with pre-recorded databases. (This is easier said than done in interactive systems, since there may be many different routes through a dialogue and even modest changes in experimental variables may cause the pre-recorded utterances to be out of phase with the system's utterances). On the other hand, since users contribute about half of every dialogue, it is important to trial dialogue systems in the field  with real users operating under target usage conditions.  

The degree of simulation or system integration

The degree of simulation or system integration (pre-recorded databases, Wizard-of-Oz versus system integration tested with real users). WOZ simulations are frequently used to test dialogue system specifications in advance of implementation. Likewise, simulations in which some components are real and others are simulated are used to test system integration plans. Unless there are good grounds for doubting it, it is reasonable to suppose that the same evaluation standards should be usable when all, part, or none of a system is being simulated.

How much is being evaluated?

Glass  box
(diagnostic) evaluation considers the performance of one or several subcomponents of a dialogue system. The objective is to evaluate subcomponents in the context of a whole dialogue system, and diagnose the contribution of each part to the overall success or failure of the system.

Black  box
(performance) evaluation considers the overall performance of a dialogue system without reference to any internal components or behaviours.

Glass box assessment

 

The objective of a glass box evaluation is to evaluate each component as it serves its function in the whole system. This involves determining the gross performance characteristics of the major subcomponents (such as the recogniser , parser , semantic analyser, dialogue manager , message generator, and speech synthesiser ). Further information on the assessment of these core technologies can be found in Chapters 10, 11, 12.

However, it should, in principle, also be possible to monitor certain more fine-grained internal features of the system's performance which relate more directly to the systems rôle as an interactive dialogue system, rather than simply a spoken language processing system. For example, the following features could profitably be investigated:

This is currently little more than a ``wish list'' since very few results in any of these areas have yet been achieved or published. However, readers working on interactive dialogue systems  are encouraged to consider these questions and contribute findings or lessons learned to help extend current levels of knowledge.  

Black box assessment

 

In a black box assessment exercise, the interactive dialogue system  is treated as an informationally encapsulated module. It is possible to monitor inputs and outputs, but not to look inside the box. Black box metrics are appropriate for characterising whole systems, and system comparisons should take place at the level of black box results.

No standards for black box assessment of interactive dialogue systems  have yet emerged. However, the areas mentioned in the following sections should be considered as candidates for this kind of analysis.

Quantitative measures

Qualitative measures

        Compared to using an existing information service, using the
        computer information system is...
        [circle choice which is most appropriate]

       Much                  the                   much
       easier     easier     same       harder     harder
         |----------|----------|----------|----------|

 

A core set of metrics for system comparison

It should be clear from the somewhat schematic discussion of interactive dialogue evaluation metrics that the field is still at a fairly primitive stage of development. Therefore, in this section we describe a core set of evaluation metrics (all of them black  box metrics) which can be used, in the interim, in order to provide a comprehensible and concise characterisation of the system's capabilities. The proposed set is far from complete, and must be regarded as no more than provisional. However, if the metrics are taken up and applied to a number of different systems, it should be possible to learn a reasonable amount about the performance of a given system relative to other systems assessed (under near-identical conditions) using the same set of metrics.

The core metrics to be employed are these:

Dialogue duration  :
the average duration of a dialogue.
Turn duration:
   the average duration of a turn.
Contextual appropriateness:
  a measure of the appropriateness of the system's turn-by-turn behaviour.
Correction rate:
  a measure of the proportion of all turns which are devoted to rectifying problems.
Transaction success rate:
  a measure of the percentage of all transactions which the system completes appropriately.

These metrics are described in more detail below.

Dialogue duration

   

SHORT NAME: DD

DEFINITION
Dialogue duration is a measure of the average duration in seconds of a dialogue.

METHODOLOGY
Ensure that all dialogues in the evaluation corpus are timed. A good way to do this is to get the system to keep a record of its ``connect time'' (time when it is being used). To calculate DD, divide the total amount of dialogue connect time by the number of dialogues in the corpus.    

Turn duration

  

SHORT NAME: TD

DEFINITION
Turn duration is a measure of the average duration of one turn in a corpus of dialogues.

METHODOLOGY
The methodology proposed here is for a minimal TD measure. To obtain a TD figure, divide the total amount of dialogue connect time by the total number of turns in the corpus (where a turn is a contiguous block of speech contributed to a dialogue by either the system or the user). This figure averages across system turns (measured from when the user stops speaking to when the system stops speaking) and user turns (measured from when the system stops speaking to when the user stops speaking). Some researchers may wish to distinguish between these. However, for the baseline set of metrics described here, it is proposed that the simpler calculation be used.    

Contextual appropriateness

 

SHORT NAME: CA

DEFINITIONS
Contextual appropriateness is a measure of the appropriateness of a system utterance in its immediate dialogue context. This is a five-valued measure, with values drawn from the set:

METHODOLOGY
It is difficult - perhaps too difficult - to specify a priori the range of all possible system utterances, in part because it is impossible to anticipate a priori the full range of user utterances for which a response will be required. Thus, contextual appropriateness scoring will be carried out by a ``panel of experts''. Two ``experts'' (e.g. members of a project team, though this may lead to overrating, cf. Chapter 9) will independently score each system utterance in a corpus. Where both experts agree, the scores will stand. Where the experts disagree, they will try in the first instance to reach agreement by discussion. Where this still fails to produce agreement, the utterance will be classified as AI (appropriate/inappropriate), thus flagging the fact that there is genuine uncertainty which requires further investigation.

EXAMPLES
(These illustrative examples are drawn from the travel information domain.)

EXAMPLE CA.1: TF [total failure]

U: What time does that leave?
[closedown]

EXAMPLE CA.2: IA [inappropriate dialogue act]

U: What time does the next train leave for Milan?
S: Can I help you?

EXAMPLE CA.3: IA [uncooperative answer]

U: Is there a direct flight from Paris to Inverness?
S: No.

EXAMPLE CA.4: IA [under-informative answer]

U: What time does the next train from Rome arrive?
S: In the afternoon.

EXAMPLE CA.5: IA [over-informative answer]

U: What time does the next train from Rome arrive?
S: The next train from Rome leaves at 11 o'clock in the morning.
It stops at Florence at 12.30. It arrives in Turin at 3 o'clock
in the afternoon. The train consists of seven cars, drawn by an
electric engine (serial number B475). Refreshments will be
available, between 11.30 and 14.45. All major credit cards accepted.

EXAMPLE CA.6: IC [unintelligible answer]

U: What time does the next train from Rome arrive?
S: rhubarbrhubarbrhubarb

As pointed out above, CA is a first order metric. Though each of the examples classed as IA is inappropriate in some way, some seem more profoundly bad than others. So, for example, CA.2 is non-sensical, whereas CA.4 is just extremely curt. Two things are worth bearing in mind. First, CA is just one metric amongst several, and we can expect the categories used by other metrics to cut across IA. Second, notwithstanding the general issues relating to Grice's Co-operative Principle, judgements of contextual appropriateness must be earthed in a system specification. An unco-operative answer may not be disasterous for the flow of the dialogue but, given some specification of a cooperative spoken language dialogue system , and un-cooperative answer may be judged to be just as inappropriate as a non-sensical one.  

Correction rate

 

SHORT NAME: CR

DEFINITION
The correction rate is the percentage of all turns  in a dialogue which are concerned primarily with rectifying a ``trouble''.gif In general, turns  which introduce troubles and those which correct them have the status of insertion sequences  - they interrupt the flow of the dialogue without contributing new propositional content to it. (They may, of course, make substitutions  in the propositional content.) If these sequences were removed, the dialogue would retain the same basic informational content and progression.

System turns  which seek to correct a user misunderstanding about the capabilities of the system should not be included in the measure as a correction turn . The rationale is that when users try from the outset of a dialogue to misuse the system, dialogues can be very short with all of the system's utterances devoted to correcting the user's misapprehensions. This would skew the figures badly, though the user might have been acting reasonably in ignorance of the system's capabilities and the system might have coped perfectly.

The correction rate could be divided into two subsidiary measures: the ``system correction rate (SCR)'' and the ``user correction rate (UCR)''. Definitions of these rates are as follows:

CR:
Percentage of all turns  which are correction turns 
SCR:
Percentage of all system turns  which are correction turns 
UCR:
Percentage of all user turns  which are correction turns 

METHODOLOGY
Once again, this metric can only be applied subjectively. Two experts independently count the number of system and user turns  in a set of dialogues and keep a record of the system and user correction turns . Any conflict between experts will either be resolved by discussion or the offending turn  will be removed from the measure and stored for further consideration. Once all of the uncertain turns  have been removed, normalisation must take place in order to obtain a reliable CR rate. If n more system turns  have been removed than user turns  or vice versa, then the total number of turns  must be reduced by n. This is not relevant in the case of SCR or UCR.

Examples
(tex2html_wrap_inline48899 = system correction turn ; tex2html_wrap_inline48901 = user correction turn .)

EXAMPLE CR.1

S: Flight information. Can I help you?
U: What time does flight BA 123 arrive from Paris?
S: BA 923?
tex2html_wrap_inline48901: No. BA 123.
S: BA 123?
U: Yes.
S: BA 123 from Paris leaves at 7.15am.
tex2html_wrap_inline48901: What time does it arrive?
S: That flight arrives at London Heathrow at 7.20am.
U: 7.28?
tex2html_wrap_inline48899: 20 past 7 in the morning.
U: OK. Thanks.
S: Thank you. Goodbye.

No. of turns  = 13 No. of corrections = 3
No. of system turns  = 7 No. of tex2html_wrap_inline48899 turns  = 1
Number of user turns  = 6 No. of tex2html_wrap_inline48901 turns  = 2

CR = 23%
SCR = 14%
UCR = 33%

Under normal circumstances the correction rate would be calculated for a corpus rather than for a single dialogue.

EXAMPLE CR.2

S: Flight information. Can I help you?
U: I'd like to book a flight to Genoa.
S: I'm sorry, this is just an information service.
For reservations dial 071-340 4000.
U: OK. Thanks.
S: Goodbye.

None of these turns counts as a correction turn  for the purposes of this metric. Thus the CR = 0%.

 

Transaction success

 

SHORT NAME: TS

DEFINITION
Transaction success is a measure of the success of the system in providing users with the information they require, if such information is available in the database. This is a four-valued measure:

S:
succeed
SC:
succeed with constraint relaxation
SN:
succeed in spotting that no answer exists
F:
fail

TASKS FOR WHICH ANSWERS EXIST. A ``reference answer'' can be associated with most task scenarios  in advance of the dialogues based on the scenario . A reference answer consists of a frame with some or all slots filled in. Some of these are marked as obligatory - the transaction can only be judged to have succeeded (S) if the system conveys to the user the information stored in these slots. Other slots are marked as optional - the transaction will succeed even if the system does not tell the user the information in these slots. A transaction must be judged to have failed (F) if the system provides to the user any information which is inconsistent with that found in the reference answer frame, or if it fails to provide obligatory information to the user.

For example, here is a flight information scenario  and associated answer frame. (Slots marked with an asterisk must be filled in a successful answer).

SCENARIO 1
Find out when flight BA 123 from Paris arrives.

REFERENCE ANSWER FRAME 1
TASK: flight enquiry
FLIGHT ID: BA123
FROM CITY: Paris
FROM AIRPORT: Charles de Gaulle
TO CITY: London
TO AIRPORT: Heathrow
TO TERMINAL: 4
DEPART TIME: 15.35
ARRIVE TIME: 16.00*

Thus, any transaction in which the system tells the user the arrival time and does not contradict any of the other slot-fillers will succeed.

If a user introduces information which was not contained in the scenario  and not anticipated in the reference answer frame, then an expert must produce a post hoc reference answer frame using this information and the task success must be judged in the light of this reference answer frame.

TASKS INVOLVING UNKNOWN OBJECTS WHICH CAN BE FOUND BY RELAXATION. If the user asks the system to perform a task type within its general competence, but the user references non-existent objects, then the transaction will be judged to have succeeded (SC) if the system is able to relax constraints until the user accepts an answer relating to the closest known object, and that information is correct according to the database.

TASKS INVOLVING UNKNOWN OBJECTS WHICH CANNOT BE FOUND BY RELAXATION. If the user asks the system to perform a task type within its general competence, but the user references non-existent objects which cannot be found by constraint relaxation, then the transaction will be judged to have succeeded if the system informs the user of the non-existence of the object and the user either ends the dialogue (SN) or replaces the non-existent object with one which exists (S) or can be relaxed satisfactorily (SC).

TASKS FOR WHICH NO ANSWER IS KNOWN. If the user asks the system to perform a task beyond the planned competence of the system then the reference answer frame will include the entry:

REFERENCE ANSWER FRAME 2
TASK: unknown

The dialogue will be judged to have succeeded (SN) if the system informs the user that it is unable to perform the task requested.

In all other circumstances, the task will be judged to have failed (F).

Transaction success is a very simple measure of whether or not the system as a whole has come up with ``the right answer''.

METHODOLOGY
Where possible, an expert will prepare a reference answer frame for each scenario  in advance of data collection. Where this is impossible (as in the circumstances described above), an expert must construct a reference answer frame on the basis of the task which the user appears to be trying to solve and on the basis of information in the database. The success of the task must then be judged against the new reference answer frame.

EXAMPLES
The following examples assume Scenario  1 and Reference Answer Frame 1, above.

EXAMPLE TS.1: S

S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?
S: BA 923?
U: No. BA 123.
S: BA 123?
U: That's right.
S: That flight leaves Paris at 3.35pm and
arrives at London Heathrow airport at 4pm.
U: Ok. Thanks a lot.
S: Thank you. Goodbye.

EXAMPLE TS.2: SC

S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 923 gets in from Paris this afternoon?
S: BA 923?
U: That's right.
S: There is no flight BA 923. Perhaps you mean flight BA 123 which leaves
Paris at 3.35pm and arrives at London Heathrow airport at 4pm.
U: Oh. OK. Thanks a lot.
S: Thank you. Goodbye.

EXAMPLE TS.3: SN

S: Flight information. Can I help you?
U: Hello. I'd like to book a flight on this afternoon's flight BA 123 from Paris to London.
S: I'm sorry. This is just an information service.
For reservations dial 071-340 4000.
U: Oh. OK. Thanks a lot.
S: Thank you. Goodbye.

EXAMPLE TS.4: F

S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?
S: BA 923?
U: No. BA 123.
[closedown]

EXAMPLE TS.5: F

S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123 gets in from Paris this afternoon?
S: BA 123?
U: That's right.
S: That flight arrives in London at 5.30pm this evening.
U: OK. Thanks very much.
S: Goodbye.
 

 

NAME VALUE
DD Dialogue duration   average dialogue duration  in seconds
TD Turn   duration average turn  duration  in seconds
CA Contextual
Appropriateness  TF total failure
AP appropriate
IA inappropriate
AI appropriate/inappropriate
IC incomprehensible
CR Correction Rate percentage of turns  which are corrections
SCR System Correction Rate percentage of system turns  which
are corrections
UCR User Correction Rate percentage of user turns  which
are corrections
TS Transaction Success  S succeed
SC succeed with constraint relaxation
SN succeed in spotting that no answer exists
F fail
Table 13.1: Summary of core comparative evaluation metrics



next up previous contents index
Next: Recommendations on evaluation methodology Up: Evaluation Previous: Characterisation

EAGLES SWLG SoftEdition, May 1997. Get the book...