It is unlikely that a single measure could be used meaningfully to sum up the quality of an interactive language system in the foreseeable future, due to the large variety of dialogue systems and the complexity of their different components. It is more likely that systems will be characterised by vectors of metrics, each one picking out a different aspect of the system's performance. Some of these aspects can readily be assigned a numeric value, whereas others are clearly qualitative.
Different types of evaluation must be identified depending on:
The broad category of environment in which tests take place (laboratory tests against field trials ) is of vital importance, and the selection of this environment will depend on the purpose for which the evaluation results are required. There are competing imperatives here. On the one hand it is valuable to be able to repeat experiments exactly, changing only the desired variables. The way to achieve this is to carry out laboratory tests with pre-recorded databases. (This is easier said than done in interactive systems, since there may be many different routes through a dialogue and even modest changes in experimental variables may cause the pre-recorded utterances to be out of phase with the system's utterances). On the other hand, since users contribute about half of every dialogue, it is important to trial dialogue systems in the field with real users operating under target usage conditions.
The degree of simulation or system integration (pre-recorded databases, Wizard-of-Oz versus system integration tested with real users). WOZ simulations are frequently used to test dialogue system specifications in advance of implementation. Likewise, simulations in which some components are real and others are simulated are used to test system integration plans. Unless there are good grounds for doubting it, it is reasonable to suppose that the same evaluation standards should be usable when all, part, or none of a system is being simulated.
The objective of a glass box evaluation is to evaluate each component as it serves its function in the whole system. This involves determining the gross performance characteristics of the major subcomponents (such as the recogniser , parser , semantic analyser, dialogue manager , message generator, and speech synthesiser ). Further information on the assessment of these core technologies can be found in Chapters 10, 11, 12.
However, it should, in principle, also be possible to monitor certain more fine-grained internal features of the system's performance which relate more directly to the systems rôle as an interactive dialogue system, rather than simply a spoken language processing system. For example, the following features could profitably be investigated:
This is currently little more than a ``wish list'' since very few results in any of these areas have yet been achieved or published. However, readers working on interactive dialogue systems are encouraged to consider these questions and contribute findings or lessons learned to help extend current levels of knowledge.
In a black box assessment exercise, the interactive dialogue system is treated as an informationally encapsulated module. It is possible to monitor inputs and outputs, but not to look inside the box. Black box metrics are appropriate for characterising whole systems, and system comparisons should take place at the level of black box results.
No standards for black box assessment of interactive dialogue systems have yet emerged. However, the areas mentioned in the following sections should be considered as candidates for this kind of analysis.
Compared to using an existing information service, using the
computer information system is...
[circle choice which is most appropriate]
Much the much
easier easier same harder harder
|----------|----------|----------|----------|
It should be clear from the somewhat schematic discussion of interactive dialogue evaluation metrics that the field is still at a fairly primitive stage of development. Therefore, in this section we describe a core set of evaluation metrics (all of them black box metrics) which can be used, in the interim, in order to provide a comprehensible and concise characterisation of the system's capabilities. The proposed set is far from complete, and must be regarded as no more than provisional. However, if the metrics are taken up and applied to a number of different systems, it should be possible to learn a reasonable amount about the performance of a given system relative to other systems assessed (under near-identical conditions) using the same set of metrics.
The core metrics to be employed are these:
These metrics are described in more detail below.
SHORT NAME: DD
DEFINITION
Dialogue duration is a measure of the average
duration in seconds of a dialogue.
METHODOLOGY
Ensure that all dialogues in the evaluation corpus
are timed. A good way to do this is to get the system to keep a record
of its ``connect time'' (time when it is being used). To calculate DD,
divide the total amount of dialogue connect time by the number of
dialogues in the corpus.
SHORT NAME: TD
DEFINITION
Turn duration is a measure of the average duration of one turn in a corpus of
dialogues.
METHODOLOGY
The methodology proposed here is for a minimal TD measure. To obtain a
TD figure, divide the total amount of dialogue connect time by the
total number of turns in the corpus (where a turn is a contiguous
block of speech contributed to a dialogue by either the system or the
user). This figure averages across system turns (measured from when the
user stops speaking to when the system stops speaking) and user turns
(measured from when the system stops speaking to when the user stops
speaking). Some researchers may wish to distinguish between these.
However, for the baseline set of metrics described here, it is
proposed that the simpler calculation be used.
SHORT NAME: CA
DEFINITIONS
Contextual appropriateness is a
measure of the appropriateness of a system utterance in its immediate
dialogue context. This is a five-valued measure, with values drawn
from the set:
Intuitively, an utterance is appropriate in context if it is not unexpectedly conspicuous (marked) in some way. An utterance can be marked in a number of ways such as by introducing an unnatural or non-sensical progression of dialogue acts, or by being uncooperative, or by being noticeably over- or under-informative. This is a first order metric which groups together a number of phenomena. Analysis of these results should lead to the development of some finer-grained second order metrics in the future as understanding of the key issues builds. Contextual appropriateness values are expressed as percentages of all system utterances.
METHODOLOGY
It is difficult - perhaps too difficult - to specify a
priori the range of all possible system utterances, in part because
it is impossible to anticipate a priori the full range of user
utterances for which a response will be required. Thus, contextual
appropriateness scoring will be carried out by a ``panel of experts''.
Two ``experts'' (e.g. members of a project team, though this may lead to
overrating, cf. Chapter 9) will independently
score each system utterance in a corpus. Where both experts agree,
the scores will stand. Where the experts disagree, they will try in the first
instance to reach agreement by discussion. Where this still fails to
produce agreement, the utterance will be classified as AI
(appropriate/inappropriate), thus flagging the fact that there is
genuine uncertainty which requires further investigation.
EXAMPLES
(These illustrative examples are drawn from the travel information
domain.)
EXAMPLE CA.1: TF [total failure]
U: What time does that leave?
[closedown]
EXAMPLE CA.2: IA [inappropriate dialogue act]
U: What time does the next train leave for Milan?
S: Can I help you?
EXAMPLE CA.3: IA [uncooperative answer]
U: Is there a direct flight from Paris to Inverness?
S: No.
EXAMPLE CA.4: IA [under-informative answer]
U: What time does the next train from Rome arrive?
S: In the afternoon.
EXAMPLE CA.5: IA [over-informative answer]
U: What time does the next train from Rome arrive?
S: The next train from Rome leaves at 11 o'clock in the morning.
It stops at Florence at 12.30. It arrives in Turin at 3 o'clock
in the afternoon. The train consists of seven cars, drawn by an
electric engine (serial number B475). Refreshments will be
available, between 11.30 and 14.45. All major credit cards accepted.
EXAMPLE CA.6: IC [unintelligible answer]
U: What time does the next train from Rome arrive?
S: rhubarbrhubarbrhubarb
As pointed out above, CA is a first order metric. Though each of the examples classed as IA is inappropriate in some way, some seem more profoundly bad than others. So, for example, CA.2 is non-sensical, whereas CA.4 is just extremely curt. Two things are worth bearing in mind. First, CA is just one metric amongst several, and we can expect the categories used by other metrics to cut across IA. Second, notwithstanding the general issues relating to Grice's Co-operative Principle, judgements of contextual appropriateness must be earthed in a system specification. An unco-operative answer may not be disasterous for the flow of the dialogue but, given some specification of a cooperative spoken language dialogue system , and un-cooperative answer may be judged to be just as inappropriate as a non-sensical one.
SHORT NAME: CR
DEFINITION
The correction rate is the percentage of all turns in a
dialogue which are concerned primarily with rectifying
a ``trouble''. In general, turns
which introduce troubles and those which correct them have the status
of insertion sequences - they interrupt the flow of the
dialogue without contributing new propositional content to it. (They
may, of course, make substitutions in the propositional content.) If these
sequences were removed, the dialogue would retain the same basic
informational content and progression.
System turns which seek to correct a user misunderstanding about the capabilities of the system should not be included in the measure as a correction turn . The rationale is that when users try from the outset of a dialogue to misuse the system, dialogues can be very short with all of the system's utterances devoted to correcting the user's misapprehensions. This would skew the figures badly, though the user might have been acting reasonably in ignorance of the system's capabilities and the system might have coped perfectly.
The correction rate could be divided into two subsidiary measures: the ``system correction rate (SCR)'' and the ``user correction rate (UCR)''. Definitions of these rates are as follows:
METHODOLOGY
Once again, this metric can only be applied subjectively. Two experts
independently count the number of system and user turns in a set
of dialogues and keep a record of the system and user correction
turns . Any conflict between experts will either be resolved by
discussion or the offending turn will be removed from the measure and
stored for further consideration. Once all of the uncertain turns have
been removed, normalisation must take place in order to obtain a
reliable CR rate. If n more system turns have been removed than
user turns or vice versa, then the total number of turns must be
reduced by n. This is not relevant in the case of SCR or UCR.
Examples
( = system correction turn ; = user correction turn .)
EXAMPLE CR.1
S: Flight information. Can I help you?
U: What time does flight BA 123 arrive from Paris?
S: BA 923?
: No. BA 123.
S: BA 123?
U: Yes.
S: BA 123 from Paris leaves at 7.15am.
: What time does it arrive?
S: That flight arrives at London Heathrow at 7.20am.
U: 7.28?
: 20 past 7 in the morning.
U: OK. Thanks.
S: Thank you. Goodbye.
No. of turns | = | 13 | No. of corrections | = | 3 |
No. of system turns | = | 7 | No. of turns | = | 1 |
Number of user turns | = | 6 | No. of turns | = | 2 |
CR | = | 23% |
SCR | = | 14% |
UCR | = | 33% |
Under normal circumstances the correction rate would be calculated for a corpus rather than for a single dialogue.
EXAMPLE CR.2
S: Flight information. Can I help you?
U: I'd like to book a flight to Genoa.
S: I'm sorry, this is just an information service.
For reservations dial 071-340 4000.
U: OK. Thanks.
S: Goodbye.
None of these turns counts as a correction turn for the purposes of this metric. Thus the CR = 0%.
SHORT NAME: TS
DEFINITION
Transaction success is a measure of the success of the system in
providing users with the information they require, if such information
is available in the database. This is a four-valued measure:
TASKS FOR WHICH ANSWERS EXIST. A ``reference answer'' can be associated with most task scenarios in advance of the dialogues based on the scenario . A reference answer consists of a frame with some or all slots filled in. Some of these are marked as obligatory - the transaction can only be judged to have succeeded (S) if the system conveys to the user the information stored in these slots. Other slots are marked as optional - the transaction will succeed even if the system does not tell the user the information in these slots. A transaction must be judged to have failed (F) if the system provides to the user any information which is inconsistent with that found in the reference answer frame, or if it fails to provide obligatory information to the user.
For example, here is a flight information scenario and associated answer frame. (Slots marked with an asterisk must be filled in a successful answer).
SCENARIO 1 |
Find out when flight BA 123 from Paris arrives. |
REFERENCE ANSWER FRAME | 1 |
TASK: | flight enquiry |
FLIGHT ID: | BA123 |
FROM CITY: | Paris |
FROM AIRPORT: | Charles de Gaulle |
TO CITY: | London |
TO AIRPORT: | Heathrow |
TO TERMINAL: | 4 |
DEPART TIME: | 15.35 |
ARRIVE TIME: | 16.00* |
Thus, any transaction in which the system tells the user the arrival time and does not contradict any of the other slot-fillers will succeed.
If a user introduces information which was not contained in the
scenario and not anticipated in the reference answer frame, then an expert
must produce a post hoc reference answer frame using this
information and the task success must be judged in the light of this
reference answer frame.
TASKS INVOLVING UNKNOWN OBJECTS WHICH CAN BE FOUND BY RELAXATION.
If the user asks the system to perform a task type within its general
competence, but the user references non-existent objects, then the
transaction will be judged to have succeeded (SC) if the system is
able to relax constraints until the user accepts an answer relating to the
closest known object, and that information is correct according to the
database.
TASKS INVOLVING UNKNOWN OBJECTS WHICH CANNOT BE FOUND BY RELAXATION. If the user asks the system to perform a task type within
its general competence, but the user references non-existent objects
which cannot be found by constraint relaxation, then the transaction
will be judged to have succeeded if the system informs the user of the
non-existence of the object and the user either ends the dialogue (SN)
or replaces the non-existent object with one which exists (S) or can
be relaxed satisfactorily (SC).
TASKS FOR WHICH NO ANSWER IS KNOWN. If the user asks the system to perform a task beyond the planned competence of the system then the reference answer frame will include the entry:
REFERENCE ANSWER FRAME | 2 |
TASK: | unknown |
The dialogue will be judged to have succeeded (SN) if the system informs the user that it is unable to perform the task requested.
In all other circumstances, the task will be judged to have failed (F).
Transaction success is a very simple measure of whether or not the
system as a whole has come up with ``the right answer''.
METHODOLOGY
Where possible, an expert will prepare a reference answer frame for each
scenario in advance of data collection. Where this is impossible (as
in the circumstances described above), an expert must construct a
reference answer frame on the basis of the task which the user appears to
be trying to solve and on the basis of information in the database.
The success of the task must then be judged against the new reference
answer frame.
EXAMPLES
The following examples assume Scenario 1 and Reference Answer Frame 1,
above.
EXAMPLE TS.1: S
S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123
gets in from Paris this afternoon?
S: BA 923?
U: No. BA 123.
S: BA 123?
U: That's right.
S: That flight leaves Paris at 3.35pm and
arrives at London Heathrow airport at 4pm.
U: Ok. Thanks a lot.
S: Thank you. Goodbye.
EXAMPLE TS.2: SC
S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 923
gets in from Paris this afternoon?
S: BA 923?
U: That's right.
S: There is no flight BA 923. Perhaps you mean flight BA 123 which leaves
Paris at 3.35pm and arrives at London Heathrow airport at 4pm.
U: Oh. OK. Thanks a lot.
S: Thank you. Goodbye.
EXAMPLE TS.3: SN
S: Flight information. Can I help you?
U: Hello. I'd like to book a flight on this
afternoon's flight BA 123 from Paris to London.
S: I'm sorry. This is just an information service.
For reservations dial 071-340 4000.
U: Oh. OK. Thanks a lot.
S: Thank you. Goodbye.
EXAMPLE TS.4: F
S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123
gets in from Paris this afternoon?
S: BA 923?
U: No. BA 123.
[closedown]
EXAMPLE TS.5: F
S: Flight information. Can I help you?
U: Hello. Can you tell me what time BA 123
gets in from Paris this afternoon?
S: BA 123?
U: That's right.
S: That flight arrives in London at 5.30pm this evening.
U: OK. Thanks very much.
S: Goodbye.