Although it is certainly better to rely on analyses of human-human interactions than to rely on intuitions alone, for all but highly constrained menu dialogue systems, the fact remains that human-human interactions are not the same as human-computer interactions and it would be surprising if they followed precisely the same rules. The designer is caught in a vicious circle: it is necessary to know the characteristics of a dialogue between a person and an automaton in order to be able to build a system which acts as a dialogue participant, but it is impossible to know what such a dialogue would be like until such a system has been built.
This section examines how one particular simulation technique which has come to be widely used by dialogue system designers, the ``Wizard of Oz'' technique (see also Chapters 4, 9) can melp to extend designers' understanding of what human-computer spoken language dialogues would look like if only the systems which are currently at the planning stage were, in fact, implemented and running.
The basic idea behind the Wizard of Oz (WOZ) technique is simple: a human (usually known as the wizard or accomplice) plays the rôle of the computer in a simulated human-computer interaction . It is not known who first coined the term in this context, though its etymology is obvious. In the children's novel The Wizard of Oz [Baum (1900)], the ``great and terrible'' Wizard turns out to be no more than a mechanical tin device operated by a man hiding behind a screen. As a corpus collection method, it is also less widely known as the PNAMBIC (``Pay No Attention to the Man BehInd the Curtain'', from the film version of the Wizard of Oz) technique.
What primarily interests us here is the simulation of a computer system which takes spoken natural language input, processes it in some principled way, and generates spoken natural language responses. Example applications are telephone timetable inquiry services, hotel room booking services, home banking services and ``intelligent'' telephone answering machines. A survey paper published in 1991 [Fraser & Gilbert (1991a)] was able to assert that ``very few WOZ experiments have attempted to simulate all the components of a speech dialogue system''. However, since then there has been a dramatic increase in the number of groups using the technique to help specify interactive spoken language systems .
WOZ simulations are only useful if certain conditions are met. The first condition is that the computer system being simulated is capable of being imitated realistically, given human limitations. For example, if it is known that the future computer system will need to undertake substantial database manipulation as part of its function, there is little point in setting up an unconstrained WOZ simulation, since people are not capable of performing such work within a realistic time period.
A second, less obvious precondition is that before the experiments are begun it should be possible to formulate a detailed specification of how the future system is expected to behave. This is necessary in order to ensure that the wizard is correctly simulating the intended system. This specification often needs to be more precise and more detailed than would normally be necessary just to build the computer system. For example, in a speech simulation, the wizard ideally needs to make recognition errors at the same rate and in the same way as the future system. However, while descriptions of speech understanding systems often specify error rates , they rarely indicate what kinds of errors are made, or in sufficient detail for the errors to be simulated. Indeed, one of the aims of using the WOZ technique may be to help devise such a specification. The way round this apparent paradox, that the design of the simulation requires a specification but the content of the specification depends on the results of the simulation, will be discussed later when we consider WOZ methodology.
A third condition for the usefulness of the WOZ methodology is that the task must ensure that the illusion that the wizard is a computer can be convincingly maintained. In systems which communicate using text on terminals, only minimal precautions have to be taken, since the only evidence of the ``computer'' the subject sees is the output of characters on a screen (but even here, there may be value in buffering the output so that it appears a line at a time, rather than at the speed of the wizard's typing). In speech output channels, it is necessary to ensure that the wizard's speech is disguised to sound not quite natural, a condition often satisfied by use of a synthesizer. Similar problems arise in controlling the content of the wizard's output, which must use only knowledge likely to be available to a computer. The degree of attention which has to be paid to these issues is related to the likely gullibility (that is, likelihood of believing that the simulated system is real) of the subjects.
In this section we consider some variables in spoken WOZ simulations. By ``variables'' we simply mean things which may vary. We make no distinction here between control variables which are set by the experimenter, response variables which are measured by the experimenter, and confounding factors, in which the experimenter has no interest or over which he has no control. The experimenter must decide how to treat variables in each simulation since there is considerable scope for variation between experiments. For example, in simulations of a telephone train timetable enquiry service, the caller's level of familiarity with telephone information services might be a confounding factor, producing significant differences between speakers. However, in an experiment which divides users into ``experienced'' and ``novice'' classes, this would be a control variable rather than a confounding factor. We shall restrict our discussion here to a straightforward listing of some of the variables in spoken WOZ simulations. For the purposes of our presentation, the variables can be divided into those relating to the subject, those relating to the wizard, and those relating to the communication channel.
Variables which concern the subjects in WOZ simulations can be subclassified into subject recognition variables, subject production variables, and subject knowledge variables.
SUBJECT RECOGNITION VARIABLES relate to the subject's ability to recognise the wizard's words.
SUBJECT PRODUCTION VARIABLES relate to the speech and language produced by the subject insofar as they have implications for the ability of the wizard to recognise and understand the subject's words.
SUBJECT KNOWLEDGE VARIABLES are concerned with what the subject knows.
It seems unnecessarily complex to ask the subject to guess whether or not he is talking to a computer; this is to turn a simple WOZ experiment into a Turing test [Turing (1950)]. The experiment would no longer be a simple simulation of future technology if the subject were given this additional discrimination task.
Thus, it seems that for routine simulations the subject should be led to believe that he is actually using the future technology. This can be expected to yield the best guide to how that technology will be used when it becomes available. Potentially there are ethical problems here since a responsible experimenter would not choose to tell an outright lie to the subject. A more appropriate approach is to tell the subject that the research aims to establish how people converse with computers, and to allow her/him to draw her/his own conclusions.
It was pleasing to note that the subjects in the covert [i.e. misinformed] group all expressed surprise on being told that the experiment was based on a simulation. (Indeed, one [male] subject was substantially embarrassed on finding that a female operator had encoded the profanities which he had used when he was having difficulties and which had been faithfully reproduced on the screen!)
It seems that many subjects can be totally misled. Follow-up questioning can be used to determine what subjects believe about simulations. If they are not convinced then the results can be discarded.
It is interesting to note in passing one result of Newell's which appears to demonstrate the opposite of what might be expected [Newell (1989), p. 8,]:
Those subjects who were made aware of the operator's existence were more impressed...than those who thought they were talking to a computer.
Wizard variables can also be divided into wizard recognition and production variables but these must be supplemented with extra classes of dialogue model variables and staging variables.
WIZARD RECOGNITION VARIABLES Corresponding to the subject's production variables are a set of wizard recognition variables defining the ranges of acoustic, lexical, syntactic and pragmatic phenomena which the wizard is allowed to recognise. One of the hardest tasks for the wizard is restricting what is recognised to what is defined by these variables. We shall see below (Section 3) one possible approach to formalising recognition constraints but, for the most part, the constraints will have to be applied directly by a wizard who knows his rôle intimately.
A tolerable error margin (i.e. of successful recognitions which should have been unsuccessful) should be set and any dialogues which, on post-simulation inspection, are found to stray beyond that margin should be discarded.
A particularly difficult problem is that of trying to mimic a speech recogniser which only manages to recognise the words in its limited vocabulary, and these only with, say, a 95% recognition rate. In order to be faithful to the technology the wizard would have to introduce a random (or partially random) 5% failure rate even with words which the system is supposed to know about. This is an almost impossible task. The best that can be expected is for the wizard to introduce occasional deliberate recognition errors. Of course, if the wizard is able to type the subject's words fast enough, an automatic system can be used to generate the appropriate errors with the target frequency.
The acoustic front-ends of speech-based information systems designed for use by the general public are likely to include rapid speaker adaptation capabilities. This means that speech recognition rates are likely to improve during the course of individual conversations. It is hard enough for a wizard to generate a fixed percentage of recognition errors; it would be virtually impossible for him to simulate an error rate which varies over time.
WIZARD PRODUCTION VARIABLES. Just like the subject, the wizard has production variables, but with the wizard these are defined by the performance of the existing or projected technologies.
Thus the whole gamut of speech generation variables (voice quality, intonation , syntax , register , etc.) need to be considered. Again, the wizard may be required to introduce principled errors at any of these levels if the simulation is to be faithful to the technology.
In his listening typewriter WOZ experiments, Newell considered the question of response time to be so important that he trained his wizard to use a palantype keyboard [Newell (1978)] (an electronic stenography system which generates normal text) for rapid speech transcription (180 words/minute or more). The reasons why response times are important are simple (see also Chapters 8, 12 for further production variables):
DIALOGUE MODEL VARIABLES. The model of the dialogue employed by the wizard is central to his interpretation of utterances and selection of responses to them. It is worth flagging the dangers of constructing a prototype dialogue model in advance of running simulations. The two-stage experiment carried out by [Guyomard & Siroux (1987)] indicates the amount of work required to define a minimally acceptable dialogue manager . In spite of their positive reports, it is to be expected that many simulation-analysis-redesign iterations would be necessary to define a truly impressive dialogue manager . Since most research projects run to a tight schedule, a two-stage simulation is probably the best that most experimenters can hope for.
STAGING VARIABLES. In this section we consider some practical matters relating to the preparation of the wizard and the tools available to assist him in his work.
[Kelley (1983a), Kelley (1983b), Kelley (1984)] proposes an iterative development scheme which involves running an initial WOZ simulation and then, in subsequent simulations, incorporating more and more subcomponents of the real system, moving in the direction of a more complex system in the loop. In the development of a speech input/output system this could involve placing a speech recogniser between the subject and the human wizard. Alternatively (or additionally), the wizard could respond with synthesised speech generated from text which could either be typed rapidly (e.g. on a palantype system) or selected from a file of standard responses. In principle, a bionic wizard could include many subcomponents of the system, with the human accomplice merely ``plugging the gaps''. A bionic wizard could expect to encounter a number of problems, not least of which is the lengthening of response time which a mixing of human and computer components might entail. However, if these difficulties can be overcome, iterative development represents a promising technique.
The simplest means of connecting the subject and the wizard is by telephone or similar two-way electronic communication channel. The quality of the channel can be yet another variable.
THE SUBJECT WIZARD CHANNEL. One way of modelling the performance of a speech recogniser is to degrade the subject's speech signal. This would save the wizard from the (almost impossible) task of consciously introducing recognition errors. However, the drawback of this method is that the wizard, who already has enough to cope with, is faced with the extra workload of interpreting degraded speech. The alternative presented above is to place a real speech recogniser between the subject and the wizard. However, if the object of the exercise is to simulate a future system, the use of existing technology might place unrealistic constraints on the simulation dialogues.
THE WIZARD SUBJECT CHANNEL. No subject is going to believe that they are talking to a machine if they are unable to distinguish its performance from that of a human speaker. An important part of the simulation is the ``de-humanising'' of the wizard's voice. One way to do this is to pass the signal through a vocoder to strip it of human intonation and make it sound ``mechanical''. A secondary effect might be to make it roughly as difficult for the subject to understand the wizard as it would be to understand a speech synthesiser. This similarity could never be better than approximate.
The alternative to degrading the wizard's voice is to place a speech synthesiser between the wizard and the subject. Once again, the usefulness of this strategy depends, in part, on the extent to which the synthesiser approximates to the synthesiser in the projected future system.
CHANNEL INTERACTION. Can signals pass in opposite directions at the same time? The reason why this is important is that it may be desirable to let either subject or system talk in overlap or interrupt the other. On the other hand, it may be desirable or necessary to prevent them from doing so. In either case, it is important that the capabilities planned for the future system should be designed into the WOZ simulation to ensure that turn-taking phenomena recorded in the experiments are relevant for the future system.
In this section we describe a methodology for using WOZ simulations to specify the functionality of a speech input/output system. The suggestions presented here draw heavily on the work of [Kelley (1983a), Kelley (1983b), Kelley (1984)], [Guyomard & Siroux (1986a), Guyomard & Siroux (1986b), Guyomard & Siroux (1987), Guyomard & Siroux (1988)], the SUNDIAL Project [Peckham (1993)] and the Danish National Project on Spoken Language Dialogue Systems [Dalsgaard & Baekgaard (1994)].
The methodology involves at least three phases: a pre-experimental phase, a first phase, and a subsequent phase or phases.The need for at least three phases in the methodology stems from the difficulty noted earlier, namely that a WOZ simulation is intended to simulate as exactly as possible a future computer system, but the requirements to be satisfied by that system (and thus its precise specification) may be one of the outputs of the simulation work. To get round this circularity, we propose an iterative methodology which over the course of several phases refines both the simulation and the system specification until, ultimately, they converge.
To begin, the simulation incorporates only gross features of the intended system, the wizard in other respects acting ``normally'', that is with full human capabilities. The first phase yields data which can be used to develop an initial specification of requirements and thus some constraints on the wizard's behaviour in the second phase. In principle, the cycle of simulation and specification could be repeated many times, but in practice, two or three phases are likely to be sufficient.
Before the simulation is carried out it is necessary to analyse the application domain in order to define the wizard's domain knowledge. This domain knowledge may be available on-line in the form of a database (e.g. a travel booking database). In this case the wizard must be trained to use the database query language.
A second pre-experimental task is to decide what the subjects are to be told and how they can be made to interact meaningfully with the system, without simply following a script. This problem can be overcome by the use of scenarios : the subject is assigned a rôle and given some background information. The subject is then given a high-level description of a task to be accomplished (e.g. ``you want to meet Aunt Matilda who is flying into London from Hong Kong this evening''). The subject is free to decide what needs to be asked (e.g. when flights are due, which airport she is arriving at, which terminal she is arriving at, etc), the order in which the questions should be asked, and the exact wording of the questions.
Thus a vital task in the pre-experimental phase is to design realistic scenarios which constrain the subject as tightly as possible to the application domain of the future system, while giving her/him as much motivation and as much freedom of expression as possible within these bounds.
At the pre-experimental phase a number of other practical matters need to be sorted out, such as:
In the first experimental phase very few - perhaps no - constraints should be placed on either subject or wizard. Any constraints which are applied are likely to relate to what the wizard is allowed to say. The wizard's voice should, of course, be distorted so the interaction should not have the character of a free human-human conversation . This first phase should be used to gather data which can then be used in the definition of an initial lexicon, grammar , and dialogue model.
The findings of the first phase should be used to define some constraints for the second phase. A clearer definition should now be available of what the wizard is not allowed to understand and what he is not allowed to say. In an ideal world it is conceivable that there could be many subsequent phases in which the insights of the last phase would be used to refine the current phase. It is also conceivable that at each iteration, a new or improved hardware or software component could be added to a bionic wizard, thus bringing the simulation ever closer, in fact as well as in appearance, to the future system.
In summary, we have introduced the WOZ technique as a means of predicting the functional requirements of future spoken language dialogue systems. Though there are significant technical problems in setting up spoken WOZ simulations, with careful design a wizard can simulate a computer sufficiently well to fool almost all subjects almost all of the time. The fact that, with support, people can simulate future speech systems enables designs to be developed iteratively and evaluation to be carried out before significant resources have been invested in system building. This strategy of early evaluation, which has been recommended in other areas of computer system design, has obvious advantages of cost and speed of convergence to a satisfactory design over the only alternative: build, evaluate and re-build.
We have identified a number of subject, wizard, and communication channel variables for spoken WOZ simulations. Taken together, these should provide an initial framework for staging and for comparing WOZ simulations.