Design by simulation

Next: Iterative design methodology for Up: Specification and design Previous: Design by observation

Design by simulation

Although it is certainly better to rely on analyses of human-human interactions than to rely on intuitions alone, for all but highly constrained menu dialogue systems, the fact remains that human-human interactions are not the same as human-computer interactions and it would be surprising if they followed precisely the same rules. The designer is caught in a vicious circle: it is necessary to know the characteristics of a dialogue between a person and an automaton in order to be able to build a system which acts as a dialogue participant, but it is impossible to know what such a dialogue would be like until such a system has been built.

This section examines how one particular simulation technique which has come to be widely used by dialogue system designers, the ``Wizard of Oz'' technique (see also Chapters 4, 9) can melp to extend designers' understanding of what human-computer spoken language dialogues would look like if only the systems which are currently at the planning stage were, in fact, implemented and running.

The Wizard of Oz technique

The basic idea behind the Wizard of Oz (WOZ) technique is simple: a human (usually known as the wizard or accomplice) plays the rôle of the computer in a simulated human-computer interaction . It is not known who first coined the term in this context, though its etymology is obvious. In the children's novel The Wizard of Oz [Baum (1900)], the ``great and terrible'' Wizard turns out to be no more than a mechanical tin device operated by a man hiding behind a screen. As a corpus collection method, it is also less widely known as the PNAMBIC (``Pay No Attention to the Man BehInd the Curtain'', from the film version of the Wizard of Oz) technique.

What primarily interests us here is the simulation of a computer system which takes spoken natural language input, processes it in some principled way, and generates spoken natural language responses. Example applications are telephone timetable inquiry services, hotel room booking services, home banking services and ``intelligent'' telephone answering machines. A survey paper published in 1991 [Fraser & Gilbert (1991a)] was able to assert that ``very few WOZ experiments have attempted to simulate all the components of a speech dialogue system''. However, since then there has been a dramatic increase in the number of groups using the technique to help specify interactive spoken language systems .

Requirements for WOZ simulations

WOZ simulations are only useful if certain conditions are met. The first condition is that the computer system being simulated is capable of being imitated realistically, given human limitations. For example, if it is known that the future computer system will need to undertake substantial database manipulation as part of its function, there is little point in setting up an unconstrained WOZ simulation, since people are not capable of performing such work within a realistic time period.

A second, less obvious precondition is that before the experiments are begun it should be possible to formulate a detailed specification of how the future system is expected to behave. This is necessary in order to ensure that the wizard is correctly simulating the intended system. This specification often needs to be more precise and more detailed than would normally be necessary just to build the computer system. For example, in a speech simulation, the wizard ideally needs to make recognition errors at the same rate and in the same way as the future system. However, while descriptions of speech understanding systems often specify error rates , they rarely indicate what kinds of errors are made, or in sufficient detail for the errors to be simulated. Indeed, one of the aims of using the WOZ technique may be to help devise such a specification. The way round this apparent paradox, that the design of the simulation requires a specification but the content of the specification depends on the results of the simulation, will be discussed later when we consider WOZ methodology.

A third condition for the usefulness of the WOZ methodology is that the task must ensure that the illusion that the wizard is a computer can be convincingly maintained. In systems which communicate using text on terminals, only minimal precautions have to be taken, since the only evidence of the ``computer'' the subject sees is the output of characters on a screen (but even here, there may be value in buffering the output so that it appears a line at a time, rather than at the speed of the wizard's typing). In speech output channels, it is necessary to ensure that the wizard's speech is disguised to sound not quite natural, a condition often satisfied by use of a synthesizer. Similar problems arise in controlling the content of the wizard's output, which must use only knowledge likely to be available to a computer. The degree of attention which has to be paid to these issues is related to the likely gullibility (that is, likelihood of believing that the simulated system is real) of the subjects.

Variables in spoken WOZ experiments

In this section we consider some variables in spoken WOZ simulations. By ``variables'' we simply mean things which may vary. We make no distinction here between control variables which are set by the experimenter, response variables which are measured by the experimenter, and confounding factors, in which the experimenter has no interest or over which he has no control. The experimenter must decide how to treat variables in each simulation since there is considerable scope for variation between experiments. For example, in simulations of a telephone train timetable enquiry service, the caller's level of familiarity with telephone information services might be a confounding factor, producing significant differences between speakers. However, in an experiment which divides users into ``experienced'' and ``novice'' classes, this would be a control variable rather than a confounding factor. We shall restrict our discussion here to a straightforward listing of some of the variables in spoken WOZ simulations. For the purposes of our presentation, the variables can be divided into those relating to the subject, those relating to the wizard, and those relating to the communication channel.

Subject variables

Variables which concern the subjects in WOZ simulations can be subclassified into subject recognition variables, subject production variables, and subject knowledge variables.

SUBJECT RECOGNITION VARIABLES relate to the subject's ability to recognise the wizard's words.

ACOUSTIC RECOGNITION Is the acoustic signal intelligible to the subject? The quality of canned and synthesised speech currently available ranges from fairly good to virtually unintelligible (see Chapter 12). The wizard's speech should therefore display characteristics which locate it either somewhere on this spectrum, or just beyond the best available technology if the system being simulated is expected to include synthesisers currently at the design or development stages. The ability to understand synthetic speech is not constant; rather, it displays learning effects. Thus the ability to decode the acoustic signal is a variable, not just among speakers, but for a given speaker over time.
LEXICAL RECOGNITION Does the subject recognise the words used by the wizard? The important question here relates not to acoustic recognition but rather to whether or not all of the lexical items used by the wizard are known to the subject. This variable could be expected to interact with the subject's domain expertise variable. For example, in a flight reservation application, the wizard might refer to an apex fare. If this word is not in the subject's vocabulary then he may not even know how to segment it (apex, a pex, ape eggs ...). The subject will either initiate some sort of breakdown recovery or he will adopt a wait-and-see strategy. The subject's unfamiliarity with items of the wizard's vocabulary is likely, sooner or later, to lead to clarification subdialogues which would not otherwise be present.

SUBJECT PRODUCTION VARIABLES relate to the speech and language produced by the subject insofar as they have implications for the ability of the wizard to recognise and understand the subject's words.

ACCENT. A commercial telephone information service can not screen callers before they make their calls. A strong non-standard accent would cause problems for most currently available speech recognisers (assuming they have been designed or trained for a spectrum of accents centered around a perceived standard). If the wizard is to simulate a plausible future system then he must fail to decode strong accents in some principled way.
VOICE QUALITY. Similar variability can be found in voice quality, but this time the variability is between individuals rather than speech communities.
DIALECT. The subjects may manifest different dialects . Non-standard dialect words and - more problematically - non-standard syntactic forms would probably be unintelligible to the sort of computer system which can currently be envisaged.
VERBOSITY AND POLITENESS. How direct are the subject's requests? What part does politeness play in the subject's talk?

SUBJECT KNOWLEDGE VARIABLES are concerned with what the subject knows.

DOMAIN EXPERTISE. Concerning the application domain, subjects may have expertise which ranges from novice through to expert. The way in which the subject interacts with the system, the questions he asks of it, and the way in which he expects to be addressed by it, are likely to be affected by his level of domain expertise.
SYSTEM EXPERTISE. [Richards & Underwood (1984a)] found that as subjects gained expertise in using a WOZ system, so they learned to frame requests more concisely and simply. Thus, the amount of system expertise a subject possesses is a significant variable.
INFORMATION ABOUT THE WIZARD. What the subject is told about the wizard has an effect on dialogue structure and on the subject's view of the experiment. There is a body of evidence to show that people use different dialogue strategies according to whether they believe they are talking to a human or a machine [Hauptmann & Rudnicky (1988)]. Speech to a computer has been labelled ``formal'' [Grosz (1977)], ``baby talk'' [Guindon et al. (1986)], ``telegraphic'' [Guindon et al. (1987)], and ``computerese'' [Reilly (1987)].
It seems unnecessarily complex to ask the subject to guess whether or not he is talking to a computer; this is to turn a simple WOZ experiment into a Turing test [Turing (1950)]. The experiment would no longer be a simple simulation of future technology if the subject were given this additional discrimination task.
Thus, it seems that for routine simulations the subject should be led to believe that he is actually using the future technology. This can be expected to yield the best guide to how that technology will be used when it becomes available. Potentially there are ethical problems here since a responsible experimenter would not choose to tell an outright lie to the subject. A more appropriate approach is to tell the subject that the research aims to establish how people converse with computers, and to allow her/him to draw her/his own conclusions.
GULLIBILITY. What the subject is told is one thing, what he believes is quite another. In an experiment to determine the effect of awareness of the human operator on subjects' performance, [Newell (1989), p. 146,] observes that:
It was pleasing to note that the subjects in the covert [i.e. misinformed] group all expressed surprise on being told that the experiment was based on a simulation. (Indeed, one [male] subject was substantially embarrassed on finding that a female operator had encoded the profanities which he had used when he was having difficulties and which had been faithfully reproduced on the screen!)

It seems that many subjects can be totally misled. Follow-up questioning can be used to determine what subjects believe about simulations. If they are not convinced then the results can be discarded.
It is interesting to note in passing one result of Newell's which appears to demonstrate the opposite of what might be expected [Newell (1989), p. 8,]:
Those subjects who were made aware of the operator's existence were more impressed...than those who thought they were talking to a computer.

Wizard variables

Wizard variables can also be divided into wizard recognition and production variables but these must be supplemented with extra classes of dialogue model variables and staging variables.

WIZARD RECOGNITION VARIABLES Corresponding to the subject's production variables are a set of wizard recognition variables defining the ranges of acoustic, lexical, syntactic and pragmatic phenomena which the wizard is allowed to recognise. One of the hardest tasks for the wizard is restricting what is recognised to what is defined by these variables. We shall see below (Section 3) one possible approach to formalising recognition constraints but, for the most part, the constraints will have to be applied directly by a wizard who knows his rôle intimately.

A tolerable error margin (i.e. of successful recognitions which should have been unsuccessful) should be set and any dialogues which, on post-simulation inspection, are found to stray beyond that margin should be discarded.

A particularly difficult problem is that of trying to mimic a speech recogniser which only manages to recognise the words in its limited vocabulary, and these only with, say, a 95% recognition rate. In order to be faithful to the technology the wizard would have to introduce a random (or partially random) 5% failure rate even with words which the system is supposed to know about. This is an almost impossible task. The best that can be expected is for the wizard to introduce occasional deliberate recognition errors. Of course, if the wizard is able to type the subject's words fast enough, an automatic system can be used to generate the appropriate errors with the target frequency.

The acoustic front-ends of speech-based information systems designed for use by the general public are likely to include rapid speaker adaptation capabilities. This means that speech recognition rates are likely to improve during the course of individual conversations. It is hard enough for a wizard to generate a fixed percentage of recognition errors; it would be virtually impossible for him to simulate an error rate which varies over time.

WIZARD PRODUCTION VARIABLES. Just like the subject, the wizard has production variables, but with the wizard these are defined by the performance of the existing or projected technologies.

Thus the whole gamut of speech generation variables (voice quality, intonation , syntax , register , etc.) need to be considered. Again, the wizard may be required to introduce principled errors at any of these levels if the simulation is to be faithful to the technology.

RESPONSE TIME. One production variable of particular interest is the wizard's response time . The object of a WOZ simulation should be to respond in more or less the same time as it would take the future system to respond and not in the time it would take a human to respond. Obviously, systems are planned to run in real time but the real time course of a human-computer dialogue is not yet known. It may be appropriate to allow a wizard to take slightly longer to respond than a human expert. The wizard will in any case require all the time he can get to apply conscious constraints to his normal recognition and generation capabilities.
In his listening typewriter WOZ experiments, Newell considered the question of response time to be so important that he trained his wizard to use a palantype keyboard [Newell (1978)] (an electronic stenography system which generates normal text) for rapid speech transcription (180 words/minute or more). The reasons why response times are important are simple (see also Chapters 8, 12 for further production variables):
1. Speed of response can be expected to affect dialogue structure and content.
2. Speed of response may also affect the subject's judgements of whether he is talking to a computer or to a human.

DIALOGUE MODEL VARIABLES. The model of the dialogue employed by the wizard is central to his interpretation of utterances and selection of responses to them. It is worth flagging the dangers of constructing a prototype dialogue model in advance of running simulations. The two-stage experiment carried out by [Guyomard & Siroux (1987)] indicates the amount of work required to define a minimally acceptable dialogue manager . In spite of their positive reports, it is to be expected that many simulation-analysis-redesign iterations would be necessary to define a truly impressive dialogue manager . Since most research projects run to a tight schedule, a two-stage simulation is probably the best that most experimenters can hope for.

STAGING VARIABLES. In this section we consider some practical matters relating to the preparation of the wizard and the tools available to assist him in his work.

TRAINING. The wizard requires training in at least three areas: the application domain, the system capabilities being modelled, and the tools available to assist in playing his role. The wizard should receive as much training as time allows in order to ensure that his performance is as close as possible to the projected performance of the future system.
TOOLS. The wizard needs a lot of information at his fingertips. A range of tools could be designed to present this information as quickly and easily as possible. For example, a range of paper tools (charts, card indexes, etc.) and electronic tools (mouse menu systems, hypertext, etc.) could be used. A wizard's assistant might even be considered necessary.
WIZARD PERSONALITY. So far we have assumed that the wizard is a person. But what if the wizard is part human, part machine? We shall call such a wizard a bionic wizard.
[Kelley (1983a), Kelley (1983b), Kelley (1984)] proposes an iterative development scheme which involves running an initial WOZ simulation and then, in subsequent simulations, incorporating more and more subcomponents of the real system, moving in the direction of a more complex system in the loop. In the development of a speech input/output system this could involve placing a speech recogniser between the subject and the human wizard. Alternatively (or additionally), the wizard could respond with synthesised speech generated from text which could either be typed rapidly (e.g. on a palantype system) or selected from a file of standard responses. In principle, a bionic wizard could include many subcomponents of the system, with the human accomplice merely ``plugging the gaps''. A bionic wizard could expect to encounter a number of problems, not least of which is the lengthening of response time which a mixing of human and computer components might entail. However, if these difficulties can be overcome, iterative development represents a promising technique.

Communication channel variables

The simplest means of connecting the subject and the wizard is by telephone or similar two-way electronic communication channel. The quality of the channel can be yet another variable.

THE SUBJECT WIZARD CHANNEL. One way of modelling the performance of a speech recogniser is to degrade the subject's speech signal. This would save the wizard from the (almost impossible) task of consciously introducing recognition errors. However, the drawback of this method is that the wizard, who already has enough to cope with, is faced with the extra workload of interpreting degraded speech. The alternative presented above is to place a real speech recogniser between the subject and the wizard. However, if the object of the exercise is to simulate a future system, the use of existing technology might place unrealistic constraints on the simulation dialogues.

THE WIZARD SUBJECT CHANNEL. No subject is going to believe that they are talking to a machine if they are unable to distinguish its performance from that of a human speaker. An important part of the simulation is the ``de-humanising'' of the wizard's voice. One way to do this is to pass the signal through a vocoder to strip it of human intonation and make it sound ``mechanical''. A secondary effect might be to make it roughly as difficult for the subject to understand the wizard as it would be to understand a speech synthesiser. This similarity could never be better than approximate.

The alternative to degrading the wizard's voice is to place a speech synthesiser between the wizard and the subject. Once again, the usefulness of this strategy depends, in part, on the extent to which the synthesiser approximates to the synthesiser in the projected future system.

CHANNEL INTERACTION. Can signals pass in opposite directions at the same time? The reason why this is important is that it may be desirable to let either subject or system talk in overlap or interrupt the other. On the other hand, it may be desirable or necessary to prevent them from doing so. In either case, it is important that the capabilities planned for the future system should be designed into the WOZ simulation to ensure that turn-taking phenomena recorded in the experiments are relevant for the future system.

An iterative WOZ methodology

In this section we describe a methodology for using WOZ simulations to specify the functionality of a speech input/output system. The suggestions presented here draw heavily on the work of [Kelley (1983a), Kelley (1983b), Kelley (1984)], [Guyomard & Siroux (1986a), Guyomard & Siroux (1986b), Guyomard & Siroux (1987), Guyomard & Siroux (1988)], the SUNDIAL Project [Peckham (1993)] and the Danish National Project on Spoken Language Dialogue Systems [Dalsgaard & Baekgaard (1994)].

The methodology involves at least three phases: a pre-experimental phase, a first phase, and a subsequent phase or phases.The need for at least three phases in the methodology stems from the difficulty noted earlier, namely that a WOZ simulation is intended to simulate as exactly as possible a future computer system, but the requirements to be satisfied by that system (and thus its precise specification) may be one of the outputs of the simulation work. To get round this circularity, we propose an iterative methodology which over the course of several phases refines both the simulation and the system specification until, ultimately, they converge.

To begin, the simulation incorporates only gross features of the intended system, the wizard in other respects acting ``normally'', that is with full human capabilities. The first phase yields data which can be used to develop an initial specification of requirements and thus some constraints on the wizard's behaviour in the second phase. In principle, the cycle of simulation and specification could be repeated many times, but in practice, two or three phases are likely to be sufficient.

The pre-experimental phase

Before the simulation is carried out it is necessary to analyse the application domain in order to define the wizard's domain knowledge. This domain knowledge may be available on-line in the form of a database (e.g. a travel booking database). In this case the wizard must be trained to use the database query language.

A second pre-experimental task is to decide what the subjects are to be told and how they can be made to interact meaningfully with the system, without simply following a script. This problem can be overcome by the use of scenarios : the subject is assigned a rôle and given some background information. The subject is then given a high-level description of a task to be accomplished (e.g. ``you want to meet Aunt Matilda who is flying into London from Hong Kong this evening''). The subject is free to decide what needs to be asked (e.g. when flights are due, which airport she is arriving at, which terminal she is arriving at, etc), the order in which the questions should be asked, and the exact wording of the questions.

Thus a vital task in the pre-experimental phase is to design realistic scenarios which constrain the subject as tightly as possible to the application domain of the future system, while giving her/him as much motivation and as much freedom of expression as possible within these bounds.

At the pre-experimental phase a number of other practical matters need to be sorted out, such as:

selecting a location for the experiments;
installing the required hardware and software;
finding subjects.

The first experimental phase

In the first experimental phase very few - perhaps no - constraints should be placed on either subject or wizard. Any constraints which are applied are likely to relate to what the wizard is allowed to say. The wizard's voice should, of course, be distorted so the interaction should not have the character of a free human-human conversation . This first phase should be used to gather data which can then be used in the definition of an initial lexicon, grammar , and dialogue model.

Second or subsequent experimental phases

The findings of the first phase should be used to define some constraints for the second phase. A clearer definition should now be available of what the wizard is not allowed to understand and what he is not allowed to say. In an ideal world it is conceivable that there could be many subsequent phases in which the insights of the last phase would be used to refine the current phase. It is also conceivable that at each iteration, a new or improved hardware or software component could be added to a bionic wizard, thus bringing the simulation ever closer, in fact as well as in appearance, to the future system.

WOZ conclusions

In summary, we have introduced the WOZ technique as a means of predicting the functional requirements of future spoken language dialogue systems. Though there are significant technical problems in setting up spoken WOZ simulations, with careful design a wizard can simulate a computer sufficiently well to fool almost all subjects almost all of the time. The fact that, with support, people can simulate future speech systems enables designs to be developed iteratively and evaluation to be carried out before significant resources have been invested in system building. This strategy of early evaluation, which has been recommended in other areas of computer system design, has obvious advantages of cost and speed of convergence to a satisfactory design over the only alternative: build, evaluate and re-build.

We have identified a number of subject, wizard, and communication channel variables for spoken WOZ simulations. Taken together, these should provide an initial framework for staging and for comparing WOZ simulations.

Next: Iterative design methodology for Up: Specification and design Previous: Design by observation

EAGLES SWLG SoftEdition, May 1997. Get the book...