Humans are natural conversationalists. As very young children we easily learn to follow and participate in linguistic exchanges . Even before we have mastered a single word we make an effort to engage in some kind of interactive exchange by making non-linguistic noises by turns with talking adults. Once acquired, natural language endows us with the ability to form complex ideas and discuss them with others. One of the most remarkable aspects of language is its reflexivity: we are able to use language to talk about language. Humans, then, are in principle authorities on language who, conveniently, have the capacity to articulate that knowledge by means of language. Perhaps the best place to look for data on spoken language dialogue is within, in our own expert intuitions.
This view is prevalent in theoretical linguistics. The most famous statement of the doctrine of the primacy of native speaker intuitions derived from linguistic competence, i.e. the underlying knowledge of language, as opposed to observations of data derived from actual linguistic performance, was offered by Noam Chomsky:
Linguistic theory is concerned with an ideal speaker-hearer in a completely homogeneous speech community, who knows its language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of the language in actual performance [Chomsky (1965), p. 3,].
Chomsky's robust statement of mainstream linguistic thinking must be set in its historical context. Throughout the 1940s and 1950s linguistics, particularly North American linguistics, was dominated by empiricist structuralism [Hockett (1958)]. The practice of linguists was to go out into the field and collect large amounts of language data, i.e. they recorded and transcribed actual utterances. This data was examined using simple analysis techniques which as far as possible kept sophisticated intuitions about the nature of language outside of the process. The prevailing view was that intuitions were not to be trusted; they might turn out to be no more than noise distracting researchers from spotting the true - possibly counter-intuitive - regularities in the structure of language. For example, the structuralist approach to establishing the word classes in a language was to work with the minimal criterion of substitutability. If two words can appear in exactly the same sentential context, then they are strong candidates for membership of the same word class .
There were at least two problems with the structuralist approach. First, the number of sentences in a natural language is, practically speaking, infinite so that it is never possible to be sure that enough data has been acquired to motivate a generalisation. Second, structuralism provided no way of distinguishing noise from reliable data. For example, the following two utterances must be treated equally, in spite of the fact that the second is unacceptable to an extent which makes it a potential subject of comment:
a. I used to be able to run all the way to the station.
b. I used to could run all the way to the station.
The point here is not whether or not (b) is interpretable. Rather, it is that the very same speaker who uttered (b) also has the capacity to reject it as a ``slip of the tongue'', if necessary, i.e. as a failed attempt to utter (a). Empiricist structuralism closed its ears to the rich data source offered by explicit statements of native speaker intuitions.
The Chomskyan method in linguistics, which has largely replaced empiricist structuralism, treats intuitions as primary. Even though linguistic acceptability (i.e. grammaticality) judgements turn out to be graded rather than binary, the last thirty years of research has not come up with a better source of data than the intuitions of linguists themselves. This applies particularly to the analysis of dialogue, though powerful statistical methods have been introduced to supplement linguistic categorisations for smaller linguistic units, and increasingly extensive dialogue data, with dialogue act annotation and stochastic analysis, are being used to train dialogue models for spoken language systems.
So how does the experience of theoretical linguistics relate to that of spoken language system design? In fact, the relationship has been close so far. Most researchers in natural language processing (NLP) have a background or interest in linguistics. Perhaps as a result of this, the vast majority of NLP systems have been oriented towards competence rather than performance .
This is how dialogue system designers have tended to proceed.
At each stage in each possible dialogue, the designer attempts to answer the question ``What could happen next?'' This can be answered at multiple levels. For example, an acceptable answer for a given point in a banking application dialogue could be ``The user will ask for an account balance.'' This answer abstracts away from the surface realisation of utterances. It could be realised in many different ways, including the following.
a. How much is in my account?
b. What is my present balance?
c. Can you give me a balance, please?
d. What have I got at the moment?
Each branching point in a natural language dialogue offers very large amounts of variability. If the question ``What could happen next?'' is answered at the level of abstract task-oriented speech acts, there are likely to be several possible next moves with non-negligible probability. This modest branching is complicated by the fact that there are potentially very many different ways of linguistically realising each distinct abstract move type. In turn, each of these linguistic realisations may be acoustically realised in infinitely many different ways. The whole solution space cannot be explored by means of introspection alone, although carefully chosen subparts of it may be designed quite effectively on the basis of designers' intuitions. Empirical support for this claim comes from those linguists who have tried to construct computational grammars capable of analysing all the sentences in substantial corpora of real language use.
Most dialogue systems developed to date have side-stepped some of the problems inherent in interactive dialogue by ruling out mixed-initiative turn-taking in dialogue. Whereas in natural conversation participants cooperate and collaborate in the unfolding of the talk, dialogues involving designed systems tend to require the user to follow a strictly predetermined fixed-initiative dialogue plan. In these menu-based systems, standard conventions are set aside and the system conveys to the user the non-negotiable rule of interaction, namely ``I ask the questions; you answer them.'' The following example should convey how menu-based dialogues are typically constrained.
System: Thank you for calling the EuroWatch Weather Forcasting
Service.
For which country do you require a weather forecast?
User: The United Kingdom.
System: Scotland, England, Wales or Northern Ireland?
User: Scotland.
System: Do you want a weather report for today, a short term
forecast for the next 48 hours, or a long range
forecast for the next 7 days?
User: A short-term forecast please.
System: Here is the weather forecast for Scotland for the
next 48 hours:
It will begin wet in all regions and will grow
progressively wetter throughout the next two days.
Do you require any further information?
User: No thank you.
System: Thank you for calling. Good bye.
In this example, the system asks closed questions. There may be some flexibility as to the exact wording of a user response. For example, the user's first utterance could reasonably have been any of the following (parentheses enclose elements which can be omitted): ``((for) the) United Kingdom (please)'', ``((for) the) U.K. (please)'', ``(for) Britain (please)'' or ``(for) Great Britain (please)''. However, the directness of the system's prompting effectively rules out the much wider range of utterances which could include examples such as ``Oh hello. I was wondering if you could give me a weather forecast for the U.K. please.'' While it is reasonably straightforward to construct a linguistic model which covers most utterances that would be produced by cooperative users in the more constrained case, the same cannot be said of the case where the system asks open questions or allows user initiatives.
In designing a strictly system-led menu dialogue system, all that is required of the designer is to come up with some way - any way - of allowing each task in the application domain to be performed. This approach could be called ``a priori design'' - the system designer states in advance how the user is to be allowed to progress through the task to the goal. The designer does not look beyond existing intuitions about how best to structure dialogues in order to develop a working system. There are, of course, no guarantees that the chosen design will be ergonomically optimal, and it will be necessary to test the system with users in order to fine-tune the dialogue strategy.
However, once a decision has been taken to renounce fixed initiative, menu-based dialogue in favour of freer, more natural spoken language dialogue , the use of intuitions alone in specification must be called into question for at least three reasons.
In producing utterances, speakers seldom end up saying exactly what they wanted to say. In listening to other people's utterances we filter them with expectations to such an extent that we are quite capable of hearing what we want to hear and not what has really been said.
Consider the case of non-lexical fillers in speech, such as uhm. If asked, most speakers would suggest either that these items have random distribution or that they are inserted when the speaker cannot think what to say next. In fact, research has shown that these items are used in a highly ordered way to structure talk and to aid the smooth transfer of turns between speakers. This is a fact of considerable consequence to system designers, but one which would not have emerged through introspection.
In summary, design by intuition is a cheap, simple and effective approach for specifying and designing system-led, menu-style, interactive voice response systems which, implicitly or explicitly strongly limit the kind of language which the user may employ. It requires no special materials or resources other than those with which system designers are already naturally endowed. However, there are fundamental problems at the root of this method when the dialogue system in question allows less constrained natural language interaction. Though the design by intuition approach may be a useful compliment to other approaches (such as design by observation and design by simulation), it cannot be relied on as the primary specification/design methodology.
RECOMMENDATIONS
Use design by intuition as the primary specification/design methodology
only for those applications in which the following conditions apply:
(i) all tasks in the domain can be structured into a fixed sequence of
steps,
(ii) the system takes the initiative in all phases of dialogue, and
(iii) the design of system prompts and the nature
of the domain constrain the
kind of language which may reasonably be employed by the user.