An interactive dialogue system is constructed such that it enables and supports the communication between a human user and the service offered by the system. It is based on the integration of a set of modules, each of which handles a complex task. The modules are linked to each other and their interactions are controlled by a kernel module which has the overall task of managing the dialogue. Seen from the dialogue manager, the application functions as an external module (e.g. a remotely functioning database) connected to a human user who may have a number of input and output devices at his disposal.
A dialogue manager may be able to handle several input and output devices in parallel: a user may interact with the dialogue system using multimodal input and output, and several input devices may be used in transferring the same message to the system, for example, DTMF (touch tones ) instead of speech input.
Users communicate with the system in a number of transactions. A transaction consists of a number of exchanges , each of which consists of the input utterance (or a sequence of DTMF signals for a touch tone input device) or the corresponding system response (e.g. synthetic or canned speech or text on a screen). The attention of the interactive dialogue system changes in a sequence of turns .
A number of basic terms from interactive dialogue are introduced here:
Now that these basic terms have been defined, we shall consider how interactive dialogue systems compare with command systems, and shall review some issues relating to the different levels of interactive complexity to be found in dialogue systems.
In command systems, the interaction is direct and deterministic: to one stimulus from one agent corresponds one unique response from the other agent , the response being independent of the state or context of each agent . For example, you press a key on a keyboard and the expected character appears on the screen. With command systems, the human has direct control over the machine. This form, not normally considered as a variety of human communication, is usually referred to as the tool metaphor.
A dialogue system can be considered as a kind of interface which performs communication between a human being and an application system, which may include several other systems. The dialogue system must process two kinds of information: that coming from the user and that coming from the task itself through specialised interfaces, one for the speech technologies, one for the application. One of the dialogue system's main activities is to maintain coherence between both. Therefore, the connection between a human being's action (a natural language utterance, for instance) and the response of the system is not direct: the dialogue system must achieve a number of internal actions in order to give a response which is not unique but depends on the internal state of the system and on the context of the interaction. This form of communication is referred to as the agent metaphor or the advisor metaphor.
Dialogue systems include different comprehension levels relating to basic components: a recogniser , a parser , an interpretation module, a dialogue manager , a synthesiser, etc. Each of the modules requires associated knowledge databases (lexicons, rules and models concerning the language used, the system, the task, the user, the environment, the dialogue itself). Each of the models has both static and dynamic parts: the static part exists before the dialogue begins, the dynamic part is built and modified during dialogue. One important component is the dialogue history which keeps track of the previous exchanges. The different modules and their associated knowledge bases allow the dialogue manager (or system) to perform internal actions including the following:
The different comprehension levels involved (acoustic, phonetic, lexical, syntactic, semantico-pragmatic ) may be addressed sequentially. Alternatively, information transfers may take place in parallel between different levels in a non-hierarchical fashion, depending on the dialogue situation.
The role and performance of the dialogue system are largely constrained by and therefore dependent on the performance of speech technologies (depending on the recogniser error rate and authorised vocabulary, or on the control parameters of the synthesiser, for instance). They are also greatly dependent on the task objectives and requirements.
Different interactive complexity levels in dialogue systems may be identified. These are described in the following sections.
The interaction is reduced to a question-answer user-interface. The dialogue model is merged into the task model from which it cannot be distinguished. Dialogues of this kind are often represented by branching tree structures. This category includes interactive voice response (IVR ) systems, integrating tone signalling , isolated word recognition and word spotting techniques. The dialogue is strictly guided, leaving very little initiative to the user (system utterances may in some cases be interrupted by the user, for example). Several exchanges may be necessary to provoke one action or to obtain information from the system. This latter feature distinguishes these systems from pure voice control or command language systems in which there is no dialogue.
A question/answer system is a particular limiting case, as it may either be considered as a command system or as a marginal dialogue system: if one particular question always provokes the same response whatever the situation, then the system may be considered as a command system. But if asking the same question can provoke different responses (in menu-driven dialogue systems, for instance, it may depend on the current level in a tree structure), then the system can be called an interactive dialogue system.
The system possesses distinct and independent models for the task, for the user, for the system, and for the dialogue itself. The dialogue model takes context into account, using a particular knowledge base (a dialogue history ), which is built during dialogue. Multiple types of references (anaphora , ellipses), may be processed. The system may be capable of reasoning, of error or incoherence detection and internal correction, and of anticipation and prediction.
In this case, the complexity of the spoken language dialogue is compounded by the fact that the result of speech recognition has to be merged with other information delivered by other means of communication (media). The dialogue is itself dependent on the system model. Each piece of information delivered by a medium must be dated, as each medium does not process information in the same time, and the dialogue manager has to take event chronology into account.
The first category of systems (menu systems) is now used in several real-world application domains (enquiries about cinema programmes, travel timetables, bank accounts, etc.). Most applications deployed in the field work over the telephone and are used by the general public. Members of the two other categories are mostly still industrial and laboratory prototypes , which still impose a lot of constraints (such as a training phase , and a quiet environment ) on the user. However, this position is steadily changing as more advanced interactive systems come to be deployed in the field.