Until relatively recently there has been very little discussion of data in the natural language processing (NLP) literature (but see the results of the EAGLES Working Group on written text corpora). While speech technologists concerned themselves with the task of of collecting, managing and exploiting speech corpora, computational linguists tended to work primarily with native speaker intuitions, though an increasing number of researchers practise corpus-based NLP. The lack of attention to real text data is now beginning to be addressed through a growing interest in the collection of corpora which are used to define the coverage of grammars and parsers . For example, the Linguistic Data Consortium established with U.S. Government funding aims to collect and distribute large quantities of computer-readable speech and text. This is just one of a number of initiatives around the world to collect really large corpora (counted in millions of words) from which to extract information about aspects of language use, a key area also addressed by the EAGLES initiative.
In the past, much of the data for speech recogniser training has been generated on demand by getting a number of people to read each of the words to be learned (or sentences containing them), perhaps several times. This approach, though yielding reasonably good results, has to some extent been discredited by work demonstrating the significant differences which exist between read and spontaneous speech [Soclof (1990), for example]. It is becoming increasingly clear that in order to obtain the best possible training data for speech recognition systems it is necessary to collect and analyse samples of real spontaneous speech . Needless to say, this is a much more costly and time-consuming exercise than simply getting a small group of cooperative subjects to read prepared scripts in the laboratory .
Corpora of spoken language have much to offer for the trainingtraining or corpus-tuning of speech recognition and speech synthesis systems, in particular for stochastic language models and Hidden Markov Model techniques in the decoder stages of speech recognition. Do they have a role to play in the process of designing spoken language dialogue systems?
The answer to this question must be a clear ``yes''. Many of the objections to using native speakers' intuitions as design data are addressed using by observational data. One of the problems with intuitions was that the space of possibilities in spoken language dialogue is extremely large: there are just too many different possibilities to allow the designer to explore them by introspection alone. What a corpus of dialogues offers is concrete evidence to give the system designer a strategy for handling the problem. Speech recognisers in general support a restricted finite vocabulary, bounded by the limitations of the current technology. Suppose that some speech recognition system is only capable of operating in real time if it has a lexicon of 100 words or less. The most reasonable way to decide which words to include in the lexicon (assuming that the user is not explicitly restricted by means of a menu) would be to select the 100 most frequently occurring words in the chosen domain of discourse. While people can perhaps make fairly good guesses at the two or three most frequently occurring words, no-one could discriminate reliably between the 100th and the 101st words on the basis of intuitions alone.
What is true in the case of lexical selection is also true for higher levels of the system. Suppose at a given point in a human-computer dialogue , a cooperative user could reasonably say almost anything. The only way to design a practical dialogue manager would be to equip the system to deal with the most likely cases and to provide it with a repertoire of general purpose recovery strategies to enable it to repair understanding failures and proceed in orderly fashion with the dialogue when unanticipated utterances are produced. The task of prioritising which cases to manage specifically and which to leave to general failure repair mechanisms is, perhaps, even more difficult than that of selecting which words to include in the recognition vocabulary. By observing human-human dialogues in the chosen application domain it is possible to base these difficult design decisions on a solid foundation of empirical fact, rather than on the shifting sand of mere conjecture.
So far, all that we have done is to argue for the proposition that some empirical data is better than none. Now we must pause to consider just how reliable human-human dialogues can be as data sources for dialogue system design.
Most languages have a ``Standard'' dialect which serves as the basis for NLP systems, sometimes to the exclusion of a significant number of speakers of the language. However, it should not be thought that (ignoring minor ideolectal differences between speakers) dialects are constant across all situations of use. On the contrary, each dialect encompasses a rich variety of different registers. A register is a variety of the language which is selected according to the context of use. So, for example, one might greet an old friend with the word Hi, a bank manager with the words Good morning, and a complete stranger with the words How do you do? Consider the linguistic differences to be found in a conversation about the weather with a small child or a conversation on a similar theme with a potential employer during a job interview. The interactions are likely to differ markedly in lexical selection, grammatical structure, formality, intonation , and indeed almost every conceivable linguistic dimension.
Speakers are able to reason to some extent about how they might speak in completely new situations. For example, many people will have experienced exchanges with a massively aggressive or brutal interlocutor, most people have the experience of being the weaker partner in a hierarchical relationship, have had conversations with individuals of whom they are afraid, have been asked difficult questions in personal contexts or in examinations, have at some time lost face in conversation, and so on.
What no-one yet has any extensive experience of is engaging in fairly free natural language dialogue with an asocial artificial being. A very considerable part of human-human talk is taken up with interpersonal concerns. Indeed, a legitimate question which has been raised is whether or not the word ``conversation'' makes any sense in the context of human-computer dialogue , so intimately entwined are our notions of conversation and social interaction [Button (1990)]. It is hard to know where to begin speculating how people might react when faced with a non-human dialogue partner. The one safe bet which can be placed is that, in the words of [Jönsson & Dalbäck (1988)] ``talking to your computer is not like talking to your best friend''.
A result of this is that we have no safe grounds for extrapolating a detailed specification of a spoken natural language dialogue system on the basis of a corpus of human-human dialogue , even if this corpus contains dialogues addressing the tasks foreseen for the planned interactive dialogue system in the target domain.
In summary, human-human dialogue data has considerable value for
building an understanding of the domain and its component tasks.
In the absence of other information, it can be used to construct
initial vocabularies and language models for systems. However, time
should be allowed for refining the vocabulary and language models to
compensate for possible linguistic changes introduced by the non-personal
nature of human-computer dialogues .
Where possible, use human-human dialogue data to build an understanding of the domain and its component tasks.
In the absence of simulation data, use human-human dialogue data to create vocabularies, language models , and dialogue automata, augmented where necessary by careful use of linguistic intuitions.
Before leaving the topic of observational data, it is worth mentioning
a technique which has recently emerged, primarily for the purpose of
collecting speech data for training and
testing recognisers . This is
the so-called system-in-the-loop method, in which users interact
with an existing dialogue system while the data generated is
collected. Obviously, there are limitations with the approach. First,
it presupposes that a dialogue system is currently available to use for
data collection. Second, it restricts the exercise to collecting data
on usage patterns of a current system, when the planned
future system may embody much increased functionality, or operate in
a quite different domain.
System-in-the-loop data collections are useful for collecting speech data, and they may supply some baseline facts about how people use spoken language dialogue systems. However, this method should only be used for collecting more detailed data to guide future system specification if the functionality of the future system is planned to be only a small step beyond the current system which forms the basis of the data collection.