4  Modules for Incremental Spoken Dialogue Systems

4.1  Speech Recognition

In Baumann et al., [2009a] ("Assessing and improving the performance of speech recognition for incremental systems") we examined the output of an ASR component (in our case, the open-source package Sphinx4 Walker et al., [2004]) that we queried for its current hypothesis continuously and concurrently to an ongoing utterance, rather than letting it endpoint (i.e., determine the size of) the utterance. We found that the output was characterised by a high degree of instability, where typically the last word or at most the two last words would often change from one query moment to the next. (A finding that was recently replicated for a different ASR system by Selfridge et al., [2011].)

This being a new field, we had to define metrics that capture the additional dimensions of partiality of output and of timing relative to the partial input (metrics which we later found to be useful for the evaluation of other incremental modules as well). With these metrics in hand, we looked at ways to improve this output. We found that a simple method - increasing the amount of right context allowed to the ASR by letting the incremental output lag behind - increased the stability of the output, but at the cost of reduced timeliness. A more sophisticated method which directly addressed the stability of edits - only passing on those changes that persisted over several query episodes - resulted in a better trade-off of stability and timeliness, and was ultimately what we used in our systems.

(Optimization and evaluation is also discussed in Baumann et al., [2011] ("Evaluation and optimization of incremental processors"), a journal paper that collects and generalises our various attempts at evaluating incremental processors.)

In Baumann et al., [2009b] ("Evaluating the potential utility of asr n-best lists for incremental spoken dialogue systems") we extended our investigations of incremental ASR to look at the whole n-best list during incremental processing. We found that if we were able to re-rank hypotheses, there would be potential to improve according to our metrics. However, this would come at an enormous computational overhead, as this re-ranking would have to be performed continuously, at every query step. For our own work, we concluded that this was not something that could currently realistically be done.

4.2  Natural Language Understanding

In a number of papers, we investigated the task of incremental natural language understanding, that is, of assigning some kind of meaning representation to potentiall `unfinished' utterances (strings of words). We explored both rule-based methods and statistical, data-driven methods.

4.2.1  Rule-Based iNLU

In Atterer and Schlangen, [2009] ("RUBISC - a robust unification-based incremental semantic chunker") we presented a parser for regular grammars that can robustly assign a simple semantic representation to utterance prefixes. In Atterer et al., [2009] ("No sooner said than done? testing incrementality of semantic interpretations of spontaneous speech") we evaluated this parser against a data-set with incremental gold-standard semantics, finding that quite often all relevant information that is delivered by an utterance is delivered before it is over.

4.2.2  Statistical iNLU

Siebert and Schlangen, [2008] ("A simple method for resolution of definite reference in a shared visual context") did not directly deal with incremental processing, but is connected in the concerns with the main focus of the project. In this paper, we looked at how domain-specific meanings of words can be learned from a corpus. We did this within one of the main domains of the project, that of building a puzzle, and we found that a simple approach that learns relevant visual features achieved a high degree of robustness and surprisingly good results. However, as developed in that paper, the composition process that computes the overall meaning was not incremental. A next step which we have not taken yet would be to incrementalise composition in order to integrate this method into a real incremental module.

In contrast, in Schlangen et al., [2009] ("Incremental reference resolution: The task, metrics for evaluation, and a bayesian filtering model that is sensitive to disfluencies") we explored the task of incrementally determining the referent of a referring expression. We trained a statistical model that computed a belief distribution over possible referents, updating on each new word. Interestingly, when including silence as an information source (by turning it into a pseudo-word, basically), we found that the model replicated findings from the psycholinguistic literature, namely that hesitations often precede references to hard-to-describe objects, a cue which listeners pick up on.

The model trained in this work can be seen as an example of an "input-incremental-only" approach, where partial input is accepted, but the output is continuously of the same type, i.e., is not composed out of increments. In Heintze et al., [2010] ("Comparing local and sequential models for statistical incremental natural language understanding") we extended such an approach beyond reference resolution to the prediction of full semantic representations (inspired by work by Sagae et al., [2009,DeVault et al., [2009]). We trained and compared models that predict such full representations ("input-incremental-only") and models that also build up their output representation incrementally.

4.3  Prosody Processing, and Tracking of Conversational Floor

While Schlangen, [2006] ("From reaction to prediction: Experiments with computational models of turn-taking") predates the project, it rehearsed some of the questions that we also tackled in this project. In that paper, we investigated different information sources for deciding whether a user utterance was intended to be finished or not. We systematically moved the decision point from coming after silence of the user to before the predicted event, i.e. to within the utterance. The information sources investigated in this paper were prosody and n-gram language models; in Atterer et al., [2008] ("Towards incremental end-of-utterance detection in dialogue systems") we additionally included features from an incremental parser.

Baumann, [2008] ("Simulating spoken dialogue with a focus on realistic turn-taking" ) brought a similar classification approach into a live setting, albeit one in simulation. In the work reported in this paper, two instances of a conversational system exchanged (meaningless, pre-recorded) audio messages, with the turn-taking of the systems controlled by simple rules over floor-states, which were computed by the classificators. We showed that simple rules using our classificators can generate realistic (in terms of distribution of floor and overlap) turn-taking patterns.

The systems described below (Section 5) use our online-implementation of f0-tracking algorithms within the Sphinx framework, and use simple rules to classify user silences according to the preceding boundary tones.

4.4  Dialogue Management

With incremental interpretation results available, the question arises what use the dialogue manager can make of this information, which behaviours can profit from being triggered potentially during a user's utterance. In Buß and Schlangen, [2010] ("Modelling sub-utterance phenomena in spoken dialogue systems"), we discussed in general terms which "sub-utterance" phenomena appear to be particularly useful in spoken dialogue systems, and sketched an approach to dialogue management that can model them. We argued for an approach to achieving high reactivity that does indeed leave the generation of this reactivity to the dialogue manager and does not factor it out to some additional reactive layer which does not have access to the dialogue state, as is done in some recent work.

In Buß et al., [2010] ("Collaborating on utterances with a spoken dialogue system using an isu-based approach to incremental dialogue management") we built on this work and described our dialogue manager and system that is able to provide feedback to the user on whether a given referring expression is already informative enough or not.

In Buß and Schlangen, [2011] ("DIUM - An Incremental Dialogue Manager That Can Produce Self-Corrections") we discussed a problem that occurs in this approach, namely that sometimes behaviour might have been executed that turns out to have been based on input that needs to be revised. We presented a dialogue manager that can handle such situations by offering self-corrections. The dialogue management approach is a further development of the models described in our previous papers, which as a new feature stresses the similarities between incrementing utterances and incrementing dialogues through contributions by modelling both in our IU model (see Section 3.1 above).

4.5  Output Generation and Timing

In Baumann and Schlangen, [2011] ("Predicting the micro-timing of user input for an incremental spoken dialogue system that completes a user's ongoing turn"), we devised a method for actually synthesing completions of a user's utterance (where we factored out the task of predicting this continuation). The idea basically was to let an off-the-shelf TTS system synthesise the whole predicted utterance (including the parts that were already heard), scaling the resulting utterance by comparing the predicted timing for the already heard parts with the actual timing. This method worked surprisingly well, producing utterances that deviated very little in the timing from the actual utterances.

4.6  Evaluation of incremental processing modules

Baumann et al., [2011] ("Evaluation and optimization of incremental processors") collects and systematises our work on metrics for evaluating incremental processors. We discuss situations where incremental gold-standard data is available (that is, data where the time-dimension that incremental processing introduces is annotated as well) vs. those where it isn't, and look at three families of metrics for evaluation of incremental processors: similarity metrics, which compare actual and gold-standard output in terms of content; timing metrics, which look at the temporal dimension; and diachronic metrics, which look at the development of hypotheses over time.


This website reports on some results of the research project "InPro", which was led by David Schlangen and ran from 2006 to 2011. The information contained in this website is for general information purposes only; we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics

... Continued

contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.
Through this website you are able to link to other websites which are not under the control of us or Bielefeld University. We have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.