What are speech output systems?

Next: Why speech output assessment? Up: Introduction Previous: Introduction

What are speech output systems?

By a speech output system we mean some artifact, whether a dedicated machine or a computer programme that produces signals that are intended to be functionally equivalent to speech produced by humans. At the present time speech output systems generally produce audio signals only, but laboratory systems are being developed that supplement the audio signal with the visual image of the (artificial) talker's face [Benoît (1991), Benoît et al. (1992)]. Audio-visual (or: bi-modal) speech output is more intelligible than audio-only output, especially when the audio channel is of degraded quality. In this chapter we will not be concerned with bi- or multimodal speech output systems, and concentrate on audio-only output instead.

We exclude from the domain of speech output systems such devices as tape recorders and other, more advanced, systems that output speech on the basis of complete, pre-stored messages (``canned speech'' or ``copy synthesis''), irrespective of the type of coding or information compression used to save storage space. We crucially limit our definition to systems that allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by recombining shorter pre-stored units. This definition also includes hybrid synthesis systems where individually stored words (e.g.\ digits) are substituted in information slots in a carrier sentence (e.g. in time-table consultation services).

It seems to us that two basic types of speech output systems have to be distinguished on the basis of their input, namely text-to-speech (TTS) and concept-to-speech (CTS) . Other, more complex, systems combine characteristics of these two.

TEXT-TO-SPEECH. The majority of speech output systems is driven by text input. These systems convert text printed in normal orthography (generally stored in a computer memory as ASCII codes) to speech. Conventional spelling provides a reasonable indication of what sounds and words have to be output, but typically underrepresents prosodic properties of the message, such as the positions of accents , speech melody, and temporal organisation, including speech rhythm. The prosody of an utterance reflects, among other things, the communicative intentions of the writer of the input text, which cannot be reconstructed from the text alone - note the title of a much-cited article: ``Accent is predictable, if you're a mind reader'' [Bolinger (1972)]. The reconstruction of the writer's intentions is an implicit part of the so-called linguistic interface , i.e. the first part of most advanced text-to-speech systems. All errors in the linguistic interface may detract from the quality of the output speech, and are therefore a legitimate object of evaluation.
CONCEPT-TO-SPEECH. In other types of speech output systems, especially dialogue systems, the communicative intentions are fully specified at the input stage: the system itself determines what message it wants to get across. It may still be the case, of course, that the dialogue system has misconstrued a user's request, and consequently issues an inappropriate response message, but this should not be considered an error on the part of the output system.
INTERPRETING (OR TRANSLATING) TELEPHONY (SL-TRANS, cf. [Morimoto et al. (1990)]; JANUS, cf. [Waibel et al. (1991)]) and face-to-face spoken dialogue translation [Wahlster (1993), VERBMOBIL] combine characteristics of both TTS and CTS . Interpreting telephony , for instance, a spoken utterance in one language (e.g. Japanese) is decomposed into its linguistic message and its speaker-specific properties (e.g. voice characteristics , speed, pitch range). The linguistic message is converted to text, and transmitted. At the receiver end the text is automatically translated into another language (e.g. German) and then converted back to speech in the target language setting the synthesiser's speaker specific parameters such that the personal characteristics of the source speaker are approximated in the output signal. Crucially, the sender's intentions do not have to be inferred from the textual representation of the message; the intended focus distribution can be reconstructed directly from the properties of the source language speech signal.

Next: Why speech output assessment? Up: Introduction Previous: Introduction

EAGLES SWLG SoftEdition, May 1997. Get the book...