Next: Why speech output assessment?
Up: Introduction
Previous: Introduction
By a speech output system we mean some artifact, whether a
dedicated machine or a computer programme that produces signals that are
intended to be functionally equivalent
to speech produced by humans.
At the present time
speech output systems
generally produce audio signals only, but laboratory systems
are being developed that
supplement the audio signal with the visual image of the
(artificial) talker's face [Benoît (1991), Benoît et al. (1992)]. Audio-visual (or:
bi-modal) speech output is
more intelligible than audio-only output, especially when the
audio channel is of degraded
quality. In this chapter we will not be concerned
with bi- or multimodal speech output systems, and concentrate on audio-only
output instead.
We exclude from the domain of speech output systems such
devices as tape recorders and
other, more advanced, systems that output speech on the basis
of complete, pre-stored
messages (``canned speech'' or ``copy synthesis''), irrespective
of the type of coding or
information compression used to save storage space. We
crucially limit our definition to
systems that allow the generation of novel messages, either
from scratch (i.e. entirely by
rule) or by recombining shorter pre-stored units. This
definition also includes hybrid
synthesis systems where individually stored words (e.g.\
digits) are substituted in information slots in a carrier sentence
(e.g. in time-table consultation services).
It seems to us that two basic types of speech output systems
have to be distinguished on
the basis of their input, namely text-to-speech (TTS) and
concept-to-speech (CTS) . Other,
more complex, systems combine characteristics of these two.
- TEXT-TO-SPEECH.
The majority of speech
output systems is driven by text
input. These systems convert text printed in normal
orthography (generally stored
in a computer memory as ASCII codes) to speech.
Conventional spelling provides a
reasonable indication of what sounds and words have to be
output, but typically
underrepresents prosodic properties of the message, such as
the positions of
accents , speech melody, and temporal organisation, including
speech rhythm. The prosody of
an utterance reflects, among other things,
the communicative intentions of the writer of the input
text, which cannot be reconstructed from the text alone -
note the title of
a much-cited article: ``Accent is predictable, if you're a
mind reader'' [Bolinger (1972)]. The reconstruction of the writer's
intentions is an
implicit part of the so-called linguistic interface , i.e. the first part of
most
advanced text-to-speech
systems. All errors in the linguistic interface may detract
from the quality of the
output speech, and are therefore a legitimate object of
evaluation.
- CONCEPT-TO-SPEECH.
In other
types of speech output systems, especially
dialogue systems, the communicative intentions are fully
specified at the input
stage: the system itself determines what message it wants
to get across. It may still
be the case, of course, that the dialogue system has
misconstrued a user's request,
and consequently issues an inappropriate response message,
but this should not be
considered an error on the part of the output system.
- INTERPRETING (OR TRANSLATING) TELEPHONY
(SL-TRANS, cf. [Morimoto et al. (1990)];
JANUS, cf. [Waibel et al. (1991)]) and face-to-face
spoken dialogue
translation
[Wahlster (1993), VERBMOBIL] combine characteristics of both TTS
and CTS . Interpreting telephony , for instance, a spoken
utterance in one language
(e.g. Japanese) is decomposed into its linguistic message
and its speaker-specific
properties (e.g. voice characteristics , speed,
pitch range). The linguistic message is
converted to text, and transmitted. At the receiver end the
text is automatically
translated into another language (e.g. German) and then
converted back to speech
in the target language setting the synthesiser's speaker
specific parameters such that
the personal characteristics of the source speaker are
approximated in the output
signal. Crucially, the sender's intentions do not have to
be inferred from the textual
representation of the message; the intended focus
distribution can be reconstructed
directly from the properties of the source language speech
signal.
Next: Why speech output assessment?
Up: Introduction
Previous: Introduction
EAGLES SWLG SoftEdition, May 1997. Get the book...