Why speech output assessment?

Next: Users of this chapter Up: Introduction Previous: What are speech output

Why speech output assessment?

In spite of the rapid progress that is being made in the field of speech technology, any speech output system available today can still be spotted for what it is: non-human, a machine. Most older systems will fail immediately due to their robot-like melody and garbled vowels and consonants. Other, more recently developed synthesis methods using short-segment waveform concatenation techniques such as PSOLA [Moulines & Charpentier (1990)] yield segmental quality that is very close to human speech [Portele et al. (1994)], but still suffer from noticeable defects in matters of melody and timing.

As long as synthetic speech is inferior to human speech, speech output assessment will be a major concern. Speech technology development today is typically evaluation-driven. Large scale speech technology programmes have been launched both in the United States and in Europe [O'Malley & Caisse (1987), Van Bezooijen & Pols (1989), Pols (1991), for overviews see,]. Especially in the European Union, with its many official languages, a strong need was felt for output quality assessment methods and standards that can be applied across languages. With this goal in mind the multinational EU-ESPRIT SAM project was set up [Fourcin et al. (1989)], and later the EU Expert Advisory Group on Language Engineering Standards (EAGLES) programme started; both initiatives included a working group on speech output assessment.

Speech output assessment may be of crucial importance to two interested parties, the systems designers and developers on the one hand, and the prospective buyers and end users of the system (possibly represented by consumer organisations) on the other.

Developers are intent on improving their speech output systems. However, designers who have grown up with their system are used to all its habits; they are likely to understand its output much better than first-time users, and will overrate its performance level. Less subjective quality assessment techniques are needed in order to determine how well a system performs relative to a benchmark test , or how favourably it compares with a previous edition of the system or with other designers' systems (comparative testing or performance evaluation). To the extent that a system performs less than perfect, the designer will have to learn which aspect(s) and/or components of the system are flawed. Designers will therefore also be interested in diagnostic evaluation, either by doing detailed error analyses on the test results, or by running component-specific tests.
The needs of systems users (end users and/or systems providers) are different than those of designers but they, too, rely heavily on assessment techniques. Prospective buyers will always have a specific use of their speech output system in mind. Understandably, they will want the simplest, and therefore cheapest, system that satisfies their needs. The buyer (or his consumer organisation) will therefore need an absolute yardstick in order to determine beforehand if the output speech is good enough to get the message across in the given application.

Next: Users of this chapter Up: Introduction Previous: What are speech output

EAGLES SWLG SoftEdition, May 1997. Get the book...