In spite of the rapid progress that is being made in the field of speech technology, any speech output system available today can still be spotted for what it is: non-human, a machine. Most older systems will fail immediately due to their robot-like melody and garbled vowels and consonants. Other, more recently developed synthesis methods using short-segment waveform concatenation techniques such as PSOLA [Moulines & Charpentier (1990)] yield segmental quality that is very close to human speech [Portele et al. (1994)], but still suffer from noticeable defects in matters of melody and timing.
As long as synthetic speech is inferior to human speech, speech output assessment will be a major concern. Speech technology development today is typically evaluation-driven. Large scale speech technology programmes have been launched both in the United States and in Europe [O'Malley & Caisse (1987), Van Bezooijen & Pols (1989), Pols (1991), for overviews see,]. Especially in the European Union, with its many official languages, a strong need was felt for output quality assessment methods and standards that can be applied across languages. With this goal in mind the multinational EU-ESPRIT SAM project was set up [Fourcin et al. (1989)], and later the EU Expert Advisory Group on Language Engineering Standards (EAGLES) programme started; both initiatives included a working group on speech output assessment.
Speech output assessment may be of crucial importance to two interested parties, the systems designers and developers on the one hand, and the prospective buyers and end users of the system (possibly represented by consumer organisations) on the other.