As pointed out above speech synthesis assessment methodologies are at a less advanced stage than other technologies. Chapter 12 recommends several approaches to tackle the most important issues.
Text-to-speech (TTS) is the possibility to convert textual information into speech output. This technique is used by systems that handle huge pieces of information or information that varies frequently (products in a catalogue, daily press releases, access to e-mail messages, etc.). Another type of speech synthesis is based on concept-to-speech (CTS) and may be integrated into dialogue systems (cf. Chapter 12 for a brief description).
A TTS system basically involves the following modules:
It is obvious that an end-user and an application developer are mainly interested in the quality and performance of the speech synthesiser as a whole. The synthesiser is to be considered as a black box, although some applications may need particular tuning of some modules to account for a specific feature of an application. For example a reverse-directory application may need to pronounce proper names, including foreign ones; other applications may need to pronounce particular abbreviations or acronyms (military or scientific domains). For this purpose the application developer has to know the different components of the speech synthesiser (glass box analysis). He should know how to pin-point the module that causes a problem and whether he has any control over each component to correct the problem or to balance its effect.
The linguistic part usually incorporates a pre-processing stage to deal with initial input data which is not ``standard text''. The data may include non-alphabetic symbols (parentheses, punctuation marks), abbreviations, acronyms , and exception entries. This pre-processing has to replace the symbols by the corresponding text, to expand abbreviations, to complete the correct pronunciation of acronyms , to state whether a string of digits is a number or a sequence of single digits, to correct the orthographic and syntactic mistakes. The application developer has to know whether he has any control over this submodule and at what level. To illustrate this he may want to get the appropriate French pronunciation of the acronyms CNET and CCITT . He may use a single and separate table with the acronyms and the corresponding (in this case French) pronunciation: CNET /knEt/, CCITT /se: se: i: te: te:/.
He may also be forced to replace systematically these acronyms in the input text whenever they have to be pronounced. The application developer has to know how to handle this.
Of course the application developer should not deal with the assessment of the linguistic module but only get the best tools/hints to improve it (or to tune it) if he selects the speech synthesiser for its global performance as a black box.
The linguistic part is also in charge of converting the pre-processed orthographic text to its phonetic transcription (or generally any abstract elements representing the speech sounds). This may use special rules as well as phonetic dictionaries . At that stage, words requiring special pronunciations are considered. Syntactic and lexical analysis are carried out to assign ``lexical stress '' and a ``prosodic pattern'' that will give synthetic speech its naturalness . This may be dependent on the application and has to be done by the application developer.
The output of this part is the decomposition of words into morphemes and then into phonetic symbols with prosodic patterns (syntactic/lexical accent ).
As it is detailed in Chapter 12, there is no standard and agreed-upon methodology for the assessment of the linguistic module of a TTS . Our intent here is to define the different tasks that a speech synthesis technology should accomplish, and the ones that will be needed by the application developer. These may be carried out automatically but for better tuning they may need some interventions as illustrated above. The underlying processes may or may not be offered by the technology provider. It could be possible to acquire some modules from another supplier and incorporate them with the synthesiser adopted, though this may not be easy. All these possibilities should be carefully taken into account. If the application developer has the intention or the need to modify subparts of this module he should know the format and any tool for editing the rule/dictionary component to be modified.
This module concerns the construction of segmental synthesis using the concatenation of pre-recorded segment labels. It uses the broad phonetic output of the linguistic module to produce a set of the basic building blocks (which are diphones , triphones , individual words if any, etc.) that will be concatenated. This is based on an inventory of the labels of speech segments that are associated with ``phonetic'' sequences.
This module outputs a complete sequence of such ``building blocks'' with the appropriate prosodic markers (stress markers, accent position, etc.).
For example in a class of applications one may need to present a word with a particular focus to capture the attention of the end-user (listener) (e.g. departure time for a travel information system). This ``accent placement process/rule'' may or may not be offered by the system and may or may not be easily accessible to the application developer.
The application developer has to know who provides the necessary inventory of pre-recorded segments and how to create new inventories in order to personalise the voice output.
At this level one may need to incorporate a natural intonation contour and/or a duration model . This is usually done automatically. For special purposes the application developer may need to handle this manually and the technology provider has to inform him if this is possible and how to achieve it.
The acoustic module aims at the generation of synthetic speech using pre-recorded segments, extracted from acoustic speech data, using a voice coding algorithm.
The technology provider may deliver a single output voice, male or female. He may also provide a tool to generate personalised voices (the aim here may be to have a company-tailored voice).
For all these modules, the application developer needs to know if human intervention is necessary to obtain satisfactory speech output.
The acoustic module may be based on a non-segmental approach, but the requirements indicated here remain.
The chapter devoted to speech synthesis assessment (Chapter 12) points out the different measures that should be conducted by the technology provider and demanded by the application developer. Such measures have to be conducted with the proposed factors and scales in mind. We may quote: naturalness , acceptability, intelligibility, listening effort , pleasantness , comprehension, and so forth. The evaluation should be application-specific or at least mention the way it is done. A TTS may be acceptable for the 1000 most frequent words/sentences but not the ones to be synthesised by the application under development. There is also an important item to consider: to measure the intelligibility of phonemes , words, and sentences, as some applications require understanding of sentences while others demand understanding of keywords (names, dates, digits), including some minimal pairs with no dialogue contextualisation.