Text-to-speech systems generally comprise a range of modules that take care of specific tasks. The first module (or complex of modules) converts an orthographic input string to some abstract linguistic code that is explicit in its representation of sounds and prosodic markers. Various modules then act upon this symbolic representation. Typically, one module concatenates the primitive building blocks (phonemes , diphones ) in their appropriate order, another implements what coarticulation is needed to obtain smooth human-like transitions between successive building blocks. Prosodic modules, taking the positions of word stresses , sentence accents , phrasal and sentence boundaries into account, are then called upon in order to provide an appropriate temporal organisation (local accelerations and decelerations, pauses) and speech melody.
End users will typically be interested in the performance of a system as a whole. They will consider the system as a black box that accepts text and outputs speech, a monolith without any internal structure. For them it is only the quality of the output speech that matters. In this way systems developed by different manufacturers can be compared or the improvement of one system relative to an earlier version can be traced over time (comparative testing ). However, if the output is less than optimal it will not be possible to pinpoint the exact module or modules that caused the problem. For diagnostic purposes, therefore, designers often set up (glass box . evaluations with experimental character. Keeping the effects of all modules but one constant, and systematically varying the characteristics of the free module, any difference in the assessment of the system's output can be attributed to the variations in the target module. Glass box testing , of course, presupposes that the researcher has control over the input and output of each individual module.
The dichotomy between glass box and black box testing is basic to speech output testing , which has led some researchers to propose a strict terminological division whereby ``evaluation'' signifies glass box testing (or: diagnostic evaluation ) only, and ``assessment'' is reserved exclusively for black box testing (or: performance evaluation). In this chapter we will use the terms, ``testing'', ``evaluation'' and ``assessment'' indiscriminately, and use disambiguating adjectives whenever there is a risk of confusion.