Generally, we feel that the development and testing of the higher-order linguistic modules of speech output systems should be left to language technology (NLP) experts (see also the results of the EAGLES Working Groups on Corpora, Machine Readable Lexicons, Formalisms, and Evaluation). A reasonable division of work would be for speech technology to deal with the mainly word-level linguistic modules that are specific to TTS applications, i.e. text preprocessing and grapheme-phoneme conversion (including stress position, accent placement and boundary marking). Other linguistic tasks, such as morphological analysis and syntactic parsing , are to a large extent common to other branches of linguistic engineering (e.g. grammar checking , automatic translation), with much more resources and manpower available. However, even if this division of work could be effected, one would like to see attempts made towards early separation of consequential vs.\ inconsequential errors in word and sentence parsers. Consequential symbolic errors will audibly affect the (quality of the) acoustic output, whereas inconsequential errors are not reflected at the audio level. This means that part of speech output testing should still be concerned with the higher-order linguistic modules.
We would advocate a more detailed analysis of the various tasks a text preprocessor has to perform, focussing on those classes of difficulties that crop up in any (European) language. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers ) in each of the relevant error categories, so that multilingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, abbreviation expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.
A short-term recommendation is to develop multilingual machine-readable pronouncing dictionaries at the single word level which list permissible variations. Comparisons of algorithmic output with the model transcriptions requires the development of adequate string alignment procedures. Moreover, not all discrepancies found contribute equally to the overall evaluation. Distance metrics should be developed that allow us to express the differences between two segmentally different phonemic transcriptions in terms of meaningful perceptual distance. Recent work done by [Cucchiarini (1993)] could serve as a starting point.
The correctness of most symbolic output can only be determined on the basis of connected text at the sentence level. What is dearly needed, therefore, is the availability of large, multilingual text corpora with full phonemic annotation , including not only the permissible pronunciation(s) of the words, including the effects of assimilation across word boundaries and stress shifts, but also the indication of accent positions (and degrees of accent ), prosodic boundaries (with break indices of various strengths), and some intonation transcription . Moreover, since these corpora will also have to be used for testing morphological and syntactic parsing , hierarchical word and sentence structure should be indicated; or at least provisions should be made for linguists to enter this type of information at a later stage, resulting in a hierarchically tagged text corpus or tree bank.
The development of corpora of this type is best left to the text corpus experts. We refer to the relevant chapters on database development (Chapters 3, 4, 5)for a discussion of corpus-related matters.
We recommend the development of procedures for strictly modular testing of linguistic interfaces. This means that test materials have to be made available that are specific to each individual module in the linguistic interface . Each module should be given correct input strings, and the correct output string(s) for only the module at hand should be provided. Only in this way can the problem of percolating and compounding of errors made by earlier modules be eliminated. Obviously, such procedures can only be effective if the databases referred to in the previous paragraph contain representations of the correct strings at each of the levels addressed by the various modules.
With the availability of cheap mass memory, the need for highly intricate rule-based linguistic interfaces is less strongly felt than some years before. Rather than computing the phonological code that is to be fed to the acoustic modules , the correct code is simply looked up in large lexicons included in the speech output system. If this trend continues, the emphasis of our research efforts will shift from rule development (and testing) to collecting databases. Database collection and annotation will take place regardless of the direction that the field takes in this matter. If choices have to be made, money is spent most safely on the development of corpora, but only if a multilingual notation format can be found that can be used for the transcription of segments and prosodic features of all languages dealt with.
Although less important at the isolated word level, it still remain necessary to test grapheme-phoneme conversion . The output of post-lexical rules (changing the pronunciation of words in connected speech, e.g. through assimilation ). Also, testing grapheme-phoneme conversion will remain applicable in the development of cheap speech output systems (such as MULTIVOX and APOLLO), which do not access large lexicons or perform sophisticated linguistic analyses of the input text.