In this section we shall deal with evaluation procedures that have been, or can be, followed when modules in a text-to-speech system yield some intermediary symbolic output. As was stated above, there are no established methods for evaluating the quality of linguistic modules in speech output testing . As a result there is no agreed-upon methodology in this area nor are there evaluation experts; what little evaluation work is done, is done by the same researchers who developed the modules. In view of the lack of an established methodology we will refrain from making recommendations on the use of specific linguistic tests and test procedures . The need for a more general research effort towards a general methodology in the field of linguistic testing will be discussed in Section 12.6.3.
The first stage of a linguistic interface makes decisions on what to do with punctuation marks and other non-alphabetic textual symbols (e.g.\ parentheses), and expands abbreviations, acronyms , numbers, special symbols, etc. to full-blown orthographic strings, as follows:
abbreviations | ``i.e.'' | that is | |
``viz.'' | namely | ||
acronyms | ``NATO'' | naytoe | |
``UN'' | you en | ||
numbers | ``124'' | one hundred and twenty four | |
``1:24'' | twenty four minutes past one | ||
special symbols | ``#1'' | number one | |
``£1.50'' | one pound fifty |
There are no standardised tests for determining the adequacy of text preprocessors . Yet is seems that all preprocessors meet with the same classes of transduction problems, so that it would make sense to set up a multilingual benchmark for preprocessing. [Laver et al. (1988), Laver et al. (1989)], describing the internal structure of the CSTR text preprocessor , mention a number of transduction problems and present some quantification of their errors in the various categories, which we recapitulate in Table 12.1 [pp. 12-15]Laver88. The test was run on a set of anomaly-rich texts taken from newspapers and technical journals.
The results in Table 12.1 are not so much revealing in terms of the numerical information they offer as in the taxonomy of errors opted for. The only other formal evaluation of a text preprocessor that we have managed to locate uses a completely different set of error categories. [Van Holsteijn (1993)] presents an account of a text preprocessor for Dutch, and gives the results of a comprehensive evaluation of the module. It was observed that the use of abbreviations, acronyms and symbols differs strongly from text to text. Three types of newspaper text were broadly distinguished:
Correctly demarcated expressions could then be characterised further in terms of:
Finally, a distinction is made between unavoidable and avoidable errors. The former type would be the result of incorrect or unavailable syntactic/semantic information that would be needed in order to choose between alternative solutions. The latter type is the kind of error that needs correction, either by the addition of new rules or by inclusion in the exceptions lexicon. Table 12.2 presents some results [Van Holsteijn (1993), after,].
Percentage of avoidable errors in four categories; percentage of unavoidable errors in parentheses; N specifies the 100% base per cell.
The proposals by [Laver et al. (1988)] and [Van Holsteijn (1993)] represent rather crude, and disparate, approaches towards a taxonomy of errors of a text preprocessor . What is clearly needed for the evaluation of text preprocessors , is a more principled analysis of the various tasks a text preprocessor has to perform, focussing on those classes of difficulties that crop up in the European language concerned. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers ) in each of the relevant error categories, so that multilingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.
By grapheme-phoneme conversion we mean a process that accepts a full-blown orthographic input (i.e. the output of a preprocessor), and outputs a string of phonemes. The output string does not yet contain (word) stress marks, (sentence) accent positions, and boundaries. The correct phonemic representation of a normally spelled word depends on its linear context and hierarchical position (e.g. assimilation to adjacent words: I have to go /ahæf tg/ but I have two goals /ahæv tu: glz/; or the choice between heterophonous homographs : I lead /li:d/ but made of lead /ld/ (see also Chapter 6). Therefore the adequacy of grapheme-phoneme conversion modules should not, in principle, be tested on the basis of isolated word pronunciation (citation forms) . In practice, however, this is precisely what is done. The reasons for this are threefold:
Table 12.3 presents results of a multilingual evaluation of grapheme-phoneme converters for seven EU languages, performed within ESPRIT 291/860 ``Linguistic analyses of European languages,'' based on isolated word pronunciation. Since it has often been reported that many more conversion errors occur in proper names than in ordinary words, the evaluation distinguished between four types of materials:
Note: Newspaper scores are weighed for token frequency. Higher first score for French excludes all preprocessing errors; higher first German score is based on the use of an exceptions list.
Incidentally, the results should not be taken to indicate that spelling is harder to convert to phonemes in Italian than in any other language, since different conversion methods were used for each language; however, Italian proper names are no more a problem than ordinary text words. In English and French spelling the proper names do present a serious problem, so that exceptions lists will be a priority for these languages.
In a complementary test [Nunn & Van Heuven (1993)] compared the performance of three grapheme-phoneme converters for Dutch, i.e. two systems with no or only implicit morphological decomposition [Kerkhoff et al. (1984), Berendsen et al. (1986)] and one that included the MORPA morphological decomposition module. About 2,000 simplex and complex (see Section 12.5.1) test words were selected from newspaper texts that did not belong to the 10,000 most frequent Dutch words, so that dictionary look-up would fail. Phoneme, syllabification, and stress placement errors were found by automated comparison with a hand-made master transcription file. The earlier converters performed at a success rate of 60% and 64%, which is considerably poorer than the newspaper text score in Table 12.3 [p. 394]Pols91. The newer system with explicit morphological decomposition was correct in 78%.
Stressed syllables are generally pronounced with greater duration , greater loudness (in terms of acoustical intensity as well as pre-emphasis on higher frequencies), and greater articulatory precision (no consonant deletions, more peripheral vowel formant values). Moreover, when a word is in focus , a prominence-lending fast pitch movement occurs on the stressed syllable of that word. Except for French, where stress is always on the last full syllable of the word, the stress position varies from word to word in all other EU languages. However, stress position in these languages is predictable to a large extent on the basis of:
All the EU languages have a proportion of idiosyncratic words that do not comply with the proposed stress rules for diverse reasons. Therefore the coverage of stress rule systems has to be evaluated, and errors have to be corrected by including the problematic words in an exceptions dictionary .
Tests of stress rule modules have been performed only on an ad hoc basis, either checking the output of the rules by hand [Barber et al. (1989), for Italian,], or automatically (using the phonemic transcription field in lexical databases containing stress marks [Langeweg (1988), for Dutch,], which in turn had been checked by hand in some earlier stage of the database development).
In morphological decomposition orthographic words are analysed into morphemes , i.e.\ elements belonging to the finite set of smallest subword parts with an identifiable meaning (see Chapter 6). Morphological decomposition is necessary when the language/spelling allows words to be strung together without intervening spaces or hyphens so as to form an indefinitely large number of complex, longer words. For many EU languages word-internal morpheme boundaries are referred to by the grapheme-phoneme conversion . For instance, if the English letter sequence sh is pronounced as // when it occurs morpheme internally as in bishop, but is pronounced as /s/ followed by /h/ when a morpheme boundary intervenes, as in mishap.
Obviously, long and complex words will have to be broken up into smaller basic words and affixes (i.e. morphemes ) before the parts can be looked up in an exceptions dictionary . If all complex words were to be integrally stored in the lexicon, it would soon grow to unmanageable proportions. For stress placement rules it is sometimes necessary to refer to the hierarchical relationships between the constituent morphemes (e.g. lighthouse keeper, light housekeeper, where ``'' denotes main stress ) and to the lexical category of the word-final morpheme (which generally determines the lexical category of the complex word as a whole, e.g. black+bird is a noun, pitch+black is an adjective). Morphological decomposition is a notoriously difficult task, as one input string can often be analysed in a large number of different ways. The hard problem is choosing the correct solution out of the many possible solutions.
As far as we have been able to ascertain, there are no established test procedures for evaluating the performance of morphological decomposition modules. [Laver et al. (1988), pp. 12-16,] tested the morphological decomposition module of the CSTR TTS on 500 words randomly sampled from a 85,000 word type list, which was compiled from a large text corpus as well as from two machine-readable dictionaries . The output of the module was examined by hand, and proved accurate at 70% (which seems rather low considering the fact that the elements of English compounds are generally separated by spaces or hyphens).
The Dutch morphological decomposition module MORPA (MORphological PArser, [Heemskerk & Van Heuven (1993)]) compared the module's output with pre-stored morphological decomposition in a lexical database . In this comparison only segmentation errors were counted, in a sample of 3,077 (simplex and complex) words taken from weekly newspapers . The results showed that in 3% of the input the whole word, or part of it, could not be matched with any entry in the MORPA morpheme lexicon. The frequency of this type of error depends on the coverage of the lexicon. Erroneous analyses were generated in another 1% of the input words. In all other cases the correct morphological segmentation was generated, either as the single correct solution (44%), or as the most likely solution in an ordered list of candidate segmentations (48%), or as one of the less probable candidate solutions (3%). Although both the accuracy and the coverage of the MORPA module seems excellent by today's standards, the module proved too slow for realistic text-to-speech applications. Processing speed is therefore an important criterion in the evaluation of morphological parsers. There will be a speed/accuracy/coverage trade-off in the evaluation of morphological parsers.
Syntactic analysis lays the groundwork for the derivation of the prosodic structure needed to demarcate the phonological phrases (whose boundaries block assimilation and stress clash avoidance rules) and intonation domains (whose boundaries are marked by deceleration, pause insertion and boundary marking pitch movements ). Syntactic structure also determines (in part) which words have to be accented . Finally, lexical category disambiguation is often a by-product of a syntactic parser.
Although the syntactic parser is an important module in any advanced TTS , we take the view that, in principle, its development and evaluation does not belong to the domain of speech output systems. Syntactic parsing is much more a language engineering challenge, needed in automatic translation systems, grammar checking , and the like. For this reason, we refer to the chapters produced by the EAGLES Working Groups on the evaluation of Automatic Translation and Translation tools.
Appropriate accentuation is necessary to direct the listener's attention to the important words in the sentence. Inappropriate accentuation may lead to misunderstandings and delays in processing time [Terken (1985)]. For this reason most TTS-systems provide for accent placement rules. Accentuation rules can be evaluated at the symbolic and the acoustic level.
[Monaghan & Ladd (1989), Monaghan & Ladd (1990)] tested the symbolic output of a sentence accent assignment algorithm applied to four English 250 word texts (transcripts of radio broadcasts). The algorithm generated primary and secondary accents, which were rated on a 4-point appropriateness scale by three expert judges. [Van Bezooijen & Pols (1989)] tested a Dutch accent assignment algorithm at the symbolic as well as the acoustic level (only one type of accent is postulated for Dutch) using 8 isolated sentences and 8 short newspaper texts. Two important points emerged from this study:
Again, these are scattered tests, addressing only a handful of the problems that a linguistic module has to take care of. We would recommend the development of a comprehensive test procedure that identifies categories of accent placement error at the sentence and the paragraph level. The principles that underlie sentence accent placement are largely the same across EU languages, so that it makes sense to develop the test procedure on a multilingual basis.