next up previous contents index
Next: Acoustic aspects Up: Glass box approach Previous: Glass box approach


Linguistic aspects


In this section we shall deal with evaluation procedures that have been, or can be, followed when modules in a text-to-speech system  yield some intermediary symbolic output. As was stated above, there are no established methods for evaluating the quality of linguistic modules  in speech output testing . As a result there is no agreed-upon methodology in this area nor are there evaluation experts; what little evaluation work is done, is done by the same researchers who developed the modules. In view of the lack of an established methodology we will refrain from making recommendations on the use of specific linguistic tests and test procedures . The need for a more general research effort towards a general methodology in the field of linguistic testing will be discussed in Section 12.6.3.




The first stage of a linguistic interface  makes decisions on what to do with punctuation marks  and other non-alphabetic textual symbols (e.g.\ parentheses), and expands abbreviations, acronyms , numbers, special symbols, etc. to full-blown orthographic strings, as follows:

abbreviations ``i.e.'' tex2html_wrap_inline48841 that is
``viz.'' tex2html_wrap_inline48841 namely
acronyms ``NATO'' tex2html_wrap_inline48841 naytoe
``UN'' tex2html_wrap_inline48841 you en
numbers ``124'' tex2html_wrap_inline48841 one hundred and twenty four
``1:24'' tex2html_wrap_inline48841 twenty four minutes past one
special symbols ``#1'' tex2html_wrap_inline48841 number one
``£1.50'' tex2html_wrap_inline48841 one pound fifty

There are no standardised tests for determining the adequacy of text preprocessors . Yet is seems that all preprocessors meet with the same classes of transduction problems, so that it would make sense to set up a multilingual benchmark  for preprocessing. [Laver et al. (1988), Laver et al. (1989)], describing the internal structure of the CSTR text preprocessor , mention a number of transduction problems and present some quantification of their errors in the various categories, which we recapitulate in Table 12.1 [pp. 12-15]Laver88. The test was run on a set of anomaly-rich texts taken from newspapers  and technical journals.


Anomaly type # tested # correct % correct
Meta-textuals  95 95 100
Capital-initials 87 87 100
Digit-bearing 35 34 97
Hyphens 15 15 100
Other 24 24 100
Proper names 73
Strings in dictionary 85
Strings not in dictionary 87
Number strings 100
Table 12.1: Percentage of correct treatment of textual anomalies by CSTR text preprocessor.  

The results in Table 12.1 are not so much revealing in terms of the numerical information they offer as in the taxonomy of errors opted for. The only other formal evaluation of a text preprocessor  that we have managed to locate uses a completely different set of error categories. [Van Holsteijn (1993)] presents an account of a text preprocessor  for Dutch, and gives the results of a comprehensive evaluation of the module. It was observed that the use of abbreviations, acronyms  and symbols differs strongly from text to text. Three types of newspaper  text were broadly distinguished:

  1. editorial text on home and foreign news,
  2. editorial text on sports and business,
  3. telegraphic-style text (i.e. classified adds, film & theatre listings, radio & television guide).
Text segmentation  errors were separately counted for:

Correctly demarcated expressions could then be characterised further in terms of:

Finally, a distinction is made between unavoidable and avoidable errors. The former type would be the result of incorrect or unavailable syntactic/semantic information that would be needed in order to choose between alternative solutions. The latter type is the kind of error that needs correction, either by the addition of new rules or by inclusion in the exceptions lexicon.  Table 12.2 presents some results [Van Holsteijn (1993), after,].


(1) (2)
segm. sent label. express.
A. 0.1 (0.4) N=786 <0.01 (<0.01) N=13699
B. 0.2 (0.4) N=643 <0.01 (<0.01) N=10388
C. 0.0 (0.5) N=202 0.0 ( 2.0 ) N= 2570
(3) (4)
label. express. expan. expr.
A. 0.9 (3.8) N=1904 1.5 (0.0) N=479
B. 0.3 (4.0) N=1683 0.0 (0.0) N=571
C. 0.9 (2.5) N=1231 0.7 (0.4) N=560
Table 12.2: Evaluation results for text preprocessor  TextScan 

Percentage of avoidable errors in four categories; percentage of unavoidable errors in parentheses; N specifies the 100% base per cell.

The proposals by [Laver et al. (1988)] and [Van Holsteijn (1993)] represent rather crude, and disparate, approaches towards a taxonomy of errors of a text preprocessor . What is clearly needed for the evaluation of text preprocessors , is a more principled analysis of the various tasks a text preprocessor  has to perform, focussing on those classes of difficulties that crop up in the European language concerned. Procedures should be devised that automatically extract representative items from large collections of recent text (newspapers ) in each of the relevant error categories, so that multilingual tests can be set up efficiently. Once the test materials have been selected, the correct solutions to, for instance, expansion problems can be extracted from existing databases, or when missing there, will have to be entered manually.  


Grapheme-phoneme conversion


By grapheme-phoneme conversion we mean a process that accepts a full-blown orthographic input (i.e. the output of a preprocessor), and outputs a string of phonemes. The output string does not yet contain (word) stress  marks, (sentence) accent  positions, and boundaries. The correct phonemic representation of a normally spelled word depends on its linear context and hierarchical position (e.g. assimilation  to adjacent words: I have to go /atex2html_wrap_inline45169f ttex2html_wrap_inline45173gtex2html_wrap_inline45173tex2html_wrap_inline45201/ but I have two goals /atex2html_wrap_inline45169v tu: gtex2html_wrap_inline45173tex2html_wrap_inline45201lz/; or the choice between heterophonous homographs : I lead /li:d/ but made of lead /ltex2html_wrap_inline45175d/ (see also Chapter 6). Therefore the adequacy of grapheme-phoneme conversion modules should not, in principle, be tested on the basis of isolated word pronunciation (citation forms) . In practice, however, this is precisely what is done. The reasons for this are threefold:

Table 12.3 presents results of a multilingual evaluation of grapheme-phoneme converters for seven EU languages, performed within ESPRIT  291/860 ``Linguistic analyses of European languages,'' based on isolated word pronunciation. Since it has often been reported that many more conversion errors occur in proper names than in ordinary words, the evaluation distinguished between four types of materials:


Language Newspaper  Towns Capitals First names
Dutch 98.9 85.3 96.7 89.4
English 90.3 46.0 58.0 58.0
French 96.9-94.5 77.3 74.2 84.7
German 93.0-90.0 81.0 61.0 80.0
Greek 98.7 97.3 93.5 97.3
Italian 85.2 85.3 80.6 86.9
Spanish 98.9 95.3 96.6 98.0
Table 12.3: Percentage correct grapheme-phoneme conversion in seven EU languages in four types of materials. 

Note: Newspaper  scores are weighed for token frequency. Higher first score for French excludes all preprocessing errors; higher first German score is based on the use of an exceptions list.

Incidentally, the results should not be taken to indicate that spelling is harder to convert to phonemes in Italian than in any other language, since different conversion methods were used for each language; however, Italian proper names are no more a problem than ordinary text words. In English and French spelling the proper names do present a serious problem, so that exceptions lists will be a priority for these languages.

In a complementary test [Nunn & Van Heuven (1993)] compared the performance of three grapheme-phoneme converters for Dutch, i.e. two systems with no or only implicit morphological decomposition  [Kerkhoff et al. (1984), Berendsen et al. (1986)] and one that included the MORPA  morphological decomposition  module. About 2,000 simplex and complex (see Section 12.5.1) test words were selected from newspaper  texts that did not belong to the 10,000 most frequent Dutch words, so that dictionary look-up  would fail. Phoneme, syllabification, and stress  placement errors were found by automated comparison with a hand-made master transcription  file. The earlier converters performed at a success rate of 60% and 64%, which is considerably poorer than the newspaper  text score in Table 12.3 [p. 394]Pols91. The newer system with explicit morphological decomposition  was correct in 78%.    


Word stress


Stressed syllables  are generally pronounced with greater duration , greater loudness (in terms of acoustical intensity  as well as pre-emphasis on higher frequencies), and greater articulatory precision (no consonant deletions,  more peripheral vowel formant values). Moreover, when a word is in focus , a prominence-lending fast pitch movement   occurs on the stressed syllable  of that word. Except for French, where stress is always on the last full syllable  of the word, the stress position varies from word to word in all other EU languages. However, stress position in these languages is predictable to a large extent on the basis of:

All the EU languages have a proportion of idiosyncratic words   that do not comply with the proposed stress rules for diverse reasons. Therefore the coverage of stress rule systems has to be evaluated, and errors have to be corrected by including the problematic words in an exceptions dictionary .

Tests of stress rule modules have been performed only on an ad hoc basis, either checking the output of the rules by hand [Barber et al. (1989), for Italian,], or automatically (using the phonemic transcription   field in lexical databases  containing stress marks [Langeweg (1988), for Dutch,], which in turn had been checked by hand in some earlier stage of the database development).gif  


Morphological decomposition


In morphological decomposition orthographic words  are analysed into morphemes , i.e.\ elements belonging to the finite set of smallest subword parts with an identifiable meaning (see Chapter 6). Morphological decomposition is necessary when the language/spelling allows words to be strung together without intervening spaces or hyphens so as to form an indefinitely large number of complex, longer words. For many EU languages word-internal morpheme boundaries  are referred to by the grapheme-phoneme conversion . For instance, if the English letter sequence sh is pronounced as /tex2html_wrap_inline45205/ when it occurs morpheme internally as in bishop, but is pronounced as /s/ followed by /h/ when a morpheme boundary  intervenes, as in mishap.

Obviously, long and complex words will have to be broken up into smaller basic words and affixes  (i.e. morphemes ) before the parts can be looked up in an exceptions dictionary . If all complex words were to be integrally stored in the lexicon, it would soon grow to unmanageable proportions. For stress  placement rules it is sometimes necessary to refer to the hierarchical relationships between the constituent morphemes  (e.g. tex2html_wrap_inline45207lighthouse keeper, light tex2html_wrap_inline45207housekeeper, where ``tex2html_wrap_inline45207'' denotes main stress ) and to the lexical category of the word-final morpheme  (which generally determines the lexical category of the complex word as a whole, e.g. black+bird is a noun, pitch+black is an adjective). Morphological decomposition is a notoriously difficult task, as one input string can often be analysed in a large number of different ways. The hard problem is choosing the correct solution out of the many possible solutions.

As far as we have been able to ascertain, there are no established test procedures  for evaluating the performance of morphological decomposition modules. [Laver et al. (1988), pp. 12-16,] tested the morphological decomposition module of the CSTR TTS  on 500 words randomly sampled from a 85,000 word type list, which was compiled from a large text corpus as well as from two machine-readable dictionaries . The output of the module was examined by hand, and proved accurate at 70% (which seems rather low considering the fact that the elements of English compounds  are generally separated by spaces or hyphens).

The Dutch morphological decomposition module MORPA  (MORphological PArser,  [Heemskerk & Van Heuven (1993)]) compared the module's output with pre-stored morphological decomposition in a lexical database . In this comparison only segmentation  errors were counted, in a sample of 3,077 (simplex and complex) words taken from weekly newspapers . The results showed that in 3% of the input the whole word, or part of it, could not be matched with any entry in the MORPA  morpheme lexicon.  The frequency of this type of error depends on the coverage of the lexicon. Erroneous analyses were generated in another 1% of the input words. In all other cases the correct morphological segmentation   was generated, either as the single correct solution (44%), or as the most likely solution in an ordered list of candidate segmentations  (48%), or as one of the less probable candidate solutions (3%).gif Although both the accuracy  and the coverage of the MORPA  module seems excellent by today's standards, the module proved too slow for realistic text-to-speech  applications. Processing speed is therefore an important criterion in the evaluation of morphological parsers.  There will be a speed/accuracy/coverage trade-off in the evaluation of morphological parsers.  


Syntactic parsing


Syntactic analysis lays the groundwork for the derivation of the prosodic  structure needed to demarcate the phonological phrases (whose boundaries block assimilation   and stress   clash avoidance rules) and intonation domains (whose boundaries are marked by deceleration, pause insertion and boundary marking pitch movements  ). Syntactic structure also determines (in part) which words have to be accented . Finally, lexical category disambiguation  is often a by-product of a syntactic parser.

Although the syntactic parser is an important module in any advanced TTS , we take the view that, in principle, its development and evaluation does not belong to the domain of speech output systems. Syntactic parsing is much more a language engineering challenge, needed in automatic translation systems, grammar checking , and the like. For this reason, we refer to the chapters produced by the EAGLES Working Groups on the evaluation of Automatic Translation and Translation tools.  


Sentence accent


Appropriate accentuation is necessary to direct the listener's attention to the important words in the sentence. Inappropriate accentuation may lead to misunderstandings and delays in processing time [Terken (1985)]. For this reason most TTS-systems  provide for accent placement rules. Accentuation rules can be evaluated at the symbolic and the acoustic level.

[Monaghan & Ladd (1989), Monaghan & Ladd (1990)] tested the symbolic output of a sentence accent assignment algorithm applied to four English 250 word texts (transcripts of radio broadcasts). The algorithm generated primary and secondary accents, which were rated on a 4-point appropriateness scale by three expert judges. [Van Bezooijen & Pols (1989)] tested a Dutch accent assignment algorithm at the symbolic as well as the acoustic level (only one type of accent is postulated for Dutch) using 8 isolated sentences  and 8 short newspaper  texts. Two important points emerged from this study:

Again, these are scattered tests, addressing only a handful of the problems that a linguistic module has to take care of. We would recommend the development of a comprehensive test procedure  that identifies categories of accent placement error at the sentence and the paragraph level. The principles that underlie sentence accent placement are largely the same across EU languages, so that it makes sense to develop the test procedure on a multilingual basis.    

next up previous contents index
Next: Acoustic aspects Up: Glass box approach Previous: Glass box approach

EAGLES SWLG SoftEdition, May 1997. Get the book...