next up previous contents index
Next: Test procedures Up: Methodology Previous: Methodology



One of the most important aspects of a measuring instrument is its reliability. How reliable, for example, is subjects' performance in functional intelligibility tests  when tested several times? Test/retest intrasubjective reliability of intelligibility was assessed by [Logan et al. (1989)] and [Van Bezooijen (1988)]; in both cases it was found to be good. More attention has been paid to subject dimensions systematically affecting intersubjective reliability. This research was motivated by the finding of large variance in the test scores, possibly obscuring effects of the synthesis systems compared. Most studies in this area examined variability in intelligibility scores. Subject dimensions considered relevant include: age, non-expert experience with synthetic speech, expert experience with synthetic speech, and analytic listening. 

Within the ESPRIT-SAM project   [Howard-Jones (1992a), Howard-Jones (1992b)], the effect of age was examined with Italian VCV-items. Five age categories were distinguished (10-19, 20-29, 30-44, 45-59, over 60), with between 5 and 8 subjects per group. The group scores of percentages correct consonant identification ranged from 58%, for the oldest group, to 64%, for the youngest group. So, little evidence was found for an effect of the subject dimension age.

Non-expert experience with synthetic speech was investigated in several studies. [Howard-Jones (1992a), Howard-Jones (1992b)] compared the performance of 8 subjects experienced with synthetic speech and 24 inexperienced subjects. German VCV-items were presented. The mean score for the experienced subjects was 79%, that for the inexperienced subjects 62%. There is further evidence that the intelligibility of synthetic speech increases as a result of non-expert experience with synthetic speech, both when acquired in the form of training  with feedback [Greenspan et al. (1985), Schwab et al. (1985), e.g.,] and when acquired in a more natural way without feedback [Pisoni et al. (1985b), Pisoni et al. (1985a), Boogaart & Silverman (1992)]. The learning effect has been found to manifest itself after only a few minutes of exposure. However, there are indications that the effect of learning depends on the type of synthesis used. [Jongenburger & Van Bezooijen (1992)] assessed the intelligibility of two synthesis systems used by visually handicapped for reading a digital daily newspaper  in a first confrontation and after one month of experience. An open response CVC identification test was used. For one system, which was allophone  based, consonant intelligibility increased from 58% to 79%; for the other system, which was diphone  based, intelligibility increased from 63% to 68%. It was hypothesised that the characteristics of allophone -based synthesis  are easier to learn because they are rule-governed and therefore more invariant than those of diphone -based synthesis. Moreover, no transfer was found from experience with one type of synthesis to the understanding of the other type of synthesis. This suggests that there is no such thing as general experience in listening to synthetic speech.

The subject dimension expert experience with synthetic speech was examined by [Howard-Jones (1992a)] with English VCV-items. A percentage correct consonant identification of 30% was obtained for the inexperienced subjects versus 49% for the experts. So, again improved performance was found as a function of increased exposure.

The last subject dimension we want to mention is experience in listening analytically  to speech. On the basis of a reanalysis of the results from a number of their evaluation studies, [Van Bezooijen & Pols (1993)] conclude that the more ear-training  subjects have, the higher the percentages correct they attain. Furthermore, ear-training was found to result in a reduction of intersubjective differences.

Apart from variance which can be attributed to particular subject dimensions, much apparently individual variability is found in test scores. [Hazan & Shi (1993)] examined the variance in subject scores in various tests, including intelligibility of meaningless VCV-items, intelligibility of Semantically Unpredictable Sentences    (SUS, see Section 12.7.7), and speech pattern identification for plosive place and voicing  contrasts. A homogeneous group of subjects was used:

Despite the homogeneity of the subject group, a sizeable degree of variability was found in all tests. For the SUS    the range (i.e the difference between the best and worst performing subject) was 28%, for the CVC-test the range was 47%. At the level of speech pattern processing, considerable differences were found in the perceptual weighting given to individual cues to plosive place and voicing  contrasts. Hazan & Shi attribute the variability not to audiological differences among listeners, but to the development of different perceptual strategies during language acquisition. They distinguish two types of listeners: ``auditors'' (i.e. users of acoustic information) and ``comprehenders'' (i.e. users of global contextual information).

Having established that there is much variability in the scores obtained in speech output evaluation tests, part of which can be attributed to clearly identifiable subject dimensions such as previous experience with synthetic speech, one may wonder what implications this has for the selection of subjects in specific tests. We think that the implications for subject selection depend in part on, at least, the type of listening required (e.g. global  versus analytic mode ), and the width of application (general public  versus specific user groups). Therefore, for some common applications the following recommendations can be formulated:

Recommendations on choice of subjects

  1. Exclude hearing-impaired subjects from speech output assessment. Within the SAM project  [Howard-Jones (1992a), Howard-Jones (1992b)] it is specified that subjects should pass the hearing screening test at 20dB HL at all octave frequencies from 500 to 4000Hz.
  2. Do not use the same subject more than once.
  3. In diagnostic testing  only include subjects speaking the same language (variety) as the language (variety) tested.
  4. For diagnostic   purposes requiring analytic listening , hire a trained phonetician (with a basic understanding of the relationships between articulation and acoustics) in the initial stages of development of a system in order to obtain subtle information (e.g. degree of voicing  in plosives), or information that is usually not used for functional purposes in real-life communication (e.g. formal aspects of temporal organisation and intonation , cf. [Terken (1993)]).
  5. In specialised applications, select subjects who are representative of the (prospective) users. For example, synthesis integrated in a reading machine for the blind should be tested with visually handicapped. And synthesis for long-term use should be tested with subjects with different degrees of experience and familiarisation with the type of synthetic speech of interest.

    The above recommendation was made not only because of (possible) differences in the perception of the speech output, but also because motivation is known to play an important role in the effort people are willing to spend in order to understand suboptimal speech. If people have a choice between human and synthetic speech, the synthetic speech will have to be good if it wants to have a chance of being accepted. However, if people do not have a choice, e.g. the visually handicapped who without synthesis (or braille) will not have access to a daily newspaper , synthesis will be accepted more easily.

  6. Synthesis to be used by the general public  for incidental purposes, i.e. which should be functionally adequate in a first confrontation, should be tested with a wide variety of subjects, including people with a limited command of the language, dialect  speakers, and people of different ages. However, none of them should have experience in listening to synthetic speech. In telecommunications research, groups of between 12 and 16 subjects (all with English as their primary language) have been found sufficient to obtain stable mean values in judgment tests .

next up previous contents index
Next: Test procedures Up: Methodology Previous: Methodology

EAGLES SWLG SoftEdition, May 1997. Get the book...