One of the most important aspects of a measuring instrument is its reliability. How reliable, for example, is subjects' performance in functional intelligibility tests when tested several times? Test/retest intrasubjective reliability of intelligibility was assessed by [Logan et al. (1989)] and [Van Bezooijen (1988)]; in both cases it was found to be good. More attention has been paid to subject dimensions systematically affecting intersubjective reliability. This research was motivated by the finding of large variance in the test scores, possibly obscuring effects of the synthesis systems compared. Most studies in this area examined variability in intelligibility scores. Subject dimensions considered relevant include: age, non-expert experience with synthetic speech, expert experience with synthetic speech, and analytic listening.
Within the ESPRIT-SAM project [Howard-Jones (1992a), Howard-Jones (1992b)], the effect of age was examined with Italian VCV-items. Five age categories were distinguished (10-19, 20-29, 30-44, 45-59, over 60), with between 5 and 8 subjects per group. The group scores of percentages correct consonant identification ranged from 58%, for the oldest group, to 64%, for the youngest group. So, little evidence was found for an effect of the subject dimension age.
Non-expert experience with synthetic speech was investigated in several studies. [Howard-Jones (1992a), Howard-Jones (1992b)] compared the performance of 8 subjects experienced with synthetic speech and 24 inexperienced subjects. German VCV-items were presented. The mean score for the experienced subjects was 79%, that for the inexperienced subjects 62%. There is further evidence that the intelligibility of synthetic speech increases as a result of non-expert experience with synthetic speech, both when acquired in the form of training with feedback [Greenspan et al. (1985), Schwab et al. (1985), e.g.,] and when acquired in a more natural way without feedback [Pisoni et al. (1985b), Pisoni et al. (1985a), Boogaart & Silverman (1992)]. The learning effect has been found to manifest itself after only a few minutes of exposure. However, there are indications that the effect of learning depends on the type of synthesis used. [Jongenburger & Van Bezooijen (1992)] assessed the intelligibility of two synthesis systems used by visually handicapped for reading a digital daily newspaper in a first confrontation and after one month of experience. An open response CVC identification test was used. For one system, which was allophone based, consonant intelligibility increased from 58% to 79%; for the other system, which was diphone based, intelligibility increased from 63% to 68%. It was hypothesised that the characteristics of allophone -based synthesis are easier to learn because they are rule-governed and therefore more invariant than those of diphone -based synthesis. Moreover, no transfer was found from experience with one type of synthesis to the understanding of the other type of synthesis. This suggests that there is no such thing as general experience in listening to synthetic speech.
The subject dimension expert experience with synthetic speech was examined by [Howard-Jones (1992a)] with English VCV-items. A percentage correct consonant identification of 30% was obtained for the inexperienced subjects versus 49% for the experts. So, again improved performance was found as a function of increased exposure.
The last subject dimension we want to mention is experience in listening analytically to speech. On the basis of a reanalysis of the results from a number of their evaluation studies, [Van Bezooijen & Pols (1993)] conclude that the more ear-training subjects have, the higher the percentages correct they attain. Furthermore, ear-training was found to result in a reduction of intersubjective differences.
Apart from variance which can be attributed to particular subject dimensions, much apparently individual variability is found in test scores. [Hazan & Shi (1993)] examined the variance in subject scores in various tests, including intelligibility of meaningless VCV-items, intelligibility of Semantically Unpredictable Sentences (SUS, see Section 12.7.7), and speech pattern identification for plosive place and voicing contrasts. A homogeneous group of subjects was used:
Despite the homogeneity of the subject group, a sizeable degree of variability was found in all tests. For the SUS the range (i.e the difference between the best and worst performing subject) was 28%, for the CVC-test the range was 47%. At the level of speech pattern processing, considerable differences were found in the perceptual weighting given to individual cues to plosive place and voicing contrasts. Hazan & Shi attribute the variability not to audiological differences among listeners, but to the development of different perceptual strategies during language acquisition. They distinguish two types of listeners: ``auditors'' (i.e. users of acoustic information) and ``comprehenders'' (i.e. users of global contextual information).
Having established that there is much variability in the scores obtained in speech output evaluation tests, part of which can be attributed to clearly identifiable subject dimensions such as previous experience with synthetic speech, one may wonder what implications this has for the selection of subjects in specific tests. We think that the implications for subject selection depend in part on, at least, the type of listening required (e.g. global versus analytic mode ), and the width of application (general public versus specific user groups). Therefore, for some common applications the following recommendations can be formulated:
The above recommendation was made not only because of (possible) differences in the perception of the speech output, but also because motivation is known to play an important role in the effort people are willing to spend in order to understand suboptimal speech. If people have a choice between human and synthetic speech, the synthetic speech will have to be good if it wants to have a chance of being accepted. However, if people do not have a choice, e.g. the visually handicapped who without synthesis (or braille) will not have access to a daily newspaper , synthesis will be accepted more easily.