Test procedures

Next: Benchmarks Up: Methodology Previous: Subjects

Test procedures

As indicated in Section 12.2, speech output assessment techniques can be differentiated along a number of parameters, but no parameters related to the actual test procedure were included there. Test procedures can vary with respect to subjects (see Section 12.3.1), stimuli, and response modality .

Stimuli can vary along a large number of parameters, the most important of which are listed below.

Length and complexity: (e.g. at the word phonology level: monosyllabic, disyllabic, polysyllabic, including only single consonants and vowels or also sequences of consonants and vowels). The more varied in length and complexity the test items are, the more diagnostic information can be obtained and the more representative the test results are for the perception of unrestricted speech output. However, higher linguistic levels are often less suited for diagnostic purposes because subjects' responses are determined by many other sources of information in addition to the acoustic properties of the stimuli (see Section 12.4.1).
Linguistic level: (word, sentence, paragraph). Again, the higher the linguistic level, the better test results can be generalised to unrestricted speech output.
Stimulus set: (fixed set, where all items are presented each time the test is run, versus open set, where each time new (combinations of) test items are presented, e.g. the SUS Test in Section 12.7.7). Of course, in the light of learning effects open sets are more useful and flexible than fixed sets.
Meaningfulness: either at the word level or at the sentence level (meaningful, meaningless, or mixed, i.e. lexically or semantically unpredictable). Each choice seems to have both advantages and disadvantages/restrictions. For example, tests which only use meaningful test items at the word level, such as the DRT and MRT (see Sections 12.7.4 and 12.7.5) have the advantage of being reliable and easy to administer. However, intelligibility may be overestimated, there is a risk of a ceiling effect, and they have little diagnostic value . In principle, the mixed approach seems a good choice, because the subjects are not guided in any way as to what constitutes a legal or an illegal response. Nevertheless, there may be a risk of a bias towards meaningful words. For other implications of the choice between meaningful, meaningless, and mixed items at the word level, see Section 12.5.2. For implications at the sentence level, see Section 12.5.2.
Representativeness: e.g. Phonetically Balanced (PB) stimulus lists, with a frequency of occurrence of phonemes in accordance with the phoneme distribution in the language tested or the specific domain of application at hand, or equal representation of each phoneme . If one wants to obtain a global idea of the intelligibility of a system, PB-lists are to be preferred, if one aims at diagnostic information, one usually opts for equal representation.

In Section 12.7, summary descriptions of tests are given where the stimuli have been categorised along these stimulus parameters. Chapter 9 on methodology should also be consulted.

Response modality can vary along a number of parameters as well. The choice seems to be mainly determined by three factors: comparative versus diagnostic , functional versus judgment , and TTS development versus psycholinguistic interest. In the five types of response modalities listed below, 1 and 2 are mainly used within the glass box approach (1 in TTS development, 2 in psycholinguistically oriented research ), whereas 3, 4 and 5 are more common in the black box approach . The latter three response modalities can be further differentiated in that 3 and 4 are functional in nature (3 in TTS development, 4 in psycholinguistically oriented research ), whereas 5 represents judgment testing . In the list of response modalities a distinction is made between off-line tests, where subjects are given some time to reflect before responding, and on-line tests, where an immediate response is expected from the subjects, tapping the perception process before it is finished.

OFF-LINE IDENTIFICATION TESTS , where subjects are asked to transcribe the separate elements (sounds, words) making up the test items. This response modality can be further differentiated. With respect to the nature of the set of response categories there is a choice between:
- a closed set, where subjects are forced to select the appropriate response from a limited number of pregiven categories, and
- an open response mode, where the only restriction are the constraints imposed by the language.
TRANSCRIPTION can be:
- in normal spelling, leading to problems in the interpretation of the responses in case of meaningless or lexically unpredictable stimuli (e.g. if subjects write down ``lead'', have they heard /led/ or /li:d/?), or
- unambiguous notation, placing the burden upon the subjects, since they have to be trained to systematically apply this notation system.
ON-LINE IDENTIFICATION TESTS , requiring the subject to decide whether the stimulus does or does not exist as a word in the language [Pisoni et al. (1985b), Pisoni et al. (1985a), so-called lexical decision task, e.g.,].
OFF-LINE COMPREHENSION TESTS , in which content questions have to be answered in an open or closed response mode [Pisoni et al. (1985b), Pisoni et al. (1985a), e.g.,].
ON-LINE COMPREHENSION TESTS , requiring the subject to indicate whether a statement is true or not (so-called sentence verification task, e.g. [Manous et al. (1985)]).
JUDGMENT TESTS (also called opinion tests), involving the rating of scales [Pavlovic et al. (1990), Delogu et al. (1991), ITU-T (1993), e.g.,].

The last response modality will be discussed in some more detail. Pavlovic and co-workers have conducted an extensive series of studies [Pavlovic et al. (1990)] comparing different types of scaling methods that can be used in judgment tests to evaluate speech output. Much attention was paid to:

the magnitude estimation method, where the subject is presented with an auditory stimulus and is asked to express the perceived strength/quality of the relevant attribute (e.g. intelligibility) numerically (``type in a value'') or graphically (``draw a line on the computer screen''), and
the categorical estimation method, where the subject has to select a value from a limited range of prespecified values, e.g. 1 representing extremely poor and 10 excellent intelligibility.

Pavlovic et al. stress that there are important differences between the two types of scaling methods, for example the fact that categorical estimation results in an interval scale, whereas magnitude estimation results in a ratio-scale. The former leads to the use of raw ratings, the calculation of the arithmetic mean, and the comparison of conditions in terms of differences, the latter leads to the use of the logarithm of the ratings, the geometric mean, and comparison in terms of ratios. The differences also have implications for the type of conclusions to be drawn from the test results. Both the categorical estimation method (with a 20-point scale) and the magnitude estimation method have been included in SOAP as standard SAM Overall Quality test procedures (see Section 12.7.11).

Recommendations on choice of response modality

For rapid judgment testing , use intra-subject (``internal comparison'') categorical estimation, , and when you do, use at least a 10-point scale.
To compare results across tests (``external comparison''), use magnitude estimation and when you do, use the line length drawing procedure, asking subjects to express the quality of the stimulus relative to the most ideal (human) speech they can imagine.

Next: Benchmarks Up: Methodology Previous: Subjects

EAGLES SWLG SoftEdition, May 1997. Get the book...