Next: Benchmarks
Up: Methodology
Previous: Subjects
As indicated in Section 12.2, speech output assessment techniques
can be differentiated along a
number of parameters, but no parameters related to the actual test
procedure were included
there. Test procedures can vary with respect to
subjects (see Section 12.3.1), stimuli, and
response modality .
Stimuli can vary along a large number of parameters, the
most important of which
are listed below.
- Length and complexity: (e.g. at the word phonology level:
monosyllabic, disyllabic,
polysyllabic, including only single consonants and vowels
or also sequences of
consonants and vowels). The more varied in length and
complexity the test items
are, the more diagnostic information can be obtained and
the more representative
the test results are for the perception of unrestricted
speech output. However,
higher linguistic levels are often less suited for
diagnostic purposes because
subjects' responses are determined by many other sources of
information in
addition to the acoustic properties of the stimuli (see
Section 12.4.1).
- Linguistic level: (word, sentence, paragraph).
Again, the higher the linguistic
level, the better test results can be generalised to
unrestricted speech output.
- Stimulus set: (fixed set, where all items are
presented each time the test is run,
versus open set, where each time new (combinations of)
test items are presented,
e.g. the SUS Test
in
Section 12.7.7). Of course, in the
light of learning effects open
sets are more useful and flexible than fixed sets.
- Meaningfulness: either at the word level or at the
sentence level (meaningful,
meaningless, or mixed, i.e. lexically or semantically
unpredictable). Each choice
seems to have both advantages and disadvantages/restrictions.
For example, tests which only
use meaningful test items at the word level, such as the
DRT and MRT (see Sections 12.7.4 and 12.7.5)
have the advantage of being reliable and easy to administer.
However, intelligibility may be overestimated, there is
a risk of a ceiling effect,
and they have little diagnostic value . In principle, the
mixed approach seems a
good choice, because the subjects are not guided in any
way as to what constitutes
a legal or an illegal response. Nevertheless, there may
be a risk of a bias towards
meaningful words. For other implications of the choice
between meaningful,
meaningless, and mixed items at the word level, see Section 12.5.2. For
implications at the sentence level, see Section 12.5.2.
- Representativeness: e.g. Phonetically Balanced
(PB) stimulus lists, with a
frequency of occurrence of phonemes in accordance with
the phoneme distribution
in the language tested or the specific domain of
application at hand, or equal representation of each
phoneme . If one wants to obtain a global idea of the
intelligibility of a system, PB-lists
are to be
preferred, if one aims at diagnostic
information, one usually opts for equal representation.
In Section 12.7, summary descriptions of tests are given where
the stimuli have been
categorised along these stimulus parameters.
Chapter 9 on methodology should also be consulted.
Response modality
can vary along a number of parameters
as well. The choice
seems to be mainly determined by three factors: comparative
versus diagnostic , functional
versus judgment ,
and TTS development versus
psycholinguistic interest. In the five types
of response modalities listed below, 1 and 2 are mainly used
within the glass box approach
(1 in TTS development, 2 in psycholinguistically
oriented research ), whereas 3, 4 and 5
are more common in the black box approach . The latter three
response modalities can be
further differentiated in that 3 and 4 are functional in
nature (3 in TTS development, 4 in
psycholinguistically oriented research ), whereas 5 represents
judgment testing . In the list
of response modalities a distinction is made between
off-line tests, where subjects
are given some time to reflect before responding, and
on-line tests, where an
immediate response is expected from the subjects, tapping the
perception process before it
is finished.
- OFF-LINE IDENTIFICATION TESTS , where subjects are
asked to transcribe the
separate elements (sounds, words) making up the test
items. This response modality
can be further differentiated.
With respect to the nature of the set of response
categories there is a choice between:
- a closed set, where subjects are forced to
select the appropriate response
from a limited number of pregiven categories, and
- an open response mode, where the only
restriction are the constraints
imposed by the language.
TRANSCRIPTION can be:
- in normal spelling, leading to problems in the
interpretation of the
responses in case of meaningless or lexically
unpredictable stimuli (e.g. if
subjects write down ``lead'', have they heard /led/ or
/li:d/?), or
- unambiguous notation, placing the burden upon
the subjects, since they
have to be trained to systematically apply this
notation system.
- ON-LINE IDENTIFICATION TESTS , requiring the
subject to decide whether the
stimulus does or does not exist as a word in the
language [Pisoni et al. (1985b), Pisoni et al. (1985a), so-called lexical decision task, e.g.,].
- OFF-LINE COMPREHENSION TESTS , in which content
questions have to be answered
in an open or closed response mode [Pisoni et al. (1985b), Pisoni et al. (1985a), e.g.,].
- ON-LINE COMPREHENSION TESTS ,
requiring the subject to indicate whether a
statement is true or not (so-called sentence
verification task, e.g. [Manous et al. (1985)]).
- JUDGMENT TESTS
(also called opinion tests),
involving the rating of scales [Pavlovic et al. (1990), Delogu et al. (1991), ITU-T (1993), e.g.,].
The last response modality will be discussed in some more
detail. Pavlovic and co-workers
have conducted an extensive series of studies [Pavlovic et al. (1990)]
comparing
different types of scaling methods that can be used in
judgment tests to evaluate speech
output. Much attention was paid to:
- the magnitude estimation method, where the
subject is presented with an
auditory stimulus and is asked to express the
perceived strength/quality of the
relevant attribute (e.g. intelligibility) numerically
(``type in a value'') or graphically
(``draw a line on the computer screen''), and
- the categorical estimation method,
where the subject has to select a value
from a limited range of prespecified values, e.g. 1
representing extremely poor and
10 excellent intelligibility.
Pavlovic et al. stress that there are important differences
between the two types of scaling
methods, for example the fact that categorical estimation
results in an interval scale,
whereas magnitude estimation results in a ratio-scale. The
former leads to the use of raw
ratings, the calculation of the arithmetic mean, and the
comparison of conditions in terms
of differences, the latter leads to the use of the logarithm
of the ratings, the geometric
mean, and comparison in terms of ratios. The differences also
have implications for the
type of conclusions to be drawn from the test results. Both
the categorical estimation
method (with a 20-point scale) and the
magnitude estimation method
have been included in
SOAP
as standard SAM Overall Quality test
procedures (see Section 12.7.11).
- For rapid judgment testing , use intra-subject
(``internal comparison'') categorical estimation, , and when you do,
use at least a 10-point scale.
- To compare results across tests
(``external comparison''), use magnitude estimation
and when you do, use the line length drawing procedure,
asking subjects to express the quality of
the stimulus relative to the most ideal (human) speech
they can imagine.
Next: Benchmarks
Up: Methodology
Previous: Subjects
EAGLES SWLG SoftEdition, May 1997. Get the book...