next up previous contents index
Next: Specification of number and Up: Specification of the linguistic Previous: Different types of speech

 

Factorial experiments and corpus studies

 

Experimental speech research has traditionally been focussed on factorial experiments, that is, experiments in which a number of factors are defined that are hypothesised to influence some aspects of speech behaviour, in production or in perception (see Chapter 9). The amount of speech in these experiments has typically been small, if only because it was practically impossible to record large amounts of speech in production experiments or to generate large amounts for perception experiments. The major causes of the limitations were in the tight control of the speech needed for well designed factorial experiments and in the time required from the subjects. Tight control is necessary to prevent the outcome of factorial experiments from being meaningless: this type of experiment requires that all conceivable factors different from the small number under study be kept constant, whereas the experimental factors are varied over a limited range. It is not our intention to criticise factorial experiments, if only because they have contributed to virtually all the knowledge we have about speech and because until recently there was hardly an alternative. But it must be acknowledged that, precisely because of the tight control, the speech used in the older experiments may not have been exactly ``communicative''. In the majority of the cases the subjects performed in situations which are quite remote from normal communicative behaviour; therefore, some caution should be exercised in generalising the results of controlled experiments  to ``normal communicative'' speech.
Another reason to be careful in interpreting results of factorial experiments is the possibility that the experimenter did not completely succeed in keeping all non-experimental factors constant: it may be the case that non-experimental factors did co-vary with experimental ones, thereby being responsible for at least part of the effects attributed to the experimental factor(s). One case in point is intonation  research, that has been pretty much focussed on pitch  and on duration  effects. There is, however, increasing evidence that other factors like spectral structure,   spectral slope,  spectral dynamics,  etc. also play a role, and perhaps one that is quite important. In short: there is a danger that factorial experiments lead to overestimating the impact of the factors under investigation, at the cost of factors that were supposed to be constant, but that actually co-varied so as to enforce the effects of the experimental factors.
Now that very large corpora are becoming available, it is possible to set up another type of experiment, in which the behaviour of one or more specific factors is investigated in a very large, perhaps comprehensive number of different contexts. Instead of trying to neutralise the effect of concomitant factors by trying to keep them constant (which will normally mean that one of the many different levels of such factors is selected, e.g. a voiceless  stop  as the right neighbour of the phonemes  under study, or only syllables  which have a prominence lending High-Low pitch  contour), one may try instead to sample many different contexts. Of course, in order to make this type of research feasible, one has to assume that subject effects can be treated in exactly the same way as context effects, because it will still be extremely difficult to have subjects perform for very long periods of time. In designing corpus based experiments one must be aware of the extreme skewing of many frequency distributions  observed in spoken language. For instance, in all languages for which data on phoneme  frequencies are available it has appeared that within a system some phonemes  occur much more often than other phonemes . Random sampling  would leave one with a very high likelihood of missing infrequent phonemes  and of missing possible contexts, unless the total corpus is made excessively large. Greedy  algorithms [Van Santen (1992)] can be used to find the minimum amount of linguistic material that covers a maximum number of phenomena, but even with the use of greedy  algorithms it cannot be guaranteed that all possibly relevant conditions are indeed covered: conditions which are not formulated as targets for the search  will only be present by chance. Since complete coverage is not practically attainable, corpus research must deal with missing data in one way or another. Attempts have been made to handle missing data by means of knowledge-based arithmetic models, including all relevant parameters; alternatively, ``blind'' statistical modelling techniques like CART (Classification And Regression Trees)  can be used. There seems to be some preference for arithmetic models, unless one can guarantee that the missing data are not concentrated in a few subspaces [Van Santen (1994)].  


next up previous contents index
Next: Specification of number and Up: Specification of the linguistic Previous: Different types of speech

EAGLES SWLG SoftEdition, May 1997. Get the book...