Many applications in language engineering require testing of hypotheses. An example from the scenario given in Section 9.1.2 was testing whether there were differences between read and spontaneous speech with respect to selected statistics. If the statistic was mean vowel duration in the two conditions where speech was recorded, we have a situation calling for simple hypothesis testing. This situation is called simple hypothesis testing since it involves a parameter of a single population.
Following the approach adopted so far, the concepts involved in such testing will be illustrated for this selected example. The first step is to make alternative assertions about what the likely outcome of an analysis might be. One assertion is that the analysis might provide no evidence of a difference between the two conditions. This case is referred to as the null hypothesis (conventionally abbreviated as H) and asserts here that the mean tone unit duration in the read speech is the same as that in the spontaneous speech .
Other assertions might be made about this situation. These are referred to as alternative hypotheses. One alternative hypothesis would be that the tone unit duration of the read speech will be less than that of the spontaneous speech . A second would be the converse, i.e. the tone unit duration of the spontaneous speech will be less than that of the read speech . The decision about which of these alternate hypotheses to propose will depend on factors that lead the language engineer to expect differences in one direction or the other. These instances are referred to as one-tailed (one-directional) hypotheses as each predicts a specific way in which there will be a difference between read and spontaneous speech . If the language engineer wants to test for a difference but has no theoretical or empirical reasons for predicting the direction of the difference, then the hypothesis is said to be two-tailed. Here, large differences between the means of the read and spontaneous speech, no matter which direction they go in, might constitute evidence in favour of the alternative hypothesis.
The distinction between one and two-tailed tests is an important one as it affects what difference between means is needed to assert a significant difference (i.e., support the null hypothesis ). In the case of a one-tailed test, smaller differences between means are needed than in the case of two-tailed tests. Basically, this comes down to how the tables are used in the final step of assessing significance (see below). There are no fixed conventions for the format of tables for the different tests, so there is no point in illustrating how to use them. The tables usually contain guidance as to how they should be used to assess significance.
Hypothesis testing involves asserting what level of support can be given in favour of, on the one hand, the null, and, on the other, the alternate hypotheses. Clearly no difference between the means of the read and spontaneous speech would indicate that the null hypothesis is supported for this sample. A big difference between the means would seem to indicate that there is a statistical difference between these samples if the direction in which the means differs is in the same direction as hypothesised for a one-tailed hypothesis or if a two-tailed test has been formulated. The way in which a decision whether a particular level of support (a probability) is provided is described next.
In the read-spontaneous example that we have been working through, we are interested in testing for a difference between means for two samples where, it is assumed, the samples are from the same speaker. The latter point requires that a related groups test as opposed to an independent groups test is used (see Figure 9.1 on page ). In this case, the t statistic is computed from:
Thus if the read speech for 15 speakers had a mean tone unit duration of 40.2 centiseconds and the spontaneous speech 36.4 centiseconds and the standard deviation of the difference between the means is 2.67, the t value is 1.42. The t value is then used for establishing whether two sample means lying this far apart might have come from the same (null hypothesis ) or different (alternate hypothesis) distributions. This is done by consulting tables of the t statistic using n/-1 degrees of freedom (here n refers to the number of pairs of observations).
In assessing a level of support for the alternate hypothesis, decision rules are formulated. Basically this involves stipulating, assuming that the samples are from the same distribution, that if the probability of the means lying this far apart is so low then a more likely alternative is that the samples are drawn from different populations. The ``stipulation'' is done in terms of discrete probability levels and, conventionally, if there is a less than 5% chance that the samples were from the same distribution, then the hypothesis that the samples were drawn from different distributions is supported (the alternative hypothesis at that level of significance). Conversely, if there is a greater than 5% chance that the samples are from the same distribution, the null hypothesis is supported. In the worked example, with 14 degrees of freedom, a t value of 1.42 does not support the hypothesis that the samples are drawn from different populations, thus the null hypothesis is accepted.
It should be noted that support or rejection of these alternative hypotheses is statistical rather than absolute. In 1/20 (5%) cases where no difference is asserted, a difference does occur (referred to as a Type II error, accepting the null hypothesis when it is false) and in cases where a 5% significance level is adopted and differences found, 1 occasion out of 20 will also lead to an error (referred to as a Type I error, rejecting the null hypothesis when it is in fact true).
This chapter of the Handbook does not cover all statistical tests that might be encountered, only offer a background and point to relevant material. However, some comments on Analysis of Variance (ANOVA) are called for as it is a technique that has a widespread use in language engineering.
ANOVA is a statistical method for assessing the importance of factors that produce variability in responses or observations. The approach is to control for a factor by specifying different values (or, treatment levels ) for it in order to see if there is an effect. It can be thought of as having sampled a potentially different population (different in the sense of having different means). Factors that have an effect change the variation in sample means, where ``factor'' refers to a controlled independent variable . When the experimenter controls the levels of the factors , this is referred to as treatment level .
For example, in the ANOVA approach, two estimates of the variances are obtained: the variance between the sample means, between groups variance , and the variance of each of the scores about their group mean, within groups variance . If the treatment factor has had no effect, then variability between and within groups should both be estimates of the population variance. So, as discussed earlier when the ratio of two sample variances from the same population was considered, if the F ratio of between groups to within groups is taken, the value should be about 1 (in which case, the null hypothesis is supported). Statistical tables of the F distribution can be consulted to ascertain whether the F ratio is large enough to support the hypothesis that the treatment factor has had an effect resulting in larger variance of the between group to the within group means (the alternative hypothesis is supported). Another way of looking at this is that the between groups variance is affected by individual variation of the units tested plus the treatment effect whereas the within groups estimate is only affected by individual variation of the units tested.
ANOVA is a powerful tool which has been developed to examine treatment effects involving several factors . Some examples of its scope are that it can be used with two or more factors. Factors that are associated with independent and related groups can be tested in the same analysis, and so on. When more than one factor is involved in an analysis, the dependence between factors (interactions) comes into play and has major implications for the interpretation of results. A good reference covering many of the situations where ANOVA is used is [Winer (1971)]. Though statistical texts cover how the calculations are performed manually, these days the analyses are almost always done with computer packages. The packages are easy to use if ANOVA terminology is known. Indeed the statistical manuals for these programmes (such as MINITAB, SPSS and BMDP) are important sources which discuss how to choose and conduct an appropriate analysis and should be consulted.
Parametric tests cannot be used when discrete, rather than continuous, measures are obtained since the Central Limit Theorem does not approximate the normal distribution in these instances. The distinction between discrete and continuous measures is the principal factor governing whether a parametric or non-parametric test can be employed. Continuous and discrete measures relate to another taxonomy of scales - interval, nominal and ordinal: interval scales are continuous and the others are discrete. Statisticians consider this taxonomy misleading, but since it is frequently encountered, the nature of data from the different scales is described. Interval data are obtained when the distance between any two numbers on the scale are of known size and is characterised by a constant unit of measurement. This applies to physical measures like duration and frequency measured in Hertz (Hz) which have featured in the examples discussed to now. Nominal scales are obtained when the measures are obtained from symbols to characterise objects (such as sex of the speakers). Ordinal scales give some idea of the relative magnitude of units that are measured but the difference between two numbers does not give any idea of the relative size. The examples discussed below in connection with Likert scales represent this level of measurement.
In cases where parametric tests cannot be used, non-parametric (also known as distribution-free) tests have to be employed. The computations involved in these tests are straightforward and covered in any elementary text book [Siegel (1956)]. A reader who has followed the material presented thus far should find it easy to apply the previous ideas to these tests. To help the reader access the particular test needed in selected circumstances (parametric and non-parametric), a tree diagram for the different decisions it is necessary to make is given in Figure 9.1. Thus, a particular test might require a number of judges to indicate how acceptable they think the synthesiser output is from before to after a change has been made. The dependent variable is the frequency of judges considering whether the synthesiser produced satisfactory output or not before and after the change; for this, a McNemar test is appropriate.
Figure 9.1: Summary of decision structure for establishing what statistical test to use for data
A number of representative questions a language engineer might want to answer were considered at the start of this section. Let us just go back over these and consider which ones we are now equipped to answer. First there was how to check whether there are differences between spontaneous and read speech.
If the measures are parametric (such as would be the case for many acoustic variables), then either an independent or related t test would be appropriate to test for differences. An independent t test is needed when samples of spontaneous speech and read speech are drawn from different speaker sets; a related t test is used when the spontaneous and read samples are both obtained from the same group of speakers.
If the measures are non-parametric (e.g. ratings of clarity for the spontaneous and read speech) then a Wilcoxon test would be used when the read and spontaneous versions of the speech are drawn from the same speaker and a Mann-Whitney U test otherwise.
If you find differences between read and spontaneous speech that require them to use the latter for training data (see application described), how can you check whether language statistics on your sample of recordings is representative of the language as a whole - or, what might or might not be the same thing, how can you be sure that you have sampled sufficient speech? For this, the background information provided to estimate how close sample estimates are to population estimates is appropriate.
The next questions in our list given in the introduction lead on to the second major theme which we want to cover: the general principles behind setting up a well-controlled experiment . The particular experiments that will be considered concern the assessment of the procedures for segmenting and labelling the speech for training and testing the ANNs . The discussion concerning experimental design, etc. will apply to many more situations, however. Once we have an idea what the experimental data would look like, we can consider how to treat the data statistically, which involves hypothesis testing again.