Estimation is used for making decisions about populations based on simple random samples . A truly random sample is likely to be representative of the population; this does not mean that a variable measured on a second sample taken will be the same as the first. The skill involved in estimating the value of a variable is to impose conditions which allow an acceptable degree of error in the estimate without being so conservative as to be useless in practice (an extreme case of the latter would be recommending a sample of the same order of magnitude as the population). The necessary background skill is to understand how quantities like sample means, proportions and variances are related to means, proportions and variances in the population. The following notation is used in the discussion: is the sample mean, S is the sample standard deviation, and S is the sample variance; is the population mean, is the population standard deviation, and is the population variance. The abbreviations sd and S.D. are sometimes used for standard deviation; S.E. is used for standard error, Z is used for z-scores, P stands for estimated probability, and p stands for proportion.
A fundamental step towards this goal is to relate the sample statistic to a probability distribution: What this means is: if we repeatedly take samples from a population, how do the variables measured on the sample relate to those of the population? To translate this to an empirical example: How sure can you be about how close your sample mean lies to the population mean? Even more concretely, if we obtain the mean of a set of samples, how does the mean of a particular sample relate to the mean of the population. As has already been said, the value of the mean of the first of two samples is unlikely to be exactly the same as the second. However, if repeated samples are taken, the mean value of all the samples will cluster around the population mean; this is usually regarded as an unbiased estimator of the population mean.
The usual way this is shown is to take a known distribution (i.e., where the population mean is known) and then consider what the distribution would be like when samples of a given size are taken. So, if a population of events has equally likely outcomes and the variable values are 1, 2, 3 and 4, the mean would be 2.5. If all possible combinations are taken (1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3 and 4), the mean of the mean values for all pairs is also 2.5 (taking all pairs is a way of ensuring that the sample is simple random). An additional important finding is that if the distribution of sample means (the sampling distribution) are plotted as a histogram , the distribution is no longer rectangular but has a peak at 2.5 (1 and 4, and 2 and 3 both have a mean of 2.5 and none of the means of the other pairs has the same mean). Moreover the distribution is symmetrical about the mean and approximates more to a normal (Gaussian) distribution even though the original distribution was not. As sample size gets larger the approximation to the normal distribution gets better. Moreover, this tendency applies to all distributions, not just the rectangular distribution considered. The tendency of large samples to approximate the normal distribution is, in fact, a case of the Central Limit Theorem.
This particular result has far-reaching implications when testing between alternative hypotheses (see below). As a rule of thumb sample sizes of 30 or greater are adequate to approximate the normal distribution.
The statistical quantity standard deviation (sd, S.D.) is a measure of how a set of observations scatter about the mean. It is defined numerically as
Later the related quantity of the variance will be needed. This is simply the sd squared:
An important aspect of the situation described is that the sample means themselves (rather than the observations) have a standard deviation (sd ). The sd of the sample means (here the sd of all samples of size two for the rectangular distribution) is related to the sd of the samples in the original distribution by the formula:
This quantity is given a particular name to distinguish it
from the sd - it is called the standard error (S.E.). In practice, the standard deviation of the population is
often not known. In these circumstances, provided the sample is
sufficiently large, the standard deviation of the sample can be
used to approximate that of the population and the above formula
used to calculate the S.E. The S.E.
is used in the computation of another quantity, the z score
of the sample mean:
The importance of this quantity is that the measure can be translated into a probabilistic statement relating the sample and population means. Put another way, from the z score , the probability of a sample mean being so far from the population mean can be computed.
To show how this is used in practice: if a sample of size 200 is taken, what is the probability that the mean is within 1.5 S.E.s of the population mean? Normal distribution tables give the desired area. Here is a section of a table giving the proportion of the area of a normal distribution associated with given values of z (the stippled section in the figure indicates what area is tabulated):
z | Area |
| |
. . . | . . . | ||
1.3 | 0.4032 | ||
1.4 | 0.4192 | ||
1.5 | 0.4332 | ||
1.6 | 0.4452 | ||
. . . | . . . |
The sketch of the normal distribution is symmetrical and the symmetry is about the mean value (i.e., the peak of the distribution). The z values above the mean are tabulated, and the row with a z value of 1.5 indicates that 0.4332 of the area on the right half of the distribution lie within 1.5 S.E.s above the mean. Since it has already been noted that the distribution is symmetrical, 0.4332 of the area will lie within 1.5 S.E.s below the mean. Thus, the area within 1.5 S.E.s above or below the mean is 0.4332 + 0.4332, or 0.8664. Thus, converted to percentages, approximately 86.6% of all samples of size 200 will have means within 1.5 S.E.s of the population mean. If, as in any real experiment, one sample is taken, we can assign a statement about how likely that sample is being within the specified distance of the mean.
Another, related, use of S.E.s is in stipulating confidence intervals. If you look at the areas associated with particular z values in the way just described, you should be able to ascertain that the area of a normal distribution enclosed within z values S.E.s of the mean is 95%. Thus, if the S.E. and mean of a sample are known, you can specify a measurement interval that indicates the degree of confidence (here 95%) that the population mean will be within these bounds. This is between the value 1.96 the S.E below the sample mean and 1.96 the S.E. above the sample mean. This particular interval is called the 95% confidence interval. Other levels of confidence can be adopted by obtaining the corresponding z values.
Since this topic is so important, an example is given: Say a
random sample of mean voice fundamental of 64 male university
students has a mean of 98Hz and a standard deviation of 32Hz.
What is the 95% confidence interval for mean voice fundamental of
the male students at this university? The maximum error of the
estimate is approximated
(using sample standard deviation S
rather than that of the population
as an approximation, see above) as:
Thus, the 95% confidence interval is from 98 - 7.84 = 90.16 to 98 + 7.84 = 105.84. Often, the confidence intervals are presented graphically along with the means: the mean of the dependent variable is indicated on the y axis with some chosen symbol; a line representing the confidence interval extends from (in this case) 90.16 to 105.84 and it is drawn vertically and passes through the mean.
Before leaving this section, it is necessary to consider what
to do when wanting to make corresponding statements about
small-sized samples which cannot be approximated with the normal
distribution. Here computation of the mean and standard error
S.E. proceeds as before. Since the quantity z is used in
conjunction
with the normal distribution tables, it cannot be used. Instead the
analogous quantity t is calculated:
The distribution of t is dependent on sample size n and so (in essence) the t value has to be referred to different tables for each size of sample. The tables corresponding to the t distribution are usually collapsed into one table and the section of the table used is accessed by a parameter related to the sample size n (the quantity used for accessing the table is n-1 and is called the degrees of freedom). Clearly, since several different distributions are being tabulated, some condensation of the information relative to the z tables is desirable. For this reason, t values corresponding to particular probabilities are given. Consideration of t tables emphasises one of the advantages of the Central Limit Theorem insofar as one table can be used to address a wide variety of issues rather than is the case for t.
Here the problem faced is similar to that with means: A sample
has been taken and the proportion of people meeting some criterion
and those not meeting that criterion are observed. The question is
with what degree of confidence can you assert that the proportions
observed reflect those in the population? Once again the solution
is directly related to that discussed when estimating how close a
sample mean lies to the population mean using z scores .
Essentially the z score for means measures:
The only difference here is that binomial events are being
considered (meet/not meet the criterion). Since the mean of a
binomial distribution is np
(number tested population proportion)
and the S.E. is where q = 1-p), the z score
associated with a particular sample based on the estimated probability and
the population proportion is:
Normal distribution tables can again be used to assign a probability associated with this particular outcome.
To illustrate with an example: Suppose that it is expected
that as many men will use the ANN
system as will women
(p (man) = p (woman) 0.5).
What size of sample is needed to be 95% certain that the
proportion
of men and women in the sample differs from that in the population
by at most 4%?
Solving for n gives 600.25. Therefore, a sample of size at least 601 should be used.
Now what are the effects if we want to be more than 4% confident, say if the difference is reduced to 2%. The required sample size jumps to 2401.
The relationship between the variance of a sample and that of the population is distributed as (chi squared) with n-1 degrees of freedom.
Thus, if we have a sample of size 10 drawn from a normal population with population variance 12, the probability of its variance exceeding 18 is:
This has associated with it 9 degrees of freedom. Because values are only tabulated for particular probabilities (as with t), the probability can only be estimated for limited probabilities. In this case lies between 0.2 and 0.1.
If two independent samples are taken from two normal
populations with variance and , the ratio of
the
two variances ( and ) has the F distribution:
If the two samples (which can differ in size) from the same normal population are taken, then the ratio of the variances will be approximately 1. Conversely, if the samples are not from the same normal population, the ratio of their variances will not be 1 (the ratio of the variances is termed the F ratio ). The F tables can be used to assign probabilities that the sample variances were or were not from the same normal distribution. The importance of this in the Analysis of Variance (ANOVA) will be seen later.