next up previous contents index
Next: Hypothesis testing Up: Statistical and experimental procedures Previous: Biases

Estimating sample means, proportions and variances

Estimation is used for making decisions about populations based on simple random samples  . A truly random sample is likely to be representative of the population; this does not mean that a variable measured on a second sample taken will be the same as the first. The skill involved in estimating the value of a variable is to impose conditions which allow an acceptable degree of error in the estimate without being so conservative as to be useless in practice (an extreme case of the latter would be recommending a sample of the same order of magnitude as the population). The necessary background skill is to understand how quantities like sample means, proportions and variances are related to means, proportions and variances in the population. The following notation is used in the discussion: tex2html_wrap_inline46581 is the sample mean, S is the sample standard deviation, and Stex2html_wrap_inline46583 is the sample variance; tex2html_wrap_inline44831 is the population mean, tex2html_wrap_inline46587 is the population standard deviation, and tex2html_wrap_inline46589 is the population variance. The abbreviations sd and S.D. are sometimes used for standard deviation; S.E. is used for standard error, Z is used for z-scores, P stands for estimated probability, and p stands for proportion.

Estimating means

A fundamental step towards this goal is to relate the sample statistic to a probability distribution: What this means is: if we repeatedly take samples from a population, how do the variables measured on the sample relate to those of the population? To translate this to an empirical example: How sure can you be about how close your sample mean lies to the population mean? Even more concretely, if we obtain the mean of a set of samples, how does the mean of a particular sample relate to the mean of the population. As has already been said, the value of the mean of the first of two samples is unlikely to be exactly the same as the second. However, if repeated samples are taken, the mean value of all the samples will cluster around the population mean; this is usually regarded as an unbiased estimator  of the population mean.

The usual way this is shown is to take a known distribution (i.e., where the population mean is known) and then consider what the distribution would be like when samples of a given size are taken. So, if a population of events has equally likely outcomes and the variable values are 1, 2, 3 and 4, the mean would be 2.5. If all possible combinations are taken (1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3 and 4), the mean of the mean values for all pairs is also 2.5 (taking all pairs is a way of ensuring that the sample is simple random). An additional important finding is that if the distribution of sample means (the sampling distribution) are plotted as a histogram , the distribution is no longer rectangular but has a peak at 2.5 (1 and 4, and 2 and 3 both have a mean of 2.5 and none of the means of the other pairs has the same mean). Moreover the distribution is symmetrical about the mean and approximates more to a normal (Gaussian) distribution even though the original distribution was not. As sample size gets larger the approximation to the normal distribution gets better. Moreover, this tendency applies to all distributions, not just the rectangular distribution considered. The tendency of large samples to approximate the normal distribution is, in fact, a case of the Central Limit Theorem.

This particular result has far-reaching implications when testing between alternative hypotheses (see below). As a rule of thumb sample sizes of 30 or greater are adequate to approximate the normal distribution.

The statistical quantity standard deviation (sd, S.D.)  is a measure of how a set of observations scatter about the mean. It is defined numerically as


eqnarray12872

Later the related quantity of the variance will be needed. This is simply the sd squared:


equation12882

An important aspect of the situation described is that the sample means themselves (rather than the observations) have a standard deviation (sd ). The sd of the sample means (here the sd of all samples of size two for the rectangular distribution) is related to the sd of the samples in the original distribution by the formula:


equation12897

This quantity is given a particular name to distinguish it from the sd - it is called the standard error  (S.E.). In practice, the standard deviation  of the population is often not known. In these circumstances, provided the sample is sufficiently large, the standard deviation  of the sample can be used to approximate that of the population and the above formula used to calculate the S.E.  The S.E. is used in the computation of another quantity, the z score  of the sample mean:


equation12909

The importance of this quantity is that the measure can be translated into a probabilistic statement relating the sample and population means. Put another way, from the z score , the probability of a sample mean being so far from the population mean can be computed.

To show how this is used in practice: if a sample of size 200 is taken, what is the probability that the mean is within 1.5 S.E.s of the population mean? Normal distribution tables give the desired area. Here is a section of a table giving the proportion of the area of a normal distribution associated with given values of z (the stippled section in the figure indicates what area is tabulated):

z Area

figure12914

. . . . . .
1.3 0.4032
1.4 0.4192
1.5 0.4332
1.6 0.4452
. . . . . .

The sketch of the normal distribution is symmetrical and the symmetry is about the mean value (i.e., the peak of the distribution). The z values above the mean are tabulated, and the row with a z value of 1.5 indicates that 0.4332 of the area on the right half of the distribution lie within 1.5 S.E.s above the mean. Since it has already been noted that the distribution is symmetrical, 0.4332 of the area will lie within 1.5 S.E.s below the mean. Thus, the area within 1.5 S.E.s above or below the mean is 0.4332 + 0.4332, or 0.8664. Thus, converted to percentages, approximately 86.6% of all samples of size 200 will have means within 1.5 S.E.s of the population mean. If, as in any real experiment, one sample is taken, we can assign a statement about how likely that sample is being within the specified distance of the mean.

Another, related, use of S.E.s is in stipulating confidence intervals. If you look at the areas associated with particular z values in the way just described, you should be able to ascertain that the area of a normal distribution enclosed within z values tex2html_wrap_inline46607 S.E.s of the mean tex2html_wrap_inline46581 is 95%. Thus, if the S.E. and mean tex2html_wrap_inline46581 of a sample are known, you can specify a measurement interval that indicates the degree of confidence (here 95%) that the population mean will be within these bounds. This is between the value 1.96 tex2html_wrap_inline44881 the S.E below the sample mean and 1.96 tex2html_wrap_inline44881 the S.E. above the sample mean. This particular interval is called the 95% confidence interval. Other levels of confidence can be adopted by obtaining the corresponding z values.

Since this topic is so important, an example is given: Say a random sample  of mean voice fundamental of 64 male university students has a mean of 98Hz and a standard deviation  of 32Hz. What is the 95% confidence interval for mean voice fundamental of the male students at this university? The maximum error of the estimate is approximated (using sample standard deviation  S rather than that of the population tex2html_wrap_inline46587 as an approximation, see above) as:


equation12928

Thus, the 95% confidence interval is from 98 - 7.84 = 90.16 to 98 + 7.84 = 105.84. Often, the confidence intervals are presented graphically along with the means: the mean of the dependent variable is indicated on the y axis with some chosen symbol; a line representing the confidence interval extends from (in this case) 90.16 to 105.84 and it is drawn vertically and passes through the mean.

Before leaving this section, it is necessary to consider what to do when wanting to make corresponding statements about small-sized samples which cannot be approximated with the normal distribution. Here computation of the mean tex2html_wrap_inline46581 and standard error S.E. proceeds as before. Since the quantity z is used in conjunction with the normal distribution tables, it cannot be used. Instead the analogous quantity t is calculated:


equation12936

The distribution of t is dependent on sample size n and so (in essence) the t value has to be referred to different tables for each size of sample. The tables corresponding to the t distribution are usually collapsed into one table and the section of the table used is accessed by a parameter related to the sample size n (the quantity used for accessing the table is n-1 and is called the degrees of freedom). Clearly, since several different distributions are being tabulated, some condensation of the information relative to the z tables is desirable. For this reason, t values corresponding to particular probabilities are given. Consideration of t tables emphasises one of the advantages of the Central Limit Theorem insofar as one table can be used to address a wide variety of issues rather than is the case for t.

 

Estimating proportions

Here the problem faced is similar to that with means: A sample has been taken and the proportion of people meeting some criterion and those not meeting that criterion are observed. The question is with what degree of confidence can you assert that the proportions observed reflect those in the population? Once again the solution is directly related to that discussed when estimating how close a sample mean lies to the population mean using z scores . Essentially the z score  for means measures:


equation12951

The only difference here is that binomial events are being considered (meet/not meet the criterion). Since the mean of a binomial distribution is np (number tested tex2html_wrap_inline44881 population proportion) and the S.E. is tex2html_wrap_inline46643 where q = 1-p), the z score  associated with a particular sample based on the estimated probability and the population proportion is:


equation12961

Normal distribution tables can again be used to assign a probability associated with this particular outcome.

To illustrate with an example: Suppose that it is expected that as many men will use the ANN    system as will women (p (man) = p (woman) 0.5). What size of sample is needed to be 95% certain that the proportion of men and women in the sample differs from that in the population by at most 4%?


equation12967

Solving for n gives 600.25. Therefore, a sample of size at least 601 should be used.

Now what are the effects if we want to be more than 4% confident, say if the difference is reduced to 2%. The required sample size jumps to 2401.

Estimating variance

The relationship between the variance of a sample and that of the population is distributed as tex2html_wrap_inline46647 (chi squared) with n-1 degrees of freedom.


equation12973

Thus, if we have a sample of size 10 drawn from a normal population with population variance 12, the probability of its variance exceeding 18 is:


eqnarray12977

This has associated with it 9 degrees of freedom. Because tex2html_wrap_inline46647 values are only tabulated for particular probabilities (as with t), the probability can only be estimated for limited probabilities. In this case tex2html_wrap_inline46647 lies between 0.2 and 0.1.

Ratio of sample variances

If two independent samples are taken from two normal populations with variance tex2html_wrap_inline46655 and tex2html_wrap_inline46657, the ratio of the two variances (tex2html_wrap_inline46659 and tex2html_wrap_inline46661) has the F distribution:


equation12986

If the two samples (which can differ in size) from the same normal population are taken, then the ratio of the variances will be approximately 1. Conversely, if the samples are not from the same normal population, the ratio of their variances will not be 1 (the ratio of the variances is termed the F ratio ). The F tables can be used to assign probabilities that the sample variances were or were not from the same normal distribution. The importance of this in the Analysis of Variance  (ANOVA) will be seen later.



next up previous contents index
Next: Hypothesis testing Up: Statistical and experimental procedures Previous: Biases

EAGLES SWLG SoftEdition, May 1997. Get the book...