4. Sampling Procedures
4.6.1 Sample Sizes for Population Parameter Estimates
4.6.1.1 Sample Sizes for Continuous Variables
Before proceeding too far in the determination of required sample sizes, it will be useful if we review one statistical theory which is at the very heart of sample size estimation. This theorem is called the Central Limit Theorem. This theorem states that estimates of the mean of a sample tend to become normally distributed as the sample size n increases. This normality of sample means applies irrespective of the distribution of the population from which the samples are drawn provided that the sample size is of reasonable size (n!>!30). For small sample sizes, the theorem still applies provided that the original population distribution is approximately bell-shaped.
This theorem often causes confusion but it is so basic to sampling theory that it must be understood before any progress can be made in understanding sample size determination. So let's restate it. Assume that we have, for example, a continuous variable (x) whose variability among sampling units in the population may be described by the distribution shown in Fig.!4.10. Such a variable may be, for example, the income of people in our population. The distribution may be of any form (for example, negatively skewed as shown). Assume that the population is of size N and the population distribution has some true mean value m and a true standard deviation s.
If we were to now draw a sample of size n from this population, we could calculate the mean income for that sample as m1 and the standard deviation for that sample as S1.We could then draw a second sample of size n from the total population and calculate m2 and S2. This could be repeated for a third sample to obtain m3 and S3, a fourth sample to get m4 and S4 etc. Having drawn x samples, we could then construct a frequency distribution of the values m1, m2, m3,!...!mx. The Central Limit Theorem states that this distribution, as shown in Fig. 4.11, is normally distributed with mean m (which is an unbiased estimate of the population mean m).
f(x)
x
s
m
Figure 4.11 Distribution of the Means of Independent Samples
The standard deviation of this distribution of sample means, which is referred to as the standard error of the mean (s.e.(m)), is given by:
s.e.(m) = !N-nN !.!s 2
n ! (4.1)
The above discussion has been based on taking repeated samples from a population. Generally, however, this is not possible and therefore it is necessary to make some estimates based on a single sample of size n. In such a situation our best estimate of m is given by m1 and similarly the best estimate of s is given by S1 (hereafter referred to as S). Therefore on the basis of a single sample, we can estimate what the standard error of the mean would have been, if repeated samples had been drawn, as:
s.e.(m) = !N-nN !.!Sn ! 2 (4.2)
As noted earlier, the standard error is a function of three variables; the variability of the parameter in the population (represented by the standard deviation s), the sample size (n) and the population size (N) . However for large populations and small sample sizes (which is often the case in transport surveys), the finite population correction factor (N-n)/N is very close to unity. In such situations, the equation for standard error of the mean may be reduced to the more familiar form of:
s.e.(m) = !Sn ! = 2 S
!n!! (4.3)
This equation highlights a most important aspect of sample size determination. That is, as sample size increases, the standard error of the mean will decrease but only in proportional to the square root of the sample size. Thus, quadrupling the sample size will only halve the standard error of the mean. Increasing sample size is therefore a clear case of diminishing marginal returns with respect to decreases in standard error of the mean.
Reference to the properties of the normal distribution, dictated by the Central Limit Theorem, also enables an estimate to be made of the accuracy of the sample mean m as a reflection of the true population mean m. Such estimates are calculated using the concept of confidence limits associated with the normal distribution. Thus, some 95% of all sample means (from samples of size n) would lie within two standard errors on either side of the true mean, so that there is a probability of only about one in twenty that the deviation between a sample mean and the true mean will exceed a value greater than twice the standard error.
Given the foregoing discussion, the required sample size can be estimated by solving for n in equation (4.2). This is most easily done in two stages by first solving for n in equation (4.3) such that:
n' = S2
(s.e.(m))2 (4.4)
and then correcting for the finite population effect, if necessary, such that:
n = 1!+!(n'/N) n' (4.5)
Whilst the above procedure for the determination of sample size looks relatively straightforward and objective, there are two major problems in the application of the method; the estimation of the population standard deviation (s) and the selection of an acceptable standard error of the mean (s.e.(m)). The problem with the estimation of the standard deviation is that this is one of the statistics which will be calculated after the survey has been conducted, and yet we are required to estimate it before we conduct the survey in order to calculate the sample size. It is therefore necessary to derive an estimate of the standard deviation from other sources. Three major sources suggest themselves:
(a) Previous surveys of the same, or a similar, population may provide an estimate of the standard deviation of the parameter in question. Due
allowance should be made for any differences in the sampling method used in the previous and the current survey.
(b) There may be some theoretical foundations on which to base an estimate of the standard deviation. This technique was used, for example, in the Australian National Travel Survey (Aplin and Flaherty, 1976).
(c) Where little previous information exists about the population, it may be necessary to conduct a pilot survey to obtain information needed to design the main survey. A problem with this method, however, is that often time and money resources do not permit the conduct of large enough pilot survey to enable serviceable estimates of the standard deviation to be obtained. In such circumstances, the standard deviation estimates may be more misleading than informative.
Sometimes the estimated sample size can be adjusted during the course of the main survey to overcome any uncertainty in the initial estimate of the standard deviation. Thus using the initial standard deviation estimate, a sample of minimum size could be collected. The standard deviation in this sample could then be computed and compared with the initial estimate. If the standard deviation is larger than estimated, thus indicating that a larger sample should be collected, then a supplementary sample could be collected to augment the initial sample. Whilst this two-step procedure sounds attractive in being able to lessen the demands of accurate estimation of standard deviation, it is only feasible in certain circumstances. Thus the conduct of the survey must be spread over a reasonable time period so that coding, editing and analysis of an initial sample can be completed in time for the supplementary sample data collection to follow on reasonably soon after the collection of the initial sample. Where strict time limitations are placed on the conduct of a survey, supplementary samples may not be feasible.
The second problem in the estimation of sample sizes using the above equations is the specification of an acceptable standard error of the mean. This task basically expresses how confident we wish to be about using the sample mean as an estimate of the true population mean. The specification of a standard error is rarely performed per se; rather it is usual to specify confidence limits of a specified size around the mean and at a certain level of confidence. For example, you may specify that you wish, with C% confidence, to obtain a sample mean which is within a specified range, either relative or absolute, of the population mean. Specified in this way, two judgements must be made in order to calculate the acceptable standard error.
First, a level of confidence must be chosen for the confidence limits. Basically, the level of confidence expresses how frequently the client is prepared to be wrong in accepting the sample mean as a measure of the true mean. For example, if a
95% level of confidence is used, then implicitly it is being stated that the client is prepared to be wrong on 5% of occasions. If such a risk is deemed to be unacceptably large, then higher confidence limits (such as 99%) can be used. Higher confidence limits will, however, require larger sample sizes.
The specification of levels of confidence is a difficult task for the client and the survey designer. Not understanding the subtleties of sampling, most clients are unwilling to accept that the survey, for which they are paying, will come up with anything but the correct answer. On the other hand, the survey designer should know that nothing but a full population survey will produce results that are absolutely correct (assuming that everything else about the survey is acceptable). The task of the survey designer, therefore, is to get some indication from the client of what they think is an acceptable level of confidence in the results. By convention, 95% levels of confidence are often assumed for sample surveys in transport. This means that if repeated sample surveys were to be conducted on this topic with a sample of this size, then 5% of the estimates of the mean would lie outside the range of the population mean, plus or minus two standard errors. Second, it is necessary to specify the confidence limits around the mean, either in absolute or relative terms. If relative measures are used (i.e. the confidence limit is a proportion of the mean) then this requires that an estimate of the mean be available so that an absolute measure of the confidence limit can be calculated. If the parameter being estimated is of some importance then smaller confidence limits can be specified but again this will result in higher sample sizes being necessary. The size of the confidence limits will depend on the use to which the results of the survey are to be put.
The important point to note about acceptable standard errors is that the specification of both the confidence limits and the level of confidence is relatively subjective. More important parameters can be assigned smaller confidence limits and/or higher levels of confidence. Each of these actions will result in a smaller acceptable standard error and thus a higher required sample size. The decision, however, lies in the hands of the sample designer in liaison with the client; is accuracy of a parameter estimate sufficiently important to warrant the higher costs involved in a larger sample size?
To illustrate the points outlined above, consider, as an example, a survey of a population of 1000 households in which an estimate of average household income is required such that there is a 95% probability that the sampling error will be no more than 5% of the sample mean. From a pilot survey of this population, it has been found that with a sample size of 30 the mean income was $24,000 and the standard deviation was $5,000.
The acceptable standard error can be calculated from the specified confidence limits and level of confidence. From a table of unit normal distribution values (see
Appendix A), a 95% level of confidence corresponds to a value of 1.96 times the standard error. That is, there is a 95% probability that the error of the mean estimate will be no more than 1.96 times the standard error. However, in our case we want this error to be no more than 5% of our estimated mean. Using the pilot survey mean value as an initial estimate, the confidence limit will therefore be equal to $1,200 (= 0.05 x 24,000). The acceptable standard error is then given by:
s.e.(m) = confidence!limitz (4.6)
= 12001.96 = 612
The sample size, for an infinite population, is then given by: n' = S2
s.e.(m)2 = 50002
6122 = 67 (4.7)
Applying the finite population correction factor, the final sample size is given by: n = 1!+!(n'/N) = n' 1!+!67/1000 = 6367 (4.8) Having collected data from these 63 households, it may be found that the estimated sample mean income has fallen to $23,200 while the sample standard deviation has increased to $6,000. In such a case, a new estimate of the required sample size may be given by:
n' = 60002
5922 = 103 (4.9)
and hence
n = 1!+!103/1000 = 93103 (4.10)
If it is convenient, and if the extra expense is deemed worthwhile, then an extra 30 households should be sampled and surveyed. It should be noted that without the extra households in the survey, the sampling error of the mean, at a 95% confidence level, is equal to 6.2% of the sample mean. The question must be asked as to whether the expense of the extra surveys is warranted by the reduction in sampling error from 6.2% to 5.0% of the mean.