4. Sampling Procedures
4.6.1 Sample Sizes for Population Parameter Estimates
4.6.1.3 Explaining Sample Sizes to Clients
Because of the complexities and subtleties involved in sample design, it is often difficult for a survey designer to explain the concepts to clients and to obtain useful information from the client to assist in the sample design. This input of information from the client is essential because, after all, it is the client who has to live with the results after the survey is completed. Therefore, every attempt should be made to tailor the survey and the sample to the needs of the client. To assist in explaining sample size calculations to clients, and in obtaining the required input from clients, the design aids outlined in this section have proven useful.
The basis of these design aids is a set of two spreadsheet tables, shown in Figures 4.12 and 4.13. Figure 4.12 shows how the confidence limits around a continuous variable change as the sample size is changed, given estimates of the population mean and standard deviation, the size of the population and the level of confidence required by the client. Specification of the level of confidence automatically calculates the value of z used in equation 4.6. The outputs of Figure 4.12 are the upper and lower confidence limits which could be expected from any specified sample size.
Continuous Variable Confidence Limit Calculator Population Mean = 3.60
Population S.D. = 2.50
Population Size = 10000
Level of Confidence = 95% => z = 1.96
Sample Size s.e.(m) z*s.e.(m) Lower Limit Upper Limit
50 0.35 0.69 2.91 4.29 100 0.25 0.49 3.11 4.09 150 0.20 0.40 3.20 4.00 200 0.18 0.34 3.26 3.94 250 0.16 0.31 3.29 3.91 300 0.14 0.28 3.32 3.88 350 0.13 0.26 3.34 3.86 400 0.12 0.24 3.36 3.84 450 0.12 0.23 3.37 3.83 500 0.11 0.21 3.39 3.81 5000 0.03 0.05 3.55 3.65
Figure 4.12 Confidence Limit Estimator for Continuous Variables
For example, if the expected trip rate per person per day is 3.60 (with a standard deviation of 2.50), then with a sample of 300 persons in any one strata, we would expect that the mean trip rate would fall between 3.32 and 3.88 in 95% of samples of this size.
If the client believed that this range was too great, then they could experiment either with different sample sizes or with changing the level of confidence. The bottom line in Figure 4.12 is provided to allow the specification of any desired sample size, which may be outside the range of those provided in the table. Figure 4.12 may be re-used for any continuous variable in the survey by simply changing the population mean and standard deviation.
A similar set of calculations is carried out in Figure 4.13 for discrete variables. In this case, however, what needs to be specified is the expected proportion in the population possessing a certain feature. For example, if the expected proportion of trips by bus is 20%, then with a sample of 300 trips in any one strata, we would expect that the mean proportion of trips by bus would lie between 16% and 24% in 95% of samples of this size.
Discrete Variable Confidence Limit Calculator Population Proportion = 0.20
Population Size = 20000
Level of Confidence = 90% => z = 1.64
Sample Size s.e.(m) z*s.e.(m) Lower Limit Upper Limit
50 0.06 0.09 0.11 0.29 100 0.04 0.07 0.13 0.27 150 0.03 0.05 0.15 0.25 200 0.03 0.05 0.15 0.25 250 0.03 0.04 0.16 0.24 300 0.02 0.04 0.16 0.24 350 0.02 0.03 0.17 0.23 400 0.02 0.03 0.17 0.23 450 0.02 0.03 0.17 0.23 500 0.02 0.03 0.17 0.23 2000 0.01 0.01 0.19 0.21
Figure 4.13 Confidence Limit Estimator for Discrete VariablesFigures 4.12 and
4.13 show the effect of changing sample sizes on the confidence limits which could be expected for one variable at a time. However, as noted earlier, one of the problems in sample design for a real survey is that sample sizes must be calculated for many variables across many strata. The effects of varying sample size on the precision obtained for all variables can be summarised for the client as shown in Figures 4.14 through 4.16. These tables are constructed within a standard spreadsheet program, and are designed to be used interactively with a client to give them a feel for the implications of using various sized samples and varying number of strata (e.g. geographic regions). In Figure 4.14, the client can specify the number of strata and the sample size, the population size (in terms of number of households in the study area) and the required level of confidence (the latter item may be selected on the advice of the survey designer).
Sample Size Design Parameters
Number of Strata = 12
Households in Sample per Strata = 200
Persons in Sample per Strata = 600
(assuming 3 persons per household)
Trips in Sample per Strata = 2400
(assuming 4 trips per person)
Population Size = 60000 households
Level of Confidence = 95% => z = 1.96
Total Households in Sample = 2400
Total Survey Cost = $144,000
(assuming $60 per responding household)
Figure 4.14 Input Screen for Sample Size Design Parameters
The survey designer can also input a unit price per responding household in Figure 4.14 to give the client an indication of the cost implications of their sample design decisions. A second input screen, shown in Figure 4.15, requires the client, or the survey designer, to specify the expected values of key variables in the population together with the expected variability of continuous variables.
The spreadsheet program calculates the standard error for each variable and, using the value of z corresponding to the stated level of confidence, then calculates the upper and lower confidence limits for each key variable. These limits are then stated in simple English as shown in Figure 4.16. If one or more these ranges are not acceptable to the client, because they think that the precision is not adequate for their purposes, then they can go back to the first input screen in Figure 4.14, change the sample size and observe the effects on the precision shown in Figure 4.16. They can also experiment with the precision obtained with different levels of stratification, by either increasing the number of strata and decreasing the sample size in each strata, or by decreasing the number of strata and increasing the sample size in each. In this way, the client and the survey designer can interactively experiment with different sample designs and observe the effects on the precision of sample estimates and the cost of the survey, thereby experiencing the nature of the trade-offs in sample design.
Expected Population Values for Key Variables
Household Variables Proportion M e a n S . D .
Persons per Household 3.00 1.50 Vehicles per Household 1.50 0.80 Households without Vehicles 0.10
Trips per Household 12.00 5.00
Person Variables Proportion M e a n S . D .
Trips per Person 4.00 2.50
% Male 0.50
% Unemployed 0.10
Average Personal Income ($K) 28.00 8.00
Trip Variables Proportion M e a n S . D .
Average Trip Length (minutes) 20.00 12.00 % Trips by Bus 0.05
% Trips to School 0.10
Average People per Vehicle 1.40 0.20
Figure 4.15 Input Screen for Expected Values of Variables in Population
Precision of Sample Estimates
Household Variables
The estimated value of Persons per Household lies between 2.79 and 3.21 The estimated value of Vehicles per Household lies between 1.39 and 1.61 The estimated value of Households without Vehicles lies between 0.06 and 0.14 The estimated value of Trips per Household lies between 11.31 and 12.69
Person Variables
The estimated value of Trips per Person lies between 3.80 and 4.20 The estimated value of % Male lies between 0.46 and 0.54 The estimated value of % Unemployed lies between 0.08 and 0.12 The estimated value of Average Personal Income ($K) lies between 27.36 and 28.64
Trip Variables
The estimated value of Average Trip Length (minutes) lies between 19.52 and 20.48 The estimated value of % Trips by Bus lies between 0.04 and 0.06 The estimated value of % Trips to School lies between 0.09 and 0.11 The estimated value of Average People per Vehicle lies between 1.39 and 1.41
Figure 4.16 Output Screen for Estimated Values of Variables in Sample 4.6.2 Sample Sizes for Hypothesis Testing
The second purpose of a survey (or surveys) may be to test a statistical hypothesis concerning some of the population parameters e.g. are there significant differences in trip rates in different areas, or has use of a specific mode risen following introduction of a new transport service? To test such hypotheses,
it is necessary to compare two sample statistics (each being an estimate of a population parameter under different conditions), each of which has a degree of sampling error associated with it. The tests are performed using statistical significance tests where the power of the test is a function of the sample size of the survey(s).
In using sample survey data to test hypotheses about population behaviour, it is first necessary to ensure that the hypothesis to be tested is correctly specified. While hypotheses are often described as having been rejected or accepted, it should be realised that the rejection of a hypothesis is to conclude that it is false, while the acceptance of a hypothesis merely implies that we have insufficient evidence to believe otherwise. Because of this, the investigator should always state the hypothesis in the form of whatever it is hoped will be rejected. Thus if it is hoped to prove that car ownership is higher in one area than in another, the hypothesis should be that car ownership is equal in the two areas, and then we try to reject the hypothesis (statistically).
An hypothesis that is formulated with the hope of rejecting it is called the null hypothesis and is denoted by H0. The rejection of H0 leads to the "acceptance" of an alternative hypothesis denoted by H1. Thus in the case of testing for differences in two average values, the null hypothesis could be specified as:
H0 : m = m0 (4.15)
The alternative hypothesis could be specified in a number of different ways depending on the purpose of the comparison. Possible alternative hypotheses might include:
H1 : m > m0 (4.16)
H1 : m < m0 (4.17)
H1 : m ≠ m0 (4.18)
Note that only one of these alternative hypotheses can be used in any particular test. The first two alternative hypotheses would constitute a one-tailed test, while the third alternative hypotheses would constitute a two-tailed test.
Since the data to be used in the testing of these hypotheses is to be collected using a sample survey, we would be reluctant to base our decision on a strict deterministic interpretation of the stated hypotheses. For example, if we wished to test the null hypothesis that m = m0, then we would base our decision on testing whether m fell within a critical region (D) around m0. Thus, if m0!+!D!<!m!<!m0 - D , then we would, in practice, not reject the hypothesis that m!=!m0. Similarly if m < m0 - D , then we would not reject the hypothesis that m!=!m0 in favour of the alternative hypothesis that m > m0. The definition of the critical region, D, is
somewhat arbitrary and merely serves to give a workable rule for the rejection of hypotheses. Obviously a smaller critical region will make rejection of the null hypothesis easier.
In testing hypotheses, there are four possible end-states of the hypothesis testing procedure. Two of these states signify that a correct decision has been made while the other two indicate that an error has been made. The four end-states may be depicted as shown in Table 4.1.
Table 4.1 Possible End-States of Hypothesis Testing