Precision essentially depends only the absolute sample size, not the relative fraction of the population sampled
3.7 Stratified simple random sampling
3.7.4 Example - sampling organic matter from a lake
[With thanks to Dr. Rick Routledge for this example].
Suppose that you were asked to estimate the total amount of organic matter suspended in a lake just after a storm. The first scheme that might occur to you could be to cruise around the lake in a haphazard fashion and collect a few sample vials of water which you could then take back to the lab. If you knew the total volume of water in the lake, then you could obtain an estimate of the total amount of organic matter by taking the product of the average concentration in your sample and the total volume of the lake.
The accuracy of your estimate of course depends critically on the extent to which your sample is repre-sentative of the entire lake. If you used the haphazard scheme outlined above, you have no way of objectively evaluating the accuracy of the sample. It would be more sensible to take a properly randomized sample.
(How might you go about doing this?)
Nonetheless, taking a randomized sample from the entire lake would still not be a totally sensible ap-proach to the problem. Suppose that the lake were to be fed by a single stream, and that most of the organic matter were concentrated close to the mouth of the stream. If the sample were indeed representative, then most of the vials would contain relatively low concentrations of organic matter, whereas the few taken from around the mouth of the stream would contain much higher concentration levels. That is, there is a real potential for outliers in the sample. Hence, confidence limits based on the normal distribution would not be trustworthy.
Furthermore, the sample mean is not as reliable as it might be. Its value will depend critically on the number of vials sampled from the region close to the stream mouth. This source of variation ought to be controlled.
Finally, it might be useful to estimate not just the total amount of organic matter in the entire lake, but the extent to which this total is concentrated near the mouth of the stream.
You can simultaneously overcome all three deficiencies by taking what is called a stratified random sample. This involves dividing the lake into two or more parts called strata. (These are not the horizontal strata that naturally form in most lakes, although these natural strata might be used in a more complex sampling scheme than the one considered here.) In this instance, the lake could be divided into two parts, one consisting roughly of the area of high concentration close to the stream outlet, the other comprising the remainder of the lake.
Then if a simple random sample of fixed size were to be taken from within each of these “strata”, the results could be used to estimate the total amount of organic matter within each stratum. These subtotals could then be added to produce an estimate of the overall total for the lake.
This procedure, because it involves constructing separate estimates for each stratum, permits us to assess the extent to which the organic matter is concentrated near the stream mouth. It also permits the investigator to control the number of vials sampled from each of the two parts of the lake. Hence, the chance variation in the estimated total ought to be sharply reduced. Finally, we shall soon see that the confidence limits that one can construct are free of the outlier problem that invalidated the confidence limits based on a simple random sampling scheme.
A randomized sample is to be drawn independently from within each stratum.
How can we use the results of a stratified random sample to estimate the overall total? The simplest way is to construct an estimate of the totals within each of the strata, and then to sum these estimates. A sensible estimate of the average within the h’th stratum is yh. Hence, a sensible estimate of the total within the h’th stratum isbτh= Nhyh, and the overall total can be estimated byτ =b PH
h=1bτh=PH
h=1Nhyh.
If we prefer to estimate the overall average, we can merely divide the estimate of the overall total by the size of the population, N . The resulting estimator is called the stratified random sampling estimator of the population average, and is given byµ =b PH
h=1Nhyh/N .
This can be expressed as a fancy average if we adjust the order of operations in the above expression. If, instead of dividing the sum by N , we divide each term by N and then sum the results, we shall obtain the same result. Hence, be viewed as a weighted average of the within-stratum sample averages.
The estimated standard error is found as:
where the estimated se(yh) is given by the formulas for simple random sampling: se(yh) = qs2h
nh(1 − fh).
A Numerical Example
Suppose that for the lake sampling example discussed earlier the lake were subdivided into two strata, and that the following results were obtained. (All readings are in mg per litre.)
Stratum Nh nh Sample Observations yh sh
1 7.5 × 108 5 37.2 46.6 45.3 38.1 40.4 41.52 4.23
2 2.5 × 107 5 365 344 388 347 403 369.4 25.7
We begin by computing the estimated mean for each stratum and its associated standard error. The sampling fraction Nnh
h is so close to 0 it can be safely ignored. For example, the standard error of the mean for stratum 1 is found as:
Next, we estimate the total organic matter in each stratum. This is found by multiplying the mean concentration and se of each stratum by the total volume:
bτh= Nh×µbh se(bτh) = Nhse(µbh) For example, the estimated total organic matter in stratum 1 is found as:
bτ1= N1×bµ1= 7.5 × 108× 41.52 = 311.4 × 108
Next, we total the organic content of the two strata and find the se of the grand total as√
14.1752+ 2.8732× 108to give the summary table:
Stratum nh bµh se(µbh) bτh se(bτh) 1 5 41.52 1.8935 311.4 ×108 14.175 ×108 2 5 369.4 11.492 92.3 ×108 2.873 ×108
Total 403.7 ×108 14.46 ×108
Finally, the overall grand mean is found by dividing by the total volume of the lake 7.75 × 108to give:
bµ =403.7 × 108
7.75 × 108 = 52.09mg/L se(µ) =b 14.46 × 108
7.75 × 108 = 1.87mg/L
The calculations required to compute the stratified estimate can also be done using the method of weighted averages as shown in the following table:
Stratum Nh Wh yh Whyh se(yh) Wh2[se(yh)]2 (= Nh/N )
1 7.5 × 108 0.9677 41.52 40.180 1.8935 3.3578
2 2.5 × 107 0.0323 369.4 11.916 11.492 0.1374
Totals 7.75 × 108 1.0000 52.097 3.4952
se =√ 3.4952
Hence the estimate of the overall average is 52.097 mg/L, and the associated estimated standard error is
√3.4963 = 1.870 mg/L and an approximate 95% confidence interval is then found in the usual fashion. As expected these match the previous results.
This discussion swept a number of practical difficulties under the carpet. These include (a) estimating the volume of each of the two portions of the lake, (b) taking properly randomized samples from within each stratum, (c) selecting the appropriate size of each water sample, (d) measuring the concentration for each water sample, and (e) choosing the appropriate number of water samples from each stratum. None of these difficulties is simple to do. Estimating the volume of a portion of a lake, for example, typically involves taking numerous depth readings and then applying a formula for approximating integrals. This problem is beyond the scope of these notes.
The standard error in the estimator of the overall average is markedly reduced in this example by the stratification. The standard error was just estimated for the stratified estimator to be around 2. This result was for a sample of total size 10. By contrast, for an estimator based on a simple random sample of the same size, the standard error can be found to be about 20. [This involves methods not covered in this class.]
Stratification has reduced the standard error by an order of magnitude.
It is also possible that we could reduce the standard error even further without increasing our sampling
effort by somehow allocating this effort more efficiently. Perhaps we should take fewer water samples from the region far from the outlet, and take more from the other stratum. This will be covered later in this course.
One can also read in more comprehensive accounts how to construct estimates from samples that are stratified after the sample is selected. This is known as post-stratification. These methods are useful if, e.g., you are sampling a population with a known sex ratio. If you observe that your sample is biased in favor of one sex, you can use this information to build an improved estimate of the quantity of interest through stratifying the sample by sex after it is collected. It is not necessary that you start out with a plan for sampling some specified number of individuals from each sex (stratum).
Nonetheless, in any survey work, it is crucial that you begin with a plan. There are many examples of surveys that produced virtually useless results because the researchers failed to develop an appropriate plan.
This should include a statement of your main objective, and detailed descriptions of how you plan to generate the sample, collect the data, enter them into a computer file, and analyze the results. The plan should contain discussion of how you propose to check for and correct errors at each stage. It should be tested with a pilot survey, and modified accordingly. Major, ongoing surveys should be reassessed continually for possible improvements. There is no reason to expect that the survey design will be perfect the first time that it is tried, nor that flaws will all be discovered in the first round. On the other hand, one should expect that after many years experience, the researchers will have honed the survey into a solid instrument. George Gallup’s early surveys were seriously biased. Although it took over a decade for the flaws to come to light, once they did, he corrected his survey design promptly, and continued to build a strong reputation.
One should also be cautious in implementing stratified survey designs for long-term studies. An efficient stratification of the Fraser Delta in 1994, e.g., might be hopelessly out of date 50 years from now, with a substantially altered configuration of channels and islands. You should anticipate the need to revise your stratification periodically.