Voluntary sampling - Sampling Errors - Model Quality Report in Business Statistics

Part 1: Sampling Errors

4.2 Voluntary sampling

Voluntary sampling arises when, for example, businesses are requested, but not required, to take part in a survey, and the survey results are based just on the data received from the companies who choose to respond. The choice of whether to participate thus makes the sample non-probability-based: even if one wished to acknowledge uncertainty, before the data arrive, about which companies will respond by regarding the sample inclusion indicators

I as random, the inclusion probabilities π_j are rendered unknown by the choice

mechanism. As with quota sampling (Section 4.3), the result can range from highly accurate to highly inaccurate, depending on the (possibly unknown) degree to which the volunteer units represent the population in all relevant respects. Any bias that arises from failure of the voluntary sample to match the population in this way is an example of selection bias (see Freedman et al. 1998 for a discussion), in which the self-selection mechanism is correlated with the outcome of interest and some or all of its most important predictors.

An example of voluntary sampling is provided by the Stocks Inquiry business survey, conducted by the UK Office for National Statistics (ONS). This survey has both a monthly voluntary component and a quarterly component based on probability sampling: random samples of companies are (a) chosen, (b) required to provide quarterly data, and (c) requested

(in addition) to provide monthly data, so that the companies providing voluntary monthly information form a self-selected subset of the probability sample. In practice about 30% of the sampled companies choose to supply the voluntary data. Note that this type of sample could equally well be described as a probability sample (a) with a voluntary sub-sample or (b) with a high degree of (almost certainly) non-ignorable non-response (see chapter 8 and section 9.7).

Industry 1 Industry 2 Industry 3

Period

P V _Bˆ P V _Bˆ P V _Bˆ

‘97/Q1 3,420 5,425 +2,005 38,011 38,905 +894 26,617 61,534 +34,917

‘97/Q2 3,456 6,148 +2,692 40,502 43,271 +2,769 27,439 62,990 +35,551

‘97/Q3 3,455 6,008 +2,553 36,940 44,170 +7,320 26,059 59,931 +33,872

Table 4.1 Estimates based on the Probability (P) and voluntary (V) samples, by industry and

period, for work-in-progress Opening stocks. All figures are in £000. Bˆ = estimated bias.

Available variables in the analysis we present here include industry group number (four-digit SIC92; we focus here on only 3 industries, coded 1-3); period of return from 01/1997 to 09/1997; register employment and (VAT) turnover (in £000) based on data gathered roughly 3 months previously; and the Opening and Closing stocks (in £000) for each of three categories: materials, work in progress, and finished goods. The numbers of companies involved in the voluntary and probability samples in this period were 77-87 and 261-275, respectively, varying a bit from quarter to quarter. We concentrate here on the work-in- progress stocks (results for the other two categories were similar). For ease of exposition (a) we present results only on the 77 and 226 companies in the voluntary and probability samples with complete data at all time points relevant to the analyses below, and (b) we analyse the data as if the probability sample had been a simple random sample (in fact it was a stratified random sample; the points we wish to make in this section come through more clearly without the extra issue of re-weighting the probability sample back to the population).

Industry 1 Industry 2 Industry 3

Period

P V Bˆ P V Bˆ P V Bˆ

‘97/Q1 3,456 6,148 +2,692 40,502 43,271 +2,769 27,439 62,990 +35,551

‘97/Q2 3,455 6,008 +2,553 36,940 44,170 +7,320 26,059 59,931 +33,872

‘97/Q3 3,898 7,828 3,930 39,356 49,605 +10,249 24,627 56,638 +32,011

Table 4.2 Estimates based on the Probability (P) and voluntary (V) samples, by industry and

period, for work-in-progress Closing stocks. All figures are in £000. Bˆ = estimated bias.

Some indication of the biases that could arise from basing inferences on the voluntary monthly samples is provided by a direct comparison between the monthly and quarterly data in each of the three periods 01-03/97, 04-06/97, and 07-09/97 that were common to both surveys (for comparability between the monthly and quarterly series, the opening and closing

of the first quarter of 1997 were taken to be 01/97 and 03/97 for the voluntary series, and analogously for the other quarters). Table 4.1-Table 4.3 present sample estimates by industry

and period for work-in-progress Opening, Closing, and (Closing − Opening) stocks in each of

these three quarters. Within each industry code, probability (P) and voluntary (V) estimates

are given, and − since we are taking the probability-sampling results to be (design)

unbiased – the estimated bias Bˆ =V −P from the voluntary data may also be calculated. It is

evident from these tables (a) that the voluntary results for both opening and closing stocks are

enormously biased on the high side, and (b) that much − though by no means all − of this bias

cancels in the subtraction when producing the (Closing − Opening) stocks estimates, which

are the principal outcomes of interest.

Industry 1 Industry 2 Industry 3

Period

P V _Bˆ P V _Bˆ P V _Bˆ

‘97/Q1 36 723 +687 2,491 4,366 +1,875 822 1,456 +634

‘97/Q2 -1 -140 -139 -3,562 899 +4,461 -1,380 -3,059 -1,679

‘97/Q3 443 1,820 +1,377 2,416 5,435 +3,019 -1,432 -3,293 -1,861

Table 4.3 Estimates based on the Probability (P) and voluntary (V) samples, by industry and

period, for work-in-progress (Closing −−−− Opening) stocks. All figures are in £000. Bˆ =

estimated bias.

The leading method for bias reduction with voluntary samples is poststratification (for example, Holt & Smith 1979, Jagers 1986, Smith 1991, Little 1993). Taking for simplicity the case of a single outcome of interest, two ingredients are required for this method: (i) a list, preferably (close to) exhaustive, of covariates likely to be (highly) correlated with the outcome; and (ii) the ability to gather data on these covariates both in the voluntary sample and in the population itself. Dividing each covariate into strata and cross-tabulating the resulting categorical variables, poststratification involves (a) estimating both population and voluntary sample prevalences in the cells of this stratification grid, and (b) re-weighting the voluntary sample to match the estimated population prevalences. Ideally the stability of this method should be checked by sensitivity analysis (see Draper et al. 1993a for examples), varying the covariates used and the cut-points defining their strata across plausible ranges and seeing whether the bias-adjusted estimates are similar. The (approximate) success of this method rests on the assumption that (most or all of) the important covariates have been correctly identified, measured, and adjusted for.

Probability Sample Voluntary Sample

Variable

Industry 1 Industry 2 Industry 3 Industry 1 Industry 2 Industry 3

Table 4.4 Comparison of probability and voluntary samples on median register employment (numbers of people) and turnover ( £000), by industry, in the first quarter of 1997 (results for the other two quarters were similar).

In this example the only available covariates are register employment (E) and turnover (T), which are fairly highly correlated in both the P and V samples (for example, the correlation, with both variables on the log scale, is +0.74 in the voluntary sample). Table 4.4 shows that at least some of the discrepancy between the probability and voluntary samples should indeed be explainable on the basis of E and/or T: the 30% of the quarterly probability sample that chose to volunteer monthly data heavily over-represented large companies.

To avoid redundancy we present poststratification results here only for one industry (results were similar with the other two industries). With only 17 companies per quarter in this industry in the voluntary sample, bivariate stratification on both E and T would leave empty cells, which does not permit re-weighting, so in the work presented here we first stratified only on register turnover (in any case the high correlation between E and T indicates that there is not much information in E after T has been accounted for). We chose four strata, with the smallest cutpoint selected so that the lowest stratum had at least one company in both samples, and with the other two cutpoints chosen to spread the rest of the distribution out approximately evenly.

Table 4.5 indicates how the probability and voluntary samples in industry 1 were distributed across strata based on register turnover. This provides another view of how sharply the large companies were over-sampled in the voluntary survey, for example, 43% of the probability- sampled companies were in the smallest register-turnover stratum, versus 6% in the voluntary sample. The weights used in the poststratification are also given in this table; for example, the

voluntary-sample company in the lowest stratum was given weight

(

30 70

) ( )

117 ≅7.29,

whereas the 6 voluntary companies in the highest stratum received weight

(

14 70

) (

617

)

≅0.57.

[0-8,455] 30 1 7.29

(8,455-14,784] 12 4 0.73

(14,784-84,657] 14 6 0.57

(84,657-2,284,224] 14 6 0.57

Total 70 17 ₋

Table 4.5 Frequency distribution of probability (P) and voluntary (V) samples, across the four register turnover strata, together with the poststratification weights.

Table 4.6 presents the results of the bias reduction arising from poststratification on register

turnover. Separately for each of the stocks categories {Opening, Closing, and (Closing −

Opening)}, the P column gives the probability-sample estimate (reported previously in Table 4.1-Table 4.3), the PV column is the voluntary-sample estimate re-weighted by the

poststratification on register turnover, Bˆ =PV −P is the estimated bias after

poststratification, and Rˆ is the percentage (relative) reduction in estimated bias yielded by

the poststratification. For example, in 1997/Q2 the raw voluntary-sample estimate for

Opening Closing Period P PV Bˆ Rˆ(%) P PV Bˆ Rˆ(%) Q1 3,420 3,223 -197 90.2 3,456 3,556 +100 96.3 Q2 3,456 3,556 +100 96.3 3,455 3,541 +86 96.6 Q3 3,455 3,541 +86 96.6 3,898 4,628 +730 81.4 Closing − Opening Period P PV Bˆ Rˆ(%) Q1 36 333 +297 56.8 Q2 -1 -15 -14 89.9 Q3 443 1,087 +644 53.2

Table 4.6 Results, by period, from poststratifying on register turnover. In each of the stocks

categories {Opening, Closing, and (Closing − Opening)}, P is the probability-sample

estimate, PV is the poststratified voluntary sample estimate, Bˆ = PV −P is the estimated bias

after poststratification, and Rˆ is the percentage reduction in estimated bias arising from the

poststratification.

industry 1 was 6,148, giving an estimated bias of +2,692 (Table 4.1); after re-weighting the new voluntary-sample estimate is 3,556, with an estimated bias of +100 (Table 4.6); and diminishing the estimated bias from 2,692 to 100 represents an estimated bias reduction of

(

2,692−100

)

2692≅96.3%. Poststratification has resulted in massive estimated bias

reductions ranging from 81% to 97% for the opening and closing stocks, but has produced a

more modest estimated improvement in the crucial difference (Closing − Opening), with

gains from 53% to 90%. Opening Closing Period P PV Bˆ Rˆ(%) P PV Bˆ Rˆ(%) Q1 3,420 3,301 -119 94.1 3,456 3,598 +142 94.7 Q2 3,456 3,598 +142 94.7 3,455 3,549 +94 96.3 Q3 3,455 3,549 +94 96.3 3,898 4,528 +630 84.0 Closing − Opening Period P PV Bˆ Rˆ(%) Q1 36 307 +271 60.6 Q2 -1 -49 -48 65.5 Q3 443 979 +536 61.1

Table 4.7 Results, by period, from poststratifying on register employment. In each of the

estimate, PV is the poststratified voluntary sample estimate, Bˆ = PV −P is the estimated bias

after poststratification, and Rˆ is the percentage reduction in estimated bias arising from the

poststratification.

Sensitivity analysis on the poststratification process is straightforward. For example, basing the strata on register employment and using three strata instead of four (with stratum definitions [20-215], (215-449], and (449-12,378]), chosen to create approximately equal- sized groups in the voluntary sample, yielded the results in Table 4.7. The two approaches to poststratification have in this case led to similar amounts of bias reduction, although this need not always be true. In practice, when a “gold-standard” (such as the probability-sample results here) is not available, any differences revealed by a comparison of this type may indicate that other variables should ideally have been part of the stratum definitions, that is that poststratification may not have been entirely successful in removing the selection bias present in the voluntary sample.

In document Model Quality Report in Business Statistics (Page 73-78)