Appendix A Statistical properties of design based estimators in the presence of non
A.1 Design based estimation
A.6. Design based estimators are so named because the investiga- tor puts all of the prior information that they have about the population into the design of the survey. For example, if the population responses are known to vary according to some known characteristics of the respondent then that information is included into the design of the survey by stratifying it ac- cording to these characteristics. One reason for doing this is to reduce the size of the standard errors for the estimates. But as we show below stratiÞcation can be a useful tool for reducing the extent of non response bias.
A.7. Design based estimation starts with the assumption that there is a population of interest comprising N members. Each mem- ber of that population has a vector of Þxed characteristics, xj, and someÞxed quantity of interest to the investigator,yj.1
Interest centers on estimating the sum of theyj for the popula- tion as a whole.2 This could be done by undertaking a census
but that is too expensive so a random sample is taken. Thus, in this approach the only source of the randomness comes from the sampling that is necessary to reduce the cost of estimating the population quantity.
A.8. The random selection of members from the population is de- scribed by two independent3 binary4 random variablesSj and
Rj. Where
1It could be a vector of quantities of interest but we focus on the scalar
case here to simplify the presentation.
2That is, the population quantity of interest isY =PN
j=1yj , whereN
is the number in the population andyj is the quantity of interest for
thejthmember of the population.
3Here indepence means that the joint probability density ofS
j andRj,
f(Sj, Rj),can be written as the product of the two marginal densities,
h(Sj)and g(Rj) . That is, f(Sj, Rj) =h(Sj)g(Rj). In the case of
the Yellow Pages survey the assumption of independence is justiÞed because the telephone numbers of respondents were selected randomly. The fact that this random drawing of telephone numbers occured at different times, depending on when the businesses entered the panel, does not affect the independence ofSj andRj.
4Binary means that S
j can take on one of two values 1 (Selected) or
0 (not selected) and, simlarly, Rj can take on one of two values 1
(response if selected) or 0 (non response if selected). The non response could be split into further categories according to the FDC codes but we will not do that here.
1. Sj describes the selection process
(a) Sj = 1 if individual j is selected; and (b) Sj = 0 if individual j is not selected.
2. Rj describes how members of the population respond if selected:
(a) Rj = 1 if individual j completes the survey when selected; and
(b) Rj = 0if individualj does not complete the survey when selected.
A.9. The sample mean y can be written as follows,
y= PN 1 j=1SjRj N X j=1 SjRjyj
A.10. In many surveys the number of completed responses nc = PN
j=1SjRjis a random variable but in this survey the sampling
continues until the actual number of completed interviews nc is equal the planned number of interviews n. That is nc =n. Thus, for the survey, we can write the sample mean as
y= 1 n N X j=1 SjRjyj
A.11. The expected value of the sample mean is, y= 1 n N X j=1 E(Sj)E(Rj)yj
whereE(X)denotes the expected value of X. Because Sj and Rj are binary variables E(Sj) andE(Rj)can be interpreted, respectively, as the probability that the jth member of the population is selected for interview and the probability that thejth member of the population responds by completing the interview.
A.12. There are three cases to consider here,
1. E(Rj) is equal to a constant π that is independent of both the characteristics of the respondent xj and the respondents answer to the question of interest yj. In
this case the non respondents are said, in the literature, to be ‘missing completely at random’. In this survey ‘missing completely at random non response’ might arise, for example, where the business person is has unusually high demands on their time during the interview period and thus
(a) if in the panel cannot respond this interview period; or
(b) if a potential recruit to the panel is too busy to agree to join the panel.
These would be ‘missing completely at random’ if the factors causing demand on their time did not depend on the businesses characteristics and were not correlated with any of the variables of interest in the survey such as the number of award employees the business has or the number of additional employees the business would hire in the event of a guarantee of no change in the Safety Net;
2. E(Rj) varies with the characteristics of the respondent xj,but does not vary with the respondents answer to the question of interest yj. In this case the non respondents are said, in the literature, to be ‘missing at random’. In this case we write E(Rj) = γ(xj) , where γ(xj) is a function that describes how the probability of response varies with the characteristics of the respondent. In this survey, missing at random, non response might arise, for example, where larger businesses, as measured by turnover, are more (or less) likely than small businesses to respond to the survey;
3. E(Rj) varies with the characteristics of the respondent xj and with the respondents answer to the question of interest yj. In this case the non respondents are said, in the literature, to be ‘non ignorable nonresponse’.5 In
this case we write E(Rj) =θ(xj, yj), whereθ(xj, yj) is a function that describes how the probability of response varies with the characteristics of the respondent and with the answer to the question of interest. In this survey non ignorable nonresponse might, for example, arise if:
5We do not like the term non ignorable non response as it implies,
incorrectly, that the other categories of non response are ignorable. Nonetheless, we use the term because it is used in the literature; see Lohr (1999).
(a) Businesses where the respondent had unusually higher demand on their time during the survey period were more likely to pass-on the SNA to over award em- ployees;
(b) Businesses where the respondent had unusually higher demand on their time during the survey period were more likely than other businesses to have high de- mand for their product and thus, arguably,
i. less likely, than businesses that responded, to report that the 2003 SNA cost jobs in their business; but
ii. more likely, than businesses that responded, to report that they would hire additional employ- ees if there were a guarantee of no change in the Safety Net for a period of Þve years. A.13. The examples given above were deliberately chosen to illus-
trate the point that there are ‘swings and roundabouts’ with bias. The biases mentioned, in the examples, would cause un- derestimation of the extent to which SNAs are passed on, over estimation of the job loss from the 2003 SNA but under es- timation of the number of jobs created by a guarantee of no change in the Safety Net for a period of Þve years. Non sta- tisticians often claim that the bias works against their case and cite example to support that contention. But this is a poor statistical practice. The best statistical practice is to re- administer the survey to non respondents and then estimate the probability of non response and test whether it is ‘missing completely at random’, ‘missing at random’ or ’non ignorable non response’.