• No results found

Probability Distributions and Design-Based Inference

Design-Based Estimation and Inference

3.5 Probability Distributions and Design-Based Inference

The whole procedure consists really in solving the problems which Professor Bowley termed direct problems: given a hypothetical popula-tion, to find the distribution of certain characters in repeated samples.

If this problem is solved, then the solution of the other problem, which takes the place of the problem of inverse probability, can be shown to follow. —Jerzy Neyman, 1934

3.5.1 Sampling Distributions of Survey estimates

Neyman’s (1934) “distribution of certain characters in repeated samples” is termed the sampling distribution of a sample estimate. The theoretical sam-pling distribution is based on all possible samples of size n that could be selected from a finite population of N elements. Using a single sampling plan, if all possible samples of size n from the N population elements were drawn in sequence, sample estimates were computed for each selected sample, and a histogram of the estimated values was plotted, the shape of the sampling distribution would emerge. Provided that the sample size, n, was sufficiently large, the distribution that would begin to appear as each new sample esti-mate was added to the histogram would be the familiar bell-shaped curve of a Normal distribution.

Figure 3.2 illustrates a set of nine simulated sampling distributions for sam-ple estimates of the population mean. Each individual graph in this figure represents the histogram of sample estimates,y , computed from 5,000 inde-pendent samples from a large finite population with known mean Y = 25 . The nine simulated sampling distributions displayed in this figure represent nine different probability sampling plans—three levels of sample size (n =

dations and Techniques for Design-Based Estimation and Inference 61

–5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2 –5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2 –5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2 SRS

Mean 25.04614 SD 2.811056

Mean 24.9862 SD 2.019553

Mean 25.0038 SD 0.909071

Mean 24.95748 SD 4.607751

Mean 25.02338 SD 3.219163

Mean 24.99775 SD 1.438167

Mean 25.04225 SD 8.485906

Mean 24.97321 SD 5.898481

Mean 25.02288 SD 2.620669 Cluster Size, b = 10 Cluster Size, b = 50

% of Sample% of Sample% of Samplen = 500n = 1000n = 5000

0.6 0.5 0.4 0.3 0.2 0.1 0.6 0.5 0.4 0.3 0.2 0.1 0.6 0.5 0.4 0.3 0.2 0.1

Figure 3.2

Sampling distributions for a survey estimate (n = 5,000 simulated samples, Y = 25).

© 2010 by Taylor and Francis Group, LLC

62 Applied Survey Data Analysis

500, n = 1,000, n = 5,000) and three levels of clustering (no clustering and clusters of size B = 10 and B = 50). Since y is an unbiased estimator regard-less of the sampling plan or the sample size, each sampling distribution is centered at the population mean, Y = 25 . As the sample size decreases or the size of sample clusters increases, the dispersion of sample estimates about the population mean value increases. The degree of dispersion of the sample estimates about the mean of the sampling distribution is the sampling vari-ance associated with the sample design, which can be written as

Var y p s y E ys s s

S

( )= ( )⋅

(

− ( )

)

= 2 1

(3.6)

where

s = 1, …, S indexes all possible samples of size n under the design;

p(s) = the probability that sample s was chosen from the set of S possibilities;

ys = the estimate for sample s.

The square root of the sampling variance is the standard error of a prob-ability sample estimate, denoted bySE y( ) . Or equivalently, the standard error of a design-based estimate is simply the standard deviation of the sampling distribution.

In real-world survey samples, a single sample is observed. It is never practically feasible to observe the full sampling distribution of an estimate, its mean, its variance, or its distributional shape. So how is it possible to make inferential statements based on a sampling distribution that is never observed? Briefly, statistical theory shows that if the sample size n is suffi-ciently large (e.g., 100 cases) and if ˆθ is an unbiased or otherwise consistent estimator of the population value θ, the sampling distribution converges to an approximately Normal distribution: f(ˆ)θ ∼N Var( ,θ (ˆ))θ . Consequently, the test statistic

t = −ˆ (ˆ) θ θ seθ

where both ˆθ and se are estimated from the survey sample data, follows the (ˆ)θ Student t probability distribution with df degrees of freedom (to be defined in the following section). This test statistic can be “inverted” to derive the following probability statement:

Ps

{{

θˆ−t1α/ ,2dfse(ˆ)θ θ θ≤ ≤ +ˆ t1α/ ,2dfse(ˆ)θ

}

≅ −1 α

This reexpression of the test statistic as a range of values for θ is the basis for the 100(1 – α)% confidence interval presented in Equation 3.2.

3.5.2 Degrees of Freedom for t under Complex Sample Designs

Probability distributions such as the Student t, χ2, and F play a critical role in the construction of confidence intervals for population values or as the refer-ence distributions for formal tests of hypotheses concerning population param-eters. Included in the quantities that define the shape of these distributions are degrees of freedom (df) parameters. The degrees of freedom are indices of how precisely the true variance parameters of the reference distribution have been estimated from the sample design. Sample designs with large numbers of degrees of freedom for variance estimation enable more precise estimation of the true variance parameters of the reference distribution. Conversely, the smaller the degrees of freedom afforded by the sample design, the less pre-cisely these variance parameters are estimated. Consider the (1 – α/2 = 0.975) critical values for the Student t distribution with varying degrees of freedom:

t.975,1 = 12.706; t.975,20 = 2.0860; t.975,40 = 2.0211; t.975,∞ = Z.975 = 1.9600. Whenever an analyst (or his or her computer software) derives a confidence interval or test statistic from sample data, variance parameters that define the appropriate

t, ,χ2 or F reference distribution must be estimated from the sample data.

Precise determination of the degrees of freedom for variance estimation available under complex sample designs used in practice is difficult. Currently, computer software programs for the analysis of complex sample survey data employ a fixed degrees of freedom rule to determine the degrees of freedom for the reference distribution used to construct a confidence interval (e.g.,

Interested readers are referred to Theory Box 3.1 for a more in-depth (yet not strictly theoretical) explanation of the basis for this rule.

The fixed rule for determining degrees of freedom for complex sample designs is applied by software procedures designed for the analysis of survey data whenever the full survey sample is being analyzed. However, programs may use different rules for determining degrees of freedom for subpopula-tion analyses. For subpopulasubpopula-tions, improved confidence interval coverage of the true population value is obtained using a “variable” degrees of freedom calculation method (Korn and Graubard, 1999):

df Ih ah

© 2010 by Taylor and Francis Group, LLC

64 Applied Survey Data Analysis

THEORy BOx 3.1 DEGREES OF FREEDOM FOR VARIAnCE ESTIMATIOn only n – 1 unique pieces of information for estimating the variance).

Consequently, for an SRS design, the t statistic is referred to a Student t distribution with n – 1 degrees of freedom.

Now, consider this same test statistic under a more complex stratified cluster sample design: each stratum contributes ah – 1 independent contrasts to the estimate of the var(yst cl, ). The t-statistic for the complex sample design is no lon-ger referred to a Student t distribution with n – 1 degrees of freedom.

Instead, the correct degrees of freedom for variance estimation under this complex sample design are

design dffixed ah a H clusters s

The variance estimation technique known as Taylor series linear-ization (TSL) in particular (see Section 3.6.2) involves approximating

where

Ih = 1 if stratum h has 1 or more subpopulation cases, 0 otherwise;

ah = the number of clusters (PSUs) sampled in stratum h = 1, …, H.

The variable degrees of freedom are determined as the total number of clusters in strata with 1+ subpopulation cases minus the number of strata with at least one subpopulation observation. Rust and Rao (1996) suggest the same rule for calculating degrees of freedom for test statistics when repli-cated variance estimation methods are used to develop standard errors for subpopulation estimates. See Section 4.5 for a more in-depth discussion of survey estimation for subpopulations.