Design-Based Estimation and Inference
2. Multistage sampling within selected PSUs results in a single ultimate cluster of observations for that PSU. Variance estimation methods
3.6.2 The Taylor Series linearization Method
Taylor series approximations of complex sample variances for weighted sample estimates of finite population means, proportions, and linear regres-sion coefficients have been available since the 1950s (Hansen, Hurwitz, and Madow, 1953; Kish and Hess, 1959). Woodruff (1971) summarized the general application of the TSL methods to a broader class of survey statistics. Binder (1983) advanced the application of the TSL method to variance estimation for analysis techniques such as logistic regression or other generalized linear
models. The TSL approach to variance estimation involves a noniterative process of five steps:
3.6.2.1 TSL Step 1
The estimator of interest is written as a function of weighted sample totals.
Consider a weighted, combined ratio estimator of the population mean of the variable y (Kish, 1965):
4 Strata, 2 PSUs per stratum, ultimate clusters of 2 elements per PSU.
Stratum PSU (Cluster) Case yi wi
1 1 1 .58 1
Data set for example TSL, JRR, and BRR sampling variance calculations.
© 2010 by Taylor and Francis Group, LLC
70 Applied Survey Data Analysis
Notice that the estimator of the ratio mean can be expressed as a ratio of two weighted totals, u and v, which are sums over design strata, PSUs, and indi-vidual cases of the constructed variables uh iα =wh iα ⋅yh iα and vh iα =wh iα. The concept of a sample total is not limited to sums of single variables or sums of weighted values. For reasons that will be explained more fully in later sec-tions, the individual case-level variates used to construct the sample totals for complex sample survey data may be more complex functions involving many variates and functional forms. For example, under a stratified, cluster sample design, the following estimated totals are employed in TSL variance estimation for simple linear and simple logistic regression coefficients:
u w y x two weighted sample totals. Consequently,
Var y Var u
To solve the problem of the nonlinearity of the sample estimator, a stan-dard mathematical tool, the Taylor series expansion, is used to derive an approximation to the estimator of interest, rewriting it as a linear combina-tion of weighted sample totals:
y u
where A and B symbolically represent the derivatives with respect to u and v, evaluated at the expected values of the sample estimates u0 and v0.
The quadratic, cubic, and higher-order terms in the full Taylor series expansion of yw are dropped (i.e., the remainder is assumed to be negligible).
Further, consistent (and preferably unbiased) sample estimates are generally used in place of the expected values of the sample estimates.
3.6.2.3 TSL Step 3
A standard statistical result for the variance of a linear combination (sum) is applied to obtain the approximate variance of the “linearized” form of the estimator, yw TSL, :
are the weighted sammple totals computed from the survey data.
TTherefore,
The sampling variance of the nonlinear estimator,yw TSL, , is thus approxi-mated by a simple algebraic function of quantities that can be readily com-puted from the complex sample survey data. The sample estimates of the ratio mean,yw TSL, , and the sample total of the analysis weights, v0, are computed from the survey data. The estimates of var(u), var(v), and cov(u,v) are computed using the relatively simple computational formulas described in Step 4.
3.6.2.4 TSL Step 4
Under the TSL method, the variance approximation in Step 3 has been derived for most survey estimators of interest, and software systems such as Stata and SUDAAN provide programs that permit TSL variance estimation for virtually all of the analytical methods used by today’s survey data analyst.
The sampling variances and covariances of individual weighted totals, u or v, are easily estimated using simple formulae (under an assumption of
with-© 2010 by Taylor and Francis Group, LLC
72 Applied Survey Data Analysis
replacement sampling of PSUs within strata at the first stage) that require knowledge only of the subtotals for the primary stage strata and clusters:
var( )u a included in the survey data set, the weighted totals for strata and clusters are easily calculated. These calculations are illustrated here for u:
uh uh i u u
Returning to the example data set in Figure 3.4:
y
3.6.2.5 TSL Step 5
Confidence intervals (or hypothesis tests) based on estimated statistics, stan-dard errors, and correct degrees of freedom based on the complex sample design are then constructed and reported as output from the TSL variance estimation program. We show this calculation for a 95% confidence interval for the population mean based on the example data set in Figure 3.4 (note that df = 8 clusters – 4 strata = 4):
CI y( w TSL, )=yw TSL, ±t1−α/ ,2df ⋅ var (TSL yw TSL, ) e..g., CI y( w TSL) . t .
.
, = ± / , ⋅
=
0 4737 − 0 0093 0 4
1α2 4
7737 2 7764 0 0093± . ⋅( . ) ( .= 0 4478 0 4996, . )
Most contemporary software packages employ the TSL approach as the default method of computing sampling variances for complex sample sur-vey data. TSL approximations to sampling variances have been derived for virtually all of the statistical procedures that have important applications in survey data analysis. The following Stata (Version 10) syntax illustrates the command sequence and output for an analysis of the prevalence of at least one lifetime episode of major depression in the National Comorbidity Survey Replication (NCS-R) adult survey population:
. svyset seclustr [pweight=ncsrwtsh], strata(sestrat) ///
vce (linearized) singleunit(missing) pweight: ncsrwtsh
VCE: linearized Single unit: missing
Strata 1: sestrat SU 1: seclustr FPC 1: <zero>
. svy: mean mde
(running mean on estimation sample) Survey: Mean estimation
Number of strata = 42 Number of obs = 9282 Number of PSUs = 84 Population size = 9282
Design df = 42
| Linearized
| Mean Std. Err. [95% Conf. Interval]
mde | .1917112 .0048768 .1818694 .201553
---© 2010 by Taylor and Francis Group, LLC
74 Applied Survey Data Analysis
Note that Stata explicitly reports the linearized estimate of the standard error (0.0049) of the weighted estimate of the population proportion (0.1917). These Stata commands will be explained in more detail in the upcoming chapters.