Descriptive Analysis for Continuous Variables
5.2 Special Considerations in Descriptive Analysis of Complex Sample Survey Dataof Complex Sample Survey Data
5.3.4 Standard Deviations of Continuous Variables
Although experience suggests that it is not a common task, survey analysts may wish to compute an unbiased estimate of the population standard
deviation of a continuous variable. Just as weights are required to obtain unbi-ased (or nearly unbiunbi-ased) estimates of the population mean, weights must also be employed to obtain consistent estimates of the standard deviation of a random variable in a designated target population. A weighted estimator of the population standard deviation of a variable y can be written as follows:
s
In Equation 5.9, the estimate of the population mean for the variable y is calculated as in Equation 5.6. Stata currently does not include an explicit com-mand for estimation of these population standard deviations. Users of the SAS software can use PROC UNIVARIATE (with the VARDEF-DF option) or PROC MEANS with a WEIGHT statement to generate this weighted estimate of the population standard deviation Sy.
5.3.5 estimation of Percentiles and Medians of Population Distributions Estimation of quantiles, such as the median (Q50) or the 95th percentile (Q95) of the population distribution of a continuous variable, can play an important role in analyses of survey data. A sociologist may wish to compare sample estimates of percentiles of household income for a regional survey popu-lation to nationally defined poverty criteria. An epidemiologist may wish to estimate the 95th percentile of prostate-specific antigen (PSA) levels in a metropolitan sample of men over the age of 40.
The ungrouped method of quantile estimation (Loomis, Richardson, and Elliott, 2005) builds on results related to the weighted estimator for totals presented earlier in this chapter (see Section 5.3.2), employing a weighted sample estimate of the population cumulative distribution function (CDF) of a survey variable. Specifically, the CDF for a variable y in a given population of size N is defined as follows:
F x or equal to a specified value of x, and 0 otherwise. The weighted estimator of the CDF from a complex sample of size n from this population is then writ-ten as follows:
© 2010 by Taylor and Francis Group, LLC
132 Applied Survey Data Analysis value of y such that the population CDF is greater than or equal to q. For exam-ple, the median would be the smallest value of y at which the CDF is greater than or equal to 0.5. The ungrouped method of estimating a quantile first considers the order statistics (the sample values of y ordered from smallest to largest), denoted by x1, …, xn, and finds the value of j (j = 1, …, n) such that
ˆ( ) ˆ( )
F xj ≤ <q F xj+1 (5.12)
Then, the estimate of the q-th population quantile Xq is calculated as follows:
ˆ ˆ( )
Kovar, Rao, and Wu (1988) report results of a simulation study suggesting that BRR performs well for variance estimation and construction of confi-dence intervals when working with estimators of nonsmooth functions like quantiles. The WesVar PC software currently implements the BRR variance estimation approach for quantiles, and we recommend the use of this vari-ance estimation approach for estimated quantiles in practice. Varivari-ance esti-mates for the estimated quantile in Equation 5.13 can also be computed using Taylor series linearization (Binder, 1991). The SUDAAN software currently uses the linearized variance estimator. The JRR approach to variance estima-tion is known to be badly biased for these types of estimators (Miller, 1974), but modifications to the jackknife approach addressing this problem have been developed (Shao and Wu, 1989).
example 5.8: estimating Population Quantiles for Total Household assets using the HrS Data
This example considers the total household assets variable collected from the 2006 HRS sample and aims to estimate the 0.25 quantile, the median, and the 0.75 quantile of household assets in the HRS target population. The SUDAAN software is used in this example because Stata (Version 10) does not currently support procedures specifically dedicated to the estimation of quantiles (and their standard errors) in complex sample survey data sets. The following SUDAAN code
generates the quantile estimates and standard errors, using an unconditional sub-class analysis approach:
proc descript ; nest stratum secu ; weight kwgthh ; subpopn finr = 1 ; var h8atota ;
percentiles 25 75 / median ; setenv decwidth = 1 ;
run ;
Table 5.1 summarizes the results provided in the SUDAAN output and compares the estimates with those generated using the BRR approach to variance estimation in the WesVar PC software. The estimated median of the total household assets for the HRS target population is $183,309. The estimate of the mean total household assets from Example 5.7 was $527,313, suggesting that the distribution of total household assets is highly skewed to the higher dollar value ranges.
The analysis of quantiles of the distribution of total household assets for HRS households was repeated using the WesVar PC software (readers are referred to the ASDA Web site for the menu steps needed to perform this analysis in WesVar).
WesVar allows for the use of BRR to estimate the standard errors of estimated quantiles. From the side-by-side comparison in Table 5.1, WesVar’s weighted esti-mates of the quantiles agree exactly with those reported by SUDAAN. However, WesVar’s BRR estimates of the corresponding standard errors differ slightly from the TSL standard errors computed by SUDAAN, as expected. The resulting infer-ences about the population quantiles would not differ substantially in this example as a result.
The DESCRIPT procedure in the SUDAAN software can also be used to esti-mate quantiles for subpopulations of interest. For example, the same quantiles could be estimated for the subpopulation of adults age 75 and older in the HRS population using the following syntax:
proc descript ; nest stratum secu ; weight kwgthh ;
subpopn kage > 74 & finr = 1 ; Table 5.1
Estimation of Percentiles of the Distribution of 2006 HRS Total Household Assets
Percentile
SUDAAn (TSL) WesVar PC (BRR) ˆQp se Q( ˆ )p ˆQp se Q( ˆ )p
Q25 $39,852 $3,167 $39,852 $3,249
Q50 (Median) $183,309 $10,233 $183,309 $9,978 Q75 $495,931 $17,993 $495,931 $17,460
© 2010 by Taylor and Francis Group, LLC
134 Applied Survey Data Analysis
var h8atota ;
percentiles 25 75 / median ; setenv decwidth = 1 ;
run ;
Note the use of the SUBPOPN statement to identify that the estimate is based on respondents who are 75 years of age and older and are the financial reporter for their HRS household unit. The estimated quantile values and standard errors gen-erated by this subpopulation analysis are ˆQ25 75+, = $40,329.4 ($4,434.8); ˆQ50 75+, =
$177,781.3 ($11,142.9); and ˆQ75 75+, = $461,308.3 ($27,478.0).
5.4 Bivariate Relationships between