Preparation for Complex Sample Survey Data Analysis
4.4 Addressing Item Missing Data in Analysis Variables
4.4.1 Potential bias Due to ignoring Missing Data
Many analysts simply choose to ignore the missing data problem when per-forming analyses. If the rates of missing data on key analysis variables are very low (say < 1–2% of cases), the penalty for not taking active steps (i.e., weighting or imputation) to address the missing data is probably small.
Consider a univariate analysis to estimate the population mean of a variable y. Under a simple deterministic assumption that the population is composed of “responders” and “nonresponders,” the expected bias in the respondent mean of the variable y is defined as follows:
Bias Y( )R =Y Y PR− = NR×(Y YR− NR) (4.1)
where Y is the true population mean, YR is the population mean for responders in the population, YNR is the population mean for nonre-sponders in the population, and PNR is the expected proportion of nonre-spondents in a given sample. For the bias to be large, the rate of missing data must be sizeable, and respondents must differ in their characteristics from nonrespondents. In statistical terms, the potential bias due to missing data depends on both the missing data pattern and the missing data mechanism.
Chapter 11 will provide a more detailed review of missing data patterns and mechanisms.
4.4.2 exploring rates and Patterns of Missing Data Prior to analysis Stata provides data analysts with the mvpatterns command to display the patterns and rates of missing data across a set of variables that will be included in an analysis. The following example uses this command to explore the missing data for seven 2005–2006 NHANES variables that will be included in the multiple imputation regression example presented in Section 11.7.2. The variables are diastolic blood pressure (BPXDI1_1), marital status (MARCAT), gender (RIAGENDR), race/ethnicity (RIDRETH1), age (AGEC and AGECSQ), body mass index (BMXBMI), and family poverty index (INDFMPIR). The command syntax to request the missing data summary for this set of seven variables is as follows:
mvpatterns bpxdi1_1 marcat riagendr ridreth1 agec agecsq ///
bmxbmi indfmpir
The actual Stata output produced by the mvpatterns command is pro-vided next. The first portion of the output lists the number of observed and missing values for each variable that has at least one missing value. Since 2005–2006 NHANES has no missing data for age, gender, and race/ethnicity, these variables are not listed in the output. The second portion of the output summarizes the frequencies of various patterns of missing data across the four variables with missing data, using a coding of “+” for observed and “.”
for missing. For example, 4,308 observations have no missing data for these four variables, and a total of 49 observations have missing data only for the BMXBMI (body mass index) variable.
Variables with No mv’s: riagendr ridreth1 agec agecsq Variable type obs mv variable label
bpxdi1_1 int 4581 753 diastolic bp
marcat byte 5329 5 1=married 2=prev married
3=never married
bmxbmi float 5237 97 body mass index (kg/m**2)
indfmpir float 5066 268 family pir
Patterns of Missing Values _pattern _mv _freq
++++ 0 4308
.+++ 1 666
+++. 1 217
++.+ 1 49
© 2010 by Taylor and Francis Group, LLC
110 Applied Survey Data Analysis
_pattern _mv _freq
.++. 2 41
.+.+ 2 39
.+.. 3 5
++.. 2 4
+.++ 1 3
..++ 2 1
..+. 3 1
With recent theoretical advances in the theory of statistical analysis with missing data (Little and Rubin, 2002) and today’s improved software (e.g., Carlin, Galati, and Royston, 2008; Raghunathan et al., 2001), data producers or data users often consider different methods for the imputation (or prediction) of missing data values. Depending on the patterns of missing data and the underlying process that generated the missingness, statistically sound impu-tation strategies can produce data sets with all complete cases for analysis.
In many large survey programs, the data producer may perform imputa-tion for key variables before the survey data set is released for general use.
Typically, methods for single imputation are employed using techniques such as hot deck imputation, regression imputation, and predictive mean matching. Some survey programs may choose to provide data users with multiply imputed data sets (Kennickell, 1998; Schafer, 1996). When data pro-ducers perform imputation of item-missing data, a best practice in data dis-semination is to provide data users with both the imputed version of the variable (e.g., I_INCOME_AMT) and an indicator variable (e.g., I_INCOME_
FLG) that identifies the values of the variables that have been imputed. Data users should expect to find general documentation for the imputations in the technical report for the survey. The survey codebook should clearly identify the imputed variables and the imputation “flag” variables.
When imputations are not provided by the data producer (or the analyst chooses to employ his or her own method and model), the task of imputing item missing data will fall to the data user. Details on practical methods for user imputation of item-missing data are provided later in Chapter 11.
4.5 Preparing to Analyze Data for Sample Subpopulations Analysis of complex sample survey data sets often involves separate estima-tion and inference for subpopulaestima-tions or subclasses of the full populaestima-tion.
For example, an analyst may wish to use NHANES data to estimate the prev-alence of diabetes separately for male and female adults or compute HRS esti-mates of retirement expectations only for the population of the U.S. Census South Region. An NCS-R analyst may choose to estimate a separate logistic regression model for predicting the probability of past-year depression status for African American women only. When analyses are focused in this way on these specific subclasses of the survey population, special care must be taken to correctly prepare the data and specify the subclass analysis in the com-mand input to software programs. Proper analysis methods for subclasses of survey data have been well established in the survey methodology literature (Cochran, 1977; Fuller et al., 1989; Kish, 1965; Korn and Graubard, 1999; Lohr, 1999; Rao, 2003), and interested readers can consult these references for more general information on estimation of survey statistics and related variance estimation techniques for subclasses. A short summary of the theory under-lying subclass analysis of survey data is also provided in Theory Box 4.2.