Example with NHANES Data - Methods for Improving Efficiency of Planned Missing Data Designs

The National Health and Nutrition Examination Survey (NHANES) is a research program by the National Center for Health Statistics (CDC) examining health and nutrition in the United States (Curtin et al., 2012). The survey conducts both interviews and physical examinations, including laboratory tests. We used data from NHANES to examine how effective our methods for sample selection would be in practice for a two-phase study. Selected variables to represent Phase I included demographic information on race, gender, age, marital status, and income, as well as general health information on diet, self reported diabetes diagnosis, blood pressure, BMI, smoking status, minutes of activity, television watching, computer usage, and self-reported overall health status. Our Phase II outcome was the laboratory variable glycohemoglobin, a blood test measuring how much sugar is bound to hemoglobin. This test is used to diagnose diabetes and prediabetes. Our main interests in this study are estimation of the mean of glycohemoglobin for this population and the relationship between obesity and glycohemoglobin, adjusted for our other covariates.

We chose all 30,468 NHANES participants from 2010 through 2015 with any available data on selected variables. Missing values were singly imputed through predictive mean matching, creating our pseudo-population. From this population, we first selected 3,000 individuals as our Phase I sample. In our Phase I sample, we created three propensity score classes using the predicted probability of obesity given the other selected covariates. The propensity score classes form our categorical Z variable, as described in section 2.4, and the indicator for obesity is our binary X. Afterwards, we selected 500 individuals for our Phase II sample and obtained their glycohemoglobin as our Y, which was approximately normal. We first conducted our Phase II sampling with the goal of estimating the mean of glycohemoglobin in this population. We used several different methods for selecting our Phase II sample and computed the parameter estimates and standard errors over many iterations.

This experiment was latter repeated for estimating the difference in mean glycohemoglobin between individuals with obesity and those without.

Table 2.6 displays results for estimating the population mean using multiple methods to select the Phase II sample and estimate the mean. The estimates for the population mean are similar across all methods. For the first method, Y , we select our Phase II sample using simple random sampling (SRS) and then use Y as our estimate for the population mean. This method does not make use of any Phase I variables for sample selection or estimation and produces the largest standard error. Stratification by obesity selects the Phase II sample using the optimal allocation when stratifying by obesity, and then estimates the population mean through that same stratification. Use of obesity on its own lowers the standard error over using Y by a small amount. The next method again selects the Phase II sample using SRS, but then uses obesity and the propensity score for stratification to estimate the population mean. This further lowers the standard error over using Y or only stratifying by obesity. It should be noted that when using simple random sampling to select the Phase II sample, we may not always be able to use the stratified estimator due to empty strata. For this analysis, we only looked at instances where it was possible to use the stratified estimator.

The other three methods used both obesity and the propensity score for stratification to choose the Phase II sample and estimate the population mean. The standard error from the proportional allocation was similar to using simple random sampling and then stratifying. This is unsurprising, as on average SRS will give us a sample size proportional to the size of each strata. Explicitly stratifying by obesity and the propensity score to select the Phase II sample guarantees that we can always use the stratified estimator, unlike SRS. Our adaptive sample selection method greatly outperformed proportional allocation in terms of variance. The average standard error from the adaptive design was close to the standard error from the optimal design, and the optimal design can only be achieved if the population parameters are known beforehand.

Table 2.7 compares several methods in estimating the difference in the average value of glycohemoglobin between individuals with obesity and those without, adjusted for the other Phase I variables through stratification on the propensity score classes. We selected the Phase II sample using either SRS, proportional allocation, an equal sample size in each strata, our adaptive approach, or the optimal allocation. In terms of standard errors, SRS again performed similarly to proportional allocation, having an equal sample size in each strata worked much better than both methods, and the adaptive design performed better than those

three methods. Our adaptive approach produced standard errors close to those obtained from the optimal design, which of course depended on unknown population parameters, and produced relatively similar estimates to the other approaches. Note that only the adaptive approaches used Bayesian estimation in these instances.

Table 2.6: Estimate and standard error for the mean of glycohemoglobin. Method Estimate Standard Error

Y 5.570 0.0403

Stratification by Obesity 5.571 0.0396 Stratifying by Obesity and PS after SRS 5.571 0.0383 Proportional Allocation using Obesity and PS 5.572 0.0383 Adaptive Allocation using Obesity and PS 5.568 0.0355 Optimal Allocation using Obesity and PS 5.571 0.0350

Table 2.7: Estimate and standard error for the difference in mean glycohemoglobin between those with obesity and those without, adjusted for propensity score classes.

Method Estimate Standard Error Simple Random Sampling 0.207 0.1183 Proportional Allocation 0.204 0.1137 Equal Sample Size in Each Strata 0.206 0.0819 Adaptive Allocation 0.206 0.0766 Optimal Allocation 0.208 0.0754

In document Methods for Improving Efficiency of Planned Missing Data Designs (Page 42-44)