Handling Attrition in Longitudinal Studies: The Case for Refreshment Samples

(1)

2013, Vol. 28, No. 2, 238–256 DOI:10.1214/13-STS414

©Institute of Mathematical Statistics, 2013

Handling Attrition in Longitudinal Studies:

The Case for Refreshment Samples

Yiting Deng, D. Sunshine Hillygus, Jerome P. Reiter, Yajuan Si and Siyu Zheng

Abstract. Panel studies typically suffer from attrition, which reduces sam-ple size and can result in biased inferences. It is impossible to know whether or not the attrition causes bias from the observed panel data alone. Refresh-ment samples—new, randomly sampled respondents given the questionnaire at the same time as a subsequent wave of the panel—offer information that can be used to diagnose and adjust for bias due to attrition. We review and bolster the case for the use of refreshment samples in panel studies. We in-clude examples of both a fully Bayesian approach for analyzing the con-catenated panel and refreshment data, and a multiple imputation approach for analyzing only the original panel. For the latter, we document a positive bias in the usual multiple imputation variance estimator. We present models appropriate for three waves and two refreshment samples, including nonter-minal attrition. We illustrate the three-wave analysis using the 2007–2008 Associated Press–Yahoo! News Election Poll.

Key words and phrases: Attrition, imputation, missing, panel, survey.

1. INTRODUCTION

Many of the the major ongoing government or government-funded surveys have panel components in-cluding, for example, in the U.S., the American Na-tional Election Study (ANES), the General Social Sur-vey (GSS), the Panel SurSur-vey on Income Dynamics (PSID) and the Current Population Survey (CPS). De-spite the millions of dollars spent each year to col-lect high quality data, analyses using panel data are inevitably threatened by panel attrition (Lynn, 2009), that is, some respondents in the sample do not par-ticipate in later waves of the study because they can-Yiting Deng is Ph.D. Candidate, Fuqua School of Business, Duke University, Box 90120, Durham, North Carolina 27708, USA (e-mail:[email protected]). D. Sunshine Hillygus is Associate Professor, Department of Political Science, Duke University, Box 90204, Durham, North Carolina 27708, USA (e-mail:[email protected]). Jerome P. Reiter is Professor and Siyu Zheng is B.S. Alumnus, Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, USA (e-mail: [email protected];[email protected]). Yajuan Si is Postdoctoral Associate, Applied Statistical Center, Columbia University, 1255 Amsterdam Avenue, New York, New York 10027, USA (e-mail:[email protected]).

not be located or refuse participation. For instance, the multiple-decade PSID, first fielded in 1968, lost nearly 50 percent of the initial sample members by 1989 due to cumulative attrition and mortality. Even with a much shorter study period, the 2008–2009 ANES Panel Study, which conducted monthly interviews over the course of the 2008 election cycle, lost 36 percent of respondents in less than a year.

At these rates, which are not atypical in large-scale panel studies, attrition can have serious impacts on analyses that use only respondents who completed all waves of the survey. At best, attrition reduces effec-tive sample size, thereby decreasing analysts’ abilities to discover longitudinal trends in behavior. At worst, attrition results in an available sample that is not repre-sentative of the target population, thereby introducing potentially substantial biases into statistical inferences. It is not possible for analysts to determine the degree to which attrition degrades complete-case analyses by using only the collected data; external sources of infor-mation are needed.

One such source is refreshment samples. A refresh-ment sample includes new, randomly sampled respon-dents who are given the questionnaire at the same time as a second or subsequent wave of the panel. Many

(2)

of the large panel studies now routinely include re-freshment samples. For example, most of the longer longitudinal studies of the National Center for Educa-tion Statistics, including the Early Childhood Longi-tudinal Study and the National Educational Longitu-dinal Study, freshened their samples at some point in the study, either adding new panelists or as a separate cross-section. The National Educational Longitudinal Study, for instance, followed 21,500 eighth graders in two-year intervals until 2000 and included refreshment samples in 1990 and 1992. It is worth noting that by the final wave of data collection, just 50% of the original sample remained in the panel. Overlapping or rotating panel designs offer the equivalent of refreshment sam-ples. In such designs, the sample is divided into dif-ferent cohorts with staggered start times such that one cohort of panelists completes a follow-up interview at the same time another cohort completes their baseline interview. So long as each cohort is randomly selected and administered the same questionnaire, the baseline interview of the new cohort functions as a refreshment sample for the old cohort. Examples of such rotating panel designs include the GSS and the Survey of In-come and Program Participation.

Refreshment samples provide information that can be used to assess the effects of panel attrition and to correct for biases via statistical modeling (Hirano et al., 1998). However, they are infrequently used by analysts or data collectors for these tasks. In most cases, attrition is simply ignored, with the analysis run only on those respondents who completed all waves of the study (e.g., Jelici´c, Phelps and Lerner, 2009), perhaps with the use of post-stratification weights (Vandecasteele and Debels, 2007). This is done despite widespread recognition among subject matter experts about the potential problems of panel attrition (e.g.,

Ahern and Le Brocque, 2005).

In this article, we review and bolster the case for the use of refreshment samples in panel studies. We begin in Section 2 by briefly describing existing ap-proaches for handling attrition that do not involve re-freshment samples. In Section 3 we present a hypo-thetical two-wave panel to illustrate how refreshment samples can be used to remove bias from nonignorable attrition. In Section 4 we extend current models for refreshment samples, which are described exclusively with two-wave settings in the literature, to models for three waves and two refreshment samples. In doing so, we discuss modeling nonterminal attrition in these set-tings, which arises when respondents fail to respond to one wave but return to the study for a subsequent one.

In Section5we illustrate the three-wave analysis using the 2007–2008 Associated Press–Yahoo! News Elec-tion Poll (APYN), which is a panel study of the 2008 U.S. Presidential election. Finally, in Section6we dis-cuss some limitations and open research issues in the use of refreshment samples.

2. PANEL ATTRITION IN LONGITUDINAL STUDIES

Fundamentally, panel attrition is a problem of nonre-sponse, so it is useful to frame the various approaches to handling panel attrition based on the assumed miss-ing data mechanisms (Rubin, 1976;Little and Rubin, 2002). Often researchers ignore panel attrition entirely and use only the available cases for analysis, for ex-ample, listwise deletion to create a balanced subpanel (e.g., Bartels, 1993; Wawro, 2002). Such approaches assume that the panel attrition is missing completely at random (MCAR), that is, the missingness is indepen-dent of observed and unobserved data. We speculate that this is usually assumed for convenience, as often listwise deletion analyses are not presented with em-pirical justification of MCAR assumptions. To the ex-tent that diagnostic analyses of MCAR assumptions in panel attrition are conducted, they tend to be reported and published separately from the substantive research (e.g., Zabel, 1998;Fitzgerald, Gottschalk and Moffitt, 1998;Bartels, 1999;Clinton, 2001;Kruse et al., 2009), so that it is not clear if and how the diagnostics influ-ence statistical model specification.

Considerable research has documented that some in-dividuals are more likely to drop out than others (e.g.,

Behr, Bellgardt and Rendtel, 2005;Olsen, 2005), mak-ing listwise deletion a risky analysis strategy. Many analysts instead assume that the data are missing at ran-dom (MAR), that is, missingness depends on observed, but not unobserved, data. One widely used MAR ap-proach is to adjust survey weights for nonresponse, for example, by using post-stratification weights provided by the survey organization (e.g., Henderson, Hillygus and Tompson, 2010). Re-weighting approaches assume that dropout occurs randomly within weighting classes defined by observed variables that are associated with dropout.

Although re-weighting can reduce bias introduced by panel attrition, it is not a fail-safe solution. There is wide variability in the way weights are constructed and in the variables used. Nonresponse weights are of-ten created using demographic benchmarks, for exam-ple, from the CPS, but demographic variables alone are unlikely to be adequate to correct for attrition

(3)

(Vandecasteele and Debels, 2007). As is the case in other nonresponse contexts, inflating weights can re-sult in increased standard errors and introduce instabil-ities due to particularly large or small weights (Lohr, 1998;Gelman, 2007).

A related MAR approach uses predicted probabil-ities of nonresponse, obtained by modeling the re-sponse indicator as a function of observed variables, as inverse probability weights to enable inference by generalized estimating equations (e.g., Robins and Rotnitzky, 1995; Robins, Rotnitzky and Zhao, 1995;

Scharfstein, Rotnitzky and Robins, 1999;Chen, Yi and Cook, 2010). This potentially offers some robustness to model misspecification, at least asymptotically for MAR mechanisms, although inferences can be sen-sitive to large weights. One also can test whether or not parameters differ significantly due to attrition for cases with complete data and cases with incomplete data (e.g.,Diggle, 1989;Chen and Little, 1999;Qu and Song, 2002;Qu et al., 2011), which can offer insight into the appropriateness of the assumed MAR mecha-nism.

An alternative approach to re-weighting is single im-putation, a method often applied by statistical agen-cies in general item nonresponse contexts (Kalton and Kasprzyk, 1986). Single imputation methods replace each missing value with a plausible guess, so that the full panel can be analyzed as if their data were complete. Although there are a wide range of single imputation methods (hot deck, nearest neighbor, etc.) that have been applied to missing data problems, the method most specific to longitudinal studies is the last-observation-carried-forward approach, in which an in-dividual’s missing data are imputed to equal his or her response in previous waves (e.g.,Packer et al., 1996). Research has shown that this approach can introduce substantial biases in inferences (e.g., see Daniels and Hogan, 2008).

Given the well-known limitations of single imputa-tion methods (Little and Rubin, 2002), multiple im-putation (see Section3) also has been used to handle missing data from attrition (e.g., Pasek et al., 2009;

Honaker and King, 2010). As with the majority of available methods used to correct for panel attrition, standard approaches to multiple imputation assume an ignorable missing data mechanism. Unfortunately, it is often expected that panel attrition is not missing at ran-dom (NMAR), that is, the missingness depends on un-observed data. In such cases, the only way to obtain unbiased estimates of parameters is to model the miss-ingness. However, it is generally impossible to know

the appropriate model for the missingness mechanism from the panel sample alone (Kristman, Manno and Cote, 2005; Basic and Rendtel, 2007; Molenberghs et al., 2008).

Another approach, absent external data, is to han-dle the attrition directly in the statistical models used for longitudinal data analysis (Verbeke and Molen-berghs, 2000;Diggle et al., 2002; Fitzmaurice, Laird and Ware, 2004;Hedeker and Gibbons, 2006;Daniels and Hogan, 2008). Here, unlike with other approaches, much research has focused on methods for handling nonignorable panel attrition. Methods include vari-ants of both selection models (e.g., Hausman and Wise, 1979; Siddiqui, Flay and Hu, 1996; Kenward, 1998;Scharfstein, Rotnitzky and Robins, 1999;Vella and Verbeek, 1999; Das, 2004; Wooldridge, 2005;

Semykina and Wooldridge, 2010) and pattern mixture models (e.g.,Little, 1993;Kenward, Molenberghs and Thijs, 2003;Roy, 2003; Lin, McCulloch and Rosen-heck, 2004; Roy and Daniels, 2008). These model-based methods have to make untestable and typically strong assumptions about the attrition process, again because there is insufficient information in the original sample alone to learn the missingness mechanism. It is therefore prudent for analysts to examine how sen-sitive results are to different sets of assumptions about attrition. We note that Rotnitzky, Robins and Scharf-stein (1998) and Scharfstein, Rotnitzky and Robins

(1999) suggest related sensitivity analyses for estimat-ing equations with inverse probability weightestimat-ing.

3. LEVERAGING REFRESHMENT SAMPLES

Refreshment samples are available in many panel studies, but the way refreshment samples are currently used with respect to panel attrition varies widely. Ini-tially, refreshment samples, as the name implies, were conceived as a way to directly replace units who had dropped out (Ridder, 1992). The general idea of using survey or field substitutes to correct for nonresponse dates to some of the earliest survey methods work (Kish and Hess, 1959). Research has shown, however, that respondent substitutes are more likely to resemble respondents rather than nonrespondents, potentially in-troducing bias without additional adjustments (Lin and Schaeffer, 1995; Vehovar, 1999; Rubin and Zanutto, 2001; Dorsett, 2010). Also potentially problematic is when refreshment respondents are simply added to the analysis to boost the sample size, while the attrition process of the original respondents is disregarded (e.g.,

(4)

et al., 2006). In recent years, however, it is most com-mon to see refreshment samples used to diagnose panel attrition characteristics in an attempt to justify an ig-norable missingness assumption or as the basis for dis-cussion about potential bias in the results, without us-ing them for statistical correction of the bias (e.g.,Frick et al., 2006;Kruse et al., 2009).

Refreshment samples are substantially more power-ful than suggested by much of their previous use. Re-freshment samples can be mined for information about the attrition process, which in turn facilitates adjust-ment of inferences for the missing data (Hirano et al.,

1998, 2001;Bartels, 1999). For example, the data can be used to construct inverse probability weights for the cases in the panel (Hirano et al., 1998; Nevo, 2003), an approach we do not focus on here. They also offer information for model-based methods and multiple im-putation (Hirano et al., 1998), which we now describe and illustrate in detail.

3.1 Model-Based Approaches

Existing model-based methods for using refreshment samples (Hirano et al., 1998;Bhattacharya, 2008) are based on selection models for the attrition process. To our knowledge, no one has developed pattern mixture models in the context of refreshment samples, thus, in what follows we only discuss selection models. To il-lustrate these approaches, we use the simple example also presented by Hirano et al. (1998, 2001), which is illustrated graphically in Figure1. Consider a two-wave panel of NP subjects that includes a refreshment sample of NR new subjects during the second wave. Let Y1and Y2be binary responses potentially available

in wave 1 and wave 2, respectively. For the original panel, suppose that we know Y1 for all NP subjects

FIG. 1. Graphical representation of the two-wave model. Here, X represents variables available on everyone.

and that we know Y2 only for NCP< NP subjects due to attrition. We also know Y2 for the NR units in the refreshment sample, but by design we do not know Y1

for those units. Finally, for all i, let W1i= 1 if subject

iwould provide a value for Y2if they were included in

wave 1, and let W1i= 0 if subject i would not provide

a value for Y2if they were included in wave 1. We note

that W1iis observed for all i in the original panel but is

missing for all i in the refreshment sample, since they were not given the chance to respond in wave 1.

The concatenated data can be conceived as a par-tially observed, three-way contingency table with eight cells. We can estimate the joint probabilities in four of these cells from the observed data, namely, P (Y1=

y1, Y2= y2, W1= 1) for y1, y2∈ {0, 1}. We also have

the following three independent constraints involving the cells not directly observed:

1− y1,y2 P (Y1= y1, Y2= y2, W1= 1) = y1,y2 P (Y1= y1, Y2= y2, W1= 0), P (Y1= y1, W1= 0) = y2 P (Y1= y1, Y2= y2, W1= 0), P (Y2= y2)− P (Y2= y2, W1= 1) = y1 P (Y1= y1, Y2= y2, W1= 0).

Here, all quantities on the left-hand side of the equa-tions are estimable from the observed data. The system of equations offers seven constraints for eight cells, so that we must add one constraint to identify all the joint probabilities.

Hirano et al. (1998, 2001) suggest characterizing the joint distribution of (Y1, Y2, W1) via a chain of

ditional models, and incorporating the additional con-straint within the modeling framework. In this context, they suggested letting

Y1i∼ Ber(π1i), (1) logit(π1i)= β0, Y2i|Y1i∼ Ber(π2i), (2) logit(π2i)= γ0+ γ1Y1i, W1i|Y2i, Y1i∼ Ber(πW1i), (3) logit(πW1i)= α0+ αY1Y1i+ αY2Y2i

for all i in the original panel and refreshment sam-ple, plus requiring that all eight probabilities sum to

(5)

TABLE1

Summary of simulation study for the two-wave example. Results include the average of the posterior means across the 500 simulations and the percentage of the 500 simulations in which the 95% central

posterior interval covers the true parameter value. The implied Monte Carlo standard error of the simulated coverage rates is approximately√(0.95)(0.05)/500= 1%

HW MAR AN

Parameter True value Mean 95% Cov. Mean 95% Cov. Mean 95% Cov.

β0 0.3 0.29 96 0.27 87 0.30 97 β_X −0.4 −0.39 95 −0.39 95 −0.40 96 γ0 0.3 0.44 30 0.54 0 0.30 98 γX −0.3 −0.35 94 −0.39 70 −0.30 99 γY1 0.7 0.69 91 0.84 40 0.70 95 α0 −0.4 −0.46 84 0.25 0 −0.40 97 αX 1 0.96 93 0.84 13 1.00 98 αY1 −0.7 — — −0.45 0 −0.70 98 α_Y₂ 1.3 0.75 0 — — 1.31 93

one. Hirano et al. (1998) call this an additive nonig-norable (AN) model. The AN model enforces the ad-ditional constraint by disallowing the interaction be-tween (Y1, Y2) in (3). Hirano et al. (1998) prove that

the AN model is likelihood-identified for general dis-tributions. Fitting AN models is straightforward using Bayesian MCMC; see Hirano et al.(1998) and Deng

(2012) for exemplary Metropolis-within-Gibbs algo-rithms. Parameters also can be estimated via equations of moments (Bhattacharya, 2008).

Special cases of the AN model are informative. By setting (αY2= 0, αY1 = 0), we specify a model for a

MAR missing data mechanism. Setting αY2 = 0

im-plies a NMAR missing data mechanism. In fact, setting (αY₁ = 0, αY2 = 0) results in the nonignorable model

ofHausman and Wise (1979). Hence, the AN model allows the data to determine whether the missingness is MAR or NMAR, thereby allowing the analyst to avoid making an untestable choice between the two mechanisms. By not forcing αY1 = 0, the AN model

permits more complex nonignorable selection mecha-nisms than the model of Hausman and Wise (1979). The AN model does require separability of Y1 and Y2

in the selection model; hence, if attrition depends on the interaction between Y1 and Y2, the AN model will

not fully correct for biases due to nonignorable attri-tion.

As empirical evidence of the potential of refresh-ment samples, we simulate 500 data sets based on an extension of the model in (1)–(3) in which we add a Bernoulli-generated covariate X to each model; that is, we add βXXi to the logit predictor in (1), γXXi to the

logit predictor in (2), and αXXi to the logit predictor in (3). In each we use NP = 10,000 original panel cases and NR= 5000 refreshment sample cases. The param-eter values, which are displayed in Table 1, simulate a nonignorable missing data mechanism. All values of (X, Y1, W1)are observed in the original panel, and all

values of (X, Y2)are observed in the refreshment

sam-ple. We estimate three models based on the data: the

Hausman and Wise (1979) model (set αY1 = 0 when

fitting the models) which we denote with HW, a MAR model (set αY2= 0 when fitting the models) and an AN

model. In each data set, we estimate posterior means and 95% central posterior intervals for each parame-ter using a Metropolis-within-Gibbs sampler, running 10,000 iterations (50% burn-in). We note that interac-tions involving X also can be included and identified in the models, but we do not use them here.

For all models, the estimates of the intercept and co-efficient in the logistic regression of Y1 on X are

rea-sonable, primarily because X is complete and Y1 is

only MCAR in the refreshment sample. As expected, the MAR model results in biased point estimates and poorly calibrated intervals for the coefficients of the models for Y2and W1. The HW model fares somewhat

better, but it still leads to severely biased point esti-mates and poorly calibrated intervals for γ0 and αY2.

In contrast, the AN model results in approximately un-biased point estimates with reasonably well-calibrated intervals.

We also ran simulation studies in which the data gen-eration mechanisms satisfied the HW and MAR mod-els. When (αY1= 0, αY2= 0), the HW model performs

(6)

well and the MAR model performs terribly, as ex-pected. When (αY1= 0, αY2= 0), the MAR model

per-forms well and the HW model perper-forms terribly, also as expected. The AN model performs well in both sce-narios, resulting in approximately unbiased point esti-mates with reasonably well-calibrated intervals.

To illustrate the role of the separability assump-tion, we repeat the simulation study after including a nonzero interaction between Y1 and Y2 in the model

for W1. Specifically, we generate data according to a

response model,

logit(πW1i)= α0+ αY1Y1i+ αY2Y2i

(4)

+ αY1Y2Y1iY2i,

setting αY1Y2= 1. However, we continue to use the AN

model by forcing αY1Y2 = 0 when estimating

param-eters. Table 2 summarizes the results of 100 simula-tion runs, showing substantial biases in all parameters except (β0, βX, γX, αX). The estimates of (β0, βX)are

unaffected by using the wrong value for αY1Y2, since all

the information about the relationship between X and Y1is in the first wave of the panel. The estimates of γX and αX are similarly unaffected because αY1Y2involves

only Y1 (and not X), which is controlled for in the

re-gressions. Table2also displays the results when using (1), (2) and (4) with αY1Y2= 1; that is, we set αY1Y2 at

its true value in the MCMC estimation and estimate all other parameters. After accounting for separability, we are able to recover all true parameter values.

TABLE2

Summary of simulation study for the two-wave example without separability. The true selection model includes a nonzero interaction between Y1and Y2(coefficient αY1Y2= 1). We fit the

AN model plus the AN model adding the interaction term set at its true value. Results include the averages of the posterior means

and posterior standard errors across 100 simulations

AN AN+ αY1Y2 Parameter True value Mean S.E. Mean S.E.

β0 0.3 0.30 0.03 0.30 0.03 βX −0.4 −0.41 0.04 −0.41 0.04 γ0 0.3 0.14 0.06 0.30 0.06 γX −0.3 −0.27 0.06 −0.30 0.05 γ_Y₁ 0.7 0.99 0.07 0.70 0.06 α0 −0.4 −0.55 0.08 −0.41 0.09 αX 1 0.99 0.08 1.01 0.08 αY1 −0.7 −0.35 0.05 −0.70 0.07 αY2 1.3 1.89 0.13 1.31 0.13 αY1Y2 1 — — 1 0

Of course, in practice analysts do not know the true value of αY1Y2. Analysts who wrongly set αY1Y2= 0, or

any other incorrect value, can expect bias patterns like those in Table2, with magnitudes determined by how dissimilar the fixed αY1Y2 is from the true value.

How-ever, the successful recovery of true parameter values when setting αY1Y2 at its correct value suggests an

ap-proach for analyzing the sensitivity of inferences to the separability assumption. Analysts can posit a set of plausible values for αY1Y2, estimate the models

af-ter fixing αY1Y2 at each value and evaluate the

infer-ences that result. Alternatively, analysts might search for values of αY1Y2 that meaningfully alter

substan-tive conclusions of interest and judge whether or not such αY1Y2 seem realistic. Key to this sensitivity

anal-ysis is interpretation of αY1Y2. In the context of the

model above, αY1Y2has a natural interpretation in terms

of odds ratios; for example, in our simulation, setting αY₁Y2 = 1 implies that cases with (Y1 = 1, Y2 = 1)

have exp(2.3)≈ 10 times higher odds of responding at wave 2 than cases with (Y1= 1, Y2= 0). In a

sensitiv-ity analysis, when this is too high to seem realistic, we might consider models with values like αY1Y2 = 0.2.

Estimates from the AN model can serve as starting points to facilitate interpretations.

Although we presented models only for binary data,

Hirano et al. (1998) prove that similar models can be constructed for other data types, for example, they present an analysis with a multivariate normal distri-bution for (Y1, Y2). Generally speaking, one proceeds

by specifying a joint model for the outcome (uncondi-tional on W1), followed by a selection model for W1

that maintains separation of Y1and Y2.

3.2 Multiple Imputation Approaches

One also can treat estimation with refreshment sam-ples as a multiple imputation exercise, in which one creates a modest number of completed data sets to be analyzed with complete-data methods. In multiple im-putation, the basic idea is to simulate values for the missing data repeatedly by sampling from predictive distributions of the missing values. This creates m > 1 completed data sets that can be analyzed or, as rele-vant for many statistical agencies, disseminated to the public. When the imputation models meet certain con-ditions (Rubin, 1987, Chapter 4), analysts of the m completed data sets can obtain valid inferences using complete-data statistical methods and software. Specif-ically, the analyst computes point and variance esti-mates of interest with each data set and combines these estimates using simple formulas developed by Rubin

(7)

(1987). These formulas serve to propagate the uncer-tainty introduced by missing values through the ana-lyst’s inferences. Multiple imputation can be used for both MAR and NMAR missing data, although stan-dard software routines primarily support MAR tation schemes. Typical approaches to multiple impu-tation presume either a joint model for all the data, such as a multivariate normal or log-linear model (Schafer, 1997), or use approaches based on chained equations (Van Buuren and Oudshoorn, 1999;Raghunathan et al., 2001). See Rubin (1996), Barnard and Meng (1999) andReiter and Raghunathan(2007) for reviews of mul-tiple imputation.

Analysts can utilize the refreshment samples when implementing multiple imputation, thereby realizing similar benefits as illustrated in Section3.1. First, the analyst fits the Bayesian models in (1)–(3) by running an MCMC algorithm for, say, H iterations. This algo-rithm cycles between (i) taking draws of the missing values, that is, Y2 in the panel and (Y1, W1)in the

re-freshment sample, given parameter values and (ii) tak-ing draws of the parameters given completed data. Af-ter convergence of the chain, the analyst collects m of these completed data sets for use in multiple imputa-tion. These data sets should be spaced sufficiently so as to be approximately independent, for example, by thin-ning the H draws so that the autocorrelations among parameters are close to zero. For analysts reluctant to run MCMC algorithms, we suggest multiple imputa-tion via chained equaimputa-tions with (Y1, Y2, W1)each

tak-ing turns as the dependent variable. The conditional models should disallow interactions (other than those involving X) to respect separability. This suggestion is based on our experience with limited simulation stud-ies, and we encourage further research into its general validity. For the remainder of this article, we utilize the fully Bayesian MCMC approach to implement multi-ple imputation.

Of course, analysts could disregard the refreshment samples entirely when implementing multiple imputa-tion. For example, analysts can estimate a MAR mul-tiple imputation model by forcing αY2 = 0 in (3) and

using the original panel only. However, this model is exactly equivalent to the MAR model used in Table1

(although those results use both the panel and the re-freshment sample when estimating the model); hence, disregarding the refreshment samples can engender the types of biases and poor coverage rates observed in Ta-ble1. On the other hand, using the refreshment samples allows the data to decide if MAR is appropriate or not in the manner described in Section3.1.

In the context of refreshment samples and the ex-ample in Section 3.1, the analyst has two options for implementing multiple imputation. The first, which we call the “P+ R” option, is to generate completed data sets that include all cases for the panel and refresh-ment samples, for example, impute the missing Y2 in

the original panel and the missing (Y1, W1)in the

re-freshment sample, thereby creating m completed data sets each with NP + NR cases. The second, which we call the “P-only” option, is to generate completed data sets that include only individuals from the initial panel, so that NP individuals are disseminated or used for analysis. The estimation routines may require imput-ing (Y1, W1) for the refreshment sample cases, but in

the end only the imputed Y2are added to the observed

data from the original panel for dissemination/analysis. For the P+ R option, the multiply-imputed data sets are byproducts when MCMC algorithms are used to es-timate the models. The P+ R option offers no advan-tage for analysts who would use the Bayesian model for inferences, since essentially it just reduces from H draws to m draws for summarizing posterior distribu-tions. However, it could be useful for survey-weighted analyses, particularly when the concatenated file has weights that have been revised to reflect (as best as possible) its representativeness. The analyst can apply the multiple imputation methods ofRubin(1987) to the concatenated file.

Compared to the P+ R option, the P-only option of-fers clearer potential benefits. Some statistical agencies or data analysts may find it easier to disseminate or base inferences on only the original panel after using the refreshment sample for imputing the missing val-ues due to attrition, since combining the original and freshened samples complicates interpretation of sam-pling weights and design-based inference. For exam-ple, re-weighting the concatenated data can be tricky with complex designs in the original and refreshment sample. Alternatively, there may be times when a sta-tistical agency or other data collector may not want to share the refreshment data with outsiders, for example, because doing so would raise concerns over data confi-dentiality. Some analysts might be reluctant to rely on the level of imputation in the P+ R approach—for the refreshment sample, all Y1 must be imputed. In

con-trast, the P-only approach only leans on the imputa-tion models for missing Y2. Finally, some analysts

sim-ply may prefer the interpretation of longitudinal anal-yses based on the original panel, especially in cases of multiple-wave designs.

(8)

In the P-only approach, the multiple imputation has a peculiar aspect: the refreshment sample records used to estimate the imputation models are not used or avail-able for analyses. When records are used for impu-tation but not for analysis,Reiter (2008) showed that Rubin’s (1987) variance estimator tends to have posi-tive bias. The bias, which can be quite severe, results from a mismatch in the conditioning used by the ana-lyst and the imputer. The derivation of Rubin’s (1987) variance estimator presumes that the analyst conditions on all records used in the imputation models, not just the available data.

We now illustrate that this phenomenon also arises in the two-wave refreshment sample context. To do so, we briefly review multiple imputation (Rubin, 1987). For l = 1, . . . , m, let q(l) and u(l) be, respec-tively, the estimate of some population quantity Q and the estimate of the variance of q(l) in completed data set D(l). Analysts use ¯qm=

_m

l=1q(l)/m to esti-mate Q and use Tm= (1 + 1/m)bm+ ¯um to estimate var(¯qm), where bm=lm=1(q(l) − ¯qm)2/(m− 1) and

¯um=

_m

l=1u(l)/m. For large samples, inferences for Q are obtained from the t -distribution, (¯qm− Q) ∼ tν_m(0, Tm), where the degrees of freedom is νm = (m− 1)[1 + ¯um/((1+ 1/m)bm)]2. A better degrees of freedom for small samples is presented byBarnard and Rubin(1999). Tests of significance for multicom-ponent null hypotheses are derived byLi et al.(1991),

Li, Raghunathan and Rubin(1991),Meng and Rubin

(1992) andReiter(2007).

Table3summarizes the properties of the P-only mul-tiple imputation inferences for the AN model under the simulation design used for Table 1. We set m= 100, spacing out samples of parameters from the MCMC so as to have approximately independent draws. Re-sults are based on 500 draws of observed data sets, each with new values of missing data. As before, the multiple imputation results in approximately unbiased point estimates of the coefficients in the models for Y1

TABLE3

Bias in multiple imputation variance estimator for P-only method. Results based on 500 simulations

Parameter Q Avg.¯q_∗ Var¯q_∗ Avg.T_∗ 95% Cov.

β0 0.3 0.30 0.0008 0.0008 95.4

βX −0.4 −0.40 0.0016 0.0016 95.8

γ0 0.3 0.30 0.0018 0.0034 99.2

γX −0.3 −0.30 0.0022 0.0031 98.4

γY1 0.7 0.70 0.0031 0.0032 96.4

and for Y2. For the coefficients in the regression of Y2,

the averages of Tmacross the 500 replications tend to be significantly larger than the actual variances, lead-ing to conservative confidence interval coverage rates. Results for the coefficients of Y1are well-calibrated; of

course, Y1has no missing data in the P-only approach.

We also investigated the two-stage multiple imputa-tion approach ofReiter(2008). However, this resulted in some anti-conservative variance estimates, so that it was not preferred to standard multiple imputation.

3.3 Comparing Model-Based and Multiple Imputation Approaches

As in other missing data contexts, model-based and multiple imputation approaches have differential ad-vantages (Schafer, 1997). For any given model, model-based inferences tend to be more efficient than multi-ple imputation inferences based on modest numbers of completed data sets. On the other hand, multiple im-putation can be more robust than fully model-based approaches to poorly fitting models. Multiple imputa-tion uses the posited model only for completing miss-ing values, whereas a fully model-based approach re-lies on the model for the entire inference. For example, in the P-only approach, a poorly-specified imputation model affects inference only through the (NP − NCP) imputations for Y2. Speaking loosely to offer intuition,

if the model for Y2is only 60% accurate (a poor model

indeed) and (NP− NCP)represents 30% of NP, infer-ences based on the multiple imputations will be only 12% inaccurate. In contrast, the full model-based infer-ence will be 40% inaccurate. Computationally, multi-ple imputation has some advantages over model-based approaches, in that analysts can use ad hoc imputation methods like chained equations (Van Buuren and Oud-shoorn, 1999;Raghunathan et al., 2001) that do not re-quire MCMC.

Both the model-based and multiple imputation ap-proaches, by definition, rely on models for the data. Models that fail to describe the data could result in inaccurate inferences, even when the separability as-sumption in the selection model is reasonable. Thus, regardless of the approach, it is prudent to check the fit of the models to the observed data. Unfortunately, the literature on refreshment samples does not offer guid-ance on or present examples of such diagnostics.

We suggest that analysts check models with pre-dictive distributions (Meng, 1994;Gelman, Meng and Stern, 1996; He et al., 2010; Burgette and Reiter, 2010). In particular, the analyst can use the estimated model to generate new values of Y2 for the complete

(9)

cases in the original panel and for the cases in the refreshment sample. The analyst compares the set of replicated Y2 in each sample with the corresponding

original Y2 on statistics of interest, such as summaries

of marginal distributions and coefficients in regressions of Y2 on observed covariates. When the statistics from

the replicated data and observed data are dissimilar, the diagnostics indicate that the imputation model does not generate replicated data that look like the complete data, suggesting that it may not describe adequately the relationships involving Y2or generate plausible values

for the missing Y2. When the statistics are similar, the

diagnostics do not offer evidence of imputation model inadequacy (with respect to those statistics). We rec-ommend that analysts generate multiple sets of repli-cated data, so as to ensure interpretations are not overly specific to particular replications.

These predictive checks can be graphical in nature, for example, resembling grouped residual plots for lo-gistic regression models. Alternatively, as summaries analysts can compute posterior predictive probabili-ties. Formally, let S be the statistic of interest, such as a regression coefficient or marginal probability. Sup-pose the analyst has created T replicated data sets, {R(1)_{, . . . , R}(T )_{}, where T is somewhat large (say, T =} 500). Let SD and SR(l) be the values of S computed

with an observed subsample D, for example, the com-plete cases in the panel or the refreshment sample, and R(l), respectively, where l= 1, . . . , T . For each S we compute the two-sided posterior predictive probability,

ppp= (2/T ) ∗ min _T l=1 I (SD− S_R(l)>0), (5) T l=1 I (S_R(l)− SD>0) . We note that ppp is small when SD and SR(l)

consis-tently deviate from each other in one direction, which would indicate that the model is systematically dis-torting the relationship captured by S. For S with small ppp, it is prudent to examine the distribution of S_R(l)− SD to evaluate if the difference is practically important. We consider probabilities in the 0.05 range (or lower) as suggestive of lack of model fit.

To obtain each R(l), analysts simply add a step to the MCMC that replaces all observed values of Y2

us-ing the parameter values at that iteration, conditional on observed values of (X, Y1, W1). This step is used

only to facilitate diagnostic checks; the estimation of parameters continues to be based on the observed Y2.

When autocorrelations among parameters are high, we recommend thinning the chain so that parameter draws are approximately independent before creating the set of R(l). Further, we advise saving the T replicated data sets, so that they can be used repeatedly with differ-ent S. We illustrate this process of model checking in the analysis of the APYN data in Section5.

4. THREE-WAVE PANELS WITH TWO REFRESHMENTS

To date, model-based and multiple imputation meth-ods have been developed and applied in the context of two-wave panel studies with one refreshment sam-ple. However, many panels exist for more than two waves, presenting the opportunity for fielding multi-ple refreshment sammulti-ples under different designs. In this section we describe models for three-wave panels with two refreshment samples. These can be used as in Sec-tion3.1for model-based inference or as in Section3.2

to implement multiple imputation. Model identification depends on (i) whether or not individuals from the orig-inal panel who did not respond in the second wave, that is, have W1i= 0, are given the opportunity to provide

responses in the third wave, and (ii) whether or not individuals from the first refreshment sample are fol-lowed in the third wave.

To begin, we extend the example from Figure1to the case with no panel returns and no refreshment follow-up, as illustrated in Figure 2. Let Y3 be binary

re-sponses potentially available in wave 3. For the original panel, we know Y3only for NCP2< NCP subjects due to third wave attrition. We also know Y3 for the NR2

units in the second refreshment sample. By design, we do not know (Y1, Y3)for units in the first refreshment

sample, nor do we know (Y1, Y2)for units in the second

refreshment sample. For all i, let W2i = 1 if subject i

would provide a value for Y3 if they were included in

the second wave of data collection (even if they would not respond in that wave), and let W2i= 0 if subject i

would not provide a value for Y3 if they were included

in the second wave. In this design, W2i is missing for

all i in the original panel with W1i= 0 and for all i in

both refreshment samples.

There are 32 cells in the contingency table cross-tabulated from (Y1, Y2, Y3, W1, W2). However, the

ob-served data offer only sixteen constraints, obtained from the eight joint probabilities when (W1= 1, W2=

1) and the following dependent equations (which can be alternatively specified). For all (y1, y2, y3, w1, w2),

(10)

FIG. 2. Graphical representation of the three-wave panel with monotone nonresponse and no follow-up for subjects in refreshment samples. Here, X represents variables available on everyone and is displayed for generality; there is no X in the example in Section4.

where y3, w1, w2∈ {0, 1}, we have 1= y1,y2,y3,w,w2 P (Y1= y1, Y2= y2, Y3= y3, W1= w1, W2= w2), P (Y1= y1, W1= 0) = y2,y3,w2 P (Y1= y1, Y2= y2, Y3= y3, W1= 0, W2= w2), P (Y2= y2)− P (Y2= y2, W1= 1) = y1,y3,w2 P (Y1= y1, Y2= y2, Y3= y3, W1= 0, W2= w2), P (Y1= y1, Y2= y2, W1= 1, W2= 0) = y3 P (Y1= y1, Y2= y2, Y3= y3, W1= 1, W2= 0), P (Y3= y3) = y1,y2,w1,w2 P (Y1= y1, Y2= y2, Y3= y3, W1= w1, W2= w2).

As before, all quantities on the left-hand side of the equations are estimable from the observed data. The first three equations are generalizations of those from the two-wave model. One can show that the entire set of equations offers eight independent constraints, so that we must add sixteen constraints to identify all the probabilities in the table.

Following the strategy for two-wave models, we characterize the joint distribution of (Y1, Y2, Y3, W1,

W2) via a chain of conditional models. In particular,

for all i in the original panel and refreshment samples, we supplement the models in (1)–(3) with

Y3i| Y1i, Y2i, W1i∼ Ber(π3i), logit(π3i)= β0+ β1Y1i (6) + β2Y2i+ β3Y1iY2i, W2i| Y1i, Y2i, W1i, Y3i∼ Ber(πW2i), logit(πW2i)= δ0+ δ1Y1i+ δ2Y2i (7) + δ3Y3i+ δ4Y1iY2i,

plus requiring that all 32 probabilities sum to one. We note that the saturated model—which includes all eli-gible one-way, two-way and three-way interactions— contains 31 parameters plus the sum-to-one require-ment, whereas the just-identified model contains 15 parameters plus the sum-to-one requirement; thus, the needed 16 constraints are obtained by fixing parame-ters in the saturated model to zero.

The sixteen removed terms from the saturated model include the interaction Y1Y2 from the model for W1,

all terms involving W1 from the model for Y3 and all

terms involving W1 or interactions with Y3 from the

model for W2. We never observe W1= 0 jointly with

Y3 or W2, so that the data cannot identify whether or

not the distributions for Y3 or W2 depend on W1. We

therefore require that Y3 and W2 be conditionally

in-dependent of W1. With this assumption, the NCPcases with W1 = 1 and the second refreshment sample can

(11)

identify the interactions of Y1Y2 in (6) and (7).

Essen-tially, the NCP cases with fully observed (Y1, Y2) and

the second refreshment sample considered in isolation are akin to a two-wave panel sample with (Y1, Y2)and

their interaction as the variables from the “first wave” and Y3as the variable from the “second wave.” As with

the AN model, in this pseudo-two-wave panel we can identify the main effect of Y3in (7) but not interactions

involving Y3.

In some multi-wave panel studies, respondents who complete the first wave are invited to complete all sub-sequent waves, even if they failed to complete a pre-vious one. That is, individuals with observed W1i = 0

can come back in future waves. For example, the 2008 ANES increased incentives to attriters to encourage them to return in later waves. This scenario is illus-trated in Figure3. In such cases, the additional infor-mation offers the potential to identify additional pa-rameters from the saturated model. In particular, one gains the dependent equations,

P (Y1= y1, Y3= y3, W1= 0, W2= 1)

=

y2

P (Y1= y1, Y2= y2,

Y3= y3, W1= 0, W2= 1)

for all (y1, y3). When combined with other equations,

we now have 20 independent constraints. Thus, we can add four terms to the models in (6) and (7) and maintain identification. These include two main effects for W1

and two interactions between W1 and Y1, all of which

are identified since we now observe some W2 and Y3

when W1= 0. In contrast, the interaction term Y2W1is

not identified, because Y2is never observed with Y3

ex-cept when W1= 1. Interaction terms involving Y3 also

are not identified. This is intuitively seen by supposing that no values of Y2from the original panel were

miss-ing, so that effectively the original panel plus the sec-ond refreshment sample can be viewed as a two-wave setting in which the AN assumption is required for Y3.

Thus far we have assumed only cross-sectional freshment samples, however, refreshment sample re-spondents could be followed in subsequent waves. Once again, the additional information facilitates es-timation of additional terms in the models. For ex-ample, consider extending Figure3to include incom-plete follow-up in wave three for units from the first refreshment sample. Deng (2012) shows that the ob-served data offer 22 independent constraints, so that we can add six terms to (6) and (7). As before, these include two main effects for W1 and two interactions

for Y1W1. We also can add the two interactions for

Y2W1. The refreshment sample follow-up offers

obser-vations with Y2 and (Y3, W2) jointly observed, which

combined with the other data enables estimation of the one-way interactions. Alternatively, consider extend-ing Figure 2 to include the incomplete follow-up in wave three for units from the first refreshment sam-ple. Here, Deng (2012) shows that the observed data offer 20 independent constraints and that one can add the two main effects for W1 and two interactions for

Y2W1to (6) and (7).

As in the two-wave case (Hirano et al., 1998), we expect that similar models can be constructed for other data types. We have done simulation experiments (not reported here) that support this expectation.

FIG. 3. Graphical representation of the three-wave panel with return of wave 2 nonrespondents and no follow-up for subjects in refreshment samples. Here, X represents variables available on everyone.

(12)

5. ILLUSTRATIVE APPLICATION

To illustrate the use of refreshment samples in prac-tice, we use data from the 2007–2008 Associated Press–Yahoo! News Poll (APYN). The APYN is a one year, eleven-wave survey with three refreshment samples intended to measure attitudes about the 2008 presidential election and politics. The panel was sam-pled from the probability-based KnowledgePanel(R) Internet panel, which recruits panel members via a probability-based sampling method using known pub-lished sampling frames that cover 99% of the U.S. pop-ulation. Sampled noninternet households are provided a laptop computer or MSN TV unit and free internet service.

The baseline (wave 1) of the APYN study was col-lected in November 2007, and the final wave took place after the November 2008 general election. The base-line was fielded to a sample of 3548 adult citizens, of whom 2735 responded, for a 77% cooperation rate. All baseline respondents were invited to participate in each follow-up wave; hence, it is possible, for example, to obtain a baseline respondent’s values in wave t+ 1 even if they did not participate in wave t . Cooperation rates in follow-up surveys varied from 69% to 87%, with rates decreasing towards the end of the panel. Refreshment samples were collected during follow-up waves in January, September and October 2008. For illustration, we use only the data collected in the base-line, January and October waves, including the cor-responding refreshment samples. We assume nonre-sponse to the initial wave and to the refreshment sam-ples is ignorable and analyze only the available cases. The resulting data set is akin to Figure3.

The focus of our application is on campaign inter-est, one of the strongest predictors of democratic at-titudes and behaviors (Prior, 2010) and a key measure for defining likely voters in pre-election polls (Traugott and Tucker, 1984). Campaign interest also has been shown to be correlated with panel attrition (Bartels, 1999;Olson and Witt, 2011). For our analysis, we use an outcome variable derived from answers to the sur-vey question, “How much thought, if any, have you given to candidates who may be running for president in 2008?” Table4 summarizes the distribution of the answers in the three waves. Following convention (e.g.,

Pew Research Center, 2010), we dichotomize answers into people most interested in the campaign and all oth-ers. We let Yt i = 1 if subject i answers “A lot” at time t and Yt i= 0 otherwise, where t ∈ {1, 2, 3} for the base-line, January and October waves, respectively. We let

TABLE4

Campaign interest. Percentage choosing each response option across the panel waves (P1, P2, P3) and refreshment samples (R2, R3). In P3, 83 nonrespondents from P2 returned to the survey. Five participants with missing data in P1 were not used in the analysis

P1 P2 P3 R2 R3

“A lot” 29.8 40.3 65.0 42.0 72.2

“Some” 48.6 44.3 25.9 43.3 20.3

“Not much” 15.3 10.8 5.80 10.2 5.0

“None at all” 6.1 4.4 2.90 3.6 1.9

Available sample size 2730 2316 1715 691 461

Xi denote the vector of predictors summarized in Ta-ble5.

We assume ignorable nonresponse in the initial wave and refreshment samples for convenience, as our pri-mary goal is to illustrate the use and potential benefits of refreshment samples. Unfortunately, we have little evidence in the data to support or refute that assump-tion. We do not have access to X for the nonrespon-dents in the initial panel or refreshment samples, thus, we cannot compare them to respondents’ X as a (par-tial) test of an MCAR assumption. The respondents’ characteristics are reasonably similar across the three samples—although the respondents in the second re-freshment sample (R3) tend to be somewhat older than other samples—which offers some comfort that, with respect to demographics, these three samples are not subject to differential nonresponse bias.

As in Section4, we estimate a series of logistic re-gressions. Here, we denote the 7× 1 vectors of coeffi-cients in front of the Xi with θ and subscripts indicat-ing the dependent variable, for example, θY1represents

the coefficients of X in the model for Y1. Suppressing

conditioning, the series of models is Y1i∼ Bern _exp(θ Y1Xi) 1+ exp(θY1Xi) , Y2i∼ Bern _exp(θ Y2Xi+ γ Y1i) 1+ exp(θY2Xi+ γ Y1i) , W1i∼ Bern _exp(θ W1Xi+ α1Y1i+ α2Y2i) 1+ exp(θW1Xi+ α1Y1i+ α2Y2i) , Y3i∼ Bern exp(θY3Xi+ β1Y1i+ β2Y2i + β3W1i+ β4Y1iY2i+ β5Y1iW1i) /1+ exp(θY3Xi+ β1Y1i + β2Y2i+ β3W1i + β4Y1iY2i+ β5Y1iW1i) ,

(13)

TABLE5

Predictors used in all conditional models, denoted as X. Percentage of respondents in each category in initial panel (P1) and refreshment samples (R2, R3)

Variable Definition P1 R2 R3

AGE1 = 1 for age 30–44, = 0 otherwise 0.28 0.28 0.21

AGE2 = 1 for age 45–59, = 0 otherwise 0.32 0.31 0.34

AGE3 = 1 for age above 60, = 0 otherwise 0.25 0.28 0.34

MALE = 1 for male, = 0 for female 0.45 0.47 0.43

COLLEGE = 1 for having college degree, = 0 otherwise 0.30 0.33 0.31 BLACK = 1 for African American, = 0 otherwise 0.08 0.07 0.07 INT = 1 for everyone (the intercept)

W2i∼ Bern exp(θW2Xi+ δ1Y1i+ δ2Y2i+ δ3Y3i + δ4W1i+ δ5Y1iY2i+ δ6Y1iW1i) /1+ exp(θW2Xi+ δ1Y1i+ δ2Y2i + δ3Y3i+ δ4W1i + δ5Y1iY2i+ δ6Y1iW1i) . We use noninformative prior distributions on all rameters. We estimate posterior distributions of the pa-rameters using a Metropolis-within-Gibbs algorithm, running the chain for 200,000 iterations and treat-ing the first 50% as burn-in. MCMC diagnostics sug-gested that the chain converged. Running the MCMC for 200,000 iterations took approximately 3 hours on a standard desktop computer (Intel Core 2 Duo CPU 3.00 GHz, 4 GB RAM). We developed the code in Mat-lab without making significant efforts to optimize the code. Of course, running times could be significantly faster with higher-end machines and smarter coding in a language like C++.

The identification conditions include no interaction between campaign interest in wave 1 and wave 2 when predicting attrition in wave 2, and no interaction be-tween campaign interest in wave 3 (as well as nonre-sponse in wave 2) and other variables when predicting attrition in wave 3. These conditions are impossible to check from the sampled data alone, but we cannot think of any scientific basis to reject them outright.

Table6summarizes the posterior distributions of the regression coefficients in each of the models. Based on the model for W1, attrition in the second wave is

rea-sonably described as missing at random, since the coef-ficient of Y2in that model is not significantly different

from zero. The model for W2suggests that attrition in

wave 3 is not missing at random. The coefficient for Y3 indicates that participants who were strongly

inter-ested in the election at wave 3 (holding all else

con-stant) were more likely to drop out. Thus, a panel attri-tion correcattri-tion is needed to avoid making biased infer-ences.

This result contradicts conventional wisdom that politically-interested respondents are less likely to at-trite (Bartels, 1999). The discrepancy could result from differences in the survey design of the APYN study compared to previous studies with attrition. For exam-ple, the APYN study consisted of 10–15 minute on-line interviews, whereas the ANES panel analyzed by

Bartels(1999) andOlson and Witt(2011) consisted of 90-minute, face-to-face interviews. The lengthy ANES interviews have been linked to significant panel condi-tioning effects, in which respondents change their at-titudes and behavior as a result of participation in the panel (Bartels, 1999). In contrast, Kruse et al.(2009) finds few panel conditioning effects in the APYN study. More notably, there was a differential incen-tive structure in the APYN study. In later waves of the study, reluctant responders (those who took more than 7 days to respond in earlier waves) received in-creased monetary incentives to encourage their partici-pation. Other panelists and the refreshment sample re-spondents received a standard incentive. Not surpris-ingly, the less interested respondents were more likely to have received the bonus incentives, potentially in-creasing their retention rate to exceed that of the most interested respondents. This possibility raises a broader question about the reasonableness of assuming the ini-tial nonresponse is ignorable, a point we return to in Section6.

In terms of the campaign interest variables, the ob-served relationships with (Y1, Y2, Y3) are consistent

with previous research (Prior, 2010). Not surprisingly, the strongest predictor of interest in later waves is in-terest in previous waves. Older and college-educated participants are more likely to be interested in the elec-tion. Like other analyses of the 2008 election (Lawless,

(14)

TABLE6

Posterior means and 95% central intervals for coefficients in regressions. Column headers are the dependent variable in the regressions

Variable Y₁ Y₂ Y₃ W₁ W₂ INT −1.60 −1.77 0.04 1.64 −1.40 (−1.94, −1.28) (−2.21, −1.32) (−1.26, 1.69) (1.17, 2.27) (−2.17, −0.34) AGE1 0.25 0.27 0.03 −0.08 0.28 (−0.12, 0.63) (−0.13, 0.68) (−0.40, 0.47) (−0.52, 0.37) (−0.07, 0.65) AGE2 0.75 0.62 0.15 0.24 0.27 (0.40, 1.10) (0.24, 1.02) (−0.28, 0.57) (−0.25, 0.72) (−0.07, 0.64) AGE3 1.26 0.96 0.88 0.37 0.41 (0.91, 1.63) (0.57, 1.37) (0.41, 1.34) (−0.14, 0.87) (0.04, 0.80) COLLEGE 0.11 0.53 0.57 0.35 0.58 (−0.08, 0.31) (0.31, 0.76) (0.26, 0.86) (0.04, 0.69) (0.34, 0.84) MALE −0.05 −0.02 −0.02 0.13 0.08 (−0.23, 0.13) (−0.22, 0.18) (−0.29, 0.24) (−0.13, 0.39) (−0.14, 0.29) BLACK 0.75 −0.02 0.11 −0.54 −0.12 (0.50, 1.00) (−0.39, 0.35) (−0.40, 0.64) (−0.92, −0.14) (−0.47, 0.26) Y1 — 2.49 1.94 0.50 0.88 — (2.24, 2.73) (0.05, 3.79) (−0.28, 1.16) (0.20, 1.60) Y2 — — 2.03 −0.58 0.27 — — (1.61, 2.50) (−1.92, 0.89) (−0.13, 0.66) W1 — — −0.42 — 2.47 — — (−1.65, 0.69) — (2.07, 2.85) Y₁Y₂ — — −0.37 — −0.07 — — (−1.18, 0.47) — (−0.62, 0.48) Y1W1 — — −0.52 — −0.62 — — (−2.34, 1.30) — (−1.18, −0.03) Y3 — — — — −1.10 — — — — (−3.04, −0.12)

2009), and in contrast to many previous election cycles, we do not find a significant gender gap in campaign in-terest.

We next illustrate the P-only approach with multi-ple imputation. We used the posterior draws of param-eters to create m= 500 completed data sets of the orig-inal panel only. We thinned the chains until autocorre-lations of the parameters were near zero to obtain the parameter sets. We then estimated marginal probabil-ities of (Y2, Y3) and a logistic regression for Y3 using

maximum likelihood on only the 2730 original panel cases, obtaining inferences via Rubin’s (1987) com-bining rules. For comparison, we estimated the same quantities using only the 1632 complete cases, that is, people who completed all three waves.

The estimated marginal probabilities reflect the re-sults in Table6. There is little difference in P (Y2= 1)

in the two analyses: the 95% confidence interval is (0.38, 0.42) in the complete cases and (0.37, 0.46) in the full panel after multiple imputation. However, there is a suggestion of attrition bias in P (Y3 = 1). The

95% confidence interval is (0.63, 0.67) in the complete

cases and (0.65, 0.76) in the full panel after multiple imputation. The estimated P (Y3= 1 | W2= 0) = 0.78,

suggesting that nonrespondents in the third wave were substantially more interested in the campaign than re-spondents.

Table7displays the point estimates and 95% confi-dence intervals for the regression coefficients for both analyses. The results from the two analyses are quite similar except for the intercept, which is smaller af-ter adjustment for attrition. The relationship between a college education and political interest is somewhat at-tenuated after correcting for attrition, although the con-fidence intervals in the two analyses overlap substan-tially. Thus, despite an apparent attrition bias affecting the marginal distribution of political interest in wave 3, the coefficients for this particular complete-case anal-ysis appear not to be degraded by panel attrition.

Finally, we conclude the analysis with a diagnos-tic check of the three-wave model following the ap-proach outlined in Section3.3. To do so, we generate 500 independent replications of (Y2, Y3)for each of the

(15)

TABLE7

Maximum likelihood estimates and 95% confidence intervals based for coefficients of predictors of Y3using m= 500 multiple

imputations and only complete cases at final wave

Variable Multiple imputation Complete cases INT −0.22 (−0.80, 0.37) −0.64 (−0.98, −0.31) AGE1 −0.03 (−0.40, 0.34) 0.01 (−0.36, 0.37) AGE2 0.08 (−0.30, 0.46) 0.12 (−0.25, 0.49) AGE3 0.74 (0.31, 1.16) 0.76 (0.36, 1.16) COLLEGE 0.56 (0.27, 0.86) 0.70 (0.43, 0.96) MALE −0.09 (−0.33, 0.14) −0.08 (−0.32, 0.16) BLACK 0.07 (−0.38, 0.52) 0.05 (−0.43, 0.52) Y1 1.39 (0.87, 1.91) 1.45 (0.95, 1.94) Y2 2.00 (1.59, 2.40) 2.06 (1.67, 2.45) Y1Y2 −0.33 (−1.08, 0.42) −0.36 (−1.12, 0.40)

then compare the estimated probabilities for (Y2, Y3)

in the replicated data to the corresponding probabilities in the observed data, computing the value of ppp for each cell. We also estimate the regression from Table7

with the replicated data using only the complete cases in the panel, and compare coefficients from the repli-cated data to those estimated with the complete cases in the panel. As shown in Table8, the imputation mod-els generate data that are highly compatible with the observed data in the panel and the refreshment sam-ples on both the conditional probabilities and regres-sion coefficients. Thus, from these diagnostic checks we do not have evidence of lack of model fit.

6. CONCLUDING REMARKS

The APYN analyses, as well as the simulations, il-lustrate the benefits of refreshment samples for diag-nosing and adjusting for panel attrition bias. At the same time, it is important to recognize that the ap-proach alone does not address other sources of non-response bias. In particular, we treated nonnon-response in wave 1 and the refreshment samples as ignorable. Al-though this simplifying assumption is the usual prac-tice in the attrition correction literature (e.g., Hirano et al., 1998;Bhattacharya, 2008), it is worth question-ing whether it is defensible in particular settquestion-ings. For example, suppose in the APYN survey that people dis-interested in the campaign chose not to respond to the refreshment samples, for example, because people dis-interested in the campaign were more likely to agree to take part in a political survey one year out than one month out from the election. In such a scenario, the models would impute too many interested participants

TABLE8

Posterior predictive probabilities (ppp) based on 500 replicated data sets and various observed-data quantities. Results include probabilities for cells with observed data and coefficients in regression of Y3on several predictors estimated with complete

cases in the panel

Quantity Value of ppp Probabilities observable in original data

Pr(Y2= 0) in the 1st refreshment sample 0.84

Pr(Y3= 0) in the 2nd refreshment sample 0.40

Pr(Y2= 0|W1= 1) 0.90 Pr(Y3= 0|W1= 1, W2= 1) 0.98 Pr(Y3= 0|W1= 0, W2= 1) 0.93 Pr(Y2= 0, Y3= 0|W1= 1, W2= 1) 0.98 Pr(Y₂= 0, Y₃= 1|W₁= 1, W₂= 1) 0.87 Pr(Y2= 1, Y3= 0|W1= 1, W2= 1) 0.92 Coefficients in regression of Y₃on INT 0.61 AGE1 0.72 AGE2 0.74 AGE3 0.52 COLLEGE 0.89 MALE 0.76 BLACK 0.90 Y1 0.89 Y2 0.84 Y1Y2 0.89

among the panel attriters, leading to bias. Similar is-sues can arise with item nonresponse not due to attri-tion.

We are not aware of any published work in which nonignorable nonresponse in the initial panel or re-freshment samples is accounted for in inference. One potential path forward is to break the nonresponse ad-justments into multiple stages. For example, in stage one the analyst imputes plausible values for the non-respondents in the initial wave and refreshment sam-ple(s) using selection or pattern mixture models de-veloped for cross-sectional data (seeLittle and Rubin, 2002). These form a completed data set except for at-trition and missingness by design, so that we are back in the setting that motivated Sections3and4. In stage two, the analyst estimates the appropriate AN model with the completed data to perform multiple imputa-tions for attrition (or to use model-based or survey-weighted inference). The analyst can investigate the sensitivity of inferences to multiple assumptions about the nonignorable missingness mechanisms in the ini-tial wave and refreshment samples. This approach is related to two-stage multiple imputation (Shen, 2000;