5.1 Introduction
This chapter lays out the definitions and assumptions that are commonly used to make inference about a full-data distribution from incompletely observed data. Some specific examples help to motivate the discussion.
Example 5.1. Dropout in a longitudinal clinical trial. In the growth hormone study described in Chapter 1, individuals were scheduled for measurement at baseline, month six and month 12. The target of infer-ence is mean differinfer-ence in quadriceps strength between the two treatment groups, at month 12, among everyone who began follow up. This is a pa-rameter indexing the distribution of full-data; i.e., the data that would have been observed had everyone completed the study. However several individuals dropped out of the study before completing follow up. 2 Example 5.2. Dropout and mortality in a cohort study. The HER Study (Chapter 1) followed 871 HIV-infected women for up to six years, record-ing clinical and behavioral outcomes every six months. All women were scheduled for 12 follow up measurements. About 10 percent died from AIDS and other causes, and about another 20 percent withdrew or were lost to follow up. For a specific outcome of interest, like CD4 count, the full data could be defined in several ways. One definition is all CD4 counts that were scheduled to be observed (12 for each woman). Because of the difficulty in conceptualizing CD4 count for someone who has died, an alternate definition is those CD4 counts taken while still alive. 2 A key theme of this chapter is the distinction, for purposes of model specification and inference, between full data and observed data. Infer-ences about full data that are based on incomplete observations must rely on assumptions about the distribution of missing responses. We will demonstrate here that these assumptions are always encoded in a full data model by some combination of structural modeling assumptions and constraints on the parameter space. This is even true – actually it
93
94 MISSING DATA MECHANISMS is especially true – for commonly used assumptions such as MAR and ignorability.
Section 1.2 provides definitions for full data and characterizes some com-mon missing data processes such as dropout and com-monotone missingness.
Section 1.3 gives a general conceptualization of the full-data model, and Section 1.4 describes missing data mechanisms such as MAR in the con-text of a full-data model. In Section 1.5 we show how the MAR assump-tion applies to dropout in longitudinal studies. The remainder of the chapter is devoted to model specification and interpretation under var-ious assumptions for the missing data mechanism. A number of issues are highlighted for further reading.
5.2 Full versus observed data 5.2.1 Overview
When drawing inference from incomplete data, it is necessary to expand the context of the modeling problem to characterizing some set of ‘full’
— but incompletely observed — data. In this Section we differentiate full data, observed data, and full-data response. The full data refers to all elements of the data that are observed or are intended to be observed, usually including some response or dependent variable, covariates of di-rect interest, auxiliary variables, and missing data indicators. Observed data comprise the observed subset of the full data. Finally, the full-data response refers to the dependent variable of primary interest; for example, while missing data indicators are part of the full data, their distribution or relation to covariates typically are not of primary interest to the modeler.
As an example, consider a clinical trial of a new antipsychotic agent, where patients are randomized to receive either standard treatment or the new therapy. The schedule calls for symptom scale to be recorded weekly for six weeks, but many patients drop out of the study before their follow up is complete. Here, the full data consists of
– symptom scale outcomes for all weeks, regardless of whether they actually were recorded;
– a binary random variable for each week, indicating whether the symp-tom scale was recorded (sometimes called missing data indicators);
– the covariate of interest, namely treatment group;
– other baseline covariates that may have been recorded.
FULL VERSUS OBSERVED DATA 95 The full-data response data consists only of the first and third items, or more simply, the data that was intended to be collected and analyzed.
Finally, the observed data comprises the observed elements of symptom score, the missing data indicators, and the covariates.
This Section lays out notation and definitions for the data only; in Sec-tion 5.3, we define various models that can be used to characterize vari-ous aspects of the full-data distribution.
Key references: Copas and Eguchi (2005), Heitjan and Rubin, Scharfstein et al., Diggle et al.
5.2.2 Data structures
Consider first the case of bivariate response, where the first response is always observed but the second one may be missing. The full data for individual i consists of a response vector Yi= (Yi1, Yi2), a 2 × p matrix Xiof model covariates, a 2×q matrix of auxiliary covariates Vi, and the missing data indicator Ri. Throughout the book, we assume the process that causes response data to be missing is stochastic (as opposed to being part of a study design); hence Ri (and its generalizations below) are always treated as random variables.
The observed data for individual i are Oi= (Yi,obs, Xi, Vi, Ri), where Yi,obs=
(Yi1, Yi2)T if Ri= 1 Yi1 if Ri= 0 .
For more general patterns of longitudinal data, similar structures apply.
Consider first the case of temporally aligned observations. Referring to Section ??, the full data response vector is Yi = (Yi1, . . . , YiJ)T. The vector Ri = (Ri1, . . . , RiJ)T indicates which components are observed, with Rij = 1 if Yij is observed, and Rij= 0 otherwise. Let Ji=P
jRij
denote the number of full-data components of Yi that are observed. A useful way to represent the observed and missing components of the full data is the (possibly temporally unordered) partition (YTi,obs, YTi,mis)T. The subvector YTi,obs is Ji× 1, with elements {Yij: Rij = 1}; similarly, Yi,misis (J − Ji) × 1, with elements {Yij : Rij = 0}.
For temporally misaligned data the situation is somewhat different, and the definition of full data is not always obvious because the timing of measurements is itself a random variable. For example, the full response data could be all values of a stochastic process {Yi(t)} over a fixed range of t, and the observed data is the subset {Yi(ti1), . . . , Yi(tiJi)}. Here the
96 MISSING DATA MECHANISMS observation times ti1, . . . , tiJi might arise from a marked counting pro-cess Ni(t), with observations of Yi(t) taken at points where dNi(t) = 1 (i.e., where the counting process jumps). While our focus will primarily be on full data with fixed (as opposed to random) observation times, many of the topics we discuss can be applied to settings where obser-vation times are random. For instance, a key consideration in modeling longitudinal data is whether the missing data process defined by R is independent of the responses Y ; for continuous-time processes, the con-sideration is whether N (t) is in some sense independent of Y (t). We will return to this in later chapters; for a full treatment, see Lin and Ying (1999 JASA).
5.2.3 Dropout and other processes leading to missing responses
Any number of events can lead to missing data. Commonly encountered examples include
(a) missed visits, either at random or for reasons related to response, such as when a patient in a study of depression fails to show up when he is experiencing symptoms;
(b) withdrawal from a study, decided either by the participant or by the investigator conducting the study; common examples in pharmaco-logic trials is withdrawal due to side effects, toxicity, or lack of effi-cacy;
(c) loss to follow up, distinguished from withdrawal because the reasons are not reported;
(d) death or disabling event, possibly related to the outcome but some-times not; for example in a longitudinal study of HIV, accidental death is not outcome related.
(e) missingness by design, as in a longitudinal survey where only a subset of individuals is selected for follow up.
This certainly is not exhaustive but covers many reasons for missing data in longitudinal studies of various types. Adding to the complexity of handling missing data is that the types of missingness listed above may have different causes, and in most cases should be treated differently.
Withdrawal for lack of efficacy is a different process than withdrawal for toxicity; outcome related mortality must be treated differently from death by other causes.
Our focus throughout the book is primarily on dropout, and for simplic-ity we begin with the assumption that dropout – and its relation to the
FULL-DATA MODELS AND MISSING DATA MECHANISM 97 response process – can be captured using a single random variable. Al-though it seems overly simplistic, this approach can sometimes be used when dropout has multiple causes (Hogan and Laird, 1997). When there are distinct types of dropout such that they are related to outcome in different ways, then it is straightforward to introduce a multinomial ver-sion of the missing data indicator, and many of the same ideas discussed here will apply directly (Scharfstein et al; Scharfstein and Rotnitzky).
Before moving forward to describing models for incomplete data, it is necessary to define two important terms, dropout and monotone missing data pattern. For many of the models we discuss in this and subsequent chapters, monotone missingness is a key requirement. For temporally aligned longitudinal data, missingness is characterized in terms of the missingness indicators R = (R1, . . . , RJ)T. Using these, we can define both terms.
Definition 5.1. Dropout process. For full-data responses Y1, . . . , YJ sched-uled to be recorded at times t1, . . . , tJ, let R1, . . . , RJdenote the missing data indicators, with Rj = 1 if Yj is observed and Rj = 0 if miss-ing. A missing data process is a dropout process if for some j < J, Rj= 0 ⇒ Rj+k= 0 for all 1 < k ≤ J − j. The dropout time is tj. 2 Missingness that does not lead to dropout usually is called intermittent missingness because rather than truncating the longitudinal process, it creates gaps. Dropout that occurs in the absence of intermittent miss-ingness leads to a monotone pattern for the responses (see Figure xx).
Definition 5.2. Monotone missing data pattern (monotone dropout).
A missing data pattern is monotone if, for each individual, there exists a measurement occasion j such that R1 = · · · = Rj−1 = 1 and Rj = Rj+1 = · · · = RJ = 0; that is, all responses are observed through time j − 1, and no responses are observed thereafter. 2
5.3 Full-data models and missing data mechanism
We are now prepared to describe models that can be used for drawing inference about a full-data parameter, say θ, from incompletely observed data. To meet this objective, we begin with a model for the joint distri-bution of the full data, p(y, r | x, ω), indexed by a parameter ω. The parameter of interest is the one indexing the full data response distri-bution p(y | x, ω). The target of inference θ is a subset or function of ω.
By specifying p(y, r | x, ω), the analyst either implicitly or explicitly specifies both the full-data response model p(y | x, ω) and a missing
98 MISSING DATA MECHANISMS data mechanism p(r | y, x, ω). This section describes how the three are related.
5.3.1 Targets of inference
In almost all practical settings, the analyst is interested in making in-ferences about a parameter θ that indexes the full-data response model
p(y | x, θ) = p(y1, y2, . . . , yJ | x, θ).
Examples include the full-data mean µ(θ) = E(Yi | θ), or regression coefficients β = β(θ) in a linear model E(Yi| Xi, β).
When data are completely observed, p(y | x, θ) can be specified directly.
If responses are not fully observed, θ is a function of the parameter ω indexing a larger model p(y, r | x, ω) of the full data (Y , R, X). The form of the model for the responses will depend on specifications and assumptions on the full data model.
Definition 5.3. Full-data model. Let Y denote the full-data response vector for an individual, let R denote the associated vector of missingness indicators, and let X represent covariates of interest. The full data model describes the joint distribution of Y and R, conditionally on covariates X, is indexed by a finite-dimensional parameter ω, and is written
p(y, r | x, ω).
2 The full-data response model is determined by the full-data model.
Definition 5.4. Full-data response model. The full-data response model characterizes the distribution of Y conditionally on covariates X. It is indexed by a parameter θ = θ(ω) that is a subset or function of the full-data parameter ω,∗
p(y | x, θ(ω)) = Z
p(y, r | x, ω) dr.
2 Inference about θ will depend crucially on choices made for the specifi-cation of the full-data model p(y, r | x, ω); however, observed data offer no information to validate these choices. Assumptions made by the data
∗ The integral representation here is meant to be very general; although R is gener-ally discrete in our discussions to this point, we use a slight abuse of notation — writing ‘p(r)dr’ rather than ‘dP (r)’ — to maintain consistency of notation and clarity of ideas.
FULL-DATA MODELS AND MISSING DATA MECHANISM 99 analyst will generally exert considerable influence over final inferences, even for very large samples of observed data. This point can sometimes be lost when using widely adopted models based on assumptions like missing at random (MAR) and widely available in standard software packages. Many commonly-used approaches, such as posterior inference under ignorability, can be viewed as posterior inference about a full data model under some very specific assumptions about the joint distribution of Y and R and/or constraints on the parameter ω.
5.3.2 Missing data mechanisms and the full data model
In general terms, the missing data mechanism is the stochastic mech-anism that leads to missingness among elements of Y . Formally, we will use the term missing data mechanism to refer to the conditional distribution of missing data indicators R given the full-data response Y and covariates X, p(r | y, x, ω). Generally speaking, specification of p(y, r | x, ω) will imply a missing data mechanism; in practice, the modeler often begins with a working assumption about the missing data mechanism and uses it to specify the full data model. In what follows, we define what is meant by missing data mechanism and then describe a hierarchy of assumptions about this mechanism.
Definition 5.5. Missing data mechanism. The missing data mechanism is the model for the joint distribution of missing data indicators R as a function of Y and X; it is indexed by a finite-dimensional parameter ψ= ψ(ω) and written as
p(r | y, x, ψ(ω)).
2 Any full data model can therefore be factored as the product of a full-data response model and the associated missing full-data mechanism,
p(y, r | x, ω) = p(y | x, θ(ω)) p(r | y, x, ψ(ω)). (5.1) Importantly, being able to characterize the missing data mechanism does not depend on whether the full-data model has been specified according to (5.1), which is commonly referred to as a selection model factorization.
The implied missing data mechanism can, in principle, be derived from any specification of the full data model, but it may not always take a closed form.
100 MISSING DATA MECHANISMS 5.4 Common assumptions about the missing data mechanism Restrictions on the missing data mechanism can be classified as miss-ing completely at random (MCAR), missmiss-ingness at random (MAR), or missingness not at random (MNAR); these progressively weaker assump-tions delineate the dependence of the missing data indicators R on the observed and missing parts of the full-data response vector Y . The tax-onomy and its terminology were developed by Rubin (1976) and gen-eralized in a series of papers on coarsening by Heitjan and colleagues (198x). Related work can be found in Keiding etc (1994) and Gill and Robins (1997).
To define missing data assumptions, it is useful to rewrite the missing data mechanism as
p(r | y, x, ψ) = p(r | yobs, ymis, x, ψ)
to emphasize dependence of R on the observed and missing components of Y . Although we write ψ, it is assumed that ψ = ψ(ω) unless stated otherwise.
5.4.1 Missing completely at random (MCAR)
Definition 5.6. Missing completely at random. Missing responses are missing completely at random (MCAR) if, for all x and ψ,
p(r | y, x, ψ) = p(r | x, ψ);
i.e., if p(r | y, x, ψ) is a constant function of y for all x and ψ. 2 One implication of MCAR is that missingness can be fully explained by covariates X that are included in the full-data model. Another is that the full-data distribution can be factored as
p(y, r | x, ω) = p(y | x, ω) p(r | x, ω),
meaning that the observed and missing response data have the same distribution, conditionally on X. If in addition the full-data parameter ωcan be separated as (θ, ψ), then inference about the full-data response distribution can be based solely on those with complete response data.
Even though this is a valid approach, it may not make optimal use of the available data and posteriors for θ may have unnecessarily high variance.
Example 5.3. Missing completely at random without covariates. A lon-gitudinal cohort study of school performance will record test scores each year for two years of secondary school; these are Y1 and Y2. In the first
ASSUMPTIONS ABOUT MISSING DATA MECHANISM 101 year, 1000 students are sampled and their test scores are recored. In the second year, budget constraints force the investigators to reduce the sample size to 800. Data on Y2 is collected for a random subsample of the original 1000 students. Because the subsample is randomly drawn, missing responses on Y2are missing completely at random. 2 Example 5.4. Missing completely at random with covariates (contin-uation of Example 5.3). In the same study, suppose the investigators are interested to compare test scores between boys and girls, and let X denote gender. The distribution of interest is [Y | X]. Suppose further that the random subsample at time 2 oversamples girls, so that girls are more likely to have Y2 observed. Because gender is a model covariate, the missing data on Y2 is missing completely at random. 2 The MCAR assumption usually is not realistic for longitudinal stud-ies because unplanned missingness is so common. Consider the smoking cessation trial in Example ??, where Y = (Y1, . . . , Y12) are the smok-ing indicators measured weekly over the course of the study, X is the binary indicator of treatment, and R = (R1, . . . , R12) are the binary missingness indicators. The full-data model of interest is p(y | x, θ), which characterizes the joint distribution of smoking outcomes condi-tionally on treatment group X. The MCAR assumption states that p(r | y, x, ψ) = p(r | x, ψ); in this context, it means that probabil-ity of nonresponse can depend on treatment group, but within treatment group, nonresponse is completely independent of smoking outcomes. Un-der the plausible (and testable) scenario that within treatment group, participants observed to be heavier smokers are more likely to drop out, then the MCAR assumption would not hold.
5.4.2 Missing at random (MAR)
A more realistic condition for many longitudinal studies is missing at random (MAR), which essentially requires that the probability of nonre-sponse is a function of observed renonre-sponses Yobs and model covariates X.
Definition 5.7. Missing at random. Missing responses are missing at random (MAR) if, for all yobs, x and ψ,
p(r | yobs, ymis, x, ψ) = p(r | yobs, x, ψ);
i.e., if p(r | yobs, ymis, x, ψ) is a constant function of ymisfor all yobs, x
and ψ. 2
The MAR assumption has two important implications for model-based
102 MISSING DATA MECHANISMS inference. First, MAR implies that missingness can be explained by ob-served responses Yobs and model covariates X. If, in addition, the pa-rameters θ and ψ are distinct, then the observed-data likelihood is a function of θ only. If we impose the additional condition that θ and ψ are a-priori independent, then the observed data posterior is a function of θ only, and the missing data mechanism does not have to be speci-fied in order to obtain poterior inference about θ. This combined set of assumptions constitutes the ignorability condition, discussed in detail in Section 5.7.
Example 5.5. A missing at random mechanism (continuation of Ex-ample 5.3). If those with lower values of the first test score Y1 are less likely to sit for the second test, then missingness in Y2 depends on Y1. If missingness further depends on X, but does not depend on any other variable, then the missing data mechanism is MAR. 2 The implications of the MAR condition for dropout mechanisms requires some more development and is discussed in more detail in Section 5.5.
5.4.3 Missing not at random (MNAR)
Although the MAR assumption is fairly general in that it allows the missing data mechanism (and sometimes the missing data itself) to be explained by observables, there are cases where MAR may fail to hold.
MAR will not be valid, for example, when the probability of missingness depends on the value of the missing response – or on other unobservables – even after conditioning on observed data. We refer to mechanisms of
MAR will not be valid, for example, when the probability of missingness depends on the value of the missing response – or on other unobservables – even after conditioning on observed data. We refer to mechanisms of