CHAPTER 2: LITERATURE REVIEW
2.3 Bayesian methods and Missing Data Approaches
2.3.1 Bayesian MCMC
Bayesian methods is increasingly popular for use in social science and other application areas where the data are observations from an informative sample. An informative sampling design leads to inclusion probabilities that are correlated with the response variable of interest. Model inference performed on the observed sample taken from the population will be biased for the population generative model under informative sampling since the balance of information in the sample data is different from that for the population.
Chen (2004) mentioned that the parametric modeling approach has two disadvantages: (1) parametric covariate models are not robust to model misspecification, and thus a large bias may be introduced into the regression parameter estimator if the model is misspecified, and (2) computing the estimates of the regression para meter involves intractable integrations when either a nonlinear regression model or a nonnormal covariate model is involved.
However, Bayesian methods can overcome the first disadvantage by estimating the prior according to historical data or SRS data in our design and employing noninformative hyperprior if needed. Moreover, Bayesian methods can model nonnormal covariate models with no limitation.
The Bayesian MCMC methods was first proposed by Tanner and Wong (1987), in which the missing data were sampled iteratively from their conditional distributions. Let Xmis and
Xobs be the missing values and observed values respectively. Letθ be the parameters of the
sampling model. Then at each iteration t sample,
Xmis(t) ∼ p(Xmis|Xobs, θ(t−1))
θ(t) ∼ p(θ|Xobs, X
(t)
After convergence of the Gibbs sampler, we can treat sampled values of θ as draws from their marginal posterior distributionp(θ|Xobs). Inferences about θ can then be made using
the posterior samples.
Fully Bayesian methods for missing covariate data in regression problems are quite straightforward conceptually. To carry out inference for (β, α)based on the observed data posterior, given by p(β, α|Y, Xobs)∝ n Y i=1 ( Z Xmis,i
p(Yi|β, Xobs,i, Xmis,i)×p(Xmis,i|Xobs,iα)dXmis,i )!
(2.21)
we do the following:
1. Specify the covariate distributionp(Xmis,i|Xobs,i, α).
2. Specify a joint distribution for(β, α), in which (β, α) can be taken independent or dependent a priori. Also, the joint prior distribution can be proper or improper. 3. To sample from the posterior p(β, α|Y, Xobs), do the following:
– Sample fromp(β|Y, Xobs, α, Xmis)
– Sample fromp(α|Y, Xobs, β, Xmis)
– Sample fromp(Xmis|Y, Xobs, β, α)
If the observed data likelihood can be factored, then the efficiency of the Gibbs sampler can be increased by sampling the parameters according to that factorization.
Ibrahim et al. (2002) discussed fully Bayesian methods for MAR covariates in GLMs and considered informative prior elicitation strategies using historical data, where the historical data itself contains missing covariates. Following Ibrahim et al. (2002), let D0 = (n0, Y0, X0) denote the complete historical data, where n0 is the sample size based on the historical data, X0 is the n0 ×p complete-data covariate matrix, and Y0 is the n0 ×1 vector of response variables for the historical data. As with the current dataset, the historical data may also
have missing covariates. Further, denote the ith row of X0 byx
0
0i = (x0i1, ..., x0ip) and the
ith component of Y0 by Y0i. The joint power prior for (β, α)takes the form
π(β, α|a0, D0,obs)∝π∗(β, α|a0, D0,obs)π0(β, α) (2.22) where π∗(β, α|a0, D0,obs) = n0 Y i=1 Z X0,mis,i (p(Y0i|β, X0,obs,i, X0,mis,i)a0p(x01i|α1)a0i1 × p−1 Y j=1
p(x0i,p−j+1|x0i1, ..., x0i,p−j, αp−j+1)a0i,p−j+1)dX0,mis,i
(2.23)
p(y0i|β, X0,obs,i, X0,mis,i) is the complete-data likelihood for the ith subject with the current
data D = (n, Y, X) replaced by the historical data D0, the joint covariate distribution in the above equation is the same as for the current data with D replaced by D0, and D0,obs = (n0, Y0, X0,obs) is the observed historical data.
The termπ0(β, α)is called the “initial prior” of(β, α), that is, π0(β, α)is the prior of(β, α) before observing the historical data. The quantity 0 < a0 <1 is a scalar prior parameter that weights the complete-data likelihood of the historical data relative to the current study. To properly weight the historical complete-data likelihood, let a0i,p−j+1 =a0 if X0i,p−j+1 is observed anda0i,p−j+1 = 1ifX0i,p−j+1 is missing,i= 1, ..., n,j = 1, ..., p. The prior parameter a0 can be interpreted as a precision parameter that controls the heaviness of the tails of the joint prior for(β, α). It is reasonable to take a vague prior for π0(.), and take β and α to be independent at this stage. The parameter a0 can be taken as fixed or random. When a0 is taken to be random, a beta prior is a reasonable choice Ibrahim et al. (2002).
A crucial issue with missing covariate data is the specification of a model for the missing covariates. When a parametric distribution is specified for the covariates, the indexing parameters of this distribution are typically viewed as nuisance parameters and usually are not parameters of inferential interest. With many nuisance parameters and large fractions of
missing data, parameter estimation can become too computationally intensive and inefficient. The proposed parametric modeling scheme for the distribution of the covariates Ibrahim et al. (2002) as a sequence of one-dimensional conditional distributions is quite useful in the Bayesian context since it greatly reduces the number of nuisance parameters that have to be specified, thus greatly easing the computational strategies.
Ibrahim et al. (2005) reviewed four common approaches for inference in generalized linear models (GLMs) with missing covariate data: maximum likelihood (ML), multiple imputation (MI), fully Bayesian (FB), and weighted estimating equations (WEEs). They used a real dataset and a detailed simulation study to compare the four methods. In comparing the ML, MI, FB, and WEE methods based on correctly specified covariate models, ML, MI, and FB were quite comparable to each other, whereas WEE performed slightly worse.
For the non-ignorable informativeness, one approach is to account for it by parameterizing the sampling design into the Bayesian model (Little, 2004). The Bayesian approach is well equipped to handle complex design features such as clustering through random cluster models (Scott and Smith, 1969), stratification through covariates that distinguish strata, nonresponse (Little, 1982; Rubin, 1987; Little and Rubin, 1986) and response errors. Moreover, the Bayesian approach may yield better inferences for small sample problems where exact frequentist solutions are not available, by propagating error in estimating parameters (Little, 2004).
The specification of the joint distribution of the data and the missing data mechanism mainly focuses on two types of models: selection models and pattern-mixture models (Glynn et al., 1986; Little, 1993).
As in a general missing-data regression problem, let W= (Wij)denote a rectangular data
set involving the response and all covariates, wherei= 1, ..., n for individuals and j = 1, ..., k for variables. We partition W into observed and missing values, W= (Wobs,Wmis). Let
R= (Rij)be the missing-data indicator for W, with value 1 if Wij is observed and 0 if Wij
full data is
f(W, R|β, θ) = f(Wobs, Wmis, R|β, θ) (2.24) Selection models specify the joint distribution of Wi and Ri through models for the
marginal distribution of Wi and the conditional distribution of Ri given Wi:
f(Wobs, Wmis, R|γ, φ) =f(Wobs, Wmis|γ)fR|W(R|Wobs, Wmis, φ) (2.25)
An advantage of the selection model factorization is that it includes the model of interest term f(Wobs, Wmis|γ)directly.
On the other hand, pattern-mixture models specify the marginal distribution of Ri and
the conditional distribution of Wi given Ri:
f(Wobs, Wmis, R|δ, ν) =f(R|δ)f W|R(Wobs, Wmis|R, ν) (2.26) The pattern mixture model corresponds more directly to what is actually observed, i.e., the distribution of the data within subgroups having different missing data patterns.
However, sometimes parameterizing a informative design is difficult to accomplish and may disrupt desired inference by requiring a change to the underlying population model parameterization. Another approach incorporates the sampling weights into inference about the population, but requires a particular form for the likelihood that does not allow the analyst to impose their own population model formulation of inferential interest. For example, Dong et al. (2014) specified an empirical likelihood, while Kunihama et al. (2016) constructed a non-parametric mixture for the likelihood and Rao and Wu (2010) used a sampling-weighted (pseudo) empirical likelihood. All of these approaches impose Dirichlet distribution priors for the mixture components with hyperparameters specified as a function of the first-order sampling weights. Si et al. (2015) regress the response variable on a Gaussian process function of the weights for sampling designs where sub-groups of sampled units have equal weights (e.g.,
a stratified sampling design). These approaches are designed for inference about simple mean and total statistics, rather than inference for parameters that characterize an analyst-specified population model that is the focus for our proposed method.
Savitsky et al. (2016) constructed a sampling-weighted pseudo posterior distribution by exponentiating each unit likelihood contribution, under the analyst-specified model, by its sampling weight, to produce, p(yi|δi = 1, λ)wi. Exponentiating by the sampling
weight, wi ∝1/πi, constructs the pseudo likelihood used to estimate the pseudo posterior
when convolved with the prior distributions for model parameters, λ. Savitsky et al. (2016) demonstrate that estimation of the model parameters from the pseudo posterior distribution is asymptotically unbiased. This approach provides a “plug-in” approximation to the population likelihood (forn observations), in that the sampling inclusion probabilities, πi, are assumed
fixed.