Designs and Data Structure - BAYESIAN INFERENCE FOR LIKELIHOOD-DEPEND-

CHAPTER 4: BAYESIAN INFERENCE FOR LIKELIHOOD-DEPEND-

4.2 Designs and Data Structure

We will begin this section with a review of some important notation that has been presented in Chapter 2.

• N is the total study population size,n0 is the number of simple random sample (SRS sample), ns is the number of supplementary sample;

• Y is the continuous outcome variable, X and Z be the exposure variables, in which X is expensive and of our interest, and Z is the collection of all other covariates.

• We assume that the conditional density of Y given X and Z is f(Y|X, Z;θ), where f(·|·) is known up to a d-dimensional parameter θ of interest. When considering the linear model, we have

Y =β0+β1X+β2Z+ (4.1)

where ∼N(0, σ2₎ _and _{’s are independent.}

The proposed LDS design is as follows: in the first phase, we observe (Y, Z) for each member of the full cohort and X on a simple random sample (SRS) of size n0. Based on those information, we estimate the conditional likelihood that will be used to sample the second stage sample, Specifically, we estimate the conditional likelihood f(X, Y|Z)

by fˆ(Y|X, Zˆ ;θ)ˆg( ˆX|Z;γ) for those subjects without X observed, where ˆg(X|Z;γ) is the Bayesian prior that will be detailed later. Then, we select supplemental sample as a SRS sample of size ns from the subjects, whose fˆ( ˆX, Y|Z)from the lower end of fˆ( ˆX, Y|Z). We

consider the selection range as ρ% lowest fˆ( ˆX, Y|Z), where ρ% is a pre-specified percentage. Assuming fˆ( ˆX, Y|Z)(1:N−n0), where subscript (,) represents the order statistics, is an increasing sequence of fˆ( ˆX, Y|Z) of subjects that are not in the SRS sample from the first stage cohort, the data structure of the proposed LDS design can be summarized as the following:

SRS Sample: {X0i, Y0i, Z0i} i= 1, ..., n0; Supplemental Sample: {Xsi, Ysi, Zsi : ˆf( ˆXsi, Ysi|Zsi)∈ {fˆ( ˆX, Y|Z)(1:Ns)},

where Ns =d(N −n0)ρ%e ≥ns is the size of

supplementary sample selection range} i= 1, ..., ns;

Remaining of Phase 1: {Yni, Zni :all remaining with X unobserved} i= 1, ..., nnonv;

where Remaining of Phase 1 is also called non-validation sample, with a sample size nnonv =

N −n0−ns.

To estimate g(X|Z;γ), the prior distribution of X conditional on Z, we can fit and find the best fitted distribution from common distributions for X given Z using the SRS data. For example, ifX is continuous, we can check normal, uniform, logistic, exponential, gamma, lognormal, beta and Weibull distributions, and choose the best fitted distribution as g(X|Z;γ).

To predict unobserved X, we can use the Bayesian MCMC introduced in the next section. By setting priors for θ andγ in the MCMC, we can not only predict unobserved X, but also update estimates of θ and γ using the posterior means.

4.3 Bayesian MCMC Inference

In the proposed LDS design, since the probability of observingXonly depends on observed Y, Z and X in the observed SRS, we have a ‘missing at random’ mechanism (Rubin, 1976). In Bayesian methods, the ‘missing at random’ mechanism is ignorable for posterior inference. Therefore, we employ Bayesian MCMC algorithm to help us estimate the parameters of interest by using the means of the posterior distribution. The algorithm to estimate the parameters is summarized as 3-step process below:

(a) Set an appropriate prior for X conditional on Z as g(X|Z;γ), in which γ is a hyperparameter. Set a hyperprior π(γ) forγ to make the prior flexible.

(b) Set a parametric distribution of Y given X and Z as f(Y|X, Z;θ), such as the linear model in equation (4.1), in which the parameterθ = (β0, β1, β2, σ2)0. Set a priorπ(θ)for θ.

Step 2. Given fˆ(Y|X, Z;θ)and gˆ(X|Z;γ) estimated by the SRS data, and observed(Y, Z) of the subjects not selected into SRS data:

(a) Run Bayesian MCMC to estimate unobserved X of the subjects not selected into SRS data, represented asX.ˆ

(b) Usefˆ(Y|X, Z;θ),gˆ(X|Z;γ)andXˆto estimate the conditional likelihoodfˆ( ˆX, Y|Z). (c) Order fˆ( ˆX, Y|Z), then select the supplemental sample as described in above data

structure. ObserveX values on the subjects in the supplemental sample.

Step 3. Combine all available data as shown in the above data structure and run Bayesian MCMC algorithm to estimate the parameters.

Usually, the prior is chosen either by historical information or to be non-informative. In our design, the framework of the LDS design helps us better estimate the priors using the SRS data. As in the Step 1(a) of the algorithm, using X and Z in the SRS data, we can fit and find the best fitted distribution from common candidate distributions for g(X|Z;γ). For example, if X is continuous, we can check commonly used Normal, Uniform, Logistic, Exponential, Gamma, Log-Normal, Beta and Weibull distributions. The hyperprior π(γ)

can be either non-informative or informative using the information in the chosen best fitted distribution. In the Step 1(b), we can set π(θ) similarly by using Y, X and Z in the SRS data. We have a R code for this in the Supplementary Materials.

The settings, convergence test and estimation of the asymptotic variance of the Bayesian MCMC are similar to the EODS design proposed in Chapter 3.

4.4 Simulation Studies

We evaluate the finite sample performance of our proposed LDS design via simulation studies. Four competing estimators are compared: (i) βˆ_WZ1_.₀, the estimator by Weaver and Zhou (2005) in which a = 1 is set, i.e., the cut-points of two tails ofY are µY ±1σY.; (ii)

β_WZ1_.₅, the estimator by Weaver and Zhou (2005) in whicha= 1.5is set, i.e., the cut-points of two tails of Y areµY ±1.5σY.; (iii) βˆ_LDS1_.₀, the estimator by our proposed LDS design,

in which to use the same information percentage to compare with Weaver and Zhou (2005), ρ% = ¯Φ(a)and a= 1.0is set. We evaluate the LDS design under two settings: (i) a linear relationship ofY andX with X following a Normal distribution; and (ii) a linear relationship of Y and X with X from a mixture distribution.

In the first set of simulation studies, we letX follows a normal distribution. The data were generated by the following model: Y =β0+β1X+β2Z+, whereX ∼N(0,1), Z ∼Ber(0.45) and ∼N(0,1). We set β0 = 1, β2 =−0.5 and the study population size N = 2000, and allowβ1 to take value 0 or0.5.

We consider different sample sizes for the SRS (n0) and the supplemental sample (ns for

the LDS design, which is equal to n1 +n3 as in Weaver and Zhou (2005)). As we are mainly interested in studying the relationship between the outcomeY and the expensive exposure X, we focus on the estimation of β1. We find the best fitted common distribution for X in each simulation and use the Gibbs sampling to obtain the posterior of β1 and estimate its mean and standard error. We report the empirical standard error of the parameter estimates and the biased mean if it exists. We perform 12000 iterations with 7000 burn-in iterations in 4 chains for each run of the MCMC algorithm. The Gelman-Rubin criterion is always 1, which indicates the convergence of the MCMC. The simulation results based on 1000 simulations are shown in Table 4.1.

We have the following observations from Table 4.1. (i) All three estimators yield approximately unbiased estimates; (ii) The average of the proposed estimator of standard error is very close to the empirical standard error based on 1000 simulations; (iii) The proposed estimators

Table 4.1: Simulation results of the LDS design with Normal X

n0 ns/n1+n3 Method β1 = 0 β1 = 0.5

Mean SE / MeanSEˆ Mean SE / MeanSEˆ 100 50 βˆ_WZ1_.₀ 0.000 0.069 / 0.068 0.501 0.065 / 0.070 ˆ β_WZ1_.₅ 0.000 0.064 / 0.060 0.496 0.062 / 0.064 ˆ β_LDS1_.₀ -0.004 0.060 / 0.059 0.503 0.058 / 0.059 100 βˆ_WZ1_.₀ -0.001 0.056 / 0.055 0.500 0.057 / 0.056 ˆ β_WZ1_.₅ -0.003 0.048 / 0.047 0.502 0.057 / 0.053 ˆ β_LDS1_.₀ -0.001 0.045 / 0.046 0.503 0.052 / 0.050 200 50 βˆ_WZ1_.₀ 0.003 0.058 / 0.056 0.500 0.050 / 0.052 ˆ β_WZ1_.₅ -0.001 0.051 / 0.051 0.500 0.050 / 0.050 ˆ β_LDS1_.₀ -0.003 0.050 / 0.050 0.499 0.048 / 0.048 100 βˆ_WZ1_.₀ 0.000 0.049 / 0.048 0.500 0.046 / 0.047 ˆ β_WZ1_.₅ 0.000 0.042 / 0.042 0.500 0.045 / 0.044 ˆ β_LDS1_.₀ 0.000 0.042 / 0.042 0.498 0.044 / 0.043

* Results are based on the model Y = β0 +β1X +β2Z +, where X ∼ N(0,1), Z ∼ Ber(0.45) and ∼ N(0,1); the true parameter values are β0 = 1, β1 = 0 or 0.5 and β2 = −0.5, and the study population size N = 2000. βˆ_LDS1_.₀ is the estimator of our proposed LDS design with a= 1.0, βˆ_WZ1_.₀ and βˆ_WZ1_.₅ are the estimators in Weaver and Zhou (2005) with a= 1.0 and 1.5separately.

Table 4.2: Simulation results of the LDS design with a mixture distribution of X

n0 n1+n3 Method β1 = 0 β1 = 0.5

Mean SE / Mean SEˆ Mean SE / MeanSEˆ 100 50 βˆ_WZ1_.₀ 0.000 0.034 / 0.034 0.543 0.046 / 0.038 ˆ β_WZ1_.₅ 0.001 0.030 / 0.029 0.527 0.037 / 0.034 ˆ β_LDS1_.₀ 0.000 0.026 / 0.026 0.497 0.022 / 0.022 100 βˆ_WZ1_.₀ 0.000 0.026 / 0.026 0.529 0.036 / 0.028 ˆ β_WZ1_.₅ 0.001 0.022 / 0.022 0.512 0.026 / 0.027 ˆ β_LDS1_.₀ 0.000 0.020 / 0.020 0.499 0.018 / 0.017 200 50 βˆ_WZ1_.₀ 0.000 0.026 / 0.026 0.535 0.039 / 0.026 ˆ β_WZ1_.₅ 0.000 0.023 / 0.023 0.523 0.032 / 0.024 ˆ β_LDS1_.₀ 0.000 0.022 / 0.022 0.496 0.019 / 0.019 100 βˆ_WZ1_.₀ 0.000 0.022 / 0.021 0.524 0.031 / 0.021 ˆ β_WZ1_.₅ -0.001 0.019 / 0.019 0.510 0.022 / 0.019 ˆ β_LDS1_.₀ -0.001 0.017 / 0.018 0.499 0.016 / 0.016

* Results are based on the model Y = β0 +β1X +β2Z + , where X ∼ Exp(1) + Log-Normal(0,1), Z ∼ Ber(0.45) and ∼N(0,1); the true parameter values are β0 = 1, β1 = 0 or 0.5, and β2 = −0.5, and the study population size N = 2000. For βˆLDS1.0,

β_WZ1_.₀ and βˆ_WZ1_.₅, see footnote of Table 4.1.

β_LDS1_.₀ is the most efficient in all settings. For example, when n0 = 100, n1+n2 = 50 and β1 = 0.5, SE( ˆβLDS1.0) = 0.058, while SE( ˆβWZ1.0) = 0.065 and SE( ˆβWZ1.5) = 0.062.

We also conduct further simulation studies to assess the robustness of our proposed LDS design when X doesn’t follow a normal distribution, or even a common distribution. We generate X = X1 +X2 with X1 ∼ Exp(1) and X2 ∼ Lognormal(0,1), and find the best fitted common distribution as described in Section 2.2 in each simulation. We performed 12000 iterations with 7000 burn-in iterations in 4 chains for each run of the MCMC algorithm. The Gelman-Rubin criterion is always 1, which indicates the convergence of the MCMC. The simulation results based on 1000 simulations are shown in Table 4.2.

We have the following observations from Table 4.2. (i) βˆ_WZ1_.₀ and βˆ_WZ1_.₅ yield some biased estimates when β1 = 0.5 and the sample size is small, while βˆ_LDS1_.₀ is consistently approximately unbiased; (ii) The proposed estimator βˆ_LDS1_.₀ is still the most efficient. For example, when n0 = 200, n1 +n2 = 100 and β1 = 0.5, SE( ˆβLDS1.0) = 0.016, while SE( ˆβ_WZ1_.₀) = 0.031 and SE( ˆβ_WZ1_.₅) = 0.022.

4.5 Analysis of the Collaborative Perinatal Project Data

We also use data from the Collaborative Perinatal Project (CPP) (Niswander and Gordon, 1972) to illustrate our method. We consider the cohort of 849 subjects as the whole cohort. We select an overall SRS sample with size n0 = 100 from the whole cohort. Then using this SRS sample, we find the best fitted common distribution for PCB conditional on other covariates and use MCMC to predict unobserved PCB and update estimates of parameters in the likelihood function using the posterior means. Finally, after estimating the likelihood value of all subjects not selected into the SRS, we selectns= 100 supplemental sample from

Ns =

(849−100) ¯Φ(a) subjects with smallest likelihood for the LDS design. We used the following quadratic model proposed in Zhou et al. (2014) to compare different designs.

IQ=β0+β1P CB+β2EDU +β3SES+β4AGE+ β5RACE+β6SEX +β7EDU2+β8AGE2+

(4.2)

The results for the CPP data analysis are summarized in Table 4.3. βˆ_Full denotes the full data analysis, which is included for the purpose of comparison. βˆ_SRS denotes the data analysis based on a simple random sample with the same sample size as of the LDS design. We also include the estimator of the EODS design for comparison.

Results in Table 4.3 reveal that none of the estimators demonstrated a significant PCB effect on the IQ scores for children at 7 years of age. The estimatorβˆ_LDS1_.₀ for PCB under the LDS design has smaller standard error (0.279) than estimators βˆ_SRS (0.470), βˆ_WZ1_.₀

Table 4.3: Analysis results of the LDS design for the Collaborative Perinatal Project data set

Covariate Int PCB EDU SES AGE RACE SEX EDU2 _AGE2

¯ ˆ β_Full 94.002 0.031 3.134 1.055 -0.532 -7.874 -0.549 0.817 0.486 ¯ ˆ SE( ˆβ_Full) 1.642 0.222 0.558 0.259 0.513 0.796 0.828 0.251 0.352 ¯ ˆ β_LDS1_.₀ 93.161 0.338 3.141 1.037 -0.606 -7.838 -0.570 0.819 0.462 ¯ ˆ SE( ˆβ_LDS1_.₀) 1.720 0.279 0.555 0.258 0.514 0.795 0.829 0.249 0.349 ¯ ˆ β_WZ1_.₀ 93.707 0.093 3.411 1.095 -0.612 -7.869 -0.474 0.815 0.359 ¯ ˆ SE( ˆβ_WZ1_.₀) 3.163 0.370 0.897 0.490 0.818 0.927 0.911 0.427 0.574 ¯ ˆ β_WZ1_.₅ 95.106 0.223 3.712 0.800 -0.738 -8.220 -0.485 0.820 0.714 ¯ ˆ SE( ˆβ_WZ1_.₅) 2.876 0.319 0.794 0.452 0.751 0.910 0.908 0.359 0.503 ¯ ˆ β_SRS 94.030 0.036 3.189 1.033 -0.557 -7.866 -0.458 0.827 0.475 ¯ ˆ SE( ˆβ_SRS) 3.438 0.470 1.180 0.542 1.069 1.665 1.728 0.548 0.746

* The outcome is the Weschler Intelligence Scale for children at 7 years of age (IQ). PCB is the level measured from the third-trimester blood serum specimens that have been preserved from mothers in the CPP study; EDU is the standardized mother’s education level; SES is the socioeconomic status of the child’s family; AGE is standardized mother’s age; RACE and SEX are the race and gender of the child. The fitted model is IQ =

β0+β1P CB+β2EDU+β3SES+β4AGE+β5RACE+β6SEX+β7EDU2+β8AGE2+, where is zero mean normal variable with unknown variance. For βˆ_LDS1_.₀, βˆ_WZ1_.₀ and

β_WZ1_.₅, see footnote of Table 4.1. βˆ_EODS is the estimator of the EODS design,βˆ_Full is the estimator from full data analysis and βˆ_SRS is the estimator based on a simple random sample with the same sample size as for the EODS design.

(0.370) andβˆ_WZ1_.₅ (0.319), exceptβˆF ull (0.222). It is not surprising that the standard error of

the estimator βˆF ull based on all data with a size of 849 for the PCB is the smallest. However,

the LDS design, with PCB values of 649 subjects unknown, achieves much closer standard errors compared with the estimator based on full data.

4.6 Concluding Remarks

We proposed an innovative and cost-effective sampling design, the Likelihood Dependent Sampling (LDS) design, by considering a new criterion, the conditional likelihood, to identify more informative supplemental samples in the structure of the general ODS design.

The proposed design selects the supplemental samples using the estimated conditional likelihood. The estimated conditional likelihood contains the information of both Y and X, and a small conditional likelihood indicates the rarity of the subject, which is more likely to be associated with a subject with smallest or largest X. Using the estimated conditional likelihood, we can put all candidate subjects in one order of the qualification for the supplemental sample, rather than checking Y or estimated X from a predetermined lower or upper tail. Moreover, compared to existing ODS and PDS designs, all of which choose a range of information (percentage) from each of two tails, the LDS design can select the supplemental sample from a range of same information (percentage) from just one tail, which is a more informative range of candidates. Thus the LDS design achieves more efficiency than existing ODS and PDS designs.

Same as the EODS design, the Bayesian MCMC framework naturally accommodate missing data without requiring new techniques for inference. The Bayesian approach handles the uncertainty well by incorporating hyper-parameters and hyper priors and can be more robust through different choices of prior distributions. For our proposed LDS design, the use of Bayesian MCMC fits well by both obtaining meaningful informative priors from the overall SRS and incorporating all available information, even from the subjects with the main exposure variable unobserved, thus achieving more accurate estimation. Moreover, the Bayesian MCMC improves the accuracy of the estimated conditional likelihood by carefully

estimating unobserved X and updating all parameters using all available information before we select the supplemental samples. The convergence of MCMC ensures that we get the unbiased estimates of the parameters. Both our simulation results and real data analysis support that for the same sample size, the proposed LDS design, coupled with the estimator by Bayesian MCMC, is more efficient than the competing estimators.

In document Wang_unc_0153D_19353.pdf (Page 85-96)