Simulations - High Dimensional Regression Techniques for Complex Data

To assess the performance of our estimation and variable selection methods, we simulated group testing data in the following manner. To closely mimic the features of the motivating Iowa data, K = 50 clinic sites were conceptualized with 100 individuals at each clinic; i.e., N = 5000. This sample size provides for a stress test of our model’s capability of recovering the true covariates, as compared to the amount

of data available in the motivating data set. The infection status for each individual was generated as a D = 2 dimensional binary vector eYi = ( eYi1, eYi2)0according to eYid = 1 if ωid ≥ 0 and 0 otherwise, where

ωi= (ω1d, ω2d)0is a random draw from a multivariate normal distribution with mean ηiand correlation ma-

trix R. The mean vectors ηiare constructed with the following values: β1= (−4, −1.5, 0.5, 0.25, 0)0, β2=

(−2.5, 1, −0.75, 0.3, 0, 0)0, λ1= (1, 0.75, 0.25, 0, 0)0, λ2= (0.8, 0.3, 0.15, 0, 0, 0)0, a1= (0.9, 0.5, 0.7, 07)0,

a2= (0.75, −0.5, 0.8, 1.2, 0.5, 0.3, 09)0, 0`is an `th dimensional zero vector, b(i)d= bkdif individual i pre-

sented at clinic site k, bkd iid

∼ N (0, I), and R’s off-diagonal elements are set to ρ = 0.99. We set the covariate vectors associated with the fixed effects to standardized versions of x?

i1= (1, N (0, 1), B(0.5), B(0.5), N (0, 1))0

and x?_i2 = (1, N (0, 1), B(0.5), B(0.5), N (0, 1), B(0.5))0, where B(0.5) represents a Bernoulli(0.5) random variable. The random effects covariates for each disease are taken to be equal to the fixed effects covariates. These covariate and parameter configurations provide for an overall disease prevalence rate of about 3% and 9%, which is in keeping with the observed prevalence rate of gonorrhea and chlamydia, respectively, from the motivating data. This process was used to generate 500 individual level data sets.

Group testing outcomes were generated with these individual level data sets in the following manner. Each individual was randomly placed into a group of size cj = 4 individuals. This poses the most difficult

estimation configuration; that is, individuals are pooled across sites rather than within sites and thus each pool contains multiple random effects. To proceed, one of three group testing algorithms was implemented to test each of these master pools; namely, master pool testing (MPT), Dorfman testing (DT), and array testing (AT). Briefly, under MPT, each master pool is tested with no further retesting, regardless of the outcome. DT proceeds by individually retesting each contributing member of a master pool that tested positively. Similarly, AT views the master pools as rows and columns of an array and will retest individuals identified to be likely positives; e.g., individuals residing at the intersection of positive rows and columns. For the specific AT protocol we adopted, see Kim et al. [2007]. Under each protocol, the testing response for the jth pool was simulated as Zjd| eZjd∼ Bernoulli{Sej:dZejd+ (1 − Spj:d)(1 − eZjd)}, where eZjd= max{ eYid: i ∈ Pj}. For

comparative purposes, individual testing (IT) was also implemented. Regarding the testing assay accuracies, two different simulation configurations were considered. The first setting assumes that the sensitivity and specificity for each pool are known; i.e., Sej:1= Sej:2 = 0.95 and Spj:1 = Spj:2 = 0.98 for all j = 1, ..., J .

To perform the assessment of the proposed model, we set m0= 0, C0 = 0.5I, and used flat priors

for all mixing weights. Recall that the prior parameters for a should be chosen in a somewhat informative fashion to avoid a strong a priori correlation between any two random effects; see Chen and Dunson [2003]. To impose a diffuse prior variance on the slab distribution of the fixed and random effects, we took φ2

rd =

ψ2

ld = 100. The prior degrees of freedom and scale matrix for W was set as m0 = D + 1 = 3 and S = I,

where I is a D × D identity matrix, and the proposal degrees of freedom was set to m = 300; these values are discussed in Zhang et al. [2006]. To perform posterior estimation and inference, the last half of 50000 iterations of our MCMC algorithm was used for summary statistics. Point estimates of the model parameters were obtained as the empirical means of the posterior distributions.

Table 6.4 summarizes estimation performance in the first simulation setting; that is, when Sej:dand

Spj:d are known. Overall, these results illustrate our approach provides reliable inference for the fixed and

random effects; i.e., the empirical bias and the variability in the estimates are small relative to the true value of the corresponding parameter. These results also indicate the proposed methodology is adept at identifying non-zero fixed and random effects. That is, covariates with strong (no) effects almost always have posterior inclusion probabilities near 1 (0) in all data sets.

Among the group testing protocols presented in Table 6.4, MPT generally performs the worst in terms of estimation i.e., estimates obtained from analyzing MPT data exhibit the most bias and variability. This is expected because MPT does not complete the classification process like the other protocols and there- fore results in less information about the individuals’ latent statuses. In contrast, the estimation performance under the two classification protocols (DT and AT) is as good if not better than the performance under IT; furthermore, DT/AT estimates are obtained at about 60% of the testing cost on average when compared to IT. Specifically, 5000 tests are used to complete IT, while DT and AT require 2747 and 3258 tests on average, respectively. These results illustrate the “get more for less” phenomenon that has previously been reported with group testing regression [Zhang et al., 2013, McMahan et al., 2017].

Table 6.5 summarizes estimation performance when the assay accuracy probabilities are unknown. In this second setting, the proposed methodology is tasked with estimating 8 additional parameters; i.e., the testing assay accuracies. These results indicate we can accurately estimate these parameters and the variability in the estimates is small relative to the true values. Moreover, there are no appreciable differences between the estimates in Tables 6.4 and 6.5 for DT and AT; i.e., inference for the fixed and random effects is not impacted by having to estimate these additional parameters.

[Table 4 about here.] [Table 5 about here.]

In document High Dimensional Regression Techniques for Complex Data (Page 39-42)