Implementation for genetic association studies

2.2 Methods

2.2.2 Implementation for genetic association studies

The data model

For our primary application of LLARRMA, we start by considering a standard logistic regression to estimate the effects ofmSNPs in a hit region on a case/control outcome in

n individuals, and then describe statistical approaches to identify a subsetmq of SNPs

that has been previously identified by an initial genomewide screen using, for example, single locus regression, and that them SNPs may be highly correlated with each other due to blocks of LD. We assume y= (y1, . . . , yn) to be an n-vector of the dichotomous

response with each of then1 cases coded by 1 and the n0 controls coded by 0, and that

Xis ann×mmatrix of SNP genotypes, where SNPs are coded to reflect additive-only effects as {0,1,2} for unphased genotypes {qq,qQ,QQ}. Logistic regression models the case-control status of individual i as if sampled from Yi ∼ Bin(pi,1), where i’s

propensitypi =P(Yi = 1) is determined by a linear function of the m SNP predictors

logit(pi) = log pi 1−pi =µ+ m X j=1 βjxij, (2.5)

where xij is value of the jth SNP for the ith individual and the ijth element of the

column-centered design matrix X, µ is the intercept, and β = (β1, . . . , βm) are the

effects of them predictors.

Resample model averaging

The RMA procedure described above does not need any modification or further clari- fication for the genetic association hit region application.

Selection within a subsample using the LASSO

To select SNPs within the kth subsample we use LASSO penalized logistic regression. This estimates β for subsample k as

ˆ β(λ;D(k)_{) = argmin} β ( −`(β;D(k)_{) +}_λ m X j=1 |βj| ) , (2.6)

where`(β;D(k)_{) is the log-likelihood of} _β _{for subsampled data}_D(k)_{, and}_λ _{is a penalty}

parameter. The LASSO estimate ˆβ(λ;D(k)_{) still translates into an estimate of the}

Predictive-based choice of λ(k): complement deviance selection

The complement deviance criterion described in the general setting may be explicitly defined for logistic regression. After estimating ˆβ(λ;D(k)_{) over a grid of} _λ _{in order}

to calculate the LASSO path, this criterion finds the value of λ that minimizes the deviance of the complement of subsample k, ie,

ˆ λ(_CompDevk) = argmin λ    −2 X i∈N(\k)

[yilog(ˆpi,λ) + (1−yi) log(1−pˆi,λ)]

 



whereN(\k) ₌_{N \N}(k) _{is the set of (1}₋_φ₎_n _{individuals not selected for subsample} _k_,

and ˆpi,λ is the predicted probability of P(Yi = 1) based upon ˆβ(λ;D(k)) applied to the

design matrix of the complement subsample X(\k).

Discovery-based choice of λ(k)_{: permutation selection}

The permutation selection criterion described in the general setting can be explicitly defined for logistic regression. Given a subsample k, we calculate for a given permutation of the response π(y) the smallest penalty required to zero out all predictors is given by λnull(π, k) = 1 |N(k)_|max_j hx (k) j ,π(y (k)₎_i

wherex(_jk) is the jth column of the subsampled and mean-centered design matrix X(k)_,

and h·,·i denotes the inner product of its two arguments. Calculating this for each of

S permutations π1, . . . ,πS, we estimate the permutation selection λ for subsample k

as given by Equation 2.4.

Incorporating uncertainty due to missing genotypes: hard, dosage and multiple imputation

SNP data within a hit region will often include combinations of markers and individuals for which the genotype is unknown or uncertain. To avoid a potentially wasteful complete cases analysis, it is common to impute the missing genotypes using a program

such as MACH (Li et al., 2010), IMPUTE (Howie, Donnelly and Marchini, 2009) or fastPHASE (Scheet and Stephens, 2006), and analyze the partly-imputed data as if it were fully observed. Imputation methods are typically based on reconstruction and phasing of inferred haplotypes. Dividing the SNP matrixX into missing and observed elements X = {Xmis,Xobs}, methods such as fastPHASE (Scheet and Stephens, 2006)

model the joint distribution p(Xmis|Xobs,ω), where ω includes additional information

used in the imputation (eg, priors). Most GWAS studies, however, do not use this joint distribution directly. Rather, they replace Xmis with a point estimate Xb_mis, each element of which is constructed from its marginal distributions. Specifically, Xmis is replaced by either the “dosage”,Xb_misdose, with elements defined as the expectation of the allele count ˆxij = E(xij|Xobs,ω); or a “hard” imputation, Xb_mishard, with elements imputed as their maximum a posteriori genotype

xij = argmax g∈{0,1,2}

p(xij =g|Xobs,ω).

The simplest approach to modeling missing genotypes within LLARRMA is first to estimate Xmis as either Xb_misdose or Xb_mishard and then subsample Xb = {Xb_mis,X_obs} as if it were complete. This plug-in approach underestimates variability because it fails to incorporate uncertainty about the imputation. Zheng et al. (2011) show that doing this when modeling effects at single loci reduces power a negligible amount when the imputation accuracy is reasonably high. Nonetheless, ignoring imputation uncertainty could be more problematic in multiple locus settings, if, for example, the posterior distribution of haplotypes p(Xmis|Xobs,ω) differs substantially from joint distribution

implied by the product of marginal posteriors Q

ij∈Xmisp(xij|Xobs,ω) (eg, Servin and

Stephens, 2007). A natural way to incorporate imputation uncertainty into our re- sampling framework is through multiple imputation (Little and Rubin, 2002). At each iteration k, we sample a new X?

resultingX? ={X?

mis,Xobs}to give{X?(k),y(k)}=D?(k), and then calculate RMIPs us-

ing ˆγ(D?(k)_{) in place of ˆ}_γ₍_D(k)_{) in Eq 2.2. The resulting RMIPs incorporate additional}

variability because each subsample now includes a potentially different imputation of missing genotypes. We implement hard, dosage and multiple imputation using posterior draws from fastPHASE (making use of the -s option).

In document 5247.pdf (Page 66-70)