• No results found

In this chapter, we studied linear models and Gaussian processes. It is important to understand the connections between the two. First, Gaussian processes can be de- rived from Bayesian linear regression by mapping the data into a higher dimensional

2.3. Summary 29 feature space. More importantly, we do not need to know that mapping explicitly, as long as we can prove that it exists and can compute its kernel efficiently. Second, linear mixed models are also closely connected to Gaussian processes if the random e↵ect is treated as a Gaussian process prior (Liu et al., 2007). The work presented here builds on recent advances made in both fields.

In genomics, linear mixed models are often the model of choice when it comes to association testing: The random e↵ect allows to correct for confounding, either induced by shared genetic or environmental factors, reducing the number of False Positives. For computing the random e↵ect covariance matrix, a linear kernel on all markers is commonly used. This can be interpreted in two ways: We can either think of the covariance matrix as a genetic similarity matrix that measures the relatedness between individuals by using the SNP markers, or, as a Bayesian linear additive model, in which each marker contributes to the phenotype (Goddard et al., 2009). The second approach reveals one of the main limitations of linear mixed models: It assumes that all SNPs are associated with the phenotype, and it does not allow for outlier SNPs, which have a larger e↵ect size (Zaitlen and Kraft, 2012; Lippert et al., 2013). In the following chapter, we will present an algorithm that relaxes these assumptions by including markers with a large e↵ect size as fixed e↵ects in the model. By designing a better background model, we can increase the power to detect weak associations.

Chapter 3

Confounder correction for Lasso

methods

One of the key challenges in association testing is, as we elucidated before, to design multivariate methods that can correct for population stratification. Linear mixed models are often used to correct for population stratification, but do predominantly consider individual markers in isolation. In contrast, sparse methods increase the power to detect multifactorial associations, but cannot deal with confounding.

The goal of this chapter is to develop an algorithm that combines the merits of linear mixed models and sparse approaches while allowing efficient computation. Our approach tackles the problem in a three-step procedure: in a first step, it estimates how much phenotypic variance can be explained by population structure. In a second step, it transforms the markers and the phenotype such that the correlation due to the population structure is removed. Finally, a sparse solver is used on the transformed data to identify a set of markers that jointly contribute to the phenotype. The additional runtime for confounder correction is a one-time cubic operation in the number of samples O(N3), which is negligible compared to the runtime of the sparse

solver.

We define our new approach in Section 3.1 and give a detailed description of the inference scheme in Section 3.2. In Section 3.3, our experiments show that the rigorous combination of sparse and mixed modeling approaches yields greater power to detect true causal e↵ects in a large range of settings. In genome-wide association studies in Arabidopsis thaliana and linkage mapping in mice, our method achieves significantly more accurate phenotype predictions than its competitors and retrieves associations that are enriched for known candidate genes.

3.1

Feature selection in the presence of confound-

ing

Our approach builds on linear mixed models (see Section 2.1.4), explaining the phe- notype variability by a sum of individual genetic e↵ects and random confounding variables. In brief, the phenotype of N samples y = (y1, . . . , yN) is expressed as a

linear function of the markers X 2 RN⇥M

y = Xw|{z} genetic factors + |{z}u confounding + ✏|{z} noise . (3.1)

Here, ✏ 2 RN denotes observation noise and u2 RN are confounding influences. Con-

founding influences in genetic mapping are typically not directly observed, however their Gaussian covariance K can in many cases be estimated from the observed data. To account for confounding by population structure, K can be reliably estimated from genetic markers, for example using the realized relationship matrix which cap- tures the overall genetic similarity between all pairs of samples (Hayes et al., 2009). Similarly, in genetic analyses of gene expression, K can be fit to capture and correct for the confounding e↵ect of gene expression heterogeneity (Listgarten et al., 2010; Fusi et al., 2012). Marginalizing over the random e↵ect u results in a Gaussian marginal likelihood model (Kang et al., 2008) whose covariance matrix accounts for confounding variation and observation noise.

The resulting mixed model is typically considered in the context of single candi- date SNPs, i.e. restricting the sum in Eq. (3.1) to a particular SNP while ignoring all others (see Section 2.1.4). While computationally efficient and easy to interpret, this independent analysis can be compromised by complex genetic architectures with some genetic factors masking others (Platt et al., 2010b). Some improvements can be achieved by step-wise regression or forward selection, which has recently been extended to the mixed model framework (Yang et al., 2012a; Segura et al., 2012). However, these approaches are often caught in suboptimal modes as they are order dependent (Segura et al., 2012). As an alternative, we propose an efficient approach to carry out joint inference over all markers as implied by Eq. (3.1). Our approach assesses all SNPs at the same time while accounting for their interdependencies and without making any assumptions on their ordering. To allow for applications to genome-wide SNP data, we regularize the fixed e↵ects by an `1-norm, assigning zero

e↵ect size to the majority of SNPs as done in the classical Lasso (see Section 2.1.5). We call this approach LMM-Lasso as it combines the advantages of established linear mixed models (LMM) with sparse Lasso regression.

3.1. Feature selection in the presence of confounding 33 There is a vast amount of literature using a `1-regularized approach for genome-

wide association studies (Wu et al., 2009; Lee and Xing, 2012; Kim and Xing, 2009). In Foster et al. (2007), a sparse random e↵ect model is proposed, in which the markers are modeled as random e↵ects drawn from a Laplacian distribution. In Hoggart et al. (2008) and Li et al. (2011), the authors suggest to add principal components to the model to correct for population structure. While these approaches can be e↵ective in some settings, principal components cannot account for family structure or cryptic relatedness (Price et al., 2010). Importantly, none of these approaches considers including random e↵ects to control for confounding. A notable exception is the general `1-mixed model framework by Schelldorfer et al. (2011) and Schelldorfer and

B¨uhlmann (2011), who consider a random e↵ect component but do not provide a scalable algorithm that is applicable to genome-wide settings. More recently, Zhou et al. (2013) introduced a fully Bayesian approach to tackle the same problem by using a mixture of two Gaussians as prior. This is conceptually close to the work presented here, as it is equivalent to a linear mixed model with a spike-and-slab prior on the fixed e↵ects and employing a linear kernel as random e↵ect covariance matrix. Probabilistic model Let X denote the N⇥ M matrix of M SNPs for N individ- uals, xj is then the N ⇥ 1 vector representing SNP j. We model the phenotype for

N individuals, y = (y1, . . . , yN) as the sum of genetic e↵ects wj of SNPs xj and con-

founding influences u (see Eq. (3.1)). The genetic e↵ects are treated as fixed e↵ects, whereas the confounding influences are modeled as random e↵ects. The genetic e↵ect terms are summed over genome-wide polymorphisms, where the great majority of SNPs has zero e↵ect size, i.e. wj = 0, which is achieved by a Laplace shrinkage prior

on all weights. The random variable u is not observed directly. Instead, we assume that the distribution of u is Gaussian with covariance K, u⇠ N (0, 2

gK).

Assuming Gaussian noise, ✏ ⇠ N (0, 2

eI), and marginalizing over the random

variable u, we can write down the conditional posterior distribution over the weight vector w: p(w|y, X, K, 2g, e2, )/ N y Xw, g2K + e2I | {z } marginal likelihood M Y m=1 e 2|wm| | {z } prior . (3.2)

Here, denotes the sparsity hyperparameter of the Laplace prior, 2

e is the residual

noise variance and 2