Applications to real data - Multivariate linear mixed models for statistical genetics

To illustrate the advantages of mtSet we considered two real data application: a human GWAS of four lipid traits from the Northern Finland Birth cohort (Sabatti et al.,

Number of causal SNPs Fraction of correlated polygenic signal

(a)

Number of causal SNPs Fraction of correlated polygenic signal

(b)

Figure 3.3: Power comparison of alternative methods on simulated data from 1000 Genomes Project genotypes. Power comparison of alternative methods on simulated data based on 1000 Genome Project data with four phenotypes, varying the complexity of true causal effects (number of causal variants) (a) and the extent of correlated genetic background (b). The stars denote the default values that are kept constant when varying the respective other parameter. Shown is the average power for different methods and simulation settings across 1,000 repeated experiments where the error bars denote standard errors. Compared are the full-rank multi-trait set test (mtSet), the PC approximation (mtSet-PC), the single-trait set test (stSet), a multi-trait single-variant LMM (mtLMM-SV) and a single-trait single-variant LMM (stLMM-SV).

(a) Power 1 5 10 20 30 50 100 150 200 Window Size (Kb) 0.0 0.2 0.4 0.6 0.8 1.0 Av er ag e R 2 w ith in wi nd ow s

(b) Average R2within windows

1 5 10 20 30 50 100 150 200 Window Size (Kb) 0 200 400 600 800 1000 1200 1400

Number of SNPs within windows

Figure 3.4: Power comparison when varying the size of the set component on simulated data from 1000 genomes project genotypes. (a) Shown is power

at10% family-wise error rate for mtSet, stSet, mtSet-PC, mtLMM-SV and stLMM-SV

when varying the window size for set test approaches. While the methods are overall robust, the set test methods were most powerful when the testing size matched the size of the simulated causal region. (b) Shown is the averaged squared correlation coefficient within a window as a function of the window size. Larger windows tend to contain fewer tightly correlated SNPs within windows. (c) Shown is the number of SNPs within windows as a function of the window size. When selecting the size of the testing window both linkage disequilibrium and number of SNPs within regions should be taken into account.

2009) and a GWAS of six haematology traits in a population of outbred rats (Baud et al., 2014). While the first is a medium-sized cohort of unrelated individuals (5,256

individuals), the second is a smaller cohort (1,334 individuals) characterised by strong genetic relatedness. The RRM of both datasets is shown in Fig. 2.3. Fig. B.8 shows the distribution of the number of variants and the squared average correlation coefficient between variants in genetic regions, both as a function of the considered window size. As these figures show, the rat genotypes have a slower LD decay, which is consequence of the large haplotype blocks that characterise this dataset.

3.3.1 Genetic analysis of lipid traits in human

We applied mtSet to data from four lipid-related traits (C-reactive protein (CRP), triglycerides (TRIGL) and LDL and HDL cholesterol levels) measured in 5,256 unrelated individuals from the Northern Finland Birth cohort (Sabatti et al., 2009), which were previously considered for multi-trait analysis using single-variant LMMs (Korte et al., 2012; Zhou and Stephens, 2014). Following the approach taken in Zhou and Stephens (2014), we regressed out sex, usage of contraceptive pill (Pills31) and oral contraception (ZP4202U) and then quantile-normalised each trait to follow a unit variance normal distribution (see also Fusi et al. (2014a) for a comparison of alternative pre-processing methods in the context of LMMs). For set tests we employed a sliding- window approach, considering a window size of 100 kb and a step size of 50 kb (for a total of 328,517 variants with an MAF of at least 1%). Regions with fewer than

4 SNPs were discarded (1,802 SNPs, corresponding to <0.5%), resulting in a total

of 51,658 sets for analysis. Heritability estimates and correlation coefficients on the null models of mtSet were in line with those previously reported in Korte et al. (2012) (Table B.7 - the procedure for calculating standard errors is described in Section D.3), where small deviations are likely due to small differences in the phenotype normalisa- tion. Following the strategy described in Section 3.1.2, we estimated P-values using 10 genome-wide permutations per window to fit an empirical null distribution. A genome- wide run required 49h for mtSet (null model: 2.58 min, average window: 5.18s), and 5h for mtSet-PC (null model: 44.89s, average window: 1.78 s). We again compared mtSet and mtSet-PC to a single-trait set test, and single-trait and multi-trait LMMs for single-variant testing. All methods yielded well-calibrated P values (Fig. B.9- B.10). Significance of QTLs was assessed at the Bonferroni adjusted significance level α < 0.01. For single-trait methods, we considered the minimum P-values across traits, again adjusting for the additional tests using Bonferroni.

Manhattan plots for all methods are shown in Fig. B.11 and a tabular summary of the QTLs is provided in Table B.8. Notably, mtSet identified 14 genome-wide significant quantitative trait loci (QTL) (alpha<0.01, Bonferroni adjusted), 13 of which

have previously been identified in a larger meta-analysis (Teslovich et al., 2010) and the remaining one has been reported when applying single-variant multi-trait LMMs to the same dataset (Zhou and Stephens, 2014). Single-variant LMMs missed four associations and single-trait set tests failed to detect three of the associations detected by mtSet. In contrast, mtSet identified all but one association found when considering the union of associations detected by previous methods and retrieved one additional association close to the ANGPTL3 (Anglopoletin-like 3) gene, a known regulator of lipid metabolism in mice (Koishi et al., 2002). Notably, mtSet-PC was even slightly better powered than mtSet (identifying 16 QTLs). The model retrieved all associations found by mtSet or any other method and found an additional association close to the LCAT (lecithin cholesterol acyltransferase) gene, which is known to contribute with multiple rare alleles to low plasma levels (Cohen et al., 2004). Finally, in order to assess robustness of the results to the choice of the window size we repeated the analysis considering either a smaller window size of 60kb or a larger window size of 300kb. Table B.9 shows that results are overall robust when considering different window sizes.

3.3.2 Genetic analysis of haematology traits in rat

To evaluate mtSet in a setting with strong relatedness, we considered a QTL study of 1,334 outbred rats (The Rat Genome Sequencing and Mapping Consortium, 2013) and applied mtSet to six traits related to basal haematology (concentrations of basophils (basos), eosinophils (eos), large unstained cells (luc), lymphocytes (lymphs), monocytes (monos) and neutrophils (neuts)). We regressed out sex and batch covariates and quantile normalised each trait to a unit variance normal distribution. Variants were filtered to have a minor allele frequency (MAF) of at least 5% resulting in 4,138,000 variants for the analysis. Because of the large haplotype blocks in this particular study (multi-parent cross genetic design), we considered larger regions (1Mb size) and a step size of 500kb, resulting in a total of 5,220 sets. Heritability estimates from a single- trait LMM were consistent with the marginal heritability estimates of the mtSet null model (Table B.10). To estimate P-values, we used 30 genome-wide permutations per window to fit an empirical null distribution.

First, to study calibration of P values using different correction strategies we compared mtSet to mtSet-PC and a single-variance component model without correction for population structure (mtSet-NoBg). All three models were calibrated when permuting the SNPs within the set (using genome-wide permutations to retain the LD structure, see Section 3.1.2), which corresponds to empirical data from the null without asso-

ciation signal (Fig. 3.5a). However, for the observed data (Fig. 3.5b, Fig. B.13), only mtSet yielded calibrated results, confirming the expected benefits of the second variance component term to control for relatedness (Kang et al., 2010; Schifano et al., 2012). The QQ-plots in Fig. 3.5c were obtained after removing duplicate SNPs, i.e. only considering unique variants.

We then compared mtSet to different LMMs, including single-trait set test, single- trait LMM for single variants and multi-trait-LMM for single variants. Again, signif-

icance was assessed at FWER< 0.01 significance level, adjusting for multiple testing

using Bonferroni (considering only unique variants in the dataset). For single-trait methods we corrected Bonferroni both across variants and traits. Fig. 3.5c shows the Manhattan plots for the four methods (for single-trait methods we report the minimum P value across traits; Manhattan plots for single-traits tests are reported in Fig. B.13). A tabular summary of the results is given in Table B.11. mtSet identified one ad-

ditional QTL (alpha < 0.01, Bonferroni adjusted, see Fig. 3.5c). This QTL points

to NFKB2, a gene that is involved in immune response in humans (Wit et al., 1998), making it also a plausible candidate gene for haematological traits in rat.

In document Multivariate linear mixed models for statistical genetics (Page 80-84)