Current statistical inference problems in genome-wide association studies (GWAS) routinely involve the simultaneous test of hundreds of thousands (or, even millions) of null hypotheses. This testing problem entails inference for high-dimensional joint distributions of complex and unknown dependence structures among the sampled genotype and phenotype data. In turn, this leads to complex dependence structures among the test statistics arising from the simultaneous testing of the null hypotheses. Ignoring the dependence structure among the test statistics can lead to a loss in statistical power within a GWAS. The core methodological and computational issue encompassing GWAS is multiple hypothesis testing (MHT). Within this chapter, we discuss approaches to tackling the GWAS multiple hypothesis testing problem, compare and contrast their operating characteristics and computational performance, and develop a parallel programming algorithm to implement the permutation maxT and minP multiple testing procedures (MTPs).
2.1.1 Approaches to Controlling the FWER in Genome-wide Association Studies Of the four types of Type I error rates defined within §1.3, it seems strong control of the FWER at level α to be most befitting for application within a GWAS. This is due to the fact that MTPs based upon the PFER are generally more conservative (i.e., leads to an increased reporting of Type II errors) than those based upon the FWER [60]; MTPs based upon the PCER are generally less conservative than those which control either the FWER or FDR, but tend to ignore the multiplicity problem altogether [60]. Furthermore, while MTPs based upon the FDR tend to achieve greater statistical power than those based upon the FWER – particularly, when the ratio of false null hypotheses (m1) to the total number of tested null hypotheses (m) is large (i.e., the ratio m1/m is large) – in general they can result in a high probability for the occurrence of one or more false positives (i.e., an inflated FWER) [74]. Although it is highly unlikely that all tested null hypotheses in a GWAS are in fact true (i.e., it is unlikely that m1 ≡ 0), it is likely that the ratio m1/m is exceptionally small, far less than 1%. Under these conditions, control of the FDR is close to weak
control of the FWER [61]; strong control of the FWER is close to the best methods for weak control of the Type I error rate [75, 76]. In light of the above, strong control of the FWER seems most befitting for application within a GWAS, and likely explains why many – not all (see e.g., [77]) – methodological approaches for multiple testing in GWAS have focused upon control of the FWER. As such, all multiple testing procedures discussed within Chapters 2 and 3 of this manuscript are assumed to control, in the strong sense, the FWER at some user specified level α.
As indicated above, there are issues specific to GWAS designs which influence both how inves- tigators control for Type I errors and the decision of which MTP to be most useful for control in the adopted Type I error rate. In a multiple hypothesis testing MHT problem such as a GWAS, the likelihood of committing some Type I errors increases (i.e., the FWER increases), as we have illus- trated above through expression (1.1). The goal of the MHT problem is to control some Type I error rate in the strong sense, say the FWER, while simultaneously maximizing statistical power to reject false null hypotheses. To control the FWER at a predefined level, say α, one implements a multiple testing procedure. The choice of implemented MTP is critical – an overly conservative MTP could result in overlooking genetic markers which are truly associated with the disease under investigation (i.e., an excessive Type II error rate); an overly liberal MTP, on the other hand, could result in excessive false positives (i.e., an excessive Type I error rate). The Bonferroni MTP is, by far, the most exploited MTP within a GWAS (for recent articles, see e.g., [78, 79, 80, 81, 82, 83, 84]) for strong control of the FWER at level α, presumably due to its simplicity of application – for a GWAS comprised of m markers, at the FWER α level one rejects a null hypothesis if its corresponding pointwise p-value does not exceed the ratio α/m.
While the Bonferroni MTP is simple to implement, it ignores LD (see footnote 1 within §1.1 for a review of the LD definition) among the sampled SNP markers. As a consequence, in the presence of correlated SNP markers this MTP is overly conservative [20, 21, 22, 85]. So as to maximize the efficiency of a GWAS, SNPs are often selectively sampled to be nearly free of LD (i.e., to avoid ascertaining redundant information, SNPs should be selected to be essentially statistically independent). In spite of this, some degree of correlation typically exists within the sampled genetic data [20]. Permutation-based MTPs, such as the so-called maxT and minP approaches of [62], are widely considered most powerful for strong control of the FWER at level α within a GWAS, insofar as these MTPs account for the correlation structure amongst the sampled data [22]. We outline the maxT and minP MTPs in more detail within §2.2.4, but point out here that they remain largely
unimplemented due high computation effort upon a GWAS data set (see e.g., [20,21,86,87,88]). For example, performing the necessary number of permutations (100K) upon a typical GWAS data set containing 2500 cases and 2500 controls and m = 500K SNP markers using standard software (e.g., PLINK [63]) can take upwards of four CPU years to complete [21]. To alleviate this computational burden, there have been several recent algorithms proposed to approximate the GWAS permutation- based maxT and minP gold standard.
When correlation exists upon the tested null hypotheses – by way of LD upon the sampled SNP markers – there is less variation among their corresponding test statistics than if the null hypotheses were mutually independent. This decreases the likelihood of extreme test statistics [20]. With correlated tests, we gain information about the plausibility of a particular null hypothesis based upon the tests of other null hypotheses. One alternative approach to permutation MTPs for control of the FWER in GWAS exploits the correlation structure upon the sampled markers. It is based upon estimating the LD within the data. Then, utilizing the LD estimates in turn to estimate the effective number of tested independent null hypotheses (Meff) and modifying the ˇSid´ak MTP1 replacing the value m within said MTP with the less conservative estimate Meff. By exploiting the correlation within the sampled data, this approach results in a less conservative MHT correction than the ˇSid´ak MTP [so also the Bonferroni MTP] (i.e., Meff < m); the approach results in a low computational requirement when compared to permutation MTPs. Cheverud (2001) pioneered this approach, and proposed estimating LD from the eigenvalues of the Pearson correlation matrix for the sampled SNP markers [89]. Subsequently, several author’s proposed alternative methods for estimating Meff [86, 88, 90]. However, these methods remain conservative when compared to using the actual permutation null distribution [21,22]. Moreover, [91] and [92] illustrated that the effective number of tested independent null hypotheses varies across p-value levels, thereby demonstrating that the Meff approach can be inaccurate.
A second alternative approximation approach to permutation MTPs for control of the FWER is based upon the framework of the multivariate normal distribution (MVN). The joint distribution of the test statistics under the complete null hypothesis for many statistical tests commonly employed within a GWAS – such as the Cochran-Armitage Trend Test – follows an asymptotic MVN [93, 94]. The articles of [93] and [94] proposed simulating replicates of the test statistics from this asymptotic MVN under H0(the complete null hypothesis), and ascertaining adjusted p-values by way of com- paring the test statistic replicates with those of the observed data. The proposal of [20] increased
the efficiency of this approach, by direct numerical integration over the MVN probability density function (PDF) under H0. When applied to data sets of the size of candidate gene studies (i.e., a panel of a few hundred SNPs), these methods have been shown to be as accurate as permutation MTPs (less than 1% average error in adjusted p-values) [20]. However, when applied to GWAS data sets, the accuracy of these methods suffer. Utilizing the Wellcome Trust Case Control Consortium (WTCCC) data [13], [21] demonstrated that these MVN methods only remove about two-thirds of the error in the adjusted p-values relative to the Bonferroni MTP. Due to numerical limitations of integrating over high-dimensional MVN PDFs, these methods require the user to partition the data into small LD blocks (of hundreds of markers each) and integrate the MVN PDF within each LD block. Insomuch as inter-block correlation is ignored, these MVN approaches lead to conservative multiplicity correction. To address this problem, [21] proposed a resampling method called SLIDE (a Sliding-window approach for Locally Inter-correlated markers with asymptotic Distribution Errors corrected). However, accuracy and computational efficiency for this approach depends on the size of the window: a large window leads to increased accuracy and decreased efficiency, while a small window leads to decreased accuracy and increased efficiency.
Overall, several permutation approximation methods have recently been proposed, with the in- tent of: (1) controlling the FWER; (2) avoiding the exceptional computational effort of permutation MTPs; and (3) obtaining greater statistical power over the Bonferroni MTP. The accuracy in these methods seems to be increasing, although some concerns linger. First, there is no agreement to a standard alternative method. In fact, there is a lack of consistency in the reported results across the Meff methods. For example, the results of [88] suggest the Meff estimate of [90] to be liberal in controlling the FWER at the 5% level; the results of [95] suggest control of the FWER at the 5% level for the Meff estimate of [86], to vary between 3% and 7%, where the variation is dependent upon LD; and [90] suggest the Meff estimate of [89] is overestimated for some LD structures in the sampled SNP panel. Second, in order to accurately account for the correlation among the sets of tested hypotheses, one must do so utilizing the joint distribution of the test statistics. The ˇSid´ak MTP – for which each of the Meff methods make use of in computing their respective pointwise significance level – does not guarantee control of the FWER for arbitrary distributions of the test statistics [20, 60]. These methods fail to account for the distribution in the test statistics, and as such the validity in their respective extension to the ˇSid´ak MTP is questionable. Finally, each of the Meff methods, as well as the MVN methods of [20] and [93], cannot cope with missing SNP
data. As such, imputation methods (e.g., the K nearest-neighbor algorithm of [96]) are required to be implemented to fill-in any missing data, which could lead to differential misclassification bias in their reported results.
In contrast, not only is the permutation based MTP approach the GWAS gold standard, it is also robust to patterns of missing data [89] and fully accounts for the correlation structure within the sampled data. The robustness is due to the patterns of missing data being preserved within the permuted data and is thus also included in estimation of the permutation significance thresholds. In addition, as genotyping technology continues to evolve, one is able to sample DNA sequences within the human genome at increasingly finer resolution. This implies future genetic samples will arguably incur increasing presence of correlation among markers within the sample. Thus, continued implementation of the Bonferroni MTP within future genetic association studies, will lead to an increase in the reporting of Type II errors. Therefore, it is imperative that permutation MTPs be implemented within current and future genetic association studies. Over the past three years, significant progress has been made toward resolving this notion. For example, [85] has developed a Java based software called PRESTO, which is markedly faster than PLINK [63]. When performing 1K permutations upon a 450K SNP sample of 2938 controls and 1749 cases of Crohn’s disease, PRESTO was approximately eighteen times faster than PLINK at performing this task. More recently, [22] developed a software called PERMORY, which is exceptionally faster than PLINK. For example, when performing 10K permutations upon a simulated balanced (i.e., equal numbers for each of cases and controls) GWAS sample of size 6000 participants and 500K SNP markers, PERMORY completed this task in 1.9 hours. In contrast, extrapolated run times within PLINK were projected by the authors to be 43 days. Based upon this simulated data set, PERMORY is shown to be on the order of approximately 550 times faster than PLINK.
There is however, a significant problem with the PERMORY approach. Namely, it is not clear how to handle missing genotype data with this approach, since the authors fail to include this notion within the description of their algorithm. In fact, within section 2.4 of the article, the author’s have miss-stated a critical fact regarding permutation upon the maxT MTP in the presence of missing genotype data. Namely, the author’s claim that the permutation of phenotype elements (i.e., the random shuffling of the elements upon the response vector y – see §2.2.1 for definition of y) does not change the marginal totals of the 2×3 table (Table 2.1; see §2.2.1–§2.2.3 for appropriate definitions of terms) at locus j. However, this notion is not true for the maxT MTP, when some loci are comprised
of missing genotype data. Because the authors fail to handle missing genotype data within their algorithm, the PERMORY approach is essentially incomplete.
2.1.2 An Efficient Approach for Processing the Permutation Null Distribution of the MaxT and MinP Multiple Testing Procedures
We propose an optimized maxT/minP permutation algorithm for conducting multiple hypothe- sis tests of the null hypothesis of no genotype-phenotype association within large SNP panel genetic association studies entailing a binary disease trait (e.g., a GWAS sample), denoted GPER.2 Whereas previous maxT permutation algorithm approaches (e.g., PLINK, PRESTO, PERMORY) make use of the central processing unit (CPU) of the personal computer (PC), our approach is novel in that we exploit offloading the computational burden of the permutation procedure to the graphics pro- cessing unit (GPU). Not only does this approach abolish the computational problem for the maxT and minP MTPs, it illustrates the utility of the GPU within the framework of a statistical applica- tion. This approach incorporates parallel computing, arguably the programming paradigm for the future of high performance computing (HPC) upon the personal computer (see §1.4.1), and is the key ingredient for many of the algorithms developed henceforth within this dissertation. Moreover, we develop an algorithm for clustering GPER upon multiple GPUs – each GPU residing within a single personal computer – of which we demonstrate a linear scaling in the computational power of GPER over the single GPU implementation.
We provide the underlying details of the GPER algorithm (Algorithm 2.1) within §2.4. Here, we proceed with introducing some notation which will be used throughout the remainder of this manuscript (§2.2), and outline two data management techniques for efficient application of GPER (§2.3).