• No results found

2.2.1 Data Setup for a GWAS

Consider a GWAS in which data is collected upon m genetic markers among n study partici- pants, where a binary response (trait) is recorded for each participant. For example, such a GWAS data set could have arisen from sampling n0 controls and n1cases (where n1 = n − n0) from some

2Named from the acronym GPU and the word permutation, emphasizing the utility of the graphics processing

population, whereupon for each participant we obtain – by way of, say, blood samples – genotypes for a collection of m genetic markers. The data can be succinctly represented, utilizing a single vector (warehousing the binary responses) and a single matrix (warehousing the genotypes across participants and SNP loci). Indeed, the data for the ithparticipant consists of: the binary response yi, where yi=     

1, if the participant is a case (diseased)

0, if the participant is a control (healthy, non-diseased); (2.1)

and SNP profile, gi = (g1i, . . . , gmi)0, where gji denotes the genotype of the jth SNP locus for participant i. In turn, the genotype at any SNP locus is defined in terms of the number of copies for the minor allele (the less frequently occurring allele at the locus within the population – zero, one, or two) at the locus. That is, for j = 1, . . . , m, and i = 1, . . . , n,

gji=           

2, if participant i carries two copies of the minor allele at SNP locus j 1, if participant i carries one copy of the minor allele at SNP locus j 0, if participant i carries no copies of the minor allele at SNP locus j. (2.2)

For notational clarity, we organize the n SNP profiles by the m × n matrix G = (g1· · · gn) (referred to as the genotype matrix), whose row and column indices identify SNP loci and participants, re- spectively; we denote the vector of binary responses for the n study participants by y = (y1, . . . , yn), referred to as the response vector.

2.2.2 The Genetic Model of Inheritance – Statistical Model

Here, let Gjand Y denote, respectively, the random variables which correspond to the genotype for SNP locus j, j = 1, . . . , m, and binary response. Within a GWAS, we are interested in testing the null hypothesis of no association between Y and Gj, which we denote by H

(j)

0 . There are several ways in which we can define the alternative hypothesis for the existence of an association between Gj and Y . Each of these approaches encompass the notion known as the genetic model of inheritance. A genetic model of inheritance (GMI) for a biallelic SNP locus, describes how the risk of disease is expected to change as the number of copies in the minor allele changes. In the circumstance for which we do not know the GMI between Gj and Y – and, rarely do we know the GMI (this notion especially holds true for diseases with little known etiology) – the GMI under the alternative

hypothesis is specified as the general model [97]. On the other hand, if we know the GMI between Gjand Y under the alternative hypothesis, then – in coherence with the literature – it is assumed to lie among one of the three models: (1) additive; (2) recessive; or (3) dominant [63,97]. As mentioned within §1.1, by far the most commonly assumed GMI in GWAS is that of the additive model [23], and for the sake of clarity in discussion is the GMI we assume here.

The additive GMI assumes the change in the log-odds of disease is linear for a one-unit change in the number of copies for the minor allele at SNP locus j; equivalently, a one-unit increase in the number of copies of the minor allele at the locus, leads to an additive change in the log-odds of disease. Mathematically, if πjk = Pr (Y = 1|Gj = k), for k ∈ {0, 1, 2} = G, the additive GMI assumes the behavior in the πjk satisfy the simple logistic regression model

log (Odds (πjk)) = β0j+ β1jk ∀k ∈ G, (2.3)

where β0j and β1j are population parameters. Therefore, in terms of model (2.3), the test of H (j) 0 – against the two-sided alternative hypothesis (denoted Ha(j)) under the additive GMI – can be expressed by

H0(j): β1j = 0 Ha(j): β1j 6= 0. (2.4)

2.2.3 The Cochran-Armitage Trend Test

By combining the elements upon the jth row of the genotype matrix with those of the response vector, we can cross-classify the sample of data for Gjand Y , as depicted by a 2×3 contingency table (Table 2.1). To test the null hypothesis of no association between Gj and Y in GWAS, a commonly applied test statistic is based upon the Cochran-Armitage trend test (CATT) [19, 20, 21, 22], which can be expressed by [98] Tj= n nP k∈Gnj1kvk− n1Pk∈Gnjkvk 2 (n0)(n1)  nP k∈Gnjkvk2− P k∈Gnjkvk 2 , (2.5)

where vk, k ∈ G, denotes the score for genotype Gj = k – used to specify the specific tested trend in the πjk under H

(j)

a – and nj1kand njkare the respective genotype counts in cases and the entire sample. Particularly, taking (v0, v1, v2) = (t, t + 1, t + 2), for some real number t, the CATT statistic

can be used to test H0(j) against Ha(j)under the additive GMI. Here, the reader may be speculating to the reason(s) for using the CATT in testing H0(j), and not directly performing inference upon the slope parameter of the simple logistic regression model (2.3) (e.g., conducting a likelihood ratio test (LRT), score test, or Wald-based test under H0(j)[99]). Indeed, under H0(j), the CATT statistic (2.5) is equivalent to Rao’s Score test statistic in testing the hypotheses given by (2.4) upon said logistic regression model. We provide a formal statement and proof of this notion as Proposition A.1 within Appendix A.

Table 2.1: Cross-classification of Disease Status and Genotype for SNP Locus j. Number of Copies of Minor Allele

0 1 2 Totals

Cases nj10 nj11 nj12 n1

Controls nj00 nj01 nj02 n0

Totals nj0 nj1 nj2 n

2.2.4 The MaxT and MinP Multiple Testing Procedures Let tj and pj = Pr  Tj≥ tj|H (j) 0 

, denote respective realizations of Tj (2.5) and the pointwise p-value in testing H0(j). Given an MTP, the adjusted p-value in testing H0(j), denoted ˜pj, is the nominal level of the entire test procedure at which H0(j) would just be rejected, given the values of all test statistics involved (see e.g., [60, 62]). That is,

˜ pj= inf

n

α ∈ [0, 1] : H0(j) is rejected at nominal FWER = αo, (2.6)

where the nominal FWER is the α level at which the MTP is performed. For control of the FWER, while simultaneously accounting for the joint correlation among the vector of test statistics (T1, . . . , Tm), [62] proposed the single-step minP adjusted p-value (hereinafter, minP adjusted p- value) for null hypothesis H0(j), ˜pj(minP), defined by

˜ pj(minP)= Pr  min 1≤k≤mPk ≤ pj|H0  , (2.7)

where Pk denotes the random variable for the pointwise p-value in testing null hypothesis H (k) 0 , k = 1, . . . , m. Alternatively, one may consider multiplicity correction based upon the single-step maxT adjusted p-values (hereinafter, maxT adjusted p-value), defined in terms of the test statistics

(T1, . . . , Tm) themselves [60, 62]: ˜ pj(maxT) = Pr  max 1≤k≤mTk ≥ tj|H0  . (2.8)

It is noted here that the maxT and minP MTPs control the FWER in the weak sense [60], the notion in which is essentially absent within the GWAS literature – particularly, the articles by [85] and [22] fail to make mention of this notion. Strong control of the FWER holds under the property of subset pivotality (see pg. 42 of [62]). The distribution of pointwise p-values (P1, . . . , Pm) is said to possess subset pivotality, provided that the joint distribution of the random vectornPj: H

(j) 0 ∈ H

p 0 o

is identical, for all Hp0∈ P (H0) [60], where – as previously defined within §1.3 – H p

0denotes a partial null hypothesis over H0 and P(·) denotes the power set of (·). It turns out that subset pivotality holds among the pointwise p-values for the Cochran-Armitage Trend test statistics (see pg. 157 of [62]), for which we attain strong control of the FWER within the maxT and minP MTPs upon utilizing the Cochran-Armitage Trend test in testing H0.

When the distributions of T(m) and P(1) are unknown, the maxT and minP adjusted p-values can be estimated by resampling [60, 62], where T(k) and P(k) denote the kth order statistics for the respective vectors (T1, . . . , Tm) and (P1, . . . , Pm). Here, in accordance with the PERMORY approach, we consider permuting the response vector, y, a total of R times [22]. Then, in accordance with Box 2 of [60], within the rth permutation, r = 1, . . . , R:

1. Randomly shuffle (i.e., permute) the elements of the response vector y. Permuting the elements of y – while simultaneously preserving the structure of the genotype matrix G – creates a situation in which y is independent of G i.e., we are simulating H0



and preserves the correlation structure and distributional properties of the SNP profiles (gi) within G.

2. Compute the test statistic for null hypothesis H0(j), tj,r. If implementing the minP MTP, then compute the pointwise p-value corresponding to tj,r, pj,r = Pr

 Tj≥ tj,r|H (j) 0  .

3. If implementing the maxT MTP, then locate the maximum of the tj,r, denoted by t(m),r. If implementing the minP MTP, then locate the minimum of the pj,r, denoted by p(1),r. The maxT and minP permutation adjusted p-values are given by

˜ p∗j(maxT) = PR r=1I t(m),r≥ tj R , (2.9)

and ˜ p∗j(minP)= PR r=1I p(1),r≤ pj R , (2.10)

respectively, where tj and pj denote the respective realizations of Tj and Pj under H (j)

0 for the observed (non-permuted) data and I(·) is the indicator random variable returning the value of one if the argument (·) is true and zero otherwise.

2.3 Data Management Techniques for Efficient Processing of the MaxT and MinP