Joint Test Based on Forward SNP Selection

2.6 Supplemental Results

4.2.5 Joint Test Based on Forward SNP Selection

We propose a method based on forward variable selection in regression that provides a joint test of the statistical significance of a gene or pathway. We assume that we have

n subjects that are genotyped onpSNPs within a candidate gene or pathway and that

have been measured for a quantitative phenotype. Specifically, the algorithm to build

the multi-SNP linear regression model is the following, for j = 1, . . . , p SNPs:

1. Begin forward selection (i.e. when j = 1) by carrying out single SNP linear

regression models across all p SNPs. After the first iteration (i.e. for j > 1),

adjust for the selected SNPs (i.e.SN P1, . . . , SN Pj−1) in the model.

2. While adjusting for the selected set of SNPs (SN P1, . . . , SN Pj−1), constructp−

(j−1) linear regression models across the remainingp−(j−1) SNPs that have not

yet been selected to represent the candidate gene or pathway. Do not consider any SNPs that are in “high” LD with any of the SNPs already entered in the

SNP covariate set, based on a user-defined r2 _{threshold (i.e. prune any SNPs}

with pairwise r2 _values _above _{a given} _r2 _{threshold). Note, for} _j _{= 1 we simply}

select the SNP with the smallest p-value.

3. For each of thep−(j−1) models, conduct a joint test of all the SNPs, i.e.H0 :β1 =

provided that the p-value corresponding to the test on the individual parameter is

less thana user-defined p-value threshold. If the individual p-value does not meet this p-value threshold, then consider the next “most significant” joint (that also improves upon the prior joint p-value) and corresponding individual p-values, and so forth. As a final filter, the SNP covariate set may not exceed a user-defined

number of members. The j-th SNP corresponding to this joint test is selected

as a predictor from the candidate gene or pathway in explaining the phenotypic

variation. Record the p-value for this joint test under thej-th iteration.

4. Repeat steps 1 through 3 for each iteration of j until no more SNPs can be

added in the multi-SNP linear regression model, based on the predefined stopping criterion defined in the prior step. When the forward selection procedure ceases,

there will bep∗ SNP predictors in the model that contains the largest number of

variables.

5. The minimum p-value amongst the p∗ joint p-values will be the last joint p-value

recorded, as defined by the nature of the algorithm. This set of p∗_min SNPs is

chosen to act as a proxy for the candidate gene or pathway. Estimate the p-value of this test statistic via permutation testing. This joint p-value represents the statistical significance of the candidate gene or pathway.

Alternative Stopping Criterion

We allowed for a more relaxed criterion in building the SNP covariate set. If the current joint p-value being evaluated did not improve upon the prior joint p-value, then we admitted this SNP in the set (granted that its individual p-value met the threshold and the maximum number of SNP members in the set was not yet satisfied) and continued building the set under the usual guidelines as specified above. We ceased

to expand the SNP set when we encountered a joint p-value that was not smaller than the overall minimum joint p-value.

Inclusion of Pairwise SNP x SNP Interactions

We designed the option to include pairwise SNP x SNP interactions. As we constructed the SNP covariate set, we sequentially added all possible pairwise SNP x SNP interactions in the linear model containing the current state of the SNP set (as well as all other previously entered interactions). We decided to keep the interaction term if the joint p-value that assessed all terms in the model was more significant.

A Note on the Thresholds: r2_{, Individual P-Value, and Max Number of}

SNP Members

We analogously implemented ther2_{, individual p-value, and maximum number of SNP}

members thresholds described in the Introduction in reference to PLINK’s set-based association test (Section 4.1) so as to allow a fair and direct comparison of PLINK’s approach and our competing method. The essential difference between the two tech- niques was that PLINK assessed overall statistical significance by averaging the single SNP test statistics contained in the set, whereas the p-value in our proposed method was based on the joint test of the parameters in a general linear model.

Setting the maximum number of members in a set to one and not imposing a p- value filter (i.e. setting the p-value threshold to one) resulted in a test based on the “best” single SNP for PLINK’s set-based test and our forward selection procedure. On the other hand, by not constraining the number of SNPs in the set and by turning the

p-value andr2 filters off (i.e. p-value threshold = 1 andr2 threshold = 1), PLINK’s test

includedall test statistics across all of the SNPs in the data set. For our method, it was

In document Novel statistical methods for the study design and analysis of genome-wide association studies (Page 177-180)