Performance Benchmarking - Robust Computational Tools for Multiple Testing with Genetic Associa

To gain a perspective for the computational performance of GPER applied to varying sam- pling characteristics of GWAS data sets – particularly, dynamics encompassing sample size and case/control balanced11_{nature of the sample – we simulated subsets of GWAS data sets of varying} sample sizes and varying balancing effects upon the underlying cases and controls, where the fixed marker density m = 40K was used across the simulations. For each data set, 4K SNP loci (of the 40K total) were simulated under assumed Hardy-Weinberg equilibrium (HWE) among population genotype frequencies, upon each of the ten (10) minor allele frequencies (MAF; the frequency within the population of the rarer occurring allele at a particular locus) residing within the collection {0.01, 0.02, . . . , 0.10} (see the simulation setup within §3.2.4.1 for a justification in the use of this collection of values). For each data set, R = 10 240 random permutations were applied within GPER and PERMORY and R = 1000 permutations were applied within PLINK.12

Table 2.5: Computational Time to Perform R = 10 240 Permutations Within GPER and PER- MORY, and R = 1000 Permutations Within PLINK.

Computational Time (minutes) GPER Speedup Over Cases (n1) Controls (n0) GPER PLINK PERMORY PLINK† PERMORY

1000 1000 0.6 43.0 5.1 785x 8x 900 1100 0.5 42.3 5.9 830x 10x 800 1200 0.5 42.1 5.4 890x 10x 1500 1500 0.8 64.7 7.3 790x 8x 1350 1650 0.8 63.9 7.2 835x 8x 1200 1800 0.7 62.8 7.1 880x 9x 2000 2000 1.2 89.7 7.9 775x 6x 1800 2200 1.1 88.8 8.3 840x 7x 1600 2400 1.0 87.1 7.6 910x 7x †_{Extrapolated estimate.}

Table 2.5 summarizes the results from this simulation. In all simulations GPER significantly outperformed each of the PLINK and PERMORY softwares, as demonstrated by the figures depicted within the final two columns of the table. Interestingly, for any fixed balancing characteristic of the sample (i.e., 40%, 45%, or 50% cases within the sample), the relative performance of PERMORY to GPER seems to improve as the sample size increases, as shown by the apparent decreasing trend in the figures upon the final column of the table; exactly the opposite notion seems to hold true for

11_{A balanced/unbalanced GWAS sample is comprised of equal/unequal numbers of cases and controls.} 12_{All simulations conducted within this section assume: the value of ρ to be four (4); the asymptotic-based}

Cochran-Armitage Trend test statistic to be used to test the null hypothesis of no genotype-phenotype association at each SNP marker, where the additive genetic model of inheritance is assumed under the two-sided alternative hypothesis across SNP loci; and, GPER implemented upon a single GPU.

the relative performance of PLINK to GPER for increasing sample size (column 6). Moreover, as expected (per the methodology of §2.3.1), these data suggest that the computational performance of GPER increasingly improves as the sample becomes increasingly unbalanced, as demonstrated by the decreasing trend in computational time for a fixed sample size (column 3). Furthermore, although this notion seems to also be true of PLINK (column 4) – and, PERMORY (column 5) when n = 3000 – it is more lucid for GPER. For example, let us consider the samples of n = 4000. In comparing the relative timing of the unbalanced sample comprised of 40% cases (row 9) to that of the balanced sample (row 7), we find these values to be: 0.83 for GPER; 0.97 for PLINK; and 0.96 for PERMORY. This suggests that the relative efficiency of GPER to each of PLINK and PERMORY increases as the sample becomes increasingly unbalanced.

To examine the performance of GPER applied against m-size marker panels resembling that of GWAS, we simulated GWAS data sets of varying sample sizes for balanced GWAS samples, assuming marker densities of m = 500K and m = 1M, under two different scenarios governing the underlying MAFs of the markers. The first (denoted simulation scenario 1), was identical to that given above, where each marker panel was simulated uniformly over the collection of MAFs {0.01, . . . , 0.10}. For the second (denoted simulation scenario 2), we noted that, by algorithm design, the computational performance of PERMORY is suggestive to be dependent upon the distribution of MAFs comprising the GWAS sample. Namely, in theory, the computational performance of PERMORY is accelerated upon GWAS samples comprising a large proportion of markers with minute MAF. Thus, when applied against GWAS marker panels comprised of MAF distributions resembling that of the former simulation, the performance of PERMORY could be overstated from its anticipated performance in practice. Hence, to obtain an idea for the relative performance of GPER to PERMORY upon GWAS samples – comprised of marker panels assuming MAFs over the entire domain thereof – we simulated MAFs upon marker panels uniformly over the collection {0.01, . . . , 0.50}. Overall, for GPER we anticipated no difference in performance between the two simulation scenarios, since by design, the GPER algorithm does not depend upon the MAF distribution of the markers. However, as previously elucidated to, when compared to the former simulation scenario, we anticipated the computational performance of PERMORY to be lower within the latter scenario.

Table 2.6 summarizes the computational time to perform R = 10 240 maxT permutations within GPER and PERMORY, across the marker panel densities and sample sizes for the two simulation scenarios. In all simulations, GPER significantly outperformed the PERMORY software, as demon-

Table 2.6: Computational Time to Perform R = 10 240 Permutations Within GPER and PER- MORY, Across Several Balanced GWAS Sample Sizes, Marker Densities, and Distribution of SNP Minor Allele Frequencies.

Computational Time (minutes) Marker Density MAF Range Sample Size GPER† PERMORY

m = 500K 0.01 − 0.10 n = 2000 7.0 (8x) 62.6 n = 3000 10.9 (7x) 90.9 n = 4000 15.6 (6x) 110.1 m = 500K 0.01 − 0.50 n = 2000 7.0 (16x) 118.7 n = 3000 10.9 (16x) 180.1 n = 4000 15.6 (13x) 218.7 m = 1M 0.01 − 0.10 n = 2000 14.0 (8x) 120.6 n = 3000 22.0 (7x) 175.8 n = 4000 31.2 (6x) 240.9 m = 1M 0.01 − 0.50 n = 2000 13.9 (20x) 294.1 n = 3000 21.9 (14x) 345.3 n = 4000 30.4 (12x) 394.3 †_{Parenthetic values represent speedup over PERMORY.}

strated by the figures presented within the final two columns of the table. In addition, a similar – to that of the simulation conducted above with m = 40K – increasing trend in relative computational performance of PERMORY to GPER for increasing sample size is apparent here. Nonetheless, even for n = 4000, GPER was at least six (6) times faster than PERMORY. Moreover, as expected, the computational performance of PERMORY appears to depend upon the distribution of MAF amongst the SNP sample. Taking the SNP density m = 500K, for example, when compared to simulation scenario 1, PERMORY required essentially twice the time to complete the maxT permutations upon simulation scenario 2. The computational performance of GPER, on the other hand, is impervious to the distribution of MAF upon the SNP sample. Overall, based upon these simulations, GPER appears to be the computational tool of choice for use in the maxT MTP upon GWAS data.

In document Robust Computational Tools for Multiple Testing with Genetic Association Studies (Page 68-70)