An Alternative to Permutation Testing: a Quick and Efficient

2.6 Supplemental Results

3.2.8 An Alternative to Permutation Testing: a Quick and Efficient

of Haplotype Sharing Measures

We developed methods to assess statistical significance of excess haplotype sharing amongst the cases as compared to the controls. Aside from the biological motivation, these methods were computationally quick and efficient.

We investigated the haplotype sharing amongst 60 unrelated individuals (120 haplotypes) of the CEPH sample (Utah Residents with Northern and Western European

Ancestry), genotyped as part of the HapMap project. Across all 120₂ = 7,140 pairs

of haplotypes, we examined the number of contiguous alleles starting at a reference marker in the PHB gene region. We chose 2 markers, one that was relatively common (rs2233667, estimated minor allele frequency [MAF] = 0.328) and the other that was rare (rs882031, estimated MAF = 0.025). Surprisingly, a fair amount of unrelated in- dependent haplotypes shared a considerable number of adjacent alleles, regardless of the MAF of the initial reference marker. We therefore hypothesized that case haplotypes as well as control haplotypes in themselves shared a unique set of haplotype patterns. This was similar to Lange and Boehnke’s 2004 assertion that under the alternative hypothesis, for the groups of transmitted and non-transmitted haplotypes in a family-based study design, the within group similarity is high while the between group similarity is low. Thus, we were prompted to design an approach that allowed affected and unaffected haplotypes to cluster according to a relative measure of similarity and to formally test the statistical significance of the observed clustering in the cases versus the controls.

We carried out the clustering algorithm in the following manner. Begin cluster

being formed. The thresholds, TPk’s, specific for a haplotype sharing score type (e.g.

reference marker and fixed window based log₁₀(CHSS), Length, and Count) described

in Section 3.2.3 served as the cutoffs for which we designated haplotypes to be members of a particular haplotype grouping.

1. Amongst the haplotype sharing scores pertaining to the haplotype pairings with the initial haplotype in the cluster, search for any haplotype that meets or exceeds

TPk and include them in the cluster. Note, for hc1,1 the threshold scores searched

is a subset of 4N −1 scores out of a total of 4N₂ haplotype pairing scores for

2N + 2N = 4N case and control haplotypes.

2. For each of the haplotypes entered at Step 1, search for any haplotype(s) to further include in the cluster, based on the corresponding subset of haplotype sharing scores.

3. Once the cluster can no longer include additional haplotype members, begin building another cluster starting with an arbitrary single haplotype, granted that there are haplotypes that have not yet been grouped. Repeat Steps 1 and 2 for new clusters to be formed.

4. As soon as no other clusters can be formed after iterating through Steps 1 through 3 and haplotypes remain that have not been assigned to any of the clusters, place them in an “other” bin to be subsequently assessed.

In Steps 1 and 2, we searched scores corresponding to haplotypes that have not yet been clustered, so haplotypes were not counted more than once. As a result, this also saved computational time and resources when searching.

To illustrate the clustering algorithm, consider the following example. We begin

1. 3 other haplotypes (hc1,2, hc1,3, hc1,4) have log10(CHSS) values ≥ T99 from their

pairings withhc1,1, so these are included in clusterc1.

2. (a) 2 other haplotypes (hc1,5 andhc1,6) have log10(CHSS) values≥T99from their

pairings with hc1,2, so are included in cluster c1.

(b) 3 other haplotypes (hc1,7, hc1,8, and hc1,9) have log10(CHSS) values ≥ T99

from their pairings with hc1,3, so are included in cluster c1.

so no additional haplotypes are included in cluster c1.

3. Amongst the log₁₀(CHSS) values computed from the haplotype pairings with each

of the haplotypes entered in Step 2 (hc1,5, hc1,6, . . . , hc1,9), no other log10(CHSS)

values were≥T99, therefore the construction of clusterc1is complete and consists

of haplotypeshc1,1, hc1,2, . . . , hc1,9. We begin building another cluster, c2, starting

with arbitrary haplotype hc2,1.

4. After iterating through Steps 1 to 3 two more times, there are 3 clusters, c1, c2,

and c3, each containing haplotypes (hc1,1, hc1,2, . . . , hc1,9), (hc2,1, hc2,2, . . . , hc2,5),

and (hc3,1, hc3,2, . . . , hc3,24). However, there are still 4N − (9 + 5 + 24 = 38)

haplotypes remaining that were not clustered, and will be placed in the “other” bin.

We employed various methods to handle clusters containing few haplotypes and haplotypes that were categorized into the “other” bin. We postulated that these rare haplotypes could potentially provide useful information in discerning associations between clusters and affection status, therefore we assessed the performance of incorporating these rare haplotypes compared to removing them entirely from the analysis.

sizes (sizei for i= 1, . . . , Nsizes) and if the number of haplotypes in a given cluster was

not greater than or equal to sizei then we placed all of the haplotypes in this cluster in

the “other” bin.

The second method was to attempt to regroup “other” haplotypes into any of the clusters (“Regrouping”) in a 3 step process. First, we regrouped “other” haplotypes per

criteria which we discuss below. Small clusters that did not fulfill sizei could possibly

be expanded at this step. Second, we imposed the cluster size constraints, sizei, across

all clusters. Third, we attempted to regroup “other” haplotypes once again, since some clusters may have moved to the “other” bin in the previous step.

Our strategy to regroup “other” haplotypes into clusters was the following. For each of the “other” haplotypes, we inspected all of the scores from the pairings with the haplotypes already in clusters and found the maximum score. We regrouped the “other” haplotype into the cluster in which the maximum score resided. We required that the maximum score stemmed from a haplotype that was not originally in the “other” bin. If more than one maximum score was found in multiple clusters, we did not regroup the “other” haplotype in question.

In the third method, we collected all of the small clusters (i.e. all of the clusters

that were not as large as sizei) into one group instead of recategorizing them into the

“other” bin (“Small Cluster Row”). We did not attempt to regroup “other” haplotypes into clusters for this method.

Once the clusters were created by way of the “No Regrouping”, “Regrouping”, and

“Small Cluster Row” methods, we constructed R ×2 contingency tables where the

number of rows, R, represented the number of clusters and the 2 columns categorized

affection status (i.e. if a haplotype originated from an affected or unaffected individual). We cross classified the clusters by affection status in order to examine if case haplotypes grouped together differently than control haplotypes. Regardless if such a difference

existed or not, theR×2 tables characterized across all of the clusters the frequency at which the case and control haplotypes congregated based on a quantifiable measure of haplotype sharing.

Similar to how we tested for association in theR×2 tables of the haplotypeχ2 test

(Section 3.2.7), we computed theχ2 statistic withR−1 degrees of freedom. We assessed

the performance of including and removing haplotypes that did not assemble into any clusters, which were plausibly the rare haplotypes, by either keeping or removing the “other” group for each of the 3 methods discussed above (“Keep” or “Delete”). For the Delete method, we removed entirely the row of “other” haplotypes, given that such a row existed and that deleting the “other” row did not result in a table with 0 degrees of freedom (i.e. a table with 1 row). On the other hand, for the Keep method, we simply

kept in the “other” row when calculating theχ2 statistic.

Finally, we formed 2×2 tables for which affection status defined the columns and

the 2 rows consisted of the aggregated collection of clusters and the group of “other” haplotypes. We did not attempt to regroup “other” haplotypes into the cluster row.

We computed theχ2 _{statistic to assess statistical significance.}

3.2.9 Illumina’s iControlDB Public Resource: Acquisition, Clean-

In document Novel statistical methods for the study design and analysis of genome-wide association studies (Page 107-111)