2.6 Supplemental Results
3.2.8 An Alternative to Permutation Testing: a Quick and Efficient
of Haplotype Sharing Measures
We developed methods to assess statistical significance of excess haplotype sharing amongst the cases as compared to the controls. Aside from the biological motivation, these methods were computationally quick and efficient.
We investigated the haplotype sharing amongst 60 unrelated individuals (120 hap- lotypes) of the CEPH sample (Utah Residents with Northern and Western European
Ancestry), genotyped as part of the HapMap project. Across all 1202 = 7,140 pairs
of haplotypes, we examined the number of contiguous alleles starting at a reference marker in the PHB gene region. We chose 2 markers, one that was relatively common (rs2233667, estimated minor allele frequency [MAF] = 0.328) and the other that was rare (rs882031, estimated MAF = 0.025). Surprisingly, a fair amount of unrelated in- dependent haplotypes shared a considerable number of adjacent alleles, regardless of the MAF of the initial reference marker. We therefore hypothesized that case hap- lotypes as well as control haplotypes in themselves shared a unique set of haplotype patterns. This was similar to Lange and Boehnke’s 2004 assertion that under the alter- native hypothesis, for the groups of transmitted and non-transmitted haplotypes in a family-based study design, the within group similarity is high while the between group similarity is low. Thus, we were prompted to design an approach that allowed affected and unaffected haplotypes to cluster according to a relative measure of similarity and to formally test the statistical significance of the observed clustering in the cases versus the controls.
We carried out the clustering algorithm in the following manner. Begin cluster
being formed. The thresholds, TPk’s, specific for a haplotype sharing score type (e.g.
reference marker and fixed window based log10(CHSS), Length, and Count) described
in Section 3.2.3 served as the cutoffs for which we designated haplotypes to be members of a particular haplotype grouping.
1. Amongst the haplotype sharing scores pertaining to the haplotype pairings with the initial haplotype in the cluster, search for any haplotype that meets or exceeds
TPk and include them in the cluster. Note, for hc1,1 the threshold scores searched
is a subset of 4N −1 scores out of a total of 4N2 haplotype pairing scores for
2N + 2N = 4N case and control haplotypes.
2. For each of the haplotypes entered at Step 1, search for any haplotype(s) to further include in the cluster, based on the corresponding subset of haplotype sharing scores.
3. Once the cluster can no longer include additional haplotype members, begin build- ing another cluster starting with an arbitrary single haplotype, granted that there are haplotypes that have not yet been grouped. Repeat Steps 1 and 2 for new clusters to be formed.
4. As soon as no other clusters can be formed after iterating through Steps 1 through 3 and haplotypes remain that have not been assigned to any of the clusters, place them in an “other” bin to be subsequently assessed.
In Steps 1 and 2, we searched scores corresponding to haplotypes that have not yet been clustered, so haplotypes were not counted more than once. As a result, this also saved computational time and resources when searching.
To illustrate the clustering algorithm, consider the following example. We begin
1. 3 other haplotypes (hc1,2, hc1,3, hc1,4) have log10(CHSS) values ≥ T99 from their
pairings withhc1,1, so these are included in clusterc1.
2. (a) 2 other haplotypes (hc1,5 andhc1,6) have log10(CHSS) values≥T99from their
pairings with hc1,2, so are included in cluster c1.
(b) 3 other haplotypes (hc1,7, hc1,8, and hc1,9) have log10(CHSS) values ≥ T99
from their pairings with hc1,3, so are included in cluster c1.
(c) No haplotypes have log10(CHSS) values≥T99 from their pairings withhc1,4,
so no additional haplotypes are included in cluster c1.
3. Amongst the log10(CHSS) values computed from the haplotype pairings with each
of the haplotypes entered in Step 2 (hc1,5, hc1,6, . . . , hc1,9), no other log10(CHSS)
values were≥T99, therefore the construction of clusterc1is complete and consists
of haplotypeshc1,1, hc1,2, . . . , hc1,9. We begin building another cluster, c2, starting
with arbitrary haplotype hc2,1.
4. After iterating through Steps 1 to 3 two more times, there are 3 clusters, c1, c2,
and c3, each containing haplotypes (hc1,1, hc1,2, . . . , hc1,9), (hc2,1, hc2,2, . . . , hc2,5),
and (hc3,1, hc3,2, . . . , hc3,24). However, there are still 4N − (9 + 5 + 24 = 38)
haplotypes remaining that were not clustered, and will be placed in the “other” bin.
We employed various methods to handle clusters containing few haplotypes and hap- lotypes that were categorized into the “other” bin. We postulated that these rare hap- lotypes could potentially provide useful information in discerning associations between clusters and affection status, therefore we assessed the performance of incorporating these rare haplotypes compared to removing them entirely from the analysis.
sizes (sizei for i= 1, . . . , Nsizes) and if the number of haplotypes in a given cluster was
not greater than or equal to sizei then we placed all of the haplotypes in this cluster in
the “other” bin.
The second method was to attempt to regroup “other” haplotypes into any of the clusters (“Regrouping”) in a 3 step process. First, we regrouped “other” haplotypes per
criteria which we discuss below. Small clusters that did not fulfill sizei could possibly
be expanded at this step. Second, we imposed the cluster size constraints, sizei, across
all clusters. Third, we attempted to regroup “other” haplotypes once again, since some clusters may have moved to the “other” bin in the previous step.
Our strategy to regroup “other” haplotypes into clusters was the following. For each of the “other” haplotypes, we inspected all of the scores from the pairings with the haplotypes already in clusters and found the maximum score. We regrouped the “other” haplotype into the cluster in which the maximum score resided. We required that the maximum score stemmed from a haplotype that was not originally in the “other” bin. If more than one maximum score was found in multiple clusters, we did not regroup the “other” haplotype in question.
In the third method, we collected all of the small clusters (i.e. all of the clusters
that were not as large as sizei) into one group instead of recategorizing them into the
“other” bin (“Small Cluster Row”). We did not attempt to regroup “other” haplotypes into clusters for this method.
Once the clusters were created by way of the “No Regrouping”, “Regrouping”, and
“Small Cluster Row” methods, we constructed R ×2 contingency tables where the
number of rows, R, represented the number of clusters and the 2 columns categorized
affection status (i.e. if a haplotype originated from an affected or unaffected individual). We cross classified the clusters by affection status in order to examine if case haplotypes grouped together differently than control haplotypes. Regardless if such a difference
existed or not, theR×2 tables characterized across all of the clusters the frequency at which the case and control haplotypes congregated based on a quantifiable measure of haplotype sharing.
Similar to how we tested for association in theR×2 tables of the haplotypeχ2 test
(Section 3.2.7), we computed theχ2 statistic withR−1 degrees of freedom. We assessed
the performance of including and removing haplotypes that did not assemble into any clusters, which were plausibly the rare haplotypes, by either keeping or removing the “other” group for each of the 3 methods discussed above (“Keep” or “Delete”). For the Delete method, we removed entirely the row of “other” haplotypes, given that such a row existed and that deleting the “other” row did not result in a table with 0 degrees of freedom (i.e. a table with 1 row). On the other hand, for the Keep method, we simply
kept in the “other” row when calculating theχ2 statistic.
Finally, we formed 2×2 tables for which affection status defined the columns and
the 2 rows consisted of the aggregated collection of clusters and the group of “other” haplotypes. We did not attempt to regroup “other” haplotypes into the cluster row.
We computed theχ2 statistic to assess statistical significance.