B. Evolutionary Trees and Hash Tables
4. Implicit Bipartition Representation
To represent bipartitions with implicit bipartition representation, we use universal hash functions. Similarlyto Amenta et al. [71], we deο¬ne our universal hashing functions as follows: β1(π΅) = β ππππ mod π1. (4.3) β2(π΅) = β πππ π mod π2. (4.4)
π1 represents the number of entries (or locations) in the hash table (π1 > π β π‘,
where π is the number of taxa and π‘ is the number of trees). π2 represent the largest
bipartition ID (BID) that we can be given to a bipartition (π2 > π β π β π‘, where π is
a large constant). That is, instead of storing the π-bitstring, a shortened version of it (represented bythe BID) will be stored in the hash table. Our implementations requires two sets of random integers ππ and π π in the intervals (0, ..., π1 β 1) and
(0, ..., π2β 1), respectively. ππ represents the πth bit of the π-bitstring representation
of the bipartition π΅. ππ and π π are randomlygenerated everytime when the algorithm
runs and characterize our HashCS and HashRF algorithms as randomized algorithms. Figure 19 shows an example of π-bitstring and implicit bipartition representation based on the universal hash functions.
6 n - b i t b i p a r t i t i o n 0 1 0 0 0 2 1 0 0 1 0 0 1 0 0 0 0 1 0 8 0 0 0 0 1 1 9 1 1 1 1 1 1 8 1 1 0 0 0 4 1 1 1 0 0 1 4 0 0 0 1 1 4 1 0 0 0 0 U V W X Y N5 N3 N8 N9 T r e e T N1 N2 N4 N6 N7 h v a l u e1
Fig. 19. Calculating the β1 hash code from π-bitstring bipartitions representations
based on a post-order traversal of an example tree π . Each node is labeled by ππ, where π is the order in which it was visited during the post-order
traversal. The input taxa are π, π, π, π, and π and are represented bynodes π1, π2, π4, π6, and π7, respectively. To compute the β1 hashing function,
π1 = 23 and the contents of array π are π0 = 6, π1 = 21, π2 = 10, π3 = 8,
and π4 = 19. Each node in the tree π shows an π-bitstring representation
(shaded) along with the corresponding β1 value (unshaded). The β1 val-
ues of the nodes are computed based on Equations 2 and 5. For node π5,
a. Universal Hash Functions, β1 and β2
For faster performance, we actuallyavoid sending the π-bitstring and π-bitstring representations of each bipartition π΅ to our hashing functions. Instead, we use an implicit bipartition representation to compute the hash functions quickly. An implicit bipartition is simplyan integer value that provides the representation of the biparti- tion. Consider an internal node π΅ whose bipartition is represented bya π-bitstring. Let the two children of this node have their π-bitstring representations labeled π΅ππππ‘
and π΅πππβπ‘, which represent disjoint sets of taxa. Then, the hash value for our β1 hash
function is β1(π΅) = β ββ π΅ππππ‘ ππππ mod π1 β β + β ββ π΅πππβπ‘ ππππ mod π1 β β . (4.5)
We can use Equation 4.5 to compute implicit bipartitions, which replace the need for bitstrings in our hashing functions. Computing β2 works similarly. The
above equation is valid since π΅ππππ‘ and π΅πππβπ‘ represent disjoint sets of taxa. We
can use Equation 4.5 to compute implicit bipartitions, which replace the need for bitstrings in our hashing functions. An implicit representation is simplythe hash code for a bipartition. Consider the β1 hashing function. For each bipartition π΅,
the β1 hash values (implicit representations) for its two children π΅ππππ‘ and π΅πππβπ‘,
have alreadybeen computed. Theyare π₯ = β1(π΅ππππ‘) and π¦ = β1(π΅πππβπ‘). Thus, the
implicit representation for node π΅ is β1(π΅) = (π₯ + π¦) mod π1. Computing the β2
value implicitlyfor each bipartition works similarly.
Our HashRF algorithm is implemented onlybased on implicit bipartitions. How- ever, our HashCS algorithm uses a combination of implicit and π-bitstring represen-
tations. The implicit representation is the onlyrepresentation fed to the hashing functions. The π-bitstring is additional information that is collected from the ο¬rst input tree to determine how the taxa should be grouped in the consensus tree and is used for constructing the resulting consensus trees.
Figure 19 provides an example of how to compute the β1 hash code implicitly.
Assume the set of random numbers π = (6, 21, 10, 8, 19) and π1 = 23. The β1 value
for node π3is computed from the π-bitstrings of its two children, π1 and π2, resulting
in π΅ππππ‘= 10000 and π΅πππβπ‘ = 01000. Therefore, β1(11000) = β1(10000)+β1(01000) =
4. The implicit β1values are represented bythe integer values (unshaded boxes). The
hash code for the leaf node containing taxon π is ππ. These hash codes are propagated
up the tree to compute the implicit representations of the parent bipartitions. Thus, to compute the β1 hash code implicitlyfor π3, we calculate (6 + 21) mod 23, which
is 4. Computing the β2 hash code for representing a bipartitionβs ID (BID) works
similarly.
b. Collision Types and Probability
A consequence of using hash functions is that bipartitions mayend up residing in the same location in the hash table. Such an event is considered a collision, and there are three types to consider (see Table I). Type 0 collisions are not regarded as a problematic condition. In our algorithms, this type of collision occurs when the same bipartition has alreadybeen populated into the hash table. In other words, Type 0 collision notiο¬es us it ο¬nds a shared bipartition among input trees. Even if it is called as a collision in this thesis, it is one of the reasons to make our algorithms faster and eο¬cient.
Table I. Collision types in our randomized algorithms.
Collision Type π΅π = π΅π? β1(π΅π) = β1(π΅π)? β2(π΅π) = β2(π΅π)?
Type 0 Yes Yes Yes
Type 1 No Yes No
Type 2 No Yes Yes
resided in the same location in the hash table. That is, β1(π΅π) = β1(π΅π). Figure 20
shows two Type 0 collisions between two same bipartition, π΅1 and π΅3 and a Type 1
collision caused by π΅4. partition which is collided at location 4 in the hash table.
Type 2 (or double) collisions are serious and require a restart of the algorithm if such an event occurs. Otherwise, the resulting output will possiblybe incorrect. Suppose that π΅π β= π΅π. A Type 2 collision occurs when two diο¬erent bipartitions π΅π
and π΅π hash to the same location in the hash table and the bipartition IDs (BIDs)
associated with them are also the same. That is, β1(π΅π) = β1(π΅π) and β2(π΅π) =
β2(π΅π). The probabilityof our randomized algorithms restarting because of a double
collision among anypair of the bipartitions is π(1 π
)
[71]. The probabilityrestarting because of a double collision among anypair of the bipartitions is π(1
π). Given that
we can make π arbitrarilylarge, we do not explicitlycheck for Type 2 collisions. Thus we have the following result.
Theorem 1. Our HashCS and HashRF algorithms have a theoretical error rate of π(1
π), where π is an arbitrarily large constant number.
Proof. Double collision occurs when for two bipartitions π΅1, π΅2, we have β1(π΅1) =
β1(π΅2) and also β2(π΅1) = β2(π΅2). Thus, given two diο¬erent bipartitions, π΅π and π΅π,
where π β= π, the probabilityof β1(π΅π) is equal to β1(π΅π) is
π πππ(β1(π΅π) = β1(π΅π)) = π1
Hash Table 0 4 ... m -11 h1 Bipartitions 1 2 3 T1 T2 2 B 1 B TYPE 0 collsions TYPE 1 collsion Hash Records T1 27 T1 T2 31 T2 46 4 B 3 B h2
Fig. 20. An example of populated hash table with the bipartitions from trees in Fig- ure 16. That is, π΅1 and π΅2 deο¬ne π1, π΅3 and π΅4 are from π2, etc. The implicit
representation of each bipartition is fed to the hash functions β1 and β2. The
shaded value in each hash record contains the bipartition ID (or β2 value).
Each bipartition ID has a linked list of tree indexes that share that particular bipartition. π΅1 and π΅3 are identical bipartition and causes TYPE 0 collisions.
π΅4 from π2 is actuallydiο¬erent bipartitions with π΅1 and π΅3 but hashed to
the same location. It occurs TYPE 1 collision and thus the BID, β31β for the bipartition π΅4 is chained with β27β.
Similarly, the probability of β2(π΅π) is equal to β2(π΅π) is
π πππ(β2(π΅π) = β2(π΅π)) = π1
2, π€βπππ π β= π. (4.7)
Note that π1 is a prime number which satisο¬es π1 β₯ ππ‘ and π2 is also a prime
number which is bigger than πππ‘, where π is the number of taxa, π‘ is the number of trees, and π is a constant integer value. Thus, the probabilityof β1(π΅π) is equal to
β1(π΅π), and β2(π΅π) is equal to β2(π΅π) is
π πππ(β1(π΅π) = β1(π΅π) πππ β2(π΅π) = β2(π΅π)) = π1
1π2, π€βπππ π β= π. (4.8)
The equation 4.8 is for a pair of bipartitions. Therefore, if there are π‘ binarytrees with π taxa, we have (ππ‘)2 pairs of bipartitions. Thus, for a given set of all bipartitions,
π΅ from the set of input trees, we can conclude the probabilityof double collisions for all pairs of bipartitions is
β π΅π, π΅π β π΅, π πππ(β1(π΅π) = β1(π΅π) πππ β2(π΅π) = β2(π΅π)) = π1
1π2 β (ππ‘) 2
= ππ‘1 β πππ‘1 β (ππ‘)2
= 1π.
In practice, however, the error rate of HashCS and HashRF algorithms is much better. Our experiments with varying π from 1 to 10,000, and running HashCS at least
100 times for each π value, resulted in our algorithm producing the correct consensus tree everytime for an overall error rate of 0%. Similarly, HashRF shows the same overall error rate as the veriο¬cation result.