• No results found

B. Evolutionary Trees and Hash Tables

4. Implicit Bipartition Representation

To represent bipartitions with implicit bipartition representation, we use universal hash functions. Similarlyto Amenta et al. [71], we define our universal hashing functions as follows: β„Ž1(𝐡) = βˆ‘ π‘π‘–π‘Ÿπ‘– mod π‘š1. (4.3) β„Ž2(𝐡) = βˆ‘ 𝑏𝑖𝑠𝑖 mod π‘š2. (4.4)

π‘š1 represents the number of entries (or locations) in the hash table (π‘š1 > 𝑛 β‹… 𝑑,

where 𝑛 is the number of taxa and 𝑑 is the number of trees). π‘š2 represent the largest

bipartition ID (BID) that we can be given to a bipartition (π‘š2 > 𝑐 β‹… 𝑛 β‹… 𝑑, where 𝑐 is

a large constant). That is, instead of storing the 𝑛-bitstring, a shortened version of it (represented bythe BID) will be stored in the hash table. Our implementations requires two sets of random integers π‘Ÿπ‘– and 𝑠𝑖 in the intervals (0, ..., π‘š1 βˆ’ 1) and

(0, ..., π‘š2βˆ’ 1), respectively. 𝑏𝑖 represents the 𝑖th bit of the 𝑛-bitstring representation

of the bipartition 𝐡. π‘Ÿπ‘– and 𝑠𝑖 are randomlygenerated everytime when the algorithm

runs and characterize our HashCS and HashRF algorithms as randomized algorithms. Figure 19 shows an example of 𝑛-bitstring and implicit bipartition representation based on the universal hash functions.

6 n - b i t b i p a r t i t i o n 0 1 0 0 0 2 1 0 0 1 0 0 1 0 0 0 0 1 0 8 0 0 0 0 1 1 9 1 1 1 1 1 1 8 1 1 0 0 0 4 1 1 1 0 0 1 4 0 0 0 1 1 4 1 0 0 0 0 U V W X Y N5 N3 N8 N9 T r e e T N1 N2 N4 N6 N7 h v a l u e1

Fig. 19. Calculating the β„Ž1 hash code from 𝑛-bitstring bipartitions representations

based on a post-order traversal of an example tree 𝑇 . Each node is labeled by 𝑁𝑖, where 𝑖 is the order in which it was visited during the post-order

traversal. The input taxa are π‘ˆ, 𝑉, π‘Š, 𝑋, and π‘Œ and are represented bynodes 𝑁1, 𝑁2, 𝑁4, 𝑁6, and 𝑁7, respectively. To compute the β„Ž1 hashing function,

π‘š1 = 23 and the contents of array 𝑅 are π‘Ÿ0 = 6, π‘Ÿ1 = 21, π‘Ÿ2 = 10, π‘Ÿ3 = 8,

and π‘Ÿ4 = 19. Each node in the tree 𝑇 shows an 𝑛-bitstring representation

(shaded) along with the corresponding β„Ž1 value (unshaded). The β„Ž1 val-

ues of the nodes are computed based on Equations 2 and 5. For node 𝑁5,

a. Universal Hash Functions, β„Ž1 and β„Ž2

For faster performance, we actuallyavoid sending the 𝑛-bitstring and π‘˜-bitstring representations of each bipartition 𝐡 to our hashing functions. Instead, we use an implicit bipartition representation to compute the hash functions quickly. An implicit bipartition is simplyan integer value that provides the representation of the biparti- tion. Consider an internal node 𝐡 whose bipartition is represented bya 𝑛-bitstring. Let the two children of this node have their 𝑛-bitstring representations labeled 𝐡𝑙𝑒𝑓𝑑

and π΅π‘Ÿπ‘–π‘”β„Žπ‘‘, which represent disjoint sets of taxa. Then, the hash value for our β„Ž1 hash

function is β„Ž1(𝐡) = βŽ› βŽβˆ‘ 𝐡𝑙𝑒𝑓𝑑 π‘π‘–π‘Ÿπ‘– mod π‘š1 ⎞ ⎠ + βŽ› βŽβˆ‘ π΅π‘Ÿπ‘–π‘”β„Žπ‘‘ π‘π‘–π‘Ÿπ‘– mod π‘š1 ⎞ ⎠ . (4.5)

We can use Equation 4.5 to compute implicit bipartitions, which replace the need for bitstrings in our hashing functions. Computing β„Ž2 works similarly. The

above equation is valid since 𝐡𝑙𝑒𝑓𝑑 and π΅π‘Ÿπ‘–π‘”β„Žπ‘‘ represent disjoint sets of taxa. We

can use Equation 4.5 to compute implicit bipartitions, which replace the need for bitstrings in our hashing functions. An implicit representation is simplythe hash code for a bipartition. Consider the β„Ž1 hashing function. For each bipartition 𝐡,

the β„Ž1 hash values (implicit representations) for its two children 𝐡𝑙𝑒𝑓𝑑 and π΅π‘Ÿπ‘–π‘”β„Žπ‘‘,

have alreadybeen computed. Theyare π‘₯ = β„Ž1(𝐡𝑙𝑒𝑓𝑑) and 𝑦 = β„Ž1(π΅π‘Ÿπ‘–π‘”β„Žπ‘‘). Thus, the

implicit representation for node 𝐡 is β„Ž1(𝐡) = (π‘₯ + 𝑦) mod π‘š1. Computing the β„Ž2

value implicitlyfor each bipartition works similarly.

Our HashRF algorithm is implemented onlybased on implicit bipartitions. How- ever, our HashCS algorithm uses a combination of implicit and 𝑛-bitstring represen-

tations. The implicit representation is the onlyrepresentation fed to the hashing functions. The 𝑛-bitstring is additional information that is collected from the first input tree to determine how the taxa should be grouped in the consensus tree and is used for constructing the resulting consensus trees.

Figure 19 provides an example of how to compute the β„Ž1 hash code implicitly.

Assume the set of random numbers 𝑅 = (6, 21, 10, 8, 19) and π‘š1 = 23. The β„Ž1 value

for node 𝑁3is computed from the 𝑛-bitstrings of its two children, 𝑁1 and 𝑁2, resulting

in 𝐡𝑙𝑒𝑓𝑑= 10000 and π΅π‘Ÿπ‘–π‘”β„Žπ‘‘ = 01000. Therefore, β„Ž1(11000) = β„Ž1(10000)+β„Ž1(01000) =

4. The implicit β„Ž1values are represented bythe integer values (unshaded boxes). The

hash code for the leaf node containing taxon 𝑖 is π‘Ÿπ‘–. These hash codes are propagated

up the tree to compute the implicit representations of the parent bipartitions. Thus, to compute the β„Ž1 hash code implicitlyfor 𝑁3, we calculate (6 + 21) mod 23, which

is 4. Computing the β„Ž2 hash code for representing a bipartition’s ID (BID) works

similarly.

b. Collision Types and Probability

A consequence of using hash functions is that bipartitions mayend up residing in the same location in the hash table. Such an event is considered a collision, and there are three types to consider (see Table I). Type 0 collisions are not regarded as a problematic condition. In our algorithms, this type of collision occurs when the same bipartition has alreadybeen populated into the hash table. In other words, Type 0 collision notifies us it finds a shared bipartition among input trees. Even if it is called as a collision in this thesis, it is one of the reasons to make our algorithms faster and efficient.

Table I. Collision types in our randomized algorithms.

Collision Type 𝐡𝑖 = 𝐡𝑗? β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗)? β„Ž2(𝐡𝑖) = β„Ž2(𝐡𝑗)?

Type 0 Yes Yes Yes

Type 1 No Yes No

Type 2 No Yes Yes

resided in the same location in the hash table. That is, β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗). Figure 20

shows two Type 0 collisions between two same bipartition, 𝐡1 and 𝐡3 and a Type 1

collision caused by 𝐡4. partition which is collided at location 4 in the hash table.

Type 2 (or double) collisions are serious and require a restart of the algorithm if such an event occurs. Otherwise, the resulting output will possiblybe incorrect. Suppose that 𝐡𝑖 βˆ•= 𝐡𝑗. A Type 2 collision occurs when two different bipartitions 𝐡𝑖

and 𝐡𝑗 hash to the same location in the hash table and the bipartition IDs (BIDs)

associated with them are also the same. That is, β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗) and β„Ž2(𝐡𝑖) =

β„Ž2(𝐡𝑗). The probabilityof our randomized algorithms restarting because of a double

collision among anypair of the bipartitions is 𝑂(1 𝑐

)

[71]. The probabilityrestarting because of a double collision among anypair of the bipartitions is 𝑂(1

𝑐). Given that

we can make 𝑐 arbitrarilylarge, we do not explicitlycheck for Type 2 collisions. Thus we have the following result.

Theorem 1. Our HashCS and HashRF algorithms have a theoretical error rate of 𝑂(1

𝑐), where 𝑐 is an arbitrarily large constant number.

Proof. Double collision occurs when for two bipartitions 𝐡1, 𝐡2, we have β„Ž1(𝐡1) =

β„Ž1(𝐡2) and also β„Ž2(𝐡1) = β„Ž2(𝐡2). Thus, given two different bipartitions, 𝐡𝑖 and 𝐡𝑗,

where 𝑖 βˆ•= 𝑗, the probabilityof β„Ž1(𝐡𝑖) is equal to β„Ž1(𝐡𝑗) is

𝑃 π‘Ÿπ‘œπ‘(β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗)) = π‘š1

Hash Table 0 4 ... m -11 h1 Bipartitions 1 2 3 T1 T2 2 B 1 B TYPE 0 collsions TYPE 1 collsion Hash Records T1 27 T1 T2 31 T2 46 4 B 3 B h2

Fig. 20. An example of populated hash table with the bipartitions from trees in Fig- ure 16. That is, 𝐡1 and 𝐡2 define 𝑇1, 𝐡3 and 𝐡4 are from 𝑇2, etc. The implicit

representation of each bipartition is fed to the hash functions β„Ž1 and β„Ž2. The

shaded value in each hash record contains the bipartition ID (or β„Ž2 value).

Each bipartition ID has a linked list of tree indexes that share that particular bipartition. 𝐡1 and 𝐡3 are identical bipartition and causes TYPE 0 collisions.

𝐡4 from 𝑇2 is actuallydifferent bipartitions with 𝐡1 and 𝐡3 but hashed to

the same location. It occurs TYPE 1 collision and thus the BID, ”31” for the bipartition 𝐡4 is chained with ”27”.

Similarly, the probability of β„Ž2(𝐡𝑖) is equal to β„Ž2(𝐡𝑗) is

𝑃 π‘Ÿπ‘œπ‘(β„Ž2(𝐡𝑖) = β„Ž2(𝐡𝑗)) = π‘š1

2, π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑖 βˆ•= 𝑗. (4.7)

Note that π‘š1 is a prime number which satisfies π‘š1 β‰₯ 𝑛𝑑 and π‘š2 is also a prime

number which is bigger than 𝑐𝑛𝑑, where 𝑛 is the number of taxa, 𝑑 is the number of trees, and 𝑐 is a constant integer value. Thus, the probabilityof β„Ž1(𝐡𝑖) is equal to

β„Ž1(𝐡𝑗), and β„Ž2(𝐡𝑖) is equal to β„Ž2(𝐡𝑗) is

𝑃 π‘Ÿπ‘œπ‘(β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗) π‘Žπ‘›π‘‘ β„Ž2(𝐡𝑖) = β„Ž2(𝐡𝑗)) = π‘š1

1π‘š2, π‘€β„Žπ‘’π‘Ÿπ‘’ 𝑖 βˆ•= 𝑗. (4.8)

The equation 4.8 is for a pair of bipartitions. Therefore, if there are 𝑑 binarytrees with 𝑛 taxa, we have (𝑛𝑑)2 pairs of bipartitions. Thus, for a given set of all bipartitions,

𝐡 from the set of input trees, we can conclude the probabilityof double collisions for all pairs of bipartitions is

βˆ€ 𝐡𝑖, 𝐡𝑗 ∈ 𝐡, 𝑃 π‘Ÿπ‘œπ‘(β„Ž1(𝐡𝑖) = β„Ž1(𝐡𝑗) π‘Žπ‘›π‘‘ β„Ž2(𝐡𝑖) = β„Ž2(𝐡𝑗)) = π‘š1

1π‘š2 β‹… (𝑛𝑑) 2

= 𝑛𝑑1 β‹… 𝑐𝑛𝑑1 β‹… (𝑛𝑑)2

= 1𝑐.

In practice, however, the error rate of HashCS and HashRF algorithms is much better. Our experiments with varying 𝑐 from 1 to 10,000, and running HashCS at least

100 times for each 𝑐 value, resulted in our algorithm producing the correct consensus tree everytime for an overall error rate of 0%. Similarly, HashRF shows the same overall error rate as the verification result.

Related documents