Improved Data-to-Processor Assignment

5.3 Load Balancing

5.3.2 Improved Data-to-Processor Assignment

Algorithm

While in ExaBayes-1.2.1, the manual choice of the data distribution algorithm (i.e, MPS or CDD) is a minor inconvenience for users, pathological examples can be con- structed where neither of the two data distribution algorithms achieves “good” performance. Intuitively, if we split up just some partitions among PEs, we will obtain a more balanced distribution scheme, while PEs will have to perform some additional precompu- tations (i.e., eigenvector/eigenvalue decompositions and exponentiation of the transition rate matrix Q). If we allow partitions to be only partially assigned to PEs, we obtain the following bicriterion problem for optimizing load balance:

1. distribute alignment patterns such that every PE has the “same” number of patterns (difference between minimum and maximum number of patterns assigned should be ≤ 1) and

2. assign partitions (partially) to PEs, such that the maximum number of partitions (partially) assigned to a PE is minimal.

An optimal solution for this problem is N P-hard, yet a linearithmic-time approximation exists [60]. We refer to this algorithm as the divisible load balancing (DLB) algorithm. The approximation yields a solution such that the optimal solution assigns at most one partition less to the PE that has the most partitions assigned. In the following, we briefly describe the version of the algorithm implemented in ExaBayes and ExaML: First, we sort partitions in descending order of their number of patterns. Then, we assign entire partitions to PEs until no further partitions can be assigned without violating the first property of the criterion (close to equal number of patterns). After step 2, we divide processes into a queue A of PEs with the least number of partitions and a queue of PEs B that already have one partition more assigned to them than PEs in A. In the final step, we partially assign partitions to PEs and (i) try to fill up PEs that already have one partition more (i.e., we dequeue from queue B), (ii) then try to finalize a PE that has fewer partitions or (iii) only partially assign a partition to a PE that has less partitions.

Evaluation

We performed a runtime evaluation on a real-world alignment that is a subset of a large-scale supermatrix of insect sequences [77]. The alignment comprises 144 species and 38,400 AA characters. We used the alignment to create 10 distinct datasets with an increasing number of partitions. For each dataset, we determined partition lengths at random, while the number of partitions in a datasets was fixed to 24, 36, 48, 72, 96, 144, 192, 288, 384, and 768 respectively. For generating n partition lengths, we

drew n random numbers x1, . . . , xn from an exponential distribution exp(1) + 0.1. For

a partition p, the value of xp/Pi=1..nxi then specifies the proportion of characters that

5.3 Load Balancing #partitions time[sec] 256 1024 4096 16384 32 64 128 256 512 exaba yes 24 examl 32 64 128 256 512 256 1024 4096 16384 48 DLB CDD MPS

Figure 5.11: Runtime performance (y-axis) of three different algorithms for data distribution for increasing number of partitions. Top: Runtimes of ExaML using either 24 or 48 processes. Bottom: Runtimes of ExaBayes using either 24 or 48 processes.

that partition lengths become unrealistically small, since the exponential distribution strongly favors small values. Thus, partition lengths are distributed uniformly on the logarithmic scale.

The DLB algorithm is implemented in ExaBayes as of version 1.3 and in ExaML as of version 2.0. In ExaML we performed a typical tree search and in ExaBayes, we

ran a chain under default MCMC parameters (as of version 1.2.1) for 104 generations.

We executed both codes using either 24 or 48 processes on a cluster of Intel Sandy Bridge nodes (2 × 6 cores per node). Thus, a total of 2 nodes was needed for runs with 24 processes and 4 nodes were needed for runs with 48 processes (with accompanying higher inter-node communication costs). On the left-hand side of Fig. 5.11, runtimes for whole-partition distribution with less than 48 partitions are omitted, since here runtimes are identical to executing the run with 24 processes.

As depicted in Fig. 5.11, the new heuristic continuously executes at least as fast as the most favorable result of the two previous data distribution strategies with only one exception. Compared to CDD, the heuristic is 3.5× faster for 24 processes and up to 5.9× faster for 48 processes. Using the heuristic, ExaML needs up to 3.6× less time than with the MPS scheme for 24 processes and for 48 processes the runtime can be

improved by a factor of up to 3.9×. For large numbers of partitions, the runtime of the MPS scheme converges against the runtime of the new heuristic. However, if the same run is executed with more processes (i.e., 48 instead of 24), this break-even point is substantially shifted towards a higher number of partitions.

The runtime results verify that CDD performs on acceptable levels for many processes and few partitions. The MPS method performs equally well as the DLB heuristic with few processes and many partitions. Both figures show, that there is a region where neither of the previous strategies perform acceptably compared to the new heuristic and that this performance gap widens with an increasing number of processes.

Finally, employing the DLB heuristic, ExaML executes twice as fast with 48 processes than with 24 processes and thus exhibits an optimum scaling factor of about 2.07 in all cases. For comparison, under CDD, scaling factors ranged from 1.24 to 1.75 and under MPS, scaling factors ranged from 1.00 (i.e., no parallel runtime improvement) to 2.04.

For ExaBayes, the comparison among the three data distribution algorithms is less straight-forward. This is because for runtimes of MPS and CDD,we employed ExaBay- es version 1.2.1, whereas for the DLB algorithm, ExaBayes version 1.4.1 was used. Because of updates in the PLL and several substantial code modifications (including bug fixes), later versions of ExaBayes generally run faster (not considering the runtime gain achieved by DLB). For instance, along with the implementation of DLB, the I/O overhead at start-up could be reduced. Thus, each PE computes the assignment and subsequently only reads the portion of a preprocessed binary alignment that was assigned to it by DLB. Still, in ExaBayes, we observe a similar runtime behavior for the data distribution algorithms (see bottom of Fig. 5.11) as for ExaML. Analogously, there exists a range of partition schemes for which neither MPS nor CDD perform optimally. Specifically, the runtime penalty for CDD increases more rapidly in ExaBayes for a growing number of partitions than in ExaML. Even for the DLB data distribution, datasets with more partitions require more runtime. A likely cause for this is that Exa- Bayes has a constant proportion of AA matrix proposals, while in ExaML the initially specified matrix is kept fixed. For ExaBayes, we also obtain efficient relative speed-up values between 1.63 and 2.67 (comparing runs with 24 and 48 processes).

5.4 Memory Reduction Techniques

In document Algorithmic Advancements and Massive Parallelism for Large-Scale Datasets in Phylogenetic Bayesian Markov Chain Monte Carlo (Page 123-125)