4.4 Leveraging Task-Parallelism with OmpSs
4.4.2 Optimization and experimental results
In this section we introduce two optimization strategies to improve the performance of the
previous implementation of ILUPACK using OmpSs. Furthermore, we evaluate the optimized
parallelization in high-end multicore platforms equipped with Intel and AMD processors. Our results report significant performance gains, demonstrating that OmpSs provides an efficient and close-to-seamless means to leverage the concurrency in a complex scientific code like ILUPACK.
All experiments in this section were performed on the int sandy and amd platforms, described in Section 4.3. Moreover, the software included the Mercurium C/C++ compiler (1.99.0) with support for OmpSs, Metis (4.0.3) for the graph reorderings, and ILUPACK (2.4). For the evaluation we employed the A200 matrix (see Table 4.1).
Prioritizing tasks
Taking into account that the bulk of the computational load is concentrated in the leaves, during the computation of the preconditioner and the subsequent iterative PCG solve, it is important to assign priorities so that the leaves of the task dependency trees are executed first. The primary reason for advancing the execution of these tasks is that it provides a better chance to balance the distribution of the workload among the threads.
In order to assign priorities using OmpSs, we had to distinguish the leaf tasks in the calls to ILUPrecond,ILULwSolveandILUUpSolve; and include the appropriatepriorityclause as part of the taskifying directive. For example, for the first routine, we created two different routine calls:
#pragma omp task in(...) out(...) priority(high)
void ILUPrecondPar_LeafTask(...);
#pragma omp task in(...) out(...) priority(low)
void ILUPrecondPar_NoLeafTask(...);
which internally simply invoked the original routineILUPrecond.
The effect of introducing priorities on the computation of the preconditioner is graphically illustrated in Figure 4.8, which clearly shows how the priority mechanism enforces that the leaves of the task dependency tree are executed first. (All execution profiles in this dissertation were obtained with Extrae [79] v2.5.1.)
Controlling task granularity to reduce overhead
Our initial experiments with the iterative PCG stage revealed an excessive cost of the vector operations (dot product, axpy update, and 2-norm), much higher than could be expected from their theoretical cost; see the top plot in Figure 4.9. Further investigation revealed that this overhead was due to the large number of tasks that were created for each vector operation. To tackle this problem, we merged certain operations of the PCG iteration in order to increase their granularity. In particular, for the sequence of operations that compose the PCG loop in Figure 4.6, we merged the computation of vj with αj (SpMV with dot product); xj+1 and rj+1 (axpys); and ζj+1 with τj+1 (dot product and vector 2-norm).
Figure 4.9 reveals the outcome of collapsing these vector operations, showing much narrower time “bands” for the execution of the corresponding tasks in the merged version.
4.4. LEVERAGING TASK-PARALLELISM WITH OMPSS Thread 4 Thread 9 Thread 10 Thread 12 Thread 13 Thread 14 Thread 16 Thread 1 Thread 2 Thread 6 Thread 8 Thread 3 Thread 5 Thread 7 Thread 11 Thread 15 Thread 4 Thread 9 Thread 10 Thread 12 Thread 13 Thread 14 Thread 16 Thread 1 Thread 2 Thread 6 Thread 8 Thread 3 Thread 5 Thread 7 Thread 11 Thread 15 ILUPrecond no leaf ILUPrecond leaf
Figure 4.8: Trace of the preconditioner computation without and with priorities (top and bottom, respectively) on the Intel Xeon E5-2670, using 16 cores/threads and a decomposition of the sparse matrix into a tree with 32 leaves, for the A200 problem.
Thread 4 Thread 9 Thread 10 Thread 12 Thread 13 Thread 14 Thread 16 Thread 1 Thread 2 Thread 6 Thread 8 Thread 3 Thread 5 Thread 7 Thread 11 Thread 15 Thread 4 Thread 9 Thread 10 Thread 12 Thread 13 Thread 14 Thread 16 Thread 1 Thread 2 Thread 6 Thread 8 Thread 3 Thread 5 Thread 7 Thread 11 Thread 15
Figure 4.9: Trace of a single PCG iteration of the solve stage with unmerged and merged kernels (top and bottom, respectively) on the Intel Xeon E5-2670, using 16 cores/threads and a decomposition of the sparse matrix into a tree with 32 leaves, for the A200 problem.
DAG concurrency vs acceleration
As argued earlier, there is a trade-off between the computational cost of the preconditioner computation/iterative PCG solve and the concurrency of these two stages, which is determined by the number of levels/leaves of the task dependency trees and the cost of the individual tasks that are involved in the sparse matrix-vector product and the construction/application of the preconditioner. To illustrate this situation, we consider first the solution of the target linear system partitioned into a DAG (tree) with a single leaf/level vs one with multiple levels, using a single core of the Intel-based (int sandy) server in all cases. For example, the computation of the preconditioner for the single-leaf DAG, using the sequential implementation of ILUPACK, required 137.07 seconds. This time is reduced when the multi-level DAG/preconditioner is computed using only one core and the associated DAG consists of up to 128 leaves (concretely, 123.27 and 136.33 seconds for 32 and 128 leaves, respectively), due to differences in the fill-in patterns between the single level and multi-level cases; but the difference then grows to 172.61 seconds for 256 leaves, due to the additional flops associated with the higher number of levels. For the iterative PCG solve, the multi- level partitionings incur an increased computational cost as well as a higher number of iterations (see next subsection). The outcome of these combined factors is that, on the int sandy server, the PCG solve requires 193.17 seconds in the single-leaf DAG vs 200.10, 180.53 and 298.81 seconds with 2, 32 and 128 leaves, respectively, when executed on a single core. The large increase in the 128-leaf DAG is also due to the additional flops required by the superior number of levels.
Figure 4.10 reports the speed-up of the parallel (data-flow) implementations of preconditioner computation and PCG solve (per iteration) for the two platforms employed in the evaluation. The acceleration rates were always computed with respect to the sequential legacy implementation of ILUPACK, running with a a single thread/core. For the parallel implementation, in general the best results are obtained when the number of leaves equals or doubles the number of threads/cores. We emphasize that the speed-ups embed the increment in the computational cost that occurs when the number of levels in the DAG is increased. Thus, for the int sandy server, the speed-ups vary between 2.09/1.56 for 2 cores and 32/256 leaves; and 12.11/9.67 for 16 cores and 32/256 leaves for the calculation of the preconditioner. (The superlinear speed-up in the execution with 2 cores/32 leaves can be due to a better utilization of the cache system or a smaller fill-in.) The values for the iterative solve stage are similar: 2.31/0.95 for 2 cores and 16/256 leaves; and 9.44/4.80 for 16 cores and 32/256 leaves. Slightly lower speed-ups were obtained for the amd platform.
DAG concurrency vs numerical properties of the solver
In order to assess the numerical behaviour of the preconditioner/PCG solver as a function of the number of leaves/tasks (i.e., concurrency), we utilize the A-norm defined in [104], with the estimator in [178], as a measure of the numerical accuracy of the approximate solution xj computed at the j-th iteration: kx − xjkA, where x stands for the correct solution. Figure 4.11 shows that, for a fixed residual A-norm, there is a slight increase in the iteration count as the number of leaves grows from 1 (sequential legacy implementation in ILUPACK) up to 256. For example, in order to achieve a residual of order 1.0e–12, the sequential code requires 68 iterations, while this value grows to 77, 79 and 85, for 2, 32 and 256 leaves, respectively. In any case, from the numerical point of view, the parallel methods can still deliver the same level of accuracy (residual A-norm) as the sequential implementation at the expense of a slight increase of the theoretical cost, which is more than compensated in the parallel execution.
4.5. EXPLOITING TASK-PARALLELISM WITH MPI + OMPSS 0 2 4 6 8 10 12 14 1 2 4 8 16 32 64 128 256 Speed-up Number of leaves
Speed-up of preconditioner computation on Intel Xeon E5-2670 platform 1 thread 2 threads 4 threads 8 threads 16 threads 0 2 4 6 8 10 12 14 1 2 4 8 16 32 64 128 256 Speed-up Number of leaves
Speed-up of PCG solve on Intel Xeon E5-2670 platform 1 thread 2 threads 4 threads 8 threads 16 threads 0 2 4 6 8 10 12 14 1 2 4 8 16 32 64 128 256 Speed-up Number of leaves
Speed-up of preconditioner computation on AMD Opteron 6276 platform 1 thread 2 threads 4 threads 8 threads 16 threads 0 2 4 6 8 10 12 14 1 2 4 8 16 32 64 128 256 Speed-up Number of leaves
Speed-up of PCG solve on AMD Opteron 6276 platform 1 thread
2 threads 4 threads 8 threads 16 threads
Figure 4.10: Speed-ups attained with the data-flow ILUPACK method parallelized with OmpSs, for the A200 problem. The left-hand side plots correspond to the computation of the preconditioner and the right-hand side plots to the iterative PCG solve.
4.5
Exploiting Task-Parallelism with MPI + OmpSs
In this section we introduce a parallel implementation of the preconditioned iterative solver for sparse linear systems underlying ILUPACK that explores the interoperability between the message- passing MPI programming interface and the OmpSs task-parallel programming model [14]. Our approach commences from the task dependency tree derived from a multi-level graph partitioning of the problem, and statically maps the tasks in the top levels of this tree to the cluster nodes, fixing the inter-node communication pattern. This mapping induces a conformal partitioning of the tasks in the remaining levels of the tree among the nodes, which are then processed concurrently via the OmpSs runtime system.
In Section 4.4, we exploited the task parallelism exposed by the DAG associated with the sparse matrix to develop a parallel version of ILUPACK PCG solver for shared-memory multiprocessors that relies on OmpSs [12, 13]. Moreover, a parallel version of ILUPACK for clusters using MPI was developed in previous works [12, 23]. Unfortunately, the previous MPI version of ILUPACK