Application of the preconditioner - Performance and Energy Optimization of the Iterative Soluti

3.3 ILUPACK

3.3.2 Application of the preconditioner

Figure 3.2 offers an algorithmic description of the PCG method. The computation of the

preconditioner M , explained in the previous section, is the first step of the solver (O0). The subsequent iteration involves a sparse matrix-vector product (SpMV) (O1), the application of the preconditioner (O5), and several vector operations (dot products, axpy-like updates, 2-norm; in O2–O4 and O6–O9). In the remainder of this section, we focus on the application of the preconditioner.

For simplicity, let us next remove the subscripts in the corresponding operation (O5) of Fig- ure 3.2: z := M−1r. Applying the preconditioner in level l (i.e., computing z := M_l−1r), then requires solving the system of linear equations:

_˜ LB 0 ˜ LE I _˜ DB 0 0 Ml+1 _˜ UB U˜F 0 I PTP˜TD˜−1z = PTP˜TD˜−1r. (3.53)

Breaking down (3.53), we first recognize two transformations to the residual vector r. First, r0 := Dr applies the diagonal scaling to this vector; then the ordering step is applied to compute ˆ

r := PTP˜Tr0. Once these transformations are completed, the system

_˜ LB 0 ˜ LE I _˜ DB 0 0 Ml+1 _˜ UB U˜F 0 I w = ˆr (3.54)

is solved for w(= PTP˜TD˜−1z) in three steps. Initially, by

_˜ LB 0 ˜ LE I y = ˆr (3.55)

for y; then solving recursively

_˜

DB 0

0 Ml+1

x = y (3.56)

for x; and finally solving for w

_˜ UB U˜F

0 I

w = x. (3.57)

In turn, the expressions in (3.55) and (3.57) also need to be solved in two steps. Assuming vectors y and ˆr are split conformally with the blocks of the factors, for (3.55) we have

_˜ LB 0 ˜ LE I yB yC = ˆrB ˆ rC . (3.58)

This system is then tackled by initially solving the unit lower triangular system

LByB= ˆrB (3.59)

for yB, and then computing

yC := ˆrC− ˜LEyB. (3.60)

Splitting the vectors like before, Equation (3.56) involves the diagonal-matrix multiplication

xB:= D−1_B yB, (3.61)

and the recursive step

In the base step of the recursion, the size of Ml+1 is equal to zero and then only xB has to be computed. Finally, after an analogous partitioning, equation (3.57) can be reformulated as

wC := xC (3.63)

and

UBwB = xB− ˜UFwC , (3.64)

such that z is simply obtained form z := D( ˆPT(PTw)).

Remember that, with the purpose of saving memory, ILUPACK discards the factors ˜LE and ˜UF once each level of the preconditioner is calculated, keeping only two sparse rectangular matrices E and F , frequently much sparser than ˜LE and ˜UF, such that ˜LE = E ˜U_B−1D˜_B−1, and ˜UF = ˜D_B−1L˜−1_B F . This improvement changes (3.60) to

yC := ˆrC− E ˜UB−1D˜ −1

B yB, (3.65)

which, in combination with (3.59), yields

yC := ˆrC − E ˜U_B−1D˜−1_B L˜−1_B rˆB. (3.66)

Furthermore, (3.64) also changes to

UBwB = ˜DB−1yB− ˜DB−1L˜ −1

B F wC. (3.67)

Now, (3.66) can be tackled by first solving

LBD˜BU˜BsB= ˆrB (3.68)

for sB, and then obtaining

yC := ˆrC− EsB, (3.69)

while (3.67) can be tackled by solving

LBD˜BU˜BˆsB = F wC (3.70)

for ˆsB, and then performing

wB := sB− ˆsB. (3.71)

To summarize, at each level the procedure implemented by ILUPACK performs two sparse matrix-vector multiplications and solves two linear systems of the form ˜L ˜D ˜U x = b. In addition, three other types of operations are distinguished: diagonal scaling, vector permutation, and vector updates of the form x := a − b.

CHAPTER

4

Exploiting Task-Parallelism in ILUPACK

The increment of thread-level hardware parallelism in multicore architectures, leading to processors that nowadays support between dozens and hundreds of threads (e.g., 64 threads in the IBM PowerPC A2 processor and 240 in the Intel Xeon Phi processor), has guided the development of several data-flow programming models in the past few years with the purpose of decoupling the description of an algorithm from the “mechanics” of its parallel execution, hence reducing the coding effort and improving source code portability.

Data-flow programming models assume that data dependencies characterize a number of “cor- rect” concurrent schedules by defining a partial execution order on the operations (tasks) that compose the algorithm. In general, modern data-flow programming models are assisted by a spe- cialized runtime which analyzes data dependencies, and orchestrates the parallel execution in order to optimize performance. Emblematic examples of these type of programming models include, among others, DAGuE/ParSEC [54], Harmony [67], Mentat [89], Qilin [130], StarPU [177], Uin- tah [44], XKaapi [84], and the target of our work, OmpSs [74, 2]. For the particular domain of dense linear algebra, the application of these models and/or similar approaches has resulted in a collection of high performance libraries (DPLASMA, libflame, MAGMA, PLASMA, etc.). How- ever, the application of data-flow programming paradigms to the parallel solution of sparse systems of linear equations is still unripe, mostly due to sparse linear algebra operations being much more challenging than their dense counterparts. In particular, data-flow runtime-assisted linear system solvers using supernodal direct and ILU-type iterative methods have been proposed only recently, for example in [106, 119, 122] and [22], respectively.

In this chapter we analyze the PCG method in ILUPACK with the goal of exposing task parallelism. The parallelization scheme can be leveraged to implement different versions of the solver using OmpSs, MPI and a combination of both, achieving significant performance gains. Moreover, this scheme can be also applied to easily parallelize other ILU-type iterative solvers.

The chapter is structured as follows. Section 4.1 describes how to extract the task concurrency in the PCG method. Section 4.2 reviews the parallel programming models relevant for this disser- tation. Section 4.3 introduces the target platforms and the test cases employed in the evaluation experiments of this chapter. Next, Sections 4.4 and 4.5 present two task-parallel implementations of ILUPACK solver for multicore processors with OmpSs and for clusters of multicore processors using MPI+OmpSs. Section 4.6 provides optimized implementations with OmpSs and MPI for

NUMA platforms and manycore hardware co-processors based on the Intel Xeon Phi. Finally, some concluding remarks are included in Section 4.7.

4.1 Task-Level Concurrency in the PCG Method

We first present the strategy which is followed to extract task concurrency in the PCG method in ILUPACK. In Subsection 4.1.1 we define the main concepts to partition a matrix into an adjacency graph. Then, in Subsections 4.1.2 and 4.1.3 we explain how to use the adjacency graph to extract task parallelism in the preconditioner computation, its application, and the remaining operations in the PCG method.

In document Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors (Page 93-96)