Task-parallel implementation using OmpSs - Leveraging Task-Parallelism with OmpSs

4.4 Leveraging Task-Parallelism with OmpSs

4.4.1 Task-parallel implementation using OmpSs

First, we introduce how to exploit the concurrency by means of OmpSs, resulting in the performance, convergence rate, and numerical accuracy that is illustrated in Section 4.4.2. Using a pair of data structures, we capture the task dependencies that appear in the two most challenging operations in the method, namely the calculation of the preconditioner and its application, passing this information to the OmpSs runtime which can then implement a correct and efficient schedule of the entire solver.

Task-parallel PCG method

The operations that compose the computation of the iterative PCG solve exhibit a clear set of dependencies which dictate almost a strict order for their computation (see subsection 4.1.3). These dependencies can be easily controlled using the OmpSs #pragma omp task directive. For example, the RaW dependency αj → xj+1 is simply enforced by declaring the headers for the routines that compute αj := σj/pTjvj (dot product) and xj+1:= xj+ αjpj (axpy) as follows:

// alpha := sigma / (p^T * v)

#pragma omp task input (n, sigma, p[0:n-1], v[0:n-1]) output(alpha)

4.4. LEVERAGING TASK-PARALLELISM WITH OMPSS

// x := x + alpha * p

#pragma omp task input(n, alpha, p[0:n-1]) inout(x[0:n-1])

void AXPY(int *n, double *alpha, double p[], double x[]);

In practice, OmpSs identifies the data dependencies between tasks —i.e., program functions— that dictate the data-flow execution by tracking the order in which functions are invoked in a serial execution, checking the directionality (input, output or inout) of each operand in the argument’s list, and matching the operands’ memory addresses at runtime with those of other tasks in execution. In the example, the data dependency between the two functions is detected at execution time, when they are invoked with the same variable α (actually, the same memory address), in the first case as an output operand and in the second as an input operand.

The real opportunities to exploit concurrency in the entire PCG method lie within the compu- tations that involve the preconditioner, the sparse matrix-vector product (vj := Apj, SpMV), and the dot product (pTjvj), as described next.

Task-parallel preconditioner with OmpSs

ILUPACK is quite an involved code, which allocates/releases memory dynamically for complex data structures, turning the process of capturing the dependencies via OmpSs pragmas which are directly based on the actual function’s arguments into a delicate exercise. Furthermore, proceeding along that line would require an extensive reorganization of the package and a full rewrite of certain parts. For these reasons, we instead decided to create a “skeleton” structure that explicitly exposes the dependencies in the DAG associated with the preconditioner, analogous to the use of “representants” in [34]. The key advantage of this approach is that we limit the amount of changes

that are necessary to introduce OmpSs in ILUPACK’s legacy code. Besides, we can leverage

this “skeleton” structure and parallelization scheme to exploit the concurrency in similar solvers that use different ILU preconditioning techniques to calculate the preconditioner. In fact, we also parallelized the ILU(0) algorithm by using this methodology with minor changes in the code.

In order to describe how we capture the data dependencies and exploit task parallelism in the preconditioner with OmpSs, we will consider the DAG/binary tree represented in Figure 4.2 as a workhorse. In any case, this approach is analogous for unbalanced and/or non-binary trees. The dependencies of this graph can be easily captured using a matrix of integers,dag[3][ntasks], where each column contains, for the corresponding task, the identifiers of the left/right descendant tasks and the ancestor task; see Table 4.2.

In practice, the user explicitly determines the tree-like concurrency of the preconditioner calculation in ILUPACK, before the execution commences, by carefully manipulating a graph-based symmetric reordering tool (as, e.g., Metis or Scotch) to fix the number of levels and nodes in the preconditioner, as we explained in Subsection 4.1.1. In consequence, this skeleton structure can be automatically created and initialized before the parallel computation of the preconditioner begins. In order to explain how the computation of the preconditioner works, we consider that all the processing within any of the DAG tasks that compose the preconditioner computation is performed by invoking the same function, ILUPrecond, with the following header:

void ILUPrecond(SparseMatrix *spMat, SparseFactor *spFact, int taskid);

In this argument’s list, SpMat is an input (i.e., read-only) structure containing the sparse matrix, SpFact is an input-output (i.e., read-write) structure for the sparse triangular factors, and taskid is an input integer that identifies the task to be processed during this invocation. For simplicity,

Task T0 T1 T2 T3 T4 T5 T6

Task id. j 0 1 2 3 4 5 6

Left descendant, dag[0][j] – – – – 0 2 4

Right descendant, dag[1][j] – – – – 1 3 5

Ancestor, dag[2][j] 4 4 5 5 6 6 –

Table 4.2: Contents of the dag data structure representing the nodes (tasks) and dependencies of the DAG in Figure 4.2. Here, dag[0][j], dag[1][j], and dag[2][j], j = 0,1, . . . ,6, contain, respectively, the values in the rows labeled as “left descendant id.”, “right descendant id.”, and “ancestor”. The symbol “–” is used to indicate that the task has no left/right descendents (i.e., it is a leave) or ancestor (for the root).

we omit several other parameters that are present in the function definition but are not relevant for the following discussion.

In the parallel implementation, function ILUPrecond is modified to include two new parameters, corresponding to dag and vector, and its header declaration preceded with the “taskifying” OmpSs pragma:

#pragma omp task in (vector[dag[0][taskid]], vector[dag[1][taskid]]) out(vector[taskid])

void ILUPrecondPar(SparseMatrix *spMat, SparseFactor *spFact, int taskid, int vector[], int dag[][]);

Here the contents of vector (in this case an integer array with seven entries, one per task) are irrelevant, since this structure is only used to create references to different memory addresses, which are then passed to OmpSs in order to identify the data dependencies.

To illustrate this, consider for example the following sequence of events. When function ILUPrecond is eventually invoked to process task T5 (i.e., with taskid=5), the runtime encounters the following call:

// Process task T_5

#pragma omp task in (...) out(vector[5])

void ILUPrecondPar(spMat, spFact, 5, vector, dag);

This identifies vector[5] as an output of this function/task, while the input parameters are irrelevant for the discussion. This is eventually followed by a call to the same function, this time to process task T6 (with dag[1][6]=5):

// Process task T_6

#pragma omp task in (..., vector[5]) out(...)

void ILUPrecondPar(spMat, spFact, 6, vector, dag);

which is also captured by the runtime, identifying vector[5] as an input parameter in this case. Now, the order of the calls and the references to memory, in both cases to the same address, &vector[5], first as an output and then as an input, allow the runtime to identify the RaW dependency T5→T6.

4.4. LEVERAGING TASK-PARALLELISM WITH OMPSS

Task-parallel triangular solves with OmpSs

After the computation of the preconditioner M = LDU , its application zj+1 := M−1rj+1 at each iteration of the PCG method requires the solution of two triangular systems: First, the lower triangular system y := ˆL−1rj+1, with ˆL = LD; and next the upper triangular system zj+1:= U−1y. The parallelization of the lower triangular system solve presents the same DAG as the preconditioner computation, and therefore, the same approach can be applied. The function that performs the processing associated with each one of the DAG tasks, annotated with the corresponding OmpSs pragma, is:

#pragma omp task in (vector[dag[0][taskid]], vector[dag[1][taskid]]) out(vector[taskid])

void ILULwSolvePar(SparseFactor *spFact, double r[], double y[], int taskid, int vector[], int dag[][]);

For the parallelization of the upper triangular system, the dependencies of the DAG are inverted, which simplifies the process since only one dependency must be considered per task. The function that processes the tasks during this solve is thus annotated as:

#pragma omp task in (vector[dag[2][taskid]]) out(vector[taskid])

void ILUUpSolvePar(SparseFactor *spFact, double y[], double z[], int taskid, int vector[], int dag[][]);

Here vector[dag[2][taskid]] identifies the corresponding ancestor in the binary tree. With this scheme, the OmpSs runtime can detect that, for example, vector[6] is an output for task T6 as well as an input for task T5 (dag[2][5]=6). Therefore, given that during the upper triangular solve the call to function ILUUpSolve with taskid=6 is encountered by the runtime before that to the same function with taskid=5, by matching the memory addresses, OmpSs correctly identifies and controls the RaW dependency T6→T5 for the upper triangular solve.

In summary, the entries of vector[] act as representants for the tasks in the corresponding DAGs, and together with the dag[][] structure, they govern the dependencies during the computation of the preconditioner and the iterative solution of the subsequent triangular systems.

Task-parallel sparse-matrix vector product with OmpSs

For this operation, we exploit that, after applying the appropriate recursive graph-basic reordering, defined by P (see Section 4.1.2), matrix A is disassembled into a collection of submatrices, one per leaf task, like that in (4.3). Thus, the product vj := (PTAP )pj can be decomposed into a number of “independent” smaller matrix-vector products (e.g., 4 in equation (4.5)). This calculation is parallelized using OmpSs pragmas that render a concurrent parallel execution of the small matrix-vector suboperations.

Task-parallel dot product with OmpSs

This operation calculates a single dot product per leaf task, because each task contains a block of the operand vectors, disassembled like in the sparse-matrix vector product (Subsection 4.4.1). Afterwards, there is a transformation of the scalar from inconsistent to consistent state, imple- mented as a reduction of the subvectors local to each thread. This involves an atomic addition and, therefore, a synchronization/barrier at the end of this operation (pragma omp taskwait).

In document Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors (Page 108-112)