The literature on parallel, sparse direct methods is very abundant and many solver are currently available with different features and efficiency. Many of them are also based on task parallelism, to some extent, although very few (apart from PaStiX, described below, no other solver fully relies on a runtime system) rely on a general purpose runtime system for implementing parallelism. Some target multicore architectures, others are designed for
1.6. Related work on sparse direct solvers
using GPUs and other for distributed memory systems; some can handle all these types of devices in a single package. A complete taxonomy of all the known solvers and methods is out of the scope of this document but we will make, in the rest of this section, a list of works that are more closely related to the subject of this thesis.
One of the first multifrontal QR solvers is the MA49 developed by Amestoy et al.
[14] and still distributed in the HSL Mathematical Software Library. This solver was de-
signed for shared memory multiprocessors and employed a rather simple approach to the use of tree and node parallelism where each type is exploited separately with a different technique. In this solver, tree parallelism is achieved trough a hand coded task queue- ing system, where a task is defined as the assembly and the factorization of a front; a task associated with a front is pushed in the task queue as soon as all the tasks associ- ated with child fronts are completed. Node parallelism, instead, is entirely delegated to a multithreaded BLAS library.
More recently, in 2011, Tim Davis released a new multifrontal QR solver named SuiteS-
parseQR [42] (SPQR) which is also specifically designed for shared memory, multicore
systems. In essence, SPQR uses the same approach as MA49 for implementing parallelism
but relies on a runtime system, namely Intel Threading Building Blocks [95] for handling
the tasks: tasks are submitted to the TBB runtime in a recursive fashion and barriers are used to ensure that a front is processed once the tasks associated to its children are done. As for node parallelism, SPQR relies on the use of multithreaded BLAS libraries.
Even more recently, Davis and his team developed a variant of the SPQR solver, called SPQR-GPU, that runs on a GPU; this is a GPU-only solver in the sense that the CPU is only used to drive the work of the GPU where all the computation takes place. SPQR-GPU employs 2D communication avoiding algorithms for the factorization of frontal matrices and uses a task-based approach for implementing parallelism. The elimination tree is split into pieces called “stages” such that each stage fits into the GPU memory (for smaller problems the whole elimination tree can be processed in a single stage). The handling of tasks is done trough the use of two nested schedulers which are hand-coded. The outer scheduler, called the “Sparse QR scheduler” handles the stages and as soon as a stage is ready for being processed (i.e., when its child stages are finished) spawns an instance of the inner scheduler, called the “Bucket scheduler”, which is in charge of generating the numerical tasks related to the fronts in the stage and of submitting them to the GPU device for execution. The tracking of dependencies in both schedulers is done manually.
The UHM solver (which uses the LDLT factorization) by Kim et al. [76] uses a sim-
ilar approach to SPQR and SPQR-GPU but targets shared memory multicore machined (without GPU). It also employs a nested tasking mechanism. A recursive submission of OpenMP tasks (one task per front) is used to handle tree parallelism exactly in the same way as SPQR uses Intel TBB. When one such task is executed, it spawns an instance of an inner, hand-coded scheduler which generates OpenMP tasks for the operations related to the associated front. The tracking of the dependencies between these tasks is done in a rather complicated way where multiple tasks may be generated for the same operation although only one of them actually executes the operation. Another variant of the UHM
solver can handle GPU devices [77]: tree parallelism is handled as in the base variant and
node parallelism is achieved through a bulk synchronous execution of task lists with a stating mapping of tasks to processing units (CPUs or GPUs) based on a performance model.
In 2014 Hogg et al. [67] released SSIDS, a GPU-only multifrontal solver for sparse,
symmetric indefinite systems. This solver shares some commonalities with SPQR-GPU: the factorization proceeds in successive steps where, at each step, a list of independent tasks is generated and submitted to the GPU for execution.
The same authors of SSID also developed the MA86 [66] and MA87 [68] solvers that target, respectively, the solution of symmetric indefinite and symmetric positive definite sparse linear systems on multicore machines. Both these solvers are based on a left-looking, supernodal factorization where supernodes are decomposed into tiles and use a task-based parallelization approach; the scheduling of tasks is achieved with a hand-coded task queue- ing system. The tracking of task dependencies is done associating a counter to each tile which describes its state and, consequently, the operations that can be executed on it; as soon as it is possible to execute an operation on a tile, the corresponding task is pushed in the task pool.
Among the other efforts to port sparse, direct methods on GPUs, we can cite the work by George et al. [56], Lucas et al. [86], and Yu et al. [119]. These approaches mainly target the multifrontal method for LU or Cholesly factorizations due to its very good data locality properties. The main idea is to treat some parts of the computations (mostly, trailing submatrix updates) entirely on the GPU. Therefore the main originality of these efforts is in the methods and algorithms used to decide whether or not a task can be processed on a GPU. In most cases this was achieved through a threshold based criterion on the size of the computational tasks. More complex approaches can be found in the work by [120,99,96]. These improve over previous efforts mostly by proposing techniques for aggregating fine grain operations to form large grain tasks which maximise the GPU occupancy (either by grouping basic BLAS operation or by treating a complete subtree as a single task) and pipelining to overlap communications with computations. In more recent
work, Sao et al. [98] extend the SuperLU Dist package to support Xeon Phi architectures
using analogous techniques as in their previous effort [99].
The work which is most closely related to this thesis, is described in the PhD thesis of Xavier Lacoste [80]. Lacoste implemented two variants of the PaStiX [64] solver based, respectively, on STF and PTG task parallelism each relying on a different runtime system, namely StarPU and PaRSEC, respectively. Both variants implement a 1D, left-looking su- pernodal factorization and are capable of using multicore systems equipped with multiple GPUs. The variant based on StarPU can also run on distributed memory systems. These implementations do not take full advantage of the features of runtime systems. In the StarPU variant, task dependencies are declared explicitly through the use of tags, rather than inferred by the runtime through data analysis. In both variants the scheduling of tasks to GPU devices is static and relies on a performance model; moreover, no eviction policy is implemented: once the supernodes that are mapped on the GPU are processed, they are not brought back to the host device in order to free space for new ones. This means that the amount of computation that can be done on the GPU is limited by the size of the memory available on the device. Ultimately, Lacoste concluded that, in a shared memory context and up to a certain number of cores, it is possible to use runtime sys- tem to develop sparse direct solvers whose performance is on par with (or slightly lesser than) that of a finely tuned, hand coded package. Nonetheless he showed how runtime systems can ease the porting of such solvers on GPU equipped architectures thanks to the automatic dependency tracking and the transparent data handling capability.