Direct methods, introduced in Section 1.2, constitute a popular approach to find the solution of large sparse linear systems of equations. They are often preferred for their robustness over other approaches like iterative methods whose efficiency largely depends on the numerical properties of the input problem. As presented in Section 1.3, in the cases of dense and sparse systems, recently developed factorization algorithms generate a high amount of concurrency and allow the exploitation of efficient kernels, yielding good performance on modern multicore architectures. The evolution of the hardware, however, with the increase in the number of cores per chips, the introduction of accelerators and the diminishing amount of memory per core, face researchers with new challenges that we
1.8. Positioning of the thesis
tackle in this study: how to implement complex and irregular algorithms in an efficient yet portable way? how to make algorithms evolve and improve without being limited or con- strained by the complexity of their actual implementation? how to achieve the execution of complex workloads on heterogeneous architectures equipped with multiple execution units running at different speeds and with memories having different capacities and speeds? How to profile and analyze the performance of such codes on hybrid architectures?
This thesis attempts to address these issues through an approach which consists in rely- ing on the use of modern runtime systems in order to achieve a performance and memory efficient yet portable implementation of the QR multifrontal method for heterogeneous architectures. Modern runtime systems, presented in Section 1.4.2, provide programming interfaces complying with DAG-based algorithms that have recently become more and more popular in the domain of linear algebra as explained for both the dense and sparse case in Sections 1.5 and 1.6, respectively. The use of runtime systems has been largely studied in the context of dense linear algebra but still represents a challenge for sparse al- gorithms such as the multifrontal method. The difficulty associated with the development of sparse methods on top of runtime systems lies in complexity of the DAG representing the application with a large amount of tasks, with a great variety of kernels, granularity and memory consumption.
The first issue we address in Chapter 3 is to validate the pertinence of this approach by porting qr mumps to StarPU in a new version referred to as qrm starpu. We show that we achieve good performance with qrm starpu and provide a detailed performance analysis comparing our new version with the original solver. In Chapter 4 we redesign the previous implementation using a pure Sequential Task Flow model and improve our solver with the integration of 2D, communication avoiding front factorization algorithms. Furthermore we develop a memory-aware algorithm allowing for controlling the memory consumption of the parallel multifrontal method. As we mention in Section 1.6, few sparse solvers are based on runtime systems and none of them fully relies on these tools for handling parallelism and tasks execution, scheduling and data management. In our approach, we separate the expression of algorithms from low-level details such as dependency management, data transfer and data consistency that are delegated to the runtime system. In addition we take advantage of the expressiveness of the programming models that we use to develop new features.
Exploiting heterogeneous systems is extremely challenging due to the complexity of these architectures. Current approaches presented in Section 1.6 generally rely on simple static scheduling strategies and some of the state-of-the-art solvers only exploit the accel- erator without taking advantage of the other resources on the architectures. In Chapter 6 we address the data partitioning and scheduling issues that are critical to achieve perfor- mance on these architectures. We extend the 1D block-column partitioning used in the qr mumps solver (see Section 1.7) to a hierarchical block partitioning allowing to generate both fine and coarse granularity tasks adapted to the processing unit capabilities. We show that the simplicity of the STF model facilitates the implementation of this data partitioning with a complex dependency pattern and gives the ability to dynamically par- tition data. We develop a scheduling strategy capable of handling the task heterogeneity in the DAG and the diversity of resources on heterogeneous architectures. Thanks to the modular approach that we employ, the scheduler implemented in qrm starpu is generic and totally usable in other StarPU based applications.
The efficiency of the approaches presented above are assessed with a detailed perfor- mance analysis presented in Chapter 2. This performance analysis approach allows for measuring and separately analysing several factors playing a role in performance and scal- ability such as locality issues and task pipelining. It should be noted this analysis requires
the ability to measure times spend in several part of the application. This information may be easily obtained via the runtime system. Nonetheless, the use of this method is not restricted to applications based on task parallelism but can be readily applied to any type of parallel code.
Finally, in Chapter 7 we address several issues related to the studies presented in the previous chapters. First we propose a PTG-based version of our solver implemented with PaRSEC. Our work clearly benefits from the features, the robustness and the efficiency of runtime systems but, at the same time, provides a very valuable feedback to the runtime developers community. In Sections 7.2 and 7.3 we discuss how qrm starpu has been used to tune and validate a simulation engine for StarPU-based application and the use of scheduling contexts for improving the locality of reference to data.