LAPACK - Parallel Numerical Libraries - Hybrid algorithms for efficient Cholesky decomposition

2.3 Parallel Numerical Libraries

2.3.1 LAPACK

The LAPACK project [15] aims to provide a linear algebra library that is efficient on a wide range of high performance computers. It is developed by a group of academic and private re- searchers from the US and UK and extends earlier EISPACK and LINPACK projects. LAPACK specifies a standard library interface with routines for solving systems of linear equations, performing least squares regressions, calculating eigenvalues and performing matrix decomposi- tions. It supports dense and banded matrices but not those stored in any sparse matrix format. The functions are implemented for real and complex numerical types in single and double precision floating point arithmetic. A reference Fortran implementation of LAPACK is available

through the Netlib website although this is a generic implementation and better performance can be had by using an optimised library available from CPU manufacturers.

The EISPACK and LINPACK projects ignored the cost of accessing data elements in computer memory which leads to poor performance on modern computers where the floating point performance is much faster than memory access. Modern computers have a memory hierarchy with multiple levels of fast cache memory to store frequently used data to overcome the cost of accessing slower main memory. LAPACK is therefore designed to reuse data as much as possible to run at the speed of the floating point units rather than at the speed of the memory. Recent CPUs also have multiple processing cores and LAPACK is written to expose any available parallelism to the scheduler.

LAPACK relies on an optimised BLAS implementation for best performance on any com- puting platform. BLAS is a similar library specification to LAPACK that contains simpler linear algebra functions operating on vectors and matrices. The BLAS are organised into three levels. Level 1 of the BLAS was first to be proposed and performs operations on vectors [69]. It is efficient on scalar CPUs but not vector or parallel CPUs. Levels 2 and 3 of the BLAS were proposed later involving vector-matrix and matrix-matrix operations respectively [38, 37]. Level 3 of the BLAS has the highest ratio of floating point operations to data elements needed (FLOP to word ratio) of the 3 BLAS levels and so level 3 routines have more opportunities for data reuse and can benefit most from CPUs with a memory hierarchy. Level 2 BLAS operations present less opportunities for data reuse than level 3 operations but more than level 1. As with LAPACK a reference implementation of the BLAS written in Fortran is available from the Netlib website however it is not optimised for any particular computer architecture.

LAPACK and BLAS routines follow a standard naming convention based on the type of matrix they operate on. The names consist of four, five or six letters. The meaning of each letter is explained in Appendix A of the LAPACK Installation Guide [22] and the proposals for the level 2 and 3 BLAS [38, 37]. The first letter of any BLAS or LAPACK routine specifies how each each data element in the matrix is stored and will be “S” or “D” for single or double precision floating point, respectively, or “C” or “Z” for consecutive pairs of single or double precision floating point numbers representing the real and imaginary parts of a complex number. The following two letters in the routine name represent the form of the matrix and include “GE” for general matrices, “SY” for symmetric matrices, “TR” for upper or lower triangular matrices and “PO” for symmetric positive-definite matrices. The remaining letters indicate the operation the routine performs. As an example the BLAS routine that performs single-precision general matrix-matrix multiplication is named “SGEMM” while the LAPACK routine that performs

2.3. Parallel Numerical Libraries 37

Single precision Double precision Explanation

SDOT DDOT Dot product of two vectors

SSCAL DSCAL Multiplication of each element in a vector by a scalar value

SGEMV DGEMV General matrix-vector multiplication

SGEMM DGEMM General matrix-matrix multiplication

SSYRK DSYRK Symmetric rank-K update

STRMM DTRMM Triangular matrix multiplication

STRSM DTRSM Triangular matrix solve

STRTRI DTRTRI Triangular matrix inverse

SLAUUM DLAUUM Multiplication of an upper or lower triangular matrix with itself SPOTRF DPOTRF Positive-definite triangular matrix factorisation or Cholesky decomposition

SPOTRI DPOTRI Calculate the inverse of a matrix from its Cholesky decomposition

Table 2.1: BLAS and LAPACK acronyms used throughout this thesis

double-precision positive-definite triangular factorisation is named “DPOTRF”. The LAPACK acronyms frequently used in this thesis are summarised in Table 2.1.

Some level 2 and 3 BLAS and LAPACK routines also specify “option arguments” [38, 37, 22]. These define miscellaneous options for each subroutine and are implemented as character arguments in Fortran. There are four option arguments named “trans”, “uplo”, “side” and “diag”. “trans” is set to “N” when a matrix argument is not to be transposed by the routine, “T” when the transpose is to be used and “C” when the conjugate transpose is to be used. “trans” appears in BLAS 3 routines as “transA” and “transB” when a routine performs an operation on two matrices. For routines that operate on only the upper or lower half of a matrix, “uplo” can be set to “U” or “L” respectively. “side” is used exclusively in BLAS 3 operations to specify whether a triangular matrix appears on the left, “L”, or right, “R” of an equation to solve. “diag” is also used for triangular matrices and is set to “U” when it is assumed that the diagonal is all ones and “N” when it is not.

All the algorithms used in LAPACK were rewritten as a sequence of operations on matrix blocks in order to use computationally intense routines from level 3 of the BLAS. Each algorithm has multiple ways of being rewritten to use block operations and the block algorithm chosen is the one that is expected to give the best average performance across different architec- tures. Blocking each algorithm also introduces a parameter, the block size, that can be tuned for each architecture so that the entire matrix being operated on fits in the CPU cache. Writes to lo-

calised areas of memory containing the block are also fast if the CPU cache is a “write-through” cache. LAPACK also contains unblocked versions of blocked routines which form part of the blocked algorithm. The unblocked versions of the LAPACK routines follow the same naming conventions but end with a “2” and may miss out one of the last three characters in the name to remain within the six-character limit.

LAPACK is designed to be efficient on computers with less than 100 vector CPUs while on single serial CPUs it should be no worse than any existing EISPACK or LINPACK implementations. BLAS performance is critical to the efficiency of the algorithms on shared memory systems while on distributed memory systems exploring parallelism within each block algorithm is also possible. Using an existing shared memory LAPACK implementation as a starting point for a distributed memory version is desirable as reducing memory accesses is also an aim in distributed memory systems where the cost of data access is far higher. Each routine in LA- PACK is modular and self-contained and some contain the possibility of exploiting more than simple loop level parallelism. Therefore each routine would have to be analysed separately to produce a parallel distributed memory LAPACK implementation.

The first reference implementation of LAPACK was written in Fortran 77 using non- standard extensions for double precision complex data types. Routines that are available in multiple precisions are automatically generated from a code template as far as possible. Ex- periments conducted show that on a single CPU Cray system, 90% of the peak theoretical arithmetic performance was achieved and, on a multi-CPU Cray system, 70-80% of peak performance was achieved. This performance is similar to the matrix-vector and matrix-matrix mulpliy routines from the BLAS library in use which are the limiting routines of the LAPACK operations benchmarked.

One of the drawbacks of Fortran 77 is that it does not provide routines for dynamic memory allocation. This means that any LAPACK routine which requires a temporary working space in memory has to have an appropriately sized workspace argument passed in. Fortran 90 does not have this restriction and also has operations on arrays which are more suited to linear algebra. Fortran 90 and C implementations of LAPACK are intended also using automatic code translation as far as possible. In addition it is planned to add more routines to the LAPACK specification and add more tuning parameters other than the block size. A distributed memory version and one that takes advantage of more specific features of certain CPUs is also planned.

In document Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators (Page 49-52)