• No results found

In this section and the following, we present popular libraries and frameworks that are commonly used to perform static matrix calculations.

2.4.1 Basic Linear Algebra Subprograms

As early as in the late 1970s, scientists began to assemble basic linear algebra routines together in a library in order to remedy redundant implementations on every new computer. As a result, the Basic Linear Algebra Subprograms (Blas) library was created and written in Fortran by Lawson et al. (1979). Over the years, the Blas Technical Forum was founded and established a standard for the routines (Dongarra et al., 2001). An official reference implementation for standardized Blas can be downloaded from the Netlib website (BLAS, n.d.).

The Blas routines are arranged in the following three levels:

1. Level 1 contains the vector routines described in the initial paper Lawson et al. (1979), for instance, dot products, vector norms, and vector additions.

2. Level 2 was first presented by Dongarra et al. (1988), and includes matrix-vector operations, e.g. the generalized matrix-vector multiplication (gemv).

3. Level 3 was introduced in Dongarra et al. (1990) and adds matrix-matrix operations, such as the general matrix multiply (gemm).

Besides fundamental addition and multiplication operations, Blas further contains a solver method for triangular or band diagonal matrix problems.

Since the 1990s, Blas has become the de-facto interface standard for linear algebra calculations. Nearly every hardware vendor provides an implementation that, in contrast to the reference im- plementation, is optimized for speed on the corresponding hardware. Examples are the Intel Math Kernel Library (MKL, n.d.), AMD Core Math Library (ACML, n.d.), IBM Engineering and Scientific Subroutine Library (ESSL, n.d.), HP Mathematical Software Library (MLib, n.d.), Sun Performance Library (SPL, n.d.), et cetera. Moreover, there are open source implementations among which the most popular is the Automatically Tuned Linear Algebra Software library (ATLAS, n.d.).

2.4.2 LAPACK and ScaLAPACK

Initially Blas Level 1 routines were utilized by Linpack (Dongarra et al., 1979), a software package that offers algorithms for solving dense and banded linear equations as well as linear least squares problems. Later however, Linpack was superseded by Lapack (Anderson et al., 1999), which also included solving methods for eigen- and singular value problems. Lapack makes extensive use of Level 2 and 3 routines, and works with block-partitioned algorithms in order to efficiently exploit the memory hierarchy of modern computers.

Following the development of supercomputers and high performance computing clusters, La- pack was extended by Scalapack (Blackford et al., 1997). Scalapack is a distributed memory

implementation of Lapack, and internally uses Lapack and Blas for local computation. Local re- sults of the distributed computation are combined using the basic linear algebra communication subprograms (Blacs) as communication framework. The latter is based on the message passing interface (MPI, n.d.). Both Lapack and Scalapack are also implemented by some of the hardware- vendor-provided math libraries, for instance by Intel MKL (MKL, n.d.).

Recently, there have been further developments to adopt the functionality of Scalapack for many-core and heterogeneous architectures (Agullo et al., 2009). The Magma project (Tomov et al., 2010; Dongarra et al., 2014) aims at writing a dense linear algebra library similar to Lapack but for heterogeneous/hybrid architectures, starting with “multi-core CPU+GPU” systems. The Plasma project (Kurzak et al., 2010) refurbishes the Lapack and Scalapack algorithms in order to fully exploit thread-level parallelism. In particular, mathematical functions are decomposed into a set of tasks that operate on small matrix portions (tiles), and are arranged in a directed acyclic graph (DAG) plan. This dataflow model allows for more flexibility and improves performance on multi-core platforms due to out-of-order execution. Although on a coarser level, the tiled matrix multiplication operator (ATmult) presented in Chapter 5 also uses a dataflow DAG to schedule asynchronous tile- multiplications.

2.4.3 Sparse BLAS

The standard Blas routines, as well as the Lapack and Scalapack libraries, are designed for dense and band1 matrices. In fact, the initial Blas had no special implementation for sparse matrices.

Since this limitation alone would discourage the use of their library for a wide range of applica- tions, the Blas technical forum began to establish routines for sparse matrices in its Blas standard specification (Dongarra et al., 2001). The suggested extensions are based on the ideas of Duff et al. (1997). A fundamental difference to the dense Blas is the abstraction of the sparse storage type in the generic interface. In particular, there is no definition on how the sparse matrix should be laid out in storage. Instead, the interface works with a sparse matrix handle that accommodates a user-defined, custom sparse matrix storage implementation. Another difference to the dense Blas is the reduced functionality set: there are no matrix operations in which both operands are sparse, e.g. a sparse matrix-sparse matrix multiplication.

The National Institute of Standards and Technology (NIST) provides two versions of the sparse Blas: a reference implementation that is compliant to the BLAST standard (NISTa, n.d.), and the preceding original implementation (NISTb, n.d.). The latter deviates from the standard as of Duff et al. (1997) in the following way: instead of a general function that takes an abstract matrix handle, NISTb (n.d.) provides distinct functions for some of the most common sparse matrix storage types. Indeed, this version seems to be more practical than the official sparse Blas (NISTa, n.d.) since it is partially implemented by many available Blas libraries. In contrast, barely any of the vendor- provided libraries implement the more recent standard (Duff et al., 1997; Dongarra et al., 2001) based on sparse matrix handles. For instance, the AMD CML only provides some of the sparse Level 1 routines. IBM ESSL has implementations for several Level 1 and Level 2 methods, albeit with slightly deviating naming conventions. HP MLIB, SPL, and Intel MKL offer sparse implementations

for Level 2 and Level 3 routines that are based on the original NIST version (NISTb, n.d.) with predefined sparse matrix storage types.

To our notion the OSKI library by Vuduc et al. (2005) from the University of Berkeley offers the implementation that is closest to the current sparse Blas standard as of (NISTa, n.d.). Furthermore, the library provides automated tuning, such that the matrix the user provides is analyzed and con- verted accordingly into the most efficient data structure. There are only few Blas-based libraries that provide a sparse matrix-sparse matrix multiplication routine (SpGEMM, also called SpMM), in addition to the sparse Blas specification. An example is Nvidia’s CuSparse (n.d.) for GPUs. Nev- ertheless, the routine is implemented by complementary libraries such as CHOLMOD/SuiteSparse (Chen et al., 2008).

To summarize, the Blas interface has proven its importance as the state-of-the-art interface for numerical linear algebra. It is implemented by various hardware vendors, and many methods, par- ticularly operations on dense matrices, have been thoroughly tuned and optimized. Consequently, they are hard to outperform from a performance perspective. Nevertheless, most Blas libraries are written in Fortran or C/C++, thus requiring decent programming skills from their users. Further- more, the introduction of sparse Blas makes library implementations even more complex due to the plurality of different sparse matrix storage formats. The intention of the sparse Blas standard is to target a “a sophisticated user community but not necessarily one that is or needs to be familiar with details of sparse storage schemes” (Dongarra et al., 2001). However, we argue that this aim has not been achieved, since most libraries do not implement the abstraction of the physical sparse matrix structure. In fact, our adaptive matrix type that is described in Chapter 5 implements the decoupling between the logical matrix interface and its physical implementation.

2.4.4 HPC Algorithms

Clearly, Scalapack became the state-of-the-art for highly efficient and scalable library routines for dense linear algebra, complemented by the developments of the Plasma and Magma projects. How- ever, the obvious disadvantage of these libraries is that they do not include any methods for sparse matrix computations. Only lately, a few sparse methods has been added to the Magma project (Yamazaki et al., 2014). In fact, the runtime of many applications such as the sparse eigenvalue computation described in Section 2.2 heavily depends on the efficiency of the sparse matrix-vector multiplication implementation. As a matter of fact, there are numerous groups in the high per- formance community that have been tuning these algorithms for scalability on specific hardware setups.

Algorithms for sparse matrix-vector multiplication (SpGEMV, also called SpMV) have been op- timized for multi-core processors (e.g., Vuduc and Moon, 2005) as well as for distributed hybrid CPU+GPU environments (e.g., Schubert et al., 2011). Moreover, SpGEMM is another key routine for which there are efficient implementations on multi-core (Patwary et al., 2015), distributed mem- ory (Buluç and Gilbert, 2012), GPU (Dalton et al., 2015), and hybrid CPU+GPU platforms (Matam et al., 2012). Meanwhile, many algorithms of different HPC research groups are assembled and published as libraries, e.g., PETSc (Balay et al., 2015), LAMA (Kraus et al., 2013), GHOST (Kreutzer et al., 2015), PARALUTION (Paralution Labs, 2015).

We mention these algorithms for completeness, while emphasizing that any HPC algorithm or a library call alone is actually not serving the requirements stated in the introduction. Many algo-

rithms might perform well on their targeted platforms, but require specially designed data struc- tures. As mentioned before, these matrix data types are usually transparent and designed for either homogeneous sparse matrices, or specific topologies such as band-diagonal matrices. Moreover, sparse matrices in HPC contexts are mostly static and not designed to be updated in an ad-hoc fashion, which is one of our main requirements. In HPC environments, data is commonly loaded statically from a file, and the algorithms run completely agnostic of any resource-management in the system.

Nonetheless, these algorithms complement our work, and many can be used as multiplication

kernels in our system to benefit from their performance. For instance, we use a Blas library method

for dense matrix multiplications in the DBMS-integrated Lapeg. In the same way, other algorithms could be used for sparse operations, given that the input format is compatible with that of the Lapeg. Acting in the layer above, our Lapeg optimizer, which is described in Chapter 4, selects a suitable algorithm according to its expected runtime performance. Moreover, we discuss in Chapter 5 how our system supervises the memory resources consumption in accordance with the system’s resource management.

Further Library Abstractions

As mentioned before, a major shortcoming for users of Blas and HPC libraries is the bloated in- terface of different routine calls, which is dependent on the underlying matrix storage type, matrix shape and orientation, element data type, et cetera. Therefore, some of the mentioned libraries abstract the matrix implementation from the matrix interface. Hence, algorithms can be written in- dependently from the underlying physical matrix format, including dense and sparse matrix types. Examples for such libraries are PETsc, OSKI, LAMA, and the MTL4 (Siek and Lumsdaine, 1998). In particular, LAMA (Förster and Kraus, 2011) uses expression templates to offer a C++-embedded, natural language interface for linear algebra operations such as matrix-vector multiplication.