Linear Algebra on GPUs - Parallel Numerical Libraries

2.3 Parallel Numerical Libraries

2.3.4 Linear Algebra on GPUs

Before the advent of GPGPU using CUDA and OpenCL, developers used various languages to develop GPU kernels including C for graphics (Cg), High Level Shading Language (HLSL) and OpenGL Shading Language (GLSL) which replaced hand coding kernels in proprietary GPU assembly instructions. Kernels would use GPU vertex processors more than fragment processors as the GPUs contained more vertex processors. 3D graphics APIs such as OpenGL and DirectX were used to load data into graphics memory as textures and trigger the kernel execution by drawing polygons. The results would be rendered into the frame buffer and could be read from there.

Jung [60] uses the BrookeGPU library to hide the complexity of transferring data and launching kernels on the GPU when developing a Cholesky decomposition for a nVidia GF6800 GPU with 16 4-way SIMD fragment processors. At each iteration the Cholesky decomposition performs three steps: a square root of the diagonal element and normalising and updating the submatrix. A kernel is implemented for each step with extra temporary memory allocated in order to allow instruction streams to overlap without producing undefined results. BrookeGPU has no support for triangular matrices unlike OpenGL which makes their Cholesky decomposition slower than a similar LU decomposition which is uncommon. Their algorithm also does

not take into account any GPU caches which makes the outer product form of the Cholesky decomposition perform better than the inner product form implemented even though it exhibits less parallelism. The rate limiting step of the Cholesky decomposition is the square root which cannot be performed in parallel. This means that for large matrices the speed of their algorithm is bound by memory bandwidth rather than instruction throughput.

BLAS implementations are also available for FPGAs. While GPUs have a high theoretical throughput only a fraction of peak performance is available. FPGAs have a peak performance that is easier to attain and they are also more power efficient than more general purpose CPUs and GPUs. Kestur et al. [64] have carried out a comparison of BLAS implementations on an FPGA, GPU and CPU in terms of power efficiency as well as throughput.

They start by implementing an IEEE754 compliant double precision dot product and scalar-vector multiply-add from level 1 of the BLAS (DDOT and DAXPY respectively) and use them to produce a double precision matrix-vector multiplication BLAS level 2 DGEMV function. They use a new method of reduction for FPGAs in the DDOT and a new way of storing vectors and matrices in FPGA memory in order to improve parallel computation in the DGEMV. To perform the sum reduction in the dot product the authors start with a single accumulator which feeds the running total back into the input until the list of input elements is exhausted. The input is processed in batches producing several partial sums which are then coalesced into one result. The single accumulator is improved by first adding another to create a double accumulator and then using multiple feed-forward adders to perform the coalesced sum in log2(n) steps. The feed-forward adders reduce latency in producing the final sum and the

dual-stage adder reduces the RAM bottleneck further speeding up the reduction.

In order to produce a DGEMV, Kestur et al. [64] perform multiple independent dot prod- ucts across the rows of a matrix in parallel using a DAXPY kernel. In order to improve memory bandwidth when multiple sequential accesses are performed, they introduce bank interleaved memory in a similar manner to shared memory on a GPU. Sequential elements are stored in sequential banks, all of which can be accessed simultaneously at full bandwidth. Vectors are stored in bank interleaved memory while the idea is extended to two dimensions to store matrix elements.

The experiments were conducted using a PC with a 3.16GHz Intel Core 2 Duo and 4GB RAM running the Intel MKL. An nVidia 9500GT was added to the system using CUBLAS 2.2 to benchmark GPU performance but was removed when not in use so as not to effect power consumption measured. A BEE3 FPGA was used running at 100MHz with a maximum of 16GB memory. The FPGA was found to have much better instruction throughput than the PC

2.3. Parallel Numerical Libraries 43 and was only slightly slower than the PC. However, when measuring the number of iterations performed per joule of power using an AC power meter the FPGA was most efficient, followed by the PC then the graphics card.

With the introduction of CUDA Barrachina et al. [19] repeated earlier work by Jung and others and extended it to a comparison of algorithmic variants of the Cholesky and LU decompositions using a G80-based GPU. Their hybrid code was developed with the sole aim of outperforming traditional CPU-only implementations.

There are three variants each of the blocked Cholesky and LU decomposition algorithms. They all involve the same operations but executed them in a different order. Each algorithm also executes in-place overwriting the input matrix with its output. The three variants of Cholesky decomposition are shown in Table 2.2 with the rate-determining steps in each highlighted in bold. Each variant requires the use of the symmetric rank-K update and triangular matrix solve routines implemented as the SSYRK and STRSM routines in single-precision in the BLAS specification. Each variant also requires an unblocked Cholesky decomposition routine named SPOTF2 in LAPACk while variant three additionally requires the use of the BLAS SGEMM operation to perform general matrix-matrix multiplication. The three variants of the blocked LU decomposition algorithm are shown in Table 2.3 but the authors neglect to determine the rate determining step in each algorithm apart from noting that the STRSM routine in CUBLAS 1.0 is not as optimised as the SGEMM routine. Studies into the performance of the first release of the CUBLAS library found it to perform best when the memory being operated on is aligned on a 128-byte boundary so the block sizes used in the algorithms were chosen to be multiples of 32 elements. The variants were implemented as hybrid algorithms by performing the unblocked Cholesky decomposition and the LU column factorisation on the CPU. Recursion was used to divide the matrix into four blocks at each step with hybrid processing being used at the deepest level. Increasing or decreasing the level of recursion was found not to have an effect on the performance of the algorithm.

The matrix decompositions were used to calculate the solution to a linear system on the GPU. In order to obtain a double precision solution from a single precision decomposition an iterative refinement algorithm was used that had originally been developed for the Cell CPU found in the Playstation 3. The iterative refinement algorithm is executed on the CPU in single precision apart from a matrix-vector multiplication which is performed in double precision and manages to achieve equivalent accuracy to a full double precision solution.

A system with an Intel Core 2 Duo running at 1.86GHz and fitted with an nVidia 8800 Ultra graphics card was used to benchmark performance. The algorithms were implemented using

Variant 1 Variant 2 Variant 3 1. SPOTF2 1. STRSM 1. SSYRK 2. STRSM 2. SSYRK 2. SPOTF2 3. SSYRK 3. SPOTF2 3. SGEMM

4. STRSM

Table 2.2: The three variants of the blocked Cholesky decomposition with the rate determining step of each algorithm in bold. Each variant requires the SSYRK and STRSM routines from the BLAS to perform symmetric rank-K update and triangular matrix solve operations. Each variant also requires an unblocked Cholesky decomposition routine named SPOTF2 in LAPACK. Variant three additionally requires a single precision matrix multiplication routine implemented as SGEMM in the BLAS.

Variant 1 Variant 2 Variant 3

1. STRSM 1. SGEMM 1. STRSM

2. SGEMM 2. SGEMM 2. SGEMM

3. SGEMM 3. SGEMM

4. STRSM

Table 2.3: The three variants of the blocked LU decomposition with the rate determining step of each algorithm in bold. Each variant requires triangular matrix solve and general matrix multiplication operations implemented as the STRSM and SGEMM routines in the BLAS.

2.3. Parallel Numerical Libraries 45 Fortran 77 with CUDA and CUBLAS versions 1.0. The CPU implementation used GotoBLAS with the reference LAPACK built on top. The blocked unpadded GPU implementations were found to outperform blocked CPU code for 3000-square matrices and up using the Cholesky decomposition and 1500-square matrices and up using the LU decomposition. The SGEMM routine in CUBLAS 1.0 is optimised better than the SSYRK or STRSM routines so variant 3 of the blocked Cholesky algorithm performs best. Performing an STRSM on a large matrix performs particularly poorly therefore variant 1 is slowest. Padding the matrices to 32 elements results in a small performance improvement for the SGEMM routine with the smallest increase in performance in variant 2 of the algorithm which relies on the SSYRK routine. With the LU decomposition variant 1 performs worst as it relies on the STRSM routine heavily.

The authors found that although GPUs have poor double precision performance, single precision can be used along with iterative refinement on the CPU and still yield an overall faster routine than using double precision throughout.

In document Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators (Page 55-59)