4.2 Current State of the Art Methods
4.2.3 GPU Triangular Solve
The final operation required in the standard blocked Cholesky decomposition is a triangular solve. We present here an extension of the work in [124] rewriting the algorithm for use on the GPU itself, whereas the original was simply computed on the CPU. The triangular solve employs matrix equations of the form op(A)X = αB or Xop(A) = αB. There are 16 cases in total depending on whether A multiplies X from the left or the right, is upper or lower triangular, is to be transposed or not and has a unit or non-unit diagonal. When A multiplies X from the left A is an m × m matrix and the system is solved by forming X = αop(A)−1B.
4.2. Current State of the Art Methods 89 A n n @ @ @ @ @ nb nb - ? B m n mb nb -mb nb x -
(a) Blocked triangular matrix solve for the lower triangular XAT = αB case. This form of the triangular
matrix solve is used in the lower triangular Cholesky decomposition. The block marked x starts on the left of B and is held in registers. It is updated by reading blocks from the current row of B and matching row in A. Blocks of A are transposed in shared memory. Only blocks in B to the left of x that have already been calculated are used to update the current x and as a result only the lower triangle of A is read. After x has been calculated it is written back to B and a new x is defined to the right of the old x. Each row of B is calculated by one processor.
A m m @ @ @ @ @ mb mb ? - B m n mb nb ? mb nb x ?
(b) Blocked triangular matrix solve for the upper triangular AXT = alphaB case. This form of the
triangular matrix solve is used in the upper triangular Cholesky decomposition. The block marked x starts at the top of B and is held in registers. it is updated by reading blocks from the current column of B and matching column in A. Blocks of A are transposed in shared memory. Only blocks in B above x that have already been calculated are used to update the current x and as a result only the upper triangle of A is read. After x has been calculated it is written back to B and a new x is defined below the old x. Each column of B is calculated by one processor.
When A multiplies X from the right A is an n × n matrix and the system is solved by forming X = αBop(A)−1. In both cases X and B are m × n matrices. The in-place implementation in the BLAS specification overwrites B with X. The upper triangular Cholesky decomposition uses the ATX = αB case where A is the nb × nb upper triangular submatrix on the diagonal and B is the matrix to the right of the diagonal block with m <= n. The lower triangular Cholesky decomposition uses the XAT = αB case where A is the nb × nb lower triangular submatrix on the diagonal and B is the matrix below the diagonal block with m >= n. Both assume non-unit diagonal elements in A. These cases are illustrated in Figures 4.5a and 4.5b.
Xi,j =
αBi,j−Pmk=i+1Ai,kXk,j if A is upper triangular and not transposed.
αBi,j−Pik=0Ai,kXk,j if A is lower triangular and not transposed.
αBi,j−Pmk=i+1Ak,iXk,j if A is upper triangular and transposed.
αBi,j−Pik=0Ak,iXk,j if A is lower triangular and transposed.
(4.5) Xi,j =
αBi,j−Pjk=0Ak,jXi,k if A is upper triangular and not transposed.
αBi,j−Pnk=j+1Ak,jXi,k if A is lower triangular and not transposed.
αBi,j−Pnk=j+1Aj,kXi,k if A is upper triangular and transposed.
αBi,j−Pjk=0Aj,kXi,k if A is lower triangular and transposed.
(4.6)
Each element of X is calculated using Equations 4.5 and 4.6 for op(A)X = αB and Xop(A) = αB respectively. If A has a non-unit diagonal then each Xi,j is also divided by
the corresponding diagonal element of A.
The equations show dependencies between elements of X that do not allow for an efficient GPU implementation. Elements of X cannot be calculated independently of one another. In the reference BLAS implementation the loops over i and j are reversed where needed to satisfy these dependencies. When using a GPU a high degree of synchronisation between GPU threads is needed to implement a correct solution. In addition when matrices are stored in column major layout the op(A)X = αB cases require sums down matrix columns which are implemented via reduction. This requires even more synchronisation and some GPU threads being left idle in order to fetch data from global memory at maximum bandwidth. The right cases require independent sums across matrix rows which can be carried out simultaneously by multiple threads fetching coalesced data from global memory.
The design of the GPU triangular solve algorithm follows that of the SGEMM implemen- tation by Volkov et. al [124]. An mb × nb block of X is stored in registers by each GPU
4.2. Current State of the Art Methods 91 multiprocessor and held there until all updates have been accumulated from blocks of A and B. X is initialised with values from B as in the reference BLAS implementation. When A multiplies X from the left reading and writing X from registers is done after transposing via shared memory. This forms XT = αBT(op(A)−1)T for the left cases allowing sums to be accumulated independently by multiple threads as when A multiplies X from the right. B is fetched into shared memory in blocks of kb × nb for the cases where A multiplies X from the left and is fetched directly from global memory otherwise. For the cases where A multiplies X from the left and is not transposed A is fetched directly from global memory. For all other cases A is fetched into shared memory in blocks of mb × kb when not transposed and kb × mb when transposed. Up until the point where B is being read from the same block as X will be written to the operation performed is matrix multiplication. The SAXPY updates in the unrolled inner loop are modified to use subtraction rather than addition as in the reference BLAS implemen- tation. When the block of B is in the same position as X each thread updates the elements in a column of X. This allows the dependencies between elements within a column to be satisfied while allowing each column to be processed independently. The length of the final inner loop executed by each thread is determined by another loop resulting in a triangular loop structure that nvcc is unable to automatically unroll. This results in nvcc storing the block of X in global memory rather than registers so that array offsets can be calculated. To enable X to be stored in registers the entire triangular loop needs to be manually unrolled.
When A multiplies X from the left a one-dimensional row of thread blocks is scheduled and when A multiplies X from the right a one-dimensional column of thread blocks is sched- uled. The entire kernel is wrapped in a for loop to enforce data dependencies rather than relying on separate kernel launches to force synchronisation between thread blocks. The amount of work performed by the loop changes on every iteration. The block sizes used for each case in single and double precision are listed in Tables 4.4 and 4.5. Due to the increased register usage of the extra outer and unrolled inner loops the block size needs to be halved compared to the matrix multiplication kernels to avoid registers spilling into global memory.
There are a couple of alternative approaches that may be taken to implement a triangular solve kernel for GPUs, one of which is given in [40]. It involves forming A−1 on the GPU before using triangular matrix multiplication to form B = αA−1B or B = αBA−1, and is implemented in the MAGMA library. The CUBLAS implementation of the triangular matrix solve for all precisions involves multiple alternating kernel launches of a matrix multiply kernel optimised for small matrices followed by a smaller triangular solve kernel. This removes the need for an outer for loop to force an ordering of updates and frees registers to allow more
mb nb bx by Threads Registers Shared memory Blocks per SM upper(A)X = αB 8 64 8 8 64 32 2396 6 lower(A)X = αB 8 64 8 8 64 32 2396 6 upper(AT)X = αB 8 64 8 8 64 32 2428 6 lower(AT)X = αB 8 64 8 8 64 32 2428 6 Xupper(A) = αB 64 8 8 8 64 31 348 8 Xlower(A) = αB 64 8 8 8 64 32 348 8 Xupper(AT) = αB 64 8 8 8 64 32 316 8 Xlower(AT) = αB 64 8 8 8 64 32 316 8
Table 4.4: This table lists the block sizes for the single precision triangular solve kernels. In each case an mb × nb block of B is stored in registers and updated by a bx × by block of threads. Register usage is higher than for the corresponding matrix multiply kernel due to the extra loop required to enforce ordering of the updates to blocks of B. As a consequence nb is lower to fit the same number of blocks on each multiprocessor. This in turn reduces the amount of bandwidth reduction to 14.22× which is lower than the GPU FLOP:word ratio of 17.826 making the kernels bandwidth bound.
thread blocks to fit concurrently on each GPU multiprocessor.