Dynamic Block Sizing - Improvements on the State of the Art

4.3 Improvements on the State of the Art

4.3.3 Dynamic Block Sizing

We now propose the introduction of an additional level of blocking for heterogeneous computing environments consisting of multi-core CPUs with GPU accelerators working in parallel. These blocks are communicated between computing devices, which work on them using their own blocked or unblocked routines. The block size for the coarser level of blocking may then be chosen to balance the workload between the heterogeneous compute devices, helping to ensure that processors are not left idle when they could be doing useful computation.

The block size used for the blocked LAPACK routines for CPUs is chosen so that the working set for the call to the unblocked routine fits in the CPU cache. This is suitable for

4.3. Improvements on the State of the Art 97 n n A B C D - j - nb ? 6 n

Figure 4.6: Defining a column around submatrix B that extends to the top and bottom of the matrix allows B to be copied to and from GPU memory in a single transfer if the matrix is not padded. This will be faster than copying each column of B separately if the overhead in setting up each copy is large in relation to the time taken to transfer the data. In the upper triangular Cholesky decomposition this optimisation can be used to transfer B into host memory and back into device memory as it overlaps the update of submatrix D to the right of B.

sequential algorithms executed on single core CPUs or parallel algorithms executed on homogeneous multi-core CPUs where each core has the same amount of cache.

In the hybrid Cholesky decomposition, the majority of the processing occurs in the matrix multiplication executed on the GPU. This is overlapped with a smaller Cholesky decomposition of the diagonal block executed by the CPU. Using a fixed block size the number of floating point operations consumed by the diagonal block Cholesky is constant across every iteration of the algorithm. The number of floating point operations taken by the matrix multiplication on the other hand changes on each iteration, increasing towards the midpoint of the algorithm and decreasing towards the end. This is shown in Figure 4.8 with the floating point operations consumed by the rank-K update and triangular solve routines removed for clarity. The area between the two curves for the matrix multiplication and Cholesky decomposition represent the difference in time taken to execute the two functions on a heterogeneous computing device. When the line from the matrix multiplication is lower the compute device executing it has to wait while the Cholesky decomposition of the diagonal block is completed. When the line is higher the device executing the Cholesky decomposition of the diagonal block finishes first and has to wait for the matrix multiplication.

n n A B C D - j - nb ? 6 n

Figure 4.7: Defining a column around submatrix B that extends to the top and bottom of the matrix allows B to be copied to and from GPU memory in a single transfer if the matrix is not padded. This will be faster than copying each column of B separately if the overhead in setting up each copy is large in relation to the time taken to transfer the data. In the lower triangular Cholesky decomposition this optimisation can be used to transfer B into host memory only as the column overlaps submatrix D which is updated by the GPU in parallel.

Changing the block size on each iteration of the hybrid Cholesky decomposition would make better use of the available computing power from the CPU and GPU as neither would be waiting for the other to finish executing a function before proceeding. Ideally the block size would be changed on each iteration to minimise the area between the curves for matrix multiplication and Cholesky decomposition. The block size can be increased towards the midpoint then decreased towards the end which would cause the number of operations consumed by the Cholesky decomposition to increase then decrease in a similar manner to the matrix multiplication. Alternatively the block size can decrease towards the midpoint and increase towards the end to bring the curve from the matrix multiplication down towards the Cholesky decomposition. Both these approaches are shown in Figures 4.9 and 4.10 where the area between the curves is noticeably less than in Figure 4.8. The ideal block size for each iteration can be cal- culated analytically for homogeneous computing devices as the processing power for each core executing the different functions is the same. In a heterogeneous computing environment this cannot be done as the number of floating point operations consumed by each function needs to be normalised by the performance of the computing device executing it and there is a difference in theoretical and actual performance which also varies across architectures.

4.3. Improvements on the State of the Art 99 A tuning run was performed to measure the difference between execution times of the matrix multiplication by the GPU and the Cholesky decomposition on the CPU over a range of block sizes for a fixed matrix size. At each iteration the block size with the minimum time difference was selected. It was found that decreasing the block size then increasing it resulted in better performance however this tuning run would be costly to implement at runtime. We therefore choose a simpler scheme of starting the block size at N/2 and halving it at each iteration towards the centre, then doubling it until the end of the algorithm.

In document Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators (Page 110-113)