Streaming Kernels - Fluid Interface Movement

5.5 Fluid Interface Movement

6.1.2 Streaming Kernels

Five kernels are responsible for moving distribution functions during the streaming step: Three kernels perform transfers along the x, y and z axes and two perform transfers along the diagonals of the xy and xz planes. Transfers along the diagonals of the yz plane are performed by streaming in the y and z directions separately and consecutively because performing these transfers in a single step would require inter-block synchronization. Refer to Section5.3.2for more information on the design of these kernels.

Before presenting the results of the analysis we will outline our approach.

Analysis Approach

The focus of the streaming kernels is to transfer distribution functions as quickly and efficiently as possible. Therefore, investigating computation efficiency is unnecessary. In this section we will outline how we measure the kernel performance and how we initialize the simulation lattice.

Since the streaming kernels are focused on data transfer, memory throughput stands out as a measure of kernel performance. However, direct measurements of memory throughput via the CUDA profiler are not an exact representation of algorithm performance, but are only useful for determining how close the code is to the hardware limit [15]. Instead, the CUDA Best Prac- tices Guide [15] suggests calculating effective bandwidth as this provides a figure more relevant for measuring performance. It can be calculated as follows:

Effective Bandwidth= Bytes Read+Bytes Written

Execution Time (6.1)

Since bytes read and bytes written are identical for all streaming kernels (they all perform a single read/write pair per cell per distribution function being streamed), execution time is the only value that will affect our measure of effective bandwidth. Therefore execution time alone is sufficient to represent the performance on the streaming kernels, while memory throughput is used to show that performance is near the practical hardware limits.

To independently verify that we have achieved the best possible performance, we use an optimal copy-kernel (Pseudocode12) as a benchmark when testing these kernels. The copy-kernel has the same data structure as the streaming kernels, but does not alter array positions during data transfer. This ensures the copy-kernel has less computation and maximizes opportunities for coalescing data reads and writes. The copy kernel is also designed to be divided into thread blocks along any axis to allow for better latency hiding. At optimal launch configurations the

Pseudocode 12 The copy-kernel used to benchmark the streaming kernels. 1: functionCOPY-KERNEL(A)

2: r =l_gridDim.zdepth m 3: zs=blockIdx.z×r

4: i=tx+ty×pitch+zs×pitch×height

5: k =pitch×height 6: ze=zs+r 7: for zs→zedo 8: t= A[i] 9: syncthreads() 10: A[i] =t+1 11: i+ =k 12: end for 13: end function

inclusion of synchthreads() can cause a negligible performance decrease, but its inclusion has a noticeable performance benefit when the launch configurations are restricted to those that are applicable to the streaming kernels. The addition in line 10 is required to prevent compiler optimizations discarding the read/write pair. Due to its optimizations and simplicity, we can therefore consider the performance of the copy-kernel to be the best possible performance that can be achieved by the streaming kernels.

For simplicity, we use only two performance results from the copy-kernel: a benchmark value that is the best achievable performance for all kernel launch configurations for all tests, and a second adjusted benchmark which represents the best achievable copy-kernel performance with kernel launch configurations that are applicable to the streaming kernel(s) being discussed (the exact restrictions applicable to each kernel are discussed in Section5.3.2.) For kernels that stream in multiple directions (which require multiple read/write pairs per lattice cell), the benchmarks are multiplied by the number of directions where appropriate.

Unless otherwise specified an array size of 256×228×228 was used in these tests. This is the maximum array size for a single distribution function array, fi, such that all 19 distribution

functions can fit in GPU memory assuming the restrictions discussed in Section 6.1.1 and the hardware from Table6.1. The distribution function array was filled with arbitrary floating point values between zero and one; the streaming kernels only transfer data and do not perform cal- culations based on data values, thus the use of arbitrary data does not affect the validity of the results.

All results were calculated as an average over 100 consecutive kernel invocations. Numerous kernel invocations are grouped together to eliminate minor variances in execution time due to the inaccuracy of timers at the microsecond level. These iterations are all timed after a few “warm- up” iterations so that the time taken for the device to transition from idle to active does not affect results.

BlockDim GridDim T M TC MC

copy-kernel (64,3,1) (4,76,57) 98.28 108.44 - - x-axis streaming (256,3,1) (1,76,1) 201.77 105.55 1.42% 98.51% y-axis streaming (128,3,1) (2,76,1) 614.28 103.99 2.93% 97.05% z-axis streaming (256,3,1) (1,76,1) 607.38 105.17 1.77% 98.15% xy-axis diagonal streaming (256,3,1) (1,76,1) 416.48 102.31 4.68% 95.48% xz-axis diagonal streaming (256,3,1) (1,76,1) 414.69 102.75 4.23% 95.90%

Table 6.2: The optimal launch configurations and performance for all the streaming kernels, showing the execution time (T) and memory throughput (M). We also show the performance of the streaming kernels relative to the performance of the copy kernel when applying each kernel’s launch configuration restrictions to the copy kernel. TC is the percentage that each kernel is

slower than the copy kernel and MCis the percentage of the copy kernel throughput achieved.

Analysis Discussion

We begin the analysis by measuring performance of the copy-kernel, a simple kernel designed to provide us with streaming performance benchmarks. We then discuss the performance streaming kernels that transfer along the core x, y and z axes, using the benchmarks derived from the analysis of the copy-kernel. The performance of the kernels that stream along the diagonals is discussed, drawing on similarities in performance with the non-diagonal kernels. Lastly, we compare the performance of the GPU to a single CPU core when streaming in all directions.

The copy kernel launch configuration has three variable parameters: thread block width, thread block height, and the number of cells a single thread will process (i.e. the number of thread blocks to allocate along the z-axis). To limit the large search space, we show the top 4 launch configurations (see Figures6.1aand6.1b). Almost identical performance is shown by configurations b–d because they have the same number of active threads (256). The optimal benchmark performance for the copy-kernel is measured at 98.3 ms for execution time and 108.4 GB/s for memory throughput. This equates to 84.7% of the theoretical peak performance of our GTX 560Ti (128 GB/s [15]). Bauer [6] reports the maximum practical memory throughput achievable is 85% of a device’s theoretical peak performance, which allows us to confirm that our copy-kernel is a realistic benchmark for optimal performance. The memory-write throughput and memory-read throughput were consistently approximately half the total memory throughput — write throughput was consistently 0.4–0.45 GB/s faster than read throughput. Since the results from these individual values are identical to the results of the total memory throughput, we do not report the individual read/write results.

Figures6.2aand6.2bcompare the best possible performance achieved by each of the streaming kernels (their exact values are listed in Table6.2). We see that all kernels make efficient use of available memory bandwidth, almost matching their adjusted benchmarks. There is less than a millisecond between the optimal x- and z-axis streaming, with y-axis streaming being a further

90 100 110 120 130 140 150 160 170 1 19 38 57 76 114 228 Execution T ime (ms)

number of cells processed by each thread (64,3,1) – a (256,1,1) – b (128,2,1) – c (64,4,1) – d

(a)Execution time for selected launch configurations

65 70 75 80 85 90 95 100 105 110 1 19 38 57 76 114 228 Memory Thr oughput (GB / s)

number of cells processed by each thread (64,3,1) – a (256,1,1) – b (128,2,1) – c (64,4,1) – d

(b)Memory throughput for selected launch configurations

Figure 6.1: The performance of the Copy-Kernel as the number of cells each thread must process (n) is varied for the thread blocks of the top four launch configurations. Only results for the optimal values of n are shown — when height_n ∈N so that there are no idle threads.

millisecond slower. This is because the orientation of x- and z-axis streaming results in multiple thread blocks accessing contiguous blocks of memory simultaneously (xy-plane slices), whereas different thread blocks are accessing different xy-plane slices simultaneously when streaming along the y-axis. The performance of the diagonal streaming is noticeably worse. Since the di-

98 99 100 101 102 103 104 105

x-axis y-axis z-axis xy-diagonal xz-diagonal

Execution T ime (ms) measured result adjusted benchmark benchmark

(a)Streaming execution time compared to the benchmarks

102 103 104 105 106 107 108 109 110

x-axis y-axis z-axis xy-diagonal xz-diagonal

Memory Thr oughput (GB / s) measured result adjusted benchmark benchmark

(b)Streaming memory throughput compared to the benchmarks

Figure 6.2: The performance of the streaming kernels with optimal launch configurations. The benchmark refers to the optimal copy-kernel performance and the adjusted benchmark is the optimal copy- kernel performance given the same launch configuration restrictions as each streaming kernel. Each kernel’s execution time has been divided by the number of directions in which distribution functions are streamed so that the performance between different streaming kernels can be compared.

agonal streaming kernels stream along two axes, they are the most restricted in terms of launch configurations and, as a result, are the slowest of all the streaming kernels. The xz-diagonal streams are slightly faster that the xy-diagonal streams, because the xz-diagonal streaming kernel’s access pattern is better suited for coalescing reads on xy-slices. Despite the minor differences

0 10 20 30 40 50 60 643963 1283 1603 1923 2243 execution time (s) 0 0.5 1 1.5 2 2.5 643963 1283 1603 1923 2243 0 5 10 15 20 25 30 execution time (s) speedup

number of lattice cells in scene

CPU GPU

GPU Time GPU speedup

Figure 6.3: These figures show the total execution time and GPU speedup achieved for streaming along all directions for increasing lattice size. Small jumps in the GPU execution time (or spikes in GPU speedup) are seen in the transition from scene sizes that are suited to the GPU and those that are not. For example the big jump in execution time occurs when moving from a lattice size of 1923to 1933as 192 is divisible by warp size and is therefore better suited for GPU memory operations.

in performance all kernels still perform close to the practical hardware limit.

To test the scalability of the streaming kernels, we combined all the kernels to stream as they would as part of a normal LBM simulation. In our full LBM implementation, the distribution functions are not transferred off the graphics card. Therefore, the total GPU time in these scaling tests includes kernel execution time and only initial data transfer (i.e. the distribution functions are not transferred back to the CPU after each iteration on the GPU). This provides us with a real indication of the performance of the streaming kernels as part of our GPU LBM simulation. The combined streaming kernels were compared against a single-core CPU implementation of pure LBM streaming for increasing lattice sizes (Figure6.3). In this test we used 3D array sizes with equal dimensions (i.e. 13_{, 2}3_{, 3}3_,_{· · ·} _{, 224}3_{). The maximum array size was limited to 224}3_as

this was the largest lattice size that could fit into the graphics card memory and had dimensions that are multiples of warp size. For each array size, the optimal launch configuration for each

streaming kernel was selected automatically and used to determine optimal performance. While the CPU execution time scales linearly with respect to problem size, the GPU version has clear sudden increases in execution time. These steps occur exactly at the boundaries between scene sizes that are suited to GPU memory operations and those that are not. For example, with an array width of 128, four warps are required and all threads within each warp are in use, but with an array width of 129, five warps are required and only 81% of all threads will be used — one warp will have 31 idle threads. These idle threads still require processor time and therefore slow the execution of other warps. As the array width increases, there are fewer idle threads and performance improves until the next warp boundary is reached. This behaviour also results in the spikes seen in the speedup graph. For large array sizes, we see the combined streaming kernels are able to achieve a 25–30×speed up over the CPU version, peaking near 30×when the lattice size is optimal for GPU memory operations (i.e. x-dimension is a multiple of warp size).

In document Lattice Boltzmann Liquid Simulations on Graphics Hardware (Page 75-81)