Image ltering

6.5 Implementation and performance evaluation of GPU-SM ap-

6.5.2 Image ltering

The second evaluated application is Image Filtering, as described in Sec- tion 6.2.3.

Distributed memory implementation

The traditional implementation of convolution for multiple GPUs replicates the halos of the partitions of the input image in each GPU. The convolution matrix is replicated in all GPU memories. The output image does not require replication because computation partitions write in non-overlapping regions. GPU-SM implementation

In GPU-SM, thanks to shared memory, halo data for the input image partitions can be remotely accessed. With respect to the convolution matrix, it cannot be decomposed because it is fully accessed by every thread in each computation partition. We study two possibilities: (1) replicating it like in the distributed implementation, (2) storing it in a single GPU or in host memory and use caching, as it is small enough to t in the L1 R/O cache.

6.5. IMPLEMENTATION AND PERFORMANCE EVALUATION OF GPU-SM APPLICATIONS

128 x 128 4096 x 4096 24576 x 24576 Remote_{GPU Remote}_{host Replic. Remote}_{GPU Remote}_{host Replic. Remote}_{GPU Remote}_{host Replic.}

Image size (X x Y) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Speedup (x) Decompositions X Y XY (a) 3 × 3 convolution 128 x 128 4096 x 4096 24576 x 24576 Remote_{GPU Remote}_{host Replic. Remote}_{GPU Remote}_{host Replic. Remote}_{GPU Remote}_{host Replic.}

Image size (X x Y) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Decompositions X Y XY (b) 5 × 5 convolution

Figure 6.14: Image ltering: speedups of the multi-GPU GPU-SM implementation for 4 GPUs compared to the original implementation on a single GPU. Results provided for dierent convolution matrix sizes and locations, and image sizes.

Performance analysis

Figure 6.14 shows the speedup achieved by running the GPU-SM implementation of the convolution kernel on 4 GPUs. Like in the FDTD benchmark, the smallest input data set is too small to benet from multi-GPU execution. The bars show the results for the congurations that use the L1 R/O cache. Not using this cache introduced up to 1000× slowdowns when accessed remotely as each thread performs reads every element in the matrix (9 or 25 in our benchmarks), and each access triggers a remote memory access. Further, using the R/O cache is also a 35% faster than the regular cache hierarchy when the convolution matrix is replicated. Results for the medium-size image indicate that there is around a 10% performance loss when not using replication and resorting to caching to access the convolution matrix for both the 3 × 3and 5×5 convolution matrix sizes (storing it in host memory instead of a GPU memory improves the results by 2%). This is because all the thread blocks in all SMs (> 100) are stalled at the beginning of the kernel until each of the SMs brings the matrix to its L1 R/O cache and, therefore, this cost cannot be hidden. Having a L2 shared R/O cache would help minimizing the performance overhead as only the rst SM would need to bring the matrix and the rest of SMs could access cached data. For larger matrix sizes these initial costs are negligible and we then achieve linear speedups.

CHAPTER 6. SHARED MEMORY GPU PROGRAMMING

1 template <typename T, unsigned Halo>

2 __global__ void

3 stencil(T* out, const T* in,

4 unsignedcols, unsigned rows, unsigned planes)

5 {

6 int k = threadIdx.x + blockIdx.x * blockDim.x + Halo;

7 int j = threadIdx.y + blockIdx.y * blockDim.y + Halo;

9 if ((k < Halo + cols) && j < (Halo + rows)) {

10 for (int p = Halo; p < planes + Halo; ++p) {

11 T c = IN(k, j, p);

12 for (int s = 1; s <= Halo; ++s) {

13 T left = in[IDX_3D(k-s, j, p)]; 14 T right = in[IDX_3D(k+s, j, p)]; 15 T top = in[IDX_3D(k, j-s, p)]; 16 T bottom = in[IDX_3D(k, j+s, p)]; 17 T back = in[IDX_3D(k, j, p-s)]; 18 T front = in[IDX_3D(k, j, p+s)]; 19 20 c += 3.f * (left + right) + 21 2.f * (top + bottom) + 22 1.f * (back + front); 23 } 24 out[IDX_3D(k, j, p)] = c; 25 } 26 } 27 }

Listing 6.1: FDTD: distributed implementation (simplied version of the kernel).

6.6 Summary

In this chapter we have shown that the remote memory access mechanism en- ables easier multi-GPU programming. While PCIe oers limited bandwidth compared to GPU memories, GPU's ability to hide long latency accesses al- lows it to tolerate a moderate amount of remote accesses. More precisely, GPUs are able to hide the costs of a 10% of remote memory accesses if the kernels produce high GPU core occupancy and the accesses are spread through the whole kernel execution. We have also shown that shared memory implementations of FDTD and image ltering computations deliver a performance comparable to their highly-optimized distributed counterparts, while being simpler and much more concise. Future interconnects promise higher bandwidths that will open the GPU-SM model to a broader range of applications.

This work has also identied hardware improvements that may have the potential to hide the costs of remote accesses.

• Increasing the number of in-ight warps per GPU core would allow to tolerate a higher number of remote accesses.

6.6. SUMMARY

1 struct work_descriptor {

2 float *in, *out;

3 cudaStream_t stream[NUM_STREAMS];

4 cudaEvent_t events_A[NUM_EVENTS];

5 cudaEvent_t events_B[NUM_EVENTS];

6 cudaEvent_t *events_prev = events_A;

7 cudaEvent_t *events_cur = events_B;

8 bool has_left_neigh;

9 bool has_right_neigh;

10 unsigned planes;

11 };

12 voiddo_rtm(work_descriptor wd[NUM_GPUS])

13 {

14 for (int t = 0; t < TIME_STEPS; ++t) {

15 for (int gpu = 0; gpu < NUM_GPUS; ++gpu) {

16 // 1a. Compute right boundary

17 if(wd[gpu].has_left_neigh)

18 cudaStreamWaitEvent(wd[gpu].stream[EXEC_R], wd[gpu-1].events_prev[COMM_R]);

20 cudaStreamWaitEvent(wd[gpu].stream[EXEC_R], wd[gpu].events_prev[EXEC_M]);

21 launch_stencil(wd[gpu].in, wd[gpu].out,

22 halo + wd[gpu].planes - 2 * halo, halo * 3, // offset, size

23 wd[gpu].stream[EXEC_R]);

24 cudaEventRecord(wd[gpu].events_cur[EXEC_R], wd[gpu].stream[EXEC_R]);

25 // 1b. Compute left boundary

26 if(wd[gpu].has_right_neigh)

27 cudaStreamWaitEvent(wd[gpu].stream[EXEC_L], wd[gpu+1].events_prev[COMM_L]);

29 cudaStreamWaitEvent(wd[gpu].stream[EXEC_L], wd[gpu].events_prev[EXEC_M]);

30 launch_stencil(wd[gpu].in, wd[gpu].out,

31 0, halo * 3, // offset, size

32 wd[gpu].stream[EXEC_L]); 33 cudaEventRecord(wd[gpu].events_cur[EXEC_L], wd[gpu].stream[EXEC_L]); 34 35 // 2. Compute center 36 cudaStreamWaitEvent(wd[gpu].stream[EXEC_M], wd[gpu].events_prev[EXEC_L]); 37 cudaStreamWaitEvent(wd[gpu].stream[EXEC_M], wd[gpu].events_prev[EXEC_R]); 38 launch_stencil(wd[gpu].in, wd[gpu].out,

39 halo, wd[gpu].planes, // offset, size

40 wd[gpu].stream[EXEC_M]);

41 cudaEventRecord(wd[gpu].events_cur[EXEC_M], wd[gpu].stream[EXEC_M]);

43 // 3a. Exchange right boundary

44 if(wd[gpu].has_right_neigh) {

45 cudaStreamWaitEvent(wd[gpu].stream[COMM_R], wd[gpu].events_cur[EXEC_R]);

46 copyAsync(wd[gpu+1].out, wd[gpu].out,

47 halo + wd[gpu].planes - halo, halo, // offset, size

48 wd[gpu].stream[COMM_R]);

49 cudaEventRecord(wd[gpu].events_cur[COMM_R], wd[gpu].stream[COMM_R]);

50 }

51 // 3b. Exchange left boundary

52 if(wd[gpu].has_left_neigh) {

53 cudaStreamWaitEvent(wd[gpu].stream[COMM_L], wd[gpu].events_cur[EXEC_L]);

54 copyAsync(wd[gpu-1].out, wd[gpu].out,

55 halo, halo, // offset, size

56 wd[gpu].stream[COMM_L]);

57 cudaEventRecord(wd[gpu].events_cur[COMM_L], wd[gpu].stream[COMM_L]);

58 }

59 }

60 for (int gpu = 0; gpu < NUM_GPUS; ++gpu) {

61 swap(wd[gpu].in, wd[gpu].out);

62 swap(wd[gpu].events_prev, wd[gpu].events_cur);

63 }

64 }

65 }

CHAPTER 6. SHARED MEMORY GPU PROGRAMMING

1 template <typename T, unsigned Halo>

2 __global__ void

3 stencil(T* out, const T* in, const T* in_left, const T* in_right,

4 unsignedcols, unsigned rows, unsigned planes)

5 {

6 int k = threadIdx.x + blockIdx.x * blockDim.x + Halo;

7 int j = threadIdx.y + blockIdx.y * blockDim.y + Halo;

9 if (k < cols && j < rows) {

10 for (int p = Halo; p < planes + Halo; ++p) {

11 T c = in[IDX_3D(k, j, p)];

12 for (int s = 1; s <= Halo; ++s) {

13 T left = (in_left && k-s < 0)? in_left[IDX_3D(cols + (k-s), j, p)]:

14 in[IDX_3D(k-s, j, p)];

16 T right = (in_right && k+s >= cols)? in_right[IDX_3D((k+s) - cols, j, p)]:

17 in[IDX_3D(k+s, j, p)]; 18 19 T top = in[IDX_3D(k, j-s, p)]; 20 T bottom = in[IDX_3D(k, j+s, p)]; 21 T back = in[IDX_3D(k, j, p-s)]; 22 T front = in[IDX_3D(k, j, p+s)]; 23 24 c += 3.f * (left + right) + 25 2.f * (top + bottom) + 26 1.f * (back + front); 27 } 28 out[IDX_3D(k, j, p)] = c; 29 } 30 } 31 }

6.6. SUMMARY

1 struct work_descriptor {

2 float *in, *out;

3 cudaStream_t stream; 4 cudaEvent_t event_prev; 5 bool has_left_neigh; 6 bool has_right_neigh; 7 unsigned planes; 8 }; 9 10 11

12 voiddo_rtm(work_descriptor wd[NUM_GPUS])

13 {

14 for (int t = 0; t < TIME_STEPS; ++t) {

15 for (int gpu = 0; gpu < NUM_GPUS; ++gpu) {

16 float *in_left = nullptr;

17 float *in_right = nullptr;

18 if(wd[gpu].has_left_neigh) { 19 in_left = wd[gpu-1].in; 20 cudaStreamWaitEvent(wd[gpu].stream, wd[gpu-1].event_prev); 21 } 22 if(wd[gpu].has_right_neigh) { 23 in_right = wd[gpu+1].in; 24 cudaStreamWaitEvent(wd[gpu].stream, wd[gpu+1].event_prev); 25 } 26 cudaStreamWaitEvent(wd[gpu].stream, wd[gpu].event_prev);

27 launch_stencil(wd[gpu].in, in_left, in_right, wd[gpu].out,

28 wd[gpu].has_left_neigh ? 0 : halo, // offset

29 wd[gpu].planes + (wd[gpu].has_right_neigh ? 0 : halo), // size

30 wd[gpu].stream);

31 }

33 for (int gpu = 0; gpu < NUM_GPUS; gpu++) {

34 cudaEventRecord(wd[gpu].event_prev, wd[gpu].stream);

35 swap(wd[gpu].in, wd[gpu].out);

36 }

37 }

38 }

Listing 6.4: FDTD: GPU-SM implementation (host).

CHAPTER 6. SHARED MEMORY GPU PROGRAMMING

policy, would allow them to distribute remote accesses along the whole kernel execution, without the need to change the code of the application (i.e., data and computation distribution).

• Adding a L2 read-only cache would allow a GPU core to reuse data structures (especially small ones) that have been cached by a dierent GPU core.

Chapter 7

Automatic Multi-GPU

Execution

In the previous chapter we show that remote access is a viable mechanism to implement shared memory programming on GPUs. However, the costs of only a limited amount of remote memory accesses can be hidden by the GPU architecture and execution model. Therefore, programmers are still in charge to carefully distribute data structures so that the performance is not aected. They also need to decide when to use replication.

In this chapter we present AMGE: an auto-parallelization programming framework for multi-GPU systems that relieves programmers from these du- ties.

7.1 AMGE overview

AMGE (Automatic Multi-GPU Execution) is a programming framework that decomposes and distributes GPU kernels and data to be collaboratively ex- ecuted on all the GPUs in the system. We implement AMGE using C++ and CUDA, but it can be extended to other languages. Figure 7.1 shows the components in AMGE and how they interact with the hardware. AMGE aggregates the GPU resources in the system and presents them as a single virtual GPU. Thus, programmers are relieved from the burden of decompos- ing the problem and explicitly managing several GPUs.

The AMGE compiler is a source-to-source compiler, that analyzes the CUDA kernels in the program to detect their array access patterns and store this information in the program executable. It also generates optimized kernel versions for the possible array decompositions. We argue that the utiliza-

7.1. AMGE OVERVIEW

Figure 7.1: Overview of AMGE components. The compiler extracts array access pattern information and stores it in the program binary. The runtime system uses this information to decompose and distribute computation and data across the GPUs in the system. In this example, the system is composed of a single CPU and 4 GPUs, connected through a PCIe interconnect.

tion of the array dimensionality information is paramount in order to e- ciently exploit multi-GPU systems. However, CUDA is an extension of the C/C++ languages, which do not provide data types with such information; programmers typically atten the multi-dimensional arrays into 1D arrays and linearize the dimension indices in each array reference. It is practically dicult, if not infeasible, for static analysis to reliably recover the dimensionality information once the accesses have been attened. AMGE provides a new data type for multi-dimensional arrays that makes this information available to the compiler. Details on the implementation of the data type and the generation of optimized kernel versions are discussed in Section 7.3. The other key feature of AMGE is the utilization of remote memory accesses between GPUs [5]. On each reference to the array, the underlying implementation determines whether the element being referenced is hosted in the memory local to the GPU executing the code or on a dierent GPU. Refer- ences from a GPU to parts of the array stored in dierent GPU memories are handled using remote memory accesses. This approach ensures correctness regardless of the chosen GPU computation and data distribution congura- tion and removes the requirement for the compiler analysis to unequivocally determine the bounds of the memory range accessed by a computation partition.

However, remote accesses can impose performance overheads and they must be minimized. On each kernel call, the AMGE runtime determines the best computation and array decompositions using the information generated by the compiler, and distributes them across all GPUs in the system.

CHAPTER 7. AUTOMATIC MULTI-GPU EXECUTION

1 void sgemm(ndarray<float, 2, storage::cmo> C, ndarray<float, 2, storage::cmo> A,

2 ndarray<float, 2> B)

3 {

4 float partial[SGEMM_TILE_N];

5 __shared__ float b_tile_sh[SGEMM_TILE_HEIGHT][SGEMM_TILE_N];

6 for (int i = 0; i < SGEMM_TILE_N; i++) partial[i] = 0.0f;

8 int mid = threadIdx.y * blockDim.x + threadIdx.x;

9 int row = blockIdx.x * (SGEMM_TILE_N * SGEMM_TILE_HEIGHT) + mid;

10 int col = blockIdx.y * SGEMM_TILE_N + threadIdx.x;

12 for (int i = 0; i < A.get_dim(1); i += SGEMM_TILE_HEIGHT) {

13 b_tile_sh[threadIdx.y][threadIdx.x] = B(i + threadIdx.y, col);

14 __syncthreads();

15 for (int j = 0; j < SGEMM_TILE_HEIGHT; ++j) {

16 floata = A(row, i + j);

17 for (int k = 0; k < SGEMM_TILE_N; ++k)

18 partial[k] += a * b_tile_sh[j][k];

19 }

20 __syncthreads();

21 }

22 for (int i = 0; i < SGEMM_TILE_N; i++)

23 C(row, i + by * SGEMM_TILE_N) = partial[i];

24 }

Listing 7.1: Multi-GPUsgemmGPU code with AMGE.

Memory model

Arrays are transparently decomposed and/or replicated before each kernel call. Input arrays can be replicated at the cost of additional space and data transfers, but AMGE never replicates output arrays. Replicated output arrays requires additional coherence management to merge partial modica- tions on dierent GPUs. Previous works [57, 61] have to transfer all copies to the host memory for a merging step after every kernel call, imposing a large performance overhead in many workloads. Moreover, replicating output arrays prevents distributing codes with atomics or memory fences. AMGE always distributes output arrays across GPU memories instead, and relies on remote memory accesses to guarantee that they are available to all GPUs. AMGE implements the ADSM model [45] to allow arrays to be used both by host and GPU code. The runtime only transfers arrays between CPU and GPU memories when needed.

In document On the programmability of multi-GPU computing systems (Page 109-119)

Image ltering

6.5 Implementation and performance evaluation of GPU-SM ap-

6.5.2 Image ltering

6.6 Summary

Chapter 7

Automatic Multi-GPU

Execution

7.1 AMGE overview

6.5.2 Image ltering