• No results found

Automatic multi-GPU execution

by implementing GPU-aware resource management in the OS, called Gdev. Gdev integrates virtual memory management in the OS kernel, thus allowing to implement mechanisms such as shared memory between processes. It also implements preliminary support for paging although GPUs do not provide support for restartable page faults (needed to implement on demand paging). Instead, when a process requests more memory than is available on the GPU, Gdev evicts data from other processes using the GPU and moves it to host memory.

HSA [16] (i.e., Heterogeneous System Architecture) is a industry stan- dard for heterogeneous systems that denes a set of features that need to be supported by devices. Many of the members of the HSA Foundation are major silicon vendors such as AMD, ARM and Samsung. One of the main objectives of HSA is to completely remove the need for programmers to ex- plicitly manage all the memories in the system. Similarly to our proposal in Chapter 5, HSA provides a single at virtual address space in which all the memory can be accessed by any processor in the system. HSA also requires support for context switching and paging to implement system-wide policies. Currently, only a few devices have support for HSA and support in GPUs is limited to integrated designs such as the ones in AMD APUs. While HSA theoretically supports any high-level programming language, current HSA- compliant GPUs use OpenCL.

3.3 Automatic multi-GPU execution

3.3.1 Compiler-based transparent multi-GPU execution

Kim et al. [57] introduce an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Thus, when a kernel is launched, they decompose computation and data across the GPUs in the system. The computation grid of the kernel is decomposed into uniform partitions that are distributed across GPUs. In order to perform data de- composition, they compute the array range accessed by each computation partition by performing a sampling run of the kernel on the CPU. This step is only performed if array references are ane transformations of the thread and block identiers, which is determined using compiler analysis; otherwise the whole array is replicated in all GPU memories. This solution does not detect the dimensionality of the arrays used in the GPU kernels. Therefore, even in cases where data can be decomposed, any array decomposition not performed on its highest-order dimension will produce tiles whose memory address ranges overlap, replicating big portions of the array in all memories.

CHAPTER 3. STATE OF THE ART

The main problem of using replication is that it reduces the size of the prob- lems that can be handled. Another drawback is that replicated array regions that are potentially modied from dierent computation partitions need to be merged after every kernel execution. This step is executed on the CPU, thus increasing CPU↔GPU trac and imposing a large overhead in many computations. Furthermore, atomic operations do not on replicated data, either.

Lee et al. [61] propose SKMD, which extends the same idea to hetero- geneous systems with CPUs and GPUs. SKMD performs a prole of the dierent devices in the system to distribute the computation partitions ac- cording to their capabilities. They do not use the sampling runs on the CPU and solely rely on the compiler to detect the array region accessed by each partition, which leads to replication in more cases than in [57]. On the other hand, they generate a merge kernel (based on the original kernel code) for replicated data, that is more ecient. However, this step is still performed on the CPU, requiring all the replicated copies to be transferred to host memory (which is the part of the merge step that incurs the most overhead).

The solution proposed in Chapter 7 exploits the support for remote access across GPU memories thus avoiding, in most cases, data replication and merge operations that are required in these previous works.

3.3.2 Compiler-based multi-GPU code generation

Lee et al. [63, 62] provide automatic GPU code generation called OpenMPC, which relies on standard OpenMP annotations for the host code. It detects the variables accessed in each loop (that can be explicitly specied through clauses or implicitly), the access type, and implicit synchronization points. A second phase divides parallel regions at synchronization points to enforce correctness (there are no kernel-wide synchronization primitives in CUDA). The next phase transforms the CPU-oriented kernel into a CUDA kernel and performs CUDA-specic optimizations. The compiler inserts calls for GPU memory allocation and data transfers between CPU and GPU memories. Sabne et al. [79] extend OpenMPC to support out of core computations and multi-GPU execution. It also provides communication and computation overlap. The achieved speedups are competitive with hand-written CUDA versions for a single GPU but they do not scale when several GPUs are used. PGI proposes custom directives for Fortran programs to automatically generate optimized code for accelerators in [97]. Clauses are included to ex- plicitly specify data transfers to the GPU memory. The compiler performs

3.3. AUTOMATIC MULTI-GPU EXECUTION

strip-mining2 on the program loops to generate inner loops and assign the

dierent loops to block-level and thread-level parallelism oered in CUDA. These directives have been later extended and published as the OpenACC standard [75] which is supported by many hardware and systems vendors. OpenACC, however, requires programmers to manually decompose compu- tations so that they can be executed on multiple GPUs.

Our solution presented in Chapter 7 uses compiler analysis to automat- ically determine how computation and data can be decomposed and dis- tributed, achieving linear speedups.

3.3.3 Multi-GPU libraries

Using libraries is a simple way for programs to transparently be able to run eciently on dierent computer architectures. Programmers only need to ex- press computations in terms of calls to library functions, with no knowledge of the characteristics of the hardware. Libraries can implement optimized ver- sions of these computations for dierent accelerator architectures and system topologies.

MAGMA [93] is a widely-used linear algebra library with support for large systems. It implements heterogeneous execution of computation kernels that uses both CPUs and GPUs. Unfortunately, only a subset of the provided functions is able to exploit several GPUs.

Nukada et al. [73] present a 3D FFT library for CUDA. It uses autotuning to generate CUDA kernels that are optimized for dierent transform sizes.

Babich et al. [24] propose QUDA, a library for numerical lattice quan- tum chromodynamics (LQCD) calculations. Thanks to a multi-dimensional decomposition of the problem, they are able to scale the performance to up to 256 GPUs.

ArrayFire [17] is a C/C++/Fortran library that provides abstractions for multidimensional arrays and a number of libraries that use them (e.g., data analysis, linear algebra, image and signal processing). However, the utilization of these arrays is limited to the functions oered in their libraries and cannot be used in custom user-dened kernels.

In the latest versions of CUDA, NVIDIA oers cuBLAS-XT [8], an exten- sion of their cuBLAS [7] library that is able to decompose and spread work across several GPUs for a set of Level 3 BLAS calls. Moreover, they also provide NVBLAS [11], which provides automatic multi-GPU acceleration for applications that use regular BLAS calls. NVBLAS builds on cuBLAS-XT

2Strip-mining reshapes a multi-dimensional space to create additional dimensions from

the elements of one of the original dimensions. It is commonly used to perform blocking.

CHAPTER 3. STATE OF THE ART

and intercepts calls from the application to regular BLAS libraries, and re- placing them with GPU calls, transparently.

The main problem of the library approach is that only the computations provided in the library are able to transparently use multiple GPUs.