automatically determine which data structures are shared. They also add support for stack allocations and global variables, while GMAC is limited to heap allocations. The runtime system inspects the parameters at kernel call boundaries and keeps track of the location of data to determine when data transfers are required. This approach prevents the utilization of complex data types that use pointers, as they should be recursively inspected. Jablin et al. extend this work in [51] by replacing the parameter inspection with the memory protecting mechanism like the one used in GMAC. These works, like ADSM, target systems with a single GPU, only.
Some higher-level languages have been proposed to simplify memory man- agement in GPU programs. C++ AMP [3] provides a multi-dimensional ar- ray data type to represent the data that is used in the GPU kernels. Objects created from this type can be used in the CPU code, too. Using this type instead of raw pointers allows the compiler to detect the memory objects passed to each GPU kernel and insert the appropriate coherence annotations for the runtime system. The solution proposed in Chapter 7 uses the same approach of using a special data type, but the proposal in Chapter 5 works for any type of memory allocation.
3.2 System support and GPU virtualization
GPUs are commonly presented as isolated compute-only devices that cannot interact with other devices (e.g., network interfaces or disks). Moreover GPU programs are not well integrated with many of the mechanisms oered by Operating Systems such as inter-process communication and memory paging. Silberstein et al. propose GPUfs in [82], a POSIX-like API for GPU programs that makes the le system directly accessible to GPU code. The API is designed to exploit structured data parallelism such that threads in the same warp cooperate to perform read/write operations. GPUfs is mostly implemented as a library that works with the host OS on the CPU to coordinate the le system's namespace and data. The GPU sends le operation requests to the CPU while the kernel is running, by using a shared communication buer. A kernel module component enables caching both in the CPU and GPU memories by distributing the buer cache in the OS. Dierent GPUs and CPUs can concurrently work on the same le and changes are consolidated using ding1 at synchronization points.
Kim et al. present GPUnet in [58], a native networking layer that pro- vides a socket abstraction and high-level networking APIs to GPU programs.
1Ding is a technique that compares a memory buer with its original copy to nd
the values that have changed.
CHAPTER 3. STATE OF THE ART
GPUnet removes the need for programmers to coordinate NIC, CPU and GPU in order to transfer data that resides on the GPU memory. Fur- ther, GPU kernels can trigger network transfers and, therefore, programs no longer need to wait for kernel nalization in order to transfer data. GPUnet takes advantage of GPUDirect, enabling direct communication between NIC buers and GPU memories, and avoiding intermediate copies to host mem- ory. Optimized paths are implemented for communication between sockets of GPUs that reside in the same node (and do not require using the NIC).
Rossbach et al. [78] propose a new abstraction called PTask for processes that run on the accelerator and the addition of ports and channels to repre- sent the communication graph among regular processes and PTasks. Using this scheme, unnecessary memory transfers among CPU and GPU memo- ries can be avoided since the placement of memory objects is known to the system runtime. Moreover, more intelligent scheduling policies can be imple- mented by taking advantage of the features provided by the accelerators in the system (e.g., concurrent GPU execution and memory transfers).
Duato et al. propose the rCUDA middleware. rCUDA virtualizes the CUDA-RT API and implements a client/server architecture to execute ap- plications on remote GPUs. In [40], authors use rCUDA to manage all the GPUs in a cluster. This allows applications to use remote GPUs in a similar way as if they were local GPUs. It also enables cluster-level scheduling, which leads to higher GPU utilization. In [39], rCUDA allows programs in guest Virtual Machines to access the GPUs on the physical machine. Compared to our work, rCUDA is limited to and requires a complete implementation of the CUDA-RT API, while the HPE model proposal in Chapter 5 can be implemented in dierent programming models (versions exist for CUDA and OpenCL).
Similarly to rCUDA, Shi et al. propose vCUDA in [80] to implement GPU sharing for programs running in guest Virtual Machines. It provides a virtual GPU view to each application running on the node, though multiple applications actually share a single physical GPU. Instead of sharing among applications, our solution in Chapter 7 aggregates all the GPU resources into a single virtual GPU to transparently scale the performance of applications. Virtualization proposals for OpenCL-based clusters also exist. Barak et al. introduce VCL in [25], a solution based on MOSIX that exposes all the GPUs in the cluster to standard OpenCL programs. Xiao et al. present VOCL in [98, 99]. Besides the transparent access to remote GPUs, VOCL is able to perform live migration of virtual GPUs between dierent physical GPUs in a cluster. Like rCUDA, these solutions are limited to applications that use the OpenCL API.