cudaMemcpyHostToDevice. For brevity, we denotecudaMemcpycalls from host to device asmemcpyH2D, andcudaMemcpycalls from device to host asmemcpyH2D. Second, we launch the kernel that implements theDCTprocessing. The kernel reads the data from thegBlocksInarray and writes it to thegBlocksOutarray, which is also in the GPU global memory. Once the kernel runs to completion, we transfer the results from the GPU memory gBlocksOutto the host memoryrowToken. Once all three actions are completed, the control is returned to the processT0, which only then proceeds with the execution of the next process iteration. We illustrate the GPU execution timeline using the SOK mechanism in Figure 5.2(a).
…
t
memcpyH2D kernel memcpyD2H (a) (b)Figure 5.2: (a) Have: Synchronous execution of CUDA operations (no overlaps), (b) Want: Asynchronous execution (overlapped).
The main limitation of the SOK mechanism is that this type of kernel offloading is blocking. The PPN process can not start the data transfer from host to device and the computation of the next iteration, until all three phases of the current iteration have completed. In case of streaming applications, this inhibits the pipelining of data transfers and processing on the GPU. The impact of data transfers can be mitigated by overlapping data transfers and computation (see Figure 5.2(b)), as demonstrated in several application case studies [14, 143]. This can be achieved following NVIDIA’s technical note [100] that contains an example code pattern that enables overlapping of data transfers and kernel execution. State of the art frameworks for run-time task scheduling, such as StarPU [10], use asynchronous data transfers for communication with the GPU. Farago and Nikolov [52] experimented with the SOK mechanism in the context of PPNs, which results in the sequential timeline given in Figure 5.2(a).
The question that we address in this chapter is how to efficiently offload kernel execution from a PPN node to an accelerator and achieve overlapping of computation and communication phases as illustrated in Figure 5.2(b).
5.2
Approach
To enable overlapping of computation and communication phases of consecutive PPN process iterations illustrated in Figure 5.2(b), we propose a model-basedAsyn- chronous Offloading of a Kernel(AOK) mechanism for kernel offloading to acceler-
5.2. APPROACH
ators, which leverages the asynchronous, data-driven nature of the PPN model com- bined with asynchronous data transfers to the accelerator.
As a step to asynchronous communication with the accelerator, we differentiate be- tween two types of channels: the channel between processes executing their process iterations on the same device (e.g. a host CPU), and the channels between two pro- cesses executing process iterations on different devices (e.g. a host CPU and a GPU accelerator). For simplicity, we refer to the first type of channels as a Shared Mem- ory Channel (SMC) (or channel type A) and the second type as Distributed Memory Channel (DMC), or simply channel type B. When the SOK mechanism is used, the PPN process makes use only of SMC channels to connect to other processes, as illus- trated in Figure 5.3(a). In this figure, the data transfers between the host and GPU are denoted asdataxfer. For processing on the GPU, the processPfirst puts data into the SMC channelCh1, the processT0 reads the token from the SMC channel Ch1,
and then invokes the kernel offloading mechanism shown in Figure 5.1(b). The SOK mechanism invokes three CUDA operations: (1) synchronous data transfer from host to GPU, (2) kernel execution, and (3) synchronous data transfer from GPU to host. The timeline of the CUDA operation execution for the three consecutive process it- erations is shown in Figure 5.3(c). Only after the all three CUDA operations have completed the control is returned into processT0, which then puts the results of GPU execution into the SMC channelCh2.
T' C P Ch1 Ch2 SB1 SB2 T (2)kernel execution (1)data
xfer (3)data xfer
T' C
P
T
(2)kernel execution (1)data
xfer (3)data xfer
(b) (a) CPU GPU mem mem 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 (d) (c) i1 i2 i3 i1 i2 i3 t t
Figure 5.3: Solution Approach: (a) SOK vs (b) AOK
The core idea of our approach for asynchronous kernel offloading is to directly send the data produced by Pto the GPU without transferring it first to the host memory of process T0, by taking advantage of different channel types. Following the AOK approach, process P does not block waiting for the data transfer to complete. In- stead, process Pis free to proceed with the next process iteration and start the next data transfers to the GPU. As soon as the GPU receives the input data from process
P0, it launches theDCTkernel, and once the kernel is completed, we directly trans- fer the results from GPU memory to the processC. The arrows between processes
5.2. APPROACH
P andT0, andT0 andC inFigure 5.3 indicate synchronization that takes place be- tween processes, and arrows between processes PandT andT andC indicate the host-accelerator data transfers. So, instead of havingP transfer data from the host memory buffer to the host memory buffer ofT0, and then havingT0transfer the data from the host memory buffer into GPU memory using a synchronous data transfer call, P transfers data directly from its host memory buffer into the GPU memory buffer access byT, as indicated by dataxfermarks in Figure 5.3. To realize this pattern we take advantage of asynchronous DMA transfers between the host and GPU accelerator. This sort of asynchronous execution enables pipelining of GPU opera- tions (e.g. kernel execution from the first iteration (denoted asi1) can be overlapped
with the host-accelerator data transfer from the second iteration (denoted asi2)), thus facilitating the desired overlapping of communication and computation. The timeline of the CUDA operation execution for three consecutive process iterations is shown in Figure 5.3(d).
Our approach is inspired by the technical note from NVIDIA [100] that shows how to use the concept of a CUDAstreamto realize overlapping of communication and computation. A CUDA stream describes a sequences of GPU operations (kernel ex- ecutions, data transfers) that execute in-order. The operations from different streams execute in parallel. This enables overlapping of a kernel execution in streams1with a data transfer in streams2.
In addition, we leverage double-DMA capabilities of Fermi-architecture GPUs to achieve the overlapping of not only data transfers to the GPU and kernel execution, but also to additionally overlap GPU data transfers in different streams. This allows us to upload an input token from iterationi3to the GPU (data xfer in phase 1), while we are at the same time executing the kernel for iteration i2, and downloading the
results of iterationi1 to the host memory (data xfer in phase 3), as shown in Fig-
ure 5.3(d). Coupled with the PPN properties of data-driven asynchronous execution, the technical capabilities of modern GPUs enable us to design a model-based ap- proach for overlapping communication and computation on GPU based systems.
The chapter is structured as follows. In Section 5.3.1, we introduce a classification of PPN channels that enables model-driven channel design and mapping. In Sec- tion 5.3.2, we give an overview of traditional channel design, and explain extensions that improve its efficiency by reducing the amount of data transfered at each chan- nel access. In Section 5.4, we cover the AOK mechanism. First, we present our
stream bufferdesign for type B channels that enables model-driven overlapping of computation and communication in Section 5.4.1 Second, we show how to apply the stream buffer design to enable AOK in a PPN in Section 5.4.2. Finally, we show experimental results in Section 5.5, and discuss our findings.