(b) (a)
DT
copyin(row token into blocks) //SEQ. process statement T (mainDCT) for(jt = 0; jt< HNumBlocks; jt++) T: mainDCT(block[jt], &block[jt]) copyout(blocks into row token)
DT
copyin(row token into blocks) //PAR. process statement T (mainDCT) forall (jt = 0 to HNumBlocks-1) T: mainDCT(block[jt], &block[jt]) copyout(blocks into row token)
(c)
DT
copyin(row token into blocks)
transfer host blocks into GPU memory //CUDA process statement T (mainDCT) DCTKernel<<<HNumBlocks, ThreadsPerBlock>>> (gBlocksIn, gBlocksOut);
transfer GPU result back to host blocks
copyout(blocks into row token)
Figure 4.25: Independent parallelization ofDCTprogram module (HiPRDG nodeY).
obtaining a two-level parallel M-JPEG realization which features task and pipeline parallelism by taking advantage of the PPN model at the top-level, and whose pro- gram modules internally have data parallelism that can be used for GPU acceleration. We present the overall performance improvements achieved by the multi-level paral- lelization in Section 6.7.
4.8
Conclusions and Future Work
In this chapter, we introduced the hierarchical intermediate program representation in the polyhedral model called HiPRDG, and presented a structured method for deriva- tion of a HiPRDG from the standard polyhedral model of the application. We also showed how to derive a structured multi-level program (MLP) from a HiPRDG. The hierarchical representation leads to multi-level parallel programs that are well suited for mapping onto heterogeneous platforms, allowing us to target different architec- tural components with different types of parallelism. As a result of the techniques presented in this chapter, we are now capable of obtaining multi-level parallel pro- grams featuring task, data, and pipeline parallelism.
Chapter
5
PPN Execution on Heterogeneous
Platforms with GPUs
5.1
Introduction
In Chapter 4, we have shown how to construct a two-level program featuring task, pipeline and data parallelism. The top-level of the resulting program contains coarse- grain autonomous tasks communicating via channels that are generated as FIFO buffers from the PPN specification. Statements executed by computationally inten- sive tasks are further transformed for data parallelism in order to be offloaded for execution on an accelerator. In this chapter, we introduce novel techniques for exe- cution of PPN process’ statements on an accelerator while improving the efficiency of host-accelerator communication.
Let us again consider the two-level M-JPEG PPN illustrated in Figure 4.2. The nodes of the top-level PPN, which are denoted in this figure asS0,T0,QandVOUT, are implemented as parallel tasks and mapped for execution on different threads run- ning on a multicore CPU of the host platform. Communication between the four tasks is organized exclusively using FIFO buffers. At each iteration, theS0process writes a token into its output buffer. TheT0 process reads the token from the buffer, and executes the next process iteration which executes the statement, that contains a call to themainDCTfunction. Our goal at this stage is simple - and that is to offload the execution ofmainDCTonto an accelerator.
After generating data parallel accelerator code for themainDCT(either manually or e.g., using the approach presented in Chapter 3), it is necessary to provide the host- side accelerator management code forkernel offloadingonto a GPU. To execute the nodeT (DCT) on the GPU, two issues need be resolved:
5.1. INTRODUCTION
• (1) How to manage the offloading of PPN process iterations onto an accelerator (such as GPU)
• (2) How to reduce host-accelerator communication overheads, i.e. how to im- prove the communication efficiency
The GPU accelerator is typically seen as a co-processor managed by the host. The traditional model for kernel offloading consists of three phases:
• copy-in- data transfer from host memory (main memory) to device memory (GPU global memory),
• kernel- execution of process iteration (kernel) on the GPU
• copy-out- data transfer from device memory to host memory
In compiler-assisted parallelization frameworks, the kernel offloading is traditionally realized with adrop-in code replacement. The drop-in code replacement refers to the substitution of the sequential code on the host-side with the kernel offloading mecha- nism. We refer to this mechanism asSynchronous Offloading of a Kernel(SOK). The SOK mechanism is illustrated for function callYin Figure 5.1.
CPU
(a) (b)
//PPN Node T' Process Body: //READ FROM CHANNEL pop(C1, &row); //EXECUTE
for (f = 0; f < NumFrames; f++) for (is = 0; is < VNumBlocks; is++){ DT: Y(&row, &row); } //WRITE push(C2, &row);
GPU
drop-in code replacement
(1) memcpyH2D
cudaMemcpy(gBlocksIn, row, size, cudaMemcpyHostToDevice); (2) CUDA process statement T (mainDCT) DCTKernel<<<HNumBlocks, ThreadsPerBlock>>> (gBlocksIn, gBlocksOut); (3) memcpyD2H
cudaMemcpy(row, gBlocksOut, size, cudaMemcpyDeviceToHost);
Figure 5.1: Blocking kernel offloading using the drop-in code replacement. The body of the PPN processT0is shown in Figure 5.1(a). At each iteration (f,is),
T0 executes statement DS. The statement DS invokes the function Y that imple- ments the DCT processing. It takes as input argument a variable of type row that is an array of HNumBlocksblock elements, transforms it, and returns the resulting
row. To execute the DCT on the GPU, the call to function Y is simply replaced by the drop in code replacement shown in Figure 5.1(b). Figure 5.1(b) shows the three steps of synchronous kernel offloading. First, we transfer the input data from the variable rowin the host memory (main memory) to the variablegBlocksIn in the GPU memory using a CUDA API call cudaMemcpyindicating the direction as