The Parboil benchmark suite is a set of benchmarks designed to measure the performance of heterogeneous systems formed by general purpose processors and GPUs [IMP]. The Parboil benchmark suite is used to evaluate most proposed systems in the remaining of this dissertation. This section provides a brief description of the Parboil benchmark suite, an information that is common to most of the remaining of this dissertation.
3.5.1
Benchmark Description
MRI-Q
Magnetic Resonance Imaging Q (MRI-Q) is a computation of a matrix Q, rep- resenting the scanner configuration, used in a 3D magnetic resonance image reconstruction algorithm in non-Cartesian space.
MRI-FHD
Magnetic Resonance Imaging FHD (MRI-FHD) is a computation of an image- specific matrix FH
d , used in a 3D magnetic resonance image reconstruction al-
Benchmark Host to Device Device to Host
Total (MB) # Transfers Total (MB) # Transfers
MRI-Q 3.05 7 2.00 3 MRI-FHD 3.07 11 2.02 4 CP 0.61 10 1.00 1 SAD 0.05 1 8.49 1 TPACF 4.73 2 0.03 1 PNS 0.00 0 0.02 224 RPES 61.86 2 4.15 1
Table 3.3: Data transfers in the Parboil benchmark suite
CP
Coulombic Potential (CP) computes the coulombic potential at each grid point over on plane in a 3D grid in which point charges have been randomly dis- tributed. Adapted from ‘cionize’ benchmark in VMD.
SAD
Sum of Absolute Differences (SAD) is the sum of absolute differences kernel, used in MPEG video encoders. Based on the full-pixel motion estimation algo- rithm found in the JM reference H.264 video encoder.
TPACF
Two Point Angular Correlation Function (TPACF) is an equation used here as a way to measure the probability of finding an astronomical body at a given angular distance from another astronomical body.
PNS
Petri Net Simulation (PNS) implements a generic algorithm for Petri net sim- ulation. Petri nets are commonly used to model distributed systems.
RPES
Rys Polynomial Equation Solver (RPES) calculates 2-electron repulsion inte- grals which represent the Coulomb interaction between electrons in molecules.
3.5.2
Characterization
For the purposes of this dissertation, the characterization of data transfers be- tween the CPU and the GPU is the most important factor. Table 3.3 summa- rizes the data transfers between CPU (host) and GPU (device) for all Parboil benchmarks obtained using application profiling.
Parboil benchmarks might be categorized into small-size transfer bench- marks (PNS) and large-size transfer benchmarks (MRI-Q, MRI-FHD, and RPES).
SAD, and TPACF fit into both categories. SAD performs small-size data trans- fers from CPU to GPU, but large-size transfers from GPU to CPU. Analogously, TPACF performs large-size data transfers from CPU to GPU and small-size data transfers in the other direction. Finally, CP is a medium-size transfer benchmark, where data transfers are of moderate size in both directions. All benchmarks, but PNS, perform few data transfers between CPU and GPU. PNS requires 224 data transfers between the GPU and CPU.
The ratio of transferred data size over the number of data transfers can be combined to estimate the efficiency of data transfers. The larger this ratio, the more efficient data transfers are performed. SAD, TPACF, and RPES are the benchmarks that larger data transfers perform and, therefore, best utilize the PCIe bandwidth. PNS and CP, on the other hand, perform small data transfers so data transfer bandwidth accomplished is relatively low.
Chapter 4
Programmability of
Heterogeneous Parallel
Systems
4.1
Introduction
This chapter discusses programmability issues of heterogeneous parallel systems from the application programmer’s perspective. The two major programmabil- ity problems present in current heterogeneous parallel systems identified in this chapter are separated CPU – accelerator memories, and disjoint virtual ad- dress spaces for CPUs and accelerators. The former harms programmability by requiring application programmers to explicitly copy data structures between system and accelerator memories. CPU – accelerator communication through memory copy routines prevents by-reference parameter passing to accelerator calls, requiring all parameters to be passed by-value. Moreover, this change in the parameter passing semantics might harm the application performance due to the cost of data marshalling (i.e., collecting the data to be copied) and the memory copy. Disjoint virtual system and accelerator address spaces increase the complexity of the code because data structures are referenced by different virtual memory addresses (i.e., pointers) in CPU and accelerator code. Such a constrain might also introduce performance penalties in the accelerator call.
This chapter first presents three different models for data transfers between accelerator and system memory: per-call, double-buffered, and accelerator- hosted. The first two models are currently in use and assume separate CPU and accelerator memories. Hence, in these two models data communication between CPU and accelerator happens through system and accelerator mem- ories copy routines. These memory copy routines present a trade-off between programmability and performance in these two models. The third model, accel- erator-hosted, is a contribution of this thesis. In this model accelerator memory is accessible from the CPU code and data structures required by the accelerator are only hosted in accelerator memory. The accelerator-hosted model might increase the accelerator memory capacity requirements but greatly improves programmability of heterogeneous parallel systems. This chapter shows how the accelerator-hosted model allows programmers to write simple CPU code for heterogeneous parallel systems that resembles to the code they would produce for a homogeneous (i.e., without accelerators) system. Hardware and software implementation alternatives for this model are presented in Chapters 5 and 6
respectively.
Finally, the implications of a unified virtual address space over the accel- erator code are presented. This chapter argues that a unified virtual address space that includes both, system and accelerator memories, allows accelerator code to be written in a more straightforward way, eases the task of using data structures with embedded pointers (e.g., linked-lists) in the application code, and allows for potential performance gains. This chapter also outlines potential hardware and software approaches to build this unified virtual address space from physically separated system and accelerator memories.