The Parboil Benchmark Suite - On the programmability of heterogeneous massively-parallel comput

The Parboil benchmark suite is a set of benchmarks designed to measure the performance of heterogeneous systems formed by general purpose processors and GPUs [IMP]. The Parboil benchmark suite is used to evaluate most proposed systems in the remaining of this dissertation. This section provides a brief description of the Parboil benchmark suite, an information that is common to most of the remaining of this dissertation.

3.5.1 Benchmark Description

MRI-Q

Magnetic Resonance Imaging Q (MRI-Q) is a computation of a matrix Q, rep- resenting the scanner configuration, used in a 3D magnetic resonance image reconstruction algorithm in non-Cartesian space.

MRI-FHD

Magnetic Resonance Imaging FHD (MRI-FHD) is a computation of an image- specific matrix FH

d , used in a 3D magnetic resonance image reconstruction al-

Benchmark Host to Device Device to Host

Total (MB) # Transfers Total (MB) # Transfers

MRI-Q 3.05 7 2.00 3 MRI-FHD 3.07 11 2.02 4 CP 0.61 10 1.00 1 SAD 0.05 1 8.49 1 TPACF 4.73 2 0.03 1 PNS 0.00 0 0.02 224 RPES 61.86 2 4.15 1

Table 3.3: Data transfers in the Parboil benchmark suite

Coulombic Potential (CP) computes the coulombic potential at each grid point over on plane in a 3D grid in which point charges have been randomly distributed. Adapted from ‘cionize’ benchmark in VMD.

SAD

Sum of Absolute Differences (SAD) is the sum of absolute differences kernel, used in MPEG video encoders. Based on the full-pixel motion estimation algorithm found in the JM reference H.264 video encoder.

TPACF

Two Point Angular Correlation Function (TPACF) is an equation used here as a way to measure the probability of finding an astronomical body at a given angular distance from another astronomical body.

PNS

Petri Net Simulation (PNS) implements a generic algorithm for Petri net simulation. Petri nets are commonly used to model distributed systems.

RPES

Rys Polynomial Equation Solver (RPES) calculates 2-electron repulsion inte- grals which represent the Coulomb interaction between electrons in molecules.

3.5.2 Characterization

For the purposes of this dissertation, the characterization of data transfers between the CPU and the GPU is the most important factor. Table 3.3 summa- rizes the data transfers between CPU (host) and GPU (device) for all Parboil benchmarks obtained using application profiling.

Parboil benchmarks might be categorized into small-size transfer benchmarks (PNS) and large-size transfer benchmarks (MRI-Q, MRI-FHD, and RPES).

SAD, and TPACF fit into both categories. SAD performs small-size data transfers from CPU to GPU, but large-size transfers from GPU to CPU. Analogously, TPACF performs large-size data transfers from CPU to GPU and small-size data transfers in the other direction. Finally, CP is a medium-size transfer benchmark, where data transfers are of moderate size in both directions. All benchmarks, but PNS, perform few data transfers between CPU and GPU. PNS requires 224 data transfers between the GPU and CPU.

The ratio of transferred data size over the number of data transfers can be combined to estimate the efficiency of data transfers. The larger this ratio, the more efficient data transfers are performed. SAD, TPACF, and RPES are the benchmarks that larger data transfers perform and, therefore, best utilize the PCIe bandwidth. PNS and CP, on the other hand, perform small data transfers so data transfer bandwidth accomplished is relatively low.

Chapter 4

Programmability of

Heterogeneous Parallel

Systems

4.1 Introduction

This chapter discusses programmability issues of heterogeneous parallel systems from the application programmer’s perspective. The two major programmability problems present in current heterogeneous parallel systems identified in this chapter are separated CPU – accelerator memories, and disjoint virtual address spaces for CPUs and accelerators. The former harms programmability by requiring application programmers to explicitly copy data structures between system and accelerator memories. CPU – accelerator communication through memory copy routines prevents by-reference parameter passing to accelerator calls, requiring all parameters to be passed by-value. Moreover, this change in the parameter passing semantics might harm the application performance due to the cost of data marshalling (i.e., collecting the data to be copied) and the memory copy. Disjoint virtual system and accelerator address spaces increase the complexity of the code because data structures are referenced by different virtual memory addresses (i.e., pointers) in CPU and accelerator code. Such a constrain might also introduce performance penalties in the accelerator call.

This chapter first presents three different models for data transfers between accelerator and system memory: per-call, double-buffered, and accelerator- hosted. The first two models are currently in use and assume separate CPU and accelerator memories. Hence, in these two models data communication between CPU and accelerator happens through system and accelerator memories copy routines. These memory copy routines present a trade-off between programmability and performance in these two models. The third model, accelerator-hosted, is a contribution of this thesis. In this model accelerator memory is accessible from the CPU code and data structures required by the accelerator are only hosted in accelerator memory. The accelerator-hosted model might increase the accelerator memory capacity requirements but greatly improves programmability of heterogeneous parallel systems. This chapter shows how the accelerator-hosted model allows programmers to write simple CPU code for heterogeneous parallel systems that resembles to the code they would produce for a homogeneous (i.e., without accelerators) system. Hardware and software implementation alternatives for this model are presented in Chapters 5 and 6

respectively.

Finally, the implications of a unified virtual address space over the accelerator code are presented. This chapter argues that a unified virtual address space that includes both, system and accelerator memories, allows accelerator code to be written in a more straightforward way, eases the task of using data structures with embedded pointers (e.g., linked-lists) in the application code, and allows for potential performance gains. This chapter also outlines potential hardware and software approaches to build this unified virtual address space from physically separated system and accelerator memories.

In document On the programmability of heterogeneous massively-parallel computing systems (Page 46-51)