Figure 2.3: Multi-GPU NUMA system targeted in this thesis.
cation channels with the memory modules (AMD Socket G34 supports up to 4 channels). Each channel uses a 64-bit bus and many channels can be combined to build a wider bus (i.e., ganged mode) although this mode is not commonly used. Read and write accesses to the SDRAM are burst ori- ented; an access starts at a selected location and consists of a total of four or eight data words. For write operations a mask can be used to specify which individual bytes are actually written to memory.
Conversely, discrete GPUs use GDDR5 (i.e., Graphics DDR version 5), which presents some dierences with respect to DDR3. GDDR5 devices are always directly soldered down on the board and are not mounted on a memory module (e.g., DIMM). They operate at a much higher frequency than DDR devices. Each channel uses a 32-bit bus, and channels are always ganged by the GPU memory controller to form a much wider bus (up to 512 bits). Bursts in GDDR5 always have a size of eight data words.
2.2 Multi-GPU systems
The base system targeted in this thesis is composed of several discrete GPUs connected to the system through a PCI Express (i.e., PCIe) interconnect (Figure 2.3). Besides having access to their own memories, since the Fermi microarchitecture [46], NVIDIA GPUs can access other memories through the PCIe interconnect (i.e., GPUDirect[5]) without host code intervention. At the same time, they introduced a single Unied Virtual Memory Address Space (i.e., UVAS) for all the GPUs in the system. The goal of UVAS is to allow every object in the system, no matter which physical memory it resides in, to have a unique virtual address to be used by application point- ers. Combining these two features allows regular load/store instructions to
transparently generate local or remote requests, based on the translation of 12
CHAPTER 2. REFERENCE HARDWARE AND SOFTWARE ENVIRONMENT
virtual address to the physical location of the data being accessed. Several of the designs proposed in this thesis take advantage of the remote memory access mechanism to implement shared memory programming on multiple GPUs. As far as we know, AMD does not currently support remote memory access between GPU memories.
2.2.1 Interconnect
Practically all current discrete GPUs use the PCIe interconnect to interface with the host. However, interconnects that oer higher bandwidth have been announced.
PCI Express
PCIe is a expansion bus that uses a point-to-topology that connects the de- vices to the root complex[76]. It uses a separate link per device, which can contain several lanes (currently from 1 to 16 lanes are commonly used). The root complex can interface with PCIe endpoints (i.e., devices) or switches that multiplex the link among several devices. This exibility allows to cre- ate complex device hierarchies. Data is transmitted in packets, that can be stripped accross lanes. Therefore, bandwidth increases with the number of lanes. Moreover, PCIe supports full-duplex communication between any two endpoints. Current PCIe 3.0 provides up to 16 GB/s per direction. The en- coding used to transmit data in previous PCIe versions reduced the eective bandwidth by a 20%, but PCIe 3.0 uses a dierent encoding that introduces a 2% overhead, only.
The PCIe standard packets support up to 30-byte headers and up to 4 KB payloads (although most PCIe controllers in current processors limit it to 256 bytes or less [50]). The overhead imposed by the size of the headers makes PCIe not well suited for very small transfers. On the other hand, transfers larger than the payload size are broken by the PCIe controller to several packets (which is more ecient than breaking the transfer to smaller transfers in software). Another source of overhead is due to PCIe transactions not being able to use pageable host memory. To solve this problem, the GPU driver copies data to temporary non-pageable (or pinned) buers before transmitting them to the GPU memory. While double-buering can be used to hide this cost, they are only eective for large data transfers. Figure 2.4 shows the achieved memory transfer rates for dierent data sizes in a system with a NVIDIA C2050 and a PCIe 2.0 x16 interconnect. Lines with circles represent transfers using pinned memory buers instead of regular pageable allocations.
2.2. MULTI-GPU SYSTEMS 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB 2MB 4MB 8MB16MB32MB64MB Transfer size 0 1 2 3 4 5 6 7 Bandwidth (GB/s)
Read (pinned)
Write (pinned)
ReadWrite
Figure 2.4: PCIe transfer rates for dierent data sizes.
Memory Latency Bandwidth
DDR3 (CPU, 2 channels) ∼ 50ns ∼ 30GBps
GDDR5 (GPU, 4-8 channels) ∼ 500ns ∼ 250GBps
HBM ∼ 500ns 5001000 GBps
Interconnect Latency Bandwidth
Hypertransport 3 ∼ 40ns 25+25 GBps
QPI ∼ 40ns 25+25 GBps
PCIe 2.0 ∼ 200ns 8+8 GBps
PCIe 3.0 ∼ 200ns 16+16 GBps
NVLink (4 interconnection points) - 80 GBps Table 2.1: Memory and interconnection network characteristics.
NVLink
NVIDIA will introduce a new interconnect called NVLink [13] in the Pascal GPU family (to be released in 2016), in an eort to replace PCIe with a faster bus. NVLink also uses a point-to-point bus and includes one or several NVLink interconnection points in each device. Devices can bond together several interconnection points to increase the bandwidth. Preliminary studies show 80 GB/s per direction for a device using four interconnection points.
2.2.2 GPU NUMA systems
The interconnects and memories in the system have dierent dierent latency and bandwidth characteristics (Table 2.1). GPUs access their local memory (Figure 2.3, arc a) with full-bandwidth (e.g., ∼ 250 GBps in GDDR5). They
CHAPTER 2. REFERENCE HARDWARE AND SOFTWARE ENVIRONMENT
access other memories in the system through the PCIe interconnect: host memory (arc b) or another GPU memory (arc c)1. Remote accesses can
traverse any PCIe switch found between the client and the server GPUs. Both CPU memory and the inter-GPU interconnects like PCIe deliver a memory bandwidth which is an order of magnitude lower than the local GPU memory (e.g., ∼ 12 GBps in PCIe 3), thus creating a Non-Uniform Memory Access (i.e., NUMA) system. Future interconnects such as NVLink will increase the bandwidth, but memory interfaces with higher bandwidths have been announced too (HBM [6] will deliver up to 1 TB/s).
Remote access characteristics
Besides the lower bandwidth and increased latency, remote accesses present some other peculiarities compared to regular memory accesses. They are not cached in the regular memory hierarchy because modications from dierent GPUs to the same cache line could produce coherence problems. However, remote accesses to data that is only read in the computation can be safely cached in the read-only (R/O) cache (currently, this must be specied by programmers by using the __ldg intrinsic provided by CUDA).