2.1 Types of Parallelism
2.2.1 Hardware Processor and Memory Hierarchies
Current GPU architectures use a two-level hierarchy of processing cores. Consider, as examples, NVIDIA’s Fermi and Kepler architectures, The GTX 580 GPU (NVIDA, 2010, Fermi) contains 16 SMs, each containing 32 SPs for a total of 512 cores with an aggregate 1.58 Tera-FLOPs of single-precision peak compute capacity. The GTX 680 (NVIDIA, 2012, GTX 680) contains 8 SMXs, each with 192 SPs for a total of 1,536 cores with an aggregate 3.10 Tera- FLOPs of single-precision peak compute throughput. On both the GTX 580 and GTX 680, double-precision peak throughput drops to only 197 and 129 Giga-FLOPS, an 8-fold and 24-fold decrease, respectively, instead of the expected two-fold, because only a few of the FPUs can handle 64-bit doubles.
Both processors execute hierarchically organized threads, which are best considered as single-instruction/multiple-data (SIMD) processes, as we will describe shortly in section 2.2.3.
Memory Types:
Each GPU has support for many different types of memory (NVIDIA, 2012 Programming Guide) --Registers, Shared, Global Memory, Cache, Surface, Texture,Constant, and Local, as shown in Table 2.1. The Surface, Texture, and Local memory are just special forms of global memory with extra functionality and/or behavior. From a performance point of view, Registers are the fastest form of memory, followed by Shared memory (L1 Cache), L2 Cache, and finally Global Memory. There is also even slower access to Host CPU RAM via DMA transfer. I will briefly overview each type of memory in this section.
28
Registers Shared Memory
(L1 Cache)
Global Memory (L2 Cache) • 32K registers per SM
• Each register is 32-bit (4 bytes) • 128KB per SM • 2048KB aggregate total
• At least 8 TB/s peak aggregate compute throughput
• 48KB shared • 16K L1 cache • or (16K/48K)
• 32 banks (4 bytes per bank) • Bank conflicts
• 1.43 TB/s peak aggregate I/O throughput
• 1.5 GB capacity • 6 memory controllers
• 768 KB L2 cache shared across all SM’s
• 4-8 GB/s data transfer GPU ↔ CPU
• 192.4 GB/s peak aggregate I/O throughput. Constant Memory (CUDA API) Texture Memory (CUDA API) Local Memory • Read-only memory • 64 KB • 2KB cache per SM
• Use Broadcast mode otherwise access is serialized.
• Can be as fast as registers. • Declare variables/arrays as constant to use.
• Intended to support textures for 2D and 3D graphics
• Read-only memory
• Must be properly initialized before using. Created out of global memory. • Has own separate texture cache • Supports a variety of pixel formats. • Supports filtering/interpolation • Supports addressing operations (clamping, tiling, mirroring)
• Not really a memory type but a platform behavior to deal with Register Spill.
• Local variables that exceed the number of assigned registers per- thread are stored in global memory. • These variables are said to be in Local memory.
• These variables run at global memory speeds impacting performance.
Table 2.1 - GPU Memory Types: High level summary of the main GPU memory types.
Registers:
Each SM (or SMX) has a large pool of registers. The registers can be flexibly assigned to individual threads. The registers (4-bytes each) can be partitioned and directly assigned across all concurrent threads scheduled on each SM. On the GTX 580, each SM has 32K of 32-bit registers with an estimated peak aggregate throughput of up to ~9.5 TB/s for the Fused Multiply Add (FMA) instruction. On the GTX 680, each SMX has 64K of 32-bit registers with an estimated peak aggregate throughput of up to ~18.5 TB/s for the FMA instruction. Most other ISA9 instructions tend to run at about half of these peak FMA rates, so ~4.7 TB/s and ~9.2 TB/s respectively is more typical.Shared Memory:
Each SM (or SMX) has another pool of local memory. The pool of local memory, currently 64KB on both the GTX 580 and 680, can be split between two
9 Recall that ISA stands for instruction set architecture, i.e. the various processing operations (arithmetic,
29
categories: L1 cache and shared memory. The local memory split can be either 16/48, 48/16 or 32/32 KB (the last grouping on Kepler class cards only) (NVIDIA, 2012, Kepler GK110). The L1 cache memory speeds up read-only access to global memory via temporal and spatial locality. The cache-line size, also known as a warp-line, is 128 bytes (or 32 32-bit elements). This local memory is called shared memory by NVIDIA since it is shared by all SPs on each SM (or SMX).
Shared memory is available as a programmable scratch pad to store local variables and arrays. This scratch pad memory allows limited communication and coordination across threads and can help speed-up overall performance. The programmer has direct control over how much memory to request out of the shared memory on each SM for each thread block. However, up to 8 (or 16) concurrent thread blocks per SM (or SMX) need to share the same pool of shared memory. The programmer designates the amount of shared memory as local arrays assigned to each thread block. The CUDA platform figures out how many concurrent thread blocks can run at the same time based on the amount of memory the programmer requested and the amount of shared memory available. CUDA then partitions the shared memory across the concurrent thread blocks per SM. For example: If the programmer declared a memory array of 2,000 32-bit elements in shared memory, this would require 8,000 bytes. CUDA would decide that it could run at most 6 concurrent blocks (6 = ⌈49,152/8,000⌉ bytes, where 49,152 = 48KB). CUDA could decide to layout the 6 blocks of shared memory at starting local offsets of [0; 8,000;
16,000; 24,000; 32,000; and 40,000] respectively.
To increase throughput, shared memory is divided into 32 memory banks with a width of 4 bytes per bank. The GTX 680 also supports another addressing mode with a width of 8 bytes per bank. To prevent contention, if more than one concurrent thread accesses the same bank at the same time, then there is a bank conflict, and the hardware stalls threads on the SM (SMX) to serialize access and enforce correct behavior. On the GTX 580 and GTX 680, the peak aggregate throughput of shared memory is 1.58 TB/s and 1.03 TB/s respectively.
30
Global Memory:
All processors have access to a large global memory on the GPU card. The GTX 580 and GTX 680 have total memory of 1.5 Gigabytes and 2.0 Gigabytes respectively. (Thus, this is really “shared memory,” as that term is normally used in parallel programming circles, see chapter 2.1.2, although not in GPU terminology.) Scatter/Gather operations are supported on each SM (SMX) between registers and global memory. On the GTX 580, six memory controllers manage 768KB of read/write L2 cache that is shared among all SMs and can provide an aggregate 192.4 GB/s of peak memory I/O throughput. On each SM, there is also 16KB (or 48 KB) of read-only L1 cache that is banked in the same manner as shared memory. On the GTX 680, four memory controllers manage 512KB of read/write L2 cache that is shared among all SMXs and can provide an aggregate 192.2 GB/s of peak memory I/O. On each SMX, the read-only L1 cache is typically 16 KB (but can also be setup as 48 or 32 KB).There are three other types of memory found on GPU cards, which I briefly describe: constant, texture, and local memory. Other than these brief overviews, I intend not to discuss these three specialized memory types further in my thesis.
Constant Memory:
Constant memory is a small read-only memory with its own small cache that can be used to store constant variables, which can be declared at compile time or via the CUDA API. Access to these constants must be done in broadcast mode (IE all threads within a warp access the exact same constant address at the same time) otherwise access is serialized impacting performance.Texture Memory:
Texture memory is intended for use in the 3D graphics rendering pipeline. Texture memory is read-only and supports many pixel formats, plus a lot of special functionality including addressing modes, filtering/interpolation, etc. that can be applied to pixel- data when reading from texture memory. Since the silicon for this extra functionality has already been built into the chip to support 3D graphics, using this extra functionality for compute31
uses its own separate cache, independent of the L1 cache used for compute operations reading from global memory. Access to texture memory is via a CUDA specific API.
Local memory is not really a memory type but a platform behavior to deal with register spill. Register spill is when the number of local variables assigned to a thread by the compiler exceeds the number of physically available hardware registers. NVIDIA’s solution is to store the over-flow variables in global memory. Read/write access to these “Local” variables runs at global memory speeds impacting performance. The L1 cache can help defray the access cost but only for read-only variables.
Memory Hierarchy: Similar to a modern CPU memory architecture, the GPU memory architecture is arranged into a complex hierarchy of Registers, L1 Cache, Shared Memory, L2 Cache, GPU RAM, and CPU RAM. CPU cache memory is typically hidden from the CPU programmer. However, GPU shared memory is accessible to the GPU programmer and can be used for caching frequently re-used data or to communicate data or coordinate behavior across the threads within a thread block. The peak throughput estimates (Volkov, 2010) above for the 3 main types of memory (registers, shared, global) suggest that on the GTX 580 using registers can be up to 6.0× faster than shared memory, which in turn can be up to 8.2× faster than global memory. On the GTX 680, registers can be up to 6.0× faster than shared memory, which in turn can be up to 5.4× faster than global memory. Based on these results, programmers should favor performing computations in registers over shared memory and performing computations in shared memory over global memory. Memory transfers between CPU and GPU memory run at a slower 4-8 GB/s peak throughput as compared to the ~192 GB/s peak throughput of global memory. Programmers should minimize transfers between the CPU and GPU as a result.
32