GPU Memory Issues - Brown_unc_0153D

Memory constraints, register spills, coalescence, and bank conflicts are all GPU memory architecture issues that impact performance.

3.4.1 Memory Constraints

GPU programmers aiming for high performance must understand several concepts related to memory: access times, limited capacity, and cache constraints. As discussed in Chapter 2, GPU memory is a multi-level hierarchy of registers, shared memory, and global memory. The GPU memory architecture includes hardware support for L1 and L2 caching and as well as for specialized types of memory, such as constant, texture, and local. There is also support to transfer data to and from host memory (CPU RAM) using DMA.

Access Speeds: The consecutive levels of memory have access times that differ by an order of magnitude: Registers ≪ Shared memory ≪ Global Memory ≪ CPU RAM. According to Volkov (Volkov 2010), on the GTX 480 registers are 6× faster than shared memory, which in turn is 7.6× faster than global memory, which in turn is 11.1× faster than CPU RAM9_.

Limited Capacity: Registers are a scarce resource. Fermi architectures support a pool of only 32 K registers; Kepler, a pool of 64 K. Furthermore, both architectures allow a maximum of 63 registers per thread10_{. GPU programmers should expect their programs to} have between 21–63 registers available per thread on Fermi and [32-63] registers per thread on Kepler, depending on the number of threads scheduled. Shared memory is also scarce since only 16, 32, or 48 KB is available on each SM (or SMX), and it must be shared across all thread blocks concurrently running on each SM11_{. Global memory}

9_{Vasiley Volkov in a talk called “Better Performance at Lower Occupancy” given at the GPU Technology}

Conference in 2010 (Volkov, 2010) compared the bandwidths of registers (8 TB/s), shared memory (1.3 TB/s), and global memory (177 GB/s) on the GTX 480. This results in speedups of 6× (registers over shared memory) and 7.6× (shared memory over global memory). CPU RAM from the same generation of PC’s had a quoted maximum throughput of 16 GB/s resulting in the claimed speedup of 11.1× (global memory over CPU RAM).

10_{Most Kepler devices (CC 3.0) support a maximum of 63 registers per thread, whereas the GTX Titan}

(CC 3.5) supports a maximum of 255 registers per thread.

11_{Each SM can have up to eight concurrent thread blocks per SM running at the same time. Each SMX can}

capacities, on the other hand, vary between 1 and 6 GB, depending on the specific GPU card.

Cache Constraints: Fermi and Kepler GPU architectures both support caching but for a limited number of memory controllers and small caches. The number of memory

controllers per GPU varies across different cards, but it is fixed for each specific card and is typically in the range of one to six memory controllers per card. The L1 and L2 cache sizes are also limited, with the read-only L1 cache having [16, 32, or 48 KB] per SM and the read/write L2 cache only having 64-128 KB per memory controller. Since there are up to 16 SMs on Fermi class cards and up to 14 SMXs on Kepler class cards but only 2-6 memory controllers per card, the SMs must compete to use the memory controllers. As a result, data in GPU caches gets evicted much more frequently data in CPU caches. GPU programmers should not depend on their data staying in either GPU cache for long. They may therefore wish to migrate essential data into registers, shared memory, or both.

3.4.2 Register Spills

Although programmers often write code as if they have unlimited registers, real CPUs and GPUs have a limited number of registers available for each thread. On GPUs, the maximum number of registers available is set up at kernel launch time based on the register pool size (32 K on Fermi or 64 K on Kepler) and the number of concurrent threads running on each SM (SMX) core. Compilers dictate the number of registers a kernel needs; register spills occur when the need exceeds what is available. CPU compilers often store spilled variables onto stack or heap memory. The GPU CUDA compiler stores spilled variables into local memory, which is a reserved area of global memory, which gives orders of magnitude slower access. In some case studies, I hand-code key routines just to avoid expensive spills.

3.4.3 Coalescence

When the 32 threads in a warp request to access global memory, the GPU memory controller must transfer the requested data to or from registers in the SM that are assigned to the respective threads. The memory controller does this transfer in units of data warps (128 bytes). This means that if each thread in a warp requests a distinct four bytes and these 128 total bytes are contiguous and aligned to a data warp boundary, the memory controller can satisfy all

requests with one transfer. This condition is called coalescence. This is similar to the way a CPU makes efficient use of an entire cache-line, therefore, I call each coalesced data warp a warp-line. As we will see in all of my case studies, coalescence is crucial for a GPU to achieve peak throughput. So, a programmer will need to ensure that transferred data are the correct size (multiples of 128 bytes), aligned (respecting data warp boundaries), coherent (threads of a warp request distinct but contiguous data bytes), and fully used (all warp data is consumed by the threads before another warp line is transferred).

Alignment requires attention primarily for short runs, since warp lines behave like cache lines, meaning unaligned data takes one more data transfer than unaligned data, the extra transfer cost can of course be amortized across a long run of data. For example, while misaligned runs of 128 bytes take two transfers rather than one, capping coalescence efficiency at 50%, and runs of 1024 bytes take nine transfers rather than eight, capping efficiency at 89%, randomly accessing misaligned runs of 16 bytes transfer 128 bytes 7/8th_{of the time and 256 bytes 1/8}th_{of the time,} capping efficiency at 11.1%!

Distinctness can be relaxed, but several threads competing for the same memory address can lead to “race conditions,” where one thread overwrites a competing thread’s data. It is up to programmers to prevent race conditions in global memory by either partitioning or using costly atomics. The hardware will prevent race conditions in shared memory at the cost of serializing access for competing threads within a warp (see bank conflicts next).

3.4.4 Bank Conflicts

Bank conflicts occur when multiple threads within the same thread warp access the same memory bank within shared memory at the same time. The GPU hardware serializes access by competing threads to ensure correct behavior, but at the cost of reduced I/O throughput. Memory accesses involving k threads accessing the same bank at the same time are called k-way bank conflicts. When a warp has up to (and including) k-way conflicts, the scheduler must replay the conflicting instruction k times. This approach, if done consistently, reduces throughput by a factor of k. In the worst case, k can be the minimum of the number of banks or the threads per warp (both are 32 on Fermi and Kepler hardware.) Bank conflicts can be avoided if each thread in a warp accesses its own unique bank in shared memory, but this can be subtle. For example, for 64-bit data type (doubles, longlong integers) on Fermi, 32 threads each loading an element of a consecutive run results in at least a two-way bank conflict, since each loads the low order bytes from even banks before the high order bytes from odd banks12_.

12_{Kepler-class hardware has a new shared memory access mode that avoids 2-way bank conflicts for 64-bit}

In document Brown_unc_0153D_15479.pdf (Page 85-90)