Coalescing memory access - NVIDIA GPU - Simulation and modelling of gravitational microlensing

3.3 NVIDIA GPU

3.3.4 Coalescing memory access

The most important factor affecting performance in CUDA applications is coalescing global memory access. Global memory access has a 300-600 cycle penalty so coalescing access is an efficient way to reduce the impact of this latency. When a certain memory access pattern is archived, a half warp (16 words) or a warp (32 words) memory segment can be transferred in one single transaction. For example, for a GPU with warp size 32 reading a memory segment of 128 floats, 32 floats can be read by the GPU in one memory transaction and it only requires 4 memory transactions in total to read 128 floats. It is a 32 times higher

GPU G80 GT200 Fermi Maximum number of

threads per block

512 512 1024

Maximum number of res- ident warps per multiprocessor

24 32 48

Maximum number of resi- dent threads per multiprocessor

768 1024 1536

Number of 32-bit registers per multiprocessor

8K 16K 32K

Maximum width for 1D texture reference bound to a CUDA array

8192 8192 65536

Maximum width for 1D texture reference bound to linear memory

227 ₂27 ₂27

Maximum width and height for 2D texture reference bound to a CUDA array

65536 x 32768 65536 x 32768 65536 x 65535

Maximum width and height for 2D texture reference bound to a linear memory

65000 x 65000 65000 x 65000 65000 x 65000

Table 3.2: GPU architecture. Information on threads and warps in different generations of GPU.

bandwidth than reading 128 floats individually. However, a specific memory access pattern is required in order to achieve the maximum memory bandwidth. On different generations of GPU, the rules and restrictions are different. The newer the GPU, the more flexible in the global memory access.

Global memory should be aligned in segments of either 64 bytes (16 words) or 128 bytes (32 words) for coalescing memory access. Figure 3.4 shows three memory access patterns

...

Global memory Address (bytes) N N+4 N+8 N+12 N+16 N+20 N+24 N+28 Thread M M+1 M+2 M+3 M+4 M+5 M+6 M+7 Thread#

(a)

...

Global memory Address (bytes) N N+4 N+8 N+12 N+16 N+20 N+24 N+28 Thread M M+1 M+2 M+3 M+4 M+5 M+6 M+7 Thread#

(b)

...

Global memory N N+4 N+8 N+12 N+16 N+20 N+24 N+28 Thread M M+1 M+2 M+3 M+4 M+5 M+6 M+7 Thread#

(c)

Address (bytes)

Figure 3.4: GPU Memory Access. Figure (a) shows coalescing global memory access which only takes one memory transaction. Figure (b) and (c) shows non-coalescing global memory access which takes 16 individual memory transactions on devices of compute capability 1.0 and 1.1. However, only one memory transactions is required with compute capability 1.2 or higher, when memory access is within 32/64/128 bytes segment.

which result in different numbers of memory transactions. When each thread within a block sequentially reads one float on an aligned memory segment as shown in Figure 3.4(a), it

only takes one memory transaction to read 16 floats for devices of compute capability 1.x or 32 floats for device of compute capability 2.x. However, Figures 3.4(b) (misaligned starting address) and (c) (non-sequential access) show non-coalescing memory access. This results in 16 individual memory transactions on devices of compute capability 1.0 and 1.1. On devices of compute capability 1.2 or higher, it only takes one memory transaction when memory access is within 32/64/128 bytes segment.

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 124 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 124 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 124128 132 136 140 252 128 bytes 128 bytes ... ... ... ... ... ... ... Address (bytes) Threads Threads Threads (a) (b) (c) Memory Memory Memory Address (bytes) Address (bytes)

Figure 3.5: GPU Memory Access. Three memory access patterns that are non-coalescing on devices of compute capability 1.0 and 1.1, but are coalescing memory access on devices of compute capability of 1.2 or higher.

The restriction of sequential aligned global memory access may pose a challenge in pro- gramming the GPU. However, coalescing global memory access is more flexible on newer hardware. On devices of compute capability of 1.2 or higher, coalescing access can be achieved for any memory access pattern of 32 bytes of 8-bit words, 64 bytes of 16-bit words and 128 bytes of 32-bit and 64-bit words. Figure 3.5 shows three access patterns that are non-coalescing on devices of compute capability 1.0 and 1.1, but are coalescing access on

devices of compute capability of 1.2 or higher. Figure 3.5(a) shows random memory access of a 32-bit float within a 128 bytes memory segment, resulting in one memory transaction on devices of compute capability of 1.2 or higher. When reading floats (32-bit) from a misaligned memory segment that is within a 128 bytes segment, it only takes a single memory transaction on devices of compute capability of 1.2 or higher. However, when the access memory is outside the 128 bytes segment, two memory transactions are issued, as shown in Figure 3.5(b).

When one or more threads within a warp are not participating in memory access (shown in Figure 3.5(c)), 16 serialized transactions are issued on devices of compute capability of 1.1 or lower, but only one transaction on devices of compute capability of 1.2 or higher. How- ever, even though there is only one memory transaction on newer hardware, some memory bandwidth is wasted as all data in the segment are fetched including the memory addresses that are not required. This issue bring an awareness on how memory in CUDA should be organized and accessed.

... ... 0 4 8 12 16 20 24 28 32 120 124 ... ... 0 4 8 12 16 20 24 28 32 120 124 (a) (b) 128 bytes segment Address Address Global memory Global memory

Figure 3.6: GPU Memory Access. Some memory bandwidth is wasted due to not all memory locations being fetched within the 128 bytes memory segment.

In the case that not all required memory locations are being fetched within the 128 bytes memory segment, some memory bandwidth is wasted. Figure 3.6(a) shows only even memory locations are being accessed, therefore half of the memory bandwidth is wasted. Figure 3.6(b) shows only 1/3 of memory locations are being accessed within the 128 bytes

memory segment, thus, 2/3 of memory bandwidth is wasted. This is due to the whole segment of memory being fetched in both cases even though the GPU kernel code does not require the unwanted memory addresses.

The following example shows a similar problem with a C struct having three variables. The struct is allocated a segment of memory on the global memory space using the CUDA memory allocation instruction. In the kernel, a global memory access to the variablex in the struct space is issued. All the variables in the struct are fetched even though the kernel only requires one variable. Therefore, 2/3 of the memory bandwidth is wasted.

struct space { float x; float y; float z; }; space *d_space; cudaMalloc((void**) &d_space, ... ); // GPU Kernel

__global__ void run_kernel(space *d_space) { int gtid = blockIdx.x * blockDim.x + threadIdx.x; float x = d_space[gtid].x;

... }

Instead of organizing the x,y,z arrays in a struct, they can be organized in three arrays and allocated memory space for each array. Alternatively, they can be organized as a struct as in the following example. The memory bandwidth can be maximized in this case as only

thed space.x variable is fetched.

struct space { float *x; float *y; float *z; }; space d_space;

cudaMalloc((void**) &d_space.x, ...); cudaMalloc((void**) &d_space.y, ...); cudaMalloc((void**) &d_space.z, ...);

// GPU Kernel

__global__ void run_kernel(space *d_space) { int gtid = blockIdx.x * blockDim.x + threadIdx.x; float x = d_space.x[gtid];

... }

In document Simulation and modelling of gravitational microlensing events using graphical processing units : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany (Auckland (Page 48-54)