CUDA Programming Model - Compute Unified Device Architecture (CUDA)

2.4 Compute Unified Device Architecture (CUDA)

2.4.1 CUDA Programming Model

The CUDA programming model extends the C and C++ languages, allowing us to explicitly denote data parallel computations by defining special functions, designated by kernels. Kernel functions are executed in parallel by different threads, on a physically separate device (GPU) that operates as a co-processor to the host (CPU) running the program. These functions define the sequence of work to be carried out individually by each thread mapped over a domain (the set of threads to be invoked) [42]. Threads must be organized/grouped into blocks, which in turn form a grid. In recent GPUs, grids may have up to three dimensions, while on older devices the limit is two dimensions. This information is contained in Table 2.1 which presents the main technical specifications according to the CUDA device compute

22 2 GPU Machine Learning Library (GPUMLib)

capability. A complete list of the specifications can be found in the NVIDA CUDA C programming guide [164]. Moreover, a list of the devices supporting each compute capability can be found athttp://developer.nvidia.com/cuda-gpus.

Table 2.1 Principal technical specifications according to the CUDA device compute capability

Compute Capability

Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0

Maximum grid dimensionality 2 (x, y) 3 (x, y, z)

Maximum x-dimension of a grid 65535 231−1

Maximum y or z-dimension of a grid 65535 Maximum block dimensionality 3 (x, y, z)

Maximum x or y-dimension of a block 512 1024

Maximum z-dimension of a block 64

Maximum number of threads per block 512 1024

Warp size (see Section 2.4.2, page 26) 32

Maximum resident blocks per multiprocessor 24 32 48 64 Maximum resident threads per multiprocessor 768 1024 1536 2048 Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 K Maximum shared memory per multiprocessor 16 KB 48 KB

Local memory per thread 16 KB 512 KB

Maximum number of instructions per kernel 2 million 512 million

For convenience, blocks can organize threads in up to three dimensions. Figure 2.4 presents an example of a two-dimensional grid containing two- dimensional thread blocks. The actual structure of the blocks and the grid depends on the problem being tackled and in most cases is directly related to the structure of the data being processed. For example, if the data is contained in a single array, then it makes sense to use a one-dimensional grid with single dimensional blocks, each processing a specific region of the array. On the other hand, if the data is contained in a matrix then it could make more sense to use a bi-dimensional grid in which one dimension is used for the column and another one for the row. In this specific scenario the blocks could also be organized using two dimensions, such that each block would process a distinct rectangular area of the matrix.

2.4 Compute Unified Device Architecture (CUDA) 23

Choosing the adequate block size and structure is fundamental to maximize the kernels’ performance. Unfortunately, it is not always possible to anticipate which block structure is the best and changing it may require rewriting kernels from scratch. Threads within a block can cooperate among themselves by sharing data and synchronizing their execution to coordinate memory accesses. However, the number of threads comprising a block can not exceed 512 or 1024 depending on the GPU compute capability (see Table 2.1). This limits the scope of synchronization and communication within the computations defined in the kernel. Nevertheless, this limit is necessary in order to leverage the GPU high-core count by allowing threads to be distributed across all the available cores.

Blocks are required to execute independently: it must be possible to execute them in any arbitrary order, either in parallel or in series. This requirement allows the set of thread blocks which compose the grid to be scheduled in any order across any number of cores, enabling applications that scale well with the number of cores present on the device.

Scalability is a fundamental issue, since the key to performance in this platform relies on using massive multi-threading to exploit the large number of device cores and hide global memory latency. To achieve this, we face the challenge of finding the adequate trade-off between the resources used by each thread and the number of simultaneously active threads. The resources to manage include the number of registers, the amount of shared (on-chip) memory used per thread, the number of threads per multiprocessor and the global memory bandwidth [194].

CUDA provides a set of intrinsic variables that kernels can use to identify the actual thread location in the domain, allowing each thread to work on separate parts of a dataset [42]. Table 2.2 identifies those built-in variables [164].

Table 2.2 Built-in CUDA kernel variables Variable Description

gridDim Dimensions of the kernel grid. blockDim Dimensions of the block.

blockIdx Index of the block, being processed, within the grid. threadIdx Thread index within the block.

warpSize Warp size in threads (see Section 2.4.2, page 26).

Listing 2.1 presents a simple kernel that computes the square of each element of vector x, placing the result in vector y. Kernel functions are declared by using the qualifier global and can not return any value (i.e. its return type must bevoid).

The actual number of threads is only defined when the kernel function is called. To this end, we must specify both the grid and the block size by using the new CUDA execution configuration syntax (<<<···>>>). Listing 2.2 demonstrates the steps necessary to call the square kernel previously defined in Listing 2.1. These usually involve allocating memory on the device, transfer the input data from the host to the device, define the number of blocks and the number of threads per block

24 2 GPU Machine Learning Library (GPUMLib)

Listing 2.1 Example of a CUDA kernel function. CUDA specific keywords appear in blue.

__global__ void _{square(float * x, float * y, int size) {} int idx = blockIdx_{.x *} blockDim.x + threadIdx.x;

if _{(idx < size) y[idx] = x[idx] * x[idx];}

}

Listing 2.2 Example for calling a CUDA kernel function.

//...

float x[SIZE];

float y[SIZE];

int _{memsize= SIZE * sizeof(float);}

// Fill vector x // ...

// Allocate memory on the device for the vectors x and y

float * d_x; float * d_y;

cudaMalloc((void**) &d_x, memsize); cudaMalloc((void**) &d_y, memsize); // Transfer the array x to the device

cudaMemcpy(d_x, x, memsize, cudaMemcpyHostToDevice); // Call the square kernel function using blocks of 256

threads

const int blockSize = 256;

int nBlocks = SIZE / blockSize;

if (SIZE % blockSize > 0) nBlocks++;

square<<<nBlocks, blockSize>>>(d_x, d_y, SIZE); // Transfer the result vector y to the host

cudaMemcpy(y, d_y, memsize, cudaMemcpyDeviceToHost); //release device memory

cudaFree(d_x); cudaFree(d_y); //...

for each kernel, call the appropriate kernel functions, copy the results back to the host and finally release the device memory.

In the code presented (see Listings 2.1 and 2.2), each thread will process a single element of the array. Since the block size exerts a profound impact on the kernel performance, usually this is one of the most important aspects taken

2.4 Compute Unified Device Architecture (CUDA) 25

into consideration when choosing the block size and consequently the number of blocks. Hence, the actual number of threads (number of blocks×block size) will most likely be greater than the size of the array being processed. As a result, it is frequent to have some idle threads, within the grid blocks, as depicted in Figure 2.5.

Fig. 2.5 Execution of thesquarekernel grid blocks (see Listings 2.1 and 2.2)

In document Machine Learning for Adaptive Many Core Machines A Practical Approach (Studies in Big Data) 2015th Edition pdf (Page 37-41)