One obstacle to programming a GPU kernel is the need to specify up front a thread layout structure, the cooperative thread array (CTA), which the GPU hardware uses to schedule the tens of thousands of threads onto hundreds of processing cores. Each kernel, no matter how simple, requires a CTA layout, but choosing good sizes and shapes for the CTA layout requires the programmer to develop some intuition for what thread layouts work well, and optimal sizes and
C
67
shapes require many experiments within the specific application. Here are the technical details for specifying layouts.
Recall from Section 3.3 that threads on a GPU are structured in a four-level hierarchy. Each individual thread belongs to a thread warp, which belongs to a thread block, which belongs to a grid. In order to launch kernels with thousands of threads, NVidia requires that programmers must specify, at GPU kernel launch, two triples ‹x,y,z› the 2D grid of blocks and a 3D array of threads in the blocks. To aid my own understanding, I use a 6-tuple ‹gw, gh, 1, bw, bh, bl› to represent the grid and block layout parameters as a single unit. However, CUDA actually uses a triple angle-bracket syntax (<<<grid, block, …>>>). I also use the 6-tuple ‹bx, by, bz, tx, ty, tz› to indicate the unique thread location of each individual thread within the CTA layout.
To specify the CTA layout up front, the GPU programmer must plan how to partition the data and how to map individual threads onto the data elements. The programmer can simply specify a 1D grid and a 1D thread block layout, with two CTA layout parameters ‹gw,1,1,bw,1,1›, but the option of having multiple CTA layout parameters allows programmers the flexibility to treat the underlying data as 1D, 2D or 3D, as appropriate for their problem space. Most of the GPU kernels in my case studies are 1D, but large, so I typically use 2-3 CTA parameters per kernel as a 1D or 2D grid and a 1D block as ‹gw,gh,1,bw,1,1›.
68
template<TBS, gridRS, nWork> // Input Size, Row Size(m*c), Work per-thread
Layout_1D( n, Grid, Block ) // Input Size, Grid layout, Block Layout
DBS = nWork*TBS; // Fixed, Data Block Size
nBlocks = ⌈𝑛/𝐷𝐵𝑆⌉; // Varies, Cover data with data blocks
if (nBlocks <= 65534) // 1D or 2D Grid Layout?
gridW = nBlocks; gridH = 1;
else
mSQ = ⌈√𝑛𝐵𝑙𝑜𝑐𝑘𝑠⌉; // Start with a square layout
gridW = ⌈𝑚𝑆𝑄/𝑔𝑟𝑖𝑑𝑅𝑆⌉ ∙ 𝑔𝑟𝑖𝑑𝑅𝑆; // nCols is a multiple of ‘Row Size’ Hint
gridH = ⌈𝑛𝐵𝑙𝑜𝑐𝑘𝑠/𝑔𝑟𝑖𝑑𝑊⌉; // nRows needs to cover data
end if
Block = dim3( TBS, 1, 1 ); // Block layout (1D)
Grid = dim3( gridW, gridH, 1 ); // Grid layout (2D or 1D)
end Layout_1D
// Example Usage
...
TBS = 128; // 128, Pick a fixed 1D thread block layout
nWork = 4; // 4, Amortize costs across work-items
nSMs = 14; // 14, number of SMX’s on a GTX Titan
nConBlocks = 16; // 16, expected number of concurrent blocks per SMX
gridRS = nSMs * nConBlocks; // 224, A good starting row size for my grid
Layout_1D<TBS,gridRS,nWork>( n, Grid, Block ); // Compute my CTA Layout
Figure 4.3: Compute a CTA layout. In this example, I pick a fixed block size and a fixed number of columns per grid and then allow the number of rows per grid to vary as needed to cover the data.
As shown in Figure 4.3, the user picks a fixed-size thread-block size (TBS) and a fixed size amount of work per-thread (nWork) from which the code computes a fixed-size data block size (DBS = nWork·TBS). From the fixed size data block size, I compute the number of data blocks (m=⌈𝑛/𝐷𝐵𝑆⌉) needed to fully cover the data set [0,n). Fully covering the data with data blocks implies that the last data block may only be partially full and thus my GPU kernels will require range checking to avoid data access errors. Initially, I used a one-to-one mapping
between thread and data blocks. Recall that Fermi architectures limit their grid dimension values to the range [1,216). As a result, my layout function determines if a 1D or 2D grid layout is needed to cover all the data blocks (test: m ≥ 65,534). If a 2D grid layout is needed, I start with the square root of m as a good first approximation (√𝑚 × √𝑚). I then modify that number by a fixed-size row hint (gridRS) to get the final 2D layout (rows×cols), which creates a 2D layout of data blocks needed to fully cover all the data. Unfortunately, computing the 2D grid layout as
69
described usually results in some over-coverage because the last data row is usually only partially full (implying that there are some thread blocks where the corresponding fixed-size data blocks are completely out of range). Thus range checking in my kernels is required to prevent data access errors.
CTA Layout Guidelines: Using fewer CTA layout parameters results in fewer mapping operations (as will be shown later in Figure 4.6.) Consequently, I prefer 1D layouts over 2D layouts, and 2D layouts over 3D layouts. Since thread scheduling on each SM multi-core is warp-based, I also prefer that the thread block width (bw) parameter be a multiple of the WarpSize (32), in order to fully utilize all the SP processing cores on each SM. To support the multi-issue GPU hardware capability, I prefer to have at least two or four thread warps per-thread block (64 or 128 threads per thread-block). I generally prefer fixed sized constants for most of my CTA parameters—the only exception being the one that I allow to vary with input size (either the grid rows or grid columns -- gw, gh). I recommend including the fixed-size CTA parameters as part of the kernel’s C++ template parameters. This approach supports experimentation while still allowing the GPU kernels to know the actual CTA layout at compile time for better performance.