CTA Partitioning - GPU Architecture Issues

3.3 GPU Architecture Issues

3.3.1 CTA Partitioning

Recall that a cooperative thread array (CTA) is a 2-level hierarchy (thread blocks within a grid and threads within a thread block.) In order to take advantage of the massive data-level parallelism via threads on a GPU, GPU programmers need a way to specify the desired thread structure and thread to data mapping. This is done with a 2-step process that I call CTA partitioning.

1) For each GPU kernel launch, GPU programmers must structure the desired shape and size of the CTA layout using two 3D parameters (grid and block) both specified as 3- tuples ‹x,y,z›.

2) Inside each GPU kernel, GPU programmers must map CTA layout parameters that represent unique threads onto actual data locations. The CTA layout parameters are available to programmers inside each GPU kernel via four global read-only 3D parameters (gridDim, blockDim, blockIdx, threadIdx) as 3-tuples ‹x,y,z›. When I was first introduced to GPU programming, I found the CTA layout and mapping steps confusing, so, I present more details on how to perform the CTA layout and mapping steps below.

CTA layout: CUDA requires that the CTA layout must be specified as part of each GPU kernel launch. The grid and block CTA layout parameters are specified as 3-tuples ‹x,y,z› that represent the ‹width, height, length› of a 3D grid or thread block layout. To refer to both the grid and block dimensions as one unit, I combine both as a 6-tuple denoted as ‹gw,gh,gl, bw,bh,bl›. Various constraints on the maximum size and shape of these grid and block CTA layout parameters must be taken into consideration. Each tuple value is specified as a 32-bit signed integer. Zero and negative values do not make physical sense. Therefore, the valid range for any tuple value is [1,231_{-1). Programmers can eliminate unwanted dimensions by specifying a default value of one} (1) for the unused value. This specification enables 1D, 2D, and 3D layouts, as desired. CUDA

currently does not use the gl (grid.z) CTA dimension value in any way. Consequently, the actual CTA layout is effectively ‹gw,gh,1, bw,bh,bl›. The Fermi architecture only supports a maximum value of 65,535 (216_{-1) for any dimension for the grid parameter. As a result, the valid range is} limited to [1,216_{-1), whereas the Kepler architecture supports the full range [1,2}31_{-1). For Fermi,} this limitation means that if more than 65,535 thread blocks within a grid are needed, then a 2D grid layout is the only solution to fully cover all the desired thread blocks. CUDA currently limits the thread block size (TBS) to 1,024 threads per-block or less (bw·bh·bl ≤ 1,024). Since thread processing is actually done in warp-sized batches, the TBS should be a multiple of the warp size (meaning, 0 == TBS % 32). Processing thread blocks in warp sized batches helps keep threads busy (as well as the SP cores that execute them) within each thread block. The grid CTA layout parameter supports 1D and 2D layouts (thread blocks within a grid), while the block CTA layout parameter supports 1D, 2D, and 3D layouts (individual threads within a thread block). Once specified (at launch time), the grid and thread block layouts remain fixed for the entire execution of a specific GPU kernel.

CTA mapping: Inside of each GPU kernel, programmers must map individual threads onto data locations. To help with this mapping process, CUDA provides four global read-only CTA parameters as 3-tuples (gridDim, blockIdx, blockDim, threadIdx). These CUDA variables are always available for use by the code anywhere inside the CUDA kernel, including nested function calls. The gridDim and blockDim parameters refer back to the original CTA layout (size and shape) parameters specified at kernel launch, with the ‹x,y,z› dimension values meaing ‹width, height, length›, respectively. For convenience, I represent both the gridDim and blockDim variables as a single unit, ‹gw,gh,1, bw,bh,bl›. The blockIdx and threadIdx parameters uniquely identify the currently running thread block (within the current grid) and the currently running thread (within the current thread block). The blockIdx and threadIdx can be thought of as a multi-dimensional block ID (bid) and thread ID (tid), respectively. These

variables uniquely identify the location of each individual thread within the structured thread hierarchy. Again, for convenience, I represent both the blockIdx and threadIdx parameters as a single 6-tuple as ‹bx,by,bz, tx,ty,tx›. All four CTA parameters help GPU programmers map individual threads onto their corresponding data items.

As an example, Figure 3.3 contains a code snippet that shows a full 5-dimensional CTA mapping down onto a single unique thread index.

// Map Block ID (bid) from <gridDims, blockIdx>

gW = gridDims.x gH = gridDims.y // gL = gridDims.z (not used)

bX = blockIdx.x bY = blockIdx.y // bZ = blockIdx.z (not used)

bid = (gW*bY) + bX // 2D to 1D mapping

// Map Thread ID (tid) from <blockDims, threadIdx>

bW = blockDims.y bH = blockDims.y bL = blockDims.z tX = threadIdx.x tY = threadIdx.y tZ = threadIdx.z

tid = (bH*bW*tZ) + (bW*tY) + tX // 3D to 1D mapping

// Map Thread Index from <bid, tid>

TBS = bH*bW*bL // Thread Block Size

tIdx = (bid*TBS)+tid // 2D to 1D mapping

// Map Data Offset from <tIdx>

dataOff = ... // Problem Domain Specific

Figure 3.3: A simple mapping from the four CTA layout parameters onto unique block and thread IDs that in turn are mapped onto a unique thread index. Note: This is only one of many possible ways to map CTA layout parameters onto a unique thread index within the CTA.

The code snippet, as shown in Figure 3.3, has four main steps: First, it maps the 2D grid layout onto a unique block ID (bid) within the grid. Second, it maps the 3D block layout onto a unique thread ID (tid) within the thread block. Third, it maps the bid and tid onto a unique thread index (tIdx) within the entire CTA. Finally, it maps the thread index onto the specific data offset that needs to be processed by this thread. This step is not shown as this must be defined by the GPU programmer for their specific problem. CTA layout configurations that use fewer dimensions require less total operations in order to do the mapping.

Flexibility in specifying the CTA partitioning (layout and map) allows programmers to choose data layouts that best fit their problem domain: 1D, 2D, or 3D. This freedom of choice becomes a burden; however. Before programmers can begin to code, they are forced to make

choices on how to structure the CTA layout, how to partition data across the threads, and how to map threads onto data inside of each kernel. These choices have big effects on performance, but whether these effects will be positive or negative is initially unclear. In my case studies, my data access skeletons help me explore choices for CTA partitioning and their effects (see Chapter 5 “Data Access Skeletons” for more details).

In document Brown_unc_0153D_15479.pdf (Page 76-79)