Execution Engine - Hardware Architecture

Chapter 2: Background and Prior Work

2.4 GPGPU Mechanics

2.4.2 Hardware Architecture

2.4.2.1 Execution Engine

The execution engine consists of one or more parallel processors. GPU manufacturers scale the number of processors in the EE to realize GPUs with different computational capabilities. This allows the manufacturer to cover embedded, laptop, desktop, gaming enthusiast, and supercomputing markets with a common processor architecture. The number of parallel processors in a GPU also varies with each hardware architecture— especially among GPUs of different manufacturers. For example, recent NVIDIA GPUs typically have between one to sixteen processors, while AMD GPUs may have up to 44 (Smith, 2013).

For NVIDIA GPUs, each parallel processor is called a “streaming multiprocessor” (SM). Each SM is capable of executing a single instruction concurrently across several “lanes” of data operands. At any instant, a group of tightly coupled user-defined threads, called “warps,” are bound to these lanes, one thread per lane. Thus, the threads in a warp are executed in lock-step. If threads diverge on a conditional-code branch (e.g., an if/else-branch), then each branch is executed in turn, with the appropriate threads “masked out” within each branch to ensure each thread executes the correct branch.17

Although an SM can only execute one warp at a time, an SM may be oversubscribed with several warps at once. That is, several warps may be assigned to a single SM. This is done to facilitate the hiding of memory latencies. The SM will quickly context switch to another ready warp if the currently executing warp stalls on a memory access.

1 // O p e r a t e on ‘ i n p u t ’ of s i z e ( x , y , z ) in (4 , 4 , 4) - s i z e d t h r e a d b l o c k s

2 v o i d k e r n e l 3 d _ h o s t ( int *** input , int x , int y , int z )

3 { 4 d i m 3 s i z e 3 d ( x , y , z ); 5 d i m 3 b l o c k S i z e (4 , 4 , 4); 6 d i m 3 g r i d S i z e ( d a t a . x / b l o c k S i z e . x , d a t a . y / b l o c k S i z e . y , d a t a . z / b l o c k S i z e . z ); 7 k e r n e l 3 d _ g p u < < < g r i d S i z e , b l o c k S i z e > > >( input , s i z e 3 d ); 8 } 9 10 // A 3 D C U D A k e r n e l 11 _ _ g l o b a l _ _ 12 v o i d k e r n e l 3 d _ g p u ( int *** input , d i m 3 s i z e 3 d ) 13 { 14 // C o m p u t e s p a t i a l l o c a t i o n of t h r e a d w i t h i n g r i d 15 int i = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; 16 int j = b l o c k D i m . y * b l o c k I d x . y + t h r e a d I d x . y ; 17 int k = b l o c k D i m . z * b l o c k I d x . z + t h r e a d I d x . z ; 18 19 // O n l y o p e r a t e on i n p u t if t h r e a d i n d e x is w i t h i n b o u n d a r i e s 20 if (( i < s i z e 3 d . x ) && ( j < s i z e 3 d . y ) && ( k < s i z e 3 d . z )) 21 { 22 . . . 23 } 24 }

Figure 2.21: Code fragments for a three-dimensional CUDA kernel.

To better understand how warps are assigned to the EE, we must first briefly discuss the general GPGPU programming model. In GPGPU programs, threads are organized in a hierarchical and spatial manner. Warps are groups of tightly-coupled user threads. Warps are grouped into one-, two-, or three-dimensionalblocks. Blocks are arranged into one-, two-, or three-dimensionalgrids. One grid represents all the threads used to execute a single GPU kernel, as discussed in Chapter 1. Conceptually, it may help to think of individual threads as mapped to a single inner-most iteration of a singly-, doubly-, or triply-nested loop. At execution time, lane-specific hardware registers within the SM inform the currently executing thread of its location within its grid. Using this information, user code can properly index input and output data structures.

A simple example is illustrated by the code fragments in Figure 2.21. In line 5, the host configures a GPU kernel to process data in three-dimensional blocks of size 4x4x4 threads. The number of blocks in the grid is computed in line 6, assuming that the problem size divides evenly by four. This dimensional configuration of the kernel is provided at kernel launch, in line 7. The code in the functionkernel3d_gpu() is programmed from the perspective of a single thread—one thread among many within the grid. This thread determines its(i,j,k)coordinates within the grid on lines 15 through 17. If the coordinates are within the bounds of the problem (line 20), then the thread operates on the input data. If the input problem has the dimensions of 128x128x128, then the resulting grid has 32x32x32 blocks, each with 4x4x4 (or 64) threads. This breakdown of the grid into threads is illustrated in Figure 2.22. Each multi-dimensional block is

Grid Block Thread

32x32x32

Blocks Threads4x4x4 Thread1

Figure 2.22: Grid of 32x32x32 blocks, with blocks of 4x4x4 threads.

linearized and decomposed into warps. The threads of each warp are mapped to the hardware lanes. Although not specified by the language, all warp sizes to date on NVIDIA GPUs have been 32 lanes, so under this assumption, each block would be made up of two warps. This is illustrated in Figure 2.23.

The EEs of modern GPUs are capable of executing more than one kernel concurrently.18 This feature can be leveraged to increase EE utilization. Consider the following situation. Suppose we have several independent kernels to execute. We issue each kernel to the GPU, one at a time, waiting for each issued kernel to complete before issuing the next. Recall that grid blocks are distributed among the EE’s SMs. Towards the end of the execution of each kernel, SMs will begin to idle after completing their assigned blocks, while other SMs continue executing their remaining work. At the instant before the last block completes, all but one SM will be idle. However, if we issue the independent kernels to the GPU in quick succession,notwaiting for each issued kernel to complete before issuing the next, then SMs can be kept busy as they execute blocks of grids that have already been queued up for execution.

A GPGPU kernel is decomposed into a collection of multidimensional blocks, which are made up of threads that are grouped into warps. How is each SM assigned warps to execute? Although SMs execute the instructions of a single warp at a time, SMs are not assigned individual warps. Instead, SMs are assigned blocks, and the SMs independently schedule the warps within each block. How are blocks assigned to SMs? In today’s technology, blocks are assigned to SMs by in-silicon hardware schedulers on the GPU (Bradley, 18_{There may be restrictions, however. For example, on NVIDIA GPUs, memory subsystems of the GPU require that concurrently}

block of 64 threads

warp of 32 threads warp of 32 threads

32 lanes

Figure 2.23: Linearized block of threads decomposed into two warps, which are multiplexed on hardware lanes.

2012). As software engineers, we have no direct control over how work is distributed among the SMs.19 We merely provide the GPU with a collection of blocks, which are then scheduled by the GPU itself. This has two important implications:

1. We cannot predictably schedule individual SMs.

2. We cannot predict how SMs may be shared among concurrent kernels.

It is for these reasons that we consider the SMs together as a single processor—the execution engine—rather than individual processors.20 We can predictably schedule the EE.

There is one more aspect of the EE that affects real-time predictability: the EE is non-preemptive. This is understandable, given the complexities of the GPU’s hardware scheduler. Non-preemption has a significant impact on any real-time system in two ways. First, it becomes impossible to strictly enforce budgets on task execution time. At best, any real-time scheduling algorithm may only attempt to avoid andisolatethe harmful effects violations of provisioned execution times may have. Second, priority inversions become inevitable under any work-conserving scheduler. It will always be the case that low-priority work may be scheduled on an idle EE at timet when higher-priority work for the EE arrives at timet+ε. The higher-priority

19_{This holds true for AMD GPUs as well (AMD, 2013).}

20_{OpenCL 1.2 supports an optional feature called Device Fission. This feature allows the system designer to divide a single compute} device into several logical compute devices. This may be exploited to reserve a segment of compute resources for high-priority work. However, support for this feature is currently limited to OpenCL runtimes that execute on CPUs (including Intel’s Xeon Phi) and tightly coupled heterogeneous processors, such as the Cell BE. We note that the problem of scheduling several logical devices is very similar to scheduling a multi-GPU system.

work cannot be scheduled until the low-priority work completes. We must develop algorithms that limit the duration of such inversions and account for them in analysis.

In document Elliott_unc_0153D_15621.pdf (Page 98-102)