In every cycle, the warp scheduler selects a ready warp from the active warp pool for execution. As long as one in-flight warp is ready in every cycle, throughput is maximized. However, there are several reasons that a warp may not be ready [54]: instruction cache misses, barriers, warp finished before the rest of the warps in the same CTA, control hazards, data hazards, and structural hazards. To evaluate the effectiveness of the GPU’s latency hiding ability and explore how it might be improved, we identify and analyze all the significant sources of execution time delays for a warp.
Instruction cache misses: In order to avoid instruction fetch latency, each warp has a two-entry instruction buffer. When no instruction is available in the buffer, additional delay is added before the next instruction can be fetched. This is mainly cased by instruction cache misses.
Barrier: Barrier synchronization allows all the threads within the same CTA to wait for each other before moving forward. Once a warp hits a barrier, it stalls until the rest of the warps within the same CTA reach the barrier. The more warps each CTA has, the more likely a warp will stall at a barrier. So, it’s important to keep all the warps within a CTA progressing at the same rate.
Function done This is similar to a barrier stall. When a warp finishes before the rest of the warps in its CTA, it stalls until the CTA finishes, at which time a new CTA is issued. When there are no more CTAs available, the stall due to function done is also considered as tail effect.
Control hazards: Unlike CMPs that are often equipped with sophisticated branch prediction logic, GPUs rely on massive parallelism to hide latency from control haz- ards. However, from a single warp’s perspective, if a branch or function call instruction executes, the warp stalls until the target address is calculated.
Structural hazards: Structural hazards are caused by the unavailability of func- tional units when there are active warps ready to issue or unavailability of miss status holding registers (MSHRs) in the memory system. In modern GPU architectures such as Fermi [53], the memory pipeline is unavailable if it suffers stalls when MSHRs are full. Structural hazards often occur in SFU or MEM pipelines in GPUs, as the throughput of SP is usually much larger than the throughput of MEM and SFU. For instance, the throughput ratio between SP, SFU and MEM is 16:1:8 in Fermi.
Data hazards: Data dependency can introduce stalls when the next instruction of a warp depends on a result from a previous instruction. Currently, the GPU does not support data forwarding, so a warp stalls until all data dependencies have been resolved. If an instruction depends on a load instruction that goes to global memory (DRAM), the warp might stall for hundreds of cycles before the dependency is resolved.
6.3.1 Analyzing CPI Breakdown
To illustrate how different stall factors can contribute to the CPI of a warp, we developed an algorithm to count and categorize the cycles per instruction for each warp. In this section, we use the latency characterization algorithm introduced by Lee et al. [54]. In every cycle, profiling increments one of the stall counters for each warp if no instruction is issued from the warp. If there is overlap among multiple stall factors, we increment
0 50 100 150 200
LBM BFS SPM MRI HIS STE CUT TPA SAD SGE AVG.
Cycles per Instruction
I$ Miss Barrier Function Done Control Hazards Structural Hazards-SP Structural Hazards-SFU Structural Hazards-MEM EXE Data Hazards-memop Data Hazards-exe
Figure 6.2: The CPI per warp breakdown for Parboil benchmarks with GTO scheduling.
the first stall counter following the order in section 6.3, which defines the order that stalls occur in the pipeline (e.g., and instruction cache miss would happen before other types of stalls, etc.).
Figure 6.2 presents the average CPI breakdown for Parboil applications. Each bar shows the CPI contributed by various stall factors described in section 6.3. To better in- vestigate the CP I breakdown, we further break down structural hazards into structural hazards due to SP, SFU, and MEM function units and data hazards into data hazards due to load instructions and execution instructions. The CP I breakdown results are the average across all the warps among all the SMs throughout the kernel execution. The total CP I of each kernel indicates the effectiveness of its latency hiding ability when we launch as many warps as possible, which also shows how many warps are needed to completely hide the latencies of the kernel. Kernels toward the left do not hide latencies well, whereas the kernels on the right have smaller latencies that can be easily hidden with sufficient warps. We can derive the IP C of an SM by combining CPI with the number of warps each kernel issued per SM. Figure 6.3 shows how CP I per warp and the number of warps determines IP C for Parboil applications. The x-axis represents the average IP C of each kernel, and the y-axis is the number of warps in-flight per SM divided by per-warp CP I. The figure also list the number of warps each kernel issues per SM. The data trend confirms that IP C = Nwarps/CP I. I.e., we can improve IP C by
reducing CP I per warp and improving warp occupancy. Since it is difficult to change warp occupancy without modifying the GPU architecture or the existing scheduling scheme, we first investigate how to reduce each component that contributes to CPI.
The most dominant CPI components in Figure 6.2 are structural hazards-MEM, which contribute 35.09% of the total CP I, primarily due to contention in MSHRs and other resources that can mark the MEM function unit unavailable. SP M and LBM in particular experience a significant number of stalls from MSHRs. This is because both kernels are memory bandwidth-intensive with many L1 cache accesses/misses. As a result, the performance is degraded significantly due to structural hazards from MEM. The structural hazards due to SP and SFU components are relatively small, contributing 3.62% and 3.30% of the total CP I, respectively. Note that structural hazards indicates the unavailability of certain function units, so they cannot be improved by increasing the degree of parallelism (adding more warps). In addition, if one kernel suffers significant structural hazards due to one of the function units, it also indicates significant under- utilization in the rest of the function units. Moreover, it’s also hard to improve the utilization balance among different function units, since each kernel consists of many identical threads, so execution characteristics remain relatively stable. It is worth noting that the scheduling policy can sometimes impact structural hazards. The scheduler is responsible for picking the right warp among the active warp pool in every cycle. If there is a phase in which kernels are heavily utilizing one of the function units, a good scheduling policy would be able to reduce structural hazards by keeping warps moving at different paces so that different warp spread out their intensive utilization.
The next two most significant CP I components are data hazards due to mem and exe operation stalls caused by waiting for data to be ready from previous load or arithmetic instructions. Kernels, such as M RI, CU T , and T P A suffer from data hazards due to arithmetic instructions. For SP M and HIS, this is due to data hazards from previous load instructions. When there are sufficient warps, the scheduler can easily hide those latencies because, unlike structural hazards, data hazard latency does not increase as the degree of parallelism increases. Furthermore, note that scheduling policy cannot reduce CP I portions due to data hazards.
Stalls due to barrier and function done correspond to 14.64% and 9.52% of the total CP I in Figure 6.2. Despite having a high degree of parallelism, stalls due to barriers
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2
# of Warps / CPI per warp
Normalized IPC LBM-28 BFS-32 SPM-48 HIS-24 MRI-40 SAD-16 SGE-16 STE-32 TPA-24 CUT-32
Figure 6.3: This figure shows the relationship between number of warps, CPI, and IPC
and function done can greatly reduce the number of active warps, leaving warps waiting for the rest of warps in the same CTA if the warps are in different pace. For instance, in LBM and M RI, stalls due to function done contribute 24.31% and 13.61% of the the total CP I, and BF S suffers 53.27% of the stalls due to barrier. Scheduling policy might be able to keep different CTAs progressing differently to avoid an overlap of such stalls from different CTAs, but the effect is very kernel-dependent, especially for those kernels with bigger but fewer CTAs per SM. Furthermore, given kernels with the same characteristics, the more warps each CTA has, the harder it is to keep all of them in the same pace, so stalls due to barrier and function done could be longer. As a result, the scheduling policy plays a key role here to reduce the CPI components due to barrier and function done. By keeping all the warps in a similar pace throughput execution, in theory, we can easily reduce this CPI portion. However, current scheduling policies such as LRR and GTO do not have such awareness [54].
In this section, we laid out all the key factors that govern GPU throughput from a single warp perspective. To sum up, in order to improve GPU throughput, we need to improve the degree of parallelism, reduce structural and data hazards, and improve stalls due to barrier and functions done. The following chapters are focus on approaches that can tackle one or some of the aspects.