5.1 Block DASk
5.1.5 Improving Copy Performance using ILP and TLP
As we saw in chapter 4, my simple Copy kernel had poor performance. In this section, I seek to improve Copy performance by hiding stalls in the GPU hardware, such as waiting on long I/O operations (transfers between registers and global memory takes between 400-800 cycles) or waiting on the previous instruction’s output (RAW dependency takes up to 8-11 cycles). I hide stalls using a combination of ILP and TLP to keep the SPs on each SM as busy as possible.
Increasing ILP means increasing the work done per-thread. To test ILP, I wrote code that tested both automatic and manual loop unrolling. As will be seen shortly, manual loop unrolling, also known as software pipelining, gives better performance.
Increasing TLP means increasing the threads-per-block. This increases opportunities for the warp scheduler to hide stalls using independent instructions from other concurrent warps.
Since my simple Copy kernel did not take advantage of either ILP or TLP, it achieved only 25% of the maximum peak throughput and was about 2.7x slower than the build-in CUDA
97
library function cudaMemcopy. Contrast this with Figure 5.1 where each of my three Copy DASks achieves throughput performance comparable to cudaMemcopy.
Increasing ILP:
In this section, the programmer will learn how to use manual loop unrolling in their own kernels, I demonstrate this using my Block DASk for the Copy primitive. For this Block Copy kernel, Manual loop unrolling results in up to 17% better performance than my simple Copy kernel from Chapter 4.Loop Unrolling: Loops are the classic programming technique to handle multiple work items in CPU serial code. Loop unrolling3 hides stalls caused by dependences between instructions and between loop iterations by rewriting the loop to process more independent data elements inside of each loop iteration. Loop unrolling amortizes loop overhead across 𝑘 elements and also
amortizes the cost of indexing and pointer computations. This technique can either be
implemented automatically via a compiler or manually by the programmer. As each independent work item may require registers to track execution state, I suggest unrolling data in small batches of 2-8 work items to avoid exceeding the number of registers and spilling into local memory.
Processing more work items per thread is the GPU kernel equivalent of loop unrolling. Loop unrolling is just one of a family of optimization techniques (Wadleigh and Crawford, 2000) used to improve the performance of loops. These include loop fission, loop fusion, loop
interchange, loop invariant code motion, loop unrolling, loop reversal, and loop unswitching. Almost the entire family of serial loop optimizations can be repurposed and rebranded as GPU kernel optimizations for parallel programming. Recall in Chapter 4.1 that we related serial iteration to data parallelism (meaning that the innermost body of statements within some nested loop structure where the looping of n iterations across n data elements is replaced by hardware scheduling of n parallel threads onto n data elements.)
3 As described in the book Software Optimization for High Performance Computing (Wadleigh and
98
Automatic Loop Unrolling: A specific example of automatic loop unrolling is given in Figure 5.7. This code would be used in the GPU Programmers shaded boxes in Figure 5.6 to replace the current software pipelined copy code shown. The CUDA compiler supports a #pragma unroll directive that takes an optional batch size parameter to specify the number of iterations to unroll the loop (see the line occurring before source line #7 for an example).
// Copy assigned work items (nWork)
#pragma unroll 4
1: for (i=0; i<nWork; ++i)
2: wOff = (i*TBS)+dOff; // Work Item Offset
// Range check [start, stop]
3: inRange = (start ≤ wOff) & (wOff ≤ stop);
4: if (inRange)
5: D[wOff] = S[wOff]; // Copy Work Item from input to output
6: end if
7: end for
Figure 5.7: Automatic Loop unrolling example: The #pragma unroll 4 directive (in lighter grey) around a looping structure requests CUDA to automatically unroll the wrapped code (k=4) times
For my loop unrolling example, I rewrote my Block DASk do the loop unrolling using the #pragma unroll directive. All necessary changes to support loop unrolling replace the code found in the shaded boxes in Figure 5.6. The shaded box represents the user’s portion of this simple DASk, in other words, the code that another GPU programmer would change to support a different algorithm.
After testing the automatic loop unrolled copy, I found that this approach resulted in only a modest improvement in throughput as compared to a baseline of one work-item (see Figure 5.8). The change was barely noticeable on the GTX 580 (+0.25% in the best case) and not much better on the GTX Titan (+8.8% in the best case). As can be seen from the graph, only [2-3] work items per thread results in a minor throughput performance boost, which drops off gradually after four work items per thread. Also note that the automatic unrolled kernel’s throughput (~61 GB/s) is actually well short of the original simple Copy kernel’s throughput (~81 GB/S) on the
99
GTX Titan. This decreased throughput is caused by a poor memory access pattern that I will explain later in this section.
Figure 5.8: Automatic Loop unrolling vs. Manual Loop Unrolling test throughput: Given a fixed input size
n=224, fixed grid row size = 224, fixed block size = 32, I/O throughput (in GB/s) is shown on the y-axis and
the amount of work per thread, nWork = [1-16], is shown on the x-axis. The upper & lower panels show throughput results on the GTX 580 (Fermi) and GTX Titan (Kepler) GPUs respectively. The cyan/blue lines represent manual loop unrolling and the tan/red lines represent automatic loop unrolling Batching for both automatic and manual loop unrolling was tested in batches of [2,4,8,16] respectively.
1 4 8 12 16 10 20 30 40 50 60
w = Work per Thread
I/O Thr o ug hput (G B/s )
GTX 580- Throughput for increasing Work
Pipeline Base Software Pipeline B16 Software Pipeline B8 Software Pipeline B4 Software Pipeline B2 Unroll Base Unroll Batch 16 Unroll Batch 8 Unroll Batch 4 Unroll Batch 2 1 4 8 12 16 20 40 60 80 100w = Work per Thread
I/O Thr o ug hput (G B/s )
GTX Titan - Throughput for increasing Work
Pipeline Base Software Pipeline B16 Software Pipeline B8 Software Pipeline B4 Software Pipeline B2 Unroll Base Unroll Batch 16 Unroll Batch 8 Unroll Batch 4 Unroll Batch 2100
Manual Loop Unrolling: Next, I loop unrolled my code by hand, I interleaved k instructions from k work items to decrease RAW dependencies. The main idea here is that, while a particular instruction for the ith work item may stall, similar instructions from the other k-1 work items can be scheduled as replacements to hide the stall and keep each processing core busy doing useful work. This form of manual loop unrolling is also known as software pipelining4. More independent work items increases register pressure which in term may limit occupancy, so I experimented with manual unrolling (on up to 16 work items) in batches of [2, 4, 8, or 16]. Figure 5.9, shows Copy code that range-checks multiple work items using manual loop unrolling. This is the type of code that a GPU programmer would insert into the shaded boxes of Figure 5.6 to support the Copy primitive.
4 As described in the book Software Optimization for High Performance Computing (Wadleigh and
101
// Process work items [1-4] // Get work offsets
1: if (nWork≥1) { w1 = (0*TBS)+dOff; }
2: if (nWork≥2) { w2 = (1*TBS)+dOff; }
3: if (nWork≥3) { w3 = (2*TBS)+dOff; }
4: if (nWork≥4) { w4 = (3*TBS)+dOff; }
// Range check [start, stop]
5: if (nWork≥1) { t1 = (start ≤ w1) & (w1 ≤ stop); } 6: if (nWork≥2) { t2 = (start ≤ w2) & (w2 ≤ stop); } 7: if (nWork≥3) { t3 = (start ≤ w3) & (w3 ≤ stop); } 8: if (nWork≥4) { t4 = (start ≤ w4) & (w4 ≤ stop); }
// Load data 9: if (nWork≥1) { if (t1) { v1 = D[w1]; } } 10: if (nWork≥2) { if (t2) { v2 = D[w2]; } } 11: if (nWork≥3) { if (t3) { v3 = D[w3]; } } 12: if (nWork≥4) { if (t4) { v4 = D[w4]; } } // Store data 13: if (nWork≥1) { if (t1) { S[w1] = v1; } } 14: if (nWork≥2) { if (t2) { S[w2] = v2; } } 15: if (nWork≥3) { if (t3) { S[w3] = v3; } } 16: if (nWork≥4) { if (t4) { S[w4] = v4; } }
Figure 5.9: Manual Loop Unrolling example, showing how to copy multiple work items using careful range checking. This example assumes at most four work items per thread. Notice how the similar instructions are batched together in groups of 4 in an interleaved manner. The lighter grey if (nWork ≥ ?) { … } statement wrappers get elided away at compile time by the CUDA compiler.
The if (nWork≥*) {…} wrappers are resolved at compile time. Hand unrolled code is
more verbose, harder to read, and harder to understand. In addition, the up to k× as many generated instructions may also use k× as many registers.
Loop Unrolling Results: Tests on both automatic and loop unrolling are shown in figure 5.8. To stay in the stable upper portion of the s-shaded throughput curves, a fixed input size of n=224 was chosen. To show the impact of multiple work items on performance, nWork was increased from 1-16. To show the impact of register pressure, work items were batched into groups of [2,4,8, and 16]. All throughput numbers presented are averages of one hundred runs.
Figure 5.8 shows four surprising things.
First, there is a large difference in starting throughput for the baseline case (nWork = 1) for the automatic vs. manual loop unrolling. This difference in starting throughput is the result of the different memory access patterns used by the two different methods. For the automatic loop
102
unrolled kernel, I wrote each C++ copy instruction as if (inRange) { S[wOff] = D[wOff] }. This line of code performs two memory accesses: one load from input and one store to output and is then repeated k times, in other words, the memory access pattern ping-pongs back and forth between the input and output arrays. However, in the manually unrolled kernel, the code batches the loads and stores separately as two different instruction clusters (k inputs followed by k outputs). To verify that the performance difference was indeed a result of these two different memory access patterns, I rewrote my manually unrolled kernel as a set of if (t1) { S[w1] =
D[w1]; } statements. This approach resulted in a large decrease in throughput similar to the
automatically unrolled code kernel. So it seems clear that batching up several warps of data accesses (input or output) in close proximity to each other results in better system throughput from the memory controllers than interleaving access to different parts of memory (input and output). Such a conclusion makes sense since better locality improves L1 and L2 cache usage by each GPU memory controller.
Second, the automatic loop unrolling batch curves separate one work item sooner than I expected. For example, for both automatic and manual loop unrolling I would expect the nWork throughput curve for batching in groups of 4, 8, and 16 to be exactly the same up to nWork = 4 and then begin to separate from each other at nWork = 5 and nWork = 9. This expected behavior is observed in the manual loop unrolling throughput curves. However, the separation for loop unrolling occurs sooner at nWork = 4 and nWork = 8 for the automatic loop unrolling throughput curves. I speculate that there could be an “off by one” bug in the CUDA compiler in this case.
Third, the performance curves drop off throughput more rapidly than I was expecting. I was expecting that lower performance due to lower occupancy caused by increased register usage wouldn’t show up until 8+ work items per thread. The register pressure effect can be seen for the Batch = 16 manually unrolled curve which has the worst performance for large work loads (nWork > 8), whereas the Batch = 4 and Batch = 8 manually unrolled curves have better performance for (nWork > 8). But, throughput drops off quickly after only 4 work items per
103
thread with the largest drop occurring between nWork = 5 and 6. I suspect that there is also some memory controller queue length or caching issue coming into play here. Surprisingly, for the automatic unrolled throughput curves, The Batch = 16 curves have some of the best performance for nWork = 16. For instance, a single warp loading 4 work items would fetch 4 warp lines (512 bytes) of data. It could be that the combined combination of 14 SM’s each with 16 thread blocks of 128 threads (4 warps) requesting 6 work items per warp (5,376 warp line requests in
aggregate) is overflowing the internal request queues in the six memory controllers (896 requests per controller on average) for each set of active concurrent thread blocks on each SM).
Fourth, manual unrolling results in better throughput than automatic unrolling. Since the results for loop unrolling and software pipelining have different starting throughputs, I decided to use the nWork = 1 case as a baseline for both cases. Comparing the throughput results to their respective baselines reveals that software pipelining improves throughput more than loop
unrolling. As the following table shows, the maximal throughput increase when manual unrolling was used was substantially greater for both GPUs than when automatic unrolling was used.
Throughput Increase Loop Unrolling Software Pipelining
GTX 580 0.25% 17%
GTX Titan 7% 14%
Table 5.1: Loop Unrolling vs. Software Pipelining Performance
This makes sense since my manually unrolled code batches similar instructions into groups to reduce data dependencies (grouping and interleaving instructions from k work items). This organization provides plenty of independent instructions for the static compiler or dynamic warp scheduler to exploit in order to hide stalls. As the CUDA platform continues to mature, loop unrolling should eventually done automatically by the compiler instead of manually by the programmer. For now, it pays off for the GPU programmer to manually unroll and batch their instructions for up to 3-4 work items.
104
Increasing ILP is not the only way to hide stalls, as we will see next. TLP techniques work even better.
Increasing TLP: Another way to hide stalls uses TLP to recycle instructions from other independent and concurrent warps of execution. Fermi supports up to 48 warps (1,536 threads) per SM and Kepler supports up to 64 warps (2,048 threads) per SMX. In this section, I show how increasing the amount of threads per thread block can increase throughput up to 216 GB/s (2.67× faster than the simple Copy from chapter 4).
One of the main reasons that the simple Copy in Chapter 4 had poor throughput is that the code naively did not take advantage of the massive parallelism via TLP available on GPU architectures. The simple Copy only launched with a CTA layout of only 32 threads (TBS=32) only achieving an occupancy of 16.67% (8/48) and 25% (16/64) on the GTX 580 and GTX Titan respectively. I already discussed Occupancy and constraints in Chapter 3.3., recall that
occupancy is a simple ratio between the number of thread warps that actually concurrently run on each SM for a given GPU kernel vs. the theoretical maximum number of warps that could concurrently run on a given architecture.
Varying the number of warps on each SM is actually tricky since the programmer has no direct control over the number of concurrent warps that are actually scheduled on each SM. The GPU programmer can directly specify the number of threads per block as part of the CTA layout. However, the CUDA platform schedules as many concurrent thread blocks as it can while staying within various resource constraints.
Thus a GPU programmer can request a certain number of threads per thread block (TBS) and a certain number of thread blocks per grid (GridSize) but the actually occupancy is
determined by the SM scheduler based on the maximum warps, maximum blocks, register per thread usage, and shared memory per block usage. Since, my Grid-based Copy kernels do not use any shared memory and is relatively simple code, requiring just a few registers, CUDA will launch as many concurrent thread blocks on each SM as allowed by the maximum warps
105
constraint. For full processor utilization, I want all SMs running with as many concurrent thread blocks as reasonably possible. This means that I should request my grid size (blocks per grid) to be a multiple of (#SMs × #Blocks), where #SMs is the number of SMs (or SMXs) on a specific GPU card and #Blocks is the number of concurrent thread blocks that I expect the CUDA platform to schedule onto each SM.
However, all the SMs also share a limited number of memory controllers on each GPU card (6 on the GTX 580 and 4 on the GTX Titan). Therefore, too many I/O requests may overload the memory controllers, cause cache thrashing, and actually slow down overall throughput. So, my experiments vary the number of warps per thread block and also varying the number of concurrent thread blocks per SM in order to find an optimal balance between thread concurrency and memory controller throughput. To stabilize performance in the upper part of the s-shaped throughput curves, a fixed input size of n=224 data elements is used for all of the following experiments.
To make it easier to experiment with varying the number of threads, My Block DASk for the Copy kernel supports the following four C++ template parameters. The valT template parameter specifies the underlying data type of the data and gives me generic type support for different data types other than just 32-bit unsigned integers. The BlockCols and BlockRows parameters specify the fixed number of columns and rows in the thread block, while the GridCols parameter specifies the fixed number of columns in the corresponding CTA grid. These three fixed size CTA template parameters allows me to vary the CTA layout without having to rewrite my copy kernel for each new configuration.
Figure 5.10 shows the impact on throughput of my modified Copy kernel as I increase the number of threads per thread block from [32-1024] in multiples of the WarpSize (32).
106
Figure 5.10: Copy throughput for fixed n=224 and increasing block size (threads per block) [32-1024].
The top panel shows throughput for a given block size [32-1024] with the orange blocks showing throughput for cudaMemCopy and the gray blocks showing throughput for our TLP copy kernel. All measurements are taken as the average of a 100 runs on a GTX Titan. The bottom panel shows the number of concurrent thread blocks per SMX for each matching block size, from [16-2]. Lighter colors are used to show locations where the thread block size divides the maximum threads per SMX exactly.
Throughput from cudaMemCopy (orange bars) is also included for comparison, which achieves about 231 GB/s throughput this experiment and remains consistently clustered around this number. However, my modified Copy kernel (gray bars) quickly grows to a peak of 216 GB/s (at 128 threads per block) and then goes into an up and down, saw tooth pattern, where performance drops off a small throughput cliff and gradually climbs back up in increasingly longer runs before dropping off another cliff. Each small cliff corresponds to a drop in the number of concurrent thread blocks that can execute at the same time on each SMX due to the maximum warps per SMX constraint. Only some of the block sizes (32, 64, 128, 256, 512, 1024) actually divide evenly (lighter bars) into the maximum number of threads (2,048) per SMX on a Kepler card such as the GTX Titan. On a Fermi card such as the GTX 580, I would expect a similar pattern with cliffs occurring at 32, 64, 96, 128, 192, 256, 384, 512 since these are the block sizes that evenly divide the 1,536 maximum threads per SM on a Fermi card. That each new performance peak is slightly lower than the previous peak is likely due to locality effects in
32 128 256 384 512 640 768 896 1024 0 50 100 150 200 250 I/O Thr ou gh put (GB/s)
COPY Throughput for increasing Threads (t)
32 128 256 384 512 640 768 896 1024 0 4 8 12 16 t = Number of Threads #B locks cudaMemCpy Simple
107
memory. In other words, as we get more data items per data block each thread warp accesses data warps that are more spread out across each data block in memory. The net result is decreased locality.
In Figure 5.11, I show what happens when I artificially constrain the number of