In this chapter, I demonstrate my Row data access skeleton (DASk) on two useful parallel primitives, Reduce and Scan. The Reduce primitive produces a total sum by accumulating n input elements into a single final sum. The terms “sum” and “accumulate” refer not only to addition but to any associative operation, such as multiplication, maximum, or average. The name “Reduce” comes from Map/Reduce, which is a popular paradigm for distributed computing at very large scales. The Scan primitive, sometimes called Prefix Sum, produces a running sum by accumulating n input elements into n output elements, where the ith output element is either the inclusive {1 ... i} or exclusive {1 ... i-1} prefix sum of the first i input elements.
I have chosen to base both my Reduce and Scan primitives only on the associative property {a+(b+c) = (a+b)+c}, so that I can also support non-commutative “summation-like” operations, such as matrix multiplication. My algorithms may regroup, but not reorder; all sub-problems must work with sequential runs of consecutive data.
To support the GPU’s 2-level cooperative thread array (CTA), I think of data as sequential blocks containing sequential runs. Since the CTA is 2-level, I use a 2-part solution. 1) The Row DASk at CTA level 1 coordinates thread blocks within a grid for sequential access to data blocks. 2) Using a similar mechanism at CTA level 2 coordinates individual threads within a block for sequential access within each data block. To keep throughput high in my implementations, I use both thread-level parallelism (TLP) and instruction-level parallelism (ILP). To support experiments on TLP and ILP, I parameterize my kernels by the number of warps per thread block ‹nWarps› and by the number of work items per thread ‹nWork›.
131
1) Finding a global memory access pattern that supports both coalescence (stride = 32) and sequential access (stride = 1). Both access patterns use different strides and therefore are mutually exclusive.
2) Avoiding serialized instruction replays caused by k-way bank conflicts when accessing shared memory.
3) Avoiding reduced TLP caused by various resource constraints on occupancy.
6.1 Introduction
Reduce primitiveComputes the total sum of a sequence A containing n elements. Input: binary associativeoperator ⨁ with identity 𝕀, and sequence A = [a1, a2, ⋯ , an], Output: 𝑠𝑢𝑚 = 𝕀 ⨁ 𝑎1⨁ 𝑎1⨁ ⋯ ⨁ 𝑎𝑛 where sum is a singleton result.
Scan primitivecreates the prefix sum of a sequence A containing n elements.
Input: A binary associativeoperator ⨁ with identity 𝕀, and sequence A = [a1, a2, ⋯ , an], Output: A scanned sequence S = [s1, s2, ⋯ , sn],
where 𝑠𝑖 = 𝕀⨁𝑎1⨁𝑎2⨁ ⋯ ⨁𝑎𝑖−1 for exclusive scan or 𝑠𝑖 = 𝑎1⨁𝑎2⨁ ⋯ ⨁𝑎𝑖 for inclusive scan.
Reduce and Scan primitives are basic building blocks for many parallel algorithms (Blelloch 1989 and 1990; Blelloch and Maggs, 1996; Hillis and Steele, 1986). Each takes as input an associative, but not necessarily commutative, binary summation operator ⨁, its identity 𝕀, and a sequence A = [a1, a2, …, an]. The Reduce primitive (Harris and Sutherland, 2003), also known as “Fold” or “Total-sum”, returns sum = 𝕀 ⨁ 𝑎1⨁ 𝑎2⨁ ⋯ ⨁ 𝑎𝑛. The Scan primitive (Blelloch, 1989), also known as “Prefix-sum”, returns either an inclusive prefix-sum sequence, in which s1=a1 and, for i > 1, si=si-1⊕ai, or an exclusive prefix-sum
sequence, in which s1=𝕀 and, for i > 1, si=si-1⊕ai-1.
In my own GPU algorithms, I use Reduce to produce summation results for performance metrics, such as minimums, maximums, totals, averages. I use Scan to implement data partitioning schemes like counting sort. Scanning can count data elements to determine starting locations so that each individual thread knows where to safely access its assigned data without competing for access with other
concurrently running parallel threads.
As Table 6.1 shows the binary operator ⊕ can represent not only addition, but also product, max, min, “and,” and “or”. I assume that ⊕ is associative {(a⨁b)⨁c = a⨁(b⨁c)}, but not commutative
132
{(a⨁b) = (b⨁a)}. The fact that data can be regrouped but not reordered complicates the GPU implementations for both block and thread levels of the CTA hierarchy.
Operator {⨁} Identity {𝕀} Math Code
Sum {+} 0 𝑐 = 𝑎 + 𝑏 c=a+b; Floating point** Sum {+} 0.0 𝑐 = 𝑓𝑙(𝑎 + 𝑏) c=a+b; ** Product {×} 1 𝑐 = 𝑎 × 𝑏 c=a*b; Floating point** Product {×} 1.0 𝑐 = 𝑓𝑙(𝑎 × 𝑏) c=a*b; ** Matrix Multiply {×} Non-commutative I 𝐶 = 𝐴𝐵 3-level nested loop Minimum {𝑚𝑖𝑛} +∞ 𝑐 = {𝑎 𝑖𝑓 𝑎 ≤ 𝑏 𝑏 𝑖𝑓 𝑏 < 𝑎 c=(a<=b)?a:b; Maximum {𝑚𝑎𝑥} −∞ 𝑐 = {𝑎𝑏 𝑖𝑓 𝑎 ≥ 𝑏𝑖𝑓 𝑏 > 𝑎 c=(a>=b)?a:b;
Logical AND {∧} True 𝑐 = 𝑎 ∧ 𝑏 c=a&b;
Logical OR {∨} False 𝑐 = 𝑎 ∨ 𝑏 c=a|b;
Table 6.1: Common Reduce and Scan binary operators. **Floating point operators are
not fully-associative due to truncation and round-off errors.
Floating Point Associativity:
Programmers should be aware that floating point arithmetic operators are not fully associative since outputs have errors as floating point operations truncate and round-off the results to fit within the fixed-size data types1 (Press et al, 2007). As a result, CPU serial primitives and GPU parallel primitives on floating point data runs can give slightly different sums. Reordering and regrouping on long runs of data containing large variations in exponents can change outputs. One possible advantage of forbidding commutativity (reordering) and relying on associativity (regrouping) alone is that floating point evaluations are a bit more stable2.
1 Floating point issues are described in much greater detail in the book “Numerical Recipes …” (Press et al, 20007). 2 There are techniques that provide even more stability, such as storing values in a min heap on absolute value and
repeatedly summing the two values nearest zero. Unfortunately this approach has too many data dependencies to work well on GPUs.
133
Overflow:
Programmers should ensure that sums (especially of long data runs) do not overflow the underlying data type’s maximal (or minimal) value. For instance, adding a billion 32-bit unsigned integers may require storing intermediate and final sums as 64-bit unsigned integers.Diagrams:
Data organization and access patterns are necessary to improve Reduce and Scanperformance, and these are often easier to grasp from diagrams. Table 6.2 introduces symbols that I use in my diagrams throughout this chapter.
Name Abbr. Description Symbol
Run An input sequence of n data elements.
Sum ⨁
An associative binary operation that accumulates two inputs into an output. c = a⨁b.
Arrow color indicates access to the current entry (black) and reach back (blue) or reach forward (purple) one or more entries.
Identity 𝕀 The identity element for operator for addition, one for multiplication). ⨁, i.e. a = 𝕀⨁a for all a ∈ 𝕌 (zero
Serial
Reduce SRn
In serial, reduce an input short run of size n∈ [2-32] into a final-sum output. ~One I/O per element, plus one for output.
Serial Scan SSn
In serial, scan an input short run of size n∈ [2-32] into a prefix-sum output run. The scan output can either be inclusive or exclusive. ~Two I/Os per element (one on input, one on output.)
Table 6.2: Basic nomenclature and symbols for the Reduce and Scan primitives.
16 20 16 20 36 36 Or ... ... Out: Incl. Excl. or ... In:
134
CPU Serial Implementations:
The serial implementations of the Reduce and Scan primitives on a von Neumann CPU are similar (see Table 6.3).Serial Reduce
Serial Scan (Inclusive)
Serial Scan (Exclusive)
Reduce( sum, A, n, ⨁, 𝕀 )
sum = 𝕀; // Identity
// Reduce
for i in 1..n
sum = sum ⨁ A[i];
end for
Scan_Inclusive( S,A,⨁,𝕀 )
sum = 𝕀; // Identity
// Inclusive Scan
for (i = 0; i < n; ++i)
sum = sum⨁A[i];
S[i] = sum; end for Scan_Exclusive( S,A,⨁,𝕀 ) sum = 𝕀; // Identity // Exclusive Scan for (i = 0; i < n; ++i) S[i] = sum;
sum = sum⨁A[i];
end for Input: +, 0, [1 2 3 4 5 6 7 8] [0+1+2+3+…+8] Output: [36] Input: +, 0, [1 2 3 4 5 6 7 8] [1, 1+2, 1+2+3, ⋯, 1+2+…+8] Output: [1 3 6 10 15 21 28 36] Input: +, 0, [1 2 3 4 5 6 7 8] [0, 0+1, 0+1+2, ⋯, 0+1+2+…+7] Output: [0 1 3 6 10 15 21 28] 36 Table 6.3: Serial Reduce and Inclusive & Exclusive Serial Scan. All three procedures initialize a running sum to identity, then traverse the input array and accumulate new values. Reduce returns the final sum as its output. Both versions of scan output the current running sum for each input. The exclusive scan is effectively shifted over one element to the left of the inclusive scan. The top, middle, and bottom rows give symbolic depictions, pseudo-code, and examples for each of the three operations.
As shown in Table 6.3, Both Reduce and Scan initialize an accumulator to identity and then, as the code sequentially traverses the data, accumulate the running sum. Reduce writes out the final sum only; whereas Scan writes out the current running sum (inclusive or exclusive) for each input element. The exclusive scan is easy to obtain from the inclusive scan by prepending the identity, reaching back one entry, and dropping the total sum, or final value.
GPU Parallel Implementations:
As will be seen in this case study, GPU parallel implementations to achieve high throughput are more complex than CPU serial implementations. My main performance goal was to achieve a solid percentage of peak throughput for both primitives. I achieved solid throughput performance for Reduce and Scan in four main ways:1) I support TLP via multiple thread warps ‹nWarps› per thread block, respecting constraints on occupancy. Data Work: 𝑛 = 𝑛 Depth: 𝑛 = 𝑛 Initial Sum Final Sum 𝕀 … ... ... Output: Inclusive Scan Input: Total Sum ... ... Output: Exclusive Scan Input: Total Sum Identity
135
2) I support ILP via multiple work items ‹nWork› per thread.
3) I support coalescence, transferring 32 data elements in a single I/O instruction, for data in main memory.
4) I mitigate bank conflicts for transfers between shared memory and registers.
Throughput Results:
Since this chapter is quite long, I give the initial throughput results here to whet your appetite for the details of the rest of this chapter. The main takeaway of this chapter is that my Row DASk provides an excellent starting point for implementing these Reduce and Scan primitives. Of course since the Reduce and Scan primitives are more complex than the simple Copy primitive from Chapters 4 and 5, I also had to overcome some performance hindering issues and use some cleverness in myimplementations. The results speak for themselves. Figure 6.4 shows that the Reduce and Scan primitives can achieve nearly the same peak throughput of the simple Copy primitive.
Throughput
GTX 580(Fermi)
GTX Titan (Kepler)
Reduce Scan Copy Reduce Scan Copy
Baseline (GB/s) 24.2 33.6 49.4 40.0 53.7 86.0
Best (GB/s) 172.7 164.8 175.0 227.0 225.0 236.3
Speedup 7.1× 4.9× 4.4× 5.7× 4.2× 2.75×
Table 6.4: Best throughputs (in gigabytes per second) for Reduce and Scan on the GTX 580 and GTX Titan respectively, and speedups over baseline throughputs. Copy
throughputs using the Grid DASk from Chapter 4 are also included for comparison.
The performance results for Reduce and Scan also show that choosing the right ILP and TLP parameters based on extensive experiments results in much better throughput than naively implementing the sequential algorithm on the GPU. For example, the Reduce primitive is up to 7.1× faster than the baseline on the Fermi architecture (GTX 580) and the Scan primitive is up to 4.2× faster than the baseline on the Kepler architecture (GTX Titan)).
Moreover, the four plots in Figure 6.5 show how TLP, ILP, and two different approaches to handling bank conflicts (mitigate or avoid) all contribute to improving throughput. Each plot contains five curves, described briefly in the next paragraph.
136
The easy way to experiment with ILP is to vary the number of work items per thread (nWork), in the range 1-8. The easy way to experiment with TLP is to vary the number of thread warps per block (nWarps), also in the range 1-8. Varying these (and other data access parameters, described later) gives the five curves in each plot: The Baseline ‹nWarps=1, nWork=1 › curve has no extra ILP or TLP; The ILP-Focused ‹1, varies› curve increases the work per thread; The TLP-Focused ‹varies, 1› increases the warps per thread block. The Mitigates Bank Conflicts curve increases both ILP and TLP and allows bank conflicts. However, it uses simple code, which allows CUDA to mitigate the impact of serialized replays. The Avoid Bank Conflicts curve, on the other hand, also increases ILP and TLP, but it uses complex code, which allows it to completely avoid bank conflicts. As can be seen from the plots in Figure 6.1,
increasing both ILP and TLP achieves the best throughput.
y -ax is: Th roughpu t ( G iga byt es pe r se cond ) GTX 580 full Reduce
Throughput(GB/s) vs. Input size(n)
GTX Titan full Reduce
Throughput(GB/s) vs. Input size(n)
GTX 580 full Scan GTX Titan full Scan
x-axis: Input size (n) for increasing powers of two (logarithmic scale)
Figure 6.1: Full Reduce and full Scan throughput results (y-axis: gigabytes per second (GB/s)) as a function of input size (x-axis; log scale) on the GTX 580 {Fermi architecture} (left column) and GTX Titan {Kepler architecture} (right-column) GPUs.
210 213 216 219 222 225 228 0 20 40 60 80 100 120 140 160 180 192.4
Baseline (Warps=1, Work=1) Best TLP-Focused(Warps=6, Work=1) Best ILP-Focused(Warps=1, Work=6) Best Mitigate BC(Warps=4, Work=4) Best Avoid BC (Warps=3, Work=8)
210 213 216 219 222 225 228 0 50 100 150 200 250 288.4
Baseline (Warps=1, Work=1) Best TLP-Focused(Warps=4, Work=1) Best ILP-Focused(Warps=1, Work=8) Best Mitigate BC(Warps=4, Work=4) Best Avoid BC (Warps=5, Work=8)
210 213 216 219 222 225 228 0 20 40 60 80 100 120 140 160 180 192.4
Baseline (Warps=1, Work=1) Best TLP-Focused(Warps=5, Work=1) Best ILP-Focused(Warps=1, Work=6) Best Mitigate BC (Warps=4, Work=4) Best Avoid BC (Warps=3, Work=6)
210 213 216 219 222 225 228 0 50 100 150 200 250 288.4
Baseline (Warps=1, Work=1) Best TLP-Focused(Warps=4, Work=1) Best ILP-Focused(Warps=1, Work=8) Best Mitigate BC(Warps=5, Work=4) Best Avoid BC (Warps=5, Work=4)
137