Parallel Patterns - Brown_unc_0153D

Improvements in Reduce and Scan throughput are also obtained by combining common parallel patterns for each primitive.

6.4.1 Reduce Parallel Patterns

For the Reduce primitive, the singleton output sum depends on all input values. Figure 6.3 illustrates two natural parallel patterns to reduce n elements using p threads: Tree Reduce and Run Reduce (AKA Reduce then Reduce) (Merrill and Grimshaw, 2010, Parallel Scan). I use both patterns at different levels of my nested Reduce GPU implementation.

142

Tree Reduce Run Reduce

Figure 6.3: Parallel Reduce Patterns. This figure Illustrates two parallel patterns to reduce n elements using p threads. In the left panel is a tree reduce pattern of 16 elements in 4 = log2(16) stages. In the right

panel is a Run Reduce pattern of 16 elements in two stages (serially reduce 4 element runs into run sums using 4 threads, serially reduce 4 run sums into a final sum using 1 thread).

The Tree Reduce pattern (see Figure 6.3, left panel), a fine-grained reduction, is easy to visualize and implement. It reduces two input elements per thread to one output sum at each stage and, in so doing, takes log2(n) stages in total to fully reduce n elements down to one final sum. Total work is linear = 2n = n+n/2+n/4+…+2+1. Total depth is logarithmic = log2(n). Total I/Os is linear = 3n (or 2n).

Tree Reduce, however, has three distinct disadvantages:

 Its relatively high kernel launch costs, since log2(n) stages are required

 Its suboptimal processor utilization, since each stage launches half the threads of the previous stage

 Its relatively high I/O transfer costs, since intermediate sums from each stage must be transferred to the next using 3n I/Os if all sums are stored and 2n I/Os if continuing threads keep their sums

The Run Reduce pattern (see Figure 6.8 right panel), a coarse-grained reduction, is even easier to visualize and implement. It partitions the n data elements across p threads (cores) into even runs (run length=⌈𝑛/𝑝⌉). In the first reduce stage, each thread serially reduces its assigned run, producing p run-

3 1 1 4 1 4 3 2 3 3 2 1 1 3 2 2 Data 6 3 4 4 4 5 5 5 9 10 9 8 19 17 36 Sum Work: 𝑛 = 𝑛 Depth: 𝑛 = 2 (𝑛) Work: 𝑛 + 𝑝 = 𝑛 + 𝑝 Depth: 𝑛/𝑝 + 𝑝 = 𝑛/𝑝 + 𝑝 3 1 1 4 1 4 3 2 3 3 2 1 Data 36 Sum 1 3 2 2 9 10 p1 p2 p3 p4 9 7

143

sums (one per run). In the second reduce stage, the p run-sums are serially reduced by a single thread (core) to a final sum. Total work is linear = n+p. Total depth is linear = ⌈𝑛 𝑝⁄ ⌉+p. Total I/Os is linear = n+p+1. Choosing p =√𝑛, results in minimum depth = 2√𝑛 and work = n+√𝑛.

As described, the second stage of Run Reduce has essentially the worst possible processor utilization: only one thread is active. As a result, in my actual GPU implementation, I replace the second stage’s serial reduce by a nested run-reduce and tree-reduce to involve more parallel threads.

6.4.2 Scan Parallel Patterns

Like Reduce, Scan looks at all data values. Unlike Reduce, Scan must output a prefix sum for each input data value that depends either on all previous data values or on some previous data values or sums. There are two natural Scan patterns (see Figure 6.4), Scan-then-Fan and Reduce-then-Scan. With both Scan patterns, the input data is partitioned evenly into per-thread runs. For my own GPU Scan implementation, I use both patterns at different levels of the CTA hierarchy.

Figure 6.4: Two Parallel Scan patterns 1) Scan-then-Fan (Top panel), and 2) Reduce-then-Scan (Bottom panel).

Scan-then-Fan:

Scan-then-Fan (see Figure 6.4, left panel) is similar to the Run-Reduce pattern and takes five stages:

1. Scan Run: Each thread serially scans its assigned run. Each run is locally correct but is missing a prefix sum from all preceding thread runs.

2. Store Run-Sums: Per-thread run-sums from stage 1 are stored in another array. 3. Scan Run-Sums: The per-thread run-sums from stage 1 are inclusively scanned.

… Scan then Fan

T1=Run1 T2=Run2 … Tp=Runp

Identity

Inclusive Scan Exclusive Scan (Reach Back)

1. Scan Runs 2. Store RunSums 3. Scan RunSums 4. Reach Back 5. Runs Update

Reduce then Scan …

T1=Run1 T2=Run2 … Tp=Runp

1. Reduce Runs 2. Store RunSums 3. Scan RunSums 4. Reach Back 5. Scan Runs &

Update

Identity

Inclusive Scan Exclusive Scan (Reach Back)

144

4. Reach Back: The missing prefix-sums needed in stage 1 are found as the exclusive scan of the stage 1 run-sums, which were obtained from the inclusive scan by reaching back one entry.

5. Run Update (AKA fan): The missing prefix-sum for each thread is accumulated into all locally scanned elements in each per-thread run to generate the final scanned results. This stage is also known as a fan.

The Scan-then-Fan pattern takes ~ 4 I/Os per data element, since both stage 1 and stage 5 must read and write each element. The Scan-then-Fan pattern with input size n and p threads uses linear Work = 2n+2p, linear Depth = 2[n/p]+2p, and linear Total I/Os = 4n+4p.

Reduce-then-Scan:

The Reduce-then-Scan pattern (Figure 6.9, right panel) is derived from the Scan- then-Fan pattern. In stage 1, Scan Run, the serial Scan is replaced by a serial Reduce to compute the stage 1 run-sums, and in stage 5, Fan, the run update becomes a serial Scan initialized with the missing prefix-sums from stage 4. This scan pattern decreases the total I/Os per element since the first stage now writes a single run sum instead of an entire run. The Reduce-then-Scan pattern with input size n and p threads uses linear Work = 2n+2p, linear Depth = 2[n/p]+2p, and Total I/Os = 3n+4p.

In document Brown_unc_0153D_15479.pdf (Page 169-172)