High Performance Computing for Science and Engineering II

(1)

High Performance Computing for Science and Engineering II

11.5.2020 - Lecture: Parallel Reduction (GPU)

Lecturer: Dr. Sergio Martin

(2)

Today's Class

Optimizations:

- Improving Warp Efficiency

- Improving Memory Bank Access

- Addressing Instruction Bottlenecks The Reduction Pattern

- Principles and parallel approach pattern.

- Cost and Complexity

- Brent's Theorem

Algorithm Cascading:

(3)

Vector Reduction

Fact: Some problems require reading from an entire array but write to a smaller set.

Example: Vector reduction

double mySum = 0.0;

for (int i = 0; i < N; i++) mySum += vector[i];

10 Inputs

1 Output

Sum: 45

(4)

Parallel Vector Reduction

Thread 1 Thread 2

T1Sum: 25 T2Sum: 20

Sum: 45

Barrier

(5)

Distributed Parallel Vector Reduction

Thread 1 Thread 2

T1Sum: 25 T2Sum: 20

Sum: 45

Barrier

Thread 1 Thread 2

T1Sum: 25 T2Sum: 20

Sum: 45

Barrier

Node 0 Node 1

Sum: 90

MPI_Reduce

Tree-like Pattern

(also within MPI)

(6)

Optimizing Parallel Reduction in CUDA

Mark Harris

NVIDIA Developer Technology

The next set of slides are adapted from the presentation:

(7)

Parallel Reduction

Common and important data parallel primitive Easy to implement in CUDA

- Harder to get it right

Serves as a great optimization example

- We’ll walk step by step through 7 different versions

- Demonstrates several important optimization strategies

!7

(8)

Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2

^S

independent ops:

- Step Complexity is O(log N) 

For N=2

^D

, performs ∑

_S∈[1..D]

2

^D-S

= N-1 operations

- Work Complexity is O(N) – It is work-efficient

- i.e. does not perform more operations than a sequential algorithm 

With P threads physically in parallel (P processors), time complexity is O(N/P + log N)

-

Compare to O(N) for sequential reduction

(9)

GPU Parallel Reduction

Need to be able to use multiple thread blocks -

To process very large arrays

- To keep all multiprocessors on the GPU busy

- Each thread block reduces a portion of the array

But how do we communicate partial results between thread blocks?

!9

4 7 5 9

Tree-based approach used within each thread block

3 1 7 0 4 1 6 3

11 14

25

Block

(10)

Problem: Global Synchronization

If we could synchronize across all thread blocks, could easily reduce very large arrays.

- Global sync after each block produces its result - Once all blocks reach sync, continue recursively 

But CUDA has no global synchronization. Why?

- Expensive to build in hardware for GPUs with high processor count...

- Would force programmer to run fewer blocks to avoid deadlock, which may reduce overall efficiency  (no more than # multiprocessors * # resident blocks / multiprocessor)

Solution: decompose into multiple kernels

- Kernel launch serves as a global synchronization point.

- Kernel launch has negligible HW overhead, low SW overhead

(11)

Solution: Kernel Decomposition

Avoid global sync by decomposing computation into multiple kernel invocations

In the case of reductions, code for all levels is the same

Recursive kernel invocation

25

Level 0:

8 blocks

!11

Level 1:

1 block

25 25 25 25 25 25 25 25 25

50 50 50 50 100 100

200

3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3

4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9

11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14

25 25 25 25 25 25 25 25

(12)

What is Our Optimization Goal?

We should strive to reach GPU peak performance. 

 

Choose the right metric:

- GFLOP/s: for compute-bound kernels - Bandwidth: for memory-bound kernels

Reductions have very low arithmetic intensity:

- 1 flop per element loaded (bandwidth-optimal) - Therefore we should strive for peak bandwidth

Will use G80 GPU for this example:

-

384-bit memory interface, 900 MHz DDR 384 * 1800 / 8 = 86.4 GB/s

(13)

Reduction #1: Interleaved Addressing

!13

__global__ void reduce0(int *g_idata, int *g_odata) {

extern __shared__int sdata[];

// each thread loads one element from global to shared mem unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {

if (tid % (2*s) == 0)

sdata[tid] += sdata[tid + s];

}

// write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

(14)

Parallel Reduction: Interleaved Addressing

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Values (shared memory)

0 2 4 6 8 10 12 14

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 4 8 12

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

0 8

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Step 1 Stride 1

Thread IDs Values Thread

IDs Values Thread

IDs Values

(15)

Reduction #1: Interleaved Addressing

!15

__global__ void reduce0(int *g_idata, int *g_odata) {

extern __shared__int sdata[];

// each thread loads one element from global to shared mem unsigned int tid = threadIdx.x;

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {

if (tid % (2*s) == 0)

}

// write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Problem: highly divergent

warps are very inefficient, and

% operator is very slow

(16)

Performance for 4M element reduction

Kernel 1:

interleaved addressing with divergent branching

8.054 ms 2.083 GB/s

Note: Block Size = 128 threads for all tests

Bandwidth Time (2²²ints)

(17)

for (unsigned int s=1; s < blockDim.x; s *= 2 {

if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s];

}

Reduction #2: Interleaved Addressing

!17

Just replace divergent branch in inner loop:

for (unsigned int s=1; s < blockDim.x; s *= 2) {

int index = 2 * s * tid;

if (index < blockDim.x) sdata[index] += sdata[index + s];

}

With strided index and non-divergent branch:

(18)

Parallel Reduction: Interleaved Addressing

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Thread IDs Values Thread

IDs Values

New Problem: Shared Memory Bank Conflicts

(19)

Performance for 4M element reduction

!19

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

interleaved addressing with bank conflicts

3.456 ms 4.854 GB/s 2.33x 2.33x

Step Speedup Bandwidth

Time (2²²ints) Cumulative

Speedup

(20)

Parallel Reduction: Sequential Addressing

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Values

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Thread Step 1 IDs

Stride 8

Thread IDs

Values Thread

IDs Values

Sequential addressing is conflict free

1

0

(21)

for (unsigned int s=1; s < blockDim.x; s *= 2) {

int index = 2 * s * tid;

if (index < blockDim.x) sdata[index] += sdata[index + s]; 

}

Reduction #3: Sequential Addressing

!21

Just replace strided indexing in inner loop:

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) sdata[tid] += sdata[tid + s];

}

With reversed loop and threadID-based indexing:

(22)

Performance for 4M element reduction

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

3.456 ms 4.854 GB/s 2.33x 2.33x

Kernel 3:

sequential addressing

1.722 ms 9.741 GB/s 2.01x 4.68x

Speedup

(23)

for (unsigned int s=blockDim.x/2; s>0; s>>=1 {

if (tid < s) sdata[tid] += sdata[tid + s];

}

Problem: Idle Threads

!23

Problem:

Half of the threads are idle on first loop iteration!

This is wasteful…

(24)

// each thread loads one element from global to shared mem unsigned int tid = threadIdx.x;

Reduction #4: First Add During Load

Halve the number of blocks, and replace single load:

// perform first level of reduction,

// reading from global memory, writing to shared memory unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

With two loads and first add of the reduction:

(25)

Performance for 4M element reduction

!25

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

3.456 ms 4.854 GB/s 2.33x 2.33x

Kernel 3:

1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4:

first add during global load

0.965 ms 17.377 GB/s 1.78x 8.34x

Speedup

(26)

Instruction Bottleneck

At 17 GB/s, we’re far from bandwidth bound -

And we know reduction has low arithmetic intensity

Therefore a likely bottleneck is instruction overhead

-

Ancillary instructions that are not loads, stores, or arithmetic for the core computation - In other words: address arithmetic and loop overhead

Strategy: unroll loops

(27)

Unrolling the Last Warp

As reduction proceeds, # “active” threads decreases -

When s <= 32, we have only one warp left

Instructions are SIMD synchronous within a warp.

- That means when s<=32 (depending on architecture)

-

We don’t need to __syncthreads()

- We don’t need “if (tid < s)” because it doesn’t save any work

Let’s unroll the last 6 iterations of the inner loop

!27

(28)

__device__ void warpReduce(volatile int* sdata, int tid) { sdata[tid] += sdata[tid + 32];

sdata[tid] += sdata[tid + 16];

}

// later…

for (unsigned int s=blockDim.x/2; s>32; s>>=1) { if (tid < s)

syncthreads();

}

if (tid < 32) warpReduce(sdata, tid);

Reduction #5: Unroll the Last Warp

Note: This saves useless work in all warps, not just the last one!

Without unrolling, all warps execute every iteration of the for loop and if statement IMPORTANT:

For this to be correct, we must use the

“volatile” keyword! This prevents concurrency-unaware compiler

optimizations

(29)

Performance for 4M element reduction

!29

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

3.456 ms 4.854 GB/s 2.33x 2.33x

Kernel 3:

1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4:

0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5:

unroll last warp

0.536 ms 31.289 GB/s 1.8x 15.01x

Speedup

(30)

Complete Unrolling

If we knew the number of iterations at compile time, we could completely unroll the reduction

-

Luckily, the block size is limited by the GPU to 512 threads - Also, we are sticking to power-of-2 block sizes

So we can easily unroll for a fixed block size -

But we need to be generic

-

How can we unroll for block sizes that we don’t know at compile time?

Templates to the rescue!

- CUDA supports C++ template parameters on device and host functions

(31)

Unrolling with Templates

Specify block size as a function template parameter:

!31

template <unsigned int blockSize>

global void reduce5(int *g_idata, int *g_odata)

(32)

Reduction #6: Completely Unrolled

if (blockSize >= 512) {

if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) {

if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) {

if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) warpReduce<blockSize>(sdata, tid);

Note: all code in RED will be evaluated at compile time.

Results in a very efficient inner loop!

template <unsigned int blockSize>

__device__ void warpReduce(volatile int* sdata, int tid) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >=

8) sdata[tid] += sdata[tid + 4];

}

(33)

27

Invoking Template Kernels

Don’t we still need block size at compile time?

Nope, just a switch statement for 10 possible block sizes:

switch (threads) {

case 512:

reduce5<512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; 

case 256:

reduce5<256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;

case 128:

reduce5<128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;  

case 64:

reduce5< 64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; 

case 32:

reduce5< 32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; 

case 16:

reduce5< 16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;

case 8:

8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;

reduce5<

case 4:

reduce5<

case 2:

reduce5<

case 1:

reduce5<

}

(34)

Performance for 4M element reduction

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

3.456 ms 4.854 GB/s 2.33x 2.33x

Kernel 3:

1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4:

0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5:

0.536 ms 31.289 GB/s 1.8x 15.01x Kernel 6:

completely unrolled

0.381 ms 43.996 GB/s 1.41x 21.16x

Speedup

(35)

What About Cost?

Cost of a parallel algorithm is processor's time complexity

Allocate threads instead of processors: O(N) threads

Time complexity is O(log N), so cost is O(N log N) : not cost efficient!

!35

Cost = Time * # Threads

(36)

Brent's Theorem

(37)

What About Cost?

Cost of a parallel algorithm is processor's time complexity

Allocate threads instead of processors: O(N) threads

Time complexity is O(log N), so cost is O(N log N) : not cost efficient!

!37

Brent’s theorem suggests O(N/log N) threads

Each thread does O(log N) sequential work

Then all O(N/log N) threads cooperate for O(log N) steps Cost = O((N/log N) * log N) = O(N) ! cost efficient

Sometimes called algorithm cascading

Can lead to significant speedups in practice

(38)

Algorithm Cascading

Combine sequential and parallel reduction

Each thread loads and sums multiple elements into shared memory Tree-based reduction in shared memory

Brent’s theorem says each thread should sum O(logN) elements

i.e. 1024 or 2048 elements per block vs. 256

In my experience, beneficial to push it even further

Possibly better latency hiding with more work per thread

More threads per block reduces levels in tree of recursive kernel invocation

High kernel launch overhead in last levels with few blocks

On G80, best perf with 64-256 blocks of 128 threads (1024-4096 elements per thread)

(39)

unsigned int tid = threadIdx.x;

Reduction #7: Multiple Adds / Thread

while (i < n)  {

sdata[tid] += g_idata[i] + g_idata[i+blockSize];

i += gridSize;

}

!39

Replace load and add of two elements:

With a while loop to add as many as necessary:

unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

(40)

syncthreads();

Reduction #7: Multiple Adds / Thread

Replace load and add of two elements:

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

With a while loop to add as many as necessary:

unsigned int tid = threadIdx.x; 

unsigned int i = blockIdx.x*(blockSize*2)+threadIdx.x;

while (i < n) {

sdata[tid] += g_idata[i] + g_idata[i+blockSize];

i += gridSize;

}

syncthreads();

Use Grid Size Loop Stride

to maintain coalescing!

(41)

Performance for 4M element reduction

Kernel 7 on 32M elements: 73 GB/s!

!41

Kernel 1:

8.054 ms 2.083 GB/s

Kernel 2:

3.456 ms 4.854 GB/s 2.33x 2.33x

Kernel 3:

1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4:

0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5:

0.536 ms 31.289 GB/s 1.8x 15.01x Kernel 6:

completely unrolled

0.381 ms 43.996 GB/s 1.41x 21.16x Kernel 7:

multiple elements per thread

0.268 ms 62.671 GB/s 1.42x 30.04x

Speedup

(42)

template <unsigned int blockSize> __device__ void warpReduce(volatile int *sdata, unsigned int tid)   {

if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; 

if (blockSize >= 32) sdata[tid] += sdata[tid + 16]; 

if (blockSize >=

if (blockSize >=

} 

template <unsigned int blockSize> global void reduce6(int *g_idata, int *g_odata, unsigned int n) {  

extern shared int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + tid; 

unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0;

while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }  if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) warpReduce(sdata, tid);

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Final Optimized Kernel

(43)

Performance Comparison

0.01 0.1 1 10

131072

262144

524288

1048576

2097152

4194304

8388608

16777216

33554432

# Elements

Time (ms)

1: Interleaved Addressing: Divergent Branches

2: Interleaved Addressing: Bank Conflicts

3: Sequential Addressing

4: First add during global load

5: Unroll last warp

6: Completely unroll

7: Multiple elements per thread (max 64 blocks)

!43

(44)

Types of optimization

Interesting observation:

Algorithmic optimizations

Changes to addressing, algorithm cascading 11.84x speedup, combined!

Code optimizations

Loop unrolling

2.54x speedup, combined

(45)

Conclusion

Understand CUDA performance characteristics

Memory coalescing Divergent branching Bank conflicts

Latency hiding