CS 250B: Modern Computer Systems

(1)

CS 250B: Modern Computer Systems

GPU Computing Introduction

Sang-Woo Jun

(2)

Graphic Processing – Some History

❑ 1990s: Real-time 3D rendering for video games were becoming common

o Doom, Quake, Descent, … (Nostalgia!)

❑ 3D graphics processing is immensely computation-intensive

Texture mapping

Warren Moore, “Textures and Samplers in Metal,” Metal by Example, 2014

Shading

Gray Olsen, “CSE 470 Assignment 3 Part 2 - Gourad/Phong Shading,” grayolsen.com, 2018

(3)

Graphic Processing – Some History

❑ Before 3D accelerators (GPUs) were common

❑ CPUs had to do all graphics computation, while maintaining framerate!

o Many tricks were played

Doom (1993) : “Affine texture mapping”

• Linearly maps textures to screen location, disregarding depth

• Doom levels did not have slanted walls or ramps, to hide this

(4)

Graphic Processing – Some History

❑ Before 3D accelerators (GPUs) were common

❑ CPUs had to do all graphics computation, while maintaining framerate!

o Many tricks were played

Quake III arena (1999) : “Fast inverse square root”

magic!

(5)

Introduction of 3D Accelerator Cards

❑ Much of 3D processing is short algorithms repeated on a lot of data

o pixels, polygons, textures, …

❑ Dedicated accelerators with simple, massively parallel computation

A Diamond Monster 3D, using the Voodoo chipset (1997)

(Konstantin Lanzet, Wikipedia)

Example: OpenGL Shader Language (“GLSL”) Program a function,

function runs for every single pixel on screen

Source: Tom’s Hardware, 1997

(6)

NVIDIA Ampere-Based GA100 Architecture (2020)

Many many cores, not a lot of cache/control

(7)

Peak Performance vs. CPU

Throughput Power Throughput/Power

Intel Skylake 128 SP GFLOPS/4 Cores 100+ Watts ~1 GFLOPS/Watt

NVIDIA V100 15 TFLOPS 200+ Watts ~75 GFLOPS/Watt

Also,

(8)

System Architecture Snapshot With a GPU (2021)

CPU

GPU GPU Memory

(GDDR6, HBM2,…)

Host Memory (DDR4,…)

Platform Controller Hub

(PCH)

NVMe

Network Interface

… QPI/UPI

12.8 GB/s (QPI) 20.8 GB/s (UPI)

PCIe

16-lane PCIe Gen3: 16 GB/s 16-land PCIe Gen4: 32 GB/s DDR4 2666 MHz

128 GB/s 100s of GB GDDR5: 100s GB/s, 10s of GB

HBM2: ~1 TB/s, 10s of GB

Lots of moving parts!

(9)

High-Performance Graphics Memory

❑ Modern GPUs even employing 3D-stacked memory via silicon interposer

o Very wide bus, very high bandwidth o e.g., HBM2 in Volta, Ampere

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6 Memory Comparison,” 2019

(10)

Massively Parallel Architecture For Massively Parallel Workloads!

❑ NVIDIA CUDA (Compute Uniform Device Architecture) – 2007

o A way to run custom programs on the massively parallel architecture!

❑ OpenCL specification released – 2008

❑ Both platforms expose synchronous execution of a massive number of threads

CPU GPU

Thread

… GPU Threads

Copy over PCIe Copy over PCIe

Automatic memory transfer features exist!

(e.g., cudaMallocManaged, zero copy, …)

Emphasis (typically) more on programmability than performance

(11)

CUDA Execution Abstraction

❑ Block: Multi-dimensional array of threads

o 1D, 2D, or 3D

o Threads in a block can synchronize among themselves o Threads in a block can access shared memory

o CUDA (Thread, Block) ~= OpenCL (Work item, Work group)

❑ Grid: Multi-dimensional array of blocks

o 1D or 2D

o Blocks in a grid can run in parallel, or sequentially

❑ Kernel execution issued in grid units

❑ Limited recursion (depth limit of 24 as of now)

(12)

GPU programming abstraction

❑ “SIMT” (Single Instruction Multiple Threads), introduced by NVIDIA

o Simply put: Identical program (“Kernel”) executed on multiple threads o Thread ID is given as a parameter to the program,

so each thread can perform different work despite identical code

o Another kernel parameter is “block size”, the number of threads to use

for (ii = 0; ii < cnt; ++ii) { C[ii] = A[ii] + B[ii];

}

__global__ void KernelFunction(…) { int tid = threadIdx.x;

int blocksize = ceiling(cnt/blockDim.x);

for (i = 0; i < blocksize; ++i ) { int ii = blocksize*tid+i;

if ( ii < cnt ) C[ii] = A[ii] + B[ii];

} }

CPU Code example GPU Code example

Thread dimensions given as part of request from host software

(13)

Simple CUDA Example

Asynchronous call

NVCC Compiler

Host Compiler

Device Compiler

CPU+GPU Software C/C++

+ CUDA Code

CPU side GPU side

(14)

Simple CUDA Example

1 block

N threads per block

Which of N threads am I?

End-to-End Example: SAXPY

❑ “Single-precision A*X Plus Y”

^X ^Y

BlockDim.x

(16)

End-to-End Example: SAXPY

…

Host Memory

Device Memory

Copy to Device

Call Kernel

Copy Result

% nvcc -o saxpy saxpy.cu

% ./saxpy

Great! Now we know how to use GPUs Bye?

(17)

Matrix Multiplication

Performance Engineering

Results from NVIDIA P100

Coleman et. al., “Efficient CUDA,” 2017 Architecture knowledge is needed (again)

No faster than CPU

(18)

NVIDIA Ampere-Based GA100 Architecture (2020)

Single Streaming Multiprocessor (SM) has 64 INT32 cores, 64 FP32 cores, 32 FP64 cores

(+4 Tensor cores…) GA100 has 108 SMs

(19)

Ampere Execution Architecture

❑ 64 INT32, 64 FP32, 32 FP64, 4 Tensor Cores

o Specialization to make use of chip space…?

❑ Not much on-chip memory per thread

o 164 KB Shared memory o 256KB Registers

❑ Hard limit on compute management

o 32 blocks AND 2048 threads AND 1024 threads/block o e.g., 2 blocks with 1024 threads, or 4 blocks with 512

threads

o Enough registers/shared memory for all threads must be available (all context is resident during execution)

More threads than cores – Threads interleaved to hide memory latency

(20)

Resource Balancing Details

❑ How many threads in a block?

❑ Too small: 4x4 window == 16 threads

o 128 blocks to fill 2048 thread/SM

o SM only supports 32 blocks -> only 512 threads used

• SM has only 64 cores… does it matter? Sometimes!

❑ Too large: 32x48 window == 1536 threads

o Threads do not fit in a block!

o Runtime error: “invalid configuration argument”

❑ Too large: 1024 threads using more than 256 Byte registers

❑ Limitations vary across platforms (Fermi, Pascal, Volta, Ampere, …)

(21)

CS 250B: Modern Computer Systems

GPU Architecture And Performance

Sang-Woo Jun

(22)

GPU Processor Architecture

❑ GPUs have thousands of threads running concurrently at GHzs

❑ Much simpler processor architecture

o Dozens of threads scheduled together in a SIMD fashion

o Much simpler microarchitecture (doesn’t need to boot Linux)

❑ Much higher power budget

o CPUs try to maintain 100 W power budget (Pentium 4 till now) o Thermal design power (TDP) for modern GPUs around 300 W

• TDP: Safe level of power consumption for effective cooling

CPU (i7) adding 1 Billion floats: 2.14s, NVIDIA Turing with only one thread: 29.16s

(23)

GPU Processor Architecture

❑ Cores are organized into units of “warps”

o Threads in a warp share the same Fetch and decode units o Drastically reduces chip resource usage

• One reason why GPUs can fit so many cores

o Basically a warp is one thread with SIMD operations

• But exposes multithread abstraction to the programmer

o Typically 32 threads per warp for NVIDIA, but may change

• Warp size information is not part of programming abstraction

Source: Tor Aamodt

miss?

(24)

GPU Processor Architecture

❑ Each warp hardware can handle many sets of threads

o Context switch in case of memory access request, to hide memory access latency

❑ A large block of threads can map across many streaming multiprocessors

o Thread 0 to 31 map to warp 0, Thread 32 to 63 map to warp 1, …

miss?

(25)

Thread Synchronization in CUDA

❑ Synchronization is possible within a block

o __syncthreads() is a barrier operation

❑ Synchronization is unnecessary within a warp

o SIMD anyways

❑ Synchronization is not (easily) available between blocks

o __syncthreads() does nothing o No shared memory

o We can implement synchronization using slow global memory…

So far, typical parallel, multithreaded programming But, caveats for performance engineering starts here!

(26)

Warp Scheduling Caveats

❑ Remember: Threads within a block share the same fetch, decode units

o All threads in a warp are always executing the same instruction o What if their execution diverges?

• e.g., if (tid%2) func1(), else func2()

• e.g., if (A[tid] < 100) X++, else Y++

❑ Divergence across warps don’t matter

o Different warps, different fetch+decode

❑ What about intra-warp divergence?

(27)

Warp Scheduling Caveats

❑ Intra-warp execution divergence incurs “control divergence”

o The warp processor must execute both paths, one after another

• Whole warp will execute one direction first with some threads suspended, and the other direction with the other threads suspended

o If 32 threads go down 32 different branches, no performance gain with SIMD!

2018, “Using CUDA Warp-Level Primitives,” NVIDIA

(28)

GPU Memory Architecture

❑ Not much on-chip memory per thread

o 4KB Registers per FP32 core o 164 KB Shared memory

❑ Relatively fast off-chip “global” memory

o But not fast enough!

o GDDR6 or HBM2 can deliver up to +1TB/s o Shared across 2048+ threads…

❑ Pretty much no memory consistency between blocks

o Once data goes to off-chip main memory, explicit synchronization critical!

• Expensive!

Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”

(29)

GPU Memory Architecture

❑ Remember: A block can have thousands of threads

o They can all be accessing shared memory at once

o Shared memory hardware can’t have a port for each thread o Serializing memory access will kill performance

• Performance will be limited by one shared memory access per thread per cycle

❑ Organized into banks to distribute access

o Best performance if all threads in warp access different banks

o Best performance if all threads access the same address (broadcast) o Otherwise, bank conflicts drastically reduce performance

8-way bank conflict 1/8 memory bandwidth

(30)

Prominent Performance Engineering Topics

❑ Warp level execution

o Avoid branch divergence within nearby threads

o Algorithmic solutions for warp-size oblivious computations often possible

❑ Shared memory bank conflict

o Map data access per thread to interleaved addresses

❑ Synchronization overhead

o Avoid __syncthreads whenever possible (e.g., Within warp) o Avoid inter-block synchronization

❑ Memory reuse

o Cache-optimized algorithms

(31)

CS 250B: Modern Computer Systems GPU Application Examples

Sang-Woo Jun

(32)

Application 1: Matrix Multiplication

❑ Dividing Matrix Multiplication into blocks of threads

o Simple solution: each thread responsible for one element o Remember: 1024 threads maximum per block

o Spawn as many blocks as needed to cover C

❑ Shared memory is used to do some caching

o Good enough? ^A

B

C

2D Block Thread

(33)

A Naïve Matrix Multiplication Kernel

__global__ void MatrixMult0(float* a, float* b, float* c, int N) { int Row = blockIdx.y*blockDim.y+threadIdx.y;

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < N) && (Col < N)) {

for (int k = 0; k < N; ++k) {

c[Row*N+Col] += a[Row*N+k]*b[k*N+Col];

} }

}

MatrixMult0<<<dim3(N/BW,N/BW,1),dim3(BW,BW,1)>>>(d_a,d_b,d_c,N);

Width of a 2D square block of threads

Max threads per block: 1024 Max BW: 32

(34)

Performance So Far

❑ 16,384 x 16,384 Matrix

❑ NVIDIA RTX 2080 ti

❑ Naïve implementation

o Elapsed: 16.865s o GFLOPS: 521

(Peak GFLOPS: 13500)

16K * 16K *16K * 4B / 616GB/s ~= 26s

… Some caching!

(35)

A Naïve Matrix Multiplication Kernel

if ((Row < N) && (Col < N)) {

for (int k = 0; k < N; ++k) {

c[Row*N+Col] += a[Row*N+k]*b[k*N+Col];

} }

}

Is this reused?

(36)

Attempt 2: Local Variable For Reuse

if ((Row < N) && (Col < N)) { float Pvalue = 0;

for (int k = 0; k < N; ++k) {

Pvalue += a[Row*N+k]*b[k*N+Col];

}

c[Row*N+Col] = Pvalue;

} }

Local variable for reuse

(37)

Performance So Far

❑ 16,384 x 16,384 Matrix

❑ NVIDIA RTX 2080 ti

❑ Naïve implementation

o Elapsed: 16.865s o GFLOPS: 521

❑ Local reuse 1

o Elapsed: 5.08 o GFLOPS: 1728

A

B

C Thread1

Thread 2 Is this reused?

(38)

Attempt 3: Shared Memory

❑ Explicitly manage caches using shared

❑ Calculate result by adding N/BW sub-sums

❑ Sub-blocks will be used by all threads before discarded

A

B

C

…

Reuse!

(39)

Multithreaded Load To Shared Memory

Load Use

Source: NVIDIA, UIUC GPU Teaching Kit

(40)

Attempt 3: Shared Memory

__global__ void MatrixMult2(float* a, float* b, float* c, int N) { __shared__ float ds_a[BLOCK_WIDTH][BLOCK_WIDTH];

__shared__ float ds_b[BLOCK_WIDTH][BLOCK_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;

float Pvalue = 0;

for (int p = 0; p < N/BLOCK_WIDTH; ++p) {

ds_a[ty][tx] = a[Row*N + p*BLOCK_WIDTH+tx];

ds_b[ty][tx] = b[(p*BLOCK_WIDTH+ty)*N + Col];

__syncthreads();

for (int i = 0; i < BLOCK_WIDTH; ++i)Pvalue += ds_a[ty][i] * ds_b[i][tx];

__syncthreads();

}

c[Row*N+Col] = Pvalue;

}

Wait until load is done for all threads

Wait until computation is done for all threads

(41)

Performance So Far

❑ 16,384 x 16,384 Matrix

❑ NVIDIA RTX 2080 ti

❑ Naïve implementation

o Elapsed: 16.865s, GFLOPS: 521

❑ Local reuse 1

❑ Shared memory

(42)

Block Size Considerations

❑ More re-use with larger blocks!

❑ With 16x16 blocks (256 threads)

o 512 word loads from memory o 256 * (2*16) = 8,192 FLOPs o 16 FLOP per load

❑ With 32x32 blocks (1024 threads)

o 1024 word loads from memory o 1024 * (2*32) = 65,536 FLOPs o 32 FLOP per load

Unfortunately, threads per block limited to 1024

(43)

Application 2: Parallel Reduction

❑ Combines an array of elements and produces a single result

o E.g., adding all values in an array, finding maximum, calculating average, …

❑ If the operation is associative, i.e., (A+B)+C == A+(B+C), calculation can be parallelized

Source: Mark Harris, “Optimizing Parallel Reduction in CUDA,” NVIDIA Developer Technology, 2007

(44)

How To Best Allocate Work To Threads?

❑ Straightforward method: divide blocks of work across threads

❑ Will this be efficient?

o Warp affinity of algorithm

o Good data access patterns, etc?

❑ How many threads should we spawn?

o As many threads as cores: Too little threads… Main memory latency not hidden!  o Too many threads: Is there any downsides to this?

Thread 0 Thread 1

(45)

Method 0: Consecutive work blocks

❑ Each kernel run will reduce data size to blocks*threads

o Must run iteratively until reduced to 1

o How many threads, how many blocks? Too small: too many iterations!

o Let’s fix threads per block to 1024 (max for this architecture)

❑ Peak performance when ~64 elements per thread

o ~40ms for 2³⁰ elements o Is this good?

(46)

Our Goal: Memory Saturation

❑ Reduction is an O(N) problem, ideally reading each element exactly once

❑ Not much computation per memory, so likely memory bound

o RTX 2080 ti’s GDDR6 memory has peak bandwidth of 616 GB/s o We want to reach this utilization

o E.g., 2³⁰ elements = 4 GB, ideally 6 ms

❑ Let’s follow the guidelines in NVIDIA’s “Optimizing Parallel Reduction in

CUDA,” NVIDIA Developer Technology, 2007

(47)

❑ Each block of 1024 threads reducing 1024 elements to 1

o Use shared memory!

❑ 47 ms, 91 GB/s

Method 1: Interleaved Addressing

❑ Each block of 1024 threads reducing 1024 elements to 1

o Use shared memory!

❑ 47 ms, 91 GB/s

Where are all the odd threads?

(48)

Method 1b: Better Thread Allocation

❑ More threads are doing work!

❑ 33.41 ms, 128 GB/s

Assuming four banks, many bank conflicts!

(49)

Method 2: Sequential Addressing

❑ Change thread mapping to group to lower elements

❑ Consecutive addresses have no bank conflict

❑ 29.76 ms, 144 GB/s

For N threads, N/2 threads are never used!

(50)

Method 3: More Work per Thread

❑ Instead of 1024 elements per 1024 threads, 2048 elements!

❑ 15.36 ms, 280 GB/s

❑ Q1: What if we use the same method for all previous attempts?

o Interleaved: 47 ms -> 24.35 s, Better thread: 33.41 -> 17.35 ms

❑ Q2: Can we take this further? More work per thread?

(51)

Method 4: Back To Work Blocks Per Thread

❑ But this time, use method 2 to reduce within a block

o Lots of work per thread,

o Small result set per iteration o Best of both worlds?

❑ How many blocks?

o 8,192 blocks: 40 ms

o 32,768 blocks: 28.50 ms

o 131,072 blocks: 8.52 ms! 504 GB/s!

o 524,288 blocks: 17.96 ms…

(52)

Method 4: Why?

❑ Most likely, random access issue in DRAM

o Many threads scheduled potentially out of order causes random access o DRAM isn’t really random access!

o We will get into details later

o Bottom line: Consecutive access is faster when within same page, of multiple KBs

❑ Analyzing performance

o 131,072 blocks -> 32 KB working set within fast random access range

• Each block is scheduled sequentially. No interleaving between blocks on same SM!

o Less blocks -> Larger working set per block -> Random access penalty o More blocks -> Smaller work per thread -> Performance penalty

(53)

Method 5: Consecutive Memory Access

❑ Set stride to total number of threads in grid

o Consecutive threads access consecutive addresses

o At least, threads in a warp always access contiguous addresses at once

❑ Reliably high performance!

o 256 blocks: 9.83 ms o 1024 blocks: 8.3 ms

o 8192 blocks: 7.8 ms, 550 GB/s o 65,536 blocks: 7.9 ms

o 262,144 blocks: 11.52 ms

(54)

Some More Approaches?

❑ The NVIDIA guide suggests loop unrolling when active threads become less than 32

o Within a warp, no __synchthreads needed!

o Adding an if statement to __syncthreads also adds overhead

❑ On modern chips, this changes measures pretty negligible, so omitted

❑ 616 GB/s is ~150 GOPS…

o Remember peak computation is 13,500 GFLOPS o Very much bandwidth bound!

(55)

Application 3: Option Pricing

❑ Options in Computational Finance:

o In finance, a contract giving the buyer of an asset the right (but not the obligation) to buy or sell and underlying asset at a specified price or date.

o Question: How much should I pay for a particular option?

(56)

Option Pricing

Black-Scholes Equation

Geometric Brownian Motion in Finance

Random variable

“Monte Carlo Method”

Simulate massive amount of instances and average return

What we want

(57)

Option Pricing

❑ No memory usage

o Not even shared memory

o Completely computation bound

❑ 537x Performance vs. 1 Thread

❑ Assuming GTX 1080

o 2560 CUDA cores

o Close to linear scaling

Cho et. al., “Monte Carlo Method in CUDA,” 2016

(58)