GPGPUs, CUDA and OpenCL

(1)

GPGPUs, CUDA and OpenCL

Timo Lilja

(2)

Course arrangements

Course code: T-106.5800 Seminar on Software Techniques Credits: 3

Thursdays 1516 at A232, lecture period III only Mandatory attendance but you can skip 1 session Presentation

One hour presentation Two presentations per session

Programming project

Small programming project from a given topic or your own topic if you haven't received credits from it from some other course

The goal is to parallelize the given program

You can choose whether you want to use Cuda or OpenCL

We provide a development environment for this programming project. More information will be announced later, check the wiki page

Check the course wiki page

(3)

Why GPGPU?

GPGPU can in many cases oer a hundredfold increase in

performance, tenfold decrease in price and threefold increase in power eciency over traditional CPU in many scientic computing eorts. Business opportunities in various elds: medical technology,

(5)

What is a GPGPU?

Original application in computer graphics and games General-Purpose Computing on Graphics Processing Units Origins in programmablevertex and fragment shaders

First GPGPU programs where done by using normal graphics APIs in late 90s

In early 2000s rst programmable shaders fully programmable GPU cores

(6)

Parallel Computing Architectures

According to Flynn's taxonomy dened in 1966 by Michael J. Flynn.

(7)

Stream Processing

Programming paradigm related to SIMD

Given astreamof data and a series of operations, called kernel functions

The kernel function is applied to all elements of a stream concurrently Memory is very hierarchical: local memory easily accessible, global memory much more expensive

Memory accesses usually in bulk so memory optimized or high bandwidth and not to low latency

(8)

GPU vs. CPU

To support SIMD parallelism, ALUs must be abundant whereas control logic and data caches are not needed that much

(9)

NVIDIA GPU

Implementation of a stream processor system Unied architecture

vertex, pixel and other shaders use the same GPU facilities

Highly hierarchical hardware

Streaming-Processor core (SP) Streaming multiprocessor (SM) Texture/processor cluster (TPC) Streaming processor array (SPA)

(10)

Streaming Multiprocessor (SM)

8 Streaming Processor (SP) cores

scalar multiply-add (MAD) and ALU units

single precision oats and ALU operations in 4 cycles

Fused Multiply-Add unit (FMAD)

IEEE 754R double precision oating points 1 per/processor: double precision oats are slow

2 special function units (SFU)

provide transcendental functions other complex functions: reciprocal slow latencies 16-32 cycles or more

low-latency interconnect network between SPs and shared-memory banks

multi-threaded instruction fetch and issue unit

caches: instruction cache and read-only constant cache 16K read/write shared memory

(11)

Texture/Processor Cluster (TPC)

Geometry controller

maps the operations into Streaming Multiprocessors

Provides 2-dimenisional texture cache that uses(x,y)-spatial locality

Streaming multiprocessor (SM) controller Older NVidia's cards (G80) have 2 SMs/TPC, newer have (GT200) 3 SMs/TPC

(12)

(13)

Memory and other features

Memory is highly hierarchical and cached

Thread local memory

Shared memory which is shared inside a Streaming Multiprocessor (SM) Global memory which is accessible to all threads

Raster operation processor (ROP)

Other units are mainly used for computer graphics

Texture unit

(14)

(15)

Hardware limitations

Branching can cause the program to run fully sequentially Double precision oating point numbers are slow

(16)

(17)

Cuda

Compute Unied Device Architecture

NVidia's proprietary stream programming language Available for Linux, Mac OS X and Windows Current release 2.3, rst release in 2007 C for Cuda

Compiled through Pathscale's Open64 C compiler Standard C with kernel extensions

Cuda driver API

Standard C API interface kernels are explicitly loaded

Cuda toolkit

(18)

Programming Cuda (1/2)

Consider adding two vectors A and B and storing the result in C. In ordinary C

void VecAdd(float *A, float *B, float *C) {

for (i = 0; i < N; i++) C[i] = A[i] + B[i]; }

In Cuda

__global__ void VecAdd(float* A, float *B, float *C) {

int i = threadIdx.x; C[i] = A[i] + B[i]; }

(19)

Programming Cuda (2/2)

In order to run a parallel program

1 Data must be copied to GPU

2 The kernel must be invoked from the CPU code with special syntax 3 and the data must be copied back to CPU

The language used in Cuda kernels is limited

recursion is not supported function pointers cannot be used

few other restrictions documented in Cuda programming manual

(20)

Processing ow on CUDA

(21)

Threads, Blocks and Grids (1/2)

Threads

perform single scalar operation per cycle

Thread blocks

Can be 1-, 2- or 3-dimensional

can communicate throughshared memory

can synchronize through __syncthreads() at most 512 threads per block

Thread blocks are executed in 32 thread warps in a single SM

Grids

kernel can be executed by multiple thread blocks

thread blocks are organized into 1- or 2-dimensiongridwhich can be used indexing the block

(22)

(23)

Example: matrix addition (1/2)

In normal C

void addMatrix(float *a, float *b, float *c, int N) {

int i, j, idx;

for (i = 0; i < N; i++) for (j = 0; j < N; j++)

idx = i + j*N;

c[idx] = a[idx] + b[idx]; } } } int main(void) { ... addMatrix(a,b, c, N);

(24)

Example: matrix addition (2/2)

In Cuda

__global__ void addMatrixG(float *a, float *b, float *c, int N) {

int i = blockIdx.x*blockDim.x + threadIDx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int idx = i + j*N;

if (i < N && j < N)

c[idx] = a[idx] + b[idx]; }

int main(void) {

dim3 dimBlock (blocksize, blocksize); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y) addMatrixG<<<dimGrid, dimBlock>>>(a,b,c, N) }

(25)

Compiling Cuda

Cuda for C code is compiled with nvcc compiler and its extension is .cu

Thehost code is compiled to native x86

Thedevice code is rst compiled to Parallel Thread ExecutionPTX assembler and then tocubin binary format

pseudo-assembler with arbitrary large register set almost entirely in SSA form

NVidia graphics card driver load cubin code compiles and executes the PTX code

With the Cuda C driver API it is possible to upload own, non-nvcc generated cubin code to the driver

(26)

Other features

Asynchronous execution

Memory hierarchy: Device Memory, Shared Memory, Page-Locked Host Memory

Error Handling Multiple Devices

Debugger, Proler and the Device emulation mode Performance tuning

(27)

OpenCL

Open Computing Language was initially developed by Apple

Now developed by Khoronos Group and OpenCL 1.0 was published on December 8, 2008

Both AMD and NVidia support OpenCL 1.0 as of late 2009 Apple's implementation is based on LLVM compiler framework OpenCL is fully open standard with

The goal is to support GPGPUs, Cells, DSPs OpenCL language is based on C99

(28)

Terminology

A OpenCLhostis the machine controlling one or more OpenCL devices

Adevice consists of one ore more computingcores

A computing core consists of one or moreprocessing elements Processing elements execute code as SIMDor SPMD (Single Process Multiple Data, ordinary OpenMP kind multitasking)

AProgram consists of one or morekernels

Computation domains can be 1-, 2- or 3-dimensional

Work-items execute kernels in parallel and are grouper to local workgroups

(29)

Memory Model

Like in Cuda, memory is hierarchical Private memoryis per work-item

Local Memoryis shared with a workgroup UnsynchronizedLocal Global/Constant Memory

Host Memory

(30)

Objects and Running OpenCL

Setup

1 Choose thedevice(GPU, CPU, Cell)

2 Create acontext, which is a collection of devices 3 Create and submit work into aqueue

Memory consists of

buers which can be accessed freely, read/write

images which can be either read or written in a kernel, not both and can be accessed only by specic functions.

Work is run asynchronously, synchronous access requires blocking API calls

(31)

OpenCL kernel language

Based on ISO C99

No function pointers, recursion, variable length arrays, bit elds Syntax and other additions

work-items and workgroups vector types and operations synchronization

(32)

OpenCL, CUDA and Linux

Compiled with ordinary GCC and linked against Cuda's libOpenCL Kernels must be embedded into C strings or loaded from external les through OpenCL API, In Cuda kernels are recompiled and linked to the binary

NVidia Cuda SDK (at least in 3.0) has lots of OpenCL examples Kernel syntax is dierent!

(33)

Conclusion

GPGPUs provide one form of parallelism, namely SIMD Multi-core CPUs provide MIMD parallelism

Will the future merge these two into a single platform?

NVidia Cuda is strongly stream processing SIMD implementation whereas OpenCL is far more generic supporting both SIMD and SPMD/MIMD

What kind of applications and who will benet from GPGPU stream processing?

Will it make Oce applications run faster?

Will it benet average user? average programmer? average scientist? At least it will benet the average gamer

(34)

References

GPUs and CUDA http://www.cis.temple.edu/~ingargio/ /cis307/readings/cuda.html

NVIDIA's GT200: Inside a Parallel Processor http://www.

realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1 NVidia Cuda Programming Guide 2.3

Building NVIDIA's GT200

http://www.anandtech.com/video/showdoc.aspx?i=3334&p=2 iXBT Labs: NVIDIA CUDA

http://ixbtlabs.com/articles3/video/cuda-1-p6.html Lindholm et al. NVIDIA Tesla: A Unied Graphics and Computing Architecture. IEEE micro. vol. 28 no. 2, March/April 2008

NVidia OpenCL JumpStart Guide Khronos group's OpenCL overview

(35)

Possible topics (1/2)

How to optimize matrix multiplications in Cuda/OpenCL

Section 3.2.2 in Cuda Programming Cude

Starting from CPU multiplication and ending up in GPGPU benchmarking after each optimization step

Performance tuning and Best practices

Cuda/OpenCL Best practices Guide

Cuda and OpenCL

API comparison performance evaluation

User experiences and example applications

NVidia's Cuda/OpenCL SDK Other applications

(36)

Possible topics (2/2)

AMD

Hardware overview and comparison to NVidia Overview of AMD's implementation of OpenCL AMD currently leading in GPU performance

High-level languages and GPGPU

Python bindings for both Cuda and OpenCL C++, FP languages, Matlab

GPGPU IDEs and development tools Future GPGPU trends

(37)

Arrangements once more

Need two presentantions for the next session

Next session on Jan 28th or Feb 4th?

For the rest: e-mail me (timo.lilja@tkk.) suitable times and topic suggestions ASAP

You can suggest your own programming project topic too by emailing me

Check the wiki pages, I will add instructions on how to use Cuda/OpenCL in course server environment

http://wiki.tkk.fi/display/GPGPUK2010/Running+CUDA+and+ OpenCL+in+course+server

(38)

Helmholtz Dierential Equation (1/2)

An elliptic partial DE, in general form:

∇2ψ+k2ψ=0

Height of the wave V at coordinates (x,y) accelerates towards the wave height of adjacent places (x −d,y),(x +d,y),(x,y−d), (x,y+d) D2 tV(x,y) = C „ V(x−_d,y) +V(x+d,y) +V(x,y−_d) +V(x,y+d) 4 −V(x,y) «

Add a little friction . . .−F D_tV(x,y) and impulse. . .+ I(t,x,y)

(39)

Helmholtz Dierential Equation (2/2)

The coecient C corresponds to the conductivity of the material, C =0⇒ the wave can't penetrate this material

The coecient F corresponds to the friction of the material, F =0⇒ no friction, the wave continues forever

d is the distance between two points in the discretized space

We set d =1 and adjust C and F correspondingly, and store V(x,y) in a two-dimensional table

For a numerical solution to be at least somewhat accurate, C <1 and the wavelength>4

(40)

Solving the DE Numerically (1/3)

Given a (set of) DE y0 ₌_y0₍_t_,_y₎

Euler's algorithm:

yt+h=yt+h y0(t,yt)

follow the tangent for a step h Very inaccurate

(41)

Solving the DE Numerically (2/3)

Runge-Kutta algorithm, one variation:

yt+h=yt+h α+2β+2γ+δ 6 where α =y0(t,y_t), β =y0(t+h 2,yt+ h 2α), γ =y0(t+h 2,yt+ h 2β), δ =y0(t+h,y_t+hγ)

(42)

Solving the DE Numerically (3/3)

If the DE is of a higher degree, we can normalize it:

V0 t(x,y) = hVt00(x,y) V00 t (x,y) = C „_V_(x −d,y) +. . . 4 −V(x,y) « −F DtV(x,y)

GPGPUs, CUDA and OpenCL