GPGPUs, CUDA and OpenCL
Timo LiljaCourse arrangements
Course code: T-106.5800 Seminar on Software Techniques Credits: 3
Thursdays 1516 at A232, lecture period III only Mandatory attendance but you can skip 1 session Presentation
One hour presentation Two presentations per session
Programming project
Small programming project from a given topic or your own topic if you haven't received credits from it from some other course
The goal is to parallelize the given program
You can choose whether you want to use Cuda or OpenCL
We provide a development environment for this programming project. More information will be announced later, check the wiki page
Check the course wiki page
Contents
1 Introduction 2 NVidia Hardware Cuda 3 OpenCL 4 ConclusionWhy GPGPU?
GPGPU can in many cases oer a hundredfold increase in
performance, tenfold decrease in price and threefold increase in power eciency over traditional CPU in many scientic computing eorts. Business opportunities in various elds: medical technology,
What is a GPGPU?
Original application in computer graphics and games General-Purpose Computing on Graphics Processing Units Origins in programmablevertex and fragment shaders
First GPGPU programs where done by using normal graphics APIs in late 90s
In early 2000s rst programmable shaders fully programmable GPU cores
Parallel Computing Architectures
According to Flynn's taxonomy dened in 1966 by Michael J. Flynn.
Stream Processing
Programming paradigm related to SIMD
Given astreamof data and a series of operations, called kernel functions
The kernel function is applied to all elements of a stream concurrently Memory is very hierarchical: local memory easily accessible, global memory much more expensive
Memory accesses usually in bulk so memory optimized or high bandwidth and not to low latency
GPU vs. CPU
To support SIMD parallelism, ALUs must be abundant whereas control logic and data caches are not needed that much
NVIDIA GPU
Implementation of a stream processor system Unied architecture
vertex, pixel and other shaders use the same GPU facilities
Highly hierarchical hardware
Streaming-Processor core (SP) Streaming multiprocessor (SM) Texture/processor cluster (TPC) Streaming processor array (SPA)
Streaming Multiprocessor (SM)
8 Streaming Processor (SP) coresscalar multiply-add (MAD) and ALU units
single precision oats and ALU operations in 4 cycles
Fused Multiply-Add unit (FMAD)
IEEE 754R double precision oating points 1 per/processor: double precision oats are slow
2 special function units (SFU)
provide transcendental functions other complex functions: reciprocal slow latencies 16-32 cycles or more
low-latency interconnect network between SPs and shared-memory banks
multi-threaded instruction fetch and issue unit
caches: instruction cache and read-only constant cache 16K read/write shared memory
Texture/Processor Cluster (TPC)
Geometry controller
maps the operations into Streaming Multiprocessors
Provides 2-dimenisional texture cache that uses(x,y)-spatial locality
Streaming multiprocessor (SM) controller Older NVidia's cards (G80) have 2 SMs/TPC, newer have (GT200) 3 SMs/TPC
Memory and other features
Memory is highly hierarchical and cached
Thread local memory
Shared memory which is shared inside a Streaming Multiprocessor (SM) Global memory which is accessible to all threads
Raster operation processor (ROP)
Other units are mainly used for computer graphics
Texture unit
Hardware limitations
Branching can cause the program to run fully sequentially Double precision oating point numbers are slow
Cuda
Compute Unied Device Architecture
NVidia's proprietary stream programming language Available for Linux, Mac OS X and Windows Current release 2.3, rst release in 2007 C for Cuda
Compiled through Pathscale's Open64 C compiler Standard C with kernel extensions
Cuda driver API
Standard C API interface kernels are explicitly loaded
Cuda toolkit
Programming Cuda (1/2)
Consider adding two vectors A and B and storing the result in C. In ordinary C
void VecAdd(float *A, float *B, float *C) {
for (i = 0; i < N; i++) C[i] = A[i] + B[i]; }
In Cuda
__global__ void VecAdd(float* A, float *B, float *C) {
int i = threadIdx.x; C[i] = A[i] + B[i]; }
Programming Cuda (2/2)
In order to run a parallel program
1 Data must be copied to GPU
2 The kernel must be invoked from the CPU code with special syntax 3 and the data must be copied back to CPU
The language used in Cuda kernels is limited
recursion is not supported function pointers cannot be used
few other restrictions documented in Cuda programming manual
Processing ow on CUDA
Threads, Blocks and Grids (1/2)
Threadsperform single scalar operation per cycle
Thread blocks
Can be 1-, 2- or 3-dimensional
can communicate throughshared memory
can synchronize through __syncthreads() at most 512 threads per block
Thread blocks are executed in 32 thread warps in a single SM
Grids
kernel can be executed by multiple thread blocks
thread blocks are organized into 1- or 2-dimensiongridwhich can be used indexing the block
Example: matrix addition (1/2)
In normal Cvoid addMatrix(float *a, float *b, float *c, int N) {
int i, j, idx;
for (i = 0; i < N; i++) for (j = 0; j < N; j++)
idx = i + j*N;
c[idx] = a[idx] + b[idx]; } } } int main(void) { ... addMatrix(a,b, c, N);
Example: matrix addition (2/2)
In Cuda__global__ void addMatrixG(float *a, float *b, float *c, int N) {
int i = blockIdx.x*blockDim.x + threadIDx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int idx = i + j*N;
if (i < N && j < N)
c[idx] = a[idx] + b[idx]; }
int main(void) {
dim3 dimBlock (blocksize, blocksize); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y) addMatrixG<<<dimGrid, dimBlock>>>(a,b,c, N) }
Compiling Cuda
Cuda for C code is compiled with nvcc compiler and its extension is .cu
Thehost code is compiled to native x86
Thedevice code is rst compiled to Parallel Thread ExecutionPTX assembler and then tocubin binary format
pseudo-assembler with arbitrary large register set almost entirely in SSA form
NVidia graphics card driver load cubin code compiles and executes the PTX code
With the Cuda C driver API it is possible to upload own, non-nvcc generated cubin code to the driver
Other features
Asynchronous execution
Memory hierarchy: Device Memory, Shared Memory, Page-Locked Host Memory
Error Handling Multiple Devices
Debugger, Proler and the Device emulation mode Performance tuning
OpenCL
Open Computing Language was initially developed by Apple
Now developed by Khoronos Group and OpenCL 1.0 was published on December 8, 2008
Both AMD and NVidia support OpenCL 1.0 as of late 2009 Apple's implementation is based on LLVM compiler framework OpenCL is fully open standard with
The goal is to support GPGPUs, Cells, DSPs OpenCL language is based on C99
Terminology
A OpenCLhostis the machine controlling one or more OpenCL devices
Adevice consists of one ore more computingcores
A computing core consists of one or moreprocessing elements Processing elements execute code as SIMDor SPMD (Single Process Multiple Data, ordinary OpenMP kind multitasking)
AProgram consists of one or morekernels
Computation domains can be 1-, 2- or 3-dimensional
Work-items execute kernels in parallel and are grouper to local workgroups
Memory Model
Like in Cuda, memory is hierarchical Private memoryis per work-item
Local Memoryis shared with a workgroup UnsynchronizedLocal Global/Constant Memory
Host Memory
Objects and Running OpenCL
Setup
1 Choose thedevice(GPU, CPU, Cell)
2 Create acontext, which is a collection of devices 3 Create and submit work into aqueue
Memory consists of
buers which can be accessed freely, read/write
images which can be either read or written in a kernel, not both and can be accessed only by specic functions.
Work is run asynchronously, synchronous access requires blocking API calls
OpenCL kernel language
Based on ISO C99
No function pointers, recursion, variable length arrays, bit elds Syntax and other additions
work-items and workgroups vector types and operations synchronization
OpenCL, CUDA and Linux
Compiled with ordinary GCC and linked against Cuda's libOpenCL Kernels must be embedded into C strings or loaded from external les through OpenCL API, In Cuda kernels are recompiled and linked to the binary
NVidia Cuda SDK (at least in 3.0) has lots of OpenCL examples Kernel syntax is dierent!
Conclusion
GPGPUs provide one form of parallelism, namely SIMD Multi-core CPUs provide MIMD parallelism
Will the future merge these two into a single platform?
NVidia Cuda is strongly stream processing SIMD implementation whereas OpenCL is far more generic supporting both SIMD and SPMD/MIMD
What kind of applications and who will benet from GPGPU stream processing?
Will it make Oce applications run faster?
Will it benet average user? average programmer? average scientist? At least it will benet the average gamer
References
GPUs and CUDA http://www.cis.temple.edu/~ingargio/ /cis307/readings/cuda.html
NVIDIA's GT200: Inside a Parallel Processor http://www.
realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1 NVidia Cuda Programming Guide 2.3
Building NVIDIA's GT200
http://www.anandtech.com/video/showdoc.aspx?i=3334&p=2 iXBT Labs: NVIDIA CUDA
http://ixbtlabs.com/articles3/video/cuda-1-p6.html Lindholm et al. NVIDIA Tesla: A Unied Graphics and Computing Architecture. IEEE micro. vol. 28 no. 2, March/April 2008
NVidia OpenCL JumpStart Guide Khronos group's OpenCL overview
Possible topics (1/2)
How to optimize matrix multiplications in Cuda/OpenCL
Section 3.2.2 in Cuda Programming Cude
Starting from CPU multiplication and ending up in GPGPU benchmarking after each optimization step
Performance tuning and Best practices
Cuda/OpenCL Best practices Guide
Cuda and OpenCL
API comparison performance evaluation
User experiences and example applications
NVidia's Cuda/OpenCL SDK Other applications
Possible topics (2/2)
AMD
Hardware overview and comparison to NVidia Overview of AMD's implementation of OpenCL AMD currently leading in GPU performance
High-level languages and GPGPU
Python bindings for both Cuda and OpenCL C++, FP languages, Matlab
GPGPU IDEs and development tools Future GPGPU trends
Arrangements once more
Need two presentantions for the next session
Next session on Jan 28th or Feb 4th?
For the rest: e-mail me (timo.lilja@tkk.) suitable times and topic suggestions ASAP
You can suggest your own programming project topic too by emailing me
Check the wiki pages, I will add instructions on how to use Cuda/OpenCL in course server environment
http://wiki.tkk.fi/display/GPGPUK2010/Running+CUDA+and+ OpenCL+in+course+server
Helmholtz Dierential Equation (1/2)
An elliptic partial DE, in general form:∇2ψ+k2ψ=0
Height of the wave V at coordinates (x,y) accelerates towards the wave height of adjacent places (x −d,y),(x +d,y),(x,y−d), (x,y+d) D2 tV(x,y) = C „ V(x−d,y) +V(x+d,y) +V(x,y−d) +V(x,y+d) 4 −V(x,y) «
Add a little friction . . .−F DtV(x,y) and impulse. . .+ I(t,x,y)
Helmholtz Dierential Equation (2/2)
The coecient C corresponds to the conductivity of the material, C =0⇒ the wave can't penetrate this material
The coecient F corresponds to the friction of the material, F =0⇒ no friction, the wave continues forever
d is the distance between two points in the discretized space
We set d =1 and adjust C and F correspondingly, and store V(x,y) in a two-dimensional table
For a numerical solution to be at least somewhat accurate, C <1 and the wavelength>4
Solving the DE Numerically (1/3)
Given a (set of) DE y0 =y0(t,y)
Euler's algorithm:
yt+h=yt+h y0(t,yt)
follow the tangent for a step h Very inaccurate
Solving the DE Numerically (2/3)
Runge-Kutta algorithm, one variation:yt+h=yt+h α+2β+2γ+δ 6 where α =y0(t,yt), β =y0(t+h 2,yt+ h 2α), γ =y0(t+h 2,yt+ h 2β), δ =y0(t+h,yt+hγ)
Solving the DE Numerically (3/3)
If the DE is of a higher degree, we can normalize it:
V0 t(x,y) = hVt00(x,y) V00 t (x,y) = C „V(x −d,y) +. . . 4 −V(x,y) « −F DtV(x,y)