Practical Introduction to
http://tinyurl.com/cq-intro-openmp-20151006
By: Bart Oldeman, Calcul Qu´ebec – McGill HPC [email protected], [email protected]
Outline of the workshop
Theoretical / practical introduction
• Parallelizing your serial code
• What is OpenMP? Why do we need it?
• How do we run OpenMP codes (on the Guillimin cluster)?
• How to program with OpenMP?
• Program structure • Basic functions • Examples
Outline of the workshop
Practical exercises on Guillimin
• Login, setup environment, launch OpenMP code • Analyzing and running examples
Exercise 1:
Log in to Guillimin, setting up the environment 1) Log in to Guillimin:
ssh class##@guillimin.hpc.mcgill.ca
2) Check for loaded software modules:
$ module list
3) See all available modules:
$ module av
4) Load necessary modules:
$ module add ifort icc
Parallelizing your serial code
Models for parallel computing
(as an ordinary user sees it ...)
• Implicit Parallelization — minimum work for you
• Threaded libraries (MKL, ACML, GOTO, etc ....) • Compiler directives (OpenMP)
• Good for desktops and shared memory machines
• Explicit Parallelization — work is required !
• You tell what should be done on what CPU
• Low-level option for shared memory machines: POSIX Threads (pthreads)
OpenMP — Shared Memory API
• Open Multi-Processing: An Application Program Interface for multi-threaded programs in a
shared-memory environment.
• http://www.openmp.org
• Consists of
• Compiler directives • Runtime library routines • Environment variables
• Allows for relatively simple incremental parallelization.
• Not distributed, but can be combined with MPI (hybrid: see Advanced MPI workshop).
Shared memory approach
Shared memory Thread 0 Private memory Thread 1 Private memory• Most memory is shared by all threads. • Each thread also has some private memory:
variables explicitly declared private, local variables in functions and subroutines.
OpenMP: fork/join model
Master Thread Worker Threads Fork Join Fork Join Serial Region Parallel Region Serial Region Parallel Region Serial RegionMaster + Workers = Team
What is OpenMP for a user?
• OpenMP is NOT a language!
• OpenMP is NOT a compiler or specific product • OpenMP is a de-facto industry standard, a
specification for an Application Program Interface (API).
• You use its directives, routines, and environment variables.
• You compile and link your code with specific flags.
• History: version 1.0 (1997), 2.5 (2005), 3.0 (2008), 3.1 (2011), 4.0 (2013).
• Different implementations :
• GCC (4.2+), Intel, PGI, Visual C++, Solaris Studio, CLang (3.7+), ...
Basic features of OpenMP program
• Include basic definitions (#include <omp.h>,
INCLUDE ’omp lib.h’, or USE omp lib).
• Parallel region declared by a directive of the form
#pragma omp parallel (C) or !$OMP PARALLEL
(Fortran), declaring which variables are private. • Optional: code only compiled for OpenMP: use
OPENMP preprocessor symbol (C) or !$ prefix (Fortran).
Example: “Hello from N cores”
Fortran C
PROGRAM hello !$ USE omp_lib IMPLICIT NONE INTEGER rank, size rank = 0
size = 1
!$OMP PARALLEL PRIVATE(rank, size) !$ size = omp_get_num_threads() !$ rank = omp_get_thread_num() WRITE(*,*) ’Hello from processor ’,&
rank, ’ of ’, size !$OMP END PARALLEL
END PROGRAM hello
#include <stdio.h> #ifdef _OPENMP #include <omp.h> #endif
int main (int argc, char * argv[]) { int rank = 0, size = 1;
#ifdef _OPENMP
#pragma omp parallel private(rank, size) #endif { #ifdef _OPENMP rank = omp_get_thread_num(); size = omp_get_num_threads(); #endif
printf("Hello from processor %d" " of %d\n", rank, size ); }
return 0; }
POSIX Threads “Hello from N cores”
// pthreads.c #include <stdio.h> #include <pthread.h> #define SIZE 4
void *hello(void *arg) {
printf("Hello from processor %d of %d\n", *(int *)arg, SIZE); return NULL;
}
int main(int argc, char* argv[]) { int i, p[SIZE];
pthread_t threads[SIZE];
for (i = 1; i < SIZE; i++) { /* Fork threads */ p[i] = i;
pthread_create(&threads[i], NULL, hello, &p[i]); }
p[0] = 0;
hello(&p[0]); /* thread 0 greets as well */ for (i = 1; i < SIZE; i++) /* Join threads. */
pthread_join(threads[i], NULL); return 0;
Compiling your OpenMP code
• NOT defined by the standard
• A special compilation flag must be used. • On the Guillimin cluster:
• module add gcc
• gcc -fopenmp hello.c -o hello
• gfortran -fopenmp hello.f90 -o hello
• module add ifort icc
• icc -openmp hello.c -o hello
• ifort -openmp hello.f90 -o hello
• module add pgi
• pgcc -mp hello.c -o hello
Running your OpenMP code
• Important: environment variable OMP NUM THREADS.
• export OMP NUM THREADS=4
• ./hello
Hello from processor 2 of 4 Hello from processor 0 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4
• unset OMP NUM THREADS
• pgcc -mp hello.c -o hello
• ./hello
Hello from processor 0 of 1
• gcc -fopenmp hello.c -o hello
• ./hello
Running your OpenMP code
• On your laptop or desktop, just compile and run your code as above.
• On Guillimin cluster, use batch system to submit non-trivial OpenMP jobs! Example: hello.pbs:
#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd$PBS_O_WORKDIR export OMP_NUM_THREADS=2 ./hello > hello.out
Submit your job:
Exercise 2: “Hello”, compilation
1) Copy all files to your home directory:
$ cp -a /software/workshop/cq-formation-openmp/* ˜/
2) Compile your code:
$ ifort -openmp hello.f90 -o hello $ icc -openmp hello.c -o hello
Exercise 2: “Hello”, job submission
3) View the file “hello.pbs”:
#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2 ./hello > hello.out
Exercise 2: “Hello”, job submission
4) Submit your job:
$ qsub hello.pbs
5) Check the job status:
$ qstat -u $USER $ showq -u $USER
Exercise 2: “Hello”, compile and run
Alternatively, using interactive qsub, or on your own Mac/Linux/Cygwin/MSYS computer:
1) Interactive login:
$ qsub -I -l nodes=1:ppn=2,walltime=7:00:00 or create and then copy all files to a directory:
yourlaptop> git clone -b mcgill \
https://github.com/calculquebec/cq-formation-intro-openmp.git cd cq-formation-intro-openmp
2) Compile your code:
> gfortran -fopenmp hello.f90 -o hello > gcc -fopenmp hello.c -o hello
3) Run your code:
> # can use any value here; default: number of cores > export OMP NUM THREADS=2
OpenMP directives
Format: sentinel directive [clause,] where
sentinel is #pragma omp or !$OMP. Examples: • #pragma omp parallel (C), !$OMP PARALLEL,
!$OMP END PARALLEL (Fortran): Parallel region construct.
• #pragma omp for: A workshare construct that makes a loop parallel (!$OMP DO in Fortran). • #pragma omp parallel for: A combined
construct: defines a parallel region that only contains the loop.
• #pragma omp barrier: A synchronization directive: all threads wait for each other here.
OpenMP clauses
Examples:
• Data scope: private, shared, and default.
!$omp parallel private(i) shared(x)
The variable i is private to the thread but the variable x is shared with all other threads.
Default: all variables shared except loop variables (C: outer, Fortran:all), and variables declared inside block.
• !$omp parallel default(shared) private(i)
All variables are shared except i.
• !$omp parallel default(none) private(i)
Using default(none) requires listing each variable or else the compiler complains (helps debugging!).
OpenMP important library routines
int omp get max threads(void);
Get maximum number of threads used here.
void omp set num threads(int);
Set number of threads for next parallel region.
int omp get thread num(void);
Get current thread number in parallel region.
int omp get num threads(void);
Get number of threads in parallel region.
double omp get wtime(void);
Portable wall clock timing routine.
OpenMP main environment variables
OMP NUM THREADS
Sets the maximum number of threads used (default: compiler dependent but often the number of available (hyper)threads).
OMP SCHEDULE
Used for run-time scheduling.
parallel for
• Example:
void addvectors(const int *a, const int *b, int *c, int n) {
int i;
#pragma omp parallel for for (i = 0; i < n; i++)
c[i] = a[i] + b[i]; }
• Here i is automatically made private because it is the loop variable. All other variables are shared. • Loop split between threads, for example with two
threads, for n=10, thread 0 does index 0 to 4 and thread 1 does index 5 to 9.
parallel for
• Eliminate overhead from fork and join (in practise: synchronization) by using just one parallel region for two vector additions:
void addvectors(const int *a, const int *b, int *c, int n) {
int i;
#pragma omp for
for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }
...
#pragma omp parallel {
addvectors(a, b, c, n); addvectors(b, c, d, n); }
parallel for, nowait and barrier
void addvectors(const int *a,
const int *b, int *c, int n) {
int i;
#pragma omp for nowait
for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }
...
#pragma omp parallel {
addvectors(a, b, c, n);
#pragma omp barrier
addvectors(b, c, d, n/2); addvectors(e, f, g, n); }
}
• omp for by default implies a barrier where all threads wait at the end of the loop.
• Eliminate
synchronization
overhead using nowait
clause.
• But need to add explicit
parallel for: if clause
• The if clause allows conditional parallel regions: if
n is too small, the overhead is not worth it:
void addvectors(const int *a, const int *b, int *c, int n) {
int i;
#pragma omp for
for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }
...
#pragma omp parallel if (n > 10000) {
addvectors(a, b, c, n); }
parallel for: scheduling
• schedule(static, 10000) allocates chunks of 10000 loop iterations to every thread:
void addvectors(const int *a, const int *b, int *c, int n) {
int i;
#pragma omp for schedule(static, 10000)
for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }
• Use dynamic instead of static to dynamically assign threads, if one finishes it is assigned the next chunk. Useful for unequal work within iterations.
• guided instead of dynamic: chunk sizes decrease as less work is left to do.
Nested loops
For perfectly nested rectangular loops we can parallelize multiple loops in the nest with the collapse clause:
• Argument is number of loops to collapse.
• Will form a single loop of length NxM and then parallelize and schedule that.
• Useful if N is close to the number of threads so parallelizing the outer loop may not have good load balance
• More efficient alternative to (advanced) nested parallelism
• #pragma omp parallel for collapse(2) for (int i=0; i<N; i++) {
for (int j=0; j<M; j++) { ...
} }
parallel: manual scheduling (SPMD)
• SPMD=Single Program Multiple Data, like in MPI.
void addvectors(const int *a, const int *b, int *c, int n) {
for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } ....
int tid, nthreads, low, high;
#pragma omp parallel default(none) private(tid, nthreads,\ low, high) shared(a, b, c, n)
{
tid = omp_get_thread_num(); nthreads = omp_get_num_threads(); low = (n * tid) / nthreads;
high = (n * (tid + 1)) / nthreads;
addvectors(&a[low], &b[low], &c[low], high-low); }
SPMD vs. worksharing
• Worksharing (omp for/omp do) is easiest to implement.
• SPMD (do work based on thread ID) may give better performance but is harder to implement. • SPMD like in MPI:
• Instead of using large shared arrays, use smaller arrays private to threads: mark all (non-read-only) global and persistent (static/SAVE) variables threadprivate, and communicate using buffers and barriers.
• Fewer cache misses using more private data may give better performance.
• More advanced topic: see Advanced OpenMP workshop in 2016.
sections (SPMD construct)
• Example:
#pragma omp parallel sections {
#pragma omp section addvectors(a, b, c, n);
#pragma omp section
printf("hello world!\n");
#pragma omp section
printf("I may or may not be the third thread\n"); }
• The sections are individual code blocks that are distributed over the threads.
• More flexible alternative (OpenMP 3.0): omp task, useful when traversing dynamic data structures (lists, trees, etc.).
workshare (Fortran)
• Example: integer a(10000), b(10000), c(10000), d(10000) !$OMP PARALLEL !$OMP WORKSHARE c(:) = a(:) + b(:)!$OMP END WORKSHARE NOWAIT !$OMP WORKSHARE
d(:) = a(:)
!$OMP END WORKSHARE NOWAIT !$OMP END PARALLEL
• Array assignments in Fortran are distributed among threads like loops.
Exercise 3: Modifying “Hello”
Ask each CPU to do its own computation by inserting code as follows:
IF (rank == 0) THEN
a=SQRT(2.0) b=0.0
WRITE(*,*) ’a,b=’,a,b,’on proc’,rank END IF
IF (rank == 1) THEN
a=0.0 b=SQRT(3.0)
WRITE(*,*) ’a,b=’,a,b,’on proc’,rank END IF
Exercise 4: Modifying “Hello”
Do (almost) the same thing, now using omp sections
!$OMP SECTIONS !$OMP SECTION
a=SQRT(2.0) b=0.0
WRITE(*,*) ’a,b=’,a,b,’on proc’,rank
!$OMP SECTION
a=0.0 b=SQRT(3.0)
WRITE(*,*) ’a,b=’,a,b,’on proc’,rank
!$OMP END SECTIONS
Race conditions (Data races)
• Example: innerprod.c, innerprod.f90 ip = 0;
#pragma omp parallel for private(i) shared(ip,a,b)
for (i = 0; i < N; i++) ip += a[i] * b[i];
• Problem: could be internally run as:
for (i = 0; i < N; i++) {
int register = ip;
register += a[i] * b[i]; ip = register;
}
• Threads may sum to their private CPU registers at the same time and overwrite ip, losing the addition from the other threads!
Race conditions
• Example: a={1,2}, b={3,4}, inner product 1*3+2*4=11, two threads.
ip register0 register1 tid=0 int register = ip; 0 0 unknown
tid=1 int register = ip; 0 0 0
tid=0 register += a[0] * b[0]; 0 3 0 tid=1 register += a[1] * b[1]; 0 3 8
tid=0 ip = register; 3 3 8
tid=1 ip = register; 8 3 8
• Wrong result: ip=8.
Solution: critical section
• Example:
ip = 0;
#pragma omp parallel for private(i) shared(ip,a,b)
for (i = 0; i < N; i++)
#pragma omp critical ip += a[i] * b[i];
• critical makes sure only one thread can run the (compound) statement at a time
Solution: atomic section
• Example:
ip = 0;
#pragma omp parallel for private(i) shared(ip,a,b)
for (i = 0; i < N; i++)
#pragma omp atomic
ip += a[i] * b[i];
• atomic is like critical but can only apply to a specific memory location.
Solution: local summing
• Example:
ip = 0;
#pragma omp parallel private(i,localip) shared(ip,a,b) {
localip = 0;
#pragma omp for nowait
for (i = 0; i < N; i++) localip += a[i] * b[i];
#pragma omp atomic ip += localip; }
• Still needs atomic but only times the number of threads, not times N, greatly reducing overhead and improving performance.
Solution: local summing using array
• Example:
ip = 0;
int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)
{
tid = omp_get_thread_num(); localips[tid] = 0;
#pragma omp for
for (i = 0; i < N; i++)
localips[tid] += a[i] * b[i]; #pragma omp single
for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i];
}
• One single thread does the final summing. • Could also use omp master here, to force the
master thread to do the final summing. omp master works like if (tid==0).
Solution: local summing using array
• Example:
ip = 0;
int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)
{
tid = omp_get_thread_num(); localips[tid] = 0;
#pragma omp for
for (i = 0; i < N; i++)
localips[tid] += a[i] * b[i]; #pragma omp single
for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i];
}
• master and single are especially useful for I/O. • Problem here: false cache sharing of the array
Solution: local summing using array
• Example:
ip = 0;
int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid,localip) shared(ip,a,b,localips)
{
tid = omp_get_thread_num(); localip = 0;
#pragma omp for nowait for (i = 0; i < N; i++)
localip += a[i] * b[i]; localips[tid] = localip; #pragma omp barrier
#pragma omp single
for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i];
}
• Minimizes false sharing, moving it outside the loop. • Note the barrier!
Solution: local summing using array
• Could also use padding in localips, but need to know size of L1 cache (e.g. 64 bytes):
#define PAD 16 ip = 0;
int (*localips)[PAD] = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)
{
tid = omp_get_thread_num(); localips[tid][0] = 0; #pragma omp for
for (i = 0; i < N; i++)
localips[tid][0] += a[i] * b[i]; #pragma omp single
for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i][0];
Easiest solution: use reduction
• Example:
ip = 0;
#pragma omp parallel for reduction(+:ip)
for (i = 0; i < N; i++) ip += a[i] * b[i];
• Reduction is the most straightforward solution. • Caveat: only works on scalars, and cannot control
rounding errors caused by floating point calculations. For vectors use one of the other methods, Fortran, or OpenMP 4.0 user defined reductions.
Exercise 5: Compilation error
See omp bug.c or omp bug.f90, courtesy of Blaise Barney, Lawrence Livermore National Laboratory and find the compilation error in:
#pragma omp parallel for \ shared(a,b,c,chunk) \
private(i,tid) \
schedule(static,chunk) {
tid = omp_get_thread_num();
for (i=0; i < N; i++) {
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]); }
Exercise 6: Computing
π
Consider pi collect.c(f90), π= 4 arctan 1 = 4 ∞ X i=0 (−1)i 1 2i+ 1 = 4− 4 3+ 4 5− 4 7+ 4 9+. . .Let’s add timings:
double t1, t2
t1 = omp_get_wtime(); ... t2 = omp_get_wtime();
printf("Time = %.16f\n", t2-t1);
Try some of the other alternatives (atomic, critical) to a reduction and measure the performance.
Exercise 7: Matrix multiplication
See the file mm.c or mm.f90. Make the initialization and multiplication parallel and measure the speedup.
Further information:
• The standard itself, news, development, links to tutorials:
http://www.openmp.org
• Intel tutorial on YouTube (from Tim Mattson):
http://tinyurl.com/OpenMP-Tutorial
• New: OpenMP 4.0: thread affinity, SIMD,
accelerators (GPUs, coprocessors), in GCC 4.9+, Intel compilers 14 and 15 (used in November 12 Xeon Phi and 2016 Advanced OpenMP workshops). • Questions? Write the guillimin support team at