Practical Introduction to

(1)

Practical Introduction to

http://tinyurl.com/cq-intro-openmp-20151006

By: Bart Oldeman, Calcul Qu´ebec – McGill HPC [email protected], [email protected]

(2)

(3)

Outline of the workshop

Theoretical / practical introduction

• Parallelizing your serial code

• What is OpenMP? Why do we need it?

• How do we run OpenMP codes (on the Guillimin cluster)?

• How to program with OpenMP?

• Program structure • Basic functions • Examples

(4)

Outline of the workshop

Practical exercises on Guillimin

• Login, setup environment, launch OpenMP code • Analyzing and running examples

(5)

Exercise 1:

Log in to Guillimin, setting up the environment 1) Log in to Guillimin:

ssh class##@guillimin.hpc.mcgill.ca

2) Check for loaded software modules:

$ module list

3) See all available modules:

$ module av

4) Load necessary modules:

$ module add ifort icc

(6)

Parallelizing your serial code

Models for parallel computing

(as an ordinary user sees it ...)

• Implicit Parallelization — minimum work for you

• Threaded libraries (MKL, ACML, GOTO, etc ....) • Compiler directives (OpenMP)

• Good for desktops and shared memory machines

• Explicit Parallelization — work is required !

• You tell what should be done on what CPU

• Low-level option for shared memory machines: POSIX Threads (pthreads)

(7)

OpenMP — Shared Memory API

• Open Multi-Processing: An Application Program Interface for multi-threaded programs in a

shared-memory environment.

• http://www.openmp.org

• Consists of

• Compiler directives • Runtime library routines • Environment variables

• Allows for relatively simple incremental parallelization.

• Not distributed, but can be combined with MPI (hybrid: see Advanced MPI workshop).

(8)

Shared memory approach

Shared memory Thread 0 Private memory Thread 1 Private memory

• Most memory is shared by all threads. • Each thread also has some private memory:

variables explicitly declared private, local variables in functions and subroutines.

(9)

OpenMP: fork/join model

Master Thread Worker Threads Fork Join Fork Join Serial Region Parallel Region Serial Region Parallel Region Serial Region

Master + Workers = Team

(10)

What is OpenMP for a user?

• OpenMP is NOT a language!

• OpenMP is NOT a compiler or specific product • OpenMP is a de-facto industry standard, a

specification for an Application Program Interface (API).

• You use its directives, routines, and environment variables.

• You compile and link your code with specific flags.

• History: version 1.0 (1997), 2.5 (2005), 3.0 (2008), 3.1 (2011), 4.0 (2013).

• Different implementations :

• GCC (4.2+), Intel, PGI, Visual C++, Solaris Studio, CLang (3.7+), ...

(11)

Basic features of OpenMP program

• Include basic definitions (#include <omp.h>,

INCLUDE ’omp lib.h’, or USE omp lib).

• Parallel region declared by a directive of the form

#pragma omp parallel (C) or !$OMP PARALLEL

(Fortran), declaring which variables are private. • Optional: code only compiled for OpenMP: use

OPENMP preprocessor symbol (C) or !$ prefix (Fortran).

(12)

Example: “Hello from N cores”

Fortran C

PROGRAM hello !$ USE omp_lib IMPLICIT NONE INTEGER rank, size rank = 0

size = 1

!$OMP PARALLEL PRIVATE(rank, size) !$ size = omp_get_num_threads() !$ rank = omp_get_thread_num() WRITE(*,*) ’Hello from processor ’,&

rank, ’ of ’, size !$OMP END PARALLEL

END PROGRAM hello

#include <stdio.h> #ifdef _OPENMP #include <omp.h> #endif

int main (int argc, char * argv[]) { int rank = 0, size = 1;

#ifdef _OPENMP

#pragma omp parallel private(rank, size) #endif { #ifdef _OPENMP rank = omp_get_thread_num(); size = omp_get_num_threads(); #endif

printf("Hello from processor %d" " of %d\n", rank, size ); }

return 0; }

(13)

POSIX Threads “Hello from N cores”

// pthreads.c #include <stdio.h> #include <pthread.h> #define SIZE 4

void *hello(void *arg) {

printf("Hello from processor %d of %d\n", *(int *)arg, SIZE); return NULL;

}

int main(int argc, char* argv[]) { int i, p[SIZE];

pthread_t threads[SIZE];

for (i = 1; i < SIZE; i++) { /* Fork threads */ p[i] = i;

pthread_create(&threads[i], NULL, hello, &p[i]); }

p[0] = 0;

hello(&p[0]); /* thread 0 greets as well */ for (i = 1; i < SIZE; i++) /* Join threads. */

pthread_join(threads[i], NULL); return 0;

(14)

Compiling your OpenMP code

• NOT defined by the standard

• A special compilation flag must be used. • On the Guillimin cluster:

• module add gcc

• gcc -fopenmp hello.c -o hello

• gfortran -fopenmp hello.f90 -o hello

• module add ifort icc

• icc -openmp hello.c -o hello

• ifort -openmp hello.f90 -o hello

• module add pgi

• pgcc -mp hello.c -o hello

(15)

Running your OpenMP code

• Important: environment variable OMP NUM THREADS.

• export OMP NUM THREADS=4

• ./hello

Hello from processor 2 of 4 Hello from processor 0 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4

• unset OMP NUM THREADS

• pgcc -mp hello.c -o hello

• ./hello

Hello from processor 0 of 1

• gcc -fopenmp hello.c -o hello

• ./hello

(16)

Running your OpenMP code

• On your laptop or desktop, just compile and run your code as above.

• On Guillimin cluster, use batch system to submit non-trivial OpenMP jobs! Example: hello.pbs:

#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd$PBS_O_WORKDIR export OMP_NUM_THREADS=2 ./hello > hello.out

Submit your job:

(17)

Exercise 2: “Hello”, compilation

1) Copy all files to your home directory:

$ cp -a /software/workshop/cq-formation-openmp/* ˜/

2) Compile your code:

$ ifort -openmp hello.f90 -o hello $ icc -openmp hello.c -o hello

(18)

Exercise 2: “Hello”, job submission

3) View the file “hello.pbs”:

#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:05:00 #PBS -V #PBS -N hello cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2 ./hello > hello.out

(19)

Exercise 2: “Hello”, job submission

4) Submit your job:

$ qsub hello.pbs

5) Check the job status:

$ qstat -u $USER $ showq -u $USER

(20)

Exercise 2: “Hello”, compile and run

Alternatively, using interactive qsub, or on your own Mac/Linux/Cygwin/MSYS computer:

1) Interactive login:

$ qsub -I -l nodes=1:ppn=2,walltime=7:00:00 or create and then copy all files to a directory:

yourlaptop> git clone -b mcgill \

https://github.com/calculquebec/cq-formation-intro-openmp.git cd cq-formation-intro-openmp

2) Compile your code:

> gfortran -fopenmp hello.f90 -o hello > gcc -fopenmp hello.c -o hello

3) Run your code:

> # can use any value here; default: number of cores > export OMP NUM THREADS=2

(21)

OpenMP directives

Format: sentinel directive [clause,] where

sentinel is #pragma omp or !$OMP. Examples: • #pragma omp parallel (C), !$OMP PARALLEL,

!$OMP END PARALLEL (Fortran): Parallel region construct.

• #pragma omp for: A workshare construct that makes a loop parallel (!$OMP DO in Fortran). • #pragma omp parallel for: A combined

construct: defines a parallel region that only contains the loop.

• #pragma omp barrier: A synchronization directive: all threads wait for each other here.

(22)

OpenMP clauses

Examples:

• Data scope: private, shared, and default.

!$omp parallel private(i) shared(x)

The variable i is private to the thread but the variable x is shared with all other threads.

Default: all variables shared except loop variables (C: outer, Fortran:all), and variables declared inside block.

• !$omp parallel default(shared) private(i)

All variables are shared except i.

• !$omp parallel default(none) private(i)

Using default(none) requires listing each variable or else the compiler complains (helps debugging!).

(23)

OpenMP important library routines

int omp get max threads(void);

Get maximum number of threads used here.

void omp set num threads(int);

Set number of threads for next parallel region.

int omp get thread num(void);

Get current thread number in parallel region.

int omp get num threads(void);

Get number of threads in parallel region.

double omp get wtime(void);

Portable wall clock timing routine.

(24)

OpenMP main environment variables

OMP NUM THREADS

Sets the maximum number of threads used (default: compiler dependent but often the number of available (hyper)threads).

OMP SCHEDULE

Used for run-time scheduling.

(25)

parallel for

• Example:

void addvectors(const int *a, const int *b, int *c, int n) {

int i;

#pragma omp parallel for for (i = 0; i < n; i++)

c[i] = a[i] + b[i]; }

• Here i is automatically made private because it is the loop variable. All other variables are shared. • Loop split between threads, for example with two

threads, for n=10, thread 0 does index 0 to 4 and thread 1 does index 5 to 9.

(26)

parallel for

• Eliminate overhead from fork and join (in practise: synchronization) by using just one parallel region for two vector additions:

int i;

#pragma omp for

for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }

...

#pragma omp parallel {

addvectors(a, b, c, n); addvectors(b, c, d, n); }

(27)

parallel for, nowait and barrier

void addvectors(const int *a,

const int *b, int *c, int n) {

int i;

#pragma omp for nowait

for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }

...

#pragma omp parallel {

addvectors(a, b, c, n);

#pragma omp barrier

addvectors(b, c, d, n/2); addvectors(e, f, g, n); }

}

• omp for by default implies a barrier where all threads wait at the end of the loop.

• Eliminate

synchronization

overhead using nowait

clause.

• But need to add explicit

(28)

parallel for: if clause

• The if clause allows conditional parallel regions: if

n is too small, the overhead is not worth it:

int i;

for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }

...

#pragma omp parallel if (n > 10000) {

addvectors(a, b, c, n); }

(29)

parallel for: scheduling

• schedule(static, 10000) allocates chunks of 10000 loop iterations to every thread:

int i;

#pragma omp for schedule(static, 10000)

for (i = 0; i < n; i++) c[i] = a[i] + b[i]; }

• Use dynamic instead of static to dynamically assign threads, if one finishes it is assigned the next chunk. Useful for unequal work within iterations.

• guided instead of dynamic: chunk sizes decrease as less work is left to do.

(30)

Nested loops

For perfectly nested rectangular loops we can parallelize multiple loops in the nest with the collapse clause:

• Argument is number of loops to collapse.

• Will form a single loop of length NxM and then parallelize and schedule that.

• Useful if N is close to the number of threads so parallelizing the outer loop may not have good load balance

• More efficient alternative to (advanced) nested parallelism

• #pragma omp parallel for collapse(2) for (int i=0; i<N; i++) {

for (int j=0; j<M; j++) { ...

} }

(31)

parallel: manual scheduling (SPMD)

• SPMD=Single Program Multiple Data, like in MPI.

for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } ....

int tid, nthreads, low, high;

#pragma omp parallel default(none) private(tid, nthreads,\ low, high) shared(a, b, c, n)

{

tid = omp_get_thread_num(); nthreads = omp_get_num_threads(); low = (n * tid) / nthreads;

high = (n * (tid + 1)) / nthreads;

addvectors(&a[low], &b[low], &c[low], high-low); }

(32)

SPMD vs. worksharing

• Worksharing (omp for/omp do) is easiest to implement.

• SPMD (do work based on thread ID) may give better performance but is harder to implement. • SPMD like in MPI:

• Instead of using large shared arrays, use smaller arrays private to threads: mark all (non-read-only) global and persistent (static/SAVE) variables threadprivate, and communicate using buffers and barriers.

• Fewer cache misses using more private data may give better performance.

• More advanced topic: see Advanced OpenMP workshop in 2016.

(33)

sections (SPMD construct)

• Example:

#pragma omp parallel sections {

#pragma omp section addvectors(a, b, c, n);

#pragma omp section

printf("hello world!\n");

#pragma omp section

printf("I may or may not be the third thread\n"); }

• The sections are individual code blocks that are distributed over the threads.

• More flexible alternative (OpenMP 3.0): omp task, useful when traversing dynamic data structures (lists, trees, etc.).

(34)

workshare (Fortran)

• Example: integer a(10000), b(10000), c(10000), d(10000) !$OMP PARALLEL !$OMP WORKSHARE c(:) = a(:) + b(:)

!$OMP END WORKSHARE NOWAIT !$OMP WORKSHARE

d(:) = a(:)

!$OMP END WORKSHARE NOWAIT !$OMP END PARALLEL

• Array assignments in Fortran are distributed among threads like loops.

(35)

Exercise 3: Modifying “Hello”

Ask each CPU to do its own computation by inserting code as follows:

IF (rank == 0) THEN

a=SQRT(2.0) b=0.0

WRITE(*,*) ’a,b=’,a,b,’on proc’,rank END IF

IF (rank == 1) THEN

a=0.0 b=SQRT(3.0)

WRITE(*,*) ’a,b=’,a,b,’on proc’,rank END IF

(36)

Exercise 4: Modifying “Hello”

Do (almost) the same thing, now using omp sections

!$OMP SECTIONS !$OMP SECTION

a=SQRT(2.0) b=0.0

WRITE(*,*) ’a,b=’,a,b,’on proc’,rank

!$OMP SECTION

a=0.0 b=SQRT(3.0)

WRITE(*,*) ’a,b=’,a,b,’on proc’,rank

!$OMP END SECTIONS

(37)

Race conditions (Data races)

• Example: innerprod.c, innerprod.f90 ip = 0;

#pragma omp parallel for private(i) shared(ip,a,b)

for (i = 0; i < N; i++) ip += a[i] * b[i];

• Problem: could be internally run as:

for (i = 0; i < N; i++) {

int register = ip;

register += a[i] * b[i]; ip = register;

}

• Threads may sum to their private CPU registers at the same time and overwrite ip, losing the addition from the other threads!

(38)

Race conditions

• Example: a={1,2}, b={3,4}, inner product 1*3+2*4=11, two threads.

ip register0 register1 tid=0 int register = ip; 0 0 unknown

tid=1 int register = ip; 0 0 0

tid=0 register += a[0] * b[0]; 0 3 0 tid=1 register += a[1] * b[1]; 0 3 8

tid=0 ip = register; 3 3 8

tid=1 ip = register; 8 3 8

• Wrong result: ip=8.

(39)

Solution: critical section

• Example:

ip = 0;

for (i = 0; i < N; i++)

#pragma omp critical ip += a[i] * b[i];

• critical makes sure only one thread can run the (compound) statement at a time

(40)

Solution: atomic section

• Example:

ip = 0;

for (i = 0; i < N; i++)

#pragma omp atomic

ip += a[i] * b[i];

• atomic is like critical but can only apply to a specific memory location.

(41)

Solution: local summing

• Example:

ip = 0;

#pragma omp parallel private(i,localip) shared(ip,a,b) {

localip = 0;

#pragma omp for nowait

for (i = 0; i < N; i++) localip += a[i] * b[i];

#pragma omp atomic ip += localip; }

• Still needs atomic but only times the number of threads, not times N, greatly reducing overhead and improving performance.

(42)

Solution: local summing using array

• Example:

ip = 0;

int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)

{

tid = omp_get_thread_num(); localips[tid] = 0;

for (i = 0; i < N; i++)

localips[tid] += a[i] * b[i]; #pragma omp single

for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i];

}

• One single thread does the final summing. • Could also use omp master here, to force the

master thread to do the final summing. omp master works like if (tid==0).

(43)

Solution: local summing using array

• Example:

ip = 0;

int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)

{

tid = omp_get_thread_num(); localips[tid] = 0;

for (i = 0; i < N; i++)

localips[tid] += a[i] * b[i]; #pragma omp single

}

• master and single are especially useful for I/O. • Problem here: false cache sharing of the array

(44)

Solution: local summing using array

• Example:

ip = 0;

int *localips = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid,localip) shared(ip,a,b,localips)

{

tid = omp_get_thread_num(); localip = 0;

#pragma omp for nowait for (i = 0; i < N; i++)

localip += a[i] * b[i]; localips[tid] = localip; #pragma omp barrier

#pragma omp single

}

• Minimizes false sharing, moving it outside the loop. • Note the barrier!

(45)

Solution: local summing using array

• Could also use padding in localips, but need to know size of L1 cache (e.g. 64 bytes):

#define PAD 16 ip = 0;

int (*localips)[PAD] = malloc(omp_get_max_threads()*sizeof(*localips)); #pragma omp parallel private(i,tid) shared(ip,a,b,localips)

{

tid = omp_get_thread_num(); localips[tid][0] = 0; #pragma omp for

for (i = 0; i < N; i++)

localips[tid][0] += a[i] * b[i]; #pragma omp single

for (i = 0; i < omp_get_num_threads(); i++) ip += localips[i][0];

(46)

Easiest solution: use reduction

• Example:

ip = 0;

#pragma omp parallel for reduction(+:ip)

for (i = 0; i < N; i++) ip += a[i] * b[i];

• Reduction is the most straightforward solution. • Caveat: only works on scalars, and cannot control

rounding errors caused by floating point calculations. For vectors use one of the other methods, Fortran, or OpenMP 4.0 user defined reductions.

(47)

Exercise 5: Compilation error

See omp bug.c or omp bug.f90, courtesy of Blaise Barney, Lawrence Livermore National Laboratory and find the compilation error in:

#pragma omp parallel for \ shared(a,b,c,chunk) \

private(i,tid) \

schedule(static,chunk) {

tid = omp_get_thread_num();

for (i=0; i < N; i++) {

c[i] = a[i] + b[i];

printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]); }

(48)

Exercise 6: Computing

π

Consider pi collect.c(f90), π= 4 arctan 1 = 4 ∞ X i=0 (−1)i 1 2i+ 1 = 4− 4 3+ 4 5− 4 7+ 4 9+. . .

Let’s add timings:

double t1, t2

t1 = omp_get_wtime(); ... t2 = omp_get_wtime();

printf("Time = %.16f\n", t2-t1);

Try some of the other alternatives (atomic, critical) to a reduction and measure the performance.

(49)

Exercise 7: Matrix multiplication

See the file mm.c or mm.f90. Make the initialization and multiplication parallel and measure the speedup.

(50)

Further information:

• The standard itself, news, development, links to tutorials:

http://www.openmp.org

• Intel tutorial on YouTube (from Tim Mattson):

http://tinyurl.com/OpenMP-Tutorial

• New: OpenMP 4.0: thread affinity, SIMD,

accelerators (GPUs, coprocessors), in GCC 4.9+, Intel compilers 14 and 15 (used in November 12 Xeon Phi and 2016 Advanced OpenMP workshops). • Questions? Write the guillimin support team at