Overview: The OpenMP Programming Model
l motivation and overviewl the parallel directive: clauses, equivalent pthread code, examples l the ‘for’ directive and scheduling of loop iterations
l Pi example in OpenMP l nested parallel directives l synchronization
l handling of variables between threads l sections
l tasks
l library functions and environment variables l comparison with Pthreads
Refs: Grama et al Ch 7, Hager & Wellein Ch 6-7, Lin & Snyder Ch 6;OpenMP Tutorial from LLNL,The OpenMP Foundation
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 1
Shared Memory Parallel Programming
l explicit thread programming is messy n low-level primitives
n originally non-standard, although better since Pthreads n used by system programmers, but···
···application programmers have better things to do!
l many application codes can be usefully supported by higher level constructs n led to proprietarydirective-based approaches of Cray, SGI, Sun etc
l OpenMP is an API for shared memory parallel programming targeting Fortran, C and C++
n standardizes the form of the proprietary directives
n avoids the need for explicitly setting up mutexes, condition variables, data scope, and initialization
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 2
OpenMP
l specifications maintained by OpenMP Architecture Review Board (ARB) n members include AMD, Intel, Fujitsu, IBM, NVIDIA···cOMPunity
l versions 1.0 (Fortran ’97, C ’98), 1.1 and 2.0 (Fortran ’00, C/C++ ’02), 2.5 (unified Fortran and C, 2005), 3.0 (2008), 3.1 (2011), 4.0 (2013), 5.0 (2018)
l comprises of compiler directives, library routines and environment variables n C directives(case sensitive)
#pragma ompdirective-name [clause-list]
n library calls begin withomp
void omp set num threads(int nthreads);
n environment variables begin withOMP export OMP NUM THREADS=4 l OpenMP requires compiler support
n activated via-fopenmp(gcc) or-qopenmp(icc) compiler flags
The Parallel Directive
l OpenMP uses afork/join model: programs execute serially until they encounter a paralleldirective:
n this creates a group of threads
n the number of threads is dependent on theOMP NUM THREADSenvironment variable or set via a function call, e.g.omp set num threads(nthreads) n the main thread becomes themaster thread, with a thread id of 0
# pragma omp parallel [ clause−list ] { /∗ structured block ∗/
}
l each thread executes the structured block
Parallel Directive: Clauses
Clauses are used to specify:
l conditional parallelization: to determine if the parallel construct results in creation/use of threads
if (scalar-expression)
l degree of concurrency: explicit specification of the number of threads created/used num threads(integer-expression)
l data handling: to indicate if specific variables are local (private) to the thread (allocated on the thread’s stack, the default), global (shared), or ‘special’
private(variable-list) shared(variable-list) firstprivate(variable-list) default(shared|none)
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 5
Compiler Translation: OpenMP to Pthreads
l OpenMP codemain () { int a, b;
// serial segment
# pragma omp parallel num threads (8) private (a) shared (b) { // parallel segment
}// rest of serial segment }
l Pthread equivalent (structured block is outlined) main () {
int a, b;
// serial segment
for (i =0; i<8; i ++) pthread create (... , internal thunk ,...);
for (i =0; i<8; i ++) pthread join (...);
// rest of serial segment
}void ∗ i n t e r n a l t h u n k (void ∗ pa ck ag ed ar gu me nt ) { int a;
// parallel segment }
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 6
Parallel Directive Examples
# pragma omp parallel if ( is parallel == 1) num threads (8) \ private (a) shared (b) firstprivate (c) l if the value of variableis parallelis one, eight threads are used
l each thread has private copy ofaandc, but shares a single copy ofb
l the value of each private copy ofcis initialized to value of the globalcbefore the parallel region
# pragma omp parallel reduction (+: sum ) num threads (8) \ default( private )
l eight threads get a copy of the variablesum
l when the threads exit, the values of these local copies are accumulated into the sumvariable on the master thread
n otherreductionoperations include∗,−,&,|,^,&&, ||
l all variables areprivateunless otherwise specified
Example: Computing Pi
l computeπby generating random numbers in square with side length of 2 centered at (0,0) and counting numbers that fall within a circle of radius 1
n area of square = 4, area of circle:πr2=π
n ratio of points inside circle to outside approachesπ/4
# pragma omp parallel private (i) shared ( npoints ) \ reduction (+: sum ) num threads (8) { int seed = omp num threads (); // private
num threads = omp get num threads ();
sum = 0;
for (i = 0; i < npoints / num threads ; i ++) { rand x = (double) rand range (& seed , −1, 1);
rand y = (double) rand range (& seed , −1, 1);
if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;
} }
l the OpenMP code is very simple – c.f. theequivalent pthread code
The For Work-sharing Directive
l used in conjunction with theparalleldirective to partition theforloop immediately afterwards
# pragma omp parallel shared ( npoints ) reduction (+: sum ) \ num threads (8)
{ int seed = omp get thread num ();
sum = 0;
# pragma omp for
for (i = 0; i < npoints ; i ++) {
rand x = (double) rand range (& seed ,−1,1);
rand y = (double) rand range (& seed ,−1,1);
if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;
} }
n the loop index (i) is assumed to be private
n only two directives plus sequential code (code is easy to read/maintain) l there is an implicit synchronization at the end of the loop
n can add anowaitclause to prevent this
l it is common to merge the directives:#pragma omp parallel for ...
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 9
The Combined parallel for Directive
l the most common use case for parallelizingforloops# pragma omp threadprivate ( seed )
# pragma omp parallel
{ seed = o m p g e t t h r e a d n u m (); } sum = 0;
# pragma omp parallel for shared ( npoints ) \ reduction (+: sum ) num threads (8) for (i = 0; i < npoints ; i ++) {
rand x = (double) rand range (& seed , −1, 1);
rand y = (double) rand range (& seed , −1, 1);
if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;
}printf (" s u m = % d \ n ", sum );
l inside the parallel region,sumis treated as athread-localvariable (implicitly initialized to 0)
l at the end of the region, the thread-local versions ofsumare added to the global sum(here initialized to 0) to get the final value
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 10
Assigning Iterations to Threads
l thescheduleclause of thefordirective assigns iterations to threads l schedule(static[,chunk-size])
n splits the iteration space into chunks of size chunk-size and allocates to threads in round-robin fashion
n if the chunk size is unspecified, the number of chunks equals number of threads l schedule(dynamic[,chunk-size])
n the iteration space is split into chunk-size blocks that are scheduled dynamically l schedule(guided[,chunk-size])
n the chunk size decreases exponentially with iterations to a minimum of chunk-size
l schedule(runtime)
n determine scheduling based on setting of theOMP SCHEDULEenvironment variable
Nesting Parallel Directives
l what happens for nestedforloops?
# pragma omp parallel for num threads (2) for (i = 0; i < Ni; i ++) {
# pragma omp parallel for num threads (2) for (j = 0; j < Nj; j ++) {
...
l by default, the inner loop is serialized and run by one thread
l to enable multiple threads in nested loops requires the environment variable OMP NESTEDto beTRUE
l note - the use of synchronization constructs in nested parallel sections requires care (see the OpenMP specs)!
Sychronization #1
l barrier:threads wait until they have all reached this point#pragma omp barrier
l single: the following structured block is executed only by first thread to reach this point
#pragma omp single[clause-list]
n others wait at the end of structured block unless anowaitclause used
l master: only themaster threadexecutes following block, other threads do NOT wait
#pragma omp master
l critical section: only one thread is ever in the named critical section at any time
#pragma omp critical[name]
l atomic: the memory location updated in the following statement is done so in an atomic fashion
n can achieve the same effect using critical sections (possibly faster?)
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 13
Sychronization #2
l ordered: some operations within aforloop must be performed as if it were done so in a sequential order
cumul sum [0] = list [0];
# pragma omp parallel for shared ( cumul sum , list , n) for (i =1; i<n; i ++) {
/∗ other processing on list [i] if required ∗/
# pragma omp ordered
{ cumul sum [i] = cumul sum [i−1] + list [i];
} }
l flush:enforces a consistent view of memory
n that variables have been flushed from registers into memory
#pragma omp flush[(variable-list)]
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 14
Data Handling
l private: an uninitialized local copy of the variable is made for each thread l shared: variables are shared between threads
l firstprivate: make a local copy of an existing variable and assign it the same value
n often better than multiple reads to a shared variable
l lastprivate: copies back to the master thread the variable’s value from the thread which executed the last loop iteration (if executed serially)
l threadprivate: creates private variables but they persist between multiple parallel regions, maintaining their values
l copyin: likefirstprivatebut forthreadprivatevariables
OpenMP Exercise: Matrix-Vector Multiplication
void matmult (int M, int N, double ∗ restrict A ,
double ∗ restrict b , double ∗ restrict c) { int i, j;
for (i =0; i < M; i ++) for (j =0; j < N; j ++)
c[i] = A[i+j ∗ M] ∗ b[j] + c[i];
}
l what are the possible ways you could use OpenMP to parallelize this code? (you may change the order of the loops)
l what data handling clauses would be needed?
l what if one or other of the matrix sizes was small?
l what parallel (and serial) performance issues might arise?
Sections
l consider the partitioning of a fixed number of tasks across threads n much less common thanforloop partitioning
n an example offunctional decomposition
n explicit programming naturally limits number of threads (scalability)
# pragma omp sections { # pragma omp section
{ taskA ();
# pragma omp section} { taskB ();
} }
l separate threads will runtaskA()andtaskB() l it is illegal to branch in or out of section blocks
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 17
OpenMP Tasks
l a task has
n code to execute
n a data environment (it owns its data)
n an assigned thread that executes the code and uses the data l creating a task involves two activities: packaging and execution
n each encountering thread packages a new instance of a task (code and data) n some thread in the team executes the task at some later time
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 18
Task Syntax
#pragma omp task[clause . . . ] where each clause can be one of:
l if (scalar-expression)
If the expression is false, the creating thread executes the task (in its own environment)
l untied
Relaxes the default binding of tasks to threads
l any of the previous data-handling directivesdefault,private,firstprivateor shared
Task synchronization:
l at thread barriers (explicit or implicit) and task barriers, all tasks generated in that region must complete
#pragma omp taskwait
Task Example
int fib (int n) { int i, j;
if (n < 2) return n;
else {
# pragma omp task shared (i) firstprivate (n) i = fib (n−1);
# pragma omp task shared (j) firstprivate (n) j = fib (n−2);
# pragma omp taskwait return i+j;
} }int main () { int n = 10;
omp set dynamic (0);
omp set num threads (4);
# pragma omp parallel shared (n) { # pragma omp single
printf (" f i b ( % d ) = % d \ n ", n, fib (n ));
} }
Task Issues
l task switching:n task scheduling points occur on task creation/exit/yields and the task synchronization points
n when a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (task switching)
n it can then return to original task and resume l tied tasks:
n by default suspended tasks must resume execution on the the same thread as it was previously executing on
n theuntiedclause relaxes this constraint l task generation pitfalls:
n very easy to generate many tasks very quickly!
n the generating task will be suspended and its thread may start working on a long and boring task.
Other threads can consume all their tasks and have nothing to do . . .
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 21
Library Functions #1
l defined in header file
#include <omp.h>
l controlling threads and processors
void omp set num threads (int num threads );
int omp get num threads ();
int omp get max threads ();
int omp get thread num ();
int o mp g et n um p ro cs ();
int omp in parallel ();
l controlling thread creation:
void omp set dynamic (int dynamic threads );
int omp get dynamic ();
void omp set nested (int nested );
int omp get nested ();
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 22
Library Functions #2
l mutual exclusion support:
void omp init lock ( omp lock t ∗ lock );
void omp destroy lock ( omp lock t ∗ lock );
void omp set lock ( omp lock t ∗ lock );
void omp unset lock ( omp lock t ∗ lock );
int omp test lock ( omp lock t ∗ lock );
l nested mutual exclusion(lock may be setntimes and only released after it is unset ntimes) :
void omp init nest lock ( o m p n e s t l o c k t ∗ lock );
void omp destroy nest lock ( o m p n e s t l o c k t ∗ lock );
omp set nest l ock ( o m p n e s t l o c k t ∗ lock );
void omp unset nest lock ( o m p n e s t l o c k t ∗ lock );
int omp test nest lock ( o m p n e s t l o c k t ∗ lock );
OpenMP Environment Variables
l OMP NUM THREADS: default number of threads entering parallel region
l OMP DYNAMIC: iftrue, it permits the number of threads to change during execution, in order to optimize system resources
l OMP NESTED: iftrue, it permits nested parallel regions
l OMP SCHEDULE: determines scheduling for loops that are specified with a schedule(runtime)clause
export OMP_SCHEDULE="static,4"
export OMP_SCHEDULE="dynamic"
export OMP_SCHEDULE="guided"
OpenMP and Pthreads
l OpenMP removes the need for a programmer to initialize task attributes, set up arguments to threads, partition iteration spaces etc
l OpenMP code can closely resemble serial code
l OpenMP is particularly useful for static or regular problems
l OpenMP users are hostage to the availability of an OpenMP compiler n performance is heavily dependent on the quality of the compiler l OpenMP more easily allowsincremental parallelization
l in Pthreads, data exchange is more apparent, so false sharing and contention are less likely
l Pthreads has a richer API that is much more flexible, e.g. condition waits, locks of different types etc
l Pthreads is library-based
Must balance the above before deciding on the parallel programming model!
COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 25