Shared Memory Parallel Programming. Overview: The OpenMP Programming Model. The Parallel Directive. OpenMP. export OMP NUM THREADS=4

(1)

Overview: The OpenMP Programming Model

l motivation and overview

l the parallel directive: clauses, equivalent pthread code, examples l the ‘for’ directive and scheduling of loop iterations

l Pi example in OpenMP l nested parallel directives l synchronization

l handling of variables between threads l _sections

l _tasks

l library functions and environment variables l comparison with Pthreads

Refs: Grama et al Ch 7, Hager & Wellein Ch 6-7, Lin & Snyder Ch 6;OpenMP Tutorial from LLNL,The OpenMP Foundation

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹

Shared Memory Parallel Programming

l explicit thread programming is messy n low-level primitives

n originally non-standard, although better since Pthreads n used by system programmers, but···

···application programmers have better things to do!

l many application codes can be usefully supported by higher level constructs n led to proprietarydirective-based approaches of Cray, SGI, Sun etc

l OpenMP is an API for shared memory parallel programming targeting Fortran, C and C++

n standardizes the form of the proprietary directives

n avoids the need for explicitly setting up mutexes, condition variables, data scope, and initialization

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ²

OpenMP

l specifications maintained by OpenMP Architecture Review Board (ARB) n members include AMD, Intel, Fujitsu, IBM, NVIDIA···cOMPunity

l versions 1.0 (Fortran ’97, C ’98), 1.1 and 2.0 (Fortran ’00, C/C++ ’02), 2.5 (unified Fortran and C, 2005), 3.0 (2008), 3.1 (2011), 4.0 (2013), 5.0 (2018)

l comprises of compiler directives, library routines and environment variables n C directives(case sensitive)

#pragma ompdirective-name [clause-list]

n library calls begin withomp

void omp set num threads(int nthreads);

n environment variables begin withOMP export OMP NUM THREADS=4 l OpenMP requires compiler support

n activated via-fopenmp(gcc) or-qopenmp(icc) compiler flags

The Parallel Directive

l OpenMP uses afork/join model: programs execute serially until they encounter a paralleldirective:

n this creates a group of threads

n the number of threads is dependent on theOMP NUM THREADSenvironment variable or set via a function call, e.g.omp set num threads(nthreads) n the main thread becomes themaster thread, with a thread id of 0

# pragma omp parallel [ clause−list ] { /∗ structured block ∗/

}

l each thread executes the structured block

(2)

Parallel Directive: Clauses

Clauses are used to specify:

l conditional parallelization: to determine if the parallel construct results in creation/use of threads

if (scalar-expression)

l degree of concurrency: explicit specification of the number of threads created/used num threads(integer-expression)

l data handling: to indicate if specific variables are local (private) to the thread (allocated on the thread’s stack, the default), global (shared), or ‘special’

private(variable-list) shared(variable-list) firstprivate(variable-list) default(shared|none)

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ⁵

Compiler Translation: OpenMP to Pthreads

l OpenMP code

main () { int a, b;

// serial segment

# pragma omp parallel num threads (8) private (a) shared (b) { // parallel segment

}// rest of serial segment }

l Pthread equivalent (structured block is outlined) main () {

int a, b;

// serial segment

for (i =0; i<8; i ++) pthread create (... , internal thunk ,...);

for (i =0; i<8; i ++) pthread join (...);

// rest of serial segment

}void ∗ i n t e r n a l t h u n k (void ∗ pa ck ag ed ar gu me nt ) { int a;

// parallel segment }

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ⁶

Parallel Directive Examples

# pragma omp parallel if ( is parallel == 1) num threads (8) \ private (a) shared (b) firstprivate (c) l if the value of variableis parallelis one, eight threads are used

l each thread has private copy ofaandc, but shares a single copy ofb

l the value of each private copy ofcis initialized to value of the globalcbefore the parallel region

# pragma omp parallel reduction (+: sum ) num threads (8) \ default( private )

l eight threads get a copy of the variablesum

l when the threads exit, the values of these local copies are accumulated into the sumvariable on the master thread

n _other_reductionoperations include_∗,₋,&,_|,^,&&, _||

l all variables areprivateunless otherwise specified

Example: Computing Pi

l _computeπby generating random numbers in square with side length of 2 centered at (0,0) and counting numbers that fall within a circle of radius 1

n area of square = 4, area of circle:πr²=π

n ratio of points inside circle to outside approachesπ/4

# pragma omp parallel private (i) shared ( npoints ) \ reduction (+: sum ) num threads (8) { int seed = omp num threads (); // private

num threads = omp get num threads ();

sum = 0;

for (i = 0; i < npoints / num threads ; i ++) { rand x = (double) rand range (& seed , −1, 1);

rand y = (double) rand range (& seed , −1, 1);

if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;

} }

l the OpenMP code is very simple – c.f. theequivalent pthread code

(3)

The For Work-sharing Directive

l used in conjunction with theparalleldirective to partition theforloop immediately afterwards

# pragma omp parallel shared ( npoints ) reduction (+: sum ) \ num threads (8)

{ int seed = omp get thread num ();

sum = 0;

# pragma omp for

for (i = 0; i < npoints ; i ++) {

rand x = (double) rand range (& seed ,−1,1);

rand y = (double) rand range (& seed ,−1,1);

} }

n the loop index (i) is assumed to be private

n only two directives plus sequential code (code is easy to read/maintain) l there is an implicit synchronization at the end of the loop

n _{can add a}_nowaitclause to prevent this

l it is common to merge the directives:#pragma omp parallel for ...

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ⁹

The Combined parallel for Directive

l the most common use case for parallelizingforloops

# pragma omp threadprivate ( seed )

# pragma omp parallel

{ seed = o m p g e t t h r e a d n u m (); } sum = 0;

# pragma omp parallel for shared ( npoints ) \ reduction (+: sum ) num threads (8) for (i = 0; i < npoints ; i ++) {

rand x = (double) rand range (& seed , −1, 1);

rand y = (double) rand range (& seed , −1, 1);

}printf (" s u m = % d \ n ", sum );

l inside the parallel region,sumis treated as athread-localvariable (implicitly initialized to 0)

l at the end of the region, the thread-local versions ofsumare added to the global sum(here initialized to 0) to get the final value

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹⁰

Assigning Iterations to Threads

l _the_scheduleclause of thefordirective assigns iterations to threads l schedule(static[,chunk-size])

n splits the iteration space into chunks of size chunk-size and allocates to threads in round-robin fashion

n if the chunk size is unspecified, the number of chunks equals number of threads l schedule(dynamic[,chunk-size])

n the iteration space is split into chunk-size blocks that are scheduled dynamically l schedule(guided[,chunk-size])

n the chunk size decreases exponentially with iterations to a minimum of chunk-size

l schedule(runtime)

n determine scheduling based on setting of theOMP SCHEDULEenvironment variable

Nesting Parallel Directives

l what happens for nestedforloops?

# pragma omp parallel for num threads (2) for (i = 0; i < Ni; i ++) {

# pragma omp parallel for num threads (2) for (j = 0; j < Nj; j ++) {

...

l by default, the inner loop is serialized and run by one thread

l to enable multiple threads in nested loops requires the environment variable OMP NESTEDto beTRUE

l note - the use of synchronization constructs in nested parallel sections requires care (see the OpenMP specs)!

(4)

Sychronization #1

l _barrier:threads wait until they have all reached this point

#pragma omp barrier

l single: the following structured block is executed only by first thread to reach this point

#pragma omp single[clause-list]

n others wait at the end of structured block unless anowaitclause used

l master: only themaster threadexecutes following block, other threads do NOT wait

#pragma omp master

l critical section: only one thread is ever in the named critical section at any time

#pragma omp critical[name]

l atomic: the memory location updated in the following statement is done so in an atomic fashion

n can achieve the same effect using critical sections (possibly faster?)

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹³

Sychronization #2

l ordered: some operations within aforloop must be performed as if it were done so in a sequential order

cumul sum [0] = list [0];

# pragma omp parallel for shared ( cumul sum , list , n) for (i =1; i<n; i ++) {

/∗ other processing on list [i] if required ∗/

# pragma omp ordered

{ cumul sum [i] = cumul sum [i−1] + list [i];

} }

l _flush:enforces a consistent view of memory

n that variables have been flushed from registers into memory

#pragma omp flush[(variable-list)]

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹⁴

Data Handling

l _private: an uninitialized local copy of the variable is made for each thread l _shared: variables are shared between threads

l firstprivate: make a local copy of an existing variable and assign it the same value

n often better than multiple reads to a shared variable

l lastprivate: copies back to the master thread the variable’s value from the thread which executed the last loop iteration (if executed serially)

l threadprivate: creates private variables but they persist between multiple parallel regions, maintaining their values

l _copyin_{: like}firstprivatebut forthreadprivatevariables

OpenMP Exercise: Matrix-Vector Multiplication

void matmult (int M, int N, double ∗ restrict A ,

double ∗ restrict b , double ∗ restrict c) { int i, j;

for (i =0; i < M; i ++) for (j =0; j < N; j ++)

c[i] = A[i+j ∗ M] ∗ b[j] + c[i];

}

l what are the possible ways you could use OpenMP to parallelize this code? (you may change the order of the loops)

l what data handling clauses would be needed?

l what if one or other of the matrix sizes was small?

l what parallel (and serial) performance issues might arise?

(5)

Sections

l consider the partitioning of a fixed number of tasks across threads n much less common thanforloop partitioning

n an example offunctional decomposition

n explicit programming naturally limits number of threads (scalability)

# pragma omp sections { # pragma omp section

{ taskA ();

# pragma omp section} { taskB ();

} }

l separate threads will runtaskA()andtaskB() l it is illegal to branch in or out of section blocks

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹⁷

OpenMP Tasks

l _{a task has}

n code to execute

n a data environment (it owns its data)

n an assigned thread that executes the code and uses the data l creating a task involves two activities: packaging and execution

n each encountering thread packages a new instance of a task (code and data) n some thread in the team executes the task at some later time

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ¹⁸

Task Syntax

#pragma omp task[clause . . . ] where each clause can be one of:

l _if ₍scalar-expression)

If the expression is false, the creating thread executes the task (in its own environment)

l _untied

Relaxes the default binding of tasks to threads

l any of the previous data-handling directivesdefault,private,firstprivateor shared

Task synchronization:

l at thread barriers (explicit or implicit) and task barriers, all tasks generated in that region must complete

#pragma omp taskwait

Task Example

int fib (int n) { int i, j;

if (n < 2) return n;

else {

# pragma omp task shared (i) firstprivate (n) i = fib (n−1);

# pragma omp task shared (j) firstprivate (n) j = fib (n−2);

# pragma omp taskwait return i+j;

} }int main () { int n = 10;

omp set dynamic (0);

omp set num threads (4);

# pragma omp parallel shared (n) { # pragma omp single

printf (" f i b ( % d ) = % d \ n ", n, fib (n ));

} }

(6)

Task Issues

l task switching:

n task scheduling points occur on task creation/exit/yields and the task synchronization points

n when a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (task switching)

n it can then return to original task and resume l tied tasks:

n by default suspended tasks must resume execution on the the same thread as it was previously executing on

n _the_untiedclause relaxes this constraint l task generation pitfalls:

n very easy to generate many tasks very quickly!

n the generating task will be suspended and its thread may start working on a long and boring task.

Other threads can consume all their tasks and have nothing to do . . .

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ²¹

Library Functions #1

l defined in header file

#include <omp.h>

l controlling threads and processors

void omp set num threads (int num threads );

int omp get num threads ();

int omp get max threads ();

int omp get thread num ();

int o mp g et n um p ro cs ();

int omp in parallel ();

l controlling thread creation:

void omp set dynamic (int dynamic threads );

int omp get dynamic ();

void omp set nested (int nested );

int omp get nested ();

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ²²

Library Functions #2

l mutual exclusion support:

void omp init lock ( omp lock t ∗ lock );

void omp destroy lock ( omp lock t ∗ lock );

void omp set lock ( omp lock t ∗ lock );

void omp unset lock ( omp lock t ∗ lock );

int omp test lock ( omp lock t ∗ lock );

l nested mutual exclusion(lock may be setntimes and only released after it is unset ntimes) :

void omp init nest lock ( o m p n e s t l o c k t ∗ lock );

void omp destroy nest lock ( o m p n e s t l o c k t ∗ lock );

omp set nest l ock ( o m p n e s t l o c k t ∗ lock );

void omp unset nest lock ( o m p n e s t l o c k t ∗ lock );

int omp test nest lock ( o m p n e s t l o c k t ∗ lock );

OpenMP Environment Variables

l OMP NUM THREADS: default number of threads entering parallel region

l OMP DYNAMIC: iftrue, it permits the number of threads to change during execution, in order to optimize system resources

l OMP NESTED: iftrue, it permits nested parallel regions

l OMP SCHEDULE: determines scheduling for loops that are specified with a schedule(runtime)clause

export OMP_SCHEDULE="static,4"

export OMP_SCHEDULE="dynamic"

export OMP_SCHEDULE="guided"

(7)

OpenMP and Pthreads

l OpenMP removes the need for a programmer to initialize task attributes, set up arguments to threads, partition iteration spaces etc

l OpenMP code can closely resemble serial code

l OpenMP is particularly useful for static or regular problems

l OpenMP users are hostage to the availability of an OpenMP compiler n performance is heavily dependent on the quality of the compiler l OpenMP more easily allowsincremental parallelization

l in Pthreads, data exchange is more apparent, so false sharing and contention are less likely

l Pthreads has a richer API that is much more flexible, e.g. condition waits, locks of different types etc

l Pthreads is library-based

Must balance the above before deciding on the parallel programming model!

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× ²⁵