• No results found

Shared Memory Parallel Programming. Overview: The OpenMP Programming Model. The Parallel Directive. OpenMP. export OMP NUM THREADS=4

N/A
N/A
Protected

Academic year: 2022

Share "Shared Memory Parallel Programming. Overview: The OpenMP Programming Model. The Parallel Directive. OpenMP. export OMP NUM THREADS=4"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Overview: The OpenMP Programming Model

l motivation and overview

l the parallel directive: clauses, equivalent pthread code, examples l the ‘for’ directive and scheduling of loop iterations

l Pi example in OpenMP l nested parallel directives l synchronization

l handling of variables between threads l sections

l tasks

l library functions and environment variables l comparison with Pthreads

Refs: Grama et al Ch 7, Hager & Wellein Ch 6-7, Lin & Snyder Ch 6;OpenMP Tutorial from LLNL,The OpenMP Foundation

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 1

Shared Memory Parallel Programming

l explicit thread programming is messy n low-level primitives

n originally non-standard, although better since Pthreads n used by system programmers, but···

···application programmers have better things to do!

l many application codes can be usefully supported by higher level constructs n led to proprietarydirective-based approaches of Cray, SGI, Sun etc

l OpenMP is an API for shared memory parallel programming targeting Fortran, C and C++

n standardizes the form of the proprietary directives

n avoids the need for explicitly setting up mutexes, condition variables, data scope, and initialization

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 2

OpenMP

l specifications maintained by OpenMP Architecture Review Board (ARB) n members include AMD, Intel, Fujitsu, IBM, NVIDIA···cOMPunity

l versions 1.0 (Fortran ’97, C ’98), 1.1 and 2.0 (Fortran ’00, C/C++ ’02), 2.5 (unified Fortran and C, 2005), 3.0 (2008), 3.1 (2011), 4.0 (2013), 5.0 (2018)

l comprises of compiler directives, library routines and environment variables n C directives(case sensitive)

#pragma ompdirective-name [clause-list]

n library calls begin withomp

void omp set num threads(int nthreads);

n environment variables begin withOMP export OMP NUM THREADS=4 l OpenMP requires compiler support

n activated via-fopenmp(gcc) or-qopenmp(icc) compiler flags

The Parallel Directive

l OpenMP uses afork/join model: programs execute serially until they encounter a paralleldirective:

n this creates a group of threads

n the number of threads is dependent on theOMP NUM THREADSenvironment variable or set via a function call, e.g.omp set num threads(nthreads) n the main thread becomes themaster thread, with a thread id of 0

# pragma omp parallel [ clause−list ] { /∗ structured block ∗/

}

l each thread executes the structured block

(2)

Parallel Directive: Clauses

Clauses are used to specify:

l conditional parallelization: to determine if the parallel construct results in creation/use of threads

if (scalar-expression)

l degree of concurrency: explicit specification of the number of threads created/used num threads(integer-expression)

l data handling: to indicate if specific variables are local (private) to the thread (allocated on the thread’s stack, the default), global (shared), or ‘special’

private(variable-list) shared(variable-list) firstprivate(variable-list) default(shared|none)

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 5

Compiler Translation: OpenMP to Pthreads

l OpenMP code

main () { int a, b;

// serial segment

# pragma omp parallel num threads (8) private (a) shared (b) { // parallel segment

}// rest of serial segment }

l Pthread equivalent (structured block is outlined) main () {

int a, b;

// serial segment

for (i =0; i<8; i ++) pthread create (... , internal thunk ,...);

for (i =0; i<8; i ++) pthread join (...);

// rest of serial segment

}void ∗ i n t e r n a l t h u n k (void ∗ pa ck ag ed ar gu me nt ) { int a;

// parallel segment }

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 6

Parallel Directive Examples

# pragma omp parallel if ( is parallel == 1) num threads (8) \ private (a) shared (b) firstprivate (c) l if the value of variableis parallelis one, eight threads are used

l each thread has private copy ofaandc, but shares a single copy ofb

l the value of each private copy ofcis initialized to value of the globalcbefore the parallel region

# pragma omp parallel reduction (+: sum ) num threads (8) \ default( private )

l eight threads get a copy of the variablesum

l when the threads exit, the values of these local copies are accumulated into the sumvariable on the master thread

n otherreductionoperations include,,&,|,^,&&, ||

l all variables areprivateunless otherwise specified

Example: Computing Pi

l computeπby generating random numbers in square with side length of 2 centered at (0,0) and counting numbers that fall within a circle of radius 1

n area of square = 4, area of circle:πr2

n ratio of points inside circle to outside approachesπ/4

# pragma omp parallel private (i) shared ( npoints ) \ reduction (+: sum ) num threads (8) { int seed = omp num threads (); // private

num threads = omp get num threads ();

sum = 0;

for (i = 0; i < npoints / num threads ; i ++) { rand x = (double) rand range (& seed , −1, 1);

rand y = (double) rand range (& seed , −1, 1);

if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;

} }

l the OpenMP code is very simple – c.f. theequivalent pthread code

(3)

The For Work-sharing Directive

l used in conjunction with theparalleldirective to partition theforloop immediately afterwards

# pragma omp parallel shared ( npoints ) reduction (+: sum ) \ num threads (8)

{ int seed = omp get thread num ();

sum = 0;

# pragma omp for

for (i = 0; i < npoints ; i ++) {

rand x = (double) rand range (& seed ,−1,1);

rand y = (double) rand range (& seed ,−1,1);

if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;

} }

n the loop index (i) is assumed to be private

n only two directives plus sequential code (code is easy to read/maintain) l there is an implicit synchronization at the end of the loop

n can add anowaitclause to prevent this

l it is common to merge the directives:#pragma omp parallel for ...

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 9

The Combined parallel for Directive

l the most common use case for parallelizingforloops

# pragma omp threadprivate ( seed )

# pragma omp parallel

{ seed = o m p g e t t h r e a d n u m (); } sum = 0;

# pragma omp parallel for shared ( npoints ) \ reduction (+: sum ) num threads (8) for (i = 0; i < npoints ; i ++) {

rand x = (double) rand range (& seed , −1, 1);

rand y = (double) rand range (& seed , −1, 1);

if (( rand x ∗ rand x + rand y ∗ rand y ) <= 1.0) sum ++;

}printf (" s u m = % d \ n ", sum );

l inside the parallel region,sumis treated as athread-localvariable (implicitly initialized to 0)

l at the end of the region, the thread-local versions ofsumare added to the global sum(here initialized to 0) to get the final value

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 10

Assigning Iterations to Threads

l thescheduleclause of thefordirective assigns iterations to threads l schedule(static[,chunk-size])

n splits the iteration space into chunks of size chunk-size and allocates to threads in round-robin fashion

n if the chunk size is unspecified, the number of chunks equals number of threads l schedule(dynamic[,chunk-size])

n the iteration space is split into chunk-size blocks that are scheduled dynamically l schedule(guided[,chunk-size])

n the chunk size decreases exponentially with iterations to a minimum of chunk-size

l schedule(runtime)

n determine scheduling based on setting of theOMP SCHEDULEenvironment variable

Nesting Parallel Directives

l what happens for nestedforloops?

# pragma omp parallel for num threads (2) for (i = 0; i < Ni; i ++) {

# pragma omp parallel for num threads (2) for (j = 0; j < Nj; j ++) {

...

l by default, the inner loop is serialized and run by one thread

l to enable multiple threads in nested loops requires the environment variable OMP NESTEDto beTRUE

l note - the use of synchronization constructs in nested parallel sections requires care (see the OpenMP specs)!

(4)

Sychronization #1

l barrier:threads wait until they have all reached this point

#pragma omp barrier

l single: the following structured block is executed only by first thread to reach this point

#pragma omp single[clause-list]

n others wait at the end of structured block unless anowaitclause used

l master: only themaster threadexecutes following block, other threads do NOT wait

#pragma omp master

l critical section: only one thread is ever in the named critical section at any time

#pragma omp critical[name]

l atomic: the memory location updated in the following statement is done so in an atomic fashion

n can achieve the same effect using critical sections (possibly faster?)

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 13

Sychronization #2

l ordered: some operations within aforloop must be performed as if it were done so in a sequential order

cumul sum [0] = list [0];

# pragma omp parallel for shared ( cumul sum , list , n) for (i =1; i<n; i ++) {

/∗ other processing on list [i] if required ∗/

# pragma omp ordered

{ cumul sum [i] = cumul sum [i−1] + list [i];

} }

l flush:enforces a consistent view of memory

n that variables have been flushed from registers into memory

#pragma omp flush[(variable-list)]

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 14

Data Handling

l private: an uninitialized local copy of the variable is made for each thread l shared: variables are shared between threads

l firstprivate: make a local copy of an existing variable and assign it the same value

n often better than multiple reads to a shared variable

l lastprivate: copies back to the master thread the variable’s value from the thread which executed the last loop iteration (if executed serially)

l threadprivate: creates private variables but they persist between multiple parallel regions, maintaining their values

l copyin: likefirstprivatebut forthreadprivatevariables

OpenMP Exercise: Matrix-Vector Multiplication

void matmult (int M, int N, double ∗ restrict A ,

double ∗ restrict b , double ∗ restrict c) { int i, j;

for (i =0; i < M; i ++) for (j =0; j < N; j ++)

c[i] = A[i+j ∗ M] ∗ b[j] + c[i];

}

l what are the possible ways you could use OpenMP to parallelize this code? (you may change the order of the loops)

l what data handling clauses would be needed?

l what if one or other of the matrix sizes was small?

l what parallel (and serial) performance issues might arise?

(5)

Sections

l consider the partitioning of a fixed number of tasks across threads n much less common thanforloop partitioning

n an example offunctional decomposition

n explicit programming naturally limits number of threads (scalability)

# pragma omp sections { # pragma omp section

{ taskA ();

# pragma omp section} { taskB ();

} }

l separate threads will runtaskA()andtaskB() l it is illegal to branch in or out of section blocks

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 17

OpenMP Tasks

l a task has

n code to execute

n a data environment (it owns its data)

n an assigned thread that executes the code and uses the data l creating a task involves two activities: packaging and execution

n each encountering thread packages a new instance of a task (code and data) n some thread in the team executes the task at some later time

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 18

Task Syntax

#pragma omp task[clause . . . ] where each clause can be one of:

l if (scalar-expression)

If the expression is false, the creating thread executes the task (in its own environment)

l untied

Relaxes the default binding of tasks to threads

l any of the previous data-handling directivesdefault,private,firstprivateor shared

Task synchronization:

l at thread barriers (explicit or implicit) and task barriers, all tasks generated in that region must complete

#pragma omp taskwait

Task Example

int fib (int n) { int i, j;

if (n < 2) return n;

else {

# pragma omp task shared (i) firstprivate (n) i = fib (n−1);

# pragma omp task shared (j) firstprivate (n) j = fib (n−2);

# pragma omp taskwait return i+j;

} }int main () { int n = 10;

omp set dynamic (0);

omp set num threads (4);

# pragma omp parallel shared (n) { # pragma omp single

printf (" f i b ( % d ) = % d \ n ", n, fib (n ));

} }

(6)

Task Issues

l task switching:

n task scheduling points occur on task creation/exit/yields and the task synchronization points

n when a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (task switching)

n it can then return to original task and resume l tied tasks:

n by default suspended tasks must resume execution on the the same thread as it was previously executing on

n theuntiedclause relaxes this constraint l task generation pitfalls:

n very easy to generate many tasks very quickly!

n the generating task will be suspended and its thread may start working on a long and boring task.

Other threads can consume all their tasks and have nothing to do . . .

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 21

Library Functions #1

l defined in header file

#include <omp.h>

l controlling threads and processors

void omp set num threads (int num threads );

int omp get num threads ();

int omp get max threads ();

int omp get thread num ();

int o mp g et n um p ro cs ();

int omp in parallel ();

l controlling thread creation:

void omp set dynamic (int dynamic threads );

int omp get dynamic ();

void omp set nested (int nested );

int omp get nested ();

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 22

Library Functions #2

l mutual exclusion support:

void omp init lock ( omp lock t ∗ lock );

void omp destroy lock ( omp lock t ∗ lock );

void omp set lock ( omp lock t ∗ lock );

void omp unset lock ( omp lock t ∗ lock );

int omp test lock ( omp lock t ∗ lock );

l nested mutual exclusion(lock may be setntimes and only released after it is unset ntimes) :

void omp init nest lock ( o m p n e s t l o c k t ∗ lock );

void omp destroy nest lock ( o m p n e s t l o c k t ∗ lock );

omp set nest l ock ( o m p n e s t l o c k t ∗ lock );

void omp unset nest lock ( o m p n e s t l o c k t ∗ lock );

int omp test nest lock ( o m p n e s t l o c k t ∗ lock );

OpenMP Environment Variables

l OMP NUM THREADS: default number of threads entering parallel region

l OMP DYNAMIC: iftrue, it permits the number of threads to change during execution, in order to optimize system resources

l OMP NESTED: iftrue, it permits nested parallel regions

l OMP SCHEDULE: determines scheduling for loops that are specified with a schedule(runtime)clause

export OMP_SCHEDULE="static,4"

export OMP_SCHEDULE="dynamic"

export OMP_SCHEDULE="guided"

(7)

OpenMP and Pthreads

l OpenMP removes the need for a programmer to initialize task attributes, set up arguments to threads, partition iteration spaces etc

l OpenMP code can closely resemble serial code

l OpenMP is particularly useful for static or regular problems

l OpenMP users are hostage to the availability of an OpenMP compiler n performance is heavily dependent on the quality of the compiler l OpenMP more easily allowsincremental parallelization

l in Pthreads, data exchange is more apparent, so false sharing and contention are less likely

l Pthreads has a richer API that is much more flexible, e.g. condition waits, locks of different types etc

l Pthreads is library-based

Must balance the above before deciding on the parallel programming model!

COMP4300/8300 L17: The OpenMP Programming Model 2021 JJ J • I II× 25

References

Related documents

organic substances. Organic substances are separated from the solution that the total oxygen used to oxidize organic content in waste is also reduced. In the

Health workers in locations without Internet connectivity can access the system using any phone (satellite, fixed-line, mobile, or community pay phone).. For emerging economies

Since our aim was to produce metal ion beams (especially from Gold and Calcium) with good stability and without any major modification of the source (for

The statistical analysis of borehole data illustrated in Figure 6.5 and given below in Table 6.3 contains all relevant site, borehole, time, and analytical data including the

It is possible that colonial ties not only affect the level of food aid shipped from donor to recipient country, but also the responsiveness of aid to recipient needs.. This

Taner Akçam, The Young Turks’ Crime against Humanity: The Armenian Genocide and Ethnic Cleansing in the Ottoman Empire (Princeton, N.J.: Princeton University Press, 2012)... Dashnak

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,