Parallel Computing. Shared memory parallel programming with OpenMP

(1)

Parallel Computing

Shared memory parallel programming with OpenMP

(2)

Table of contents

Introduction Directives Scope of data Synchronization

(3)

OpenMP

MP stands for Multi Processing

De-facto standard Application Program Interface (API) for explicitshared memory parallelism.

Extension to existing programming languages (C/C++/Fortran)

Incremental parallelism

(Parallelization of an existing serial program)

Approach

“Workers“ who do the work in parallel (threads) “cooperate“ through shared memory

Memory accesses instead of explicit messages “local“ model parallelization of the serial code

(4)

OpenMP – Goals

Standardization

To establish a standard between various competing shared memory platforms

Lean and Mean

Create a simple and limited instruction set for programming shared memory computers.

Ease of Use

Enable an incremental parallelism of serial programs (Unlike the all-or-nothing approach of MPI)

Portability

Support of all common programming languages Open forum for users and developers

(5)

OpenMP – History

“Open specifications for Multi Processing“ maintained by the OpenMP Architecture Review Board

http://www.openmp.org Supported

– Commercial Compiler: IBM, Portland, Intel – Open Source: GNU gcc

– OpenMP 4.0 from gcc 4.9

OpenMP 4.0released July 2013

(6)

OpenMP – Execution model

Thread-based parallelism Compiler Directive Based Explicit Parallelism

Fork-Join Model (Focus: parallelism of loops)

(7)

OpenMP – fork/join

for Distributes iterations of the loop on the team(threads) ⇒ data parallelism.

sections Divides the work into sections/work packages, each of which is executed by a thread

⇒ functional parallelism

(8)

OpenMP – Memory model

All threads have access to the same globallysharedmemory Data inprivate memory is only accessible by the thread owning this memory No other thread sees the change(s)

Data transfer is through shared memory and is completely transparent to the application

(9)

OpenMP

|

Pros/Cons

Pros

simple parallelism

Higher abstraction than threads Sequential version can still be used Standard for shared memory

Cons

Only shared memory Limited use (loop type)

(10)

OpenMP

|

Main components

Compiler Directives and Clauses

appear as comments, executed when the appropriate OpenMP flag is specified

Parallel construct Work-sharing constructs Synchronization constructs Data Attribute clauses

C/C++:

#pragma omp directive-name [clause[clause]...]

Fortran free form:

!$omp directive-name [clause[clause]...]

(11)

Compiling

See:http://openmp.org/wp/openmp-compilers/

(12)

Environment Variables

OMP_NUM_THREADS: sets number of threads

OMP_STACKSIZE “size [B|K|M|G]“: size of the stack for threads

OMP_DYNAMIC TRUE|FALSE: dynamic thread adjustment

OMP_SCHEDULE “schedule[,chunk]“: iteration scheduling scheme

OMP_PROC_BIND TRUE|FALSE: bound threads to processors

OMP_NESTED TRUE|FALSE: nested parallelism . . ....

To set them

In csh/tcsh:setenv OMP_NUM_THREADS 4 In sh/bash:export OMP_NUM_THREADS=4

(13)

Basic functions

Query/specify some specific feature or setting

omp_get_thread_num(): get thread ID (0 for master thread) omp_get_num_threads(): get number of threads in the team omp_set_num_threads(int n): set number of threads

Allow you to manage fine-grained access (lock)

omp_init_lock(lock_var): initializes the OpenMP lock variablelock_varof typeomp_lock_t

Timing functions

omp_get_wtime(): returns elapsed wallclock time omp_get_wtick(): returns timer precision

Functions interface:

C/C++:#include <omp.h>

(14)

A first example

|

4 threads

(15)

Hello World – hello.c

1 # ifdef _OPENMP

2 # include <omp.h>

3 # endif

4 # include <stdio.h>

5 int main ( void ){

6 int i ;

7 # pragma omp parallel for

8 for ( i = 0; i < 8; ++ i ){ 9 int id = omp_get_thread_num(); 10 printf ("Hello␣World␣from␣thread␣%␣d␣\n␣", id ); 11 if ( id ==0) 12 printf("There␣are␣%␣d␣threads\n",omp_get_num_threads()); 13 } 14 return 0; 15 }

(16)

Proceeding

Starting a program, a single process is started on the CPU.

This process corresponds to the master thread

Master thread can now create & manage addit. threads The management of threads

(creation, managing and termination) is done by OpenMP without user interaction

# pragma omp parallel for

instructs that the following tt for loop is distributed to the available threads

omp_get_thread_num()

indicates the current thread number

omp_get_num_threads()

indicates the total number of threads

(17)

Compile and Run

Compiling

gcc -fopenmp hello.c output

1 th@riemann:~$ ./a.out

2 Hello World from thread 4

(18)

OpenMP pthread translation

A sample OpenMP program with its Pthreads translation that might be performed by an OpenMP compiler

(19)

Compiler directives

OpenMP is defined mainly on compiler directives

Directive format in C/C++

# pragma omp construct [ clauses ...] constructs are functionalities of the language clauses are parameters to those functionalities construct + clauses = directive

C-Compiler not enabled for OpenMP ignore unknown directives

1 $ gcc - Wall hello.c # ( gcc version ( < 4.2))

2 hello.c : In function ’␣main␣’:

3 hello.c :12: warning : ignoring # pragma omp parallel

⇒ the program can be compiled by each (even not OpenMP enabled) compiler.

(20)

Compiler directives II

Conditional compilation

C/C++: Macro_OPENMPis defined:

1 # ifdef _OPENMP

2 /* Openmp specific code , e.g. */

3 number = omp_get_thread_num () ;

4 # endif

omp parallel

Creates additional threads, i.e. work is executed by all threads.

Original thread (master thread) has thread ID 0.

# pragma omp parallel [ clauses ] /* structured block ( no gotos ...) */

(21)

Work-sharing between threads I

Loops

The work is divided among the threads. E.g. for two threads

Thread_1: Loop elements 0, . . .(N/2) −1 Thread_2: Loop elements (N/2), . . .N−1

1 # pragma omp parallel [ clauses ...]

2 # pragma omp for [ clauses ...]

3 for ( i =0; i < N ; i ++)

4 a [ i ]= i * i ;

This can be summarized (omp parallel for):

1 # pragma omp parallel for [ clauses ...]

2 for ( i =0; i < N ; i ++)

(22)

Work-sharing between threads II

parallel

The work is distributed. Each thread processes a section:

1 # pragma omp parallel

2 # pragma omp sections

3 {

4 # pragma omp section [ clauses ...]

5 [... section A runs parallel to B ...]

6 # pragma omp section [ clauses ...]

7 [... section B runs parallel to A ...]

8 }

Again one can combine:

1 # pragma omp parallel sections [ clauses ...]

(23)

Data sharing attribute clauses

Scope of data

Data clauses that are specified within OpenMP directives controls how variables are handled/shared between threads

shared()

the data within a parallel region is shared, which means visible and accessible by all threads simultaneously

private()

The data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable.

A private variable is not initialized and the value is not maintained for use outside the parallel region.

(24)

Scope of data

Values of private variables are undefined during entry and leaving of loops.

The following keywords allow to initialize/finalize variables:

default(shared|private|none)

Specifies the default value

none– Each variable has to be declared explicitly as shared()orprivate()

firstprivate()

Likeprivate()but all copies are initialized with the values the variables have before the parallel loop/region.

lastprivate()

Variable keeps the last value from within the loop after leaving the section

(25)

Example

|

private

Initialization

1 #include <stdio.h>

2 #include <omp.h>

3 int main(int argc, char* argv[])

4 {

5 int t=2;

6 int result = 0;

7 int A[100], i=0, j=0;

8 omp_set_num_threads(t); // Explicitly setting of 2 threads

9 for(i=0; i<100; i++){

10 A[i] = i;

11 result += A[i];

12 }

(26)

Example

|

private

Parallel section 1 i=0; 2 int T[2]; 3 T[0] = 0; 4 T[1] = 0;

5 #pragma omp parallel

6 {

7 #pragma omp for

8 for(i=0; i<100; i++){

9 for(j=0; j<10; j++){ 10 A[i] = A[i] * 2; 11 T[omp_get_thread_num()]++; 12 } 13 } 14 }

(27)

Example

|

private

Output/print 1 i=0; result=0;

2 for(i=0; i<100; i++)

3 result += A[i]; 4 printf("Array-Sum␣AFTER␣␣calculation:␣%d\n", result); 5 printf("Thread␣1:␣%d␣calculations\n", T[0]); 6 printf("Thread␣2:␣%d␣calculations\n", T[1]); 7 return 0; 8 }

(28)

Example

|

Calc. without

private(j)

Outputwithoutprivate declaration

1 Array-Sum BEFORE calculation: 4950

2 Array-Sum AFTER calculation: 90425960

3 Thread 1: 450 calculations

j is automatically initialized asshared

Variablej is shared by both threads. Reason for wrong result

(29)

Example

|

Modifying parallel section

Parallel sectionwithprivate(j)

2 {

3 #pragma omp for private(j)

4 for(i=0; i<100; i++){

5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 } 9 } 10 }

(30)

Example

|

Calculation with

private(j)

Outputwithprivate declaration

j is now managed individually for each thread, i.e. privately declared

Each thread has its own variable, so do not get confused in calculating

(31)

Critical section

Used (similar to the examples in pthreads) to avoid ill-conditioned runtime behaviour

Defined in OpenMP as

#pragma opm critical ...

critical section ... ...

#pragma omp end critical

Could be used to resolve a race condition

(32)

Example

|

critical

Parallel section

2 {

4 for(i=0; i<100; i++){

5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; /*added*/ 9 } 10 } 11 }

(33)

Example

|

critical

Outputwithoutcriticaldeclaration 1 Array-Sum BEFORE calculation: 4950

5 NumOfIters: 693

NumOfItersprovides wrong result

Threads hinder each other by incrementation

private declaration is no solution, since all threads should be counted

(34)

Example

|

critical

Parallel sectionwithcriticaldeclaration 1 #pragma omp parallel

2 {

4 for(i=0; i<100; i++){

5 for(j=0; j<10; j++){

6 A[i] = A[i] * 2;

7 T[omp_get_thread_num()]++;

8 #pragma omp critical /*added critical*/

9 NumOfIters++;

10 #pragma omp end critical

11 }

12 }

13 }

(35)

Example

|

critical

Outputwithcriticaldeclaration 1 Array-Sum BEFORE calculation: 4950

5 NumOfIters: 1000

NumOfItersis now executed serial

Threads don’t hinder each other by incrementation

(36)

Reduction

Reduction of data

Critical sections could be a bottle neck of a calculation. In our example, there is another way to measure the number of iterations safely: thereduction

reduction

A reduction operator is a binary operation (such as addition or multiplication).

A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result.

All of the intermediate results of the operation should be stored in the same variable: the reduction variable.

(37)

Reduction

Usage

#pragma omp for private(j) reduction(op: var)

Thereduction-clause identifies specific, commonly used variables.

We have to give the reduction operatoropand the reduction variablevar

The operatorop could one of this:

+, *, -, /, &, ^, |, &&, ||

In the variablevarmultiple threads can accumulate values Collects contributions by different threads like reduction in MPI

(38)

Example

|

reduction

Parallel sectionwithreduction()

2 {

3 #pragma omp for private(j) reduction(+: NumOfIters)

4

5 for(i=0; i<100; i++){

6 for(j=0; j<10; j++){ 7 A[i] = A[i] * 2; 8 T[omp_get_thread_num()]++; 9 NumOfIters++; 10 } 11 } 12 }

(39)

Example

|

reduction

Outputwithreduction()

5 NumOfIters: 1000

Reduction is accomplished by instructionreductionafter loop-parallelism

Requires arithmetic operation and the reduction variable, separated by colons:

(40)

Conditional parallelism

ifclauses

It may be desirable to parallelise loopsonly

when the effort is well justified.

F.i. to assure that using multiple threads the run time is lesser compared to serial execution

Properties

Allows to decide at runtime whether a loop – is executed in parallel (fork-join)

– or is executed serial

(41)

Example

|

if

clauses

Parallel sectionwithif

2 {

3 #pragma omp for private(j) reduction(+:NumOfIters) if(n>500)

4 for(i=0; i<n; i++){

5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; 9 } 10 } 11 }

(42)

Synchronization

In a parallel region threads proceed asynchronously. Sometimes coordination is necessary

OpenMP provides several ways to coordinate thread execution.

An important component is the synchronization of threads An application area we have already met:

Reduction in case of the race condition in the critical section.

Here we had implicit synchronization so that all threads execute the critical section sequentially

The behaviour we find also in theatomicsynchronisation

(43)

barrier

construct

At the barrier all threads wait and continue only when all threads have reached the barrier

The barrier guarantees thatALLthe code above has been executed

We have

Explicit barriers

– #pragma omp barrier Implicit barrier

– At the end of the worksharing constructs

(i.e.for/Do, sections, singleconstructs)

(44)

barrier

-Synchronisation

barrier– Threads are waiting until all have reached a common point.

1 # pragma omp

2 {

3 # pragma omp for nowait

4 for (i=0;i< N;i++) a[i] = b[i] + c[i];

5 # pragma omp barrier

6 # pragma omp for

7 for (i=0;i<N;i++) d[i] = a[i] + b[i];

8 }

(45)

atomic

construct

atomic

- a similar construct

#pragma omp atomic [clause]

Theatomicconstruct applies only to statements that update the value of a variable

Ensures that no other thread updates the variable between reading and writing

It is a special lightweight form of acritical

Only read/write are serialized, and only if two or more threads access the same memory address

(46)

atomic

-Synchronisation

Behaviour is related tocritical

Only permitted for specific operations i.e. (x++, ++x, x-, -x)

1 #pragma omp atomic

2 NumOfIters++;

(47)

master/single

-Synchronisation

#pragma omp master [clause]

[ Code which should be executed once by Master Thread] Only master thread to execute a region (e.g. I/O)

#pragma omp single [clause]

[ Code which should be executed once] Only one thread execute a region (e.g. I/O) This isnot necessarilythe master thread!

(48)

Resume

OpenMP is the De-facto standard for shared memory parallelism

Easy to use

Code can be fast parallalized.

Supported for all commonly used compilers Drawbacks

If your problem is becomes big enough you has to use distributed memory approaches

Not so ellaborated control over the execution order like in message passing.

(49)

Parallel Computing. Shared memory parallel programming with OpenMP