Parallel Computing
Shared memory parallel programming with OpenMP
Table of contents
Introduction Directives Scope of data Synchronization
OpenMP
MP stands for Multi Processing
De-facto standard Application Program Interface (API) for explicitshared memory parallelism.
Extension to existing programming languages (C/C++/Fortran)
Incremental parallelism
(Parallelization of an existing serial program)
Approach
“Workers“ who do the work in parallel (threads) “cooperate“ through shared memory
Memory accesses instead of explicit messages “local“ model parallelization of the serial code
OpenMP – Goals
StandardizationTo establish a standard between various competing shared memory platforms
Lean and Mean
Create a simple and limited instruction set for programming shared memory computers.
Ease of Use
Enable an incremental parallelism of serial programs (Unlike the all-or-nothing approach of MPI)
Portability
Support of all common programming languages Open forum for users and developers
OpenMP – History
“Open specifications for Multi Processing“ maintained by the OpenMP Architecture Review Board
http://www.openmp.org Supported
– Commercial Compiler: IBM, Portland, Intel – Open Source: GNU gcc
– OpenMP 4.0 from gcc 4.9
OpenMP 4.0released July 2013
OpenMP – Execution model
Thread-based parallelism Compiler Directive Based Explicit Parallelism
Fork-Join Model (Focus: parallelism of loops)
OpenMP – fork/join
for Distributes iterations of the loop on the team(threads) ⇒ data parallelism.
sections Divides the work into sections/work packages, each of which is executed by a thread
⇒ functional parallelism
OpenMP – Memory model
All threads have access to the same globallysharedmemory Data inprivate memory is only accessible by the thread owning this memory No other thread sees the change(s)
Data transfer is through shared memory and is completely transparent to the application
OpenMP
|
Pros/Cons
Prossimple parallelism
Higher abstraction than threads Sequential version can still be used Standard for shared memory
Cons
Only shared memory Limited use (loop type)
OpenMP
|
Main components
Compiler Directives and Clausesappear as comments, executed when the appropriate OpenMP flag is specified
Parallel construct Work-sharing constructs Synchronization constructs Data Attribute clauses
C/C++:
#pragma omp directive-name [clause[clause]...]
Fortran free form:
!$omp directive-name [clause[clause]...]
Compiling
See:http://openmp.org/wp/openmp-compilers/
Environment Variables
OMP_NUM_THREADS: sets number of threads
OMP_STACKSIZE “size [B|K|M|G]“: size of the stack for threads
OMP_DYNAMIC TRUE|FALSE: dynamic thread adjustment
OMP_SCHEDULE “schedule[,chunk]“: iteration scheduling scheme
OMP_PROC_BIND TRUE|FALSE: bound threads to processors
OMP_NESTED TRUE|FALSE: nested parallelism . . ....
To set them
In csh/tcsh:setenv OMP_NUM_THREADS 4 In sh/bash:export OMP_NUM_THREADS=4
Basic functions
Query/specify some specific feature or setting
omp_get_thread_num(): get thread ID (0 for master thread) omp_get_num_threads(): get number of threads in the team omp_set_num_threads(int n): set number of threads
Allow you to manage fine-grained access (lock)
omp_init_lock(lock_var): initializes the OpenMP lock variablelock_varof typeomp_lock_t
Timing functions
omp_get_wtime(): returns elapsed wallclock time omp_get_wtick(): returns timer precision
Functions interface:
C/C++:#include <omp.h>
A first example
|
4 threads
Hello World – hello.c
1 # ifdef _OPENMP
2 # include <omp.h>
3 # endif
4 # include <stdio.h>
5 int main ( void ){
6 int i ;
7 # pragma omp parallel for
8 for ( i = 0; i < 8; ++ i ){ 9 int id = omp_get_thread_num(); 10 printf ("Hello␣World␣from␣thread␣%␣d␣\n␣", id ); 11 if ( id ==0) 12 printf("There␣are␣%␣d␣threads\n",omp_get_num_threads()); 13 } 14 return 0; 15 }
Proceeding
Starting a program, a single process is started on the CPU.
This process corresponds to the master thread
Master thread can now create & manage addit. threads The management of threads
(creation, managing and termination) is done by OpenMP without user interaction
# pragma omp parallel for
instructs that the following tt for loop is distributed to the available threads
omp_get_thread_num()
indicates the current thread number
omp_get_num_threads()
indicates the total number of threads
Compile and Run
Compiling
gcc -fopenmp hello.c output
1 th@riemann:~$ ./a.out
2 Hello World from thread 4
3 Hello World from thread 6
4 Hello World from thread 3
5 Hello World from thread 1
6 Hello World from thread 5
7 Hello World from thread 2
8 Hello World from thread 7
9 Hello World from thread 0
OpenMP pthread translation
A sample OpenMP program with its Pthreads translation that might be performed by an OpenMP compiler
Compiler directives
OpenMP is defined mainly on compiler directives
Directive format in C/C++
# pragma omp construct [ clauses ...] constructs are functionalities of the language clauses are parameters to those functionalities construct + clauses = directive
C-Compiler not enabled for OpenMP ignore unknown directives
1 $ gcc - Wall hello.c # ( gcc version ( < 4.2))
2 hello.c : In function ’␣main␣’:
3 hello.c :12: warning : ignoring # pragma omp parallel
⇒ the program can be compiled by each (even not OpenMP enabled) compiler.
Compiler directives II
Conditional compilationC/C++: Macro_OPENMPis defined:
1 # ifdef _OPENMP
2 /* Openmp specific code , e.g. */
3 number = omp_get_thread_num () ;
4 # endif
omp parallel
Creates additional threads, i.e. work is executed by all threads.
Original thread (master thread) has thread ID 0.
# pragma omp parallel [ clauses ] /* structured block ( no gotos ...) */
Work-sharing between threads I
LoopsThe work is divided among the threads. E.g. for two threads
Thread_1: Loop elements 0, . . .(N/2) −1 Thread_2: Loop elements (N/2), . . .N−1
1 # pragma omp parallel [ clauses ...]
2 # pragma omp for [ clauses ...]
3 for ( i =0; i < N ; i ++)
4 a [ i ]= i * i ;
This can be summarized (omp parallel for):
1 # pragma omp parallel for [ clauses ...]
2 for ( i =0; i < N ; i ++)
Work-sharing between threads II
parallel
The work is distributed. Each thread processes a section:
1 # pragma omp parallel
2 # pragma omp sections
3 {
4 # pragma omp section [ clauses ...]
5 [... section A runs parallel to B ...]
6 # pragma omp section [ clauses ...]
7 [... section B runs parallel to A ...]
8 }
Again one can combine:
1 # pragma omp parallel sections [ clauses ...]
Data sharing attribute clauses
Scope of dataData clauses that are specified within OpenMP directives controls how variables are handled/shared between threads
shared()
the data within a parallel region is shared, which means visible and accessible by all threads simultaneously
private()
The data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable.
A private variable is not initialized and the value is not maintained for use outside the parallel region.
Scope of data
Values of private variables are undefined during entry and leaving of loops.
The following keywords allow to initialize/finalize variables:
default(shared|private|none)
Specifies the default value
none– Each variable has to be declared explicitly as shared()orprivate()
firstprivate()
Likeprivate()but all copies are initialized with the values the variables have before the parallel loop/region.
lastprivate()
Variable keeps the last value from within the loop after leaving the section
Example
|
private
Initialization1 #include <stdio.h>
2 #include <omp.h>
3 int main(int argc, char* argv[])
4 {
5 int t=2;
6 int result = 0;
7 int A[100], i=0, j=0;
8 omp_set_num_threads(t); // Explicitly setting of 2 threads
9 for(i=0; i<100; i++){
10 A[i] = i;
11 result += A[i];
12 }
Example
|
private
Parallel section 1 i=0; 2 int T[2]; 3 T[0] = 0; 4 T[1] = 0;5 #pragma omp parallel
6 {
7 #pragma omp for
8 for(i=0; i<100; i++){
9 for(j=0; j<10; j++){ 10 A[i] = A[i] * 2; 11 T[omp_get_thread_num()]++; 12 } 13 } 14 }
Example
|
private
Output/print 1 i=0; result=0;
2 for(i=0; i<100; i++)
3 result += A[i]; 4 printf("Array-Sum␣AFTER␣␣calculation:␣%d\n", result); 5 printf("Thread␣1:␣%d␣calculations\n", T[0]); 6 printf("Thread␣2:␣%d␣calculations\n", T[1]); 7 return 0; 8 }
Example
|
Calc. without
private(j)
Outputwithoutprivate declaration1 Array-Sum BEFORE calculation: 4950
2 Array-Sum AFTER calculation: 90425960
3 Thread 1: 450 calculations
4 Thread 2: 485 calculations
j is automatically initialized asshared
Variablej is shared by both threads. Reason for wrong result
Example
|
Modifying parallel section
Parallel sectionwithprivate(j)1 #pragma omp parallel
2 {
3 #pragma omp for private(j)
4 for(i=0; i<100; i++){
5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 } 9 } 10 }
Example
|
Calculation with
private(j)
Outputwithprivate declaration1 Array-Sum BEFORE calculation: 4950
2 Array-Sum AFTER calculation: 5068800
3 Thread 1: 500 calculations
4 Thread 2: 500 calculations
j is now managed individually for each thread, i.e. privately declared
Each thread has its own variable, so do not get confused in calculating
Critical section
Critical section
Used (similar to the examples in pthreads) to avoid ill-conditioned runtime behaviour
Defined in OpenMP as
#pragma opm critical ...
critical section ... ...
#pragma omp end critical
Could be used to resolve a race condition
Example
|
critical
Parallel section1 #pragma omp parallel
2 {
3 #pragma omp for private(j)
4 for(i=0; i<100; i++){
5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; /*added*/ 9 } 10 } 11 }
Example
|
critical
Outputwithoutcriticaldeclaration 1 Array-Sum BEFORE calculation: 4950
2 Array-Sum AFTER calculation: 5068800
3 Thread 1: 500 calculations
4 Thread 2: 500 calculations
5 NumOfIters: 693
NumOfItersprovides wrong result
Threads hinder each other by incrementation
private declaration is no solution, since all threads should be counted
Example
|
critical
Parallel sectionwithcriticaldeclaration 1 #pragma omp parallel
2 {
3 #pragma omp for private(j)
4 for(i=0; i<100; i++){
5 for(j=0; j<10; j++){
6 A[i] = A[i] * 2;
7 T[omp_get_thread_num()]++;
8 #pragma omp critical /*added critical*/
9 NumOfIters++;
10 #pragma omp end critical
11 }
12 }
13 }
Example
|
critical
Outputwithcriticaldeclaration 1 Array-Sum BEFORE calculation: 4950
2 Array-Sum AFTER calculation: 5068800
3 Thread 1: 500 calculations
4 Thread 2: 500 calculations
5 NumOfIters: 1000
NumOfItersis now executed serial
Threads don’t hinder each other by incrementation
Reduction
Reduction of data
Critical sections could be a bottle neck of a calculation. In our example, there is another way to measure the number of iterations safely: thereduction
reduction
A reduction operator is a binary operation (such as addition or multiplication).
A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result.
All of the intermediate results of the operation should be stored in the same variable: the reduction variable.
Reduction
Usage#pragma omp for private(j) reduction(op: var)
Thereduction-clause identifies specific, commonly used variables.
We have to give the reduction operatoropand the reduction variablevar
The operatorop could one of this:
+, *, -, /, &, ^, |, &&, ||
In the variablevarmultiple threads can accumulate values Collects contributions by different threads like reduction in MPI
Example
|
reduction
Parallel sectionwithreduction()1 #pragma omp parallel
2 {
3 #pragma omp for private(j) reduction(+: NumOfIters)
4
5 for(i=0; i<100; i++){
6 for(j=0; j<10; j++){ 7 A[i] = A[i] * 2; 8 T[omp_get_thread_num()]++; 9 NumOfIters++; 10 } 11 } 12 }
Example
|
reduction
Outputwithreduction()1 Array-Sum BEFORE calculation: 4950
2 Array-Sum BEFORE calculation: 5068800
3 Thread 1: 500 calculations
4 Thread 2: 500 calculations
5 NumOfIters: 1000
Reduction is accomplished by instructionreductionafter loop-parallelism
Requires arithmetic operation and the reduction variable, separated by colons:
Conditional parallelism
ifclausesIt may be desirable to parallelise loopsonly
when the effort is well justified.
F.i. to assure that using multiple threads the run time is lesser compared to serial execution
Properties
Allows to decide at runtime whether a loop – is executed in parallel (fork-join)
– or is executed serial
Example
|
if
clauses
Parallel sectionwithif1 #pragma omp parallel
2 {
3 #pragma omp for private(j) reduction(+:NumOfIters) if(n>500)
4 for(i=0; i<n; i++){
5 for(j=0; j<10; j++){ 6 A[i] = A[i] * 2; 7 T[omp_get_thread_num()]++; 8 NumOfIters++; 9 } 10 } 11 }
Synchronization
In a parallel region threads proceed asynchronously. Sometimes coordination is necessary
OpenMP provides several ways to coordinate thread execution.
An important component is the synchronization of threads An application area we have already met:
Reduction in case of the race condition in the critical section.
Here we had implicit synchronization so that all threads execute the critical section sequentially
The behaviour we find also in theatomicsynchronisation
barrier
construct
At the barrier all threads wait and continue only when all threads have reached the barrier
The barrier guarantees thatALLthe code above has been executed
We have
Explicit barriers
– #pragma omp barrier Implicit barrier
– At the end of the worksharing constructs
(i.e.for/Do, sections, singleconstructs)
barrier
-Synchronisation
barrier– Threads are waiting until all have reached a common point.
1 # pragma omp
2 {
3 # pragma omp for nowait
4 for (i=0;i< N;i++) a[i] = b[i] + c[i];
5 # pragma omp barrier
6 # pragma omp for
7 for (i=0;i<N;i++) d[i] = a[i] + b[i];
8 }
atomic
construct
atomic
- a similar construct
#pragma omp atomic [clause]<statement>
Theatomicconstruct applies only to statements that update the value of a variable
Ensures that no other thread updates the variable between reading and writing
It is a special lightweight form of acritical
Only read/write are serialized, and only if two or more threads access the same memory address
atomic
-Synchronisation
Behaviour is related tocritical
Only permitted for specific operations i.e. (x++, ++x, x-, -x)
1 #pragma omp atomic
2 NumOfIters++;
master/single
-Synchronisation
#pragma omp master [clause]
[ Code which should be executed once by Master Thread] Only master thread to execute a region (e.g. I/O)
#pragma omp single [clause]
[ Code which should be executed once] Only one thread execute a region (e.g. I/O) This isnot necessarilythe master thread!
Resume
OpenMP is the De-facto standard for shared memory parallelism
Easy to use
Code can be fast parallalized.
Supported for all commonly used compilers Drawbacks
If your problem is becomes big enough you has to use distributed memory approaches
Not so ellaborated control over the execution order like in message passing.
Further readings
OpenMP tutorialsBlaise Barney, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/openMP/ Guide into OpenMP
Easy multithreading programming for C++ By Joel Yliluoma
http://bisqwit.iki.fi/story/howto/openmp/ Introduction to High Performance Computing for Scientists and Engineers
G. Hager & G. Wellein
Ch. 6 –Shared memory parallel programming with OpenMP
Ch. 7 –Efficient OpenMP programming