• No results found

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

N/A
N/A
Protected

Academic year: 2021

Share "A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Euro-Par Conference (Porto, Portugal), August 2014

A Pattern-Based Comparison

of OpenACC & OpenMP

for Accelerators

Sandra Wienke

1,2

, Christian Terboven

1,2

, James C. Beyer

3

,

Matthias S. Müller

1,2

1

IT Center, RWTH Aachen University

2

JARA-HPC, Aachen

(2)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 2

Agenda

Motivation

Overview on OpenACC & OpenMP Device Capabilities

Expressiveness by Pattern-Based Comparison

Map

Stencil

Fork-Join

Superscalar Sequence

Parallel Update

Summary

(3)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 3

Motivation

Today/ future: Heterogeneity & specialized accelerators

Consequence: Parallel programming gets more complex

Directive-based programming models may simplify software design

Two well-promoted approaches: OpenACC, OpenMP

User question: Which programming model to use?

Power/ expressiveness, performance, long-term perspective

Targeted hardware: GPU, Xeon Phi,…

Expressiveness by pattern-based approach

Patterns: basic structural entities of algorithms

Used: structured parallel patterns [1], e.g.

map, stencil, reduction, fork-join,

superscalar sequence, nesting, parallel update and geometric decomposition

[1] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

(4)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 4

Overview on OpenACC & OpenMP

Device Capabilities

Directive-based accelerator

programming: C/C++, Fortran

OpenACC

Initiated by Cray, CAPS, PGI, NVIDIA

2011: specification v1.0

2013: specification v2.0

OpenMP

Broad (shared-memory) standard

2013: specification v4.0

OpenACC OpenMP Description of constructs/ clauses

parallel target offload work

parallel teams, parallel create in par. running threads kernels compiler finds parallelism loop distribute, do, for, simd worksharing across parallel units data target data manage data transfers (block) enter data unstructured data lifetime: to and from exit data the device

update target update data movement in data environment host data interoperability with CUDA/ libs cache advice to put objects to closer memory tile strip-mining of data

declare declare target declare global, static data routine declare target for function calls

async(int) task depend async. exec. w/ dependencies wait taskwait sync of streams/ tasks async wait async. waiting parallel in parallel parallel in parallel/ team nested parallelism

device_type device-specific clause tuning atomic atomic atomic operations

sections non-iterative worksharing critical block executed by master master, single block executed by one thread barrier synchronization of threads

Comparison of Constructs

(5)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 5

Overview on OpenACC & OpenMP

Device Capabilities

Execution & Memory Model

Host-directed execution

Weak device memory model (no sync at gang/ team level)

Separate or shared memory

OpenACC

We talk about the following specifications: OpenACC 2.0, OpenMP 4.0.

OpenMP

Device

L1 compute unit p ro c e s s in g e le m e n ts L1 compute unit p ro c e s s in g e le m e n ts L1 compute unit p ro c e s s in g e le m e n ts L1 compute unit p ro c e s s in g e le m e n ts L2

mem

mem

Host

L1 L1 L2 core core core core con trol *OpenCL terminology
(6)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 6

Map

Maps in parallel different elements of

input data to output collection

Elemental function:

𝑩 = 𝒑 ∙ 𝑨

𝑻

double f(double p, double aij) { return (p * aij);

}

// [..]

#pragma acc parallel

#pragma acc loop

for(i=0; i<n; i++) {

for(j=0; j<m; j++) {

b[j][i] = f(5.0,a[i][j]); } }

double f(double p, double aij) { return (p * aij);

}

// [..]

#pragma omp target

#pragma omp parallel for

for(i=0; i<n; i++) { for(j=0; j<m; j++) {

b[j][i] = f(5.0,a[i][j]); } }

OpenACC

[*] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

task data dependency join fork

Parallelism + workshare

Kernels

,

parallel

,

loop

Levels of parallelism

Automatic; tuning:

gang/worker/vector

Performance portability possible

routine

for function calls

Denote parallelism: tuning, but call from

same context

Parallelism + workshare

Target

,

parallel

,

for

,

teams

,

distribute

Levels of parallelism

Teams distribute

,

parallel for

Performance portability difficult

declare for function calls

No parallelism: call from different

contexts, less tuning

OpenMP [*]

vector

#pragma acc loop vector gang

#pragma acc routine seq

#pragma omp parallel for teams distribute

#pragma omp declare target

(7)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 7

Stencil

Access of neighboring input elements

with fixed offsets

Ability for data reuse & cache

optimization

[*] Similar to: McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

for(i=1; i<n-1; i++) { for(j=1; j<m-1; j++) {

anew[i][j] = (a[i-1][j] + a[i+1][j] + \ a[i][j-1] + a[i][j+1]) * 0.25; } }

for(i=1; i<n-1; i+=4) { for(j=1; j<m-1; j+=64) { for(k=i; k<min(n-1,i+4); k++) { for(l=j; l<min(m-1,j+64); l++) { anew[k][l] = (a[k-1][l] + ^\ a[k+1][l] + a[k][l-1] + \ a[k][l+1]) * 0.25; } } } } OpenACC OpenMP task data dependency join fork

tile: strip-mining

1st no. = block size of most inner loop

cache

: data caching

Just hint, compiler can ignore

(Performance-wise) needed especially

for software-managed mem (GPUs)

Tiling must be expressed explicitly

More development effort than w/

tiling

Performance expected similar to

tiling

No caching hints possible

Maybe performance loss on GPUs

[*]

#pragma acc parallel

#pragma omp target

#pragma omp teams distribute

#pragma acc cache(a[i-1:3][j-1:3])

collapse(2)

#pragma omp parallel for collapse(2)

(8)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 8

Fork-Join

Workflow is split (forked) into

parallel/ independent flows

Merged (joined) again later

// no tasking concept

#pragma omp declare target int fib(int n) { int x, y; if (n < 2) {return n;} x = fib(n - 1); y = fib(n - 2); return (x+y); }

#pragma omp end declare target

OpenACC

[*] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

task data

dependency join

fork

#pragma omp target #pragma omp parallel #pragma omp single result=fib(n);

Data parallelism

possible

Task parallelism

possible

sections

,

tasks

OpenMP

Data parallelism

possible

Task parallelism

Just approximation

by host-directed

nested parallel

constructs

[*]

#pragma omp task shared(x)

#pragma omp task shared(y)

(9)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 9

Superscalar Sequence

Parallelization of ordered list of tasks

Satisfaction of data dependencies

Additional interpretation: heterogeneity

#pragma acc parallel loop

// F = f(A)

#pragma acc parallel loop

// G = g(B)

#pragma acc parallel loop

// H = h(F,G) // S = s(F)

#pragma omp parallel #pragma omp single {

#pragma omp target teams distribute parallel for

// F = f(A)

#pragma omp target teams distribute parallel for // G = g(B)

#pragma omp target teams distribute parallel for // H = h(F,G) // S = s(F) } OpenACC OpenMP

F=f(A)

G=g(B)

H=h(F,G)

S=s(F)

A f B g h s task data dependency join fork

Streaming concept

async(int)

+

wait async

Calling thread

asynchronously continues

Tasking concept

Task depend

encies

#pragma omp task depend(out:F)

#pragma omp task depend(out:G)

#pragma omp task depend(in:F,G) depend(inout:H)

#pragma omp task depend(in:F)

#pragma omp taskwait async(1)

async(2)

#pragma acc wait(1,2) async(3) async(3)

#pragma acc wait(1)

(10)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 10

Parallel Update (1/3)

Own extension of pattern definitions

Synchronization of data between (separate) memories

task data

dependency join

fork

Both: data movement

Map data to/ from device, allocate

Data regions, update constructs

void stencilOnAcc(double **a, double ** \ anew, int n, int m) { #pragma acc parallel

#pragma acc loop // stencil comp.

} // [..]

while (iter < iter_max) { stencilOnAcc(A,Anew,n,m);

swapOnHost(A,Anew,n,m);

iter++; }

void stencilOnAcc(double **a, \ double** anew, int n, int m) { #pragma omp target

#pragma omp teams distribute \ parallel for // stencil comp.

} // [..]

while (iter < iter_max) { stencilOnAcc(A,Anew,n,m); swapOnHost(A,Anew,n,m); iter++; } OpenACC OpenMP copy \

#pragma acc data create(Anew[0:n][0:m]) \ (a[1:n-2][0:m], anew[1:n-2][0:m])

present \

copyin(A[0:n][0:m]) {

}

#pragma acc update host(Anew[1:n-2][0:m])

#pragma acc update device(A[1:n-2][0:m])

map(tofrom: \ (a[1:n-2][0:m], anew[1:n-2][0:m])

#pragma omp target data \

map(alloc:Anew[0:n][0:m]) \ map(to:A[0:n][0:m])

{

}

#pragma omp target update \

from(Anew[1:n-2][0:m])

#pragma omp target update \ to(A[1:n-2][0:m])

(11)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 11

Parallel Update (2/3)

Own extension of pattern definitions

Synchronization of data between (separate) memories

void stencilOnAcc(double **a, double ** \ anew, int n, int m) { #pragma acc parallel present \

(a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma acc loop // stencil comp.

} // [..]

#pragma acc data create(Anew[0:n][0:m]) \ copyin(A[0:n][0:m]) if(test)

{

while (iter < iter_max) { stencilOnAcc(A,Anew,n,m);

#pragma acc update host(Anew[1:n-2][0:m]) swapOnHost(A,Anew,n,m);

#pragma acc update device(A[1:n-2][0:m]) iter++; } } task data dependency join fork

void stencilOnAcc(double **a, \ double** anew, int n, int m) { #pragma omp target map(tofrom: \ a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma omp teams distribute \ parallel for // stencil comp.

} // [..]

#pragma omp target data \

map(alloc:Anew[0:n][0:m]) \ map(to:A[0:n][0:m]) if(test)

{

while (iter < iter_max) { stencilOnAcc(A,Anew,n,m); #pragma omp target update \

from(Anew[1:n-2][0:m]) swapOnHost(A,Anew,n,m);

#pragma omp target update \ to(A[1:n-2][0:m]) iter++;

} }

OpenACC

Missing data: Error

Present

: no data movement;

error if data not there

Missing data: Fix-Up

Implicit check for data; move if not there

(12)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 12 class CArray { public: CArray(int n) { a = new double[n]; } ˜CArray() { delete(a); } void fillArray(int n) { #pragma acc parallel loop

for(int i=0; i<n; i++) { a[i]=i; } } private: double *a; };

Parallel Update (3/3)

OpenACC

Unstructured data lifetimes

enter

/

exit data

API functions also exist

Use cases

intit/ fini functions or

constructors / destructors

Manual deep copies of pointer structures

OpenMP committee is working on it

Own extension of pattern definitions

Synchronization of data between (separate) memories

task data

dependency join

fork

#pragma acc enter data create(a[0:n])

(13)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 13

Summary

Parallel Pattern

OpenACC

OpenMP

Map

F

(kernels, parallel, loop,

gang/worker/vector, routine)

F

(target, parallel for, teams

distribute, declare target)

Stencil

I

(tile, cache)

-Reduction

F

(reduction)

F

(reduction)

Fork-Join

-

F

(tasks)

Workpile

-

F

(tasks)

Superscalar Sequence

F

(streams)

F

(tasks)

Nesting

F

(parallel in parallel)

F-

(parallel in target/ teams/

parallel)

Parallel Update*

F

(copy, data region, update,

enter/exit data, error concept)

F-

(map, target data, target

update, fix-up concept)

Geometric Decomposition

F

(acc_set_device, device_type)

F-

(device)

F : supported directly, with a special feature/ language construct F-: base aspects (but not all) directly supported

I : implementable easily and efficiently using other features - : not implementable easily

Following concept of “McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)”. (*) new pattern

(14)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 14

Conclusion & Outlook (1/2)

Concurrently: OpenACC one step ahead of OpenMP

Less general concepts, but more features

Today’s recommendation for GPUs: OpenACC

Port to OpenMP will be easy

(with OpenACC 1.0 features, or new OpenMP features)

Long-term perspective: Development of standards

Our assumption (OpenACC, OpenMP committee work): co-existence

OpenMP might be advantageous in the long term

Broader user / vendor base (portability)

But, implementation effort significant

(15)

OpenACC & OpenMP for Accelerators

Sandra Wienke | IT Center der RWTH Aachen University 15

Conclusion & Outlook (2/2)

Long-term perspective: architectures

Possible shared memory by host & device: supported by both models

But, promising perspectives for offload mode in general?

OpenMP might take lead with non-offload features

Long-term perspective: performance

We don’t expect differences between equivalent approaches

(same target hardware, compiler)

Dissimilarities might occur if general concept differs (e.g. streams vs. tasks)

Performance evaluation is future work (including comparison to other

models)

References

Related documents

Turner, The impact of variable high pressure treatments and/or cooking of rice on bacterial populations after storage using culture-independent analysis., Food Control (2018),

In the eastern Pacific, important fisheries exist for walleye pollock, Pacific cod, Pacific hake, Pacific herring, Pacific halibut, sablefish, and Pacific ocean perch.. In the

Research Question 2c The current study will then investigate whether each one can uniquely predict online victimisation, and if so, how much variance does self-esteem,

• It provides the highest perceived level of confidentiality, since no one in the client organization touches the survey (assuming it is mailed directly to an outside vendor for

A gold or silver IRA allows you to buy and sell gold, silver and other precious metals in a special custodial IRA account, harnessing the power of your precious metal investments

It is possible that colonial ties not only affect the level of food aid shipped from donor to recipient country, but also the responsiveness of aid to recipient needs.. This

The identified top talent will discuss respective Individ- ual Development Plan (IDP) with the HR VP on learning and career plan (aspiration) for the next 5 – 10 years this is

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a