• No results found

Basic Concepts in Parallelization

N/A
N/A
Protected

Academic year: 2021

Share "Basic Concepts in Parallelization"

Copied!
58
0
0

Loading.... (view fulltext now)

Full text

(1)

1

IWOMP 2010

CCS, University of Tsukuba

Tsukuba, Japan

June 14-16, 2010

Ruud van der Pas

Senior Staff Engineer

Oracle Solaris Studio

Oracle

Menlo Park, CA, USA

(2)

2

Outline

Introduction

Parallel Architectures

Parallel Programming Models

Data Races

(3)

3

(4)

4

Why Parallelization ?

Parallelization is another optimization technique

The goal is to reduce the execution time

To this end, multiple processors, or cores, are used

T

im

e

1 core

parallelization

4 cores

Using 4 cores, the execution

time is 1/4 of the single core

(5)

5

What Is Parallelization ?

"Something" is parallel if there is a certain level

of independence in the order of operations

A sequence of machine instructions

A collection of program statements

An algorithm

The problem you're trying to solve

g

ra

n

u

la

ri

ty

In other words, it doesn't matter in what order

those operations are performed

(6)

6

What is a Thread ?

Loosely said, a thread consists of a series of instructions

with it's own program counter (“PC”) and state

A parallel program executes threads in parallel

These threads are then scheduled onto processors

...

...

...

...

...

...

...

...

Thread 0

...

...

...

...

...

...

...

...

Thread 1

...

...

...

...

...

...

...

...

Thread 2

...

...

...

...

...

...

...

...

Thread 3

P

P

PC

PC

PC

PC

(7)

7

Parallel overhead

The total CPU time often exceeds the serial CPU time:

The newly introduced parallel portions in your

program need to be executed

Threads need time sending data to each other and

synchronizing (“communication”)

Often the key contributor, spoiling all the fun

Typically, things also get worse when increasing the

number of threads

Effi cient parallelization is about minimizing the

communication overhead

(8)

8

Communication

Serial

Execution

W

al

lc

lo

ck

t

im

e

Parallel - With

communication

Additional communication

Less than 4x faster

Consumes additional resources

Wallclock time is more than ¼

of serial wallclock time

Total CPU time increases

Parallel - Without

communication

Embarrassingly

parallel: 4x faster

Wallclock time is ¼ of

serial wallclock time

(9)

9

Load balancing

T

im

e

Perfect Load Balancing

Load Imbalance

All threads fi nish in the

same amount of time

No threads is idle

Thread is idle

Different threads need a different

amount of time to fi nish their

task

Total wall clock time increases

Program does not scale well

1 idle

2 idle

3 idle

(10)

10

About scalability

Defi ne the speed-up S(P) as

S(P) := T(1)/T(P)

The effi ciency E(P) is defi ned as

E(P) := S(P)/P

In the ideal case,

S(P)=P

and

E(P)=P/P=1=100%

Unless the application is

embarrassingly parallel, S(P)

eventually starts to deviate from

the ideal curve

Past this point P

opt

, the application

sees less and less benefi t from

adding processors

Note that both metrics give no

information on the actual run-time

As such, they can be dangerous to

use

Ideal

S

p

e

e

d

-u

p

S

(P

)

P

P

opt

In some cases, S(P) exceeds P

This is called "superlinear" behaviour

Don't count on this to happen though

(11)

11

Amdahl's Law

This "law' describes the effect the non-parallelizable part of a

program has on scalability

Note that the additional overhead caused by parallelization and

speed-up because of cache effects are not taken into account

Assume our program has a parallel fraction “f”

This implies the execution time T(1) := f*T(1) + (1-f)*T(1)

On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1)

Amdahl's law:

S(P) = T(1)/T(P) = 1 / (f/P + 1-f)

(12)

12

Amdahl's Law

It is easy to scale on a

small number of

processors

Scalable performance

however requires a

high degree of

parallelization i.e. f is

very close to 1

This implies that you

need to parallelize that

part of the code where

the majority of the time

is spent

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

2

4

6

8

10

12

14

16

1.00

0.99

0.75

0.50

0.25

0.00

Value for f:

Processors

S

p

e

e

d

-u

p

(13)

13

Amdahl's Law in practice

f = (1 - T(P)/T(1))/(1 - (1/P))

We can estimate the parallel fraction “f”

Recall: T(P) = (f/P)*T(1) + (1-f)*T(1)

It is trivial to solve this equation for “f”:

Example:

T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70

f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%

Estimated performance on 8 processors is then:

T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5

S(8) = T/T(8) = 3.78

(14)

14

Threads Are Getting Cheap ...

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

0

10

20

30

40

50

60

70

80

90

100

Elapsed time and speed up for f = 84%

Number of cores used

Using 8 cores:

100/26.5 = 3.8x faster

= Elapsed time

= Speed up

(15)

15

Numerical results

Consider:

A = B + C + D + E

Serial Processing

Parallel Processing

Thread 0

Thread 1

The roundoff behaviour is different and so the numerical

results may be different too

This is natural for parallel programs, but it may be hard to

differentiate it from an ordinary bug ....

A = B + C

A = A + D

A = A + E

T1 = B + C

T1 = T1 + T2

T2 = D + E

(16)

16

(17)

17

Typical cache based system

L2

cache

(18)

18

Cache line modifi cations

Main Memory

caches

page

cache line

cache line

(19)

19

Caches in a parallel computer

A cache line starts in

memory

Over time multiple copies

of this line may exist

Cache Coherence ("cc"):

Tracks changes in copies

Makes sure correct cache

line is used

Different implementations

possible

Need hardware support to

make it effi cient

Processors

Caches

Memory

CPU 0

CPU 1

CPU 2

CPU 3

invalidated

invalidated

invalidated

(20)

20

(21)

21

Uniform Memory Access (UMA)

Also called "SMP"

(Symmetric Multi

Processor)

Memory Access time is

Uniform for all CPUs

CPU can be multicore

Interconnect is "cc":

Bus

Crossbar

No fragmentation -

Memory and I/O are

shared resources

Memory

I/O

cache

CPU

Easy to use and to administer

Effi cient use of resources

Said to be expensive

Said to be non-scalable

Pro

Con

cache

CPU

cache

CPU

Interconnect

I/O

I/O

(22)

22

NUMA

Also called "Distributed

Memory" or NORMA (No

Remote Memory Access)

Memory Access time is

Non-Uniform

Hence the name "NUMA"

Interconnect is not "cc":

Ethernet, Infi niband,

etc, ...

Runs 'N' copies of the OS

Memory and I/O are

distributed resources

Said to be cheap

Said to be scalable

Diffi cult to use and administer

In-effi cient use of resources

Pro

Con

I/O

cache

CPU

M

I/O

cache

CPU

M

I/O

cache

CPU

M

Interconnect

(23)

23

The Hybrid Architecture

Second-level Interconnect

Second-level interconnect is not cache coherent

Ethernet, Infi niband, etc, ....

Hybrid Architecture with all Pros and Cons:

UMA within one SMP/Multicore node

NUMA across nodes

(24)

24

cc-NUMA

Two-level interconnect:

UMA/SMP within one system

NUMA between the systems

Both interconnects support cache coherence i.e. the

system is fully cache coherent

Has all the advantages ('look and feel') of an SMP

Downside is the Non-Uniform Memory Access time

Interconnect

2-nd level

(25)

25

(26)

26

How To Program A Parallel Computer?

There are numerous parallel programming models

The ones most well-known are:

Distributed Memory

Sockets (standardized, low level)

PVM - Parallel Virtual Machine (obsolete)

MPI - Message Passing Interface (de-facto std)

Shared Memory

Posix Threads (standardized, low level)

OpenMP (de-facto standard)

Automatic Parallelization (compiler does it for you)

An

y

Cl

us

te

r

an

d/

or

si

ng

le

S

MP

SM

P

on

ly

(27)

27

Parallel Programming Models

Distributed Memory - MPI

(28)

28

What is MPI?

MPI stands for the “Message Passing Interface”

MPI is a very extensive de-facto parallel programming

API for distributed memory systems (i.e. a cluster)

An MPI program can however also be executed on a

shared memory system

First specifi cation: 1994

The current MPI-2 specifi cation was released in 1997

Major enhancements

Remote memory management

Parallel I/O

(29)

29

More about MPI

MPI has its own data types (e.g. MPI_INT)

User defi ned data types are supported as well

MPI supports C, C++ and Fortran

Include fi le <mpi.h> in C/C++ and “mpif.h” in Fortran

An MPI environment typically consists of at least:

A library implementing the API

A compiler and linker that support the library

A run time environment to launch an MPI program

Various implementations available

HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire

MPI, Scali MPI, HP-MPI, ...

(30)

30

The MPI Programming Model

1

M

4

M

5

M

2

M

3

M

0

M

A Cluster Of Systems

(31)

31

The MPI Execution Model

All processes

start

MPI_Finalize

All processes

fi nish

= communication

MPI_Init

(32)

32

The MPI Memory Model

P

private

P

private

P

private

P

private

P

private

Transfer

Mechanism

All threads/processes have

access to their own, private,

memory only

Data transfer and most

synchronization has to be

programmed explicitly

All data is private

Data is shared explicitly by

exchanging buffers

(33)

33

Example - Send “N” Integers

#include <mpi.h>

you = 0; him = 1;

MPI_Init(&argc, &argv);

MPI_Comm_rank

(

MPI_COMM_WORLD

, &me);

if ( me == 0 ) {

error_code =

MPI_Send

(&data_buffer, N,

MPI_INT

,

him, 1957,

MPI_COMM_WORLD

);

} else if ( me == 1 ) {

error_code =

MPI_Recv

(&data_buffer, N,

MPI_INT

,

you, 1957,

MPI_COMM_WORLD

,

MPI_STATUS_IGNORE

);

}

MPI_Finalize();

include fi le

initialize MPI environment

get process ID

process 0 sends

process 1 receives

(34)

34

you = 1

him = 0

Process 0

you = 1

him = 0

Process 1

Run time Behavior

N integers

destination = you = 1

label = 1957

MPI_Send

me = 0

me = 1

Yes ! Connection established

MPI_Recv

N integers

sender = him =0

(35)

35

The Pros and Cons of MPI

Advantages of MPI:

Flexibility - Can use any cluster of any size

Straightforward - Just plug in the MPI calls

Widely available - Several implementations out there

Widely used - Very popular programming model

Disadvantages of MPI:

Redesign of application - Could be a lot of work

Easy to make mistakes - Many details to handle

Hard to debug - Need to dig into underlying system

More resources - Typically, more memory is needed

(36)

36

Parallel Programming Models

Shared Memory - Automatic

(37)

37

Automatic Parallelization (-xautopar)

Compiler performs the parallelization (loop based)

Different iterations of the loop executed in parallel

Same binary used for any number of threads

for (i=0; i<1000; i++)

a[i] = b[i] + c[i];

Thread 3

750-999

Thread 2

500-749

Thread 1

250-499

Thread 0

0-249

OMP_NUM_THREADS=4

(38)

38

Parallel Programming Models

Shared Memory - OpenMP

(39)

39

http://www.openmp.org

0

M

1

M

P

M

Shared Memory

(40)

40

Shared Memory Model

T

private

T

private

T

private

T

private

T

private

Programming Model

Shared

Memory

All threads have access to the

same, globally shared, memory

Data can be shared or private

Shared data is accessible by all

threads

Private data can only be

accessed by the thread that

owns it

Data transfer is transparent to

the programmer

Synchronization takes place,

but it is mostly implicit

(41)

41

About data

In a shared memory parallel program variables have a

"label" attached to them:

Labelled "Private"

Visible to one thread only

Change made in local data, is not seen by others

Example -

Local variables in a function that is

executed in parallel

Labelled "Shared"

Visible to all threads

Change made in global data, is seen by all others

(42)

42

TID = 0

for (i=0,1,2,3,4)

TID = 1

for (i=5,6,7,8,9)

Example - Matrix times vector

i = 0

i = 5

a[

0

] = sum

a[

5

] = sum

sum =

b[

i=0]

[j]*c[j]

sum =

b[

i=5]

[j]*c[j]

i = 1

i = 6

a[

1

] = sum

a[

6

] = sum

sum =

b[

i=1]

[j]*c[j]

sum =

b[

i=6]

[j]*c[j]

... etc ...

for (i=0; i<m; i++)

{

sum = 0.0;

for (j=0; j<n; j++)

sum += b[i][j]*c[j];

a[i] = sum;

}

#pragma omp parallel for default(none) \

private(i,j,sum) shared(m,n,a,b,c)

=

*

j

(43)

43

A Black and White comparison

MPI

OpenMP

De-facto standard

De-facto standard

Endorsed by all key players

Endorsed by all key players

Runs on any number of (cheap) systems

Limited to one (SMP) system

“Grid Ready”

Not (yet?) “Grid Ready”

High and steep learning curve

Easier to get started (but, ...)

You're on your own

Assistance from compiler

All or nothing model

Mix and match model

No data scoping (shared, private, ..)

Requires data scoping

More widely used (but ....)

Increasingly popular (CMT !)

Sequential version is not preserved

Preserves sequential code

Requires a library only

Need a compiler

Requires a run-time environment

No special environment

(44)

44

The Hybrid Parallel

Programming Model

(45)

45

The Hybrid Programming Model

0

M

1

M

P

M

Shared Memory

Distributed Memory

0

M

1

M

P

M

Shared Memory

(46)

46

(47)

47

About Parallelism

Parallelism

Independence

No Fixed Ordering

"Something" that does not

obey this rule, is not

(48)

48

Shared Memory Programming

Threads communicate via shared memory

T

private

T

private

T

private

T

private

T

private

Shared

Memory

X

X

X

Y

(49)

49

What is a Data Race?

Two different threads in a multi-threaded shared

memory program

Access the same (=shared) memory location

Asynchronously

and

Without holding any common exclusive locks

and

(50)

50

Example of a data race

T

private

T

private

Shared

Memory

n

{n = omp_get_thread_num();}

#pragma omp parallel shared(n)

W

(51)

51

Another example

{x = x + 1;}

#pragma omp parallel shared(x)

T

private

T

private

Shared

Memory

x

R/W

R/W

(52)

52

About Data Races

Loosely described, a data race means that the update

of a shared variable is not well protected

A data race tends to show up in a nasty way:

Numerical results are (somewhat) different from run

to run

Especially with Floating-Point data diffi cult to

distinguish from a numerical side-effect

Changing the number of threads can cause the

problem to seemingly (dis)appear

May also depend on the load on the system

(53)

53

A parallel loop

for (i=0; i<8; i++)

a[i] = a[i] + b[i];

Thread 1

Thread 2

a[0]=a[0]+b[0]

a[4]=a[4]+b[4]

a[1]=a[1]+b[1]

a[5]=a[5]+b[5]

a[2]=a[2]+b[2]

a[6]=a[6]+b[6]

a[3]=a[3]+b[3]

a[7]=a[7]+b[7]

Time

Every iteration in this

loop is independent of

(54)

54

Not a parallel loop

for (i=0; i<8; i++)

a[i] = a[

i+1

] + b[i];

Thread 1

Thread 2

a[0]=a[1]+b[0]

a[

4

]=a[5]+b[4]

a[1]=a[2]+b[1]

a[5]=a[6]+b[5]

a[2]=a[3]+b[2]

a[6]=a[7]+b[6]

a[3]=a[

4

]+b[3]

a[7]=a[8]+b[7]

Time

The result is not

deterministic when

(55)

55

About the experiment

We manually parallelized the previous loop

The compiler detects the data dependence and does

not parallelize the loop

Vectors a and b are of type integer

We use the checksum of a as a measure for

correctness:

checksum += a[i] for i = 0, 1, 2, ...., n-2

The correct, sequential, checksum result is computed

as a reference

We ran the program using 1, 2, 4, 32 and 48 threads

(56)

56

Numerical results

threads: 1 checksum 1953 correct value 1953

threads: 1 checksum 1953 correct value 1953

threads: 1 checksum 1953 correct value 1953

threads: 1 checksum 1953 correct value 1953

threads: 2 checksum 1953 correct value 1953

threads: 2 checksum 1953 correct value 1953

threads: 2 checksum 1953 correct value 1953

threads: 2 checksum 1953 correct value 1953

threads: 4 checksum

1905

correct value 1953

threads: 4 checksum

1905

correct value 1953

threads: 4 checksum 1953 correct value 1953

threads: 4 checksum

1937

correct value 1953

threads: 32 checksum

1525

correct value 1953

threads: 32 checksum

1473

correct value 1953

threads: 32 checksum

1489

correct value 1953

threads: 32 checksum

1513

correct value 1953

threads: 48 checksum

936

correct value 1953

threads: 48 checksum

1007

correct value 1953

threads: 48 checksum

887

correct value 1953

threads: 48 checksum

822

correct value 1953

Data Race

In Action !

(57)

57

(58)

58

Parallelism Is Everywhere

Multiple levels of parallelism:

Granularity

Technology

Programming Model

Instruction Level Superscalar

Compiler

Chip Level

Multicore

Compiler, OpenMP, MPI

System Level

SMP/cc-NUMA Compiler, OpenMP, MPI

Grid Level

Cluster

MPI

References

Related documents

Since our aim was to produce metal ion beams (especially from Gold and Calcium) with good stability and without any major modification of the source (for

Closer to set git does amend change past commit history to the timestamp of typing frequently used to the current branch, and then issue the hash.. Safe way to modify a

In the chapters that follow, we position the failure of the market to resolve external effects (inefficiencies) and equitable distributional outcomes as risks to

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,

At the first stage, the total volume of traffic among ToRs (through SDN-enhanced network tomography) help us to find out the “talky” ToR pairs (the ToR pairs with more

Figure 7 Numbers of worldwide fatal accidents broken down by nature of flight and aircraft class for the ten-year period 2002 to 2011.. 0 20 40 60 80 100 120

Therefore the study will be focusing on the following selected HRM practices (training and development, recruitment and selection, performance appraisal,

Put a check in Incl (Include) column next to each company to which you want to copy the report; then click the Next button. The system displays the Select