• No results found

Programming Techniques for Supercomputers: Distributed memory parallel processing with MPI Introduction

N/A
N/A
Protected

Academic year: 2021

Share "Programming Techniques for Supercomputers: Distributed memory parallel processing with MPI Introduction"

Copied!
89
0
0

Loading.... (view fulltext now)

Full text

(1)

Programming Techniques for Supercomputers:

Distributed memory parallel processing with MPI Introduction

MPI in a nutshell MPI in a nutshell

Communication schemes

Prof Dr G Wellein(a b) Dr G Hager(a) J Habich(a) Prof. Dr. G. Wellein(a,b) , Dr. G. Hager(a) , J. Habich(a)

(a)HPC Services – Regionales Rechenzentrum Erlangen (b)D t t fü I f tik

(b)Department für Informatik

(2)
(3)

The message passing paradigm

Distributed memory architecture:

architecture:

Each process(or) can only access its y

dedicated address space.

Message

No global shared address space

Data exchange and

communication between communication between processes is done by explicitly passing

messages through a

Message passing library:

Sh ld b fl ibl ffi i d bl messages through a

communication network • Should be flexible, efficient and portable

• Hide communication hardware and software layers from application programmer

(4)

The Message Passing Paradigm

ƒ Widely accepted standard in HPC / numerical simulation: Message Passing Interface (MPI)

ƒ Process based approach: All variables are local! ƒ Same program on each processor/machine (SPMD)

ƒ No restriction of the general MP model, because processes can be distinguished by their rank (see later)

distinguished by their rank (see later)

ƒ The program is written in a sequential language (FORTRAN/C[++]) ƒ Data exchange between processes: Send/receive messages viaData exchange between processes: Send/receive messages via

MPI library calls

ƒ This is usually the most tedious but also the most flexible way of ll li i

parallelization

ƒ MPI is standard today: ƒ MPI is standard today:

(5)

MPI: A Modern MP Standard

ƒ MPI forum – defines MPI standard / library subroutine interfaces ƒ Various implementation by vendors (Intel MPI) or communities

(OpenMPI)

ƒ Beginning: April 1992 – Before: vendor specific libraries ƒ Members (approx. 60)

ƒ Application programmers

ƒ Research institutes & Computing centres ƒ Research institutes & Computing centres

ƒ Manufacturers of supercomputers & software designers ƒ Latest standard: MPI 2.2 (September 2009)Latest standard: MPI 2.2 (September 2009)

ƒ Successful free implementation: MPICH (1.2 standard) ƒ All documents and more pointers avaliable at: docu e ts a d o e po te s a a ab e at

http://www.mpi-forum.org/

ƒ MPI 2.2 defines more that 500 subroutines – typically 10-30 are used

(6)

MPI: Goals and Scope

ƒ PORTABILITY: Architecture and hardware independent code

SGI Alti 4700

A li

i

MPI

SGI Altix 4700

Cray XT series

Application

MPI

y

IB / GigE Cluster

...

IntelMPI OpenMPI

IBM BlueGene

MVAPICH

ƒ FORTRAN, C & C++ Interface

ƒ Provides ´well-defined´ and ´safe´ data transferProvides well defined and safe data transfer

ƒ Enables development of parallel libraries

ƒ Support heterogeneous environment (e.g. clusters with heterogeneous compute nodes)

(7)
(8)

MPI in a nutshell

The software architecture

ƒ Operating system view: ƒ parallel work done by

Application

User

tasks/processes

Programmer’s ie Librar

Application

User

ƒ Programmer’s view: Library routines for ƒ coordination MPI execution i t coordination ƒ communication ƒ synchronization Library calls (>500, 10 essential) environment

ƒ User’s view: MPI execution

i t id

MPI Library

(low level communication primitives)

environment provides ƒ resource allocation ƒ startup method

Tasks in OS

CPU i h d ƒ startup method

ƒ and other (implementation-dependent) behavior

CPUs in hardware Interconnect hardware

(9)

MPI in a nutshell Parallel execution

ƒ Tasks run throughout program execution: All variables are local

ƒ Startup phase: ƒ launch tasks

t bli h i ti t t

ƒ establishes communication context („communicator“) among all tasks ƒ Point-to-point data transfer:

ƒ usually between pairs of tasks

ƒ usually coordinated

ƒ usually coordinated

ƒ may be blocking or non-blocking

ƒ explicit synchronization is needed for non-blocking

ƒ Collective communication:

ƒ between all tasks or a subgroup of tasks

+

between all tasks or a subgroup of tasks

ƒ presently blocking-only

ƒ reductions, scatter/gather operations ffi i f lib ll

ƒ efficiency of library call ƒ Clean shutdownClean shutdown

(10)

MPI in a nutshell

C and Fortran interfaces ƒ Required header files:

ƒ C: #include <mpi.h>

ƒ Fortran: include 'mpif.h'

ƒ Fortran90: USE MPI

Bi di

ƒ Bindings:

ƒ C: error = MPI_Xxxx(parameter,...);

ƒ Fortran:Fortran: call MPI XXXX(argumentcall MPI_XXXX(argument,...,ierror)ierror) ƒ MPI constants (global/common): All upper case in C

ƒ Arrays:y

ƒ C: indexed from 0

ƒ Fortran: indexed from 1

ƒ Here: concentrate on Fortran interface!

ƒ Most frequent source of errors in C: call by reference with return values!

(11)

MPI in a nutshell MPI Error handling

ƒ C MPI routines

ƒ return an int — may be ignoredy g ƒ Fortran MPI routines

ƒ ierrorierror argument — cannot be omitted!argument cannot be omitted! ƒ Return value MPI_SUCCESS

ƒ indicates that all went ok ƒ indicates that all went ok

D f lt Ab t ll l t ti i f th t l

ƒ Default: Abort parallel computation in case of other return values ƒ but can also define error handlers (not covered in this lecture)

(12)

MPI in a nutshell

Initialization and finalization (1)

ƒ Each node must start/terminate at least one MPI process ƒ Usually handled automatically

M th i ft b t t l ibl

ƒ More than one process per core is often, but not always possible

ƒ Number of MPI processes per node: Determined by MPI execution environment ƒ First call in MPI program: initialization of parallel machine

call MPI INIT(ierror) call MPI_INIT(ierror)

ƒ Last call: clean shutdown of parallel machine call MPI_FINALIZE(ierror)

ƒ Only process with rank 0 (see later) is guaranteed to return from MPI_FINALIZE (cf. standard)

ƒ Stdout/stderr of each MPI process

(13)

MPI in a nutshell

Initialization and finalization (2)

ƒ Frequent source of errors: MPI_Init() in C C binding:

C binding:

int MPI_Init(int *argc, char ***argv);

ƒ If MPI_Init() is called in a function (bad idea anyway), this function must have pointers to the original data:

function must have pointers to the original data: void init_all(int *argc, char ***argv) {

MPI Init(argc argv); MPI_Init(argc, argv);

}

init_mpi(&argc, &argv);

ƒ Depending on implementation, mistakes at this point might evenDepending on implementation, mistakes at this point might even go unnoticed until code is ported

(14)

MPI in a nutshell

Communicator and rank (1)

ƒ MPI_INIT defines "communicator" MPI_COMM_WORLD:

MPI_COMM_WORLD

0

1

4

5

2

3

5

6

7

6

7

Processes rank

MPI_COMM_WORLD defines the processes that belong to the parallel machine

rank

_ _

• other communicators (subsets) are possible (cf. MPI communicators section)

(15)

MPI in a nutshell

Communicator and rank (2)

ƒ The rank identifies each process within a communicator (e.g. MPI_COMM_WORLD):

ƒ obtain rank:

i t k i

integer rank, ierror

call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

ƒ rank = 0,1,2,…, (number of MPI tasks – 1)

ƒ Obtain number of MPI tasks in communicator:

i t i i

integer size, ierror

(16)

MPI in a nutshell

Communicator and rank (3) ƒ MPI_COMM_WORLD is

ƒ effectively aneffectively an MPI-globalMPI global variable and required as argument for nearlyvariable and required as argument for nearly all MPI calls

ƒ rank

ƒ is target label for MPI messages

ƒ can drive user-defined directives what each process should do: if (rank == 0) then

... ! *** do work for rank 0 ***

else

... ! *** do work for other ranks ***

(17)

MPI in a nutshell

The first MPI program everyone has to code program hello

use mpi

implicit none

integer rank, size, ierror call MPI_INIT(ierror)

call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

it (* *) 'H ll W ld! I ' k ' f ' i

write(*,*) 'Hello World! I am ',rank,' of ',size call MPI FINALIZE(ierror)

call MPI_FINALIZE(ierror) end program

(18)

MPI in a nutshell

Compiling and running MPI code

ƒ Compile:

if90 h ll h ll f90

ƒ

Compile time:

include files or module mpif90 –o hello hello.f90

ƒ Run on 4 processors:

include files or module information file needed

ƒ

Link time:

mpirun –np 4 ./hello

ƒ Output:

MPI library required

ƒ

Most implementations

provideprovide mpif77 mpif90mpif77, mpif90, p

mpicc and mpiCC wrappers

not standardized so varia

Hello World! I am 3 of 4 Hello World! I am 1 of 4

not standardized, so varia-tions must be expected e.g., with Intel-MPI

(mpiifort mpiicc etc )

Hello World! I am 0 of 4 Hello World! I am 2 of 4 (mpiifort, mpiicc etc.)

Startup facilities

mpirun (legacy)

order undefined

(19)

MPI in a nutshell

Transmitting a message – parameters involved

ƒ For successful and reliable data transfer MPI requires information: ƒ Which processor is sending the message.

ƒ Where is the data on the sending processor. ƒ What kind of data is being sent.

How much data is there ƒ How much data is there.

ƒ Which processor(s) are receiving the message.Which processor(s) are receiving the message.

ƒ Where should the data be left on the receiving processor.

ƒ How much data is the receiving processor prepared to accept.

ƒ Sender and receiver must pass its information to MPI separately

information to MPI separately ƒ Holds for point-to-point p p

(20)

MPI in a nutshell

MPI process communication

ƒ Communication between two processes:

Sending / Receiving of MPI-Messages Sending / Receiving of MPI Messages

ƒ MPI-Message:

Array of elements of a particular MPI datatype

SEND MPI Message RECEIVE

1

0

ƒ MPI data types:yp • basic data types

(21)

MPI in a nutshell

Basic Fortran and C data types

MPI datatypey FORTRAN datatypey

MPI_CHARACTER CHARACTER(1)

MPI_INTEGER INTEGER

MPI REAL REAL

MPI_REAL REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_COMPLEX COMPLEX

MPI LOGICAL LOGICAL

MPI_LOGICAL LOGICAL

MPI_BYTE MPI_PACKED_

MPI datatype C datatype

MPI CHAR / MPI SHORT signed char / short MPI_CHAR / MPI_SHORT signed char / short MPI_INT / MPI_LONG signed int / long MPI_UNSIGNED_CHAR / … unsigned char / … MPI FLOAT / MPI DOUBLE float / double

MPI_FLOAT / MPI_DOUBLE float / double MPI_LONG_DOUBLE long double

MPI_BYTE MPI PACKED MPI_PACKED

(22)

MPI in a nutshell

MPI data types cont’d

ƒ MPI_BYTE: Eight binary digits

ƒ hack value, do not use

ƒ MPI_PACKED: can implement new data types → however, it is more flexible to use …

ƒ … derived data types: Built at run time from basic data types (see later)

ƒ Data type matching is strictly required: Same MPI data type in

ƒ Data type matching is strictly required: Same MPI data type in SEND and RECEIVE call

ƒ type must match on both ends in order for the communication to take place ƒ Support for heterogeneous systems/clusters

ƒ implementation-dependent ƒ implementation-dependent

ƒ automatic data type conversion between systems of differing architecture may be needed

(23)

MPI in a nutshell

Point-to-point communication

ƒ Communication between exactly two processes within the same communicator

ƒ Identification of source and destination via the rank within theIdentification of source and destination via the rank within the communicator!

MPI COMM WORLD

0

1

4

_ _ Source:  send buffer

2

3

4

5

6

7

MPI Message

2

6

7

Destination: i b ff

ƒ Blocking communication: After MPI call returns receive buffer

ƒ Source process can safely modify send buffer

(24)

MPI in a nutshell

Blocking Standard Send: MPI_Send ƒ Fortran syntax:

ƒ Completion of MPI_SEND:

ƒ status of destination process is not defined – message may or may not p g y y have been received after return!

(25)

MPI in a nutshell

Blocking standard receive: MPI_Recv ƒ Fortran syntax:

ƒ Completion of MPI_RECV:

ƒ Message has been received successfully and completely g y p y

ƒ Size of buf needs to be at least of size of message to be receive May be larger Æy g MPI Get count_ _

(26)

MPI in a nutshell

Determine exact parameter of a message received

ƒ Determine number of data items received via MPI_Get_count

ƒ MPI Recv accepts wildcards for source=MPI ANY SOURCEMPI_Recv accepts wildcards for source MPI_ANY_SOURCE andand tag=MPI_ANY_TAG

ƒ The actual parameters for the wildcard parameters can be determined via status object:

call MPI_RECV(field, count, MPI_REAL, & MPI_ANY_SOURCE, MPI_ANY_TAG,

& MPI COMM WORLD status ierror) & MPI_COMM_WORLD, status, ierror)

write(*,*) 'Received from #', status(MPI_SOURCE), & ' with tag ', status(MPI_TAG)

(27)

MPI in a nutshell

Requirements for point-to-point communication For a communication to succeed:

ƒ sender must specify a valid destination.

ƒ receiver must specify a valid source rank (or MPI_ANY_SOURCE).

ƒ communicator must be the same (e.g., MPI_COMM_WORLD).

ƒ tags must match

ƒ tags must match

(resp. MPI_ANY_TAG for receiver).

ƒ message datatypes must match.

(28)

MPI in a nutshell

Summary of basic MPI API calls

ƒ Beginner's MPI procedure toolbox:

ƒ MPI_INIT let's get going

ƒ MPI_COMM_SIZE how many are we?

ƒ MPI_COMM_RANK who am I?

ƒ MPI SEND send data to someone else

ƒ MPI_SEND send data to someone else

ƒ MPI_RECV receive data from some-/anyone

ƒ MPI GET COUNT_ _ how many items have I received?y

ƒ MPI_FINALIZE finish off

ƒ Standard send/receive calls provide most simple way of point-to-point communication

S d/ i b ff f l b d ft th ll h

ƒ Send/receive buffer may safely be reused after the call has completed

ƒ MPI SEND must have a specific target/tag MPI RECV does not

ƒ MPI_SEND must have a specific target/tag, MPI_RECV does not

(29)

MPI in a nutshell

Parallel integration in MPI ƒ Determine integral over some

function f(x) on a finite interval [a b] with integrate(a b)

[a,b] with integrate(a,b)

ƒ Split up interval [a,b] into equal disjoint chunks

ƒ Compute partial sums in ƒ Compute partial sums in

parallel

ƒ Collect partial sum on “root” process (rank=0) and sum up partial sums (receive size-1 partial sums (receive size 1 messages!)

ƒ Each other process

(rank > 0) needs to send its partial sum

(30)

MPI in a nutshell

First complete MPI example ƒ Remarks:

ƒ gathering results from processes is a very common task in MPI – there are more efficient and elegant ways to do this (see later).

ƒ this is a reduction operation (summation). There are more efficient and elegant ways to do this (following slides).

elegant ways to do this (following slides).

ƒ the 'master' process waits for one receive operation to be completed before the next one is initiated. There are more efficient ways... You guessed it! ƒ ‘master-worker' schemes are quite common in MPI programming but

scalability to high process counts may be limited scalability to high process counts may be limited

ƒ error checking is rarely done in MPI programs – debuggers are often more efficient if something goes wrong

ƒ every process has its own res variable but only the master process ƒ every process has its own res variable, but only the master process

(31)

MPI in a nutshell

Useful MPI Calls: MPI_GET_PROCESSOR_NAME

ƒ MPI_GET_PROCESSOR_NAME provides a means of getting an

identification of the piece of hardware the process is running on Example:

character*(MPI_MAX_PROCESSOR_NAME) nam integer namlen

integer namlen ....

call MPI GET PROCESSOR NAME(nam, namlen, ierror) call MPI_GET_PROCESSOR_NAME(nam, namlen, ierror) write(*,*) 'Rank ',rank,' runs on ',nam(1:namlen)

(32)

MPI in a nutshell

Useful MPI Calls: MPI_WTIME

ƒ Time measurement function (elapsed wallclock time) DOUBLE PRECISION FUNCTION MPI_WTIME()

t t ti l d t i l ti f ti ith

returns current timer value; determine resolution of timer with DOUBLE PRECISION FUNCTION MPI_WTICK()

ƒ MPI WTIME has no defined starting point (always calculate time ƒ MPI_WTIME has no defined starting point (always calculate time

differences!)

ƒ No ierror argument in Fortran!g ƒ Example:

double precision start_time, total_time start time = MPI WTIME()

start_time = MPI_WTIME()

C ... do some work here

total time MPI WTIME() start time total_time = MPI_WTIME() – start_time

(33)

MPI in a nutshell

Useful MPI Calls: MPI_ABORT

ƒ MPI_ABORT forces an MPI program to terminate:

MPI ABORT( d i )

MPI_ABORT(comm, errorcode, ierror) integer errorcode

ƒ Aborts all processes in communicator

ƒ errorcode will be handed as exit value to calling environmentg ƒ Safe and well-defined way of terminating an MPI program (if

implemented correctly)

ƒ In general, if something unexpected happens, try to shut down

MPI th t d d ( )

(34)

MPI communication schemes

Point-to-point communication: blocking vs non blocking

blocking vs. non-blocking Collective Communication

(35)

Bl

ki

P i

P i

Blocking Point-to-Point

Communication in MPI

Communication in MPI

(36)

Blocking Point-to-Point Communication ƒ Point-to-Point communication:

ƒ Simplest form of message passing

ƒ Two processes within a communicator exchange a message ƒ Two classes of point-to-point communication:

ƒ Blocking (this section)

ƒ Blocking (this section)

ƒ Non-blocking (next section)

ƒ Communication calls in both classes are available in two (three) flavors ƒ Synchronous

ƒ Buffered

ƒ Buffered

ƒ (Ready – never use it)

ƒ Reminder: “Blocking” communication ÅÆ After MPI call returns ƒ Source process can safely modify send buffer

ƒ receive buffer on destination process contains message from source

(37)

Blocking Point-To-Point Communication: Synchronous Send

ƒ The sender gets an information th t th

that the message is received.

ƒ Analogue to the beep or okay-p y sheet of a fax.

ƒ Syncs the two processes

(38)

Blocking Point-To-Point Communication: Buffered Send

ƒ One only knows

ƒ One only knows when the message has left.

ƒ We can write a new message and put it into the post box

ƒ We do not care about the time of delivery the time of delivery. Æ No need to sync with other process

(39)

Point-to-Point Communication:

Blocking Communication Mode

ƒ Completion of send/receive ↔ buffer can safely be reused!

Communication

Completion condition MPI Routine mode Completion condition (Blocking)

Synchronous Send

Only completes when the receive has

started. MPI_SSEND Send started.

Buffered Send

Always completes, irrespective of the

receive process. MPI_BSEND

Standard

Send Either synchronous or buffered. MPI_SEND

Ready Always completes irrespective whether Ready

Send

Always completes, irrespective whether

the receive has completed. MPI_RSEND

Receive Completes when a message has i d MPI RECV

(40)

Point-to-Point Communication:

Blocking Synchronous Send MPI_SSEND

ƒ MPI_SSEND completes after message has been accepted by the destination (“handshaking”).( g )

ƒ Synchronization of source and destination! ƒ Predictable and safe behavior!Predictable and safe behavior!

ƒ MPI_SSEND should be used for debugging purposes! ƒ Problems:

ƒ Performance (high latency, risk of serialization – best bandwidth) ƒ Deadlock situations (see later)

ƒ Syntax (FORTRAN): like MPI_SEND

MPI_SSEND(buf, count, datatype, dest, tag, comm, ierror)

(41)

Point-to-Point Communication:

Blocking Synchronous Send MPI_SSEND

ƒ Diagrammatic view of synchronous communication:

MPI

rank=0

rank=1

MPI SSEND Do some

MPI_SSEND Post send Wait for  work recv Transfer

me

MPI_RECV Post receive Transfer Handshake

ti

Handshake Do other work Do other work

(42)

Point-to-Point Communication:

Blocking Buffered Send MPI_BSEND

ƒ MPI_BSEND: 1) Copy message to a (system) buffer 2) Complete

2) Complete ƒ Completes immediately!

ƒ Predictable behavior & no synchronization!Predictable behavior & no synchronization!

ƒ Programmer has to attach extra buffer space (MPI_BUFFER_ATTACH / MPI BUFFER DETACH_ _ ))

ƒ Only one buffer can be attached per process at a time ƒ Additional copy operations!py p

ƒ Syntax (FORTRAN): (cf. also syntax of MPI_SEND)

MPI_BSEND(buf, count, datatype, dest, tag, comm, ierror)

(43)

Point-to-Point Communication: MPI_BSEND

ƒ Diagrammatic view of blocking buffered communication:

MPI

rank=0

rank=1

MPI BSEND Do some work MPI_BUFFER_ATTACH send-> tmp MPI_BSEND MPI RECV send tmp Post send MPI_RECV Post receive Wait for recv Do some work Transfer Transfer MPI_BUFFER_DETACH Carry on

(44)

Point-to-Point Communication:

Providing Buffer Space for MPI_BSEND

ƒ Syntax (FORTRAN):

MPI BUFFER ATTACH(tmpbuf size ierror) MPI_BUFFER_ATTACH(tmpbuf, size, ierror) tmpbuf: address of buffer space

size: buffer size in bytesy

ƒ Determine size of tmpbuf: (messagesize of all BSENDs) + (number of intended BSENDs * MPI BSEND OVERHEAD)

intended BSENDs MPI_BSEND_OVERHEAD)

ƒ Best way to get required size forBest way to get required size for oneone message:message:

call MPI_PACK_SIZE(count, datatype, comm, s, ierror) size = s + MPI_BSEND_OVERHEAD

ƒ Free buffer space after use:

MPI BUFFER DETACH(tmpbuf,size,ierror)_ _ ( p , , )

(45)

Point-to-Point Communication:

Semantics: Message order preservation

ƒ Semantics: Rules, guaranteed by MPI implementations 1) Message Order Preservation (within same communicator)

MPI COMMON WORLD

0

1

MPI_COMMON_WORLD

2

0

1

2

3

4

5

2 1

2

3

6

7

(46)

Point-to-Point Communication: Semantics: Guarantees progress

ƒ Progress: It is not possible for a matching send and receive pair to remain permanently outstanding.

ƒ Matching means: data types, tags and receivers match

MPI_COMMON_WORLD

0

1

4

2

4

5

3

6

7

I want one message! I want one message!

(47)

Point-to-Point Communication:

Example – what is the problem here?

ƒ Example with 2 processes, each sending a message to the other: integer buf(200000) if(rank.EQ.0) then dest=1 source=1 else dest=0

source=0 This program will not work

source 0

end if correctly on all systems!

MPI_SEND(buf, 200000, MPI_INTEGER, dest, 0, & MPI_COMM_WORLD, ierror)

MPI_RECV(buf, 200000, MPI_INTEGER, source, 0, & MPI_COMM_WORLD, status, ierror)

(48)

Point-to-Point Communication: Deadlocks

ƒ Deadlock: Some of the outstanding blocking communication can never be completed (program stalls)

ƒ Example: MPI_SEND is implemented as synchronous send for large messages! MPI SEND(buf, ) MPI SEND(buf )

1

MPI_SEND(buf,..)

0

MPI_SEND(buf,..) MPI_RECV(buf,..) MPI_RECV(buf,..)

ƒ One remedy: reorder send/receive pair on one process (e.g. rank 0):

MPI SEND(b f ) MPI RECV(b f )

1

MPI_SEND(buf,..)

(49)

Point-to-Point Communication: Deadlocks

ƒ Other possibilities to avoid deadlocks: ƒ Use buffered sends (see above)

ƒ MPI_SENDRECV (see later)

ƒ Use non-blocking communication (see later)

ƒ Replacing MPI_SEND by MPI_SSEND in an MPI program

ƒ Can often reveal deadlocks which would otherwise go unnoticed until the ƒ Can often reveal deadlocks which would otherwise go unnoticed until the

code is ported elsewhere

ƒ Ideally, every MPI program should work with MPI_SSEND alone ƒ Probably with low efficiency, but this is for debugging only

(50)

Point-to-Point Communication: Extension: MPI_SENDRECV Motivation:

ƒ Start Send/Receive call at the same timeStart Send/Receive call at the same time

ƒ Shift messages, Ring topologies, “Ring shift” ƒ Avoid deadlocks

3

o d dead oc s ƒ Example:

0

2

MPI SEND(buf,..) MPI_RECV(buf,..)

1

_ ( , ) Deadlock, if MPI_Send uses synchronous

(51)

Point-to-Point Communication: Extension: MPI_SENDRECV

ƒ MPI_SENDRECV: separate buffer for send and receive operation -Syntax: simple combination of send and receive arguments

Syntax: simple combination of send and receive arguments

MPI_SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag, recvbuf,recvcount,recvtype,source,recvtag, comm, status, ierror)

2

S

R

1

S

R

0

S

R

0

S

R

1

S

R

2

S

R

(52)

Point-to-Point Communication: Extension: MPI_SENDRECV

ƒ Example (Guarantees progress – no deadlock)

! D t i k f t th l ft

! Determine rank of processor to the left lrank = rank - 1

if(rank .eq. 0) lrank=size-1

! i k f h i h

! Determine rank of processor to the right rrank = rank + 1

if(rank .eq. (size-1)) rrank=0

call MPI_SENDRECV(sendbuf, 5, MPI_INTEGER, rrank, 1, recvbuf, 5, MPI_INTEGER, lrank, 1,

MPI_COMM_WORLD, status, ierror )

3

(53)

Point-to-Point Communication: Extension: MPI_SENDRECV

ƒ Notes:

ƒ Completely compatible with point-to-point messagesp y p p p g

ƒ MPI_SENDRECV matches a simple xRECV or xSEND

ƒ Blocking call

b ild h i

ƒ MPI_PROC_NULL can build an open chain:

source or destination can be specified as MPI_PROC_NULL (c f example:

(c.f. example:

rank=0: specify lrank= MPI_PROC_NULL rank=3: specify rrank= MPI_PROC_NULL_ _ )

3

0

1

2

3

(54)

Point-to-Point Communication:

Extension: MPI_SENDRECV_REPLACE

Equivalent to “dangerous” send/recv implementation shown (slide 51) Syntax: simple combination of send and receive arguments with

Syntax: simple combination of send and receive arguments with identical send/receive buffers, data types and counts

MPI_SENDRECV_REPLACE(buf, count, datatype, dest, sendtag,

source, recvtag,

comm, status, ierror)

2

1

0

buf

buf

buf

2

1

(55)

Blocking Point-to-Point Communication: Summary

ƒ Blocking MPI communication calls:

ƒ MPI call has been completedMPI call has been completed →→ send/receive buffer can safely be reused!send/receive buffer can safely be reused! ƒ No assumption about the matching receive call (destination) except for

synchronous send!

ƒ May require synchronization of the processes (Performance!) ƒ May require synchronization of the processes (Performance!) ƒ Available Send communication modes:

ƒ Synchronous communication (performance drawbacks; deadlock dangers) ƒ Buffered communication:

1) Copy Send buffer to communication buffer 1) Copy Send buffer to communication buffer 2) MPI call completes

3) MPI handles communication

Behavior of standard send can be synchronous or buffered or depending on ƒ Behavior of standard send can be synchronous or buffered or depending on

(56)

N

Bl

ki

P i

P i

Non-Blocking Point-to-Point

Communcation in MPI

(57)

Non-Blocking Point-To-Point Communication

After initiating the

ƒ After initiating the communication one can return to perform other

work.

ƒ At some later time one must test or one must test or

wait for the

completion of the p non-blocking operation. Non‐blocking Non blocking  synchronous send

(58)

Non-Blocking Point-to-Point Communication: Introduction

ƒ Motivation:

ƒ Avoid deadlocks

ƒ Avoid idle processors

ƒ Avoid idle processors

ƒ Avoid useless synchronization

ƒ Overlap communication and useful work (hide the ‘communication cost’; “overlap communication and computation”)overlap communication and computation )

ƒ However: This is not guaranteed by the standard! P i i l ƒ Principle:

time

( ) ( ) U SEND(buf) P t SEND W it f RECV T f d t Do some work (do not use buf)

Sy

n

c

Use buf Wait

Dedicated helper thread for communication

(if il bl i MPI lib i l t ti ) Post SEND ‐ Wait for RECV ‐ Transfer data

(59)

Non-Blocking Point-to-Point Communication: Introduction

ƒ Detailed steps for non-blocking communication

1) Setup communication operation (MPI)

2) Build unique request handle (MPI)

3) Return request handle and control to user program (MPI) 4) User program continues while MPI library hopefully

4) User program continues while MPI library hopefully performs communication asynchronously

5) Status of a communication can be probed using its unique 5) Status of a communication can be probed using its unique

request handle

All non-blocking operations must have matching wait (or test)

operations as some system or application resources can be freed only when the non-blocking operation is completed.

(60)

Non-Blocking Point-to-Point Communication: Introduction

ƒ The return of non-blocking communication call does not imply completion of the communication

ƒ Check for completion: Use request handle !

ƒ Do not reuse buffer until completion of communication has been checked !

checked !

ƒ Data transfer can be overlapped with user program execution ƒ Data transfer can be overlapped with user program execution

(if supported by hardware and MPI implementation)

ƒ Most “standard” implementations today do not support fully asynchronous p y pp y y transfers

(61)

Non-Blocking Point-to-Point Communication: Communication Models

ƒ Communication models for non-blocking communication

Non-Blocking

Operation MPI call

St d d d MPI ISEND()

Standard send MPI_ISEND()

Synchronous send MPI_ISSEND()

Buffered send MPI_IBSEND()

Ready sendy MPI IRSEND()_

(62)

Blocking and Non-Blocking Point-To-Point Communication

ƒ A blocking send can be used with a non-blocking receive, and vice-versa.

ƒ Non-blocking sends can use any mode

t d d (Å f thi i t lid )

ƒ standard – MPI_ISEND (Å focus on this in next slides) ƒ synchronous – MPI_ISSEND

ƒ buffered –buffered – MPI IBSENDMPI_IBSEND

ƒ ready – MPI_IRSEND

ƒ Synchronous mode affects completion, i.e. MPI_Wait / MPI_Test, not initiation, i.e., MPI_I….

ƒ The non-blocking operation immediately followed by a matching

it i i l t t th bl ki ti

(63)

Non-Blocking Point-to-Point Communication: Standard Send/Receive MPI_ISEND

ƒ Standard non-blocking send

ƒ Do not reuse buf after MPI_Isend has returned!

ƒ Successive test/wait for completion operation is required!

ƒ request handle is unique identifier for each non-blocking communication

(64)

Non-Blocking Point-to-Point Communication: Standard Send/Receive MPI_IRECV

ƒ Standard non-blocking receive

ƒ Do not use buf after MPI_Irecv has returned!

ƒ Successive test/wait for completion operation is required!p p q

ƒ No status argument! Is provided at corresponding wait/test operation

MPI I + immediately following wait/test operation: Potential

ƒ MPI_Irecv + immediately following wait/test operation: Potential Fortran pitfall (cf. later)

(65)

Non-Blocking Point-to-Point Communication: Standard Send/Receive MPI_ISEND/IRECV

MPI_ISEND(sendbuf, ..., request,...) Post send request Wait for matching Do some work,

Do not use sendbuf !

matching Recveive Do not use sendbuf !

T f

Transfer

How do we know when the  transfer is complete?

(66)

Non-Blocking Point-to-Point Communication: Testing for Completion

ƒ Ensure that communication has completed before the send/receive buffer is reused.

ƒ MPI supports multiple concurrent non-blocking calls (“open communications”)

MPI id t t t d

ƒ MPI provides two test modes:

ƒ WAIT type: Wait until the communication has been completed and buffer can safely be reused: y Blockingg

ƒ TEST type: Return TRUE (FALSE) if the communication has (not) completed: Non-blocking

Ch k i l f l ti

(67)

Non-Blocking Point-to-Point Communication: Example: Wait for Completion

ƒ Examples: integer request

...

MPI_SEND(buf,…) call MPI_ISEND(buf,…,request,…) call MPI_WAIT(request,status,…)

integer request; ...

call MPI IRECV(buf,...,request,...); call MPI_IRECV(buf,...,request,...);

! DO SOME WORK !

! DO NOT USE buf

call MPI WAIT(_ (requestq ,status,ierror);, , ); buf(1) = 42.0

(68)

Non-Blocking Point-to-Point Communication:

Testing for completion of multiple open communications ƒ Test multiple open communications for completion:

ƒ Test all / any / some of the communication requests

Test for completion WAIT type (blocking) (query only) TEST type

At least one; return exactly

one MPI_WAITANY MPI_TESTANY

Every one MPI_WAITALL MPI_TESTALL

At least one; return all At least one; return all

(69)

Non-Blocking Point-to-Point Communication: Parallel integration

ƒ Wait for completion of the

I ti

(70)

Non-Blocking Point-to-Point Communication: Pitfalls due to compiler optimization

ƒ Fortran:

MPI_IRECV(recvbuf, ..., request, ierror) MPI_WAIT(request, status, ierror)

recvbuf=recvbuf+42 ƒ may be compiled as

MPI_IRECV(recvbuf, ..., request, ierror)

registerA = recvbuf may update

registerA = recvbuf

MPI_WAIT(request, status, ierror)

registerA+registerA+42

recvbuf but  compiler does not 

recognize this

ƒ i.e. old data is written instead of received data! ƒ Workarounds:

g

Workarounds:

ƒ recvbuf may be allocated in a common block, or

ƒ calling MPI_ADDRESS(recvbuf, iaddr_dummy, ierror) after MPI WAIT

(71)

Non-Blocking Point-to-Point Communication and strided sub-arrays

ƒ Fortran allows to pass subarrays:

MPI_ISEND(buf(7,:,:), ..., request_handle, ierror)

• The content of this non‐contiguous sub‐array  is stored in a temporary array.

h ll d

other work

• Then MPI_ISEND is called.

• On return, the temporary array is released.

• The data may be transferred while other work is done, … or inside of MPI Wait but data in temporary array is already lost!

MPI_WAIT( request_handle, status, ierror)

or inside of MPI_Wait, but data in temporary array is already lost!

ƒ Do not use non-contiguous sub-arrays in non-blocking calls!!!

ƒ For contiguous sub-arrays: Use first sub-array element (buf(1,1,9)) instead of whole sub array (buf(: : 9:13))

instead of whole sub-array (buf(:,:,9:13))

ƒ Call by reference necessary Æ Call by in-and-out-copy forbidden

(72)

C

C

Collective Communication in

MPI

(73)

Collective Communication Introduction

Collective communication always involves every process in the specified communicator

F t

ƒ Features:

ƒ All processes must call the subroutine

Remarks:

ƒ All processes must call the subroutine!

ƒ All processes must call the subroutine!!

ƒ Always blocking: buffer can be reused after returnAlways blocking: buffer can be reused after return ƒ May or may not synchronize the processes

ƒ Cannot interfere with point-to-point communication

D t t t hi

ƒ Datatype matching ƒ No tags

ƒ Sent message must fill receive buffer (count is exact)g ( )

ƒ Can be “built” out of point-to-point communications by hand, however, collective communication may allow optimized internal implementations e g tree based communication may allow optimized internal implementations, e.g., tree based algorithms

(74)

Collective Communication: Barriers

ƒ Synchronize processes:

(75)

Collective Communication: Synchronization ƒ Syntax:

MPI_BARRIER(comm, ierror)

ƒ MPI BARRIER blocks the calling process until all other group members ƒ MPI_BARRIER blocks the calling process until all other group members

(=processes) have called it.

ƒ MPI_BARRIER is hardly ever needed – all synchronization is done automatically by the data communication – however: debugging, profiling, …

2

0

1

2

3

t

1

3

MPI BARRIER

_

(76)

Collective Communication: Broadcast

ƒ A one-to-many communication.

(77)

Collective Communication: Broadcast

ƒ Every process receives one copy of the message from a root process

ƒ Syntax:

call MPI_BCAST(buffer, count, datatype, root, comm, ierror)

(78)

Collective Communication: Scatter

ƒ Root process scatters data to all processes

z Specify root process (cf example : root=1)

recvbuf sendbuf

z Specify root process (cf. example : root 1)  z send and receive details are different

(79)

Collective Communication: Scatter / Scatterv syntax

ƒ MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, ierror)

ƒ root process sends the i-th. segment of sendbuf to the i-th process

I l

ƒ In general: recvcount = sendcount

ƒ sendbuf is ignored for all non-root processes

ƒ More flexible: Send segments of different sizes to different processes!

processes!

ƒ MPI SCATTERV(sendbuf,MPI_SCATTERV(sendbuf, sendcount, sendcount, displ,displ,

sendtype, recvbuf, recvcount, recvtype, root, comm, ierror)

integer sendcount(*), displ(*)

Contains message count for each procs “Pointer” to starting address in sendbuf

(80)

Collective Communication: Gather

ƒ Root process gathers data from all processes

z Specify root process (cf example : root=1)

recvbuf sendbuf

z Specify root process (cf. example : root 1)  z send and receive details are different

(81)

Collective Communication: Gather/Gatherv

ƒ Gather: MPI_GATHER(sendbuf, sendcount, sendtype,

recvbuf, recvcount, recvtype, root, comm, ierror)

ƒ Each process sends sendbuf to root process

ƒ root process receives messages and stores them in rank order ƒ root process receives messages and stores them in rank order ƒ In general: recvcount = sendcount

ƒ recvbuf is ignored for all non root processes ƒ recvbuf is ignored for all non-root processes ƒ MPI GATHERV is available also

(82)

Collective Communication: Gather/Scatter

a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

0

Example MPI_SCATTERV: root=0

a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8]

0

[3] a[1] a[2]

0

sendcount = {2,1,2,1} displ = {0 2 4 7} a[3] a[5] a[6]

1

2

displ = {0,2,4,7} a[8]

3

ƒ MPI_GATHERV_ : Receive messages of different sizes from different processes! (Specify displ and recvcount array with the receiving parameters)

(83)

Collective Communication: Reduction Operations

ƒ Combine data from l

several processes

(84)

Global Operations: Introduction

Compute a result which involves data distributed across all processes of a group

ƒ Example: global maximum of a variable: max( var[rank] ) ƒ MPI provides 12 predefined operations

ƒ Definition of user defined operations: MPI_OP_CREATE & MPI_OP_FREE

ƒ MPI assumes that the operations are associative!

(floating point operations may be not exactly associative because of rounding errors)

(85)

Global Operations: Predefined Operations

Name Operation Name Operation

Name Operation Name Operation

MPI_SUM Sum MPI_PROD Product

MPI_MAX Maximum MPI_MIN Minimum

MPI_LAND Logical AND MPI_BAND Bit-AND

MPI_LOR Logical OR MPI_BOR Bit-OR

MPI LXOR L i l XOR MPI BXOR Bit XOR

MPI_LXOR Logical XOR MPI_BXOR Bit-XOR

(86)

Global Operations: Syntax

ƒ Results stored on root process:

call MPI REDUCE(sendbuf, recvbuf, count, datatype, op, & call MPI_REDUCE(sendbuf, recvbuf, count, datatype, op, &

root, comm, ierror)

ƒ Result in rec b f on root process

ƒ Result in recvbuf on root process.

ƒ Status of recvbuf on other processes is undefined.

ƒ count > 1: Perform operations on all 'count' elements of an array ƒ count > 1: Perform operations on all count elements of an array

If results are desired to be stored on all processes: If results are desired to be stored on all processes: ƒ MPI_ALLREDUCE: No root argument

ƒ combination of MPI_REDUCE with successive of result from root process MPI_BCAST

(87)

Global Operations: Example

Compute e(i) = max{a(i),b(i),c(i),d(i)} (i=1,2,3,4)

Process Data distribution

a[1] a[2] a[3] a[4]

0

b[1] b[2] b[3] b[4] c[1] c[2] c[3] c[4]

0

1

2

c[1] c[2] c[3] c[4] d[1] d[2] d[3] d[4]

2

3

MPI_REDUCE(...,e,4,MPI_MAX,...,0,...)

e[1] e[2] e[3] e[4]

(88)

Global Reduction: Parallel integration

call MPI_Comm_size(MPI_COMM_WORLD, & size, ierror)

ll MPI C k(MPI COMM WORLD &

call MPI_Comm_rank(MPI_COMM_WORLD, & rank, ierror) ! integration limits ! integration limits a=0.d0 ; b=2.d0 ; res=0.d0 mya=a+rank*(b-a)/size myb=mya+(b-a)/size myb=mya+(b a)/size

! integrate f(x) over my own chunk psum = integrate(mya,myb)

psum integrate(mya,myb)

call MPI_Reduce(psum, res, 1, &

MPI DOUBLE PRECISION, MPI SUM, & _ _ , _ , 0, MPI_COMM_WORLD, ierror)

(89)

Communication schemes Survey Point-to-Point Collective MPI_SEND(mybuf….) MPI_SSEND(mybuf…) MPI_BARRIER(…) MPI BCAST( ) g atched MPI_BSEND(mybuf…) MPI_RECV(mybuf…) MPI SENDRECV(mybuf ) MPI_BCAST(…) MPI_ALLREDUCE(…) (All processes of the

B lockin g an be m a cv(…) MPI_SENDRECV(mybuf…)

(mybuf can be modified after call returns)

( p

communicator must call the operation!) B _Send c a MPI_Ire c MPI_ISEND(mybuf…) MPI_IRECV(mybuf…) king e.g. MPI _ propriate MPI_WAIT/MPI_TEST MPI_WAITALL/ MPI TESTALL ---n-bloc k e mixed, e by ap p _ S

(mybuf may only be modified after WAIT/TEST)

No

n

Can b

References

Related documents

Our suggested approach involves derivation of and calculation of the exact overlap between every cluster pair, measured in terms of their total probability of misclassification,

Among the 26 patients with newly diagnostic extraspinal malignancies, 12 (46%) of them did not have coexisting spinal metastases and the most common etiologies were renal

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,

As taking into account long lags might be key to appropriately describe certain aspects of the relationship between life expectancy and per capita income (at least for the

Chlorophyll a (Chl a), pigment composition, particulate organic carbon, cell numbers, and photosynthetic characteristics were studied in Arctic diatoms (Thalassiosira

This study deals with zooplankton communi- ties of Ghorajan Beel, a floodplain lake of the Brahmaputra river basin of lower Assam, with special reference to monthly variations

a strong contribution of the variation in the stored water volume, temperature, total phosphorus, chlorophyll, nitrates, and water transparency to the observed, significant

The effect of co-digestion of slaughterhouse wastes with carbon-rich substrates (e.g. garden waste, paper sludge and fines), as a means to adjust the C:N ratio, and overcome