• No results found

Message Passing Interface (MPI)

N/A
N/A
Protected

Academic year: 2021

Share "Message Passing Interface (MPI)"

Copied!
394
0
0

Loading.... (view fulltext now)

Full text

(1)

Message Passing Interface

(MPI)

JalelChergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet

(2)

MPI – Plan I

1 Introduction . . . 7

1.1 Availability and updating . . . 8

1.2 Definitions . . . 9

1.3 Message Passing concepts . . . 12

1.4 History . . . 15 1.5 Library . . . 17 2 Environment . . . 21 2.1 Description . . . 22 2.2 Exemple . . . 25 3 Point-to-point Communications . . . 26 3.1 General Concepts . . . 27

3.2 Predefined MPI Datatypes . . . 33

3.3 Other Possibilities . . . 35

3.4 Example: communication ring . . . 41

4 Collective communications . . . 54

4.1 General concepts . . . 55

4.2 Global synchronization: MPI_BARRIER(). . . 57

4.3 Broadcast : MPI_BCAST(). . . 60

4.4 Scatter :MPI_SCATTER(). . . 66

4.5 Gather : MPI_GATHER(). . . 72

4.6 Gather-to-all :MPI_ALLGATHER(). . . 78

(3)

MPI – Plan II

4.8 All-to-all : MPI_ALLTOALL(). . . 91 4.9 Global reduction . . . 98 4.10 Additions . . . 120 5 One-sided Communication . . . 121 5.1 Introduction . . . 122

5.1.1 Reminder : The Concept of Message-Passing . . . 123

5.1.2 The Concept of One-sided Communication . . . 124

5.1.3 RMA Approach of MPI . . . 125

5.2 Memory Window . . . 126

5.3 Data Transfer . . . 130

5.4 Completion of the Transfer : the synchronization . . . 134

5.4.1 Active Target Synchronization . . . 135

5.4.2 Passive Target Synchronization . . . 144

5.5 Conclusions . . . 148 6 Derived datatypes. . . .149 6.1 Introduction . . . 150 6.2 Contiguous datatypes . . . 152 6.3 Constant stride . . . 153 6.4 Other subroutines . . . 155 6.5 Examples . . . 156

(4)

MPI – Plan III

6.5.3 The datatype "matrix block" . . . 160

6.6 Homogenous datatypes of variable strides . . . 162

6.7 Subarray Datatype Constructor . . . 176

6.8 Heterogenous datatypes . . . 181

6.9 Subroutines annexes . . . 190

6.10 Conclusion . . . 193

7 Optimizations . . . 194

7.1 Introduction . . . 195

7.2 Point-to-Point Send Modes . . . 196

7.2.1 Key Terms . . . 197

7.2.2 Synchronous Sends . . . 198

7.2.3 Buffered Sends . . . 200

7.2.4 Standard Sends . . . 204

7.2.5 Ready Sends. . . .205

7.2.6 Performances of Send Modes . . . 206

7.3 Computation-Communication Overlap . . . 207

8 Communicators . . . 213

8.1 Introduction . . . 214

8.2 Default Communicator . . . 215

8.3 Example. . . .216

8.4 Groups and Communicators . . . 218

(5)

MPI – Plan IV

8.6 Topologies . . . 224

8.6.1 Cartesian topologies . . . 225

8.6.2 Subdividing a cartesian topology. . . .240

9 MPI-IO . . . 246 9.1 Introduction . . . 247 9.1.1 Presentation . . . 247 9.1.2 Issues . . . 249 9.1.3 Definitions. . . .252 9.2 File management . . . 254

9.3 Reads/Writes : general concepts . . . 257

9.4 Individual Reads/Writes . . . 261

9.4.1 Via explicit offsets . . . 261

9.4.2 Via Individual File Pointers. . . .270

9.4.3 Shared File Pointers . . . 282

9.5 Collective Reads/Writes . . . 289

9.5.1 Via explicit offsets . . . 290

9.5.2 Via individual file pointers . . . 294

9.5.3 Via shared file pointers . . . 308

9.6 Explicit positioning of pointers in a file . . . 314

9.7 Definition of views . . . 323

(6)

MPI – Plan V

9.8.2 Via individual file pointers . . . 383

9.8.3 NonBlocking and collective Reads/Writes . . . 387

9.9 Advices. . . .392

(7)

7/246 1 – Introduction

1 Introduction

1.1 Availability and updating . . . 8

1.2 Definitions . . . 9

1.3 Message Passing concepts . . . 12

1.4 History . . . 15 1.5 Library . . . 17 2 Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

(8)

8/246 1 – Introduction 1.1 – Availability and updating

1 – Introduction

1.1 – Availability and updating

This document is likely to be updated regularly. The most recent version is available on the Web server of IDRIS, sectionSupport de cours: https://cours.idris.fr

IDRIS

Institut du développement et des ressources en informatique scientifique Rue John Von Neumann

Bâtiment 506 BP 167

91403 ORSAY CEDEX France

http://www.idris.fr

(9)

9/246 1 – Introduction 1.2 – Definitions

1 – Introduction

1.2 – Definitions

The sequential programming model:

the program is executed by one and only one process ;

all the variables and constants of the program are allocated in the memory of the process ;

a process is executed on a physical processor of the machine.

MEMORY

P

Program

(10)

10/246 1 – Introduction 1.2 – Definitions

In themessage passingprogramming model :

the program is written in a classic language (Fortran,C,C++, etc.) ;

all the variables of the program are private and reside in the local memory of each process ;

each process may execute different parts of a program ;

a variable is exchanged between two or many processes via a call to subroutines. Memories Processes Network Programs 0 1 2 3

(11)

11/246 1 – Introduction 1.2 – Definitions

TheSPMDexecution model :

SingleProgram,MultipleData;

the same program is executed by all the processes ;

all machines support this programming model and some support only this one ;

it is a particular case of the general model ofMPMD(MultipleProgram, MultipleData), which it can also emulate.

Memories Processes Network Single program 0 1 2 3 d0 d1 d2 d3

(12)

12/246 1 – Introduction 1.3 – Message Passing concepts

1 – Introduction

1.3 – Message Passing concepts

If a message is sent to a process, the latter must receive it

t

target

source

I send

I receive

1

0

(13)

13/246 1 – Introduction 1.3 – Message Passing concepts

A message consists of data chunks passing from the sending process to the receiving process

In addition to the data (scalar variables, arrays, etc.) to be sent, a message has to contain the following information :

the identifier of the sending process ;the datatype ;

the length ;

the identifier of the receiving process.

Memories Processes Network sender receiver datatype length DATA d1 2 d2 1 d1 message message

(14)

14/246 1 – Introduction 1.3 – Message Passing concepts

The exchanged messages are interpreted and managed by an environment comparable to telephony, fax, postal mail, e-mail, etc.

The message is sent to a specified address.

The receiving process must be able to classify and interpret the messages which are sent to it.

The environment in question is MPI (MessagePassingInterface). An MPI application is a group of independent processes executing, each one, their own code and communicating via calls to MPI library subroutines.

(15)

15/246 1 – Introduction 1.4 – History

1 – Introduction

1.4 – History

Version 1.0: in June 1994, the MPI forum (MessagePassingInterfaceForum), with the participation of about forty organizations, came to the definition of a set of subroutines concerning theMPI message passing library.

Version 1.1: June 1995, with only minor changes.

Version 1.2: in 1997, with minor changes for a better consistency of the naming of some subroutines.

Version 1.3: September 2008, with clarifications in MPI 1.2, according to clarifications themselves made by MPI-2.1

Version 2.0: released in July 1997, this version brought deliberately

non-integrated, essential additions in MPI 1.0 (process dynamic management, one-sided communications, parallel I/O, etc.).

Version 2.1: June 2008, with only clarifications in MPI 2.0 but without any changes.

(16)

16/246 1 – Introduction 1.4 – History

Version 3.0: September 2012

changes and important additions compared to version 2.2 ;main changes :

nonblocking collective communications ;

revision of the implementation of one-sided communications ;Fortran (2003-2008) bindings ;

C++ bindings removed ;

interfacing of external tools (for debugging and performance measures) ;etc.

seehttp://meetings.mpi-forum.org/MPI_3.0_main_page.php https://svn.mpi-forum.org/trac/mpi-forum-web/wiki

(17)

17/246 1 – Introduction 1.5 – Library

1 – Introduction

1.5 – Library

Message Passing Interface Forum,MPI: A Message-Passing Interface Standard, Version 3.0, High Performance Computing Center Stuttgart (HLRS), 2012 https://fs.hlrs.de/projects/par/mpi/mpi30/

Message Passing Interface Forum,MPI: A Message-Passing Interface Standard, Version 2.2, High Performance Computing Center Stuttgart (HLRS), 2009 https://fs.hlrs.de/projects/par/mpi/mpi22/

William Gropp, Ewing Lusk et Anthony Skjellum : Using MPI: Portable Parallel Programming with the Message Passing Interface, second edition, MIT Press, 1999

William Gropp, Ewing Lusk et Rajeev Thakur : Using MPI-2, MIT Press, 1999 ☞ Peter S. Pacheco : Parallel Programming with MPI, Morgan Kaufman Ed., 1997 ☞ Additional references :

http://www.mpi-forum.org/docs/

http://www.mpi-forum.org/mpi2_1/index.htm

http://www.mcs.anl.gov/research/projects/mpi/learning.html http://www.mpi-forum.org/archives/notes/notes.html

(18)

18/246 1 – Introduction 1.5 – Library

MPI open source implementations : they can be installed on a large number of architectures but their performances are generally below the ones of the constructors’ implementations.

1 MPICH2: http://www.mpich.org/ 2 Open MPI:http://www.open-mpi.org/

(19)

19/246 1 – Introduction 1.5 – Library

The debuggers

1 Totalviewhttp://www.roguewave.com/products/totalview.aspx 2 DDThttp://www.allinea.com/products/ddt/

Tools for performance measurements 1 MPE:MPI Parallel Environment

http://www.mcs.anl.gov/research/projects/perfvis/download/index.htm 2 Scalasca: Scalable Performance Analysis of Large-Scale Applications

http://www.scalasca.org/ 3 Vampir

(20)

20/246 1 – Introduction 1.5 – Library

Some open source parallel scientific libraries :

1 ScaLAPACK: linear algebraic problem solvers using direct methods. The

sources can be downloaded from the website http://www.netlib.org/scalapack/

2 PETSc: linear and non-linear algebraic problem solvers using iterative methods. The sources can be downloaded from the website

http://www.mcs.anl.gov/petsc/

3 MUMPS: MUltifrontal Massively Parallel sparse direct Solvers. The sources can be downloaded from the websitehttp://graal.ens-lyon.fr/MUMPS/ 4 FFTW: fast Fourier transforms. The sources can be downloaded from the

(21)

21/246 2 – Environment 1 Introduction 2 Environment 2.1 Description . . . 22 2.2 Exemple . . . 25 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

(22)

22/246 2 – Environment 2.1 – Description

2 – Environment

2.1 – Description

Every program unit calling MPI subroutines has to include a header file. In Fortran, we must use thempi module introduced in MPI-2 (in MPI-1, it was the mpif.h file), and in C/C++ the mpi.h file.

TheMPI_INIT() subroutine initializes the necessary environment :

integer, intent(out) :: code

call MPI_INIT(code)

TheMPI_FINALIZE() subroutine disables this environment :

integer, intent(out) :: code

(23)

23/246 2 – Environment 2.1 – Description

All the operations made by MPI are related tocommunicators. The default communicator isMPI_COMM_WORLD which includes all the active processes.

0 1

3

2 4

5 6

(24)

24/246 2 – Environment 2.1 – Description

At any moment, we can know the number of processes managed by a given communicator by theMPI_COMM_SIZE() subroutine :

integer, intent(out) :: nb_procs,code

call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

Similarly, the MPI_COMM_RANK() subroutine allows to obtain the process rank (i.e. its instance number, which is a number between 0 and the value sent by

MPI_COMM_SIZE() – 1) :

integer, intent(out) :: rank,code

(25)

25/246 2 – Environment 2.2 – Exemple

2 – Environment

2.2 – Exemple 1 program who_am_I 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,code 5 6 call MPI_INIT(code) 7

8 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

9 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

10

11 print *,’I am the process ’,rank,’ among ’,nb_procs

12

13 call MPI_FINALIZE(code)

14 end program who_am_I

> mpiexec -n 7 who_am_I

I am the process 3 among 7

I am the process 0 among 7

I am the process 4 among 7

I am the process 1 among 7

I am the process 5 among 7

I am the process 2 among 7

(26)

26/246 3 – Point-to-point Communications

1 Introduction 2 Environment

3 Point-to-point Communications

3.1 General Concepts . . . 27 3.2 Predefined MPI Datatypes . . . 33 3.3 Other Possibilities . . . 35 3.4 Example: communication ring . . . 41 4 Collective communications 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

(27)

27/246 3 – Point-to-point Communications 3.1 – General Concepts

3 – Point-to-point Communications

3.1 – General Concepts

Apoint-to-point communication occurs between two processes, one is called the senderprocess and the other one is called thereceiverprocess.

0 1

3

2 4

5 6

(28)

27/246 3 – Point-to-point Communications 3.1 – General Concepts

3 – Point-to-point Communications

3.1 – General Concepts

Apoint-to-point communication occurs between two processes, one is called the senderprocess and the other one is called thereceiverprocess.

0 1

3

2 4

5 6

Sender

(29)

27/246 3 – Point-to-point Communications 3.1 – General Concepts

3 – Point-to-point Communications

3.1 – General Concepts

Apoint-to-point communication occurs between two processes, one is called the senderprocess and the other one is called thereceiverprocess.

0 1 3 2 4 5 6 Sender Receiver

(30)

27/246 3 – Point-to-point Communications 3.1 – General Concepts

3 – Point-to-point Communications

3.1 – General Concepts

Apoint-to-point communication occurs between two processes, one is called the senderprocess and the other one is called thereceiverprocess.

0 1 3 2 4 5 6 Sender Receiver 1000

(31)

28/246 3 – Point-to-point Communications 3.1 – General Concepts

The sender and the receiver are identified by theirrankin the communicator.The so-calledmessage envelopeis composed of:

the rank of the sender process;the rank of the receiver process;the tag of the message;

the communicator which define the process group and communication context.

The exchanged data are predefined (integer, real, etc) or personal derived datatypes.

There are in each case many communicationmodes, calling different protocols which will be discussed in Chapter 7.

(32)

29/246 3 – Point-to-point Communications 3.1 – General Concepts

1 program point_to_point

2 use mpi

3 implicit none

4

5 integer, dimension( MPI_STATUS_SIZE ) :: status

6 integer, parameter :: tag=100

7 integer :: rank,value,code

8

9 call MPI_INIT(code)

10

11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

12

13 if (rank ==2) then

14 value=1000

15 call MPI_SEND(value,1, MPI_INTEGER ,5,tag, MPI_COMM_WORLD ,code)

16 elseif (rank ==5) then

17 call MPI_RECV(value,1, MPI_INTEGER ,2,tag, MPI_COMM_WORLD ,status,code)

18 print *,’I, process 5, I have received ’,value,’ from the process2’

19 end if

20

21 call MPI_FINALIZE(code)

22

23 end program point_to_point

> mpiexec -n 7 point_to_point

(33)

30/246 3 – Point-to-point Communications 3.2 – Predefined MPI Datatypes

3 – Point-to-point Communications

3.2 – Predefined MPI Datatypes

MPI Type Fortran Type

MPI_INTEGER INTEGER

MPI_REAL REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_COMPLEX COMPLEX

MPI_LOGICAL LOGICAL

MPI_CHARACTER CHARACTER

(34)

31/246 3 – Point-to-point Communications 3.2 – Predefined MPI Datatypes

MPI Type C Type

MPI_CHAR signed char

MPI_SHORT signed short

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double Table 2: Predefined MPI Datatypes (C)

(35)

32/246 3 – Point-to-point Communications 3.3 – Other Possibilities

3 – Point-to-point Communications

3.3 – Other Possibilities

On the reception of a message, the rank of the sender and the tag can be replaced respectively by the MPI_ANY_SOURCE and MPI_ANY_TAGwildcards. ☞ A communication with the dummy process of rankMPI_PROC_NULL has no effect. ☞ MPI_STATUS_IGNORE is a predefined constant that can be used instead of the

status variable.

MPI_SUCCESSis a predefined constant which allows testing the return code of a MPI subroutine.

There are syntactic variants,MPI_SENDRECV() and MPI_SENDRECV_REPLACE(), which combine a send and a receive operations (for the former, the receive memory zone must be different from the send one).

It is possible to create more complex data structures using derived datatypes (see the Chapter 6)

(36)

33/246 3 – Point-to-point Communications 3.3 – Other Possibilities

0 1

1000

1001

Figure 8: sendrecvCommunication between the Processes 0 and 1

1 program sendrecv 2 use mpi 3 implicit none 4 integer :: rank,value,num_proc,code 5 integer,parameter :: tag=110 6 7 call MPI_INIT(code)

8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

9

10 ! We suppose that we have exactly 2 processes

11 num_proc=mod(rank+1,2)

12

13 call MPI_SENDRECV(rank+1000,1, MPI_INTEGER ,num_proc,tag,value,1, MPI_INTEGER , &

14 num_proc,tag, MPI_COMM_WORLD , MPI_STATUS_IGNORE ,code)

15

16 print *,’I, process ’,rank,’, I have received’,value,’from the process ’,num_proc

17

18 call MPI_FINALIZE(code)

(37)

34/246 3 – Point-to-point Communications 3.3 – Other Possibilities

> mpiexec -n 2 sendrecv

I, process 1, I have received 1000 from the process 0

I, process 0, I have received 1001 from the process 1

Warning! It must be noticed that if theMPI_SEND() subroutine is implemented in a synchronousway (see the Chapter 7) in the used MPI implementation, the previous code will deadlock if, rather than to use the MPI_SENDRECV() subroutine, we used a

MPI_SEND() subroutine followed by a MPI_RECV()one. In this case, each of the two

processes would wait a receive command which will never occur, because the two sends would stay suspended. Therefore, for correctness and portability reasons, it is absolutely necessary to avoid such situations.

call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc,tag, MPI_COMM_WORLD ,code)

(38)

35/246 3 – Point-to-point Communications 3.3 – Other Possibilities

1 program joker

2 use mpi

3 implicit none

4 integer, parameter :: m=4,tag1=11,tag2=22

5 integer, dimension(m,m) :: A

6 integer :: nb_procs,rank,code

7 integer, dimension( MPI_STATUS_SIZE ):: status

8 integer :: nb_elements,i

9 integer, dimension(:), allocatable :: C

10

11 call MPI_INIT(code)

12 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

13 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

14

(39)

36/246 3 – Point-to-point Communications 3.3 – Other Possibilities

16 if (rank == 0) then

17 ! Initialisation of the matrix A on the process 0

18 A(:,:) = reshape((/ (i,i=1,m*m) /), (/ m,m /))

19 ! Sending of 2 elements of the matrix A to the process 1

20 call MPI_SEND(A(1,1),2, MPI_INTEGER ,1,tag1, MPI_COMM_WORLD ,code)

21 ! Sending of 3 elements of the matrix A to the process 2

22 call MPI_SEND(A(1,2),3, MPI_INTEGER ,2,tag2, MPI_COMM_WORLD ,code)

23 else

24 ! We check before the receive if a message has arrived and from which process

25 call MPI_PROBE( MPI_ANY_SOURCE , MPI_ANY_TAG , MPI_COMM_WORLD ,status,code)

26 ! We check how many elements must be received

27 call MPI_GET_COUNT(status, MPI_INTEGER ,nb_elements,code)

28 ! We allocate the reception array C on each process

29 if (nb_elements /= 0 ) allocate (C(1:nb_elements))

30 ! We receive the message

31 call MPI_RECV(C,nb_elements, MPI_INTEGER ,status( MPI_SOURCE ),status( MPI_TAG ), &

32 MPI_COMM_WORLD ,status,code)

33 print *,’I, process ’,rank,’, I have received ’,nb_elements, &

34 ’elements from the process ’, &

35 status( MPI_SOURCE ),’My array C is ’, C(:)

36 end if

37

38 call MPI_FINALIZE(code)

(40)

37/246 3 – Point-to-point Communications 3.3 – Other Possibilities

> mpiexec -n 3 sendrecv1

I, process 1, I have received 2 elements from the process 0 My array C is 1 2

(41)

38/246 3 – Point-to-point Communications 3.4 – Example: communication ring

3 – Point-to-point Communications

3.4 – Example: communication ring

0 1

3

2 4

5 6

(42)

38/246 3 – Point-to-point Communications 3.4 – Example: communication ring

3 – Point-to-point Communications

3.4 – Example: communication ring

0 1 3 2 4 5 6 1000 1001 1002 1003 1004 1005 1006

(43)

39/246 3 – Point-to-point Communications 3.4 – Example: communication ring

If all the processes execute a send followed by a receive, all the communications could potentially start simultaneously and then will not occur in a ring way (outside the already mentioned portability problem if the implementation of MPI_SEND()is made in a synchronous way in the MPI library):

...

value=rank+1000

call MPI_SEND(value,1, MPI_INTEGER ,num_proc_next,tag, MPI_COMM_WORLD ,code)

call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, MPI_COMM_WORLD , &

status,code) ...

(44)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6

(45)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1000

(46)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1001

(47)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1002

(48)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1003

(49)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1004

(50)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1005

(51)

40/246 3 – Point-to-point Communications 3.4 – Example: communication ring

To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:

0 1 2 3 4 5 6 1006

(52)

41/246 3 – Point-to-point Communications 3.4 – Example: communication ring

1 program ring

2 use mpi

3 implicit none

4 integer, dimension( MPI_STATUS_SIZE ) :: status

5 integer, parameter :: tag=100

6 integer :: nb_procs,rank,value, &

7 num_proc_previous,num_proc_next,code

8 call MPI_INIT(code)

9 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

10 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

11

12 num_proc_next=mod(rank+1,nb_procs)

13 num_proc_previous=mod(nb_procs+rank-1,nb_procs)

14

15 if (rank == 0) then

16 call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc_next,tag, &

17 MPI_COMM_WORLD ,code)

18 call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, &

19 MPI_COMM_WORLD ,status,code)

20 else

21 call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, &

22 MPI_COMM_WORLD ,status,code)

23 call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc_next,tag, &

24 MPI_COMM_WORLD ,code)

25 end if

26

27 print *,’I, process ’,rank,’, I have received ’,value,’ from process ’, &

28 num_proc_previous

29

30 call MPI_FINALIZE(code)

(53)

42/246 3 – Point-to-point Communications 3.4 – Example: communication ring

> mpiexec -n 7 ring

I, process 1, I have received 1000 from process 0

I, process 2, I have received 1001 from process 1

I, process 3, I have received 1002 from process 2

I, process 4, I have received 1003 from process 3

I, process 5, I have received 1004 from process 4

I, process 6, I have received 1005 from process 5

(54)

43/246 4 – Collective communications 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 4.1 General concepts . . . 55 4.2 Global synchronization: MPI_BARRIER(). . . 57 4.3 Broadcast : MPI_BCAST(). . . 60 4.4 Scatter :MPI_SCATTER(). . . 66 4.5 Gather : MPI_GATHER(). . . 72 4.6 Gather-to-all :MPI_ALLGATHER(). . . 78 4.7 Extended gather :MPI_GATHERV(). . . 84 4.8 All-to-all : MPI_ALLTOALL(). . . 91 4.9 Global reduction . . . 98 4.10 Additions . . . 120 5 One-sided Communication 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

(55)

44/246 4 – Collective communications 4.1 – General concepts

4 – Collective communications

4.1 – General concepts

Thecollectivecommunications allow to make a series of point-to-point communications in one single call.

A collective communication always concerns all the processes of the indicated communicator.

For each process, the call ends when its participation in the collective call is completed, in the sense of point-to-point communications (when the concerned memory area can be changed).

It is useless to add a global synchronization (barrier) after a collective call.The management oftagsin these communications is transparent and

system-dependent. Therefore, they are never explicitly defined during the calling of these subroutines. This has among other advantages that the collective communications never interfere with point-to-point communications.

(56)

45/246 4 – Collective communications 4.1 – General concepts

There are three types of subroutines :

the one which ensures the global synchronizations : MPI_BARRIER(). ❷ the ones which only transfer data :

global distribution of data : MPI_BCAST() ; ❏ selective distribution of data : MPI_SCATTER() ; ❏ collection of distributed data : MPI_GATHER();

collection by all the processes of distributed data : MPI_ALLGATHER() ; ❏ selective distribution, by all the processes, of distributed data :

MPI_ALLTOALL().

the ones which, in addition to the communications management, carry out operations on the transferred data :

reduction operations (sum, product, maximum, minimum, etc.) whether they are of a predefined or personal type : MPI_REDUCE() ;

reduction operations with broadcasting of the result (it is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()) :

(57)

46/246 4 – Collective communications 4.2 – Global synchronization:MPI_BARRIER()

4 – Collective communications

4.2 – Global synchronization:MPI_BARRIER()

Bar-rière

P0 P1 P2 P3

Figure 11: Global Synchronization : MPI_BARRIER()

integer, intent(out) :: code

(58)

46/246 4 – Collective communications 4.2 – Global synchronization:MPI_BARRIER()

4 – Collective communications

4.2 – Global synchronization:MPI_BARRIER()

Bar-rière

P0 P1 P2 P3 P0 P1 P2 P3

Figure 11: Global Synchronization : MPI_BARRIER()

integer, intent(out) :: code

(59)

46/246 4 – Collective communications 4.2 – Global synchronization:MPI_BARRIER()

4 – Collective communications

4.2 – Global synchronization:MPI_BARRIER()

Bar-rière

P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3

Figure 11: Global Synchronization : MPI_BARRIER()

integer, intent(out) :: code

(60)

47/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()

4 – Collective communications

4.3 – Broadcast :MPI_BCAST() 0 1 2 3 P0 P1 P2 A P3 MPI_BCAST()

(61)

47/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()

4 – Collective communications

4.3 – Broadcast :MPI_BCAST() 0 1 2 3 A P0 P1 P2 A P3 MPI_BCAST() P0 A

(62)

47/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()

4 – Collective communications

4.3 – Broadcast :MPI_BCAST() 0 1 2 3 A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A

(63)

47/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()

4 – Collective communications

4.3 – Broadcast :MPI_BCAST() 0 1 2 3 A A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A

(64)

47/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()

4 – Collective communications

4.3 – Broadcast :MPI_BCAST() 0 1 2 3 A A A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A P3 A

(65)

48/246 4 – Collective communications 4.3 – Broadcast :MPI_BCAST() 1 program bcast 2 use mpi 3 implicit none 4 5 integer :: rank,value,code 6 7 call MPI_INIT(code)

8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

9

10 if (rank ==2) value=rank+1000

11

12 call MPI_BCAST(value,1, MPI_INTEGER ,2, MPI_COMM_WORLD ,code)

13

14 print *,’I, process ’,rank,’ I have received ’,value,’ of the process2’

15

16 call MPI_FINALIZE(code)

17

18 end program bcast

> mpiexec -n 4 bcast

I, process 2 I have received 1002 of the process 2

I, process 0 I have received 1002 of the process 2

I, process 1 I have received 1002 of the process 2

(66)

49/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

4 – Collective communications

4.4 – Scatter :MPI_SCATTER() 0 1 2 3 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER()

(67)

49/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

4 – Collective communications

4.4 – Scatter :MPI_SCATTER() 0 1 2 3 A0 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0

(68)

49/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

4 – Collective communications

4.4 – Scatter :MPI_SCATTER() 0 1 2 3 A0 A1 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1

(69)

49/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

4 – Collective communications

4.4 – Scatter :MPI_SCATTER() 0 1 2 3 A0 A2 A1 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2

(70)

49/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

4 – Collective communications

4.4 – Scatter :MPI_SCATTER() 0 1 2 3 A0 A2 A1 A3 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2 P3 A3

(71)

50/246 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()

1 program scatter

2 use mpi

3 implicit none

4

5 integer, parameter :: nb_values=8

6 integer :: nb_procs,rank,block_length,i,code

7 real, allocatable, dimension(:) :: values,data

8

9 call MPI_INIT(code)

10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

12 block_length=nb_values/nb_procs 13 allocate(data(block_length)) 14 15 if (rank ==2) then 16 allocate(values(nb_values)) 17 values(:)=(/(1000.+i,i=1,nb_values)/)

18 print *,’I, process ’,rank,’send my values array : ’,&

19 values(1:nb_values)

20 end if

21

22 call MPI_SCATTER(values,block_length, MPI_REAL ,data,block_length, &

23 MPI_REAL ,2, MPI_COMM_WORLD ,code)

24 print *,’I, process ’,rank,’, I have received ’, data(1:block_length), &

25 ’ of the process 2’

26 call MPI_FINALIZE(code)

27

28 end program scatter

> mpiexec -n 4 scatter

I, process 2 send my values array :

1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.

I, process 0, I have received 1001. 1002. of the process 2

I, process 1, I have received 1003. 1004. of the process 2

(72)

51/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

4 – Collective communications

4.5 – Gather : MPI_GATHER() 0 1 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER()

(73)

51/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

4 – Collective communications

4.5 – Gather : MPI_GATHER() 0 1 2 3 A0 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 P3

(74)

51/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

4 – Collective communications

4.5 – Gather : MPI_GATHER() 0 1 2 3 A0 A1 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 P3

(75)

51/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

4 – Collective communications

4.5 – Gather : MPI_GATHER() 0 1 2 3 A0 A2 A1 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 P3

(76)

51/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

4 – Collective communications

4.5 – Gather : MPI_GATHER() 0 1 2 3 A0 A2 A1 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 A3 P3

(77)

52/246 4 – Collective communications 4.5 – Gather :MPI_GATHER()

1 program gather

2 use mpi

3 implicit none

4 integer, parameter :: nb_values=8

5 integer :: nb_procs,rank,block_length,i,code

6 real, dimension(nb_values) :: data

7 real, allocatable, dimension(:) :: values

8

9 call MPI_INIT(code)

10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

12 13 block_length=nb_values/nb_procs 14 15 allocate(values(block_length)) 16 17 values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/)

18 print *,’I, process ’,rank,’send my values array : ’,&

19 values(1:block_length)

20

21 call MPI_GATHER(values,block_length, MPI_REAL ,data,block_length, &

22 MPI_REAL ,2, MPI_COMM_WORLD ,code)

23

24 if (rank ==2) print *,’I, process 2’, ’ have received ’,data(1:nb_values)

25

26 call MPI_FINALIZE(code)

27

28 end program gather

> mpiexec -n 4 gather

I, process 1 send my values array :1003. 1004. I, process 0 send my values array :1001. 1002. I, process 2 send my values array :1005. 1006. I, process 3 send my values array :1007. 1008.

(78)

53/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

4 – Collective communications

4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER()

(79)

53/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

4 – Collective communications

4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 A0 A0 A0 A0 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 P1 A0 P2 A0 P3 A0

(80)

53/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

4 – Collective communications

4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 A0 A0 A0 A0 A1 A1 A1 A1 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 A1 P1 A0 A1 P2 A0 A1 P3 A0 A1

(81)

53/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

4 – Collective communications

4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 A1 A2 P1 A0 A1 A2 P2 A0 A1 A2 P3 A0 A1 A2

(82)

53/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

4 – Collective communications

4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 A1 A2 A3 P1 A0 A1 A2 A3 P2 A0 A1 A2 A3 P3 A0 A1 A2 A3

(83)

54/246 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()

1 program allgather

2 use mpi

3 implicit none

4

5 integer, parameter :: nb_values=8

6 integer :: nb_procs,rank,block_length,i,code

7 real, dimension(nb_values) :: data

8 real, allocatable, dimension(:) :: values

9

10 call MPI_INIT(code)

11

12 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

13 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

14 15 block_length=nb_values/nb_procs 16 allocate(values(block_length)) 17 18 values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) 19

20 call MPI_ALLGATHER(values,block_length, MPI_REAL ,data,block_length, &

21 MPI_REAL , MPI_COMM_WORLD ,code)

22

23 print *,’I, process ’,rank,’, I have received ’,data(1:nb_values)

24

25 call MPI_FINALIZE(code)

26

27 end program allgather

> mpiexec -n 4 allgather

I, process 1, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.

I, process 3, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.

I, process 2, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.

(84)

55/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

4 – Collective communications

4.7 – Extended gather : MPI_GATHERV()

0 1 2 3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV()

(85)

55/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

4 – Collective communications

4.7 – Extended gather : MPI_GATHERV()

0 1 2 3 A0 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 P3

(86)

55/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

4 – Collective communications

4.7 – Extended gather : MPI_GATHERV()

0 1 2 3 A0 A1 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 P3

(87)

55/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

4 – Collective communications

4.7 – Extended gather : MPI_GATHERV()

0 1 2 3 A0 A2 A1 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 P3

(88)

55/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

4 – Collective communications

4.7 – Extended gather : MPI_GATHERV()

0 1 2 3 A0 A2 A1 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 A3 P3

(89)

56/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

1 program gatherv

2 use mpi

3 implicit none

4 INTEGER, PARAMETER :: nb_values=10

5 INTEGER :: nb_procs, rank, block_length, i, code

6 REAL, DIMENSION(nb_values) :: data,remainder

7 REAL, ALLOCATABLE, DIMENSION(:) :: values

8 INTEGER, ALLOCATABLE, DIMENSION(:) :: nb_elements_received,displacement

9

10 CALL MPI_INIT(code)

11 CALL MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

12 CALL MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

13

14 block_length=nb_values/nb_procs

15 remainder=mod(nb_values,nb_procs)

16 if(remainder > rank) block_length = block_length + 1

17 ALLOCATE(values(block_length))

18 values(:) = (/(1000.+(rank*(nb_values/nb_procs))+min(rank,remainder)+i, &

19 i=1,block_length)/)

20

21 PRINT *, ’I, process ’, rank,’send my values array : ’,&

22 values(1:block_length) 23 24 IF (rank == 2) THEN 25 ALLOCATE(nb_elements_received(nb_procs),displacement(nb_procs)) 26 nb_elements_received(1) = nb_values/nb_procs 27 if (remainder > 0) nb_elements_received(1)=nb_elements_received(1)+1 28 displacement(1) = 0 29 DO i=2,nb_procs 30 displacement(i) = displacement(i-1)+nb_elements_received(i-1) 31 nb_elements_received(i) = nb_values/nb_procs

32 if (remainder > i-1) nb_elements_received(i)=nb_elements_received(i)+1

33 END DO

34 END IF

(90)

57/246 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()

CALL MPI_GATHERV(values,block_length, MPI_REAL ,data,nb_elements_received,&

displacement, MPI_REAL ,2, MPI_COMM_WORLD ,code)

IF (rank == 2) PRINT *, ’I, processus 2 receives ’, data(1:nb_values)

CALL MPI_FINALIZE(code)

end program gatherv

> mpiexec -n 4 gatherv

I, process 0 send my values’ array : 1001. 1002. 1003.

I, process 2 send my values’ array : 1007. 1008.

I, process 3 send my values’ array : 1009. 1010.

I, process 1 send my values’ array : 1004. 1005. 1006.

I, process 2 receives 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.

(91)

58/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

4 – Collective communications

4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL()

(92)

58/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

4 – Collective communications

4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 A0 A1 A2 A3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 P1 A1 P2 A2 P3 A3

(93)

58/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

4 – Collective communications

4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 A0 A1 A2 A3 B0 B1 B2 B3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 P1 A1 B1 P2 A2 B2 P3 A3 B3

(94)

58/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

4 – Collective communications

4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 C0 P1 A1 B1 C1 P2 A2 B2 C2 P3 A3 B3 C3

(95)

58/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

4 – Collective communications

4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 C0 D0 P1 A1 B1 C1 D1 P2 A2 B2 C2 D2 P3 A3 B3 C3 D3

(96)

59/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

1 program alltoall

2 use mpi

3 implicit none

4

5 integer, parameter :: nb_values=8

6 integer :: nb_procs,rank,block_length,i,code

7 real, dimension(nb_values) :: values,data

8

9 call MPI_INIT(code)

10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

12

13 values(:)=(/(1000.+rank*nb_values+i,i=1,nb_values)/)

14 block_length=nb_values/nb_procs

15

16 print *,’I, process ’,rank,’send my values array : ’,&

17 values(1:nb_values)

18

19 call MPI_ALLTOALL(values,block_length, MPI_REAL ,data,block_length, &

20 MPI_REAL , MPI_COMM_WORLD ,code)

21

22 print *,’I, process ’,rank,’, I have received ’,data(1:nb_values)

23

24 call MPI_FINALIZE(code)

(97)

60/246 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()

> mpiexec -n 4 alltoall

I, process 1 send my values array :

1009. 1010. 1011. 1012. 1013. 1014. 1015. 1016. I, process 0 send my values array :

1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008. I, process 2 send my values array :

1017. 1018. 1019. 1020. 1021. 1022. 1023. 1024. I, process 3 send my values array :

1025. 1026. 1027. 1028. 1029. 1030. 1031. 1032.

I, process 0, I have received 1001. 1002. 1009. 1010. 1017. 1018. 1025. 1026.

I, process 2, I have received 1005. 1006. 1013. 1014. 1021. 1022. 1029. 1030.

I, process 1, I have received 1003. 1004. 1011. 1012. 1019. 1020. 1027. 1028.

(98)

61/246 4 – Collective communications 4.9 – Global reduction

4 – Collective communications

4.9 – Global reduction

Areductionis an operation applied to a set of elements in order to obtain one single value. Classical examples are the sum of the elements of a vector

(SUM(A(:))) or the search of the maximum value element in a vector (MAX(V(:))). ☞ MPI proposes high-level subroutines in order to operate reductions on distributed

data on a group of processes. The result is obtained on one process

(MPI_REDUCE()) or on all (MPI_ALLREDUCE(), which is in fact equivalent to an

MPI_REDUCE() followed by an MPI_BCAST()).

If several elements are implied by process, the reduction function is applied to each one of them.

TheMPI_SCAN() subroutine allows also to make partial reductions by

considering, for each process, the previous processes of the group and itself.

TheMPI_OP_CREATE() and MPI_OP_FREE() subroutines allow personal reduction

(99)

62/246 4 – Collective communications 4.9 – Global reduction

Table 3: Main Predefined Reduction Operations (there are also other logical operations)

Name Opération

MPI_SUM Sum of elements MPI_PROD Product of elements MPI_MAX Maximum of elements MPI_MIN Minimum of elements

MPI_MAXLOC Maximum of elements and location MPI_MINLOC Minimum of elements and location MPI_LAND Logical AND

MPI_LOR Logical OR

(100)

63/246 4 – Collective communications 4.9 – Global reduction

0 1

3

2 4

5 6

(101)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1000

(102)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1

(103)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2 2

(104)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2+3 2 3

(105)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2+3+4 2 3 4

(106)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2+3+4+5 2 3 4 5

(107)

63/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2+3+4+5+6 =1021 2 3 4 5 6

(108)

64/246 4 – Collective communications 4.9 – Global reduction 1 program reduce 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,value,sum,code 5 6 call MPI_INIT(code)

7 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

9 10 if (rank ==0) then 11 value=1000 12 else 13 value=rank 14 endif 15

16 call MPI_REDUCE(value,sum,1, MPI_INTEGER , MPI_SUM ,0, MPI_COMM_WORLD ,code)

17

18 if (rank == 0) then

19 print *,’I, process 0, I have the global sum value ’,sum

20 end if

21

22 call MPI_FINALIZE(code)

23 end program reduce

> mpiexec -n 7 reduce

(109)

65/246 4 – Collective communications 4.9 – Global reduction

0 1

3

2 4

5 6

(110)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 10

(111)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1

(112)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2 2

(113)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3 2 3

(114)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3×4 2 3 4

(115)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3×4×5 2 3 4 5

(116)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3×4×5×6 =7200 2 3 4 5 6

(117)

65/246 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3×4×5×6 =7200 2 3 4 5 6 7200 7200 7200 7200 7200 7200 7200

(118)

66/246 4 – Collective communications 4.9 – Global reduction 1 program allreduce 2 3 use mpi 4 implicit none 5 6 integer :: nb_procs,rank,value,product,code 7 8 call MPI_INIT(code)

9 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)

10 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)

11 12 if (rank ==0) then 13 value=10 14 else 15 value=rank 16 endif 17

18 call MPI_ALLREDUCE(value,product,1, MPI_INTEGER , MPI_PROD , MPI_COMM_WORLD ,code)

19

20 print *,’I,process ’,rank,’,I have received the value of the global product ’,product

21

22 call MPI_FINALIZE(code)

23

(119)

67/246 4 – Collective communications 4.9 – Global reduction

> mpiexec -n 7 allreduce

I, process 6, I have received the value of the global product 7200

I, process 2, I have received the value of the global product 7200

I, process 0, I have received the value of the global product 7200

I, process 4, I have received the value of the global product 7200

I, process 5, I have received the value of the global product 7200

I, process 3, I have received the value of the global product 7200

(120)

68/246 4 – Collective communications 4.10 – Additions

4 – Collective communications

4.10 – Additions

TheMPI_SCATTERV(), MPI_GATHERV(), MPI_ALLGATHERV() and

MPI_ALLTOALLV() subroutines extend MPI_SCATTER(), MPI_GATHER(),

MPI_ALLGATHER() and MPI_ALLTOALL() in the case where the number of

elements to transmit or gather is different following the processes.

Two new subroutines have been added to extend the possibilities of collective subroutines in some particular cases:

➳ MPI_ALLTOALLW(): MPI_ALLTOALLV() version where the displacements are

expressed in bytes and not in elements,

(121)

69/246 5 – One-sided Communication 1 Introduction 2 Environment 3 Point-to-point Communications 4 Collective communications 5 One-sided Communication 5.1 Introduction . . . 122 5.1.1 Reminder : The Concept of Message-Passing . . . 123 5.1.2 The Concept of One-sided Communication . . . 124 5.1.3 RMA Approach of MPI . . . 125 5.2 Memory Window . . . 126 5.3 Data Transfer . . . 130 5.4 Completion of the Transfer : the synchronization . . . 134 5.4.1 Active Target Synchronization . . . 135 5.4.2 Passive Target Synchronization . . . 144 5.5 Conclusions . . . 148 6 Derived datatypes 7 Optimizations 8 Communicators 9 MPI-IO 10 Conclusion

(122)

70/246 5 – One-sided Communication 5.1 – Introduction

5 – One-sided Communication

5.1 – Introduction

There are various approaches to transfer data between two different processes. Among the most commonly used are :

Point-to-point communications by message-passing (MPI, etc.) ;

One-sided communications (direct access to the memory of a distant process). Also called RMA for Remote Memory Access , it is one of the major

(123)

71/246 5 – One-sided Communication 5.1 – Introduction

5 – One-sided Communication

5.1 – Introduction

5.1.1 – Reminder : The Concept of Message-Passing

t target source I send I receive

1

0

Figure 20: Message-Passing

In message-passing, a sender (origin) sends a message to a destination process (target) which will make all what is necessary to receive this message. This requires that the sender as well as the receiver be involved in the communication. This can be restrictive and difficult to implement in some algorithms (for example when it is necessary to manage a global counter).

(124)

72/246 5 – One-sided Communication 5.1 – Introduction

5 – One-sided Communication

5.1 – Introduction

5.1.2 – The Concept of One-sided Communication

The concept of one-sided communication is not new,MPIhaving simply unified the already existing constructors’ solutions (such as shmem (CRAY), lapi (IBM), ...) by offering its own RMA primitives. Through these subroutines, a process has a direct access (in read, write or update) to the memory of another remote process. In this approach, the remote process does not have to participate in the data-transfer process. The principle advantages are the following :

enhanced performances when the hardware allows it,a simpler programming for some algorithms.

(125)

73/246 5 – One-sided Communication 5.1 – Introduction

5 – One-sided Communication

5.1 – Introduction

5.1.3 – RMA Approach of MPI

The use ofMPIRMA is done in three steps:

definition on each process of a memory area (local memory window) visible and eventually accessible to remote processes ;

start of the data transfer directly from the memory of a process to the memory of another process. It is therefore necessary to specify the type, the number and the initial and final localization of data.

completion of current transfers by a step of synchronization, the data are then available.

(126)

74/246 5 – One-sided Communication 5.2 – Memory Window

5 – One-sided Communication

5.2 – Memory Window

All the processes participating in an one-sided communication have to specify which part of their memory will be available to the other processes ; it is the notion of memory window.

More precisely, the MPI_WIN_CREATE() collective operation allows the creation of an MPI window object. This object is composed, for each process, of a specific memory area called local memory window. For each process, a local memory window is characterized by its initial address, its size in bytes (which can be zero) and the displacement unit size inside this window (in bytes). These

(127)

75/246 5 – One-sided Communication 5.2 – Memory Window

Example:

0 1

MPI Window Object win1

MPI Window Object win2 First local window

Second local window

First local window Second local window

(128)

76/246 5 – One-sided Communication 5.2 – Memory Window

Once the transfers are finished, the window has to be freed with the

MPI_WIN_FREE() subroutine.

☞ MPI_WIN_GET_ATTR() allows to know the characteristics of a local memory

window by using the keywordsMPI_WIN_BASE, MPI_WIN_SIZE or MPI_WIN_DISP_UNIT.

Remark :

The choice of the displacement unit associated with the local memory window is important (essential in a heterogenous environment and facilitating the encoding in all the cases). The size of a MPI datatype is obtained by calling the

(129)

77/246 5 – One-sided Communication 5.2 – Memory Window 1 program window 2 3 use mpi 4 implicit none 5

6 integer :: code, rank, size_real, win, n=4

7 integer (kind= MPI_ADDRESS_KIND ) :: dim_win, size, base, unit

8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local

9 logical :: flag

10

11 call MPI_INIT(code)

12 call MPI_COMM_RANK( MPI_COMM_WORLD , rank, code)

13 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION ,size_real,code)

14

15 if (rank==0) n=0

16 allocate(win_local(n))

17 dim_win = size_real*n

18

19 call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL , &

20 MPI_COMM_WORLD , win, code)

21 call MPI_WIN_GET_ATTR(win, MPI_WIN_SIZE , size, flag, code)

22 call MPI_WIN_GET_ATTR(win, MPI_WIN_BASE , base, flag, code)

23 call MPI_WIN_GET_ATTR(win, MPI_WIN_DISP_UNIT , unit, flag, code)

24 call MPI_WIN_FREE(win,code)

25 print *,"process", rank,"size, base, unit = " &

26 ,size, base, unit

27 call MPI_FINALIZE(code)

28 end program window

> mpiexec -n 3 window

process 1 size, base, unit = 32 17248330400 8

process 0 size, base, unit = 0 2 8

(130)

78/246 5 – One-sided Communication 5.3 – Data Transfer

5 – One-sided Communication

5.3 – Data Transfer

MPIallows a process to read (MPI_GET()), to write (MPI_PUT()) and to update

(MPI_ACCUMULATE()) data located in the local memory window of a remote process.

We call the process which makes the calling of the initialization subroutine of the transferoriginand we call the process which has the local memory window which will be used in the transfer proceduretarget.

During the initialization of the transfer, the target process does not call anyMPI subroutine. All the necessary informations are specified under the form of parameters during the calling of theMPIsubroutine by the origin.

(131)

79/246 5 – One-sided Communication 5.3 – Data Transfer

In particular, we find :

parameters related to the origin : the datatype of elements ; their number ;

the memory address of the first element. ☞ parameters related to the target :

the rank of the target as well as the MPI window object, which uniquely determines a local memory window ;

a displacement in this local window ;

(132)

80/246 5 – One-sided Communication 5.3 – Data Transfer

1 program example_put

2 integer :: nb_orig=10, nb_target=10, target=1, win, code

3 integer (kind= MPI_ADDRESS_KIND ) :: displacement=40

4 integer, dimension(10) :: B

5 ...

6 call MPI_PUT(B,nb_orig, MPI_INTEGER ,target,displacement,nb_target, MPI_INTEGER ,win,code)

7 ...

8 end program example_put

0 1

First Local Window

First Local Window Displacement of 40 units in the local window

B(1:10)

(133)

81/246 5 – One-sided Communication 5.3 – Data Transfer

Remarks :

TheMPI_GET syntax is identical to theMPI_PUT syntax, only the direction of the

data transfer is reversed.

The RMA transfer data subroutines are nonblocking primitives (deliberate choice ofMPI).

On the target process, the only accessible data are the ones contained in the local memory window.

MPI_ACCUMULATE() admits among its parameters an operation which should be

either of the MPI_REPLACEtype, or one of the predefined reduction operations : MPI_SUM,MPI_PROD, MPI_MAX, etc. This cannot be in any case an operation defined by the user.

(134)

82/246 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization

5 – One-sided Communication

5.4 – Completion of the Transfer : the synchronization

The data transfer starts after the call to one of the nonblocking subroutines

(MPI_PUT(), ...). But when is the transfer completed and the data really available ?

After a synchronization by the programmer. These synchronizations can be classified into two types :

Active targetsynchronization (collective operation, all the processes associated with the window involved in the synchronization) ;

Passive targetsynchronization (only the origin process calls the synchronization subroutine).

(135)

83/246 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization

5 – One-sided Communication

5.4 – Completion of the Transfer : the synchronization 5.4.1 – Active Target Synchronization

It is made by using theMPIsubroutine MPI_WIN_FENCE().

☞ MPI_WIN_FENCE() is a collective operation on all the processes associated with

the MPI window object.

MPI_WIN_FENCE()acts as a synchronization barrier. It waits for the end of all the

data transfers (RMA or not) using the local memory window and initiated since the last call toMPI_WIN_FENCE().

This primitive will help separate the calculation parts of the code (where data of the local memory window are used viaload orstore) of RMA data transfer parts. ☞ Anassert argument of the MPI_WIN_FENCE() primitive, of integer type, allows to

refine its behavior for better performances. Different values are predefined MPI_MODE_NOSTORE, MPI_MODE_NOPUT, MPI_MODE_NOPRECEDE,

References

Related documents

Enrollment from 7/1/11-6/30/12: Campus Number of Students Admitted New Starts since 6/30/12 Re-Enrollments since 6/30/12 Transferred to Program from Another Program since

Item 5 revealed that 94% of officials of ministry of education and local government education authority and proprietors of private primary schools supported that NCE graduates who

predictors of each type of dating aggression was an excellent fit to the data (see Table 2). Results showed that witnessing violence in the school significantly predicted

Ιν ουρ mοδελ, ωιτη mοτιϖατεδ προϖιδερσ ανδ δεχρεασινγ mαργιναλ υτιλιτψ οφ προ…τσ, σοχιαλλψ οπτιmαλ θυαλιτψ προϖισιον ισ γενεραλλψ νοτ οβταινεδ

Southwest Chicken Salad 7.00 – Spring mix lettuce with crispy chicken strips, diced tomato, roasted corn, black beans and a chipotle avocado ranch dressing. Bistro Roast Beef

Although total labor earnings increase with the unskilled unions’ bargaining power, we can say nothing when the increase in production is due to stronger skilled unions, since

potential photonics-related research and innovation topics as input to the Societal Challenges work programme or for joint programme activities.?. Aim of the

We believe that private health funds should recognize the medical importance of rigid contact lenses for keratoconus. These lenses need to be reclassified as medical devices such