Message Passing Interface
(MPI)
JalelChergui Isabelle Dupays Denis Girou Pierre-François Lavallée Dimitri Lecas Philippe Wautelet INSTITUT DU DÉVELOPPEMENTMPI – Plan I
1 Introduction . . . 7
1.1 Availability and updating . . . 7
1.2 Definitions . . . 8
1.3 Message Passing concepts . . . 11
1.4 History . . . 14 1.5 Library . . . 16 2 Environment . . . 20 2.1 Description . . . 20 2.2 Exemple . . . 23 3 Point-to-point Communications . . . 24 3.1 General Concepts . . . 24
3.2 Predefined MPI Datatypes . . . 27
3.3 Other Possibilities . . . 29
3.4 Example: communication ring . . . 35
4 Collective communications . . . 40
4.1 General concepts . . . 40
4.2 Global synchronization: MPI_BARRIER(). . . 42
4.3 Broadcast : MPI_BCAST(). . . 43
4.4 Scatter :MPI_SCATTER(). . . 45
4.5 Gather : MPI_GATHER(). . . 47
4.6 Gather-to-all :MPI_ALLGATHER(). . . 49
4.7 Extended gather :MPI_GATHERV(). . . 51
INSTITUT DU DÉVELOPPEMENT
MPI – Plan II
4.8 All-to-all : MPI_ALLTOALL(). . . 54 4.9 Global reduction . . . 57 4.10 Additions . . . 64 5 One-sided Communication . . . 65 5.1 Introduction . . . 655.1.1 Reminder : The Concept of Message-Passing . . . 66
5.1.2 The Concept of One-sided Communication . . . 67
5.1.3 RMA Approach of MPI . . . 68
5.2 Memory Window . . . 69
5.3 Data Transfer . . . 73
5.4 Completion of the Transfer : the synchronization. . . .77
5.4.1 Active Target Synchronization . . . 78
5.4.2 Passive Target Synchronization . . . 87
5.5 Conclusions . . . 91 6 Derived datatypes . . . 92 6.1 Introduction . . . 92 6.2 Contiguous datatypes . . . 94 6.3 Constant stride . . . 95 6.4 Other subroutines . . . 97 6.5 Examples . . . 98
6.5.1 The datatype "matrix row" . . . 98
6.5.2 The datatype "matrix line" . . . 100
INSTITUT DU DÉVELOPPEMENT
MPI – Plan III
6.5.3 The datatype "matrix block" . . . 102
6.6 Homogenous datatypes of variable strides . . . 104
6.7 Subarray Datatype Constructor . . . 110
6.8 Heterogenous datatypes . . . 115
6.9 Subroutines annexes . . . 119
6.10 Conclusion . . . 122
7 Optimizations . . . 123
7.1 Introduction . . . 123
7.2 Point-to-Point Send Modes . . . 124
7.2.1 Key Terms . . . 125
7.2.2 Synchronous Sends . . . 126
7.2.3 Buffered Sends . . . 128
7.2.4 Standard Sends . . . 132
7.2.5 Ready Sends. . . .133
7.2.6 Performances of Send Modes . . . 134
7.3 Computation-Communication Overlap . . . 135
8 Communicators . . . 141
8.1 Introduction . . . 141
8.2 Default Communicator . . . 142
8.3 Example. . . .143
8.4 Groups and Communicators . . . 145
8.5 Communicator Resulting from Another . . . 147
INSTITUT DU DÉVELOPPEMENT
MPI – Plan IV
8.6 Topologies . . . 151
8.6.1 Cartesian topologies . . . 152
8.6.2 Subdividing a cartesian topology. . . .167
9 MPI-IO . . . 173 9.1 Introduction . . . 173 9.1.1 Presentation . . . 173 9.1.2 Issues . . . 175 9.1.3 Definitions. . . .178 9.2 File management . . . 180
9.3 Reads/Writes : general concepts . . . 183
9.4 Individual Reads/Writes . . . 187
9.4.1 Via explicit offsets . . . 187
9.4.2 Via Individual File Pointers. . . .192
9.4.3 Shared File Pointers . . . 197
9.5 Collective Reads/Writes . . . 200
9.5.1 Via explicit offsets . . . 201
9.5.2 Via individual file pointers . . . 203
9.5.3 Via shared file pointers . . . 209
9.6 Explicit positioning of pointers in a file . . . 211
9.7 Definition of views . . . 214
9.8 NonBlocking Reads/Writes . . . 226
9.8.1 Via explicit displacements . . . 227
INSTITUT DU DÉVELOPPEMENT
MPI – Plan V
9.8.2 Via individual file pointers . . . 230
9.8.3 NonBlocking and collective Reads/Writes . . . 232
9.9 Advices. . . .235
10 Conclusion . . . 236
INSTITUT DU DÉVELOPPEMENT
7/236 1 – Introduction 1.1 – Availability and updating
1 – Introduction
1.1 – Availability and updating
This document is likely to be updated regularly. The most recent version is available on the Web server of IDRIS, sectionSupport de cours: https://cours.idris.fr
☞ IDRIS
Institut du développement et des ressources en informatique scientifique Rue John Von Neumann
Bâtiment 506 BP 167
91403 ORSAY CEDEX France
http://www.idris.fr
☞ Translated with the help of Yazeed Zabrie.
INSTITUT DU DÉVELOPPEMENT
8/236 1 – Introduction 1.2 – Definitions
1 – Introduction
1.2 – Definitions
➊ The sequential programming model:
☞ the program is executed by one and only one process ;
☞ all the variables and constants of the program are allocated in the memory
of the process ;
☞ a process is executed on a physical processor of the machine.
MEMORY
P
Program
Figure 1: Sequential programming model
INSTITUT DU DÉVELOPPEMENT
9/236 1 – Introduction 1.2 – Definitions
➋ In themessage passingprogramming model :
☞ the program is written in a classic language (Fortran,C,C++, etc.) ;
☞ all the variables of the program are private and reside in the local memory of
each process ;
☞ each process may execute different parts of a program ;
☞ a variable is exchanged between two or many processes via a call to
subroutines. Memories Processes Network Programs 0 1 2 3
Figure 2: Message-Passing Programming Model
INSTITUT DU DÉVELOPPEMENT
10/236 1 – Introduction 1.2 – Definitions
➌ TheSPMDexecution model :
☞ SingleProgram,MultipleData;
☞ the same program is executed by all the processes ;
☞ all machines support this programming model and some support only this
one ;
☞ it is a particular case of the general model ofMPMD(MultipleProgram, MultipleData), which it can also emulate.
Memories Processes Network Single program 0 1 2 3 d0 d1 d2 d3
Figure 3: Single Program, Multiple Data
INSTITUT DU DÉVELOPPEMENT
11/236 1 – Introduction 1.3 – Message Passing concepts
1 – Introduction
1.3 – Message Passing concepts
☞ If a message is sent to a process, the latter must receive it
t
target
source
I send
I receive
1
0
Figure 4: Message Passing
INSTITUT DU DÉVELOPPEMENT
12/236 1 – Introduction 1.3 – Message Passing concepts
☞ A message consists of data chunks passing from the sending process to the
receiving process
☞ In addition to the data (scalar variables, arrays, etc.) to be sent, a message has
to contain the following information :
➳ the identifier of the sending process ; ➳ the datatype ;
➳ the length ;
➳ the identifier of the receiving process.
Memories Processes Network sender receiver datatype length DATA d1 2 d2 1 d1 message message
Figure 5: Message Construction
INSTITUT DU DÉVELOPPEMENT
13/236 1 – Introduction 1.3 – Message Passing concepts
☞ The exchanged messages are interpreted and managed by an environment
comparable to telephony, fax, postal mail, e-mail, etc.
☞ The message is sent to a specified address.
☞ The receiving process must be able to classify and interpret the messages which
are sent to it.
☞ The environment in question is MPI (MessagePassingInterface). An MPI application is a group of independent processes executing, each one, their own code and communicating via calls to MPI library subroutines.
INSTITUT DU DÉVELOPPEMENT
14/236 1 – Introduction 1.4 – History
1 – Introduction
1.4 – History
☞ Version 1.0: in June 1994, the MPI forum (MessagePassingInterfaceForum), with the participation of about forty organizations, came to the definition of a set of subroutines concerning theMPI message passing library.
☞ Version 1.1: June 1995, with only minor changes.
☞ Version 1.2: in 1997, with minor changes for a better consistency of the naming
of some subroutines.
☞ Version 1.3: September 2008, with clarifications in MPI 1.2, according to
clarifications themselves made by MPI-2.1
☞ Version 2.0: released in July 1997, this version brought deliberately
non-integrated, essential additions in MPI 1.0 (process dynamic management, one-sided communications, parallel I/O, etc.).
☞ Version 2.1: June 2008, with only clarifications in MPI 2.0 but without any
changes.
☞ Version 2.2: September 2009, with only "small" additions
INSTITUT DU DÉVELOPPEMENT
15/236 1 – Introduction 1.4 – History
☞ Version 3.0: September 2012
➳ changes and important additions compared to version 2.2 ; ➳ main changes :
➠ nonblocking collective communications ;
➠ revision of the implementation of one-sided communications ; ➠ Fortran (2003-2008) bindings ;
➠ C++ bindings removed ;
➠ interfacing of external tools (for debugging and performance measures) ; ➠ etc.
➳ seehttp://meetings.mpi-forum.org/MPI_3.0_main_page.php https://svn.mpi-forum.org/trac/mpi-forum-web/wiki
INSTITUT DU DÉVELOPPEMENT
16/236 1 – Introduction 1.5 – Library
1 – Introduction
1.5 – Library
☞ Message Passing Interface Forum,MPI: A Message-Passing Interface Standard, Version 3.0, High Performance Computing Center Stuttgart (HLRS), 2012 https://fs.hlrs.de/projects/par/mpi/mpi30/
☞ Message Passing Interface Forum,MPI: A Message-Passing Interface Standard, Version 2.2, High Performance Computing Center Stuttgart (HLRS), 2009 https://fs.hlrs.de/projects/par/mpi/mpi22/
☞ William Gropp, Ewing Lusk et Anthony Skjellum : Using MPI: Portable Parallel
Programming with the Message Passing Interface, second edition, MIT Press, 1999
☞ William Gropp, Ewing Lusk et Rajeev Thakur : Using MPI-2, MIT Press, 1999
☞ Peter S. Pacheco : Parallel Programming with MPI, Morgan Kaufman Ed., 1997
☞ Additional references : http://www.mpi-forum.org/docs/ http://www.mpi-forum.org/mpi2_1/index.htm http://www.mcs.anl.gov/research/projects/mpi/learning.html http://www.mpi-forum.org/archives/notes/notes.html INSTITUT DU DÉVELOPPEMENT
17/236 1 – Introduction 1.5 – Library
☞ MPI open source implementations : they can be installed on a large number of
architectures but their performances are generally below the ones of the constructors’ implementations.
1 MPICH2: http://www.mpich.org/
2 Open MPI:http://www.open-mpi.org/
INSTITUT DU DÉVELOPPEMENT
18/236 1 – Introduction 1.5 – Library
☞ The debuggers
1 Totalviewhttp://www.roguewave.com/products/totalview.aspx
2 DDThttp://www.allinea.com/products/ddt/ ☞ Tools for performance measurements
1 MPE:MPI Parallel Environment
http://www.mcs.anl.gov/research/projects/perfvis/download/index.htm
2 Scalasca: Scalable Performance Analysis of Large-Scale Applications
http://www.scalasca.org/
3 Vampir
http://www.vampir.eu/
INSTITUT DU DÉVELOPPEMENT
19/236 1 – Introduction 1.5 – Library
☞ Some open source parallel scientific libraries :
1 ScaLAPACK: linear algebraic problem solvers using direct methods. The
sources can be downloaded from the website http://www.netlib.org/scalapack/
2 PETSc: linear and non-linear algebraic problem solvers using iterative
methods. The sources can be downloaded from the website http://www.mcs.anl.gov/petsc/
3 MUMPS: MUltifrontal Massively Parallel sparse direct Solvers. The sources
can be downloaded from the websitehttp://graal.ens-lyon.fr/MUMPS/
4 FFTW: fast Fourier transforms. The sources can be downloaded from the websitehttp://www.fftw.org
INSTITUT DU DÉVELOPPEMENT
20/236 2 – Environment 2.1 – Description
2 – Environment
2.1 – Description
☞ Every program unit calling MPI subroutines has to include a header file. In
Fortran, we must use thempi module introduced in MPI-2 (in MPI-1, it was the mpif.h file), and in C/C++ the mpi.h file.
☞ TheMPI_INIT() subroutine initializes the necessary environment : integer, intent(out) :: code
call MPI_INIT(code)
☞ TheMPI_FINALIZE() subroutine disables this environment : integer, intent(out) :: code
call MPI_FINALIZE(code)
INSTITUT DU DÉVELOPPEMENT
21/236 2 – Environment 2.1 – Description
☞ All the operations made by MPI are related tocommunicators. The default
communicator isMPI_COMM_WORLD which includes all the active processes.
0 1
3
2 4
5 6
Figure 6: MPI_COMM_WORLD Communicator
INSTITUT DU DÉVELOPPEMENT
22/236 2 – Environment 2.1 – Description
☞ At any moment, we can know the number of processes managed by a given
communicator by theMPI_COMM_SIZE() subroutine : integer, intent(out) :: nb_procs,code
call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code)
☞ Similarly, the MPI_COMM_RANK() subroutine allows to obtain the process rank (i.e. its instance number, which is a number between 0 and the value sent by
MPI_COMM_SIZE() – 1) : integer, intent(out) :: rank,code
call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code)
INSTITUT DU DÉVELOPPEMENT
23/236 2 – Environment 2.2 – Exemple
2 – Environment
2.2 – Exemple 1 program who_am_I 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,code 5 6 call MPI_INIT(code) 78 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 9 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 10
11 print *,’I am the process ’,rank,’ among ’,nb_procs
12
13 call MPI_FINALIZE(code) 14 end program who_am_I
> mpiexec -n 7 who_am_I I am the process 3 among 7 I am the process 0 among 7 I am the process 4 among 7 I am the process 1 among 7 I am the process 5 among 7 I am the process 2 among 7 I am the process 6 among 7
INSTITUT DU DÉVELOPPEMENT
24/236 3 – Point-to-point Communications 3.1 – General Concepts
3 – Point-to-point Communications
3.1 – General Concepts
☞ Apoint-to-point communication occurs between two processes, one is called the senderprocess and the other one is called thereceiverprocess.
0 1 3 2 4 5 6 Sender Receiver 1000
Figure 7: Point-to-point communication
INSTITUT DU DÉVELOPPEMENT
25/236 3 – Point-to-point Communications 3.1 – General Concepts
☞ The sender and the receiver are identified by theirrankin the communicator. ☞ The so-calledmessage envelopeis composed of:
① the rank of the sender process; ② the rank of the receiver process; ③ the tag of the message;
④ the communicator which define the process group and communication
context.
☞ The exchanged data are predefined (integer, real, etc) or personal derived
datatypes.
☞ There are in each case many communicationmodes, calling different protocols
which will be discussed in Chapter 7.
INSTITUT DU DÉVELOPPEMENT
26/236 3 – Point-to-point Communications 3.1 – General Concepts
1 program point_to_point
2 use mpi
3 implicit none
4
5 integer, dimension( MPI_STATUS_SIZE ) :: status
6 integer, parameter :: tag=100
7 integer :: rank,value,code
8
9 call MPI_INIT(code) 10
11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 12
13 if (rank ==2) then
14 value=1000
15 call MPI_SEND(value,1, MPI_INTEGER ,5,tag, MPI_COMM_WORLD ,code) 16 elseif (rank ==5) then
17 call MPI_RECV(value,1, MPI_INTEGER ,2,tag, MPI_COMM_WORLD ,status,code) 18 print *,’I, process 5, I have received ’,value,’ from the process2’
19 end if
20
21 call MPI_FINALIZE(code) 22
23 end program point_to_point > mpiexec -n 7 point_to_point
I, process 5, I have received 1000 from the process 2
INSTITUT DU DÉVELOPPEMENT
27/236 3 – Point-to-point Communications 3.2 – Predefined MPI Datatypes
3 – Point-to-point Communications
3.2 – Predefined MPI Datatypes
MPI Type Fortran Type
MPI_INTEGER INTEGER
MPI_REAL REAL
MPI_DOUBLE_PRECISION DOUBLE PRECISION
MPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER
Table 1: Predefined MPI Datatypes (Fortran)
INSTITUT DU DÉVELOPPEMENT
28/236 3 – Point-to-point Communications 3.2 – Predefined MPI Datatypes
MPI Type C Type
MPI_CHAR signed char
MPI_SHORT signed short
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
Table 2: Predefined MPI Datatypes (C)
INSTITUT DU DÉVELOPPEMENT
29/236 3 – Point-to-point Communications 3.3 – Other Possibilities
3 – Point-to-point Communications
3.3 – Other Possibilities
☞ On the reception of a message, the rank of the sender and the tag can be
replaced respectively by the MPI_ANY_SOURCE and MPI_ANY_TAGwildcards.
☞ A communication with the dummy process of rankMPI_PROC_NULL has no effect.
☞ MPI_STATUS_IGNORE is a predefined constant that can be used instead of the status variable.
☞ MPI_SUCCESSis a predefined constant which allows testing the return code of a
MPI subroutine.
☞ There are syntactic variants,MPI_SENDRECV() and MPI_SENDRECV_REPLACE(), which combine a send and a receive operations (for the former, the receive memory zone must be different from the send one).
☞ It is possible to create more complex data structures using derived datatypes (see
the Chapter 6)
INSTITUT DU DÉVELOPPEMENT
30/236 3 – Point-to-point Communications 3.3 – Other Possibilities
0 1
1000
1001
Figure 8: sendrecvCommunication between the Processes 0 and 1
1 program sendrecv 2 use mpi 3 implicit none 4 integer :: rank,value,num_proc,code 5 integer,parameter :: tag=110 6 7 call MPI_INIT(code)
8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 9
10 ! We suppose that we have exactly 2 processes
11 num_proc=mod(rank+1,2)
12
13 call MPI_SENDRECV(rank+1000,1, MPI_INTEGER ,num_proc,tag,value,1, MPI_INTEGER , & 14 num_proc,tag, MPI_COMM_WORLD , MPI_STATUS_IGNORE ,code) 15
16 print *,’I, process ’,rank,’, I have received’,value,’from the process ’,num_proc 17
18 call MPI_FINALIZE(code) 19 end program sendrecv
INSTITUT DU DÉVELOPPEMENT
31/236 3 – Point-to-point Communications 3.3 – Other Possibilities
> mpiexec -n 2 sendrecv
I, process 1, I have received 1000 from the process 0 I, process 0, I have received 1001 from the process 1
Warning! It must be noticed that if theMPI_SEND() subroutine is implemented in a synchronousway (see the Chapter 7) in the used MPI implementation, the previous code will deadlock if, rather than to use the MPI_SENDRECV() subroutine, we used a MPI_SEND() subroutine followed by a MPI_RECV()one. In this case, each of the two processes would wait a receive command which will never occur, because the two sends would stay suspended. Therefore, for correctness and portability reasons, it is absolutely necessary to avoid such situations.
call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc,tag, MPI_COMM_WORLD ,code)
call MPI_RECV(value,1, MPI_INTEGER ,num_proc,tag, MPI_COMM_WORLD , MPI_STATUS_IGNORE ,code)
INSTITUT DU DÉVELOPPEMENT
32/236 3 – Point-to-point Communications 3.3 – Other Possibilities
1 program joker
2 use mpi
3 implicit none
4 integer, parameter :: m=4,tag1=11,tag2=22
5 integer, dimension(m,m) :: A
6 integer :: nb_procs,rank,code
7 integer, dimension( MPI_STATUS_SIZE ):: status
8 integer :: nb_elements,i
9 integer, dimension(:), allocatable :: C 10
11 call MPI_INIT(code)
12 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 14
15 A(:,:) = 0
INSTITUT DU DÉVELOPPEMENT
33/236 3 – Point-to-point Communications 3.3 – Other Possibilities
16 if (rank == 0) then
17 ! Initialisation of the matrix A on the process 0
18 A(:,:) = reshape((/ (i,i=1,m*m) /), (/ m,m /))
19 ! Sending of 2 elements of the matrix A to the process 1 20 call MPI_SEND(A(1,1),2, MPI_INTEGER ,1,tag1, MPI_COMM_WORLD ,code) 21 ! Sending of 3 elements of the matrix A to the process 2 22 call MPI_SEND(A(1,2),3, MPI_INTEGER ,2,tag2, MPI_COMM_WORLD ,code) 23 else
24 ! We check before the receive if a message has arrived and from which process 25 call MPI_PROBE( MPI_ANY_SOURCE , MPI_ANY_TAG , MPI_COMM_WORLD ,status,code) 26 ! We check how many elements must be received
27 call MPI_GET_COUNT(status, MPI_INTEGER ,nb_elements,code) 28 ! We allocate the reception array C on each process
29 if (nb_elements /= 0 ) allocate (C(1:nb_elements))
30 ! We receive the message
31 call MPI_RECV(C,nb_elements, MPI_INTEGER ,status( MPI_SOURCE ),status( MPI_TAG ), &
32 MPI_COMM_WORLD ,status,code)
33 print *,’I, process ’,rank,’, I have received ’,nb_elements, &
34 ’elements from the process ’, &
35 status( MPI_SOURCE ),’My array C is ’, C(:)
36 end if
37
38 call MPI_FINALIZE(code) 39 end program joker
INSTITUT DU DÉVELOPPEMENT
34/236 3 – Point-to-point Communications 3.3 – Other Possibilities
> mpiexec -n 3 sendrecv1
I, process 1, I have received 2 elements from the process 0 My array C is 1 2
I, process 2, I have received 3 elements from the process 0 My array C is 5 6 7
INSTITUT DU DÉVELOPPEMENT
35/236 3 – Point-to-point Communications 3.4 – Example: communication ring
3 – Point-to-point Communications
3.4 – Example: communication ring
0 1 3 2 4 5 6 1000 1001 1002 1003 1004 1005 1006
Figure 9: Communication Ring
INSTITUT DU DÉVELOPPEMENT
36/236 3 – Point-to-point Communications 3.4 – Example: communication ring
If all the processes execute a send followed by a receive, all the communications could potentially start simultaneously and then will not occur in a ring way (outside the already mentioned portability problem if the implementation of MPI_SEND()is made in a synchronous way in the MPI library):
...
value=rank+1000
call MPI_SEND(value,1, MPI_INTEGER ,num_proc_next,tag, MPI_COMM_WORLD ,code) call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, MPI_COMM_WORLD , &
status,code) ...
INSTITUT DU DÉVELOPPEMENT
37/236 3 – Point-to-point Communications 3.4 – Example: communication ring
To ensure that the communications will really happen in aringway, like for the exchange of atokenbetween processes, it is necessary to use a different method and to have one process which starts the chain:
0 1 2 3 4 5 6 1000 1001 1002 1003 1004 1005 1006
Figure 10: Communication Ring
INSTITUT DU DÉVELOPPEMENT
38/236 3 – Point-to-point Communications 3.4 – Example: communication ring
1 program ring
2 use mpi
3 implicit none
4 integer, dimension( MPI_STATUS_SIZE ) :: status
5 integer, parameter :: tag=100
6 integer :: nb_procs,rank,value, &
7 num_proc_previous,num_proc_next,code
8 call MPI_INIT(code)
9 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 11
12 num_proc_next=mod(rank+1,nb_procs)
13 num_proc_previous=mod(nb_procs+rank-1,nb_procs) 14
15 if (rank == 0) then
16 call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc_next,tag, &
17 MPI_COMM_WORLD ,code)
18 call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, &
19 MPI_COMM_WORLD ,status,code)
20 else
21 call MPI_RECV(value,1, MPI_INTEGER ,num_proc_previous,tag, &
22 MPI_COMM_WORLD ,status,code)
23 call MPI_SEND(rank+1000,1, MPI_INTEGER ,num_proc_next,tag, &
24 MPI_COMM_WORLD ,code)
25 end if
26
27 print *,’I, process ’,rank,’, I have received ’,value,’ from process ’, &
28 num_proc_previous
29
30 call MPI_FINALIZE(code) 31 end programring
INSTITUT DU DÉVELOPPEMENT
39/236 3 – Point-to-point Communications 3.4 – Example: communication ring
> mpiexec -n 7 ring
I, process 1, I have received 1000 from process 0 I, process 2, I have received 1001 from process 1 I, process 3, I have received 1002 from process 2 I, process 4, I have received 1003 from process 3 I, process 5, I have received 1004 from process 4 I, process 6, I have received 1005 from process 5 I, process 0, I have received 1006 from process 6
INSTITUT DU DÉVELOPPEMENT
40/236 4 – Collective communications 4.1 – General concepts
4 – Collective communications
4.1 – General concepts
☞ Thecollectivecommunications allow to make a series of point-to-point
communications in one single call.
☞ A collective communication always concerns all the processes of the indicated communicator.
☞ For each process, the call ends when its participation in the collective call is
completed, in the sense of point-to-point communications (when the concerned memory area can be changed).
☞ It is useless to add a global synchronization (barrier) after a collective call. ☞ The management oftagsin these communications is transparent and
system-dependent. Therefore, they are never explicitly defined during the calling of these subroutines. This has among other advantages that the collective communications never interfere with point-to-point communications.
INSTITUT DU DÉVELOPPEMENT
41/236 4 – Collective communications 4.1 – General concepts
☞ There are three types of subroutines :
❶ the one which ensures the global synchronizations : MPI_BARRIER().
❷ the ones which only transfer data :
❏ global distribution of data : MPI_BCAST() ;
❏ selective distribution of data : MPI_SCATTER() ;
❏ collection of distributed data : MPI_GATHER();
❏ collection by all the processes of distributed data : MPI_ALLGATHER() ;
❏ selective distribution, by all the processes, of distributed data :
MPI_ALLTOALL().
❸ the ones which, in addition to the communications management, carry out
operations on the transferred data :
❏ reduction operations (sum, product, maximum, minimum, etc.) whether
they are of a predefined or personal type : MPI_REDUCE() ;
❏ reduction operations with broadcasting of the result (it is in fact
equivalent to an MPI_REDUCE() followed by an MPI_BCAST()) : MPI_ALLREDUCE().
INSTITUT DU DÉVELOPPEMENT
42/236 4 – Collective communications 4.2 – Global synchronization:MPI_BARRIER()
4 – Collective communications
4.2 – Global synchronization:MPI_BARRIER()
Bar-rière
P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
Figure 11: Global Synchronization : MPI_BARRIER()
integer, intent(out) :: code
call MPI_BARRIER( MPI_COMM_WORLD ,code)
INSTITUT DU DÉVELOPPEMENT
43/236 4 – Collective communications 4.3 – Broadcast :MPI_BCAST()
4 – Collective communications
4.3 – Broadcast :MPI_BCAST() 0 1 2 3 A A A A P0 P1 P2 A P3 MPI_BCAST() P0 A P1 A P2 A P3 AFigure 12: Broadcast : MPI_BCAST()
INSTITUT DU DÉVELOPPEMENT
44/236 4 – Collective communications 4.3 – Broadcast :MPI_BCAST() 1 program bcast 2 use mpi 3 implicit none 4 5 integer :: rank,value,code 6 7 call MPI_INIT(code)
8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 9
10 if (rank ==2) value=rank+1000 11
12 call MPI_BCAST(value,1, MPI_INTEGER ,2, MPI_COMM_WORLD ,code) 13
14 print *,’I, process ’,rank,’ I have received ’,value,’ of the process2’ 15
16 call MPI_FINALIZE(code) 17
18 end program bcast > mpiexec -n 4 bcast
I, process 2 I have received 1002 of the process 2 I, process 0 I have received 1002 of the process 2 I, process 1 I have received 1002 of the process 2 I, process 3 I have received 1002 of the process 2
INSTITUT DU DÉVELOPPEMENT
45/236 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()
4 – Collective communications
4.4 – Scatter :MPI_SCATTER() 0 1 2 3 A0 A2 A1 A3 P0 P1 P2 A0 A1 A2 A3 P3 MPI_SCATTER() P0 A0 P1 A1 P2 A2 P3 A3Figure 13: Scatter :MPI_SCATTER()
INSTITUT DU DÉVELOPPEMENT
46/236 4 – Collective communications 4.4 – Scatter :MPI_SCATTER()
1 program scatter
2 use mpi
3 implicit none
4
5 integer, parameter :: nb_values=8
6 integer :: nb_procs,rank,block_length,i,code
7 real, allocatable, dimension(:) :: values,data 8
9 call MPI_INIT(code)
10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 12 block_length=nb_values/nb_procs 13 allocate(data(block_length)) 14 15 if (rank ==2) then 16 allocate(values(nb_values)) 17 values(:)=(/(1000.+i,i=1,nb_values)/)
18 print *,’I, process ’,rank,’send my values array : ’,&
19 values(1:nb_values)
20 end if
21
22 call MPI_SCATTER(values,block_length, MPI_REAL ,data,block_length, &
23 MPI_REAL ,2, MPI_COMM_WORLD ,code)
24 print *,’I, process ’,rank,’, I have received ’, data(1:block_length), &
25 ’ of the process 2’
26 call MPI_FINALIZE(code) 27
28 end program scatter > mpiexec -n 4 scatter
I, process 2 send my values array :
1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
I, process 0, I have received 1001. 1002. of the process 2 I, process 1, I have received 1003. 1004. of the process 2 I, process 3, I have received 1007. 1008. of the process 2 I, process 2, I have received 1005. 1006. of the process 2
INSTITUT DU DÉVELOPPEMENT
47/236 4 – Collective communications 4.5 – Gather :MPI_GATHER()
4 – Collective communications
4.5 – Gather : MPI_GATHER() 0 1 2 3 A0 A2 A1 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHER() P0 P1 P2 A0 A1 A2 A3 P3Figure 14: Gather :MPI_GATHER()
INSTITUT DU DÉVELOPPEMENT
48/236 4 – Collective communications 4.5 – Gather :MPI_GATHER()
1 program gather
2 use mpi
3 implicit none
4 integer, parameter :: nb_values=8
5 integer :: nb_procs,rank,block_length,i,code
6 real, dimension(nb_values) :: data
7 real, allocatable, dimension(:) :: values 8
9 call MPI_INIT(code)
10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 12 13 block_length=nb_values/nb_procs 14 15 allocate(values(block_length)) 16 17 values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) 18 print *,’I, process ’,rank,’send my values array : ’,&
19 values(1:block_length)
20
21 call MPI_GATHER(values,block_length, MPI_REAL ,data,block_length, &
22 MPI_REAL ,2, MPI_COMM_WORLD ,code)
23
24 if (rank ==2) print *,’I, process 2’, ’ have received ’,data(1:nb_values) 25
26 call MPI_FINALIZE(code) 27
28 end program gather > mpiexec -n 4 gather
I, process 1 send my values array :1003. 1004. I, process 0 send my values array :1001. 1002. I, process 2 send my values array :1005. 1006. I, process 3 send my values array :1007. 1008.
I, process 2 have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
INSTITUT DU DÉVELOPPEMENT
49/236 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()
4 – Collective communications
4.6 – Gather-to-all :MPI_ALLGATHER() 0 1 2 3 A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_ALLGATHER() P0 A0 A1 A2 A3 P1 A0 A1 A2 A3 P2 A0 A1 A2 A3 P3 A0 A1 A2 A3Figure 15: Allgather :MPI_ALLGATHER()
INSTITUT DU DÉVELOPPEMENT
50/236 4 – Collective communications 4.6 – Gather-to-all :MPI_ALLGATHER()
1 program allgather
2 use mpi
3 implicit none
4
5 integer, parameter :: nb_values=8
6 integer :: nb_procs,rank,block_length,i,code
7 real, dimension(nb_values) :: data
8 real, allocatable, dimension(:) :: values 9
10 call MPI_INIT(code)
11
12 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 13 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 14 15 block_length=nb_values/nb_procs 16 allocate(values(block_length)) 17 18 values(:)=(/(1000.+rank*block_length+i,i=1,block_length)/) 19
20 call MPI_ALLGATHER(values,block_length, MPI_REAL ,data,block_length, &
21 MPI_REAL , MPI_COMM_WORLD ,code)
22
23 print *,’I, process ’,rank,’, I have received ’,data(1:nb_values) 24
25 call MPI_FINALIZE(code) 26
27 end program allgather > mpiexec -n 4 allgather
I, process 1, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
I, process 3, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
I, process 2, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
I, process 0, I have received 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
INSTITUT DU DÉVELOPPEMENT
51/236 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()
4 – Collective communications
4.7 – Extended gather : MPI_GATHERV()
0 1 2 3 A0 A2 A1 A3 P0 A0 P1 A1 P2 A2 P3 A3 MPI_GATHERV() P0 P1 P2 A0 A1 A2 A3 P3
Figure 16: Gatherv : MPI_GATHERV()
INSTITUT DU DÉVELOPPEMENT
52/236 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()
1 program gatherv
2 use mpi
3 implicit none
4 INTEGER, PARAMETER :: nb_values=10
5 INTEGER :: nb_procs, rank, block_length, i, code
6 REAL, DIMENSION(nb_values) :: data,remainder
7 REAL, ALLOCATABLE, DIMENSION(:) :: values
8 INTEGER, ALLOCATABLE, DIMENSION(:) :: nb_elements_received,displacement 9
10 CALL MPI_INIT(code)
11 CALL MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 12 CALL MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 13
14 block_length=nb_values/nb_procs 15 remainder=mod(nb_values,nb_procs)
16 if(remainder > rank) block_length = block_length + 1 17 ALLOCATE(values(block_length))
18 values(:) = (/(1000.+(rank*(nb_values/nb_procs))+min(rank,remainder)+i, &
19 i=1,block_length)/)
20
21 PRINT *, ’I, process ’, rank,’send my values array : ’,&
22 values(1:block_length) 23 24 IF (rank == 2) THEN 25 ALLOCATE(nb_elements_received(nb_procs),displacement(nb_procs)) 26 nb_elements_received(1) = nb_values/nb_procs 27 if (remainder > 0) nb_elements_received(1)=nb_elements_received(1)+1 28 displacement(1) = 0 29 DO i=2,nb_procs 30 displacement(i) = displacement(i-1)+nb_elements_received(i-1) 31 nb_elements_received(i) = nb_values/nb_procs
32 if (remainder > i-1) nb_elements_received(i)=nb_elements_received(i)+1
33 END DO
34 END IF
35
INSTITUT DU DÉVELOPPEMENT
53/236 4 – Collective communications 4.7 – Extended gather :MPI_GATHERV()
CALL MPI_GATHERV(values,block_length, MPI_REAL ,data,nb_elements_received,& displacement, MPI_REAL ,2, MPI_COMM_WORLD ,code)
IF (rank == 2) PRINT *, ’I, processus 2 receives ’, data(1:nb_values) CALL MPI_FINALIZE(code)
end program gatherv > mpiexec -n 4 gatherv
I, process 0 send my values’ array : 1001. 1002. 1003. I, process 2 send my values’ array : 1007. 1008. I, process 3 send my values’ array : 1009. 1010. I, process 1 send my values’ array : 1004. 1005. 1006.
I, process 2 receives 1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008.
1009. 1010.
INSTITUT DU DÉVELOPPEMENT
54/236 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()
4 – Collective communications
4.8 – All-to-all :MPI_ALLTOALL() 0 1 2 3 A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 P0 A0 A1 A2 A3 P1 B0 B1 B2 B3 P2 C0 C1 C2 C3 P3 D0 D1 D2 D3 MPI_ALLTOALL() P0 A0 B0 C0 D0 P1 A1 B1 C1 D1 P2 A2 B2 C2 D2 P3 A3 B3 C3 D3Figure 17: Alltoall :MPI_ALLTOALL()
INSTITUT DU DÉVELOPPEMENT
55/236 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()
1 program alltoall
2 use mpi
3 implicit none
4
5 integer, parameter :: nb_values=8
6 integer :: nb_procs,rank,block_length,i,code
7 real, dimension(nb_values) :: values,data
8
9 call MPI_INIT(code)
10 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 11 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 12
13 values(:)=(/(1000.+rank*nb_values+i,i=1,nb_values)/) 14 block_length=nb_values/nb_procs
15
16 print *,’I, process ’,rank,’send my values array : ’,&
17 values(1:nb_values)
18
19 call MPI_ALLTOALL(values,block_length, MPI_REAL ,data,block_length, &
20 MPI_REAL , MPI_COMM_WORLD ,code)
21
22 print *,’I, process ’,rank,’, I have received ’,data(1:nb_values) 23
24 call MPI_FINALIZE(code) 25 end program alltoall
INSTITUT DU DÉVELOPPEMENT
56/236 4 – Collective communications 4.8 – All-to-all :MPI_ALLTOALL()
> mpiexec -n 4 alltoall
I, process 1 send my values array :
1009. 1010. 1011. 1012. 1013. 1014. 1015. 1016. I, process 0 send my values array :
1001. 1002. 1003. 1004. 1005. 1006. 1007. 1008. I, process 2 send my values array :
1017. 1018. 1019. 1020. 1021. 1022. 1023. 1024. I, process 3 send my values array :
1025. 1026. 1027. 1028. 1029. 1030. 1031. 1032.
I, process 0, I have received 1001. 1002. 1009. 1010. 1017. 1018. 1025. 1026.
I, process 2, I have received 1005. 1006. 1013. 1014. 1021. 1022. 1029. 1030.
I, process 1, I have received 1003. 1004. 1011. 1012. 1019. 1020. 1027. 1028.
I, process 3, I have received 1007. 1008. 1015. 1016. 1023. 1024. 1031. 1032.
INSTITUT DU DÉVELOPPEMENT
57/236 4 – Collective communications 4.9 – Global reduction
4 – Collective communications
4.9 – Global reduction
☞ Areductionis an operation applied to a set of elements in order to obtain one
single value. Classical examples are the sum of the elements of a vector
(SUM(A(:))) or the search of the maximum value element in a vector (MAX(V(:))).
☞ MPI proposes high-level subroutines in order to operate reductions on distributed
data on a group of processes. The result is obtained on one process
(MPI_REDUCE()) or on all (MPI_ALLREDUCE(), which is in fact equivalent to an MPI_REDUCE() followed by an MPI_BCAST()).
☞ If several elements are implied by process, the reduction function is applied to
each one of them.
☞ TheMPI_SCAN() subroutine allows also to make partial reductions by
considering, for each process, the previous processes of the group and itself.
☞ TheMPI_OP_CREATE() and MPI_OP_FREE() subroutines allow personal reduction operations.
INSTITUT DU DÉVELOPPEMENT
58/236 4 – Collective communications 4.9 – Global reduction
Table 3: Main Predefined Reduction Operations (there are also other logical operations)
Name Opération
MPI_SUM Sum of elements MPI_PROD Product of elements MPI_MAX Maximum of elements MPI_MIN Minimum of elements
MPI_MAXLOC Maximum of elements and location MPI_MINLOC Minimum of elements and location MPI_LAND Logical AND
MPI_LOR Logical OR
MPI_LXOR Logical exclusive OR
INSTITUT DU DÉVELOPPEMENT
59/236 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 1000+1+2+3+4+5+6 =1021 2 3 4 5 6
Figure 18: Distributed reduction (sum)
INSTITUT DU DÉVELOPPEMENT
60/236 4 – Collective communications 4.9 – Global reduction 1 program reduce 2 use mpi 3 implicit none 4 integer :: nb_procs,rank,value,sum,code 5 6 call MPI_INIT(code)
7 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 8 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 9 10 if (rank ==0) then 11 value=1000 12 else 13 value=rank 14 endif 15
16 call MPI_REDUCE(value,sum,1, MPI_INTEGER , MPI_SUM ,0, MPI_COMM_WORLD ,code) 17
18 if (rank == 0) then
19 print *,’I, process 0, I have the global sum value ’,sum
20 end if
21
22 call MPI_FINALIZE(code) 23 end program reduce
> mpiexec -n 7 reduce
I, process 0, I have the global sum value 1021
INSTITUT DU DÉVELOPPEMENT
61/236 4 – Collective communications 4.9 – Global reduction 0 1 3 2 4 5 6 1 10×1×2×3×4×5×6 =7200 2 3 4 5 6 7200 7200 7200 7200 7200 7200 7200
Figure 19: Distributed reduction (product) with broadcast of the result
INSTITUT DU DÉVELOPPEMENT
62/236 4 – Collective communications 4.9 – Global reduction 1 program allreduce 2 3 use mpi 4 implicit none 5 6 integer :: nb_procs,rank,value,product,code 7 8 call MPI_INIT(code)
9 call MPI_COMM_SIZE( MPI_COMM_WORLD ,nb_procs,code) 10 call MPI_COMM_RANK( MPI_COMM_WORLD ,rank,code) 11 12 if (rank ==0) then 13 value=10 14 else 15 value=rank 16 endif 17
18 call MPI_ALLREDUCE(value,product,1, MPI_INTEGER , MPI_PROD , MPI_COMM_WORLD ,code) 19
20 print *,’I,process ’,rank,’,I have received the value of the global product ’,product 21
22 call MPI_FINALIZE(code) 23
24 end program allreduce
INSTITUT DU DÉVELOPPEMENT
63/236 4 – Collective communications 4.9 – Global reduction
> mpiexec -n 7 allreduce
I, process 6, I have received the value of the global product 7200 I, process 2, I have received the value of the global product 7200 I, process 0, I have received the value of the global product 7200 I, process 4, I have received the value of the global product 7200 I, process 5, I have received the value of the global product 7200 I, process 3, I have received the value of the global product 7200 I, process 1, I have received the value of the global product 7200
INSTITUT DU DÉVELOPPEMENT
64/236 4 – Collective communications 4.10 – Additions
4 – Collective communications
4.10 – Additions
☞ TheMPI_SCATTERV(), MPI_GATHERV(), MPI_ALLGATHERV() and MPI_ALLTOALLV() subroutines extend MPI_SCATTER(), MPI_GATHER(), MPI_ALLGATHER() and MPI_ALLTOALL() in the case where the number of elements to transmit or gather is different following the processes.
☞ Two new subroutines have been added to extend the possibilities of collective
subroutines in some particular cases:
➳ MPI_ALLTOALLW(): MPI_ALLTOALLV() version where the displacements are expressed in bytes and not in elements,
➳ MPI_EXSCAN(): exclusive version of MPI_SCAN().
INSTITUT DU DÉVELOPPEMENT
65/236 5 – One-sided Communication 5.1 – Introduction
5 – One-sided Communication
5.1 – Introduction
There are various approaches to transfer data between two different processes. Among the most commonly used are :
➊ Point-to-point communications by message-passing (MPI, etc.) ;
➋ One-sided communications (direct access to the memory of a distant process).
Also called RMA for Remote Memory Access , it is one of the major contributions ofMPI.
INSTITUT DU DÉVELOPPEMENT
66/236 5 – One-sided Communication 5.1 – Introduction
5 – One-sided Communication
5.1 – Introduction
5.1.1 – Reminder : The Concept of Message-Passing
t target source I send I receive
1
0
Figure 20: Message-PassingIn message-passing, a sender (origin) sends a message to a destination process (target) which will make all what is necessary to receive this message. This requires that the sender as well as the receiver be involved in the communication. This can be restrictive and difficult to implement in some algorithms (for example when it is necessary to manage a global counter).
INSTITUT DU DÉVELOPPEMENT
67/236 5 – One-sided Communication 5.1 – Introduction
5 – One-sided Communication
5.1 – Introduction
5.1.2 – The Concept of One-sided Communication
The concept of one-sided communication is not new,MPIhaving simply unified the already existing constructors’ solutions (such as shmem (CRAY), lapi (IBM), ...) by offering its own RMA primitives. Through these subroutines, a process has a direct access (in read, write or update) to the memory of another remote process. In this approach, the remote process does not have to participate in the data-transfer process. The principle advantages are the following :
☞ enhanced performances when the hardware allows it, ☞ a simpler programming for some algorithms.
INSTITUT DU DÉVELOPPEMENT
68/236 5 – One-sided Communication 5.1 – Introduction
5 – One-sided Communication
5.1 – Introduction
5.1.3 – RMA Approach of MPI
The use ofMPIRMA is done in three steps:
➊ definition on each process of a memory area (local memory window) visible and
eventually accessible to remote processes ;
➋ start of the data transfer directly from the memory of a process to the memory of
another process. It is therefore necessary to specify the type, the number and the initial and final localization of data.
➌ completion of current transfers by a step of synchronization, the data are then
available.
INSTITUT DU DÉVELOPPEMENT
69/236 5 – One-sided Communication 5.2 – Memory Window
5 – One-sided Communication
5.2 – Memory Window
All the processes participating in an one-sided communication have to specify which part of their memory will be available to the other processes ; it is the notion of memory window.
☞ More precisely, the MPI_WIN_CREATE() collective operation allows the creation of an MPI window object. This object is composed, for each process, of a specific memory area called local memory window. For each process, a local memory window is characterized by its initial address, its size in bytes (which can be zero) and the displacement unit size inside this window (in bytes). These
characteristics can be different on each process.
INSTITUT DU DÉVELOPPEMENT
70/236 5 – One-sided Communication 5.2 – Memory Window
Example:
0 1
MPI Window Object win1
MPI Window Object win2 First local window
Second local window
First local window Second local window
Figure 21: Creation of two MPI window objects, win1 and win2
INSTITUT DU DÉVELOPPEMENT
71/236 5 – One-sided Communication 5.2 – Memory Window
☞ Once the transfers are finished, the window has to be freed with the
MPI_WIN_FREE() subroutine.
☞ MPI_WIN_GET_ATTR() allows to know the characteristics of a local memory window by using the keywordsMPI_WIN_BASE, MPI_WIN_SIZE or
MPI_WIN_DISP_UNIT. Remark :
☞ The choice of the displacement unit associated with the local memory window is
important (essential in a heterogenous environment and facilitating the encoding in all the cases). The size of a MPI datatype is obtained by calling the
MPI_TYPE_SIZE() subroutine.
INSTITUT DU DÉVELOPPEMENT
72/236 5 – One-sided Communication 5.2 – Memory Window 1 program window 2 3 use mpi 4 implicit none 5
6 integer :: code, rank, size_real, win, n=4
7 integer (kind= MPI_ADDRESS_KIND ) :: dim_win, size, base, unit 8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local
9 logical :: flag
10
11 call MPI_INIT(code)
12 call MPI_COMM_RANK( MPI_COMM_WORLD , rank, code)
13 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION ,size_real,code) 14
15 if (rank==0) n=0
16 allocate(win_local(n))
17 dim_win = size_real*n
18
19 call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL , &
20 MPI_COMM_WORLD , win, code)
21 call MPI_WIN_GET_ATTR(win, MPI_WIN_SIZE , size, flag, code) 22 call MPI_WIN_GET_ATTR(win, MPI_WIN_BASE , base, flag, code) 23 call MPI_WIN_GET_ATTR(win, MPI_WIN_DISP_UNIT , unit, flag, code) 24 call MPI_WIN_FREE(win,code)
25 print *,"process", rank,"size, base, unit = " &
26 ,size, base, unit
27 call MPI_FINALIZE(code) 28 end program window
> mpiexec -n 3 window
process 1 size, base, unit = 32 17248330400 8
process 0 size, base, unit = 0 2 8
process 2 size, base, unit = 32 17248330400 8
INSTITUT DU DÉVELOPPEMENT
73/236 5 – One-sided Communication 5.3 – Data Transfer
5 – One-sided Communication
5.3 – Data Transfer
MPIallows a process to read (MPI_GET()), to write (MPI_PUT()) and to update (MPI_ACCUMULATE()) data located in the local memory window of a remote process. We call the process which makes the calling of the initialization subroutine of the transferoriginand we call the process which has the local memory window which will be used in the transfer proceduretarget.
During the initialization of the transfer, the target process does not call anyMPI
subroutine. All the necessary informations are specified under the form of parameters during the calling of theMPIsubroutine by the origin.
INSTITUT DU DÉVELOPPEMENT
74/236 5 – One-sided Communication 5.3 – Data Transfer
In particular, we find :
☞ parameters related to the origin :
the datatype of elements ; their number ;
the memory address of the first element.
☞ parameters related to the target :
the rank of the target as well as the MPI window object, which uniquely determines a local memory window ;
a displacement in this local window ;
the number of elements to be transferred and their datatype.
INSTITUT DU DÉVELOPPEMENT
75/236 5 – One-sided Communication 5.3 – Data Transfer
1 program example_put
2 integer :: nb_orig=10, nb_target=10, target=1, win, code 3 integer (kind= MPI_ADDRESS_KIND ) :: displacement=40 4 integer, dimension(10) :: B
5 ...
6 call MPI_PUT(B,nb_orig, MPI_INTEGER ,target,displacement,nb_target, MPI_INTEGER ,win,code) 7 ...
8 end program example_put
0 1
First Local Window
First Local Window Displacement of 40 units in the local window
B(1:10)
Figure 22: Example of a MPI_PUT
INSTITUT DU DÉVELOPPEMENT
76/236 5 – One-sided Communication 5.3 – Data Transfer
Remarks :
☞ TheMPI_GET syntax is identical to theMPI_PUT syntax, only the direction of the data transfer is reversed.
☞ The RMA transfer data subroutines are nonblocking primitives (deliberate choice
ofMPI).
☞ On the target process, the only accessible data are the ones contained in the local
memory window.
☞ MPI_ACCUMULATE() admits among its parameters an operation which should be
either of the MPI_REPLACEtype, or one of the predefined reduction operations : MPI_SUM,MPI_PROD, MPI_MAX, etc. This cannot be in any case an operation defined by the user.
INSTITUT DU DÉVELOPPEMENT
77/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
5 – One-sided Communication
5.4 – Completion of the Transfer : the synchronization
The data transfer starts after the call to one of the nonblocking subroutines (MPI_PUT(), ...). But when is the transfer completed and the data really available ?
After a synchronization by the programmer. These synchronizations can be classified into two types :
➊ Active targetsynchronization (collective operation, all the processes associated
with the window involved in the synchronization) ;
➋ Passive targetsynchronization (only the origin process calls the synchronization
subroutine).
INSTITUT DU DÉVELOPPEMENT
78/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
5 – One-sided Communication
5.4 – Completion of the Transfer : the synchronization 5.4.1 – Active Target Synchronization
☞ It is made by using theMPIsubroutine MPI_WIN_FENCE().
☞ MPI_WIN_FENCE() is a collective operation on all the processes associated with the MPI window object.
☞ MPI_WIN_FENCE()acts as a synchronization barrier. It waits for the end of all the
data transfers (RMA or not) using the local memory window and initiated since the last call toMPI_WIN_FENCE().
☞ This primitive will help separate the calculation parts of the code (where data of
the local memory window are used viaload orstore) of RMA data transfer parts.
☞ Anassert argument of the MPI_WIN_FENCE() primitive, of integer type, allows to refine its behavior for better performances. Different values are predefined MPI_MODE_NOSTORE, MPI_MODE_NOPUT, MPI_MODE_NOPRECEDE,
MPI_MODE_NOSUCCEED. A zero value for this argument is always valid.
INSTITUT DU DÉVELOPPEMENT
79/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
Remarks :
☞ The fact of having chosen nonblocking RMA subroutines for initialisation of the
transfers and having a synchronization for the completion of actual transfers authorizes the implementation to gather, during the execution, different transfers towards the same target in one single transfer. The latencies effects are therefore reduced and the performances are improved.
☞ The collective character of the synchronization results in that we have nothing to
do with One Sided Communication... In fact all the processes of the
communicator will have to participate in the synchronization, which makes it of less interest !
INSTITUT DU DÉVELOPPEMENT
80/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
The Proper Use of MPI_WIN_FENCE()
☞ It is important to make sure that between two successive calls to
MPI_WIN_FENCE() either there is only local operations (load/store) on variables that are in the local memory window of the process, or there is MPI_PUT()RMA operations orMPI_ACCUMULATE(), but never both at once !
0 1 MPI_WIN_FENCE() MPI_WIN_FENCE() | win_loc(:) = win_loc(:) + 1.0 MPI_WIN_FENCE() MPI_WIN_FENCE() MPI_PUT() ① MPI_WIN_FENCE() MPI_WIN_FENCE()
Is the previous program compliant with the proper use ofMPI_WIN_FENCE() ? All depends on the code portion represented by①. If this latter does not produce
aload/storeon the local window (assignment or use of a variable stored in the local window), then it is correct ; otherwise, the result is undefined.
INSTITUT DU DÉVELOPPEMENT
81/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization Example 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 6 6 6 8 8 12 MPI_PUT MPI_GET MPI_ACCUMULATE tab tab tab tab tab tab win_loc win_loc win_loc win_loc win_loc win_loc P0 P1 corresponding line l. 1,30 l. 31,35 l. 37 l. 38,42 l. 44 l. 45,49 l. 50 l. 51,55 l. 57 l. 58,63 l. 65
Figure 23: Example corresponding to the ex_fence program
INSTITUT DU DÉVELOPPEMENT
82/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
1 program ex_fence
2 use mpi
3 implicit none
4
5 integer, parameter :: assert=0
6 integer :: code, rank, size_real, win, i, nb_elements, target, m=4, n=4 7 integer (kind= MPI_ADDRESS_KIND ) :: displacement, dim_win
8 real(kind=kind(1.d0)), dimension(:), allocatable :: win_local, tab 9
10 call MPI_INIT(code)
11 call MPI_COMM_RANK( MPI_COMM_WORLD , rank, code)
12 call MPI_TYPE_SIZE( MPI_DOUBLE_PRECISION ,size_real,code) 13 14 if (rank==0) then 15 n=0 16 allocate(tab(m)) 17 endif 18 19 allocate(win_local(n)) 20 dim_win = size_real*n 21
22 call MPI_WIN_CREATE(win_local, dim_win, size_real, MPI_INFO_NULL , &
23 MPI_COMM_WORLD , win, code)
INSTITUT DU DÉVELOPPEMENT
83/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
24 if (rank==0) then
25 tab(:) = (/ (i, i=1,m) /)
26 else 27 win_local(:) = 0.0 28 end if 29 30 call MPI_WIN_FENCE(assert,win,code) 31 if (rank==0) then
32 target = 1; nb_elements = 2; displacement = 1
33 call MPI_PUT(tab, nb_elements, MPI_DOUBLE_PRECISION , target, displacement, &
34 nb_elements, MPI_DOUBLE_PRECISION , win, code)
35 end if 36 37 call MPI_WIN_FENCE(assert,win,code) 38 if (rank==0) then 39 tab(m) = sum(tab(1:m-1)) 40 else 41 win_local(n) = sum(win_local(1:n-1)) 42 endif 43 44 call MPI_WIN_FENCE(assert,win,code) 45 if (rank==0) then 46 nb_elements = 1; displacement = m-1
47 call MPI_GET(tab, nb_elements, MPI_DOUBLE_PRECISION , target, displacement, &
48 nb_elements, MPI_DOUBLE_PRECISION , win, code)
49 end if
INSTITUT DU DÉVELOPPEMENT
84/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization 50 call MPI_WIN_FENCE(assert,win,code) 51 if (rank==0) then 52 tab(m) = sum(tab(1:m-1)) 53 else 54 win_local(:) = win_local(:) + 1 55 endif 56 57 call MPI_WIN_FENCE(assert,win,code) 58 if (rank==0) then 59 nb_elements = m-1; displacement = 1
60 call MPI_ACCUMULATE(tab(2), nb_elements, MPI_DOUBLE_PRECISION , target, &
61 displacement, nb_elements, MPI_DOUBLE_PRECISION , &
62 MPI_SUM , win, code)
63 end if 64 65 call MPI_WIN_FENCE(assert,win,code) 66 call MPI_WIN_FREE(win,code) 67 68 if (rank==0) then
69 print *,"process", rank, "tab=",tab(:) 70 else
71 print *,"process", rank, "win_local=",win_local(:)
72 endif
73
74 call MPI_FINALIZE(code) 75 end program ex_fence
INSTITUT DU DÉVELOPPEMENT
85/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
Some Details and Restrictions...
☞ It is possible to work on different local memory windows which overlap, even if
this is not recommanded (such a use implies so many restrictions). We will presume next never being in this case.
☞ It is always important to separate by a call to MPI_WIN_FENCE()astoreand by a call to the MPI_PUT()or MPI_ACCUMULATE() subroutine that accesses the same local memory window even at different locations that do not overlap.
☞ Between two successive calls to the MPI_WIN_FENCE()subroutine, we have the following constraints :
• the MPI_PUT() subroutines do not allow the overlapping inside the same
local memory window. In other terms, the memory areas involved during calls to many MPI_PUT()subroutines that modify the same local memory window, must not overlap ;
INSTITUT DU DÉVELOPPEMENT
86/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
• the MPI_ACCUMULATE() subroutines allow the overlapping inside the same
local memory window, on the condition that the datatypes and the used reduction operations be identical during all of these calls ;
• using consecutively the MPI_PUT()and MPI_ACCUMULATE() subroutines do
not allow the overlapping inside the same local memory window ;
• aload and a call to theMPI_GET() subroutine can access concurrently
whatever part of the local window, provided that it was not updated previously either by astore, or when calling the MPI_PUT() or
MPI_ACCUMULATE()subroutine.
INSTITUT DU DÉVELOPPEMENT
87/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
5 – One-sided Communication
5.4 – Completion of the Transfer : the synchronization 5.4.2 – Passive Target Synchronization
☞ It is made via calls to theMPI MPI_WIN_LOCK() and MPI_WIN_UNLOCK() subroutines.
☞ Unlike the synchronization byMPI_WIN_FENCE() (which is a collective operation of barrier type), here only the origin process will participate in the
synchronization. Consequently, all the necessary calls to the data transfer (initialization of the transfer, synchronization) are only made by the origin process ; this is true One Sided Communication.
☞ Thelock andunlock operations apply only to a given local memory window (i.e.
identified by a target process number and an MPI window object). The period which starts atlockand ends atunlock is called an access period to the local memory window. It is only during this period that the origin process will have access to the local memory window of the target process.
INSTITUT DU DÉVELOPPEMENT
88/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
☞ In order to use it, it only requires the origin process to surround the call to the
RMA initialization primitives of data transfer by MPI_WIN_LOCK()and MPI_WIN_UNLOCK(). For the target process, noMPIsubroutine call is necessary.
☞ WhenMPI_WIN_UNLOCK() returns control, all the data transfers initiated after the MPI_WIN_LOCK()are completed.
☞ The first MPI_WIN_LOCK() argument allows to specify if the fact of making many simultaneous accesses via RMA operations on the same local memory window is authorized (MPI_LOCK_SHARED) or not (MPI_LOCK_EXCLUSIVE).
☞ A basic use of passive target synchronizations consists of creating RMA blocking
versions (put, get, accumulate) without the target having to callMPIsubroutines.
INSTITUT DU DÉVELOPPEMENT
89/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
1 subroutine get_blocking(orig_addr, orig_count, orig_datatype, target_rank, &
2 target_disp, target_count, target_datatype, win, code)
3 integer, intent(in) :: orig_count, orig_datatype, target_rank, target_count, &
4 target_datatype, win
5 integer, intent(out) :: code
6 integer(kind= MPI_ADDRESS_KIND ), intent(in) :: target_disp 7 real(kind=kind(1.d0)), dimension(:) :: orig_addr
8
9 call MPI_WIN_LOCK( MPI_LOCK_SHARED , target_rank, 0, win, code)
10 call MPI_GET(orig_addr, orig_count, orig_datatype, target_rank, target_disp, &
11 target_count, target_datatype, win, code)
12 call MPI_WIN_UNLOCK(target_rank, win, code) 13 end subroutine get_blocking
INSTITUT DU DÉVELOPPEMENT
90/236 5 – One-sided Communication 5.4 – Completion of the Transfer : the synchronization
Remark Concerning the Fortran Codes
For portability, during the use of passive target synchronizations (MPI_WIN_LOCK(), MPI_WIN_UNLOCK()), it is necessary to allocate the memory window with
MPI_ALLOC_MEM(). This function accepts as argument pointers of C type (i.e. Fortran CRAY pointers, which do not belong to the Fortran95 standard). In the case where these latter ones are not available, a C program must be used to do the allocation of the memory window...
INSTITUT DU DÉVELOPPEMENT
91/236 5 – One-sided Communication 5.5 – Conclusions
5 – One-sided Communication
5.5 – Conclusions
☞ The RMA concepts ofMPIare difficult to implement on non-trivial applications. An in-depth knowledge of the standard is necessary in order not to fall in many traps.
☞ The performances can widely vary from one implementation to another. ☞ The advantage of the RMA concept ofMPIresides essentially in the passive
target approach. It is only in this case that the use of RMA subroutines is really essential (application requiring that a process access data that belong to a remote process without interruption of the latter...)
INSTITUT DU DÉVELOPPEMENT
92/236 6 – Derived datatypes 6.1 – Introduction
6 – Derived datatypes
6.1 – Introduction
☞ In the communications, the exchanged data have datatypes : MPI_INTEGER, MPI_REAL,MPI_COMPLEX, etc.
☞ We can create more complex data structures by using subroutines such as
MPI_TYPE_CONTIGUOUS(), MPI_TYPE_VECTOR(),MPI_TYPE_CREATE_HVECTOR()
☞ Each time that we use a datatype, it is mandatory to validate it by using the
MPI_TYPE_COMMIT() subroutine.
☞ If we wish to reuse the same name to define another derived datatype, we have to
free it first with theMPI_TYPE_FREE() subroutine.
INSTITUT DU DÉVELOPPEMENT
93/236 6 – Derived datatypes 6.1 – Introduction
MPI_TYPE_[CREATE_H]VECTOR MPI_TYPE_CREATE_STRUCT
MPI_TYPE_[CREATE_H]INDEXED
MPI_REAL, MPI_INTEGER, MPI_LOGICAL MPI_TYPE_CONTIGUOUS
Figure 24: Hierarchy of the MPI constructors
INSTITUT DU DÉVELOPPEMENT