Message Passing Interface (MPI) - Dissertation Outline

1.6 Dissertation Outline

2.1.3 Message Passing Interface (MPI)

Multithreading and OpenMP are designed mainly for shared memory platforms. MPI is a portable, ef- ficient, and flexible standard specifying the interfaces that can be used by message-passing programs on distributed memory platforms. MPI itself is not an implementation, but a vendor-independent specification about what functionalities a standard-compliant library and runtime should provide. MPI interfaces have been defined for languages C/C++ and Fortran. There are a variety of implementations available in public domain (e.g. OpenMPI [13], MPICH [10,11]). Usually a shared file system (e.g. General Purpose File System (GPFS)) is mounted to all compute nodes to facilitate data sharing.

each MPI program may concurrently run multiple processes. Each communication may involve all processes, or a portion of processes. MPI definescommunicatorsandgroupswhich define the communication context and are used to specify which processes may communicate with each other. The processes within a process group are ordered and each process is identified by its rank in the group assigned automatically by the MPI system during the initialization. The identifiers are used by application developers to specify the source and destination of messages. The pre-defined default communicator isMPI COMM WORLDwhich includes all processes. MPI-1 supports both point-to-point and collective communication. MPI guarantees messages do not overtake each other. Fairness of communication handling is not guaranteed in MPI, so it is the users’ responsibility to prevent starvation.

Point-to-Point communication routines: The basic support operations aresendandreceive. Different types of routines are provided including synchronous send, blocking send/receive, non-blocking send/ receive, buffered send, combined send/receive and “ready” send. Blocking send calls do not return until the message data have been safely stored so that the sender is free to modify the send buffer. However, the messages may have not been sent out. The usual way to implement it is to copy the data to a temporary system buffer and thus it incurs the additional overhead of memory-to-memory copying. Alternatively MPI implementations may choose not to buffer messages for performance reasons. In this case, a send call does not return until the data have has been moved to the matching receiver. In other words, the sender and receiver may or may not be loosely coupled depending on implementations. Synchronous send calls do not return until a matching receiver is found and starts to receive the message. Blocking receive calls do not complete until the message is received. A communication buffer should not be accessed or modified until the corresponding communication completes. To maximize performance, MPI provides nonblocking communication routines that can be used to make communication and computation overlap as much as possible. A nonblocking send call initiates the operation and returns before the message is copied out of the send buffer. The program can continue to run while the message is copied out of the send buffer simultaneous in background by MPI runtime. A nonblocking receive call initiates the operation and returns before a message is received and stored into the receive buffer.

Collective communication routines:Only two processes can be involved in point-to-point communica- tions. Collective communication mechanisms allow more than two processes to communicate. The collective communication mechanism supported by MPI includebarriersynchronization, broadcast,gather,scatter,

gather-to-all,reduction,reduce-scatter,scan, etc. When reaching thebarriersynchronization point, each process blocks until all processes in the group reach the same point. Broadcastssend a message from a “root” process to all other processes in the same group.Scatterdistributes data from a single source process to each process in the group, and each process receives a portion of the data (i.e. the message is split inton

segments and thei-th segment is sent to thei-th process). Thegatheroperation allows a destination process to receive messages from all other processes in the group and store them in rank order.Gather-to-alldistributes the concatenation of the data across processes to all processes. Reduceapplies a reduction operation across all members of a group. In other words, it operates on a list of data elements stored in different processes and produces a single output stored in the specified process. One example is sum calculation across all data distributed across processes. Reduce-scatterapplies element-wise reduction on a vector and distributes the result across the processes.

Process Topologies: MPI allows programmers to create a virtual topology and map MPI processes to positions in the topology. Two types of topologies are supported - Catesian (grid) and Graph. MPI does not define how to map virtual topologies to the physical structure of the underlying parallel system. Process topologies are usually used for the purposes of convenience or efficiency. Domain-specific communication patterns can be expressed easily with process topologies and ease the application development. For most parallel systems, the communication cost is not constant for all pairs of nodes (e.g. some nodes are “closer” than others). Process topologies can help MPI runtime to optimize the mapping of processes to physical processors/cores based on the physical characteristics and structures.

MPI I/O:MPI I/O adds parallel I/O support to MPI. It provides a high-level interface to describe data partitioning and data transfers. It lets users read and write files in synchronous and asynchronous modes. Accesses to a file can be independent or collective. Collective accesses allow for read and write optimization on various levels. MPI data types are used to express the data layout in files and data partitioning among processes. There has been substantial research on how to improve the performance of parallel IO [119,83,

120,47,127].

Summary: Overall, MPI provides powerful communication primitives that can be used by application developers to coordinate different tasks of a single parallel program. In MPI 1.0 and 2.0, fault tolerance is not supported, and it is the users’ responsibility to recover their programs from faults (e.g. hardware failure, process hang, network paritition). There have been some MPI extensions that support process checkpointing and

recovery [69,70] but they have not been standardized. In addition, data affinity/locality is not incorporated, which makes it inappropriate to run massively parallel applications on commodity clusters.

In document HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE (Page 36-39)