Message Passing Architecture
5.6 EXAMPLE MESSAGE PASSING ARCHITECTURES
Examples of message passing machines include Caltech Hypercube, the Inmos Transputer systems, Meiko CS-2, Cosmic Cube, nCUBE/2, iPSC/2, iPSC/860, CM-5. Other recent systems include the IBM Scalable Power Series (IBM POWERparallel 3,SP 3).
The Caltech Hypercube (the Cosmic Cube) was an n-dimensional hypercube system with a single host, known as the Intermediate Host (IH), for global control. The original system was based on the simple store-and-forward routing mechanism. The system started with a set of routine libraries known as the crystalline operating system (CrOS), which supported C and FORTRAN. The system supported only col- lective operations (broadcast) to/from the IH. Two years later, the Caltech project team introduced a hardware wormhole routing chip. The Cosmic Cube is managed using a host runtime system called the Cosmic environment (CE). The processes of a given computation are called the process group. The system can be used by more than one user. Users have to specify the cube size needed using a CE routine. Allocation will be granted based on the available hypercube nodes. In this system, C programming is supported by the help of a dynamic process structure with active process scheduling.
The Cosmic Cube is considered the first working hypercube multicomputer mess- age passing system. The Cosmic cube system has been constructed using 64 node for the Intel iPSC. Each node has 128 KB of dynamic RAM that has parity checking for error detection but no correction. In addition, each node has 8 KB of ROM in order to store the initialization and bootstrap programs. The basic packet size is 64 bits with queues in each node. In this system, messages are communicated via trans- missions (send/receive). When a message request is made, the calls will return. In case the message request is not finished (pending), the calls will not return and the program will continue. Each node in this system has a kernel that requires about 9 KB of code and 4 KB of tables. This kernel, called the Reactive Kernel,
is divided in two: an inner kernel, which is responsible for performing messages such as send,receive,queue, andhandlingmessages, and an outer kernel, which includes processes tocreate,copy, andstopthe processes.
The Meiko Computing Surface CS-1 was the first Inmos Transputer T800-based system. The Transputer was a 32-bit microprocessor with fast task-switching capability through hardware intercommunication. The system was programmed using a communication sequential processes (CSP) language called Occam. The language used abstract links known as channels and supported synchronous block- ing send and receive primitives.
T9000 represented another version of the Inmos Transputer processor T800. T9000 has the ability to perform both integer and floating-point operations. Although T9000 is a RISC processor, it uses microprogramming. Instructions take one or more processor cycles to execute. Its internal memory capacity is at least 64 KB. It also has 16 KB instruction and data cache. The memory interface cir- cuitry can generate a variety of signals to match the external memory chips. With the T9000, data transfers are synchronized using atwo-way hand-shakingmechanism. According to this technique, synchronization is achieved using two different pack- ets. The first one, called thedata packet, is sent from the source to the destination transputer process. The other packet, called the acknowledge packet, is sent from the destination to the source transputer. When the destination transputer is ready to get the data from the data packet, it should send an acknowledge signal rep- resented by the acknowledge packet telling the source transputer to send the data packet.
The Intel iPSC is a commercial message passing hypercube developed after the Cosmic Cube. The iPSC/1 used Intel 286 processors with a 287 floating-point coprocessor. Each node consists of a single board computer having two buses, a pro- cess bus and I/O bus. Nodes are controlled by the Cube manager. Each node has seven communication channels (links) to communicate with other nodes and a separate channel for communication with the Cube Manager. FORTRAN message passing routines are supported. The software environment used in iPSC1 was called NX1, and has a more distributed processes environment than those included in the Caltech CrOS. The NX1 was based on the Caltech Reactive Kernel. It provided the typical set of features needed in a message passing environment. These include communication topology hiding, multiple processes per node, any- to-any message passing, asynchronous messaging, and nonblocking communication primitives. Later, Intel machines implemented an improved version of NX, called NX2. In the case of the Paragon, NX2 was implemented upon an OSF/1 Unix microkernel.
ThenCUBE/2has up to a few thousand nodes connected in a binary hypercube network. Each node consists of a CPU-chip and DRAM chips on a small double- sided printed circuit board. The CPU chip contains a 64 bit integer unit, an IEEE floating-point unit, a DRAM memory interface, a network interface with 28 DMA channels, and routers that support cut-through routing across a 13-dimensional hypercube. The processor runs at 20 MHz and delivers roughly 5 MIPS or 1.5 MFLOPS.
The Thinking Machine CM-5 had up to a few thousand nodes interconnected in a hypertree (incomplete fat tree). Each node consists of a 33 MHz SPARC RISC processor chip-set, local DRAM memory, and a network interface to the hypertree and broadcast/scan/prefix control networks. Compared to its predecessors, CM-5 represented a true distributed memory message passing system. It featured two interconnection networks, and Sparc-based processing nodes. Each node has four vector units for pipeline arithmetic operations. The CM-5 programming environ- ment consisted of the CMOST operating system, the CMMD message passing library, and various array-style compilers. The latter includes CMF, supporting a F90-like SIMD programming style. The CMMD message passing system offered users access to routines from the lowest level, the Active Message Layer (AML), a point-to-point library, channels, and a cooperative functions library.
Having briefly reviewed a number of the early introduced message passing systems, we now discuss in some detail the features of a recent message passing system, the IBM Scalable POWERparallel 3 system.
5.6.1 The IBM Scalable POWERparallel 3
The IBM POWER3 (SP 3) is the most recent IBM supercomputer series (1999/ 2000). The SP 3 consists of 2 to 512 POWER3 Architecture RISC System/6000 pro- cessor nodes. Each node has its own private memory and its own copy of the AIX operating system. The POWER3 processor is an eight-stage pipeline processor. Two instructions can be executed per clock-cycle except for the multiply and divide. A multiply instruction takes two clock cycles while a divide instruction takes 13 to 17 cycles. The FPU contains two execution units using double precision (64 bit). Both execution units are identical and conform to the IEEE 754 binary floating- point standard. Figure 5.11 shows a block diagram of a typical SP 3 node.
Nodes are connected by a high-performance scalable packet-switched network in a distributed memory and message passing. The network’s building block is a two- staged 1616 switch board, made up of 44 bidirectional crossbar switching elements (SEs). Each link is bidirectional and has a 40 MB/s bandwidth in each direction. The switch uses buffered cut-through wormhole routing. This intercon- nection arrangement allows all processors to send messages simultaneously. For full connectivity, at least one extra stage is provided. This stage guarantees that there are at least four different paths between every pair of nodes. This form of path redundancy helps in reducing network congestion as well as recovery in the presence of failures.
The communication protocol supports end-to-end packet acknowledgment. For every packet sent by a source node, there is a returned acknowledgment after the packet has reached the destination node. This allows source nodes to discover packet loss. Automatic retransmission of a packet is made if the acknowledgment is not received within a preset time interval.
A message passing programming style is the preferred style for performance on the SP 3. Several message passing libraries used by FORTRAN and C are supported
on the SP 3. The SP 3 also supports the data parallel programming model with high- performance FORTRAN.
The message passing model in SP assumes that any task can send a message to any other task using message passing communication mechanism. The main mess- age passing operations are thesendandreceiveoperations. The send operation can be eithersynchronous(returns only when a matching receive is called) orasynchro- nous(returns immediately without waiting for a matching receive call) andblocking (returns as soon as the send buffer has been cleared) ornonblocking(returns without waiting for clearing the send buffer). The receive operation can be eitherblocking (returns only after the data has been received) ornonblocking(returns immediately a flag that the data either are not available or are in the receive buffer). Particular implementation of the different message passing libraries in SP 3 is presented below. The IBM SP 3 programming environment offers three message passing libraries: PVM, MPL, and MPI. The native library in the IBM SP-2 Parallel Operating Environment is MPL. It is implemented on top of HPS and IP protocols. The MPI implementation MPICH is implemented on top of MPL. The MPL is designed for SP-2 and, therefore, there is no special software initialization and messages are passed directly to hardware. Single-step asynchronous blocking subroutines are called mpc_bsend( ) and mpc_brecv( ). There are also dual asynchronous blocking subroutines mpc_bvsend( ) and mpc_bvrecv( ) for exchanging noncontiguous pieces of data. The MPI subroutine design for SP-2 is very similar to MPL; only the sub- routine names and parameters differ. No encoding is needed and all messages are passed directly to hardware through MPL subroutines. Single-step asynchronous blocking subroutines are called MPI_Send( ) and MPI_Recv( ).
The PVM is aimed at heterogeneous parallel systems. The asynchronous block- ing send subroutine is pvm_send( ). Sending a message requires three steps. First, the PVM sending buffer has to be initialized by pvm_initsend( ). There are three defined input parameters, which determine a mechanism of data coding and packing:
POWER3 POWER3 CPU CPU L2 Cache L2 Cache 4MB 4MB Memory I/O 256 MB to 4 G Controller Memory
PCI PCI PCI Switch Bridge Bridge Bridge Adapter
PvmDataDefault, the data is packed into PVM buffer and encoded according to the XDR format; PvmDataRaw, the data is packed into the PVM buffer with no encod- ing; PvmDataInPlace, the data is not copied to the PVM buffer, but it is fetched from user space memory during execution pvm_send( ). Then the message is packed into the PVM buffer by using any combination of pvm_pack( ) routines. Finally, the pvm_send( ) subroutine is called. The matching subroutine to pvm_send( ) is block- ing pvm_recv( ). Receiving a message requires two steps. The incoming message has to be accepted by pvm_recv( ) and then it has to be unpacked into user data space using pvm_unpack( ) functions.
The PVM and MPI message passing libraries are covered in more details in Chapters 8 and 9, respectively.
5.7 MESSAGE PASSING VERSUS SHARED MEMORY