Clusrcr Conlnlunicarion Using U M P
with task 2 using U M P channel buffers i n shared mem ory, shown as
l ->2
and 2--> l . Taskl
and task 3 com municate using UMP channel buffers in M E 1viORY CHANNEL space, shown as l -->3 and 3--> l . Task 3 is reading a message from task 1 using an outbuf. The outbuf can be written only by task 1 but is mapped for transmission to all other cluster members. On node 2 , the same region i s mapped for reception. Access to each outbuf is controlled by a unique cluster spinlock. Our rationale for taking this approach is that a short software path is more appropriate for small messages because overhead dominates message transfer time, whereas the overhead of lock manipulation is a small component of message transfer time for large mes sages. We felt that this approach helped to control the use of cluster resources and maintained the lowest pos sible latency tor short messages yet still accommodated large messages. Note that outbuts are still ti.xed-size buffers but are generally configured to be much larger than the N2 buffers.This approach worked for PVM because its message transfer semantics make it acceptable to fail a mes sage send request due to buffer space restrictions (e.g., if both the N2 buffer and tbe outbuf are full). When we analyzed the requirements tor M PI, how
ever, we fou nd that this approach was not possible. For this reason, we changed the design to use only the N2 buffers. I nstead of writing the message as a single operation, the message is streamed through the buffer in a series of fragments. Not only does this approach support arbitrarily large messages, but it also improves message bandwidth by al lowing (and, for messages exceeding the available bufter capacity, requiri n g ) the
overlapped writing and reading of the message.
Deadlock is avoided by using a background thread
to write the message. Since overflow is now h andled using the streaming N2 buffers, outbufs were not nec essary to achieve the required level of performance for large messages and were not implemented. Outbufs are retained in the design to provide fast multicast messaging, even though in the current implementa tion they are not yet supported .
Achieving the performance goals set tor U M P was not easv. I n addition to the butler architecture described earlier, several other techniques were used. • No syscalls were allowed anywhere in the U M P
messaging functions, so U M P runs completely i n user space.
• Calls to library routines and any expensive arith metic operations were minimized .
• Global state was cached in .local memory wherever
possi ble.
• Careful attention was paid to data alignment issues, and all transfers are multiples of 32-bit data.
At the programmer's level, UMP operation is based on d uplex poi nt-to-point l inks called channels, which
correspond to the N2 buffers already described .
A channel is a pair of unidirectional buffers used to
provide two-way commu nication between a pair of
process endpoints anywhere in the cluster. U M P pro vides functions to open a channel between a pair of tasks. While the resources are allocated by the first task to open the channel, the connection is not complete until the second task also opens the same channel. Once a channel has been opened by both sides, UMP functions can be used to send and receive messages on that channel . It is possible to direct UMP to use shared
memory or M E M O RY CHANN EL address space for
the channel buffers, depending on the relative location
ofthe associated processes. In addition, U M P provides a function to wait on any event (e.g., arrival of a mes sage, creation or deletion of a channe l ) . In total, U M P
provides a dozen functions, which are l isted i n Table 3 .
Most o f t h e functions relate t o initiali zation, shut
down, and miscellaneous operations. Three functions
establish the channel connection, and three fu nctions pertorm all message communications.
UMP channels provide guaranteed error detection
but not recovery. Through the use of TruC!uster
M E M ORY CHAN N E L Software error-checking rou
tines, we were able to provide efficient error detection i n U M P . We decided to let the higher layers implement error recovery. As a result, designers of higher layers can control the performance penalty they incur by specifY ing their ovvn error recovery mechanisms, or, since reliability is high , can adapt a fail-on-error strategy. Performance
U MP avoids any calls to the kernel and any copying of data across the kernel boundary. Messages are written
directly into the reception buffer of the destination channel. Data is copied once from the user's buffer to p hysical memory on the destination node by the sending process. The receiving process then copies the data from local physical memory to the destination user's butler. By comparison, the number of copies involved in a similar operation over a LAN using sock ets is greater. In this case, the data has to be copied into the kernel , where the network driver uses DMA to copy it again into the memory of the network adapter.
At this point t he data is transmitted onto the LAN. The first version of UMP used one large shared
region of M EM O RY C H A N N E L space to contain its
channel buffers and a broadcast mapping to transmit this simultaneously to all nodes in the cluster. This version ofUMP also used loopback to reflect transmis sions back to the corresponding receive region on the sending node, which resulted in a Joss of available
bandwidth . Using our AI phaSe rver 2 100 4/190
development machines, we measured
Table 3 U M P API Functions Function Name Description ump_in it ump_exit u m p_open
I n itial izes U M P and a l locates the necessary resou rces.
Shuts down U M P and deal locates any resources used by the ca l l i ng process.
Opens a d u plex channel between two endpoints over a g iven transport (shared memory or M EMORY CHAN N E L). Channel endpoints are identified by user-suppl ied, 64-bit i nteger handles. ump_close
ump_l isten
Closes a specified U M P channel, deal locating a l l resources assigned to that channel as necessary. Registers an endpoint for a channel over a specified transport. This can be used by a server process to wait on connections from c l ients with u n known handles. This function returns immed iately, but the channel is created only when another task opens the channel. This can be detected using u m p_wa it.
ump_wa it Wa its for a U M P event to occur, either on one specified channel to this task or on a l l channels to this task.
Reads a message from a specified channel. ump_read
u m p_write Writes a message to a specified channel. This function is blocking, i.e., it does not retu rn until the complete message has been written to the channel.
ump_n bread Starts rea d i ng a message from a channel, i.e., it returns as soon as a specified amount of the message has been received, but not necessarily all the message.
ump_nbwrite Starts writing a message to a specified channel, i.e., it returns as soon as the write has started. A background thread will continue writ i n g the message until it is completely transm itted. u m p_mcast
u m p_info
Writes a messag e to a specified list of channels. Returns UMP configuration and status i nformation. • Lncncv: 1 1 f.LS ( M E i'v!ORY C H A N NEL), 4 JJ.S
( sh;1rcd rncmorv)
• Bandwidth: 1 6 M B/s ( M EM O RY C H A N N EL),
30 ;VI B/s
(
shared memory)To increase bandwidth, we modi�icd U 1\'I P to usc transmit-only regions fur its challnel butlers, thus
elimim.ting Joopback. The per�(>rmancc mosun.:d for
the revised U MP using the same machi nes was • Lltcncy: 9 f.LS (MEM ORY C H A N N E L ) , 3 f.LS
(shared memory)
• Bandwid th: 23 M B/s ( M EM O RY C H A N N E L) ,
3 2 M B/s (shared mernon·)
figure 8 sho\\'s the message tr:111s�cr time ;1nd Figure 9 sho\\'S the bandwidth �(x various mcss�1gc sizes �(lr the ITI'iscd 1·crsion of U M P using both blocking ;llld non blocking writes m·cr shared memorv and the M EM O RY
CHANNEL network. Using newer Alp haScnn 4 100
5/300 machines, which have a Elster [/0 su bsystem th� 111 the older machines, and version 1 . 5 /vi E!VlORY
C HANNEL adapters, the mc�1surcd latency is 5.8 f.LS
( M EM O RY CHA N N EL), 2 f.LS
(sh
ared memo ry). Thepc;l k bandwidth achieved is 6 1 MB/s ( M E M O RY C H A N N E L ) , 75 M B/s (shared memory). I n the non
blocking cases, the buffer size used 11·:-�s 256 kilobvtcs
( KB ) �(lr sh�1rcd memon· and 32 KB �(lr M EM O RY
C :HANN E L. Further work is under 11·a1' to imprm·c the
pcr�(Jml;lllce usi ng shared memorv �1s the transport.
This work is <li med at eliminating the high-md tJI Io�fin
bandwidth in the blocking case and the notch when the
mess:-�gc size exceeds the bu t'tcr size in the non hlocking Vol � No. 2 1�"'6
case. Note that these dkcts arc not displavcd in the M EMORY CHANNEL results.
Message-passing Libraries
Messagc-p:-�ssing l i braries pro,·ide the programmer
with a set of flcilitics to bui ld par<l l leJ appl ications. Typically, these services include the abi l ity to send and receive a variety of d :-� t:-� types to and from other peer processes i tl ;1 v�1riety of modes, as wdl as collective operations that span :-� set of peer processes. Other facilities 111�1y be provided in add ition to the basic set,
e.g., PVM pmvides fu nctions t()r managing PVM processes ( spawning, killing, sign�1ling, etc . ) , whereas M PI ( at least in its first rc1 ision, M P I -
1 )
docs not. PY1'vl is probabh· the most 11·idch- used message-p;lssi n g S\'S tem . It hJs been �1\'aibblc �(lr approximJtelv �i1-c I ' C l i'S , and implementations �1rc �ll'ai lablc t(Jr a "'ide ,-arict
y ofpbtturms. M l' l is an emerging standard �(n message
passi ng thJt is growing ,-;lpidly in popularity; mam'
new applicJtio ns arc being written t(Jr it. Parallel Virtual Machine
Parallel V i rtual Mac h i ne ( PVM ) is supported on a
wide varict\' of plat�(mns, including supercompu ters
and net\vorks of wo rkstations ( N OWs ) . PVM uses
<1 1·ariety of undcrlving comm u n ications methods: shared memorv on multiprocessors, ,-arious nati1·c message-passing systems on massi1-clv parallel proces
sors ( M PPs ) , and U D l'/l P or TC P/l P on N O'vVs . The
large sofu,·;lrc m-crhc:-�d in the ! P stacks makes i t d ifti cult to prm ide high-Jlcrt(mnancc communications �{n·
1 00,000
�
1 0 ,000 i= a: UJ t!i cn 1 .ooo z o � z a: a t- u� �
1 00 UJ CI: UJ U� �
1 0 � � · ,"." ,.;;."'"' ' , - .-- · ·--.:.. :::: . - --- . ... . .. . 1 0 1 00 1 ,000 1 0.000 1 00,000 1 ,000,000 MESSAGE SIZE (BYTES)KEY:
UMP BLOCKING (SHARED ME MORY)
UMP BLOCKING (MEMORY CHANNEL) UMP NONBLOCKING (SHARED MEMORY) UMP NONBLOCKING (MEMORY CHANNEL)
Figure 8
U M I ' Com m u n i cuions PcrfmmanC<.: : Mess�gc Tr;lllster Time
�
80 0 u UJ Ul a: UJ a._ Ul UJ 1- > CD � C.!l UJ � I 1- 0 � 0 z � CD 60 50 40 30 20 1 0 0 KEY: - - - - 200,000 400,000 600,000 800,000 1 .000,000MESSAGE SIZE (BYTES)
UMP BLOCKING (SHARED ME MORY)
UMP BLOCK ING (M EMORY CHAN NEL) U M P NONBLOCKING (SHARED MEMORY) UMP NONBLOCKING (MEMORY CHANNEL)
Figure 9
U 1YI P C :om m u n ic.ll ions Pcrtorm�nu.:: Band "'idrh
PVM when using net\\'Orks like Ethernet or fDDJ. T h e high cost of commun ications f()r these systems means that only the more co:�rse-grained paral lel appli cations h:�ve de monstrJted pert(mnancc im pro,·cments as a res u l t of paralklization using PVM . Using the M EMORY C H A N N E L cl uster technology descri bed earl ier, we have i m plemented :m optimized PVM that otters lo\1' l atencv :�nd high-h:.llldwidth comnHi l1ic:�
rions. The PVM l ibrarv and daemon use UM I' ro pro vide se:�mless com mu nications over the M E MORY CHANN E L c luster.
\iVhen we began to develop PVM ror M EM O RY
C HAN1 EL clusters, \\'e had one overrid ing goal: to usc
the hardw;u-c pertormance the M EMORY C : H A N N F L
i ntercon nect ofters to prm·ide a PVM with industry
leading communicuions pertormancc, specitical lv with regard to btcncy. I nitial lv, we set a target btency t(n
PVM of less than 1 5 f..lS using shared mcmorv �md less than 30 f..lS using the M EMORY CHANNEL tr:msport.
Our tirst task was to build ;\ prototype using the
public-domain PVM implementation. We used an
earlv prototYpe of the M E M O RY CHANNEL svstcm
jointlv dc\'eloped by Digital and E ncore. The proto
type had a hardware latency of4 f..lS. We mod i tied the
shared- memory version of PVM to usc the prototype
hard\\'are and achieved a PVM l:ltencv of 60 f..lS.
Proti l ing and straigh ttcx\\'ard code Jn:ll\'sis I-c\·ealed
that most of the overhead was caused by
• PVM's support t<x heterogeneity (i. e . , external data
reprcscnt;uion
[XDR)
encoding)
• Messages being copied m ultiple times i nside PVM • A l 01rgc number ofttlllction calls in the critic1l com
mun ications path
• Inefficient coding of rhc lo\\'-b·el cbta copv routines
S ince we wanted to achic,·e the maximum possible performance available ti-om rhe hardware, we decided to reimplemcnt the PVM li brary, eli m inating support tor heterogeneity ti·om the commun ications fu nctions of PVM :md t(xusing on maxi m u m pertc>rmance
inside a Digit:�! c l u ster.'" Heterogeneity would then be
supported by using a special PVM gateW<\Y process.
The m·era l l :�rchitecture of the Digital PVM imple
mentation is sho\\'n in Figure 10. To maximize per
tornl<\llCC, we decided that, wherever possible, an
operation should be executed i n - line rather than be
requested ti·om a remote task or dacmon. This con
trasts \\'ith PVM's traditiona l approach ofrc la�·ing such
requests to the PVM daemon tor service. For example,
when a PVM task starts, often it tirsr calls pvm_mytid to
request a un ique task idc ntiticr (TJ D ) . Prc,·iously, this