Lehrstuhl für Betriebssysteme
MPICH
FOR
SCI-C
ONNECTED
C
LUSTERS
A
GENDA
• Introduction, Related Work & Motivation
• Implementation
• Performance
• Work in Progress
• Summary
M
ESSAGE
-P
ASSING
MPI: the de-facto standard for technical and scientific
Message-Passing-Programming:
• Open Standard / Open Source implementations
• Reliable specification
• Easy to use, but extensive functionality if required
• Portable code
• Efficient implementation on every architecture
• Many libraries & tools available
MPI
ON
SAN-C
LUSTERS
Available SAN-Interconnects with MPI:
• DEC MemoryChannel
Digital MPI (by DEC, commercial)
• Giganet
MPI/Pro (by MPI Software, commercial)
• Myrinet
MPI-FM (by CSAG at UCSD, freely available) LAMP (by NEC / GMD, not publically available)
• SCI
ScaMPI (by Scali, commercial) Sun MPI (by Sun, commercial)
M
OTIVATION FOR
SCI-MPICH
Lack of free MPI-Implementations for SCI-Clusters:
• researchers, universities etc. prefer free software
• free software eases the decision for a SCI-Cluster
• availabitlity of source code prevents „sudden death“ of the
development
• multiple available implementations for one platform
stimu-late the development
CS research, small labs, etc. professional
users want support
want everything for free
SCI-MPICH D
ESIGN
• SMI-Library
„Shared Memory Interface“ offers high-level services based on SISCI: - Startup, Memory Allocation & Management, Synchronization,
Load/Store Barriers / Error Checking M P I A P I M P I P P r o f i l i n g I n t e r f a c e M P I R R u n t i m e L i b r a r y c h _ s m i S M I L i b r a r y S I S C I I R M A d d r e s s S p a c e S C I m a p p e d l o c a l l y m a p p e d I n i t i a l i z a t i o n S e r v i c e s C o m m u n i c a t i o n u s e r s p a c e k e r n e l s p a c e A D I - 2
MPICH M
ESSAGE
P
ROTOCOLS
SHORT: Message contents inside a Control Packet:
• 64 byte write for each short message
• packet structure with implicit synchronization
EAGER: Message transfer without interaction of the receiver:
• static buffers at the receiver
• ringbuffer of buffer-pointers at the sender
• remote-write of the message followed by control packet
RENDEZ-VOUS: Transfer of arbitrary sized messages:
• dynamic buffer allocation by receiving process
• synchronization between sender & receiver before, during and after transfer • write-read interleave of transfer buffer
Header Payload: 47 bytes Alignment
Unique ID
MPI B
ENCHMARKS
Benchmark Environment:
A
: MPICH 1.1.2 on PentiumII @ 450 MHz, Solaris 7 with • ch_smi: communication via SCI (SCI-MPICH)• ch_p4: communication via TCP/IP on FastEthernet • ch_shmem: communication via Shared Memory
B
: ScaMPI 1.7 on PentiumII @ 400 MHz, Linux 2.2 • hpcLine at RWTH AachenPoint-to-Point Performance:
Process 0 Process 1 MPI_Send() MPI_Recv() MPI_Recv() MPI_Send() Roundtrip DelayP
OINT
-
TO
-P
OINT
P
ERFORMANCE
Roundtrip/2-Latency for blocking send/receive:
1 4 16 64 256 1K
message size [byte] 0 10 20 30 40 latency [us] ScaMPI 2 SCI−MPICH MPICH shmem MPICH lfshmem
P
OINT
-
TO
-P
OINT
P
ERFORMANCE
Bandwidth for blocking send/receive:
1 4 16 64 256 1K 4K 16K 64K 256K 1M message size [byte]
0 20 40 60 80 100 120 140 bandwidth [MByte/s] MPICH lfshmem MPICH shmem SCI−MPICH ScaMPI 2 MPICH p4
NAS P
ARALLEL
B
ENCHMARKS
Standard Suite of Parallel Kernel/Application Benchmarks
(NASA Ames Research Center)
Fortran 77 / MPI codes for
• Simulated CFD application (LU)
• 3D Fast Fourier Transformation (FT) • Conjugate Gradient (CG)
• 3D Scalar Poisson by Multigrid (MG)
• others (Integer Sort, „Embarassing Parallel“, CFD Block Solvers etc.)
Comparison of Speedup for different MPICH communication
devices / MPI implementations.
Compilers used with full optimization:
• Sun f77 5.0 for Solaris 7 (MPICH)
NPB: CG & MG
C G B e n c h m a r k , C l a s s A 1 , 6 6 1 , 6 2 3 , 5 4 3 , 4 5 2 , 4 5 1 , 9 1 1 , 8 8 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S p eed u p 2 C P U s 4 C P U s M G B e n c h m a r k , C l a s s W 1 , 8 3 1 , 5 4 1 , 7 5 3 , 2 2 2 , 9 5 1 , 7 1 1 , 8 9 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S p eed u p 2 C P U s 4 C P U sNPB: LU & FT
L U B e n c h m a r k , C l a s s A 1 , 9 8 1 , 9 4 1 , 7 0 3 , 7 5 1 , 9 4 3 , 9 6 3 , 9 5 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S peed up 2 C P U s 4 C P U s F T B e n c h m a r k , C l a s s W 1 , 1 7 1 , 3 8 2 , 7 1 3 , 6 0 1 , 5 7 1 , 8 5 1 , 4 2 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S peed up 2 C P U s 4 C P U s MOPS: 67,0 127,4 26,0 50,7F
URTHER
D
EVELOPMENT
• MPI-IO
fast parallel I/O via SCI
• Adaptive Buffer Sizing
optimize memory layout during execution
• Asynchronous Transfer
utilize DMA, Threads & Interrupts for true asynchronous transfers
• Collective Operations
replace Point-to-Point with shared-memory operations
• Cluster Manager
P
ARALLEL
IO
Implementation of a SCI-based ADIO device for MPI-IO/ROMIO
• prelimitary results of the examples/io/coll_perf benchmark
c o l l _ p e r f , 2 p r o c e s s e s 0 , 1 4 5 1 1 , 9 2 5 1 , 3 3 3 1 , 0 5 9 0 , 4 6 1 5 , 7 1 5 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 A D _ S M I ( S C I ) A D _ S M I ( F a s t E t h e r n e t ) ( F a s t E t h e r n e t )A D _ N F S MB /s w r i t e b a n d w i d t h r e a d b a n d w i d t h
U
NIPROCESSOR VS
. SMP
Platform: Windows NT on Dual-PentiumPro (200MHz, 256 MB RAM)
Testcase 1: NAS-Benchmark BT, Class W
Testcase 2: NAS-Benchmark CG, Class A
processes: 1 4 9 dedicated nodes 340,83 s 84,73 s (speedup4,02) -shared nodes 340,83 s 146,32 s (speedup2,32) 72,30 s (speedup4,71) processes: 1 4 8 dedicated nodes 127,08 s 32,27s (speedup3,93) -shared nodes 127,08 s 59,34 s (speedup2,14) 35,42 s (speedup3,58)
C
OLLECTIVE
O
PERATIONS
MPI_Bcast() for 8 Bytes
2 4 6 8 10 12 14 16 number of processes 0 10 20 30 40 latency [us] ScaMPI SCI−MPICH