MPICH FOR SCI-CONNECTED CLUSTERS

(1)

Lehrstuhl für Betriebssysteme

MPICH

FOR

SCI-C

ONNECTED

C

LUSTERS

(2)

A

GENDA

• Introduction, Related Work & Motivation

• Implementation

• Performance

• Work in Progress

• Summary

(3)

M

ESSAGE

-P

ASSING

MPI: the de-facto standard for technical and scientific

Message-Passing-Programming:

• Open Standard / Open Source implementations

• Reliable specification

• Easy to use, but extensive functionality if required

• Portable code

• Efficient implementation on every architecture

• Many libraries & tools available

(4)

MPI

ON

SAN-C

LUSTERS

Available SAN-Interconnects with MPI:

• DEC MemoryChannel

Digital MPI (by DEC, commercial)

• Giganet

MPI/Pro (by MPI Software, commercial)

• Myrinet

MPI-FM (by CSAG at UCSD, freely available) LAMP (by NEC / GMD, not publically available)

• SCI

ScaMPI (by Scali, commercial) Sun MPI (by Sun, commercial)

(5)

M

OTIVATION FOR

SCI-MPICH

Lack of free MPI-Implementations for SCI-Clusters:

• researchers, universities etc. prefer free software

• free software eases the decision for a SCI-Cluster

• availabitlity of source code prevents „sudden death“ of the

development

• multiple available implementations for one platform

stimu-late the development

CS research, small labs, etc. professional

users want support

want everything for free

(6)

SCI-MPICH D

ESIGN

• SMI-Library

„Shared Memory Interface“ offers high-level services based on SISCI: - Startup, Memory Allocation & Management, Synchronization,

Load/Store Barriers / Error Checking M P I A P I M P I P P r o f i l i n g I n t e r f a c e M P I R R u n t i m e L i b r a r y c h _ s m i S M I L i b r a r y S I S C I I R M A d d r e s s S p a c e S C I m a p p e d l o c a l l y m a p p e d I n i t i a l i z a t i o n S e r v i c e s C o m m u n i c a t i o n u s e r s p a c e k e r n e l s p a c e A D I - 2

(7)

MPICH M

ESSAGE

P

ROTOCOLS

SHORT: Message contents inside a Control Packet:

• 64 byte write for each short message

• packet structure with implicit synchronization

EAGER: Message transfer without interaction of the receiver:

• static buffers at the receiver

• ringbuffer of buffer-pointers at the sender

• remote-write of the message followed by control packet

RENDEZ-VOUS: Transfer of arbitrary sized messages:

• dynamic buffer allocation by receiving process

• synchronization between sender & receiver before, during and after transfer • write-read interleave of transfer buffer

Header Payload: 47 bytes Alignment

Unique ID

(8)

MPI B

ENCHMARKS

Benchmark Environment:

A

: MPICH 1.1.2 on PentiumII @ 450 MHz, Solaris 7 with • ch_smi: communication via SCI (SCI-MPICH)

• ch_p4: communication via TCP/IP on FastEthernet • ch_shmem: communication via Shared Memory

B

: ScaMPI 1.7 on PentiumII @ 400 MHz, Linux 2.2 • hpcLine at RWTH Aachen

Point-to-Point Performance:

Process 0 Process 1 MPI_Send() MPI_Recv() MPI_Recv() MPI_Send() Roundtrip Delay

(9)

P

OINT

-

TO

-P

OINT

P

ERFORMANCE

Roundtrip/2-Latency for blocking send/receive:

1 4 16 64 256 1K

message size [byte] 0 10 20 30 40 latency [us] ScaMPI 2 SCI−MPICH MPICH shmem MPICH lfshmem

(10)

P

OINT

-

TO

-P

OINT

P

ERFORMANCE

Bandwidth for blocking send/receive:

1 4 16 64 256 1K 4K 16K 64K 256K 1M message size [byte]

0 20 40 60 80 100 120 140 bandwidth [MByte/s] MPICH lfshmem MPICH shmem SCI−MPICH ScaMPI 2 MPICH p4

(11)

NAS P

ARALLEL

B

ENCHMARKS

Standard Suite of Parallel Kernel/Application Benchmarks

(NASA Ames Research Center)

Fortran 77 / MPI codes for

• Simulated CFD application (LU)

• 3D Fast Fourier Transformation (FT) • Conjugate Gradient (CG)

• 3D Scalar Poisson by Multigrid (MG)

• others (Integer Sort, „Embarassing Parallel“, CFD Block Solvers etc.)

Comparison of Speedup for different MPICH communication

devices / MPI implementations.

Compilers used with full optimization:

• Sun f77 5.0 for Solaris 7 (MPICH)

(12)

NPB: CG & MG

C G B e n c h m a r k , C l a s s A 1 , 6 6 _{1 , 6 2} 3 , 5 4 3 , 4 5 2 , 4 5 1 , 9 1 1 , 8 8 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S p eed u p 2 C P U s 4 C P U s M G B e n c h m a r k , C l a s s W 1 , 8 3 1 , 5 4 1 , 7 5 3 , 2 2 2 , 9 5 1 , 7 1 1 , 8 9 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S p eed u p 2 C P U s 4 C P U s

(13)

NPB: LU & FT

L U B e n c h m a r k , C l a s s A 1 , 9 8 _{1 , 9 4} 1 , 7 0 3 , 7 5 1 , 9 4 3 , 9 6 3 , 9 5 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S peed up 2 C P U s 4 C P U s F T B e n c h m a r k , C l a s s W 1 , 1 7 1 , 3 8 2 , 7 1 3 , 6 0 1 , 5 7 1 , 8 5 1 , 4 2 0 1 2 3 4 c h _ s m i S c a M P I c h _ p 4 c h _ s h m e m S peed up 2 C P U s 4 C P U s MOPS: 67,0 127,4 26,0 50,7

(14)

F

URTHER

D

EVELOPMENT

• MPI-IO

fast parallel I/O via SCI

• Adaptive Buffer Sizing

optimize memory layout during execution

• Asynchronous Transfer

utilize DMA, Threads & Interrupts for true asynchronous transfers

• Collective Operations

replace Point-to-Point with shared-memory operations

• Cluster Manager

(15)

P

ARALLEL

IO

Implementation of a SCI-based ADIO device for MPI-IO/ROMIO

• prelimitary results of the examples/io/coll_perf benchmark

c o l l _ p e r f , 2 p r o c e s s e s 0 , 1 4 5 1 1 , 9 2 5 1 , 3 3 3 _{1 , 0 5 9} 0 , 4 6 1 5 , 7 1 5 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 A D _ S M I ( S C I ) A D _ S M I ( F a s t E t h e r n e t ) ( F a s t E t h e r n e t )A D _ N F S MB /s w r i t e b a n d w i d t h r e a d b a n d w i d t h

(16)

U

NIPROCESSOR VS

. SMP

Platform: Windows NT on Dual-PentiumPro (200MHz, 256 MB RAM)

Testcase 1: NAS-Benchmark BT, Class W

Testcase 2: NAS-Benchmark CG, Class A

processes: 1 4 9 dedicated nodes 340,83 s 84,73 s (speedup4,02) -shared nodes 340,83 s 146,32 s (speedup2,32) 72,30 s (speedup4,71) processes: 1 4 8 dedicated nodes 127,08 s 32,27s (speedup3,93) -shared nodes 127,08 s 59,34 s (speedup2,14) 35,42 s (speedup3,58)

(17)

C

OLLECTIVE

O

PERATIONS

MPI_Bcast() for 8 Bytes

2 4 6 8 10 12 14 16 number of processes 0 10 20 30 40 latency [us] ScaMPI SCI−MPICH

(18)

C

OLLECTIVE

O

PERATIONS

MPI_Bcast() for 256 kB

2 4 6 8 10 12 14 16 number of processes 0 5 10 15 20 latency [ms] ScaMPI SCI−MPICH