Awareness of MPI Virtual Process Topologies on the
Single-Chip Cloud Computer
Steffen Christgau, Bettina Schnor
Potsdam University Institute of Computer Science Operating Systems and Distributed Systems
Outline
The Single-Chip Cloud Computer (SCC)
RCKMPI – MPI on the SCC
Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC
Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC
Topology-Awareness for RCKMPI – The concept
Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC
Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation
Outline
The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC
Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work
1. The Single-Chip Cloud Computer (SCC)
Many-Core Architecture Research Community (MARC) established by Intel
research of universities together with Intel
world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member
Focus: application scalability
Single-Chip Cloud Computer
MC 0 MC 1 MC 3 MC 2 VRC SIF MPB Core0 Core1 L2 Cache L2 Cache 256 KB 256 KB 16 KB Router24 tiles, 48 P54C cores, connected via Network-on-Chip,
noCache-Coherence
2. RCKMPI: MPI on the SCC
fork of MPICH2
Application
MPICH2 ROMIO
(MPI IO Implementation)
Message Passing Interface
ADIO Abstract Device Interface (ADI3)
Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI
2. RCKMPI: MPI on the SCC
fork of MPICH2
Application
MPICH2 ROMIO
(MPI IO Implementation)
Message Passing Interface
ADIO
Abstract Device Interface (ADI3) Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI
SCCMPB uses the fast Message Passing Buffer of each tile as shared
memory and divides it intonequal-sizeExclusive Write Sections (EWS)
=
⇒
remote write, local readrank
Core 0 Core n
MPI Rank 0 MPI Rank n
MPI_COMM_WORLD
rank 1 rank n-1 rank n
SCCMPB uses the fast Message Passing Buffer of each tile as shared
memory and divides it intonequal-sizeExclusive Write Sections (EWS)
=
⇒
remote write, local readrank
Core 0 Core n
MPI Rank 0 MPI Rank n
MPI_COMM_WORLD
rank 1 rank n-1 rank n
Comparison of different CH3-devices at maximum
Manhattan distance 8
0.001 0.01 0.1 1 10 100 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band widt h / MB yt e/ smessage size / Byte
RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device
Bandwidths for Manhattan distance 0, 5 and 8 (two
processes started).
0 10 20 30 40 50 60 70 80 904 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi
ba nd w id th / M B yt e/ s message size / Byte Core 00 and 01 Core 00 and 10 Core 00 and 47
Bandwidths for maximum Manhattan distance 8, and varied
number of MPI processes
0 10 20 30 40 50 60 70 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band width / MByt e/ s
message size / Byte
2 MPI processes 12 MPI processes 24 MPI processes 48 MPI processes
Remember:
SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB)
rank
Core 0 Core n
MPI Rank 0 MPI Rank n
MPI_COMM_WORLD
rank 1 rank n-1 rank n
MPB of core/rank 0
The MPB is equally devided innsections
=
⇒
depending on the3. Topology Awareness for RCKMPI: The Concept
The bandwidth between 2 RCKMPI processes depends on:
the number of started MPI processes since a fully-connected network between all MPI processes is managed.
Application Behaviour
process 1 process 2 process n Tv Tu uu uv vv qu qvGoal:The bandwidth between communicating processes, so-calledneighbors
Application Behaviour
process 1 process 2 process n Tv Tu uu uv vv qu qvGoal:The bandwidth between communicating processes, so-calledneighbors
Requirements:
1 An improved MPB layout must consider both: communication neighbors
and group communication (barriers, broadcasts, gather/scatter, ...)
Putting things together ...
proc. n-1 proc. n proc. 1 proc. 2 proc. n-1 proc. n
proc. 1 proc. 2
p1p2 pn
channel headers payload area write section
channel header message payload
original layout
new layout without topology information
p1 proc. 2
p1p2 pn
with topology information pn
MPI offers API to specifyvirtual process topology 1
#d e f i n e NUM_DIMS 2
2 3i n t grid_dims [NUM_DIMS] ;
4i n t grid_periods [NUM_DIMS] ;
5MPI_Comm comm_topo ;
67
/* f o r a grid , s e t a l l items o f grid_periods to 0 */
8
f o r ( i n t i = 0 ; i < NUM_DIMS; i++)
9
grid_periods [ i ] = 0 ;
10
11
MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;
12
MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,
MPI offers API to specifyvirtual process topology 1
#d e f i n e NUM_DIMS 2
2 3i n t grid_dims [NUM_DIMS] ;
4i n t grid_periods [NUM_DIMS] ;
5MPI_Comm comm_topo ;
67
/* f o r a grid , s e t a l l items o f grid_periods to 0 */
8f o r ( i n t i = 0 ; i < NUM_DIMS; i++)
9
grid_periods [ i ] = 0 ;
1011
MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;
12MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,
13grid_periods , true , &comm_topo ) ;
4. Topology Awareness for RCKMPI: Evaluation
0 5 10 15 20 25 30 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/smessage size / Byte
enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs)
Results for 2D CFD application withring topology:
process 1
process 2
0 5 10 15 20 25 30 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of processes
enhanced RCKMPI with topology information, 2 CL original RCKMPI
Summary and Future work
SCC is equipped with a fast NoC.
Message Passing Bufferon tile is beneficial for fast communication.
Bandwidth performance gain for MPI applications by using MPI’svirtual
process topologiesandrearranging the MPB layout.
Current/Future Work:
Comparison with
I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012
Fixed the One-Sided Communication in RCKMPI
=
⇒
support of applications based on Global ArraysSummary and Future work
SCC is equipped with a fast NoC.
Message Passing Bufferon tile is beneficial for fast communication.
Bandwidth performance gain for MPI applications by using MPI’svirtual
process topologiesandrearranging the MPB layout.
Current/Future Work: Comparison with
I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:
Scaling and Dynamic Processes Support, ARCS 2012
Fixed the One-Sided Communication in RCKMPI
=
⇒
support of applications based on Global ArraysSummary and Future work
SCC is equipped with a fast NoC.
Message Passing Bufferon tile is beneficial for fast communication.
Bandwidth performance gain for MPI applications by using MPI’svirtual
process topologiesandrearranging the MPB layout.
Current/Future Work: Comparison with
I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:
Scaling and Dynamic Processes Support, ARCS 2012
Fixed the One-Sided Communication in RCKMPI