Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer

(1)

Awareness of MPI Virtual Process Topologies on the

Single-Chip Cloud Computer

Steffen Christgau, Bettina Schnor

Potsdam University Institute of Computer Science Operating Systems and Distributed Systems

(2)

Outline

The Single-Chip Cloud Computer (SCC)

RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(3)

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC

(4)

Outline

Topology-Awareness for RCKMPI – The concept

Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(5)

Outline

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation

(6)

Outline

(7)

1. The Single-Chip Cloud Computer (SCC)

Many-Core Architecture Research Community (MARC) established by Intel

research of universities together with Intel

world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member

Focus: application scalability

(8)

Single-Chip Cloud Computer

MC 0 MC 1 MC 3 MC 2 VRC SIF MPB Core0 Core1 L2 Cache L2 Cache 256 KB 256 KB 16 KB Router

24 tiles, 48 P54C cores, connected via Network-on-Chip,

noCache-Coherence

(9)

2. RCKMPI: MPI on the SCC

fork of MPICH2

Application

MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface

ADIO Abstract Device Interface (ADI3)

Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI

(10)

2. RCKMPI: MPI on the SCC

fork of MPICH2

Application

MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface

ADIO

Abstract Device Interface (ADI3) Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI

(11)

SCCMPB uses the fast Message Passing Buffer of each tile as shared

memory and divides it intonequal-sizeExclusive Write Sections (EWS)

=

⇒

remote write, local read

rank

Core 0 Core n

MPI Rank 0 MPI Rank n

MPI_COMM_WORLD

rank 1 rank n-1 rank n

(12)

SCCMPB uses the fast Message Passing Buffer of each tile as shared

memory and divides it intonequal-sizeExclusive Write Sections (EWS)

=

⇒

remote write, local read

rank

Core 0 Core n

MPI_COMM_WORLD

(13)

Comparison of different CH3-devices at maximum

Manhattan distance 8

0.001 0.01 0.1 1 10 100 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band widt h / MB yt e/ s

message size / Byte

RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device

(14)

Bandwidths for Manhattan distance 0, 5 and 8 (two

processes started).

0 10 20 30 40 50 60 70 80 90

4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi

ba nd w id th / M B yt e/ s message size / Byte Core 00 and 01 Core 00 and 10 Core 00 and 47

(15)

Bandwidths for maximum Manhattan distance 8, and varied

number of MPI processes

0 10 20 30 40 50 60 70 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band width / MByt e/ s

message size / Byte

2 MPI processes 12 MPI processes 24 MPI processes 48 MPI processes

(16)

Remember:

SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB)

rank

Core 0 Core n

MPI_COMM_WORLD

MPB of core/rank 0

The MPB is equally devided innsections

=

⇒

depending on the

(17)

3. Topology Awareness for RCKMPI: The Concept

The bandwidth between 2 RCKMPI processes depends on:

the number of started MPI processes since a fully-connected network between all MPI processes is managed.

(18)

Application Behaviour

process 1 process 2 process n Tv Tu uu uv vv qu qv

Goal:The bandwidth between communicating processes, so-calledneighbors

(19)

Application Behaviour

process 1 process 2 process n Tv Tu uu uv vv qu qv

Goal:The bandwidth between communicating processes, so-calledneighbors

(20)

Requirements:

1 An improved MPB layout must consider both: communication neighbors

and group communication (barriers, broadcasts, gather/scatter, ...)

(21)

Putting things together ...

proc. n-1 proc. n proc. 1 proc. 2 proc. n-1 proc. n

proc. 1 proc. 2

p1p2 pn

channel headers payload area write section

channel header message payload

original layout

new layout without topology information

p1 proc. 2

p1p2 pn

with topology information pn

(22)

MPI offers API to specifyvirtual process topology 1

#d e f i n e NUM_DIMS 2

2 3

i n t grid_dims [NUM_DIMS] ;

4

i n t grid_periods [NUM_DIMS] ;

5

MPI_Comm comm_topo ;

6

7

/* f o r a grid , s e t a l l items o f grid_periods to 0 */

8

f o r ( i n t i = 0 ; i < NUM_DIMS; i++)

9

grid_periods [ i ] = 0 ;

10

11

MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;

12

MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,

(23)

MPI offers API to specifyvirtual process topology 1

#d e f i n e NUM_DIMS 2

2 3

i n t grid_dims [NUM_DIMS] ;

4

i n t grid_periods [NUM_DIMS] ;

5

MPI_Comm comm_topo ;

6

7

/* f o r a grid , s e t a l l items o f grid_periods to 0 */

8

f o r ( i n t i = 0 ; i < NUM_DIMS; i++)

9

grid_periods [ i ] = 0 ;

10

11

MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;

12

MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,

13

grid_periods , true , &comm_topo ) ;

(24)

4. Topology Awareness for RCKMPI: Evaluation

0 5 10 15 20 25 30 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s

message size / Byte

enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs)

(25)

Results for 2D CFD application withring topology:

process 1

process 2

(26)

0 5 10 15 20 25 30 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of processes

enhanced RCKMPI with topology information, 2 CL original RCKMPI

(27)

Summary and Future work

SCC is equipped with a fast NoC.

Message Passing Bufferon tile is beneficial for fast communication.

Bandwidth performance gain for MPI applications by using MPI’svirtual

process topologiesandrearranging the MPB layout.

Current/Future Work:

Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012

Fixed the One-Sided Communication in RCKMPI

=

⇒

support of applications based on Global Arrays

(28)

Summary and Future work

Current/Future Work: Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012

=

⇒

support of applications based on Global Arrays

(29)

Summary and Future work

Current/Future Work: Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012

=

⇒

support of