• No results found

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer

N/A
N/A
Protected

Academic year: 2021

Share "Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Awareness of MPI Virtual Process Topologies on the

Single-Chip Cloud Computer

Steffen Christgau, Bettina Schnor

Potsdam University Institute of Computer Science Operating Systems and Distributed Systems

(2)

Outline

The Single-Chip Cloud Computer (SCC)

RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(3)

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(4)

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept

Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(5)

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation

(6)

Outline

The Single-Chip Cloud Computer (SCC) RCKMPI – MPI on the SCC

Topology-Awareness for RCKMPI – The concept Topology-Awareness for RCKMPI – Evaluation Summary and Future Work

(7)

1. The Single-Chip Cloud Computer (SCC)

Many-Core Architecture Research Community (MARC) established by Intel

research of universities together with Intel

world-wide community (dominated by European institutions) Website, regular symposia (every 6 months, 5 up to now) Our group is MARC member

Focus: application scalability

(8)

Single-Chip Cloud Computer

MC 0 MC 1 MC 3 MC 2 VRC SIF MPB Core0 Core1 L2 Cache L2 Cache 256 KB 256 KB 16 KB Router

24 tiles, 48 P54C cores, connected via Network-on-Chip,

noCache-Coherence

(9)

2. RCKMPI: MPI on the SCC

fork of MPICH2

Application

MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface

ADIO Abstract Device Interface (ADI3)

Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI

(10)

2. RCKMPI: MPI on the SCC

fork of MPICH2

Application

MPICH2 ROMIO

(MPI IO Implementation)

Message Passing Interface

ADIO

Abstract Device Interface (ADI3) Cray BG Channel-3 Device Soc k Nemesis SCCMPB SCCSHM SCCMulti Process Management Interf ace MPI Implementation MPI

(11)

SCCMPB uses the fast Message Passing Buffer of each tile as shared

memory and divides it intonequal-sizeExclusive Write Sections (EWS)

=

remote write, local read

rank

Core 0 Core n

MPI Rank 0 MPI Rank n

MPI_COMM_WORLD

rank 1 rank n-1 rank n

(12)

SCCMPB uses the fast Message Passing Buffer of each tile as shared

memory and divides it intonequal-sizeExclusive Write Sections (EWS)

=

remote write, local read

rank

Core 0 Core n

MPI Rank 0 MPI Rank n

MPI_COMM_WORLD

rank 1 rank n-1 rank n

(13)

Comparison of different CH3-devices at maximum

Manhattan distance 8

0.001 0.01 0.1 1 10 100 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band widt h / MB yt e/ s

message size / Byte

RCKMPI sccmulti CH device RCKMPI sccmpb CH device RCKMPI sccshm CH device

(14)

Bandwidths for Manhattan distance 0, 5 and 8 (two

processes started).

0 10 20 30 40 50 60 70 80 90

4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi

ba nd w id th  /  M B yt e/ s message size / Byte Core 00 and 01 Core 00 and 10 Core 00 and 47

(15)

Bandwidths for maximum Manhattan distance 8, and varied

number of MPI processes

0 10 20 30 40 50 60 70 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi band width / MByt e/ s

message size / Byte

2 MPI processes 12 MPI processes 24 MPI processes 48 MPI processes

(16)

Remember:

SCCMPB uses the fast Message Passing Buffer of each tile as shared memory (total 384 KB)

rank

Core 0 Core n

MPI Rank 0 MPI Rank n

MPI_COMM_WORLD

rank 1 rank n-1 rank n

MPB of core/rank 0

The MPB is equally devided innsections

=

depending on the

(17)

3. Topology Awareness for RCKMPI: The Concept

The bandwidth between 2 RCKMPI processes depends on:

the number of started MPI processes since a fully-connected network between all MPI processes is managed.

(18)

Application Behaviour

process 1 process 2 process n Tv Tu uu uv vv qu qv

Goal:The bandwidth between communicating processes, so-calledneighbors

(19)

Application Behaviour

process 1 process 2 process n Tv Tu uu uv vv qu qv

Goal:The bandwidth between communicating processes, so-calledneighbors

(20)

Requirements:

1 An improved MPB layout must consider both: communication neighbors

and group communication (barriers, broadcasts, gather/scatter, ...)

(21)

Putting things together ...

proc. n-1 proc. n proc. 1 proc. 2 proc. n-1 proc. n

proc. 1 proc. 2

p1p2 pn

channel headers payload area write section

channel header message payload

original layout

new layout without topology information

p1 proc. 2

p1p2 pn

with topology information pn

(22)

MPI offers API to specifyvirtual process topology 1

#d e f i n e NUM_DIMS 2

2 3

i n t grid_dims [NUM_DIMS] ;

4

i n t grid_periods [NUM_DIMS] ;

5

MPI_Comm comm_topo ;

6

7

/* f o r a grid , s e t a l l items o f grid_periods to 0 */

8

f o r ( i n t i = 0 ; i < NUM_DIMS; i++)

9

grid_periods [ i ] = 0 ;

10

11

MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;

12

MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,

(23)

MPI offers API to specifyvirtual process topology 1

#d e f i n e NUM_DIMS 2

2 3

i n t grid_dims [NUM_DIMS] ;

4

i n t grid_periods [NUM_DIMS] ;

5

MPI_Comm comm_topo ;

6

7

/* f o r a grid , s e t a l l items o f grid_periods to 0 */

8

f o r ( i n t i = 0 ; i < NUM_DIMS; i++)

9

grid_periods [ i ] = 0 ;

10

11

MPI_Dims_create ( numProcs , NUM_DIMS, grid_dims ) ;

12

MPI_Cart_create (MPI_COMM_WORLD, NUM_DIMS, grid_dims ,

13

grid_periods , true , &comm_topo ) ;

(24)

4. Topology Awareness for RCKMPI: Evaluation

0 5 10 15 20 25 30 4 16 64 256 1 Ki 4 Ki 16 Ki 64 Ki 256 Ki 1 Mi 4 Mi bandwidth / MByte/s

message size / Byte

enhanced RCKMPI with 1D topology (48 procs, 2 Cache lines) enhanced RCKMPI with 1D topology (48 procs, 3 Cache lines) enhanced RCKMPI without topology (48 procs)

(25)

Results for 2D CFD application withring topology:

process 1

process 2

(26)

0 5 10 15 20 25 30 4 8 12 16 20 24 28 32 36 40 44 48 speedup number of processes

enhanced RCKMPI with topology information, 2 CL original RCKMPI

(27)

Summary and Future work

SCC is equipped with a fast NoC.

Message Passing Bufferon tile is beneficial for fast communication.

Bandwidth performance gain for MPI applications by using MPI’svirtual

process topologiesandrearranging the MPB layout.

Current/Future Work:

Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel: Scaling and Dynamic Processes Support, ARCS 2012

Fixed the One-Sided Communication in RCKMPI

=

support of applications based on Global Arrays

(28)

Summary and Future work

SCC is equipped with a fast NoC.

Message Passing Bufferon tile is beneficial for fast communication.

Bandwidth performance gain for MPI applications by using MPI’svirtual

process topologiesandrearranging the MPB layout.

Current/Future Work: Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012

Fixed the One-Sided Communication in RCKMPI

=

support of applications based on Global Arrays

(29)

Summary and Future work

SCC is equipped with a fast NoC.

Message Passing Bufferon tile is beneficial for fast communication.

Bandwidth performance gain for MPI applications by using MPI’svirtual

process topologiesandrearranging the MPB layout.

Current/Future Work: Comparison with

I. C. Ureña, and M. Gerndt:Improved RCKMPI’s SCCMPB Channel:

Scaling and Dynamic Processes Support, ARCS 2012

Fixed the One-Sided Communication in RCKMPI

=

support of

References

Related documents

The main new fi nding of this study is that people with less severe radiographic changes prior to knee joint replacement are less likely to experience major improvement in pain

In this chapter, two set of intelligent tire modules were designed, fabricated and tested. The first one was designed and fabricated based on the developed flexible TENG sensor.

It is possible that colonial ties not only affect the level of food aid shipped from donor to recipient country, but also the responsiveness of aid to recipient needs.. This

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,

deprivation for locally advanced and metastatic prostate cancer: results from a randomised phase 3 study of the South European Uroncological Group.. The role of intermittent

Jeg opplevde under denne kodingen at tema som hvor tett samarbeid informanten hadde hatt med veilednings- og oppfølgingslosen og hvilke av de verktøyene som veilednings-

Enrollment from 7/1/11-6/30/12: Campus Number of Students Admitted New Starts since 6/30/12 Re-Enrollments since 6/30/12 Transferred to Program from Another Program since

Since launching of our programme to raise service levels in March 2012, more than 2,600 extra train trips have been added per week to different MTR heavy rail lines, and more than