DEEP and DEEP-ER

(1)

DEEP and DEEP-ER

Estela Suarez

Jülich Supercomputing Centre

10.05.2016

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476

(2)

DEEP

• Cluster-Booster archit.

• Software stack

• Programming environ.

• Energy efficiency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

DEEP-ER

• Extend memory hierarchy

• High-performance

I/O

• Scalable

resiliency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

2

Topics

(3)

CLUSTER-BOOSTER

ARCHITECTURE

(4)

“Standard” heterogeneity

CN

InfiniBand

CN

GPU

Flat topology

Simple management of

resources

Static assignment of

accelerators to CPUs

Accelerators cannot act

autonomously

(5)

CN

InfiniBand

Cluster

Cluster-Booster architecture

5

BI

BN

BI

BN

BI

BN

EXTOLL

Booster

Flexible assignment of resources (CPUs, accelerators)

Direct communication between accelerators

(6)

DEEP Architecture

6

Intel

®

_Xeon

®

(7)

DEEP System

• Installed at JSC

• 1,5 racks

• 500 TFlop/s

peak perf.

• 3.5 GFlop/s/W

• Water cooled

7

Cluster

(128 Xeon)

Booster

(384 Xeon Phi

KNC)

(8)

Booster Nodes (KNC) from 1 to 384

Boos

ter Node

s

(KNC)

fr

om 1

to 384

Booster measurements

MPI Linktest: ping-pong

8

Latency: 870 us

BW: 1.3 MB/s

PCIe

BW-issue in 3

nodes

(9)

Booster Nodes (KNC) from 1 to 384

Boos

ter Node

s

(KNC)

fr

om 1

to 384

Booster measurements

MPI Linktest: ping-pong

9

Stddev < 10%

(10)

GreenICE system

Alternative Booster implementation

• Interconnect EXTOLL ASIC “Tourmalet”

• 32 KNC-node system

• Implement 4



4 

2 topology, with Z

dimension open

Experiment 2-phase immersion cooling

• NOVEC liquid from 3M

• Evaporates at about 50 degrees

• Condensates again in a water cooling pipe

• Allows very high-density integration

10

(11)

Xeon Phi

DEEP-ER Architecture

Innovation

Simplified Interconnect

On-Node NVM

Self-Booting Nodes

Network

Attached Memory

11

Xeon

(12)

DEEP-ER Aurora Blade

prototype

Eurotech’s Aurora technology

Direct water cooled, high density

12 7U 19 inch Rootcard -18x EXTOLL -18x NVMe Chassis: -18x KNL -94GB Mem -1x backplane 4U 3U EXTOLL Tourmalet Aurora Blade DEEP-ER Booster

(in construction)

Aurora Blade Chassis

NVMe Intel Xeon Phi

(13)

SOFTWARE

(14)

14

Programming environment

Cluster

_Booster

Booster

Interface

Inf

in

ib

and

Ex

toll

Cluster Booster

Protocol

MPI_Comm_spawn

ParaStation MPI

(15)

0 100 200 300 400 500 600 700 800 900 1000

Message size

Th

roug

hp

ut

[MB/s]

CBP Message-based RMA limit

Cluster-Booster Protocol

15

(16)

Application running on DEEP

16 16

Source code

Compiler

Application

binaries

DEEP

Runtime

(17)

DEEP Offload (with OmpSs)

Performance & Scalability evaluation

17 Rank 0 master Rank 0-15 slave0 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk63 Rank 16-31 slave1 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk127 Rank 240-255 slave15 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk1023 x16 x16 x16

Figure 7: FWI hierarchical MPI architecture

64 128 256 512 1024 64 128 256 512 1024 S p e e d -u p # nodes (16 cores) Ideal OmpSs Offload OmpSs Offload (no I/O)

Figure 8: Scalability of FWI application on up to 1024 nodes

VI. Concl usions and f ut ur e wor k

This paper presents the OmpSs O✏oad model that was originally developed to ease the porting of complex ap-plications to the highly heterogeneous cluster architecture proposed on the DEEP Exascale project. The OmpSs O✏oad model has completely fulﬁlled its design goals, combining the ease of use of Intel O✏oad with the ﬂexibility, per-formance and scalability of the nativeMPI Comm spawn

API. Moreover, our approach is fully integrated with the rest of features provided by OmpSs, such as support for OpenMP codes and CUDA or OpenCL kernels. Although it was originally conceived for heterogeneous clusters we have also successfully used it to develop hierarchical MPI applications such as FWI. We think that these hierarchical MPI architectures will play an important role in exploiting

future Exascale systems. Hence, tools such as OmpSs Of-ﬂoad will be essential for designing such architectures and helping with their implementation for complex and large applications.

As future work, we plan to integrate our allocation API with a resource manager/job scheduler to avoid the need to reserve all the resources that will be required before the program is launched. We also plan to investigate the potential of OmpSs O✏oad to improve the malleability of existing MPI applications, as well as the implications of using this o✏oad model from the resilience point of view.

Ref er ences

[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T.

Lip-pert, and E. Suarez, “ On the scalability of the clusters-booster

concept: a critical assessment of the DEEP architecture,” in

Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance. ACM, 2012, p. 3.

[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell,

X. Martorell, and J. Planas, “ OmpSs: a proposal for

pro-gramming heterogeneous multi-core architectures,” Parallel

Processing Letters, vol. 21, no. 02, pp. 173–193, 2011.

[3] K. O. W. Group et al., “ The OpenCL speciﬁcation,” A.

Munshi, Ed, 2008.

[4] C. Nvidia, “ Compute Uniﬁed Device Architecture

program-ming guide,” 2007.

[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty,

R. Narayanaswamy, J. Wiegert, F. Chinchilla, and

R. McGuire, “ O✏oad compiler runtime for the Intel R

Xeon Phi R coprocessor,” in Supercomputing. Springer,

2013, pp. 239–254.

[6] “ OpenMP 4.0 speciﬁcation,”

http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].

[7] O. W. Groupet al.,“ The OpenACC application programming

interface,” 2011.

[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell,

and E. Ayguad´e, “ Leveraging OmpSs to exploit hardware

accelerators,” in 26th IEEE International Symposium on

Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014.

IEEE, 2014, pp. 112–119. [Online]. Available: http:

//dx.doi.org/10.1109/SBAC-PAD.2014.26

[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S.

Quintana-Orti, “ rCUDA: Reducing the number of GPU-based

accel-erators in high performance clusters,” in High Performance

Computing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.

[10] A. Barak and A. Shiloh, “ The mosix Virtual OpenCLl (VCL)

cluster platform,” in Proc. Intel European Research and

Innovation Conference, 2011.

[11] F. Sainz and V. Beltran. (2015) OmpSs Collective

O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/

ompss- docs/user-guide/run-programs- archs-o✏oad.html

Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015

FWI (full wave inversion) code

(18)

Scalable I/O

• Improve I/O scalability on all usage-levels

• Used also for checkpointing

(19)

Resiliency

• Develop a hierarchical, distributed checkpoint/restart

mechanism leveraging DEEP-ER architecture

(20)

APPLICATIONS

(21)

Application-driven approach

• DEEP+DEEP-ER applications:

– Brain simulation (EPFL)

– Space weather simulation (KULeuven)

– Climate simulation (Cyprus Institute)

– Computational fluid engineering (CERFACS)

– High temperature superconductivity (CINECA)

– Seismic imaging (CGG)

– Human exposure to electromagnetic fields (INRIA)

– Geoscience (LRZ Munich)

– Radio astronomy (Astron)

– Oil exploration (BSC)

– Lattice QCD (University of Regensburg)

• Goals:

– Co-design and evaluation of architecture and its programmability

– Analysis of the I/O and resiliency requirements of HPC codes

(22)

Cluster-Booster Advantages

• More

flexible

than a standard architecture

→ This enables different use models:

1. Dynamic ratio of processors/coprocessors

2. Use Booster as pool of accelerators (globally shared)

3. Discrete use of the Booster

4. Discrete use + I/O offload

5. Specialized symmetric mode

• Enables a

more efficient use of system resources

– Only resources actually needed are blocked by applications

– Dynamic allocation further increases system utilization

(23)

23

Code optimisations

0

2

4

6

8

10

12

0

20

40

60

80

100

120

140

160 Speed

u

p

Gflo

p

s/s

Impact of different optimizations of

wave propagator on Xeon Phi

Gflops/s

speedup

BSC: Enhancing Oil Exploration (FWI, wave propagator)

(24)

Using SIONlib

LRZ: Rapid crustal deformation & earthquake source equation (Seisol)

1 process per node, 16 threads per process

writing 20 checkpoint files (4GB/checkpoint)

24

0.1

1.0

10.0

100.0

16

32

64

128

256

512 1,024

Ba

n

d

width

[GB

/s]

# of nodes

Accumulated bandwidth on SuperMUC

(25)

25

Using NVMe

Inria: Assessment of Human exposure to EM fields

24 MPI processes, 1 thread per process

0

10

20

30

40

50

60 P1

P2

P3

P4

W

riting

time

[s

]

I/O performance of MAXW-DGTD

sdv-work

NVMe

Increasing

model precision

P1<P2<P3<P4

(26)

Summary

●

DEEP:

–

Holistic project (HW+SW +App.) in strong co-design

–

Showed the potential of an alternative approach to

heterogeneous computing

–

Prototype available also to external users

• If interested, please contact

[email protected]

●

DEEP-ER:

–

HW: Prototype under development (impacted by KNL delay)

–

SW + APPs: Very good progress in software and applications

–

APPs: Benefit from new memory technologies shown.

Strong co-design within the project.

(27)

DEEP/-ER contribution

• To SRA:

– HPC System Architecture and component (

Cluster-Booster concept,

EXTOLL network, memory, energy efficiency, etc. )

– System Software management: (

Interconnect management, ParaStation

cluster management, etc.)

– Programming environment (

Heterogeneous programming model (MPI +

OmpSs), optimisations for resiliency, etc.)

– Balance compute, I/O and storage performance (

BeeGFS and I/O

software to efficiently manage multilevel memory hierarchy, etc.)

– Energy and resiliency:

(Warm water cooling, RAS monitoring, DEEP-ER

resiliency, etc.)

– Application code modernisation

• To international collaboration:

– European Exascale Projects (EEP) initiative

(28)

DEEP and DEEP-ER

28