Technology emerging from the DEEP & DEEP-ER projects

(1)

Technology emerging from the

DEEP & DEEP-ER projects

Estela Suarez

Jülich Supercomputing Centre

03.06.2016

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476

(2)

DEEP

•

Cluster-Booster archit.

• Software stack

• Programming environ.

• Energy efficiency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

DEEP-ER

• Extend memory hierarchy

• High-performance

I/O

• Scalable

resiliency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

2

Topics

(3)

CLUSTER-BOOSTER

ARCHITECTURE

(4)

“Standard” heterogeneity

CN

InfiniBand

CN

GPU

Flat topology

Simple management of

resources

Static assignment of

accelerators to CPUs

Accelerators cannot act

autonomously

(5)

CN

InfiniBand

Cluster

Cluster-Booster architecture

5

BI

BN

BI

BN

BI

BN

EXTOLL

Booster

Flexible assignment of resources (CPUs, accelerators)

Direct communication between accelerators

(6)

DEEP Architecture

6

Intel

®

Xeon

®

(7)

DEEP System

•

Installed at JSC

•

1,5 racks

•

500 TFlop/s

peak perf.

•

3.5 GFlop/s/W

•

Water cooled

7

Cluster

(128 Xeon)

Booster

(384 Xeon Phi

KNC)

(8)

Node Card

Interface Card

8

(9)

Booster Nodes (KNC) from 1 to 384

Boos

ter Nodes

(KNC)

fr

om 1 t

o

384

Booster measurements

MPI Linktest: ping-pong

15

Latency: 870 us

BW: 1.3 MB/s

PCIe

BW-issue in 3

nodes

(10)

Booster Nodes (KNC) from 1 to 384

Boos

ter Nodes

(KNC)

fr

om 1 t

o

384

Booster measurements

MPI Linktest: ping-pong

16

(11)

Main EXTOLL characteristics

– Direct network: no switches required

– Integrates network interface controller

– Supports 6+1 links

– Capable of tunneling PCIe (allows

remote-booting KNC from the

network)

Current (A3) version of EXTOLL ASIC

– 270 million transistors

– Link bandwidth: 100 G

– MPI latency: 850 ns

– MPI bandwidth: 8.5 GB/s

– Message rate: 70 million mgs/sec

– PCIe Gen3 x16

Network EXTOLL Tourmalet

Tourmalet PCI Express Board

Tourmalet Chip and Wafer

(12)

GreenICE system

Alternative Booster implementation

• Interconnect EXTOLL

ASIC “Tourmalet”

• 32 KNC-node system

• Implement 4



4



2 topology, with Z

dimension open

2-phase immersion cooling

• NOVEC liquid from 3M

• Evaporates at about 50 degrees

• Condensates again in a water cooling pipe

• Allows very high-density integration

19

(13)

MPI performance

Latency

(14)

MPI performance

Bandwidth

21

(15)

DEEP Architecture

23

(16)

Xeon Phi

DEEP-ER Architecture

Innovation

Simplified Interconnect

On-Node NVM

Self-Booting Nodes

Network

Attached Memory

24

Xeon

(17)

DEEP-ER Aurora Blade

prototype

Eurotech’s Aurora technology

Direct water cooled, high density

25 7U 19 inch Rootcard -18x EXTOLL -18x NVMe Chassis: -18x KNL -94GB Mem -1x backplane 4U 3U EXTOLL Tourmalet Aurora Blade DEEP-ER Booster

(in construction)

Aurora Blade Chassis

NVMe Intel Xeon Phi

(18)

Network Attached Memory

NAM architecture

•

Xilinx Virtex 7 FPGA

•

Hybrid Memory Cube (HMC)

– Bandwidth HMC↔FPGA: 40+ GByte/s – HMC Conrtroler: Open source development

•

Attached to TOURMALET NIC

libNAM

(libc based) for ease of use

Use cases:

•

global (shared) storage

•

compute node for an X-OR C/R app

•

“active memory”, etc.

26 EXTOLL Tourmalet NIC 12 EXTOLL lanes HDI 6 HDI 6 HDI 6 HDI 6 HDI 6 HDI 6

UHEI Aspin v2 Board

HDI 6 _{Virtex 7}Xilinx FPGA HMC DRAM 16 HMC lanes 12-lane EXTOLL Cable

Xilinx Virtex 7 FPGA

EXTOLL Endpoint HMC Controller 16 HMC lanes 12 EXTOLL lanes RMA Engine NAM Functions NAM Board

(19)

SOFTWARE

(20)

28

Programming environment

Cluster

Booster

Interface

Infinib

and

Extoll

Cluster Booster

Protocol

MPI_Comm_spawn

ParaStation MPI

(21)

0 100 200 300 400 500 600 700 800 900 1000

Message size

Throu

ghp

ut

[MB/s]

CBP Message-based RMA limit

Cluster-Booster Protocol

29

(22)

Application running on DEEP

30 30

Source code

Compiler

Application

binaries

DEEP

Runtime

(23)

DEEP Offload (with OmpSs)

Performance & Scalability evaluation

31 Rank 0 master Rank 0-15 slave0 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk63 Rank 16-31 slave1 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk127 Rank 240-255 slave15 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk1023 x16 x16 x16

Figure 7: FWI hierarchical MPI architecture

64 128 256 512 1024 64 128 256 512 1024 S p e e d -u p # nodes (16 cores) Ideal OmpSs Offload OmpSs Offload (no I/O)

Figure 8: Scalability of FWI application on up to 1024 nodes

VI. Concl usions and f ut ur e wor k

This paper presents the OmpSs O

✏

oad model that was

originally developed to ease the porting of complex

ap-plications to the highly heterogeneous cluster architecture

proposed on the DEEP Exascale project. The OmpSs O

✏

oad

model has completely

fulﬁlled its design goals, combining

the ease of use of Intel O

✏

oad with the

ﬂexibility,

per-formance and scalability of the native

MPI

Comm spawn

API. Moreover, our approach is fully integrated with the

rest of features provided by OmpSs, such as support for

OpenMP codes and CUDA or OpenCL kernels. Although

it was originally conceived for heterogeneous clusters we

have also successfully used it to develop hierarchical MPI

applications such as FWI. We think that these hierarchical

MPI architectures will play an important role in exploiting

future Exascale systems. Hence, tools such as OmpSs

Of-ﬂoad will be essential for designing such architectures and

helping with their implementation for complex and large

applications.

As future work, we plan to integrate our allocation API

with a resource manager/job scheduler to avoid the need

to reserve all the resources that will be required before

the program is launched. We also plan to investigate the

potential of OmpSs O

✏

oad to improve the malleability of

existing MPI applications, as well as the implications of

using this o

✏

oad model from the resilience point of view.

Ref er ences

[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-pert, and E. Suarez, “ On the scalability of the clusters-booster concept: a critical assessment of the DEEP architecture,” in

Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance. ACM, 2012, p. 3.

[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, “ OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp. 173–193, 2011. [3] K. O. W. Group et al., “ The OpenCL speciﬁcation,” A.

Munshi, Ed, 2008.

[4] C. Nvidia, “ Compute Uniﬁed Device Architecture program-ming guide,” 2007.

[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy, J. Wiegert, F. Chinchilla, and R. McGuire, “ O✏oad compiler runtime for the Intel R

Xeon Phi R coprocessor,” in Supercomputing. Springer,

2013, pp. 239–254.

[6] “ OpenMP 4.0 speciﬁcation,” http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].

[7] O. W. Groupet al.,“ The OpenACC application programming interface,” 2011.

[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell, and E. Ayguad´e, “ Leveraging OmpSs to exploit hardware accelerators,” in 26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014. IEEE, 2014, pp. 112–119. [Online]. Available: http: //dx.doi.org/10.1109/SBAC-PAD.2014.26

[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “ rCUDA: Reducing the number of GPU-based accel-erators in high performance clusters,” in High Performance Computing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.

[10] A. Barak and A. Shiloh, “ The mosix Virtual OpenCLl (VCL) cluster platform,” in Proc. Intel European Research and Innovation Conference, 2011.

[11] F. Sainz and V. Beltran. (2015) OmpSs Collective O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/ ompss- docs/user-guide/run-programs- archs-o✏oad.html Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015

(24)

Scalable I/O

• Improve I/O scalability on all usage-levels

• Used also for checkpointing

(25)

Filesystem

•

Two instances:

– Global FS on HDD server

– Cache FS on NVM at node

•

API for cache domain handling

– Synchronous version

– Asynchronous version

(26)

Resiliency

• Develop a hierarchical, distributed checkpoint/restart

mechanism leveraging DEEP-ER architecture

(27)

APPLICATIONS

(28)

Application-driven approach

•

DEEP+DEEP-ER applications:

– Brain simulation (EPFL)

– Space weather simulation (KULeuven)

– Climate simulation (Cyprus Institute)

– Computational fluid engineering (CERFACS)

– High temperature superconductivity (CINECA)

– Seismic imaging (CGG)

– Human exposure to electromagnetic fields (INRIA)

– Geoscience (LRZ Munich)

– Radio astronomy (Astron)

– Oil exploration (BSC)

– Lattice QCD (University of Regensburg)

•

Goals:

– Co-design and evaluation of architecture and its programmability

– Analysis of the I/O and resiliency requirements of HPC codes

(29)

Cluster-Booster Advantages

• More

flexible

than a standard architecture

→ This enables different use models:

1. Dynamic ratio of processors/coprocessors

2. Use Booster as pool of accelerators (globally shared)

3. Discrete use of the Booster

4. Discrete use + I/O offload

5. Specialized symmetric mode

• Enables a

more efficient use of system resources

– Only resources actually needed are blocked by applications

– Dynamic allocation further increases system utilization

(30)

39

Code optimisations

0

2

4

6

8

10

12

0

20

40

60

80

100

120

140

160

Speedup

Gflop

s/s

Impact of different optimizations of

wave propagator on Xeon Phi

Gflops/s

speedup

BSC: Enhancing Oil Exploration (FWI, wave propagator)

(31)

Using SIONlib

LRZ: Rapid crustal deformation & earthquake source equation (Seisol)

1 process per node, 16 threads per process

writing 20 checkpoint files (4GB/checkpoint)

40

0.1

1.0

10.0

100.0

16

32

64

128

256

512

1,024

Bandwidth

[GB/

s]

# of nodes

Accumulated bandwidth on SuperMUC

(32)

41

Using NVMe

Inria: Assessment of Human exposure to EM fields

24 MPI processes, 1 thread per process

0

10

20

30

40

50

60

P1

P2

P3

P4

W

riting time [

s]

I/O performance of MAXW-DGTD

sdv-work

NVMe

Increasing

model precision

P1<P2<P3<P4

(33)

DEEP/-ER

Emerging Technologies

•

Cluster-Booster Architecture:

– Alternative approach to heterogeneity

– High flexibility enabling various use modes

•

Hardware components:

– Booster (new kind of cluster of accelerators)

– GreenICE Booster (2-phase immersion cooling)

– EXTOLL network tested at scale

– Warm-water cooling

– Memory hierarchy based on NVM

– Network Attached Memory

(34)

DEEP/-ER

Emerging Technologies

43

•

Software

–

Cluster-Booster Protocol

: low-level communication protocol

between different high-speed networks

–

Programming environment

for future heterogeneous systems

–

ParaStation Global MPI

supporting EXTOLL and CBP

–

OmpSs

extensions for DEEP Offload

–

Resiliency

extensions for OmpSs (task recovery) and ParaStation

–

BeeGFS

extension for local caches (on NVM)

–

SIONlib

extensions for buddy-checkpointing, integration with SCR

and use of BeeGFS functionality

–

E10

scalability optimisations for MPI-I/O

–

Extrae/Paraver

support for DEEP Offload

–

Applications

modernisation and optimisation

(35)

DEEP and DEEP-ER

44

www.deep-project.eu

www.deep-er.eu

EU-Exascale projects

20 partners

Total budget: 28,3 M€

EU-funding: 14,5 M€

Nov 2011 – Mar 2017

Visit us @

ISC’16, Frankfurt

(Germany)

20.-22.06.2016

-Booth #1340

-BoF #11

-Workshop