System-level heterogeneity with Intel Xeon Phi processors

(1)

System-Level Heterogeneity

with Intel

®

Xeon Phi

TM

Processors

Estela Suarez

Jülich Supercomputing Centre

This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreements 287530 (DEEP) and 610476 (DEEP-ER).

(2)

Collaborative R&D in DEEP & DEEP-ER

European Union

Exascale projects

20 partners

Total budget: 28.3 M€

EU-funding: 14.5 M€

Combined term: 5 years

Visit us @ ISC 2016, Frankfurt

(Germany)

June 19 – 23, 2016

Booth #1340

(3)

DEEP/-ER Approach

Co-design between applications, system SW and HW

– Application and operational requirements do shape system architecture

– HW provides performance, scalability and energy efficiency

– System SW enables applications to leverage HW potential

Objectives

– Deliver highest scalability and workload performance

– Provide leading energy efficiency (energy per result)

– Offer familiar, easy to use programming environment and APIs based on standards

– Ensure sustainability of system architecture and SW

Design elements

– Exploit benefits of processor heterogeneity

– Leverage technology advances in storage-class memory and interconnects

(4)

Co-Design Applications

DEEP + DEEP-ER applications

• Brain simulation (EPFL)

• Space weather simulation (KULeuven)

• Climate simulation (CYI)

• Computational fluid engineering (CERFACS)

• High temperature superconductivity (CINECA)

• Seismic imaging (CGG)

• Human exposure to electromagnetic fields (INRIA)

• Geoscience (BADW-LRZ)

• Radio astronomy (Astron)

• Oil exploration (BSC)

• Lattice QCD (UREG)

Goals

• Drive co-design cycle

(5)

System-Level Heterogeneity

DEEP-ER – ISC 2016, Intel Booth – 21.06.2016 5

Accelerated Cluster

Cluster-Booster Architecture

– Fixed, static ratio and assignment of

accelerators to CPUs

– Static management of resources

– Accelerators do not act autonomously

– General-purpose Cluster interconnect

– Programming via local offload interfaces

(OpenCL, CUDA, CELO, OpenACC, …)

– No fixed ratio or assignment between

resources (Multicore & Manycore nodes)

– Dynamic management and association

of resources

– High-throughput network in the Booster

– Programming via MPI and “global”

(6)

DEEP System Architecture

Cluster part

– High single thread & complex code

performance



Intel

®

_Xeon

®

_processors

– High any-to-any connection

performance & infrastructure

integration



standard switched

HPC fabric (InfiniBand

TM

₎

Booster part

– High throughput, autonomous

operation



Intel Xeon Phi

coprocessor (codenamed

“Knights Corner”)

– Need for low latency, spatial

application structures



3D Torus direct-connected

network (EXTOLL)

– Network bridging and KNC control



Booster Interface layer

Both parts

– Efficiency needs



use of liquid cooling

& dense packaging

(7)

DEEP Prototype Systems

Eurotech Aurora Prototype

Cluster part



128 dual-socket

Intel Xeon E5-2403

nodes



QDR InfiniBand

TM



Eurotech Aurora

liquid cooling &

packaging

Booster part



384 Intel Xeon Phi

7120X nodes



FPGA

implementation of

EXTOLL

interconnect



24 Booster interface

nodes with Intel

Xeon processor



Eurotech Aurora

liquid cooling &

packaging

Cluster

(8)

DEEP/DEEP-ER Programming Model

ParaStation Global MPI layer



Expert-level programming



Efficient communication across the whole system



Dynamic process spawning an control in both

directions

Task-based OmpSs programming model



Pragma based, emphasizes ease of use



Efficient communication across the whole system



Dynamic spawning of massively parallel tasks in

both parts

(9)

Massively Parallel Tasks in OmpSs

Published in: Sainz, F, Bellón, J, Beltran, V, Labarta, J, “Collective Offload for Heterogeneous Clusters”, IEEE 22nd_{International Conference on High Performance}

Computing (HiPC), 2015 Rank 0 master Rank 0-15 slave0 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk63 Rank 16-31 slave1 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk127 Rank 240-255 slave15 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk1023 x16 x16 x16

Figure 7: FWI hierarchical MPI architecture

64 128 256 512 1024 64 128 256 512 1024 S p e e d -u p # nodes (16 cores) Ideal OmpSs Offload OmpSs Offload (no I/O)

Figure 8: Scalability of FWI application on up to 1024 nodes

VI. Concl usions and f ut ur e wor k

This paper presents the OmpSs O

✏

oad model that was

originally developed to ease the porting of complex

ap-plications to the highly heterogeneous cluster architecture

proposed on the DEEP Exascale project. The OmpSs O

✏

oad

model has completely

fulﬁlled its design goals, combining

the ease of use of Intel O

✏

oad with the

ﬂexibility,

per-formance and scalability of the native

MPI

Comm spawn

API. Moreover, our approach is fully integrated with the

rest of features provided by OmpSs, such as support for

OpenMP codes and CUDA or OpenCL kernels. Although

it was originally conceived for heterogeneous clusters we

have also successfully used it to develop hierarchical MPI

applications such as FWI. We think that these hierarchical

MPI architectures will play an important role in exploiting

future Exascale systems. Hence, tools such as OmpSs

Of-ﬂoad will be essential for designing such architectures and

helping with their implementation for complex and large

applications.

As future work, we plan to integrate our allocation API

with a resource manager/job scheduler to avoid the need

to reserve all the resources that will be required before

the program is launched. We also plan to investigate the

potential of OmpSs O

✏

oad to improve the malleability of

existing MPI applications, as well as the implications of

using this o

✏

oad model from the resilience point of view.

Ref er ences

[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-pert, and E. Suarez, “ On the scalability of the clusters-booster concept: a critical assessment of the DEEP architecture,” in

Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance. ACM, 2012, p. 3.

[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, “ OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp. 173–193, 2011. [3] K. O. W. Group et al., “ The OpenCL speciﬁcation,” A.

Munshi, Ed, 2008.

[4] C. Nvidia, “ Compute Uniﬁed Device Architecture program-ming guide,” 2007.

[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy, J. Wiegert, F. Chinchilla, and R. McGuire, “ O✏oad compiler runtime for the Intel R

Xeon Phi R coprocessor,” in Supercomputing. Springer,

2013, pp. 239–254.

[6] “ OpenMP 4.0 speciﬁcation,” http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].

[7] O. W. Groupet al., “ The OpenACC application programming interface,” 2011.

[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell, and E. Ayguad´e, “ Leveraging OmpSs to exploit hardware accelerators,” in 26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014. IEEE, 2014, pp. 112–119. [Online]. Available: http: //dx.doi.org/10.1109/SBAC-PAD.2014.26

[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “ rCUDA: Reducing the number of GPU-based accel-erators in high performance clusters,” in High Performance Computing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.

[10] A. Barak and A. Shiloh, “ The mosix Virtual OpenCLl (VCL) cluster platform,” in Proc. Intel European Research and Innovation Conference, 2011.

[11] F. Sainz and V. Beltran. (2015) OmpSs Collective O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/ ompss- docs/user-guide/run-programs- archs-o✏oad.html