Technology emerging from the
DEEP & DEEP-ER projects
Estela Suarez
Jülich Supercomputing Centre
03.06.2016
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476
DEEP
•
Cluster-Booster archit.
• Software stack
• Programming environ.
• Energy efficiency
• Applications:
– Co-design
– Evaluation/demonstration
– Code modernisation
DEEP-ER
• Extend memory hierarchy
• High-performance
I/O
• Scalable
resiliency
• Applications:
– Co-design
– Evaluation/demonstration
– Code modernisation
2Topics
CLUSTER-BOOSTER
ARCHITECTURE
“Standard” heterogeneity
CN
CN
CN
InfiniBand
CN
CN
CN
GPU
GPU
GPU
GPU
GPU
GPU
Flat topology
Simple management of
resources
Static assignment of
accelerators to CPUs
Accelerators cannot act
autonomously
CN
CN
CN
InfiniBand
Cluster
Cluster-Booster architecture
5BI
BN
BI
BN
BI
BN
BN
BN
BN
BN
BN
BN
EXTOLL
Booster
Flexible assignment of resources (CPUs, accelerators)
Direct communication between accelerators
DEEP Architecture
6
Intel
®Xeon
®DEEP System
•
Installed at JSC
•
1,5 racks
•
500 TFlop/s
peak perf.
•
3.5 GFlop/s/W
•
Water cooled
7Cluster
(128 Xeon)
Booster
(384 Xeon Phi
KNC)
Node Card
Interface Card
8
Booster Nodes (KNC) from 1 to 384
Boos
ter Nodes
(KNC)
fr
om 1 t
o
384
Booster measurements
MPI Linktest: ping-pong
15
Latency: 870 us
BW: 1.3 MB/s
PCIe
BW-issue in 3
nodes
Booster Nodes (KNC) from 1 to 384
Boos
ter Nodes
(KNC)
fr
om 1 t
o
384
Booster measurements
MPI Linktest: ping-pong
16
Main EXTOLL characteristics
– Direct network: no switches required
– Integrates network interface controller
– Supports 6+1 links
– Capable of tunneling PCIe (allows
remote-booting KNC from the
network)
Current (A3) version of EXTOLL ASIC
– 270 million transistors
– Link bandwidth: 100 G
– MPI latency: 850 ns
– MPI bandwidth: 8.5 GB/s
– Message rate: 70 million mgs/sec
– PCIe Gen3 x16
Network EXTOLL Tourmalet
Tourmalet PCI Express Board
Tourmalet Chip and Wafer
GreenICE system
Alternative Booster implementation
• Interconnect EXTOLL
ASIC “Tourmalet”
• 32 KNC-node system
• Implement 4
4
2 topology, with Z
dimension open
2-phase immersion cooling
• NOVEC liquid from 3M
• Evaporates at about 50 degrees
• Condensates again in a water cooling pipe
• Allows very high-density integration
19
MPI performance
Latency
MPI performance
Bandwidth
21
DEEP Architecture
23
Xeon Phi
DEEP-ER Architecture
Innovation
Simplified Interconnect
On-Node NVM
Self-Booting Nodes
Network
Attached Memory
24Xeon
DEEP-ER Aurora Blade
prototype
Eurotech’s Aurora technology
Direct water cooled, high density
25 7U 19 inch Rootcard -18x EXTOLL -18x NVMe Chassis: -18x KNL -94GB Mem -1x backplane 4U 3U EXTOLL Tourmalet Aurora Blade DEEP-ER Booster
(in construction)
Aurora Blade Chassis
NVMe Intel Xeon Phi
Network Attached Memory
NAM architecture
•
Xilinx Virtex 7 FPGA
•
Hybrid Memory Cube (HMC)
– Bandwidth HMC↔FPGA: 40+ GByte/s – HMC Conrtroler: Open source development
•
Attached to TOURMALET NIC
libNAM
(libc based) for ease of use
Use cases:
•
global (shared) storage
•
compute node for an X-OR C/R app
•
“active memory”, etc.
26 EXTOLL Tourmalet NIC 12 EXTOLL lanes HDI 6 HDI 6 HDI 6 HDI 6 HDI 6 HDI 6
UHEI Aspin v2 Board
HDI 6 Virtex 7Xilinx FPGA HMC DRAM 16 HMC lanes 12-lane EXTOLL Cable
Xilinx Virtex 7 FPGA
EXTOLL Endpoint HMC Controller 16 HMC lanes 12 EXTOLL lanes RMA Engine NAM Functions NAM Board
SOFTWARE
28
Programming environment
Cluster
Booster
Booster
Interface
Infinib
and
Extoll
Cluster Booster
Protocol
MPI_Comm_spawn
ParaStation MPI
0 100 200 300 400 500 600 700 800 900 1000
Message size
Throu
ghp
ut
[MB/s]
CBP Message-based RMA limitCluster-Booster Protocol
29Application running on DEEP
30 30Source code
Compiler
Application
binaries
DEEP
Runtime
DEEP Offload (with OmpSs)
Performance & Scalability evaluation
31 Rank 0 master Rank 0-15 slave0 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk63 Rank 16-31 slave1 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk127 Rank 240-255 slave15 Rank 0 WorkerRank 1 WorkerRank 2 WorkerRank 3 wk1023 x16 x16 x16
Figure 7: FWI hierarchical MPI architecture
64 128 256 512 1024 64 128 256 512 1024 S p e e d -u p # nodes (16 cores) Ideal OmpSs Offload OmpSs Offload (no I/O)
Figure 8: Scalability of FWI application on up to 1024 nodes
VI. Concl usions and f ut ur e wor k
This paper presents the OmpSs O
✏
oad model that was
originally developed to ease the porting of complex
ap-plications to the highly heterogeneous cluster architecture
proposed on the DEEP Exascale project. The OmpSs O
✏
oad
model has completely
fulfilled its design goals, combining
the ease of use of Intel O
✏
oad with the
flexibility,
per-formance and scalability of the native
MPI
Comm spawn
API. Moreover, our approach is fully integrated with the
rest of features provided by OmpSs, such as support for
OpenMP codes and CUDA or OpenCL kernels. Although
it was originally conceived for heterogeneous clusters we
have also successfully used it to develop hierarchical MPI
applications such as FWI. We think that these hierarchical
MPI architectures will play an important role in exploiting
future Exascale systems. Hence, tools such as OmpSs
Of-fload will be essential for designing such architectures and
helping with their implementation for complex and large
applications.
As future work, we plan to integrate our allocation API
with a resource manager/job scheduler to avoid the need
to reserve all the resources that will be required before
the program is launched. We also plan to investigate the
potential of OmpSs O
✏
oad to improve the malleability of
existing MPI applications, as well as the implications of
using this o
✏
oad model from the resilience point of view.
Ref er ences
[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T. Lip-pert, and E. Suarez, “ On the scalability of the clusters-booster concept: a critical assessment of the DEEP architecture,” in
Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance. ACM, 2012, p. 3.
[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, “ OmpSs: a proposal for pro-gramming heterogeneous multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp. 173–193, 2011. [3] K. O. W. Group et al., “ The OpenCL specification,” A.
Munshi, Ed, 2008.
[4] C. Nvidia, “ Compute Unified Device Architecture program-ming guide,” 2007.
[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty, R. Narayanaswamy, J. Wiegert, F. Chinchilla, and R. McGuire, “ O✏oad compiler runtime for the Intel R
Xeon Phi R coprocessor,” in Supercomputing. Springer,
2013, pp. 239–254.
[6] “ OpenMP 4.0 specification,” http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].
[7] O. W. Groupet al.,“ The OpenACC application programming interface,” 2011.
[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell, and E. Ayguad´e, “ Leveraging OmpSs to exploit hardware accelerators,” in 26th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014. IEEE, 2014, pp. 112–119. [Online]. Available: http: //dx.doi.org/10.1109/SBAC-PAD.2014.26
[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana-Orti, “ rCUDA: Reducing the number of GPU-based accel-erators in high performance clusters,” in High Performance Computing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.
[10] A. Barak and A. Shiloh, “ The mosix Virtual OpenCLl (VCL) cluster platform,” in Proc. Intel European Research and Innovation Conference, 2011.
[11] F. Sainz and V. Beltran. (2015) OmpSs Collective O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/ ompss- docs/user-guide/run-programs- archs-o✏oad.html Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015
Scalable I/O
• Improve I/O scalability on all usage-levels
• Used also for checkpointing
Filesystem
•
Two instances:
– Global FS on HDD server
– Cache FS on NVM at node
•
API for cache domain handling
– Synchronous version
– Asynchronous version
Resiliency
• Develop a hierarchical, distributed checkpoint/restart
mechanism leveraging DEEP-ER architecture
APPLICATIONS
Application-driven approach
•
DEEP+DEEP-ER applications:
– Brain simulation (EPFL)
– Space weather simulation (KULeuven)
– Climate simulation (Cyprus Institute)
– Computational fluid engineering (CERFACS)
– High temperature superconductivity (CINECA)
– Seismic imaging (CGG)
– Human exposure to electromagnetic fields (INRIA)
– Geoscience (LRZ Munich)
– Radio astronomy (Astron)
– Oil exploration (BSC)
– Lattice QCD (University of Regensburg)
•
Goals:
– Co-design and evaluation of architecture and its programmability
– Analysis of the I/O and resiliency requirements of HPC codes
Cluster-Booster Advantages
• More
flexible
than a standard architecture
→ This enables different use models:
1. Dynamic ratio of processors/coprocessors
2. Use Booster as pool of accelerators (globally shared)
3. Discrete use of the Booster
4. Discrete use + I/O offload
5. Specialized symmetric mode
• Enables a
more efficient use of system resources
– Only resources actually needed are blocked by applications
– Dynamic allocation further increases system utilization
39
Code optimisations
0
2
4
6
8
10
12
0
20
40
60
80
100
120
140
160
Speedup
Gflop
s/s
Impact of different optimizations of
wave propagator on Xeon Phi
Gflops/s
speedup
BSC: Enhancing Oil Exploration (FWI, wave propagator)
Using SIONlib
LRZ: Rapid crustal deformation & earthquake source equation (Seisol)
1 process per node, 16 threads per process
writing 20 checkpoint files (4GB/checkpoint)
40
0.1
1.0
10.0
100.0
16
32
64
128
256
512
1,024
Bandwidth
[GB/
s]
# of nodes
Accumulated bandwidth on SuperMUC
41
Using NVMe
Inria: Assessment of Human exposure to EM fields
24 MPI processes, 1 thread per process
0
10
20
30
40
50
60
P1
P2
P3
P4
W
riting time [
s]
I/O performance of MAXW-DGTD
sdv-work
NVMe
Increasing
model precision
P1<P2<P3<P4
DEEP/-ER
Emerging Technologies
•
Cluster-Booster Architecture:
– Alternative approach to heterogeneity
– High flexibility enabling various use modes
•
Hardware components:
– Booster (new kind of cluster of accelerators)
– GreenICE Booster (2-phase immersion cooling)
– EXTOLL network tested at scale
– Warm-water cooling
– Memory hierarchy based on NVM
– Network Attached Memory
DEEP/-ER
Emerging Technologies
43
•
Software
–
Cluster-Booster Protocol
: low-level communication protocol
between different high-speed networks
–
Programming environment
for future heterogeneous systems
–
ParaStation Global MPI
supporting EXTOLL and CBP
–
OmpSs
extensions for DEEP Offload
–
Resiliency
extensions for OmpSs (task recovery) and ParaStation
–
BeeGFS
extension for local caches (on NVM)
–
SIONlib
extensions for buddy-checkpointing, integration with SCR
and use of BeeGFS functionality
–
E10
scalability optimisations for MPI-I/O
–
Extrae/Paraver
support for DEEP Offload
–
Applications
modernisation and optimisation
DEEP and DEEP-ER
44www.deep-project.eu
www.deep-er.eu
EU-Exascale projects
20 partners
Total budget: 28,3 M€
EU-funding: 14,5 M€
Nov 2011 – Mar 2017
Visit us @
ISC’16, Frankfurt
(Germany)
20.-22.06.2016
-Booth #1340
-BoF #11
-Workshop