DEEP and DEEP-ER
Estela Suarez
Jülich Supercomputing Centre
10.05.2016
The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476
DEEP
•
Cluster-Booster archit.
• Software stack
• Programming environ.
• Energy efficiency
• Applications:
– Co-design
– Evaluation/demonstration
– Code modernisation
DEEP-ER
• Extend memory hierarchy
• High-performance
I/O
• Scalable
resiliency
• Applications:
– Co-design
– Evaluation/demonstration
– Code modernisation
2Topics
CLUSTER-BOOSTER
ARCHITECTURE
“Standard” heterogeneity
CN
CN
CN
InfiniBand
CN
CN
CN
GPU
GPU
GPU
GPU
GPU
GPU
Flat topology
Simple management of
resources
Static assignment of
accelerators to CPUs
Accelerators cannot act
autonomously
CN
CN
CN
InfiniBand
Cluster
Cluster-Booster architecture
5BI
BN
BI
BN
BI
BN
BN
BN
BN
BN
BN
BN
EXTOLL
Booster
Flexible assignment of resources (CPUs, accelerators)
Direct communication between accelerators
DEEP Architecture
6
Intel
®Xeon
®DEEP System
•
Installed at JSC
•
1,5 racks
•
500 TFlop/s
peak perf.
•
3.5 GFlop/s/W
•
Water cooled
7Cluster
(128 Xeon)
Booster
(384 Xeon Phi
KNC)
Booster Nodes (KNC) from 1 to 384
Boos
ter Node
s
(KNC)
fr
om 1
to 384
Booster measurements
MPI Linktest: ping-pong
8
Latency: 870 us
BW: 1.3 MB/s
PCIe
BW-issue in 3
nodes
Booster Nodes (KNC) from 1 to 384
Boos
ter Node
s
(KNC)
fr
om 1
to 384
Booster measurements
MPI Linktest: ping-pong
9
Stddev < 10%
GreenICE system
Alternative Booster implementation
• Interconnect EXTOLL ASIC “Tourmalet”
• 32 KNC-node system
• Implement 4
4
2 topology, with Z
dimension open
Experiment 2-phase immersion cooling
• NOVEC liquid from 3M
• Evaporates at about 50 degrees
• Condensates again in a water cooling pipe
• Allows very high-density integration
10
Xeon Phi
DEEP-ER Architecture
Innovation
Simplified Interconnect
On-Node NVM
Self-Booting Nodes
Network
Attached Memory
11Xeon
DEEP-ER Aurora Blade
prototype
Eurotech’s Aurora technology
Direct water cooled, high density
12 7U 19 inch Rootcard -18x EXTOLL -18x NVMe Chassis: -18x KNL -94GB Mem -1x backplane 4U 3U EXTOLL Tourmalet Aurora Blade DEEP-ER Booster
(in construction)
Aurora Blade Chassis
NVMe Intel Xeon Phi
SOFTWARE
14
Programming environment
Cluster
Booster
Booster
Interface
Inf
in
ib
and
Ex
toll
Cluster Booster
Protocol
MPI_Comm_spawn
ParaStation MPI
0 100 200 300 400 500 600 700 800 900 1000
Message size
Th
roug
hp
ut
[MB/s]
CBP Message-based RMA limitCluster-Booster Protocol
15Application running on DEEP
16 16Source code
Compiler
Application
binaries
DEEP
Runtime
DEEP Offload (with OmpSs)
Performance & Scalability evaluation
17 Rank 0 master Rank 0-15 slave0 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk63 Rank 16-31 slave1 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk127 Rank 240-255 slave15 Rank 0 Worker Rank 1 Worker Rank 2 Worker Rank 3 wk1023 x16 x16 x16
Figure 7: FWI hierarchical MPI architecture
64 128 256 512 1024 64 128 256 512 1024 S p e e d -u p # nodes (16 cores) Ideal OmpSs Offload OmpSs Offload (no I/O)
Figure 8: Scalability of FWI application on up to 1024 nodes
VI. Concl usions and f ut ur e wor k
This paper presents the OmpSs O✏oad model that was originally developed to ease the porting of complex ap-plications to the highly heterogeneous cluster architecture proposed on the DEEP Exascale project. The OmpSs O✏oad model has completely fulfilled its design goals, combining the ease of use of Intel O✏oad with the flexibility, per-formance and scalability of the nativeMPI Comm spawn
API. Moreover, our approach is fully integrated with the rest of features provided by OmpSs, such as support for OpenMP codes and CUDA or OpenCL kernels. Although it was originally conceived for heterogeneous clusters we have also successfully used it to develop hierarchical MPI applications such as FWI. We think that these hierarchical MPI architectures will play an important role in exploiting
future Exascale systems. Hence, tools such as OmpSs Of-fload will be essential for designing such architectures and helping with their implementation for complex and large applications.
As future work, we plan to integrate our allocation API with a resource manager/job scheduler to avoid the need to reserve all the resources that will be required before the program is launched. We also plan to investigate the potential of OmpSs O✏oad to improve the malleability of existing MPI applications, as well as the implications of using this o✏oad model from the resilience point of view.
Ref er ences
[1] D. A. Mallon, N. Eicker, M. E. Innocenti, G. Lapenta, T.
Lip-pert, and E. Suarez, “ On the scalability of the clusters-booster
concept: a critical assessment of the DEEP architecture,” in
Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance. ACM, 2012, p. 3.
[2] A. Duran, E. Ayguad´e, R. M. Badia, J. Labarta, L. Martinell,
X. Martorell, and J. Planas, “ OmpSs: a proposal for
pro-gramming heterogeneous multi-core architectures,” Parallel
Processing Letters, vol. 21, no. 02, pp. 173–193, 2011.
[3] K. O. W. Group et al., “ The OpenCL specification,” A.
Munshi, Ed, 2008.
[4] C. Nvidia, “ Compute Unified Device Architecture
program-ming guide,” 2007.
[5] C. J. Newburn, R. Deodhar, S. Dmitriev, R. Murty,
R. Narayanaswamy, J. Wiegert, F. Chinchilla, and
R. McGuire, “ O✏oad compiler runtime for the Intel R
Xeon Phi R coprocessor,” in Supercomputing. Springer,
2013, pp. 239–254.
[6] “ OpenMP 4.0 specification,”
http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, 2013, [Online; accessed 20-Dec-2013].
[7] O. W. Groupet al.,“ The OpenACC application programming
interface,” 2011.
[8] F. Sainz, S. Mateo, V. Beltran, J. L. Bosque, X. Martorell,
and E. Ayguad´e, “ Leveraging OmpSs to exploit hardware
accelerators,” in 26th IEEE International Symposium on
Computer Architecture and High Performance Computing, SBAC-PAD 2014, Paris, France, October 22-24, 2014.
IEEE, 2014, pp. 112–119. [Online]. Available: http:
//dx.doi.org/10.1109/SBAC-PAD.2014.26
[9] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S.
Quintana-Orti, “ rCUDA: Reducing the number of GPU-based
accel-erators in high performance clusters,” in High Performance
Computing and Simulation (HPCS), 2010 International Con-ference on. IEEE, 2010, pp. 224–231.
[10] A. Barak and A. Shiloh, “ The mosix Virtual OpenCLl (VCL)
cluster platform,” in Proc. Intel European Research and
Innovation Conference, 2011.
[11] F. Sainz and V. Beltran. (2015) OmpSs Collective
O✏oad. User Manual. [Online]. Available: http://pm.bsc.es/
ompss- docs/user-guide/run-programs- archs-o✏oad.html
Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015
FWI (full wave inversion) code
Scalable I/O
• Improve I/O scalability on all usage-levels
• Used also for checkpointing
Resiliency
• Develop a hierarchical, distributed checkpoint/restart
mechanism leveraging DEEP-ER architecture
APPLICATIONS
Application-driven approach
•
DEEP+DEEP-ER applications:
– Brain simulation (EPFL)
– Space weather simulation (KULeuven)
– Climate simulation (Cyprus Institute)
– Computational fluid engineering (CERFACS)
– High temperature superconductivity (CINECA)
– Seismic imaging (CGG)
– Human exposure to electromagnetic fields (INRIA)
– Geoscience (LRZ Munich)
– Radio astronomy (Astron)
– Oil exploration (BSC)
– Lattice QCD (University of Regensburg)
•
Goals:
– Co-design and evaluation of architecture and its programmability
– Analysis of the I/O and resiliency requirements of HPC codes
Cluster-Booster Advantages
• More
flexible
than a standard architecture
→ This enables different use models:
1. Dynamic ratio of processors/coprocessors
2. Use Booster as pool of accelerators (globally shared)
3. Discrete use of the Booster
4. Discrete use + I/O offload
5. Specialized symmetric mode
• Enables a
more efficient use of system resources
– Only resources actually needed are blocked by applications
– Dynamic allocation further increases system utilization
23
Code optimisations
0
2
4
6
8
10
12
0
20
40
60
80
100
120
140
160
Speed
u
p
Gflo
p
s/s
Impact of different optimizations of
wave propagator on Xeon Phi
Gflops/s
speedup
BSC: Enhancing Oil Exploration (FWI, wave propagator)
Using SIONlib
LRZ: Rapid crustal deformation & earthquake source equation (Seisol)
1 process per node, 16 threads per process
writing 20 checkpoint files (4GB/checkpoint)
24
0.1
1.0
10.0
100.0
16
32
64
128
256
512
1,024
Ba
n
d
width
[GB
/s]
# of nodes
Accumulated bandwidth on SuperMUC
25
Using NVMe
Inria: Assessment of Human exposure to EM fields
24 MPI processes, 1 thread per process
0
10
20
30
40
50
60
P1
P2
P3
P4
W
riting
time
[s
]
I/O performance of MAXW-DGTD
sdv-work
NVMe
Increasing
model precision
P1<P2<P3<P4
Summary
●
DEEP:
–
Holistic project (HW+SW +App.) in strong co-design
–
Showed the potential of an alternative approach to
heterogeneous computing
–
Prototype available also to external users
•
If interested, please contact
[email protected]
●