www.cineca.it
CINECA HPC Infrastructure: state of the art and road map
Installed HPC Engines
hybrid cluster
64 nodes
1024 SandyBridge cores
64 K20 GPU
64 Xeon PHI coprocessor
150 TFlops peak
10240 nodes
163840 PowerA2 cores
2PFlops peak
Hybrid cluster
274 nodes
3288 Westmere cores
548 nVidia M2070 (Fermi)
300TFlops peak
Eurora (Eurotech)
FERMI @ CINECA
PRACE Tier-0 System
Architecture: 10 BGQ Frame
Model: IBM-BG/Q
Processor Type: IBM PowerA2, 1.6 GHz
Computing Cores: 163840
Computing Nodes: 10240
RAM: 1GByte / core
Internal Network: 5D Torus
Disk Space: 2PByte of scratch space
Peak Performance: 2PFlop/s
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through
participating centers throughout Europe. Available resources are announced for each Call for Proposals..
Peer reviewed open access
PRACE Projects (Tier-0)
PRACE Preparatory (Tier-0)
DECI Projects (Tier-1)
European
Local
Tier 0 Tier 1 Tier 2National
1. Chip: 16 P cores
2. Single Chip Module
4. Node Card:
32 Compute Cards,
Optical Modules, Link Chips, Torus
5a. Midplane: 16 Node Cards 6. Rack: 2 Midplanes 7. System: 20PF/s 3. Compute card:
One chip module, 16 GB DDR3 Memory,
5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots
BG/Q I/O architecture
BG/Q compute racks
BG/Q IO
Switch
File system servers
IB
PCI_E
IB
I/O drawers
I/O nodes
PCIe
8 I/O nodes
At least one I/O node for each partition/job
PowerA2 chip, basic info
•
64bit RISC Processor
•
Power instruction set (Power1…Power7, PowerPC)
•
4 Floating Point units per core & 4 way MT
•
16 cores + 1 + 1 (17th Processor core for system functions)
•
1.6GHz
•
32MByte cache
•
system-on-a-chip design
•
16GByte of RAM at 1.33GHz
•
Peak Perf 204.8 gigaflops
•
power draw of 55 watts
•
45 nanometer copper/SOI process (same as Power7)
9
PowerA2 FPU
•
Each FPU on each core has four pipelines
•
execute scalar floating point instructions
•
four-wide SIMD instructions
•
two-wide complex arithmetic SIMD inst.
•
six-stage pipeline
•
maximum of eight concurrent
•
floating point operations
EURORA
#1 in The Green500 List June
2013
What EURORA stant for?
EUR
opean many integrated c
OR
e
A
rchitecture
What is EURORA?
Prototype Project
Founded by PRACE 2IP EU project
Grant agreement number: RI-283493
Co-designed by CINECA and EUROTECH
Where is EURORA?
EURORA is installed at CINECA
When EURORA has been installed?
March 2013
Who is using EURORA?
All Italian and EU researchers through PRACE
Why EURORA?
(project objectives)
Address Today HPC Constraints:
Flops/Watt,
Flops/m2,
Flops/Dollar.
Efficient Cooling Technology:
hot water cooling (free cooling);
measure power efficiency, evaluate (PUE &
TCO).
Improve Application Performances:
at the same rate as in the past (~Moore’s
Law);
new programming models.
Evaluate Hybrid (accelerated)
Technology:
Intel Xeon Phi;
NVIDIA Kepler.
Custom Interconnection Technology:
3D Torus network (FPGA);
evaluation of
accelerator-to-accelerator communications.
64 compute cards
128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W)
16GByte DDR3 1600MHz per node
160GByte SSD per node
1 FPGA (Altera Stratix V) per node
IB QDR interconnect
3D Torus interconnect
128 Accelerator cards (NVIDA K20 and INTEL PHI)
EURORA
Node card
13
Xeon PHI
Node Energy Efficiency
14
FERMI
(IBM BGQ)
(IBM x86+GPU)PLX
Eurora
(Eurotech hybrid)HPC Data store
Workspace
3.6PByte
Repository 1.8PByte Tape 1.5PBHPC Engines
Network
Custom FERMI EURORA IBEURORA PLX Store Nubes
Gbe
Infrastructure Internet
Fibre
Store External Data Sources
Labs
PRACE EUDAT Projects Data Processing Workloads FERMI PLX viz High througput Big mem DB
Data mover Data mover processing Web serv. FEC NUBES Cloud serv. We b Archive FTP HPC Workloads PRACE ISCRA LISA Labs Industry Agreements Projects Training
HPC Services
HPC Cloud
Nubes FEC PLX Store#12 Top500
2PFlops peak
163840 cores
163Tbyte RAM
Power 1.6GHz
#1 Green500
0.17PFlops peak
1024 x86 cores
64 Intel PHI
64 NVIDIA K20
0.3PFlops peak ~3500 x86 procs 548 NVIDIA GPU 20 NVIDIA Quadro 16 Fat nodesCINECA services
•
High Performance Computing
•
Computational workflow
•
Storage
•
Data analytics
•
Data preservation (long term)
•
Data access (web/app)
•
Remote Visualization
•
HPC Training
•
HPC Consulting
•
HPC Hosting
•
Monitoring and Metering
•
…
Workspace
3.6PByte
Core Data Processing
viz
memBig DBData mover processing We
b serv.
We
b Archive FTP
Core Data Store
Repository
5PByte
Tape
5+ PByte
Internal data
sources
(data centric) Infrastructure (Q3 2014)
Cloud
service
Scale-Out Data
Processing
FERMI
X86 Cluster
Laboratories
PRACE
EUDAT
Other Data Sources
External Data Sources
Human Brain Prj
SaaS APP
Analytics APP
Parallel APP
New
analytics
cluster
New
storage
Requisiti di alto livello del sistema
Potenza elettrica assorbita:
400KW
Dimensione fisica del sistema:
5 racks
Potenza di picco del sistema (CPU+GPU):
nell'ordine di 1PFlops
Potenza di picco del sistema (solo CPU):
nell'ordine di 300TFlops
New Tier 1 CINECA
Requisiti di alto livello del sistema
Architettura CPU:
Intel Xeon Ivy Bridge
Numero di core per CPU:
8 @ >3GHz, oppure 12 @ 2.4GHz
La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento
Numero di server:
500 - 600
,
( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops )
Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Architettura GPU:
Nvidia K40
Numero di GPU:
>500
( Peak perf = 700 * 1.43TFlops = 1PFlops )
Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di
numero di nodi solo CPU e numero di nodi CPU+GPU
Requisiti di alto livello del sistema
Vendor identificati:
IBM, Eurotech
DRAM Memory:
1GByte/core
Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata
Memoria non volatile locale:
>500GByte
SSD/HD a seconda del costo e dalla configurazione del sistema
Cooling:
sistema di raffreddamento a liquido con opzione di free cooling
Spazio disco scratch:
>300TByte
(provided by CINECA)
Roadmap 50PFlops
Power consumption EURORA 50KW, PLX 350 KW, BGQ 1000KW + ENI EURORA or PLX upgrade 400KW; BGQ 1000KW, Data repository 200KW; -ENIR&D Eurora EuroExa STM / ARM board EuroExa STM / ARM prototype PCP Proto 1PF in a rack EuroExa STM / ARM PF platform ETP proto towards exascaleboard
Deployment Eurora industrial prototype 150 TF Eurora or PLX upgrade 1PF peak, 350TF scalar multi petaflop system Tier-0 50PF Tier-1 towards exascale Time line 2013 2014 2015 2016 2017 2018 2019 2020
Roadmap to Exascale
HPC Architectures
two
model
Hybrid:
Server class processors:
Server class nodes
Special purpose nodes
Accelerator devices:
Nvidia
Intel
AMD
FPGA
Homogeneus:
Server class node:
Standar processors
Special porpouse nodes
Architectural trends
Peak Performance
Moore law
FPU Performance
Dennard law
Number of FPUs
Moore + Dennard
Programming Models
fundamental paradigm:
Message passing
Multi-threads
Consolidated standard: MPI & OpenMP
New task based programming model
Special purpose for accelerators:
CUDA
Intel offload directives
OpenACC, OpenCL, Ecc…
NO consolidated standard
Scripting:
But!
Si lattice
0.54 nm
There will be still 4~6 cycles (or technology generations) left until
we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some
year between 2020-30 (H. Iwai, IWJT2008).
300 atoms!
14nm VLSI
Dennard scaling law
(downscaling)
L’ = L / 2
V’ = V / 2
F’ = F * 2
D’ = 1 / L
2= 4D
P’ = P
do not hold anymore!
The power crisis!
L’ = L / 2
V’ = ~V
F’ = ~F * 2
D’ = 1 / L
2= 4 * D
P’ = 4 * P
Increase the number of cores
to maintain the
architectures evolution
on the Moore’s law
Programming crisis!
The core frequency
and performance do not
grow following the
Moore’s law any longer
new VLSI gen.
old VLSI gen.
The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.”
And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said.
So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s.
Moore’s Law
Economic and market law
From
WSJ
Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race.
What about Applications?
In a massively parallel context, an upper limit for the scalability of parallel
applications is determined by the fraction of the overall execution time
spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 −P )
P= parallel fraction