• No results found

Challenges on Extreme Scale Computers - Complexity, Energy, Reliability

N/A
N/A
Protected

Academic year: 2021

Share "Challenges on Extreme Scale Computers - Complexity, Energy, Reliability"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Challenges on Extreme Scale Computers -

Complexity, Energy, Reliability

Dieter Kranzlmüller

Munich Network Management Team

Ludwig-Maximilians-Universität München (LMU) & Leibniz SupercompuFng Centre (LRZ)

of the Bavarian Academy of Sciences and HumaniFes

Leibniz Supercompu<ng Centre

of the Bavarian Academy of Sciences and Humani<es

D. Kranzlmüller UT Dallas, CS Dept 2

Cuboid containing compuFng systems 72 x 36 x 36 meters InsFtute Building InsFtute Building Lecture Halls VisualisaFon Centre

With 156 employees + 38 extra staff for more than 90.000 students and for more than 30.000 employees including 8.500 scienFsts

(2)

Leibniz Supercompu<ng Centre

of the Bavarian Academy of Sciences and Humani<es

n 

Computer Centre for all

Munich UniversiFes

D. Kranzlmüller UT Dallas, CS Dept 3

IT Service Provider:

• Munich ScienFfic Network (MWN)

• Web servers • e-Learning

• E-Mail • Groupware • Special equipment:

•  Virtual Reality Laboratory •  Video Conference

•  Scanners for slides and large

documents

•  Large scale ploders

IT Competence Centre:

• Hotline and support

• ConsulFng (security, networking, scienFfc

compuFng, …)

• Courses (text ediFng, image processing, UNIX, Linux, HPC, …)

(3)

Leibniz Supercompu<ng Centre

of the Bavarian Academy of Sciences and Humani<es

n 

Regional Computer

Centre for all

Bavarian UniversiFes

n 

Computer Centre for all

Munich UniversiFes

D. Kranzlmüller UT Dallas, CS Dept 5

Virtual Reality & Visualiza<on Centre (LRZ)

(4)

Examples from the V2C

D. Kranzlmüller UT Dallas, CS Dept 7

Leibniz Supercompu<ng Centre

of the Bavarian Academy of Sciences and Humani<es

n 

NaFonal SupercompuFng

Centre

n 

Regional Computer

Centre for all

Bavarian UniversiFes

n 

Computer Centre for all

Munich UniversiFes

(5)

Gauss Centre for Supercompu<ng (GCS)

n 

CombinaFon of the 3 German naFonal supercompuFng centers:

–  John von Neumann InsFtute for CompuFng (NIC), Jülich –  High Performance CompuFng Center Studgart (HLRS)

–  Leibniz SupercompuFng Centre (LRZ), Garching n. Munich

n 

Founded on 13. April 2007

n 

HosFng member of PRACE

(Partnership for Advanced CompuFng in Europe)

D. Kranzlmüller UT Dallas, CS Dept 9

PRACE Research Infrastructure Created

n 

Establishment of the legal framework

–  PRACE AISBL created with seat in Brussels in April (AssociaFon InternaFonale Sans But LucraFf) –  20 members represenFng 20 European countries

–  InauguraFon in Barcelona on June 9

n 

Funding secured for 2010 - 2015

–  400 Million € from France, Germany, Italy, Spain

Provided as Tier-0 services on TCO basis

–  Funding decision for 100 Million € in The Netherlands

expected soon

–  70+ Million € from EC FP7 for preparatory and implementaFon

Grants INFSO-RI-211528 and 261557

Complemented by ~ 60 Million € from PRACE members

(6)

PRACE Tier-0 Systems

n  Curie @ GENCI:

Bull Cluster, 1.7 PFlop/s

n  FERMI @ CINECA:

IBM BG/Q, 2.1 PFlop/s

n  Hermit @ HLRS:

Cray XE6, 1 Pflop/s

n  JUQUEEN @ FZJ:

IBM Blue Gene/Q, 5.9 PFlop/s

n  MareNostrum @ BSC:

IBM System X iDataPlex, 1 PFlop/s

n  SuperMUC @ LRZ:

IBM System X iDataPlex, 3.2 PFlop/s

D. Kranzlmüller UT Dallas, CS Dept 11

Leibniz Supercompu<ng Centre

of the Bavarian Academy of Sciences and Humani<es n 

European

SupercompuFng Centre

n 

NaFonal SupercompuFng

Centre

n 

Regional Computer

Centre for all

Bavarian UniversiFes

n 

Computer Centre for all

Munich UniversiFes

SGI UV SGI Altix Linux Clusters

SuperMUC

(7)

SuperMUC @ LRZ

D. Kranzlmüller UT Dallas, CS Dept 13

Video: SuperMUC rendered on SuperMUC by LRZ

hdp://youtu.be/OlAS6iiqWrQ

Top 500 Supercomputer List (June 2012)

D. Kranzlmüller UT Dallas, CS Dept 14

(8)

LRZ Supercomputers

D. Kranzlmüller 15

SuperMUC Phase II

UT Dallas, CS Dept

(9)

SuperMUC and its predecessors

D. Kranzlmüller UT Dallas, CS Dept 17

10 m

SuperMUC and its predecessors

D. Kranzlmüller UT Dallas, CS Dept 18

10 m

22 m 11 m

(10)

SuperMUC and its predecessors

D. Kranzlmüller UT Dallas, CS Dept 19

10 m

22 m 11 m

LRZ Building Extension

(11)

I/O nodes NAS

80 Gbit/s

18 Thin node islands

(each >8000 cores) (8200 cores) 1 Fat node island è SuperMIG

$HOME 1.5 PB / 10 GB/s

Snapshots/Replika 1.5 PB (separate fire section)

non blocking non blocking

pruned tree (4:1) SB-EP 16 cores/node 2 GB/core WM-EX 40cores/node 6.4 GB/core 10 PB 200 GB/s GPFS for $WORK $SCRATCH Internet Achive and Backup ~ 30 PB

Desaster Recovery Site

Compute nodes Compute nodes

SuperMUC Architecture

D. Kranzlmüller UT Dallas, CS Dept 21

Power Consump<on at LRZ

D. Kranzlmüller UT Dallas, CS Dept 22

0 5.000 10.000 15.000 20.000 25.000 30.000 35.000 Str o m ve rb ra u ch in MW h

(12)

Cooling SuperMUC

D. Kranzlmüller UT Dallas, CS Dept 23

(13)

LRZ Application Mix

q  Computational Fluid Dynamics: Optimisation of turbines and wings, noise reduction, air conditioning in trains

q  Fusion: Plasma in a future fusion reactor (ITER)

q  Astrophysics: Origin and evolution of stars and galaxies q  Solid State Physics: Superconductivity, surface properties q  Geophysics: Earth quake scenarios

q  Material Science: Semiconductors q  Chemistry: Catalytic reactions

q  Medicine and Medical Engineering: Blood flow, aneurysms, air conditioning of operating theatres

q  Biophysics: Properties of viruses, genome analysis q  Climate research: Currents in oceans

D. Kranzlmüller UT Dallas, CS Dept 25

Increasing numbers

Date System Flop/s Cores

2000 HLRB-I 2 Tflop/s 1512

2006 HLRB-II 62 Tflop/s 9728

2012 SuperMUC 3200 Tflop/s 155656

2015 SuperMUC Phase II 3.2 + 3.2 Pflop/s 229960

(14)

SuperMUC Jobsize 2015 (in Cores)

Leibniz SupercompuFng Centre, ApplicaFon Support Group

21-Nov-15 27

1

st

LRZ Extreme Scale Workshop

n 

July 2013:

1

st

LRZ Extreme Scale Workshop

n 

ParFcipants:

–  15 internaFonal projects

n 

Prerequisites:

–  Successful run on 4 islands (32768 cores)

n 

ParFcipaFng Groups (Sosware packages):

–  LAMMPS, VERTEX, GADGET, WaLBerla, BQCD, Gromacs, APES, SeisSol, CIAO

n 

Successful results (> 64000 Cores):

–  Invited to parFcipate in PARCO Conference (Sept. 2013) including a publicaFon of their approach

(15)

1

st

LRZ Extreme Scale Workshop

n 

Regular SuperMUC operation

–  4 Islands maximum

–  Batch scheduling system

n 

Entire SuperMUC reserved 2,5 days for challenge:

–  0,5 Days for testing

–  2 Days for executing

–  16 (of 19) Islands available

n 

Consumed computing time for all groups:

–  1 hour of runtime = 130.000 CPU hours

–  1 year in total

D. Kranzlmüller UT Dallas, CS Dept 29

Results (Sustained TFlop/s on 128000

cores)

D. Kranzlmüller UT Dallas, CS Dept 30

Name MPI # cores Description TFlop/s/island TFlop/s max

Linpack IBM 128000 TOP500 161 2560

Vertex IBM 128000 Plasma Physics 15 245

GROMACS IBM, Intel 64000 Molecular Modelling 40 110

Seissol IBM 64000 Geophysics 31 95

waLBerla IBM 128000 Lattice Boltzmann 5.6 90

LAMMPS IBM 128000 Molecular Modelling 5.6 90

APES IBM 64000 CFD 6 47

(16)

Results

n 

5 Software packages were running on max 16 islands:

–  LAMMPS –  VERTEX –  GADGET –  WaLBerla –  BQCD n 

VERTEX reached

245 TFlop/s on

16 islands

(A. Marek)

D. Kranzlmüller UT Dallas, CS Dept 31

Lessons learned – Technical Perspective

n 

Hybrid (MPI+OpenMP) on SuperMUC still slower than pure

MPI (e.g. GROMACS), but applications scale to larger core

counts (e.g. VERTEX)

n 

Core pinning needs a lot of experience by the programmer

n 

Parallel IO still remains a challenge for many applications,

both with regard to stability and speed.

n 

Several stability issues with GPFS were observed for very

large jobs due to writing thousands of files in a single

directory. This will be improved in the upcoming versions of

the application codes.

(17)

Extreme Scaling - Continuation

n 

LRZ Extreme Scale Benchmark Suite (LESS) will be available

in two versions: public and internal

n 

All teams will have the opportunity to run performance

benchmarks after upcoming SuperMUC maintenances

n 

2

nd

LRZ Extreme Scaling Workshop

è

2-5 June 2014

–  Full system production runs on 18 islands with sustained Pflop/s

(4h SeisSol, 7h Gadget)

–  4 existing + 6 additional full system applications

–  High I/O bandwidth in user space possible (66 GB/s of 200 GB/s max)

–  Important goal: minimize energy*runtime (3-15 W/core)

n 

Extreme Scale-Out SuperMUC Phase 2

D. Kranzlmüller UT Dallas, CS Dept 33

Extreme Scale-Out SuperMUC Phase 2

n 

12 May – 12 June 2015 (30 days)

n 

Selected Group of Early Users

n 

Nightly OperaFon: general queue max 3 islands

n 

DayFme OperaFon: special queue max 6 islands (full system)

n 

Total available: 63,432,000 core hours

n 

Total used: 43,758,430 core hours (UFlisaFon: 68.98%)

Lessons learned

(2015):

n 

PreparaFon is everything

n 

Finding Heisenbugs is difficult

n 

MPI is at its limits

n 

Hybrid (MPI+OpenMP) is the way to go

n 

I/O libraries gewng even more important

(18)

Partnership Ini<a<ve

Computa<onal Sciences πCS

n 

Individualized services

for selected scienFfic groups –

flagship role

–  Dedicated point-of-contact

–  Individual support and guidance and targeted training &educaFon

–  Planning dependability for use case specific opFmized IT infrastructures

–  Early access to latest IT infrastructure (hard- and sosware) developments

and specificaFon of future requirements

–  Access to IT competence network and experFse at CS and Math departments

n 

Partner contribu<on

–  Embedding IT experts in user groups

–  Joint research projects (including funding)

–  ScienFfic partnership – equal fooFng – joint publicaFons

n 

LRZ benefits

–  Understanding the (current and future) needs and requirements of the

respecFve scienFfic domain

–  Developing future services for all user groups

–  ThemaFc focusing: Environmental Compu<ng

D. Kranzlmüller UT Dallas, CS Dept 35

SeisSol - Numerical Simula<on of Seismic Wave

Phenomena

Dr. ChrisFan PelFes, Department of Earth and Environmental Sciences (LMU) Prof. Michael Bader, Department of InformaFcs (TUM)

1,42 Petaflop/s on 147.456 Cores of SuperMUC

(44,5 % of Peak Performance)

(19)

Extreme Scaling - Conclusions

n 

The number of compute cores, the complexity (and heterogeneity)

is steadily increasing

n 

Users need to possibility to reliably execute (and opFmize) their

codes on the full size machines

n 

The Extreme Scaling Workshop Series @ LRZ offers a number of

incenFves for users

n 

The lessons learned from the Extreme Scaling Workshop are very

valuable for the operaFon of the center

n 

The LRZ Partnership IniFaFve ComputaFonal Science (piCS) tries to

improve user support

hdp://www.sciencedirect.com/science/arFcle/pii/S1877050914003433

D. Kranzlmüller UT Dallas, CS Dept 37

D. Kranzlmüller UT Dallas, CS Dept 38

Challenges on Extreme Scale Computers

Complexity, Energy, Realiability

Dieter Kranzlmüller

References

Related documents

Recent rheological data from low-temperature deformation experiments on olivine suggest that strain hardening occurs due to the presence of temperature-independent back

Free keywords culture, economy, society, institutions, community, managers, sciences, relativity, European, Dutch, globalization, influence, rationality, modern,

To Customer’s Holding Tank or Waste Treatment Plant Fresh Water Holding Tank LC3 Level Control &#34;Y&#34; Strainer Gate Valve Ball Check Rotameter Pump

UNCHS (Habitat) provides network support services for the sharing and exchange of information on, inter alia, good and best practices, training and human resources development,

As taking into account long lags might be key to appropriately describe certain aspects of the relationship between life expectancy and per capita income (at least for the

Furthermore, us- ing our theoretical predictions, we find that the statistical properties are invariant between 3-clique networks and original networks for several observable

Multimedia applications are used in education to transfer knowledge effectively, and enhance the ability of teachers and learners in their process, in terms of

Ancaman kepada keselamatan Sabah – pembinaan negara-bangsa di sebalik masyarakat yang berbilang etnik, pelarian dan orang tidak bernegara, dan migran ekonomi, ditambah pula