1
Getting the Performance Out Of
Getting the Performance Out Of
High Performance Computing
High Performance Computing
Jack Dongarra Innovative Computing Lab
University of Tennessee and
Computer Science and Math Division Oak Ridge National Lab
http://
http://www.cs.utk.edu/~dongarrawww.cs.utk.edu/~dongarra//
2
Getting the Performance Out Of
Getting the Performance Out Of
High Performance Computing
High Performance Computing
Jack Dongarra Innovative Computing Lab
University of Tennessee and
Computer Science and Math Division Oak Ridge National Lab
http://
3
4
Getting the Performance into
Getting the Performance into
High Performance Computing
High Performance Computing
Jack Dongarra Innovative Computing Lab
University of Tennessee and
Computer Science and Math Division Oak Ridge National Lab
http://
5
Technology Trends:
Technology Trends:
Microprocessor Capacity
Microprocessor Capacity
2X transistors/Chip Every 1.5 years Called “Moore’s Law”
Microprocessors have become smaller, denser, and more powerful.
Not just processors, storage, internet bandwidth, etc
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. 6 Earth Simulator ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MPCray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red 1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Vector Parallel Super Scalar/Vector/Parallel
Moore’s Law
Moore’s Law
7
Next Generation: IBM Blue Gene/L
Next Generation: IBM Blue Gene/L
and ASCI Purple
and ASCI Purple
♦
Announced 11/19/02
Ø
One of 2 machines for LLNL
Ø
360 TFlop/s
Ø
130,000 proc
Ø
Linux
Ø
FY 2005
Plus ASCI Purple IBM Power 5 based 12K proc, 100 TFlop/s8
To Be Provocative…
To Be Provocative…
Citation in the Press, March 10
Citation in the Press, March 10
thth, 2008
, 2008
DOE Supercomputers Sit Idle
♦
How could this happen?
ØComplexity of programming these machines were
underestimated
ØUsers were unprepared for
the lack of reliability of the hardware and software
ØLittle effort was spent to carry out medium and long term research activities to solve problems that were
foreseen 5 years ago in the areas of applications,
algorithm, middleware, programming models, and computer architectures, … WASHINGTON, Mar. 10, 2008
GAO reports that
after almost
5 years of effort and several
hundreds of M$’s spent at
the DOE labs, the
high performance
computers recently
purchased did not
meet users’ expectation
and are sitting idle…Alan
Laub head of the DOE
efforts9
Software Technology & Performance
Software Technology & Performance
♦ Tendency to focus on the hardware
♦ Software required to bridge an ever widening gap
♦ Gaps between potential and delivered performance is very steep
ØPerformance only if the data and controls are setup just right
ØOtherwise, dramatic performance degradations, very unstable situation
ØWill become more unstable as systems change and become more complex
♦ Challenge for applications, libraries, and tools is formidable with Tflop/s level, even greater with Pflops, some might say insurmountable.
10
Linpack (100x100) Analysis,
Linpack (100x100) Analysis,
The Machine on My Desk 12 Years Ago and Today
The Machine on My Desk 12 Years Ago and Today
♦ Compaq 386/SX20 SX with FPA - .16 Mflop/s
♦ Pentium IV – 2.8 GHz – 1317 Mflop/s
♦ 12 years è we see a factor of ~ 8231 ØDoubling in less than 12 months, for 12 years
♦ Moore’s Law gives us a factor of 256.
♦ How do we get a factor > 8000? ØClock speed increase = 128x
ØExternal Bus Width & Caching –
Ø16 vs. 64 bits = 4x
ØFloating Point
-Ø4/8 bits multi vs. 64 bits (1 clock) = 8x
ØCompiler Technology = 2x
♦ However the potential for that Pentium 4 is 5.6 Gflop/s and here we are getting 1.32 Gflop/s
ØStill a factor of 4.25 off of peak
vComplex set of interaction between ØApplication ØAlgorithms ØProgramming language ØCompiler ØMachine instructions ØHardware
vMany layers of translation from
the application to the hardware
11
Where Does Much of Our Lost Performance Go? or
Where Does Much of Our Lost Performance Go? or
Why Should I Care About the Memory Hierarchy?
Why Should I Care About the Memory Hierarchy?
1 100 10000 1000000 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Year Performance
Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) CPU DRAM 12
Optimizing Computation and
Optimizing Computation and
Memory Use
Memory Use
♦
Computational optimizations
ØTheoretical peak:(# fpus)*(flops/cycle)*cycle time
ØPentium 4: (1 fpu)*(2 flops/cycle)*(2.8 Ghz) = 5600 MFLOP/s
♦
Operations like:
Ø y = α x + y : 3 operands (24 Bytes) needed for 2 flops;
5600 Mflop/s requires 8400 MWord/s bandwidth from memory
♦
Memory optimization
ØTheoretical peak: (bus width) * (bus speed)
ØPentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MWord/s
Off by a factor of 30 from what’s required to drive the processor from memory to peak performance
13
Memory Hierarchy
Memory Hierarchy
♦ By taking advantage of the principle of locality: ØPresent the user with as much memory as is available in
the cheapest technology.
ØProvide access at the speed offered by the fastest
technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Level 2 and 3 Cache (SRAM) On -Chip Cache 1s 10,000,000s (10s ms) 100,000 s (.1s ms) Speed (ns): 10s 100s 100s Gs Size (bytes): Ks Ms Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) 10,000,000 s (10s ms) Ts Distributed Memory Remote Cluster Memory 14
Tool To Help Understand What’s Going
Tool To Help Understand What’s Going
On In the Processor
On In the Processor
♦
Complex system with many filters
♦
Need to identify bottlenecks
♦
Prioritize optimization
15
Tools for Performance Analysis,
Tools for Performance Analysis,
Modeling and Optimization
Modeling and Optimization
♦
PAPI
♦Dynist
♦SVPablo
ROSE: Compiler Framework Recognition of high-level abstractions
Specification of Transformations
16
Example PERC on Climate Model
Example PERC on Climate Model
♦ Interaction with the SciDAC Climate development effort ♦ Profiling ØIdentifying performance bottleneck and prioritizing enhancements ♦ Evaluation of code over time 9 month period ♦ Produced 400% improvement via decreased overhead and increased scalability
17
Signatures: Key Factors in Applications
Signatures: Key Factors in Applications
and System that Affect Performance
and System that Affect Performance
♦
Application Signatures
Ø Characterization of operations needed to be performed by application
Ø Description of application demands on resources
Ø
Ø AlgorithmAlgorithm Signatures
ØOpts counts
ØMemory ref patterns
ØData dependencies
ØI/O characteristics
Ø
Ø SoftwareSoftware Signatures
ØSync points
ØThread level parallelism
ØInst level parallelism
ØRatio of mem ref to flpt ops
Ø Predict application behavior and performance
♦ Hardware Signatures Ø Performance capabilities of Machine
Ø Latencies and bandwidth of memory hierarchy
Ø Local to node & to remote node
Ø Instruction issue rates
Ø Cache size
Ø TLB size
§Execution signature
§combine application and machine signatures to provide accurate performance models
.
Parallel or Distributed Application
Performance Monitor Observation and Model Signature Comparison Generation of Observation Signature Feedback Feedback (Degree of Similarity) (Degree of Similarity)
Live Performance Data Live Performance Data
Application Signature
Model Signature
18
Algorithms
Algorithms
vs
vs
Applications
Applications
X X
Multigrid Schemes
X X
Stiff ODE Solvers
Circuit Simulation Electronic Device Simulation Structural Mechanics Inverse Problems Adjustment of Geodetic Networks Comp Fluid Dynamics Weather Simulation Quantum Chemistry Lattice Gauge (QCD) X X X X X X Sparse Linear System Solvers X X Linear Least Squares X X X X Nonlinear Algebraic System Solvers X X X Sparse Eigenvalue Problems X X X FFT X X X Rapid Elliptic Problem Solvers X X Integral Transformations Monte Carlo Schemes X
From: Supercomputing Tradeoffs and the Cedar System, E. Davidson, D. Kuck, D. Lawrie, and A. Sameh, in High -Speed Computing, Scientific Applications and Algorithm Design, Ed R. Wilhelmson, U of I Press, 1986.
19
Update to
Update to
Sameh’s
Sameh’s
Table?
Table?
♦
Next step by looking
at:
ØApplication Signatures ØAlgorithms choices ØSoftware profile
ØArchitecture (Machine)
♦
Data mine to extract
information
♦
Need signatures for
A
3S
Application Performance Matrixhttp://www.krellinst.org/matrix/
20
Performance Tuning
Performance Tuning
♦
Motivation: performance of many applications
dominated by a few kernels
♦
Conventional approach: handtuning by user or
vendor
ØVery time consuming and tedious work
ØEven with intimate knowledge of architecture and compiler, performance hard to predict
ØGrowing list of kernels to tune
ØMust be redone for every architecture, compiler ØCompiler technology often lags architecture ØNot just a compiler problem:
ØBest algorithm may depend on input, so some tuning at run -time.
21
Automatic Performance Tuning to
Automatic Performance Tuning to
Hide Complexity
Hide Complexity
♦ Approach: for each kernel
1.Identify and generate a space of algorithms
2.Search for the fastest one, by running them
♦ What is a space of algorithms? Ø Depending on kernel and input, may vary
Øinstruction mix and order
Ømemory access patterns
Ødata structures
Ømathematical formulation
♦ When do we search?
Ø Once per kernel and architecture
Ø At compile time
Ø At run time
Ø All of the above
22
Some Automatic Tuning Projects
Some Automatic Tuning Projects
♦ATLAS (www.netlib.org/atlas) (Dongarra, Whaley) used in Matlab and many SciDAC and ASCI projects
♦PHIPAC (www.icsi.berkeley.edu/~bilmes/phipac) (Bilmes,Asanovic,Vuduc,Demmel)
♦Sparsity (www.cs.berkeley.edu/~yelick/sparsity) (Yelick, Im)
♦Self Adapting Linear Algebra Software (SALAS) (Dongarra, Eijkhout, Gropp, Keyes)
♦FFTs and Signal Processing
ØFFTW (www.fftw.org)
Ø Won 1999 Wilkinson Prize for Numerical Software
ØSPIRAL (www.ece.cmu.edu/~spiral) Ø Extensions to other Ø transforms, DSPs ØUHFFT Ø Extensions to higher Ø dimension, parallelism 0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0 AMD Athlon-600
DEC ev56-533 DEC ev6-500
HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Intel P-III 933 MHz
Intel P-4 2.53 GHz w/SSE2
SGI R10000ip28-200SGI R12000ip30-270Sun UltraSparc2-200 Architectures
MFLOP/S
Vendor BLAS ATLAS BLAS F77 BLAS
23
Futures for High Performance Scientific
Futures for High Performance Scientific
Computing
Computing
♦
Numerical software will be adaptive,
exploratory, and intelligent
♦
Determinism in numerical computing will
be gone.
Ø After all, its not reasonable to ask for exactness in numerical computations.
ØReproducibility at a cost
♦
Importance of floating point arithmetic
will be undiminished.
Ø16, 32, 64, 128 bits and beyond.
♦
Reproducibility, fault tolerance, and
auditability
♦
Adaptivity
is a key so applications can
effectively use the resources.
24
Citation in the Press, March 10
Citation in the Press, March 10
thth, 2008
, 2008
DOE Supercomputers
Live up to Expectation
WASHINGTON, Mar. 10, 2008
GAO reported today that after almost 5 years of effort and several hundreds of M$’s spent at DOE labs, the high performance computers recently purchased have exceeded users’ expectation and are helping to solve some of
our most challenging problems. Alan Laub head of
DOE’s HPC efforts reported today at the annual meeting of the SciDAC PI
How can this happen?
♦ Close interactions of with the
applications and the CS and Math ISIC groups
♦ Dramatic improvements in
adaptability of software to the execution environment
♦ Improved processor-memory
bandwidth
♦ New large-scale system
architectures and software
Ø Aggressive fault management and reliability
♦ Exploration of some alternative
architectures and languages
Ø Application teams to help drive the design of new
25
With Apologies to Gary Larson…
With Apologies to Gary Larson…
♦ SciDAC is helping
♦ Teams are developing the scientific computing software and hardware infrastructure needed to use terascale computers and beyond.