1
The HPC Challenge Benchmark:
The HPC Challenge Benchmark:
A Candidate for Replacing
A Candidate for Replacing
LINPACK in the TOP500?
LINPACK in the TOP500?
Jack DongarraUniversity of Tennessee and
Oak Ridge National Laboratory
2007 SPEC Benchmark Workshop January 21, 2007
2
Outline
Outline
-
-
The HPC Challenge Benchmark:
The HPC Challenge Benchmark:
A Candidate for Replacing
A Candidate for Replacing
Linpack
Linpack
in the TOP500?
in the TOP500?
♦ Look at LINPACK
♦ Brief discussion of DARPA HPCS Program
♦ HPC Challenge Benchmark
3
What Is LINPACK?
What Is LINPACK?
♦ Most people think LINPACK is a benchmark. ♦ LINPACK is a package of mathematical
software for solving problems in linear algebra, mainly dense linear systems of linear equations. ♦ The project had its origins in 1974
♦ LINPACK: “LINear algebra PACKage”
4
Computing in 1974
Computing in 1974
♦ High Performance Computers:
¾ IBM 370/195, CDC 7600, Univac 1110, DEC
PDP-10, Honeywell 6030
♦ Fortran 66
♦ Run efficiently ♦ BLAS (Level 1)
¾ Vector operations
♦ Trying to achieve software portability
♦ LINPACK package was released in 1979
5
The Accidental
The Accidental
Benchmarker
Benchmarker
♦ Appendix B of the Linpack Users’ Guide
¾ Designed to help users extrapolate execution
time for Linpack software package
♦ First benchmark report from 1977;
¾ Cray 1 to DEC PDP-10
Dense matrices Linear systems
Least squares problems Singular values
6
LINPACK Benchmark?
LINPACK Benchmark?
♦ The LINPACK Benchmark is a measure of a computer’s floating-point rate of execution for solving Ax=b.
¾ It is determined by running a computer program that solves a
dense system of linear equations.
♦ Information is collected and available in the LINPACK Benchmark Report.
♦ Over the years the characteristics of the benchmark has changed a bit.
¾ In fact, there are three benchmarks included in the Linpack
Benchmark report.
♦ LINPACK Benchmark since 1977
¾ Dense linear system solve with LU factorization using partial
pivoting
¾ Operation count is: 2/3 n3 + O(n2)
¾ Benchmark Measure: MFlop/s
¾ Original benchmark measures the execution rate for a Fortran
7
For
For
Linpack
Linpack
with n = 100
with n = 100
♦ Not allowed to touch the code.♦ Only set the optimization in the compiler and run. ♦ Provide historical look at computing
♦ Table 1 of the report (52 pages of 95 page report)
8
Linpack Benchmark Over Time
Linpack Benchmark Over Time
♦ In the beginning there was only the Linpack 100 Benchmark (1977)
¾ n=100 (80KB); size that would fit in all the machines ¾ Fortran; 64 bit floating point arithmetic
¾ No hand optimization (only compiler options); source code available ♦ Linpack 1000 (1986)
¾ n=1000 (8MB); wanted to see higher performance levels ¾ Any language; 64 bit floating point arithmetic
¾ Hand optimization OK
♦ Linpack Table 3 (Highly Parallel Computing - 1991) (Top500; 1993)
¾ Any size (n as large as you can; n=106; 8TB; ~6 hours);
¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK
¾ Strassen’s method not allowed (confuses the operation count and rate)
¾ Reference implementation available
♦ In all cases results are verified by looking at:
♦ Operations count for factorization ; solve
|| || (1) || || || || Ax b O A x nε − = 3 2 2 1 3 n − 2 n 2 2n
9
Motivation for Additional Benchmarks
Motivation for Additional Benchmarks
♦ From Linpack Benchmark and
Top500: “no single number can reflect overall performance”
♦ Clearly need something more
than Linpack
♦ HPC Challenge Benchmark
¾ Test suite stresses not only
the processors, but the memory system and the interconnect.
¾ The real utility of the HPCC
benchmarks are that
architectures can be described with a wider range of metrics than just Flop/s from Linpack.
Linpack Benchmark
♦ Good
¾ One number
¾ Simple to define & easy to rank ¾ Allows problem size to change
with machine and over time
¾ Stresses the system with a run
of a few hours
♦ Bad
¾ Emphasizes only “peak” CPU
speed and number of CPUs
¾ Does not stress local bandwidth ¾ Does not stress the network ¾ Does not test gather/scatter ¾ Ignores Amdahl’s Law (Only
does weak scaling)
♦ Ugly
¾ MachoFlops
10
At The Time The
At The Time The
Linpack
Linpack
Benchmark Was
Benchmark Was
Created
Created
…
…
♦ If we think about computing in late 70’s
♦ Perhaps the LINPACK benchmark was a
reasonable thing to use.
♦ Memory wall, not so much a wall but a step. ♦ In the 70’s, things were more in balance
¾ The memory kept pace with the CPU
¾ n cycles to execute an instruction, n
cycles to bring in a word from memory ♦ Showed compiler optimization
11
Many Changes
Many Changes
♦ Many changes in our hardware over the past 30 years
¾ Superscalar, Vector,
Distributed Memory, Shared Memory, Multicore, …
♦ While there has been some changes to the
Linpack Benchmark not all of them reflect the advances made in
the hardware.
♦ Today’s memory hierarchy is much more complicated. 0 100 200 300 400 500 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 Const. Cluster MPP SMP SIMD Single Proc. Top500 Systems/Architectures
High Productivity Computing Systems
High Productivity Computing Systems
Goal:
Provide a generation of economically viable high productivity computing systems for the national security and industrial user community (2010; started in 2002)
Goal:
Provide a generation of economically viable high productivity computing systems for the national security and industrial user community (2010; started in 2002)
Fill the Critical Technology and Capability Gap
Today (late 80's HPC Technology) ... to ... Future (Quantum/Bio Computing)
Fill the Critical Technology and Capability Gap
Today (late 80's HPC Technology) ... to ... Future (Quantum/Bio Computing)
Applications:
Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology
Analysis & Analysis & Assessment Assessment Performance Characterization & Prediction System Architecture Software Technology Hardware Technology Programming Models Industry R&D Industry R&D
HPCS Program Focus Areas
Focus on:
¾Real (not peak) performance of critical national
security applications
¾ Intelligence/surveillance ¾ Reconnaissance
¾ Cryptanalysis ¾ Weapons analysis
¾ Airborne contaminant modeling ¾ Biotechnology
¾Programmability: reduce cost and time of
developing applications
13 team
HPCS Roadmap
HPCS Roadmap
Phase 1 $20M (2002) Phase 2 $170M (2003-2005) Phase 3 (2006-2010) ~$250M each Concept Study Advanced Design & Prototypes Full Scale Development TBD New Evaluation Framework Test Evaluation Framework team¾ 5 vendors in phase 1; 3 vendors in phase 2; 1+ vendors in phase 3
¾ MIT Lincoln Laboratory leading measurement and evaluation team
Validated Procurement Evaluation Methodology
Today
Petascale Systems
14
Predicted Performance Levels for Top500
Predicted Performance Levels for Top500
6,267 4,648 3,447 557 405 294 59 44 33 5.46 2.86 3.95 1 10 100 1,000 10,000 100,000 Jun-0 3 Dec-0 3 Jun-04Dec-0 4 Ju n-05 Dec-0 5 Ju n-06 De c-06 Ju n-07 De c-07 Ju n-08 Dec-0 8 Ju n-09 TFl o p/ s Li npa c k Total #1 #10 #500 Total Pred. #1 Pred. #10 Pred. #500
15
A
A
PetaFlop
PetaFlop
Computer by the End of the
Computer by the End of the
Decade
Decade
♦
At least 10 Companies developing a
Petaflop system in the next decade.
¾Cray ¾IBM ¾Sun ¾Dawning ¾Galactic ¾Lenovo ¾Hitachi ¾NEC ¾Fujitsu ¾Bull Japanese Japanese “
“Life SimulatorLife Simulator”” (10 (10 Pflop/sPflop/s)) Keisoku
Keisoku project $1B 7 yearsproject $1B 7 years
}
Chinese Chinese Companies Companies}
}
2+ Pflop/s Linpack 6.5 PB/s data streaming BW 3.2 PB/s Bisection BW 64,000 GUPS16
PetaFlop
PetaFlop
Computers in 2 Years!
Computers in 2 Years!
♦ Oak Ridge National Lab¾ Planned for 4th Quarter 2008 (1 Pflop/s peak)
¾ From Cray’s XT family ¾ Use quad core from AMD
¾ 23,936 Chips
¾ Each chip is a quad core-processor (95,744 processors) ¾ Each processor does 4 flops/cycle
¾ Cycle time of 2.8 GHz
¾ Hypercube connectivity
¾ Interconnect based on Cray XT technology ¾ 6MW, 136 cabinets
♦ Los Alamos National Lab
¾ Roadrunner (2.4 Pflop/s peak)
¾ Use IBM Cell and AMD processors ¾ 75,000 cores
17
HPC Challenge Goals
HPC Challenge Goals
♦ To examine the performance of HPC architectures using kernels with more
challenging
memory access patterns than the Linpack Benchmark¾ The Linpack benchmark works well on all architectures ―
even cache-based, distributed memory multiprocessors due to
1. Extensive memory reuse
2. Scalable with respect to the amount of computation 3. Scalable with respect to the communication volume 4. Extensive optimization of the software
♦ To
complement
the Top500 list♦ Stress CPU, memory system, interconnect ♦ Allow for optimizations
¾ Record effort needed for tuning ¾ Base run requires MPI and BLAS
Tests on Single Processor and System
● Local - only a single processor (core) is performing computations.
● Embarrassingly Parallel - each processor (core) in the entire system is performing computations but they do no communicate with each other explicitly.
● Global - all processors in the system are performing computations and they explicitly communicate with each other.
19
HPC Challenge Benchmark
HPC Challenge Benchmark
Consists of basically 7 benchmarks;
¾ Think of it as a framework or harness for adding benchmarks of interest.
1. LINPACK (HPL) ― MPI Global (Ax = b) 2. STREAM ― Local; single CPU
*STREAM ― Embarrassingly parallel 3. PTRANS (A A + BT) ― MPI Global
4. RandomAccess ― Local; single CPU
*RandomAccess ― Embarrassingly parallel
RandomAccess ― MPI Global 5. BW and Latency – MPI
6. FFT - Global, single CPU, and EP 7. Matrix Multiply – single CPU and EP
proci prock
Random integer
April 18, 2006 Oak Ridge National Lab, CSM/FT 20
HPCS Performance Targets
HPCS Performance Targets
● HPCC was developed by HPCS to assist in testing new HEC systems ● Each benchmark focuses on a different part of the memory hierarchy ● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
● HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages
April 18, 2006 Oak Ridge National Lab, CSM/FT 21
HPCS Performance Targets
HPCS Performance Targets
● HPCC was developed by HPCS to assist in testing new HEC systems ● Each benchmark focuses on a different part of the memory hierarchy ● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
● HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
● LINPACK: linear system solve
Ax = b Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages
April 18, 2006 Oak Ridge National Lab, CSM/FT 22
HPCS Performance Targets
HPCS Performance Targets
● HPCC was developed by HPCS to assist in testing new HEC systems ● Each benchmark focuses on a different part of the memory hierarchy ● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
● HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
HPC Challenge
HPC Challenge
● LINPACK: linear system solve
Ax = b
● STREAM: vector operations
A = B + s * C
● FFT: 1D Fast Fourier Transform
Z = fft(X)
● RandomAccess: integer update
T[i] = XOR( T[i], rand)
Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages
April 18, 2006 Oak Ridge National Lab, CSM/FT 23
HPCS Performance Targets
HPCS Performance Targets
● HPCC was developed by HPCS to assist in testing new HEC systems ● Each benchmark focuses on a different part of the memory hierarchy ● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
● HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve real application performance Make programming easier
HPC Challenge
HPC Challenge Performance TargetsPerformance Targets
● LINPACK: linear system solve
Ax = b
● STREAM: vector operations
A = B + s * C
● FFT: 1D Fast Fourier Transform
Z = fft(X)
● RandomAccess: integer update
T[i] = XOR( T[i], rand)
Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages Max Relative 8x 40x 200x 64000 GUPS 2000x 2 Pflop/s 6.5 Pbyte/s 0.5 Pflop/s
Computational Resources and HPC Challenge Benchmarks Computational resources Computational resources CPU computational speed Memory bandwidth Node Interconnect bandwidth
Computational resources Computational resources CPU computational speed Memory bandwidth Node Interconnect bandwidth HPL Matrix Multiply STREAM
Random & Natural Ring Bandwidth & Latency
Computational Resources and HPC Challenge Benchmarks
26
How Does The Benchmarking Work?
How Does The Benchmarking Work?
♦ Single program to download and run
¾ Simple input file similar to HPL input
♦ Base Run and Optimization Run
¾ Base run must be made
¾ User supplies MPI and the BLAS
¾ Optimized run allowed to replace certain routines
¾ User specifies what was done
♦ Results upload via website (monitored)
♦ html table and Excel spreadsheet generated with performance results
¾ Intentionally we are not providing a single figure of merit
(no over all ranking)
♦ Each run generates a record which contains 188 pieces of information from the benchmark run. ♦ Goal: no more than 2 X the time to execute HPL.
27
http://icl.cs.utk.edu/hpcc/
30
HPCC
HPCC
Kiviat
Kiviat
Chart
Chart
33
Different Computers are Better at Different
Different Computers are Better at Different
Things, No
34
HPCC Awards Info and Rules
HPCC Awards Info and Rules
Class 1 (Objective) ♦ Performance 1.G-HPL $500 2.G-RandomAccess $500 3.EP-STREAM system $500 4.G-FFT $500
♦ Must be full submissions through the HPCC
database
Class 2 (Subjective)
♦ Productivity (Elegant Implementation)
¾ Implement at least two
tests from Class 1
¾ $1500 (may be split) ¾ Deadline:
¾October 15, 2007
¾ Select 3 as finalists
♦ This award is weighted
¾ 50% on performance and ¾ 50% on code elegance,
clarity, and size.
♦ Submissions format flexible
Winners (in both classes) will be announced at SC07 HPCC BOF
Winners (in both classes) will be announced at SC07 HPCC BOF
35
Class 2 Awards
Class 2 Awards
♦ Subjective
♦ Productivity (Elegant Implementation)
¾ Implement at least two tests from Class 1 ¾ $1500 (may be split)
¾ Deadline:
¾October 15, 2007
¾ Select 5 as finalists
♦ Most "elegant" implementation with special emphasis being placed on:
♦ Global HPL, Global RandomAccess, EP STREAM
(Triad) per system and Global FFT. ♦ This award is weighted
¾ 50% on performance and
36
5 Finalists for Class 2
5 Finalists for Class 2
–
–
November 2005
November 2005
♦ Cleve Moler, Mathworks ¾ Environment: Parallel
Matlab Prototype
¾ System: 4 Processor
Opteron
♦ Calin Caseval, C. Bartin, G. Almasi, Y. Zheng, M.
Farreras, P. Luk, and R. Mak, IBM
¾ Environment: UPC ¾ System: Blue Gene L
♦ Bradley Kuszmaul, MIT ¾ Environment: Cilk
¾ System: 4-processor
1.4Ghz AMD Opteron 840 with 16GiB of memory
♦ Nathan Wichman, Cray ¾ Environment: UPC
¾ System: Cray X1E (ORNL) ♦ Petr Konency, Simon
Kahan, and John Feo, Cray ¾ Environment: C + MTA
pragmas
¾ System: Cray MTA2
37
2006 Competitors
2006 Competitors
♦ Some Notable Class 1 Competitors
SGI (NASA) Columbia 10,000 CPUs NEC (HLRS) SX-8 512 CPUs IBM (DOE LLNL) BG/L 131,072 CPU Purple 10,240 CPU
Cray (DOE ORNL) X1 1008 CPUs Jaguar XT3 5200 CPUs DELL (MIT LL) 300 CPUs LLGrid ♦ Class 2: 6 Finalists
¾ Calin Cascaval (IBM) UPC on Blue Gene/L [Current Language]
¾ Bradley Kuszmaul (MIT CSAIL) Cilk on SGI Altix [Current Language]
¾ Cleve Moler (Mathworks) Parallel Matlab on a cluster [Current Language]
¾ Brad Chamberlain (Cray) Chapel [Research Language]
¾ Vivek Sarkar (IBM) X10 [Research Language]
¾ Vadim Gurev (St. Petersburg, Russia) MCSharp [Student Submission]
Cray (DOD ERDC) XT3 4096 CPUs Sapphire Cray (Sandia) XT3 11,648 CPU Red Storm
38
The Following are the Winners of the 2006
The Following are the Winners of the 2006
HPC Challenge Class 1 Awards
39
The Following are the Winners of the 2006
The Following are the Winners of the 2006
HPC Challenge Class 2 Awards
40
2006 Programmability
2006 Programmability
Speedup vs Relative Code Size
10-1 100 10 10-3 10-2 10-1 100 101 102 103
Relative Code Size
Sp e e d u p
“All too often” Java, Matlab, Python, etc. Traditional HPC Ref ♦Class 2 Award ¾50% Performance ¾50% Elegance ♦21 Codes submitted by 6 teams ♦Speedup relative to serial C on workstation
♦Code size relative to
serial C ♦Class 2 Award ¾50% Performance ¾50% Elegance ♦21 Codes submitted by 6 teams ♦Speedup relative to serial C on workstation ♦Code size relative to
41
2006 Programming Results Summary
2006 Programming Results Summary
♦ 21 of 21 smaller than C+MPI Ref; 20 smaller than serial
♦ 15 of 21 faster than serial; 19 in HPCS quadrant
♦ 21 of 21 smaller than C+MPI Ref; 20 smaller than serial
42
Top500 and HPC Challenge Rankings
Top500 and HPC Challenge Rankings
♦ It should be clear that the HPL (Linpack Benchmark - Top500) is a relatively poor predictor of overall machine performance. ♦ For a given set of applications such as:
¾ Calculations on unstructured grids ¾ Effects of strong shock waves ¾ Ab-initio quantum chemistry ¾ Ocean general circulation model
¾ CFD calculations w/multi-resolution grids ¾ Weather forecasting
♦ There should be a different mix of components used to help predict the system performance.
43
Will the Top500 List Go Away?
Will the Top500 List Go Away?
♦ The Top500 continues to serve a valuable role in high performance computing.
¾ Historical basis
¾ Presents statistics on deployment ¾ Projection on where things are going ¾ Impartial view
¾ Its simple to understand ¾ Its fun
44
No Single Number for HPCC?
No Single Number for HPCC?
♦ Of course everyone wants a single number.
♦ With HPCC Benchmark you get 188 numbers per system run!
♦ Many have suggested weighting the seven tests in HPCC to come up
with a single number.
¾ LINPACK, MatMul, FFT, Stream, RandomAccess,
Ptranspose, bandwidth & latency
♦ But your application is different than mine, so weights are
dependent on the application.
♦ Score = W1*LINPACK + W2*MM + W3*FFT+ W4*Stream + W5*RA + W6*Ptrans + W7*BW/Lat
♦ Problem is that the weights depend on your job mix.
45
Tools Needed to Help With Performance
Tools Needed to Help With Performance
♦ A tools that analyzed an application perhaps statically and/or dynamically.
♦ Output a set of weights for various sections of the application
¾ [ W1, W2, W3, W4, W5, W6, W7, W8 ]
¾ The tool would also point to places where we were
missing a benchmarking component for the mapping.
♦ Think of the benchmark components as a basis set for scientific applications
♦ A specific application has a set of "coefficients" of the basis set.
♦ Score = W1*HPL + W2*MM + W3*FFT+ W4*Stream +
46
Future Directions
Future Directions
♦ Looking at reducing execution time
♦ Constructing a framework for benchmarks ♦ Developing machine signatures
♦ Plans are to expand the benchmark collection
¾ Sparse matrix operations ¾ I/O
¾ Smith-Waterman (sequence alignment)
♦ Port to new systems
♦ Provide more implementations
¾ Languages (Fortran, UPC, Co-Array) ¾ Environments
Collaborators
• HPC Challenge
– Piotr Łuszczek, U of Tennessee
– David Bailey, NERSC/LBL
– Jeremy Kepner, MIT Lincoln Lab
– David Koester, MITRE
– Bob Lucas, ISI/USC
– Rusty Lusk, ANL
– John McCalpin, IBM, Austin
– Rolf Rabenseifner, HLRS Stuttgart
– Daisuke Takahashi, Tsukuba, Japan
http://icl.cs.utk.edu/hpcc/
• Top500
– Hans Meuer, Prometeus
– Erich Strohmaier, LBNL/NERSC