• No results found

The HPC Challenge Benchmark: A Candidate for Replacing LINPACK in the TOP500?

N/A
N/A
Protected

Academic year: 2021

Share "The HPC Challenge Benchmark: A Candidate for Replacing LINPACK in the TOP500?"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

1

The HPC Challenge Benchmark:

The HPC Challenge Benchmark:

A Candidate for Replacing

A Candidate for Replacing

LINPACK in the TOP500?

LINPACK in the TOP500?

Jack Dongarra

University of Tennessee and

Oak Ridge National Laboratory

2007 SPEC Benchmark Workshop January 21, 2007

(2)

2

Outline

Outline

-

-

The HPC Challenge Benchmark:

The HPC Challenge Benchmark:

A Candidate for Replacing

A Candidate for Replacing

Linpack

Linpack

in the TOP500?

in the TOP500?

Look at LINPACK

Brief discussion of DARPA HPCS Program

HPC Challenge Benchmark

(3)

3

What Is LINPACK?

What Is LINPACK?

Most people think LINPACK is a benchmark.LINPACK is a package of mathematical

software for solving problems in linear algebra, mainly dense linear systems of linear equations.The project had its origins in 1974

LINPACK: “LINear algebra PACKage”

(4)

4

Computing in 1974

Computing in 1974

High Performance Computers:

¾ IBM 370/195, CDC 7600, Univac 1110, DEC

PDP-10, Honeywell 6030

Fortran 66

Run efficientlyBLAS (Level 1)

¾ Vector operations

Trying to achieve software portability

LINPACK package was released in 1979

(5)

5

The Accidental

The Accidental

Benchmarker

Benchmarker

Appendix B of the Linpack Users’ Guide

¾ Designed to help users extrapolate execution

time for Linpack software package

First benchmark report from 1977;

¾ Cray 1 to DEC PDP-10

Dense matrices Linear systems

Least squares problems Singular values

(6)

6

LINPACK Benchmark?

LINPACK Benchmark?

The LINPACK Benchmark is a measure of a computer’s floating-point rate of execution for solving Ax=b.

¾ It is determined by running a computer program that solves a

dense system of linear equations.

Information is collected and available in the LINPACK Benchmark Report.

Over the years the characteristics of the benchmark has changed a bit.

¾ In fact, there are three benchmarks included in the Linpack

Benchmark report.

LINPACK Benchmark since 1977

¾ Dense linear system solve with LU factorization using partial

pivoting

¾ Operation count is: 2/3 n3 + O(n2)

¾ Benchmark Measure: MFlop/s

¾ Original benchmark measures the execution rate for a Fortran

(7)

7

For

For

Linpack

Linpack

with n = 100

with n = 100

Not allowed to touch the code.

Only set the optimization in the compiler and run.Provide historical look at computing

Table 1 of the report (52 pages of 95 page report)

(8)

8

Linpack Benchmark Over Time

Linpack Benchmark Over Time

In the beginning there was only the Linpack 100 Benchmark (1977)

¾ n=100 (80KB); size that would fit in all the machines ¾ Fortran; 64 bit floating point arithmetic

¾ No hand optimization (only compiler options); source code availableLinpack 1000 (1986)

¾ n=1000 (8MB); wanted to see higher performance levels ¾ Any language; 64 bit floating point arithmetic

¾ Hand optimization OK

Linpack Table 3 (Highly Parallel Computing - 1991) (Top500; 1993)

¾ Any size (n as large as you can; n=106; 8TB; ~6 hours);

¾ Any language; 64 bit floating point arithmetic ¾ Hand optimization OK

¾ Strassen’s method not allowed (confuses the operation count and rate)

¾ Reference implementation available

In all cases results are verified by looking at:

Operations count for factorization ; solve

|| || (1) || || || || Ax b O A x nε − = 3 2 2 1 3 n − 2 n 2 2n

(9)

9

Motivation for Additional Benchmarks

Motivation for Additional Benchmarks

From Linpack Benchmark and

Top500: “no single number can reflect overall performance”

Clearly need something more

than Linpack

HPC Challenge Benchmark

¾ Test suite stresses not only

the processors, but the memory system and the interconnect.

¾ The real utility of the HPCC

benchmarks are that

architectures can be described with a wider range of metrics than just Flop/s from Linpack.

Linpack Benchmark

Good

¾ One number

¾ Simple to define & easy to rank ¾ Allows problem size to change

with machine and over time

¾ Stresses the system with a run

of a few hours

Bad

¾ Emphasizes only “peak” CPU

speed and number of CPUs

¾ Does not stress local bandwidth ¾ Does not stress the network ¾ Does not test gather/scatter ¾ Ignores Amdahl’s Law (Only

does weak scaling)

Ugly

¾ MachoFlops

(10)

10

At The Time The

At The Time The

Linpack

Linpack

Benchmark Was

Benchmark Was

Created

Created

If we think about computing in late 70’s

Perhaps the LINPACK benchmark was a

reasonable thing to use.

Memory wall, not so much a wall but a step.In the 70’s, things were more in balance

¾ The memory kept pace with the CPU

¾ n cycles to execute an instruction, n

cycles to bring in a word from memoryShowed compiler optimization

(11)

11

Many Changes

Many Changes

Many changes in our hardware over the past 30 years

¾ Superscalar, Vector,

Distributed Memory, Shared Memory, Multicore, …

While there has been some changes to the

Linpack Benchmark not all of them reflect the advances made in

the hardware.

Today’s memory hierarchy is much more complicated. 0 100 200 300 400 500 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 Const. Cluster MPP SMP SIMD Single Proc. Top500 Systems/Architectures

(12)

High Productivity Computing Systems

High Productivity Computing Systems

Goal:

Provide a generation of economically viable high productivity computing systems for the national security and industrial user community (2010; started in 2002)

Goal:

Provide a generation of economically viable high productivity computing systems for the national security and industrial user community (2010; started in 2002)

Fill the Critical Technology and Capability Gap

Today (late 80's HPC Technology) ... to ... Future (Quantum/Bio Computing)

Fill the Critical Technology and Capability Gap

Today (late 80's HPC Technology) ... to ... Future (Quantum/Bio Computing)

Applications:

Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology

Analysis & Analysis & Assessment Assessment Performance Characterization & Prediction System Architecture Software Technology Hardware Technology Programming Models Industry R&D Industry R&D

HPCS Program Focus Areas

Focus on:

¾Real (not peak) performance of critical national

security applications

¾ Intelligence/surveillance ¾ Reconnaissance

¾ Cryptanalysis ¾ Weapons analysis

¾ Airborne contaminant modeling ¾ Biotechnology

¾Programmability: reduce cost and time of

developing applications

(13)

13 team

HPCS Roadmap

HPCS Roadmap

Phase 1 $20M (2002) Phase 2 $170M (2003-2005) Phase 3 (2006-2010) ~$250M each Concept Study Advanced Design & Prototypes Full Scale Development TBD New Evaluation Framework Test Evaluation Framework team

¾ 5 vendors in phase 1; 3 vendors in phase 2; 1+ vendors in phase 3

¾ MIT Lincoln Laboratory leading measurement and evaluation team

Validated Procurement Evaluation Methodology

Today

Petascale Systems

(14)

14

Predicted Performance Levels for Top500

Predicted Performance Levels for Top500

6,267 4,648 3,447 557 405 294 59 44 33 5.46 2.86 3.95 1 10 100 1,000 10,000 100,000 Jun-0 3 Dec-0 3 Jun-04Dec-0 4 Ju n-05 Dec-0 5 Ju n-06 De c-06 Ju n-07 De c-07 Ju n-08 Dec-0 8 Ju n-09 TFl o p/ s Li npa c k Total #1 #10 #500 Total Pred. #1 Pred. #10 Pred. #500

(15)

15

A

A

PetaFlop

PetaFlop

Computer by the End of the

Computer by the End of the

Decade

Decade

At least 10 Companies developing a

Petaflop system in the next decade.

¾Cray ¾IBM ¾Sun ¾Dawning ¾Galactic ¾Lenovo ¾Hitachi ¾NEC ¾Fujitsu ¾Bull Japanese Japanese “

“Life SimulatorLife Simulator”” (10 (10 Pflop/sPflop/s)) Keisoku

Keisoku project $1B 7 yearsproject $1B 7 years

}

Chinese Chinese Companies Companies

}

}

2+ Pflop/s Linpack 6.5 PB/s data streaming BW 3.2 PB/s Bisection BW 64,000 GUPS
(16)

16

PetaFlop

PetaFlop

Computers in 2 Years!

Computers in 2 Years!

Oak Ridge National Lab

¾ Planned for 4th Quarter 2008 (1 Pflop/s peak)

¾ From Cray’s XT family ¾ Use quad core from AMD

¾ 23,936 Chips

¾ Each chip is a quad core-processor (95,744 processors) ¾ Each processor does 4 flops/cycle

¾ Cycle time of 2.8 GHz

¾ Hypercube connectivity

¾ Interconnect based on Cray XT technology ¾ 6MW, 136 cabinets

Los Alamos National Lab

¾ Roadrunner (2.4 Pflop/s peak)

¾ Use IBM Cell and AMD processors ¾ 75,000 cores

(17)

17

HPC Challenge Goals

HPC Challenge Goals

To examine the performance of HPC architectures using kernels with more

challenging

memory access patterns than the Linpack Benchmark

¾ The Linpack benchmark works well on all architectures ―

even cache-based, distributed memory multiprocessors due to

1. Extensive memory reuse

2. Scalable with respect to the amount of computation 3. Scalable with respect to the communication volume 4. Extensive optimization of the software

To

complement

the Top500 list

Stress CPU, memory system, interconnectAllow for optimizations

¾ Record effort needed for tuning ¾ Base run requires MPI and BLAS

(18)

Tests on Single Processor and System

Local - only a single processor (core) is performing computations.

Embarrassingly Parallel - each processor (core) in the entire system is performing computations but they do no communicate with each other explicitly.

Global - all processors in the system are performing computations and they explicitly communicate with each other.

(19)

19

HPC Challenge Benchmark

HPC Challenge Benchmark

Consists of basically 7 benchmarks;

¾ Think of it as a framework or harness for adding benchmarks of interest.

1. LINPACK (HPL) ― MPI Global (Ax = b) 2. STREAM ― Local; single CPU

*STREAM ― Embarrassingly parallel 3. PTRANS (A A + BT) MPI Global

4. RandomAccess Local; single CPU

*RandomAccess Embarrassingly parallel

RandomAccess MPI Global 5. BW and Latency – MPI

6. FFT - Global, single CPU, and EP 7. Matrix Multiply – single CPU and EP

proci prock

Random integer

(20)

April 18, 2006 Oak Ridge National Lab, CSM/FT 20

HPCS Performance Targets

HPCS Performance Targets

HPCC was developed by HPCS to assist in testing new HEC systemsEach benchmark focuses on a different part of the memory hierarchyHPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPCC was developed by HPCS to assist in testing new HEC systems

Each benchmark focuses on a different part of the memory hierarchy

HPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages

(21)

April 18, 2006 Oak Ridge National Lab, CSM/FT 21

HPCS Performance Targets

HPCS Performance Targets

HPCC was developed by HPCS to assist in testing new HEC systemsEach benchmark focuses on a different part of the memory hierarchyHPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPCC was developed by HPCS to assist in testing new HEC systems

Each benchmark focuses on a different part of the memory hierarchy

HPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

● LINPACK: linear system solve

Ax = b Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages

(22)

April 18, 2006 Oak Ridge National Lab, CSM/FT 22

HPCS Performance Targets

HPCS Performance Targets

HPCC was developed by HPCS to assist in testing new HEC systemsEach benchmark focuses on a different part of the memory hierarchyHPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPCC was developed by HPCS to assist in testing new HEC systems

Each benchmark focuses on a different part of the memory hierarchy

HPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPC Challenge

HPC Challenge

● LINPACK: linear system solve

Ax = b

● STREAM: vector operations

A = B + s * C

● FFT: 1D Fast Fourier Transform

Z = fft(X)

● RandomAccess: integer update

T[i] = XOR( T[i], rand)

Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages

(23)

April 18, 2006 Oak Ridge National Lab, CSM/FT 23

HPCS Performance Targets

HPCS Performance Targets

HPCC was developed by HPCS to assist in testing new HEC systemsEach benchmark focuses on a different part of the memory hierarchyHPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPCC was developed by HPCS to assist in testing new HEC systems

Each benchmark focuses on a different part of the memory hierarchy

HPCS performance targets attempt to

Flatten the memory hierarchy

Improve real application performance Make programming easier

HPC Challenge

HPC Challenge Performance TargetsPerformance Targets

● LINPACK: linear system solve

Ax = b

● STREAM: vector operations

A = B + s * C

● FFT: 1D Fast Fourier Transform

Z = fft(X)

● RandomAccess: integer update

T[i] = XOR( T[i], rand)

Cache(s) Cache(s) Local Memory Local Memory Registers Registers Remote Memory Remote Memory Disk Disk Tape Tape Instructions Memory Hierarchy Memory Hierarchy Operands Lines Blocks Messages Pages Max Relative 8x 40x 200x 64000 GUPS 2000x 2 Pflop/s 6.5 Pbyte/s 0.5 Pflop/s

(24)

Computational Resources and HPC Challenge Benchmarks Computational resources Computational resources CPU computational speed Memory bandwidth Node Interconnect bandwidth

(25)

Computational resources Computational resources CPU computational speed Memory bandwidth Node Interconnect bandwidth HPL Matrix Multiply STREAM

Random & Natural Ring Bandwidth & Latency

Computational Resources and HPC Challenge Benchmarks

(26)

26

How Does The Benchmarking Work?

How Does The Benchmarking Work?

Single program to download and run

¾ Simple input file similar to HPL input

Base Run and Optimization Run

¾ Base run must be made

¾ User supplies MPI and the BLAS

¾ Optimized run allowed to replace certain routines

¾ User specifies what was done

Results upload via website (monitored)

html table and Excel spreadsheet generated with performance results

¾ Intentionally we are not providing a single figure of merit

(no over all ranking)

Each run generates a record which contains 188 pieces of information from the benchmark run.Goal: no more than 2 X the time to execute HPL.

(27)

27

http://icl.cs.utk.edu/hpcc/

(28)
(29)
(30)

30

HPCC

HPCC

Kiviat

Kiviat

Chart

Chart

(31)
(32)
(33)

33

Different Computers are Better at Different

Different Computers are Better at Different

Things, No

(34)

34

HPCC Awards Info and Rules

HPCC Awards Info and Rules

Class 1 (Objective)Performance 1.G-HPL $500 2.G-RandomAccess $500 3.EP-STREAM system $500 4.G-FFT $500

Must be full submissions through the HPCC

database

Class 2 (Subjective)

Productivity (Elegant Implementation)

¾ Implement at least two

tests from Class 1

¾ $1500 (may be split) ¾ Deadline:

¾October 15, 2007

¾ Select 3 as finalists

This award is weighted

¾ 50% on performance and ¾ 50% on code elegance,

clarity, and size.

Submissions format flexible

Winners (in both classes) will be announced at SC07 HPCC BOF

Winners (in both classes) will be announced at SC07 HPCC BOF

(35)

35

Class 2 Awards

Class 2 Awards

Subjective

Productivity (Elegant Implementation)

¾ Implement at least two tests from Class 1 ¾ $1500 (may be split)

¾ Deadline:

¾October 15, 2007

¾ Select 5 as finalists

Most "elegant" implementation with special emphasis being placed on:

Global HPL, Global RandomAccess, EP STREAM

(Triad) per system and Global FFT. This award is weighted

¾ 50% on performance and

(36)

36

5 Finalists for Class 2

5 Finalists for Class 2

November 2005

November 2005

Cleve Moler, Mathworks ¾ Environment: Parallel

Matlab Prototype

¾ System: 4 Processor

Opteron

Calin Caseval, C. Bartin, G. Almasi, Y. Zheng, M.

Farreras, P. Luk, and R. Mak, IBM

¾ Environment: UPC ¾ System: Blue Gene L

Bradley Kuszmaul, MIT ¾ Environment: Cilk

¾ System: 4-processor

1.4Ghz AMD Opteron 840 with 16GiB of memory

Nathan Wichman, Cray ¾ Environment: UPC

¾ System: Cray X1E (ORNL)Petr Konency, Simon

Kahan, and John Feo, Cray ¾ Environment: C + MTA

pragmas

¾ System: Cray MTA2

(37)

37

2006 Competitors

2006 Competitors

Some Notable Class 1 Competitors

SGI (NASA) Columbia 10,000 CPUs NEC (HLRS) SX-8 512 CPUs IBM (DOE LLNL) BG/L 131,072 CPU Purple 10,240 CPU

Cray (DOE ORNL) X1 1008 CPUs Jaguar XT3 5200 CPUs DELL (MIT LL) 300 CPUs LLGridClass 2: 6 Finalists

¾ Calin Cascaval (IBM) UPC on Blue Gene/L [Current Language]

¾ Bradley Kuszmaul (MIT CSAIL) Cilk on SGI Altix [Current Language]

¾ Cleve Moler (Mathworks) Parallel Matlab on a cluster [Current Language]

¾ Brad Chamberlain (Cray) Chapel [Research Language]

¾ Vivek Sarkar (IBM) X10 [Research Language]

¾ Vadim Gurev (St. Petersburg, Russia) MCSharp [Student Submission]

Cray (DOD ERDC) XT3 4096 CPUs Sapphire Cray (Sandia) XT3 11,648 CPU Red Storm

(38)

38

The Following are the Winners of the 2006

The Following are the Winners of the 2006

HPC Challenge Class 1 Awards

(39)

39

The Following are the Winners of the 2006

The Following are the Winners of the 2006

HPC Challenge Class 2 Awards

(40)

40

2006 Programmability

2006 Programmability

Speedup vs Relative Code Size

10-1 100 10 10-3 10-2 10-1 100 101 102 103

Relative Code Size

Sp e e d u p

“All too often” Java, Matlab, Python, etc. Traditional HPC Ref ♦Class 2 Award ¾50% Performance ¾50% Elegance21 Codes submitted by 6 teamsSpeedup relative to serial C on workstation

Code size relative to

serial CClass 2 Award ¾50% Performance ¾50% Elegance21 Codes submitted by 6 teamsSpeedup relative to serial C on workstationCode size relative to

(41)

41

2006 Programming Results Summary

2006 Programming Results Summary

21 of 21 smaller than C+MPI Ref; 20 smaller than serial

15 of 21 faster than serial; 19 in HPCS quadrant

21 of 21 smaller than C+MPI Ref; 20 smaller than serial

(42)

42

Top500 and HPC Challenge Rankings

Top500 and HPC Challenge Rankings

It should be clear that the HPL (Linpack Benchmark - Top500) is a relatively poor predictor of overall machine performance.For a given set of applications such as:

¾ Calculations on unstructured grids ¾ Effects of strong shock waves ¾ Ab-initio quantum chemistry ¾ Ocean general circulation model

¾ CFD calculations w/multi-resolution grids ¾ Weather forecasting

There should be a different mix of components used to help predict the system performance.

(43)

43

Will the Top500 List Go Away?

Will the Top500 List Go Away?

The Top500 continues to serve a valuable role in high performance computing.

¾ Historical basis

¾ Presents statistics on deployment ¾ Projection on where things are going ¾ Impartial view

¾ Its simple to understand ¾ Its fun

(44)

44

No Single Number for HPCC?

No Single Number for HPCC?

Of course everyone wants a single number.

With HPCC Benchmark you get 188 numbers per system run!

Many have suggested weighting the seven tests in HPCC to come up

with a single number.

¾ LINPACK, MatMul, FFT, Stream, RandomAccess,

Ptranspose, bandwidth & latency

But your application is different than mine, so weights are

dependent on the application.

Score = W1*LINPACK + W2*MM + W3*FFT+ W4*Stream + W5*RA + W6*Ptrans + W7*BW/Lat

Problem is that the weights depend on your job mix.

(45)

45

Tools Needed to Help With Performance

Tools Needed to Help With Performance

A tools that analyzed an application perhaps statically and/or dynamically.

Output a set of weights for various sections of the application

¾ [ W1, W2, W3, W4, W5, W6, W7, W8 ]

¾ The tool would also point to places where we were

missing a benchmarking component for the mapping.

Think of the benchmark components as a basis set for scientific applications

A specific application has a set of "coefficients" of the basis set.

Score = W1*HPL + W2*MM + W3*FFT+ W4*Stream +

(46)

46

Future Directions

Future Directions

Looking at reducing execution time

Constructing a framework for benchmarksDeveloping machine signatures

Plans are to expand the benchmark collection

¾ Sparse matrix operations ¾ I/O

¾ Smith-Waterman (sequence alignment)

Port to new systems

Provide more implementations

¾ Languages (Fortran, UPC, Co-Array) ¾ Environments

(47)

Collaborators

HPC Challenge

Piotr Łuszczek, U of Tennessee

David Bailey, NERSC/LBL

Jeremy Kepner, MIT Lincoln Lab

David Koester, MITRE

Bob Lucas, ISI/USC

Rusty Lusk, ANL

John McCalpin, IBM, Austin

Rolf Rabenseifner, HLRS Stuttgart

Daisuke Takahashi, Tsukuba, Japan

http://icl.cs.utk.edu/hpcc/

Top500

Hans Meuer, Prometeus

Erich Strohmaier, LBNL/NERSC

References

Related documents