High Performance Computing, Computational Grid, and Numerical Libraries. Technology Trends: Microprocessor Capacity

(1)

1

High Performance Computing, High Performance Computing,

Computational Grid, and Numerical Libraries Computational Grid, and Numerical Libraries

Jack Dongarra Innovative Computing Lab

University of Tennessee

http://

http://www.cs.utk.edu/~dongarrawww.cs.utk.edu/~dongarra//

The 14th Symposium on Computer Architecture and High Performance Computing Vitoria/ES - Brazil - October 28-30, 2002

2

Technology Trends:

Microprocessor Capacity Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Not just processors, bandwidth, storage, etc

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

(2)

3 2005

2010

ASCI White Pacific

EDSAC 1 UNIVAC 1

IBM 7090

CDC 6600

IBM 360/195 CDC 7600

Cray 1

Cray X-MPCray 2 TMC CM-2 TMC CM-5 Cray T3D

ASCI Red

1950 1960 1970 1980 1990 2000 2010

1 KFlop/s 1 MFlop/s

1 GFlop/s 1 TFlop/s 1 PFlop/s

Scalar

Super Scalar Vector

Parallel

Super Scalar/Vector/Parallel

Moore’s Law Moore’s Law

4

H. Meuer, H. Simon, E. Strohmaier, & JD

H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World

- Yardstick: Rmax from LINPACK MPP Ax=b,

dense problem

- Updated twice a year

SC‘xy in the States in November

Meeting in Mannheim, Germany in June - All data available from www.top500.org

Size

Rate

TPP performance

(3)

5

Fastest Computer Over Time

0 10 20 30 40 50 60 70

1990 1992 1994 1996 1998 2000

Year

GFlop/s

X Y ( S c a t t e r ) 1

Cray Y-MP (8)

TMC CM-2 (2048)

Fujitsu VP-2600 In 1980 a computation that took 1 full year to complete can now be done in ~ 10 hours!

6

Fastest Computer Over Time

TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

0 100 200 300 400 500 600 700

1990 1992 1994 1996 1998 2000 Year

GFlop/s

X Y ( S c a t t e r ) 1

Hitachi CP- PACS (2040)

Intel Paragon (6788) Fujitsu VPP-500 (140) TMC

CM-5 (1024) NEC

SX-3 (4)

In 1980 a computation that took 1 full year to complete can now be done in ~ 16 minutes!

(4)

7

Fastest Computer Over Time

Hitachi CP-PACS

(2040) Intel Paragon

(6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

0 1000 2000 3000 4000 5000 6000 7000

1990 1992 1994 1996 1998 2000 Year

GFlop/s

X Y ( S c a t t e r ) 1

ASCI White Pacific (7424) Intel ASCI Red Xeon (9632)

SGI ASCI Blue Mountain (5040) Intel

ASCI Red (9152)

ASCI Blue Pacific SST (5808) In 1980 a computation that took 1 full year to complete can today be done in ~ 27 seconds!

8

Fastest Computer Over Time

Hitachi CP-PACS

(2040) Intel Paragon

(6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3

(4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)

0 10 20 30 40 50 60 70

1990 1992 1994 1996 1998 2000

Year

TFlop/s

XY (Scatter) 1

2002

Intel ASCI Red (9152)

ASCI White Pacific (7424) Intel ASCI Red Xeon

(9632) ASCI Blue Mountain (5040)

Japanese

Earth Simulator NEC 5104

In 1980 a computation that took 1 full year to complete can today be done in ~ 5.4 seconds!

(5)

9

Machines at the Top of the List Machines at the Top of the List

53%

236 140

124.5

Fujitsu NWT 1993

83%

1.4 6768

338 2.3

281.1

Intel Paragon XP/S MP 1994

83%

1.0 6768

338 1

281.1

Intel Paragon XP/S MP 1995

60%

1.8 2048

614 1.3

368.2

Hitachi CP-PACS 1996

73%

3.0 9152

1830 3.6

Intel ASCI Option Red 1338

(200 MHz Pentium Pro) 1997

55%

2.1 5808

3868 1.6

ASCI Blue-Pacific SST, 2144

IBM SP 604E 1998

74%

0.8 9632

3207 1.1

ASCI Red Intel Pentium 2379

II Xeon core 1999

44%

3.5 7424

11136 2.1

ASCI White -Pacific, 4938

IBM SP Power 3 2000

65%

1.0 7424

11136 1.5

ASCI White -Pacific, 7226

IBM SP Power 3 2001

88%

3.7 5120

40960 5.0

35860

Earth Simulator Computer, NEC 2002

Efficiency Number of

Processors

Factor ? from Pervious Year Theoretica l Peak Gflop/s Factor ? from Pervious Year Measure d Gflop/s Computer

Year

10

A Tour

A Tour d’ d ’Force Force in Engineering in Engineering

♦ Homogeneous, Centralized, Proprietary, Expensive!

♦ Target Application: CFD-Weather, Climate, Earthquakes

♦ 640 NEC SX/6 Nodes (mod) Ø 5120 CPUs which have vector ops

♦ 40TeraFlops (peak)

♦ $250-$500 million for things in building

♦ Footprint of 4 tennis courts

♦ 7 MWatts

Ø Say 10 cent/KWhr - $16.8K/day =

$6M/year!

♦ Expect to be on top of Top500 until 60-100 TFlop ASCI machine arrives

♦ For the Top500 (June 2002) Ø Equivalent ~ 1/6 S Top 500

Ø Performance of ESC

> S Next Top 12 Computers

Ø S of all the DOE computers

= 27.5 TFlop/s

Ø Performance of ESC

> All the DOE + DOD machines (37.2 TFlop/s)

(6)

11

Top10 of the Top500 Top10 of the Top500

Rank Manufacturer Computer Rmax

[TF/s] Installation Site Country Year Area of Installation # Proc 1 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 Research 5120

2 IBM ASCI White

SP Power3

7.23 Lawrence Livermore National Laboratory

USA 2000 Research 8192

3 HP AlphaServer SC

ES45 1 GHz 4.46 Pittsburgh

Supercomputing Center USA 2001 Academic 3016

4 HP AlphaServer SC

ES45 1 GHz 3.98 Commissariat a l’Energie

Atomique (CEA) France 2001 Research 2560

5 IBM SP Power3

375 MHz 3.05 NERSC/LBNL USA 2001 Research 3328

6 HP AlphaServer SC

ES45 1 GHz

2.92 Los Alamos National Laboratory

7 Intel ASCI Red 2.38 Sandia National Laboratory USA 1999 Research 9632

8 IBM pSeries 690

1.3 GHz

2.31 Oak Ridge National Laboratory

9 IBM ASCI Blue Pacific SST, IBM SP 604e

2.14 Lawrence Livermore National Laboratory

10 IBM pSeries 690

1.3 Ghz 2.00 IBM/US Army Reseach Lab

(ARL) USA 2002 Vendor 768

12

TOP500

TOP500 - - Performance Performance

1.17 TF/s

220 TF/s 35.8 TF/s

59.7 GF/s

134 GF/s

0.4 GF/s

Jun-93Nov-93 Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96 Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01 Jun-02 Fujitsu

'NWT' NAL

NEC Earth Simulator Intel ASCI Red

Sandia

IBM ASCI White LLNL

N=1

N=500 SUM

1 Gflop/s 1 Tflop/s

100 Mflop/s 100 Gflop/s 100 Tflop/s

10 Gflop/s 10 Tflop/s 1 Pflop/s

My Laptop

(7)

13

Performance Extrapolation Performance Extrapolation

ASCI Purple

Earth Simulator

Jun-93Jun-94Jun-95Jun-96Jun-97Jun-98Jun-99Jun-00Jun-01Jun-02Jun-03Jun-04Jun-05Jun-06Jun-07Jun-08Jun-09Jun-10

N=1

N=500 Sum

100 MFlop/s 100 GFlop/s 100 TFlop/s

My Laptop

14

Manufacturers Manufacturers

HP 168 (12 < 100), IBM 164 (47 < 100)

Cray

SGI IBM

Sun

HP

TMC Intel

Fujitsu

NEC Hitachi others

0 100 200 300 400 500

Jun-93Nov-93Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01Jun-02

(8)

15

Architectures Architectures

Single Processor

SMP MPP SIMD

Constellation

Cluster - NOW

0 100 200 300 400 500

Jun-93Nov-93Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01Jun-02

Y-MP C90

Sun HPC Paragon

CM5

T3D

T3E SP2

Cluster of Sun HPC

ASCI Red CM2

VP500 SX3

Constellation: # of p/n … n

16

Kflops

Kflops per Inhabitant per Inhabitant

450 358

245

207 203

158 141

67 643

0 100 200 300 400 500 600 700

Japan USA

Germany Scandinavia

UK France

Switzerland Italy

Luxembourg

283

76

White is ES contribution and Blue is ASCI contribution

(9)

17

80 Clusters on the Top500 80 Clusters on the Top500

♦

A total of 42 Intel based and 8 AMD based PC clusters are in the TOP500.

Ø31 of these Intel based cluster are IBM Netfinity systems delivered by IBM.

♦

A substantial part of these are installed at industrial customers especially in the oil-industry.

ØIncluding 5 Sun and 5 Alpha based clusters and 21 HP AlphaServer.

♦

14 of these clusters are labeled as 'Self-Made'.

18

Cluster on the Top500 Cluster on the Top500

0 10 20 30 40 50 60 70 80

Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00 Nov-00 Jun-01 Nov-01 Jun-02 AMD

Intel

IBM Netfinity Alpha

HP Alpha Server Sparc

Processor Breakdown

Pentium III, 37, 46%

Itanium, 2, 3%

Sparc, 5, 6% AMD, 8, 10%

Alpha, 25,

31% Pentium 4, 3,

4%

(10)

19

Distributed and Parallel Systems Distributed and Parallel Systems

Distributed systems heterogeneous

Massively parallel systems homogeneous

Grid based

Computing Google Network of ws

Clusters w/special interconnect

Entropia /UD

Earth Simulator

♦ Gather (unused) resources

♦ Steal cycles

♦ System SW manages resources

♦ System SW adds value

♦ 10% - 20% overhead is OK

♦ Resources drive applications

♦ Time to completion is not critical

♦ Time-shared

♦ SETI@home

Ø ~ 400,000 machines Ø Averaging 40 Tflop/s

♦ Bounded set of resources

♦ Apps grow to consume all cycles

♦ Application manages resources

♦ System SW gets in the way

♦ 5% overhead is maximum

♦ Apps drive purchase of equipment

♦ Real-time constraints

♦ Space-shared

♦ Earth Simulator Ø 5000 processors Ø Averaging 35 Tflop/s

SETI@home Parallel Dist

mem

ASCI Tflop/s

20

SETI@home

SETI@home: Global Distributed Computing : Global Distributed Computing

♦ Running on 500,000 PCs, ~1000 CPU Years per Day

Ø 485,821 CPU Years so far

♦ Sophisticated Data & Signal Processing Analysis

♦ Distributes Datasets from Arecibo Radio Telescope

(11)

21

SETI@home SETI@home

♦ Use thousands of Internet- connected PCs to help in the search for

extraterrestrial intelligence.

♦ When their computer is idle or being wasted this

software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.

♦ The results of this analysis are sent back to the SETI team, combined with thousands of other participants.

♦

Largest distributed computation project in existence

Ø2500 machines today ØAveraging 40 Tflop/s

♦

Today a number of companies trying this for profit.

22

Grid Computing Grid Computing - - from ET

from ET toAnthrax toAnthrax

(12)

23

♦ Google query attributes

Ø150M queries/day (2000/second) Ø3B documents in the index

♦ Data centers

Ø15,000 Linux systems in 6 data centers

Ø15 TFlop/s and 1000 TB total capability Ø40-80 1U/2U servers/cabinet

Ø100 MB Ethernet switches/cabinet with gigabit Ethernet uplink

Øgrowth from 4,000 systems (June 2000)

Ø18M queries then

♦ Performance and operation

Øsimple reissue of failed commands to new servers Øno performance debugging

Ø problems are not reproducible

Source: Monika Henzinger, Google

24

♦ Today there is a complex interplay and increasing interdependence among the sciences.

♦ Many science and engineering problems require widely dispersed resources be operated as systems.

♦ What we do as collaborative infrastructure developers will have profound influence on the future of science.

♦ Networking, distributed computing, and parallel computation research have matured to make it possible for distributed systems to support high- performance applications, but...

ØResources are dispersed ØConnectivity is variable

ØDedicated access may not be possible

In the past: Isolation

Motivation for Grid Computing Motivation for Grid Computing

Today: Collaboration

(13)

25

The Grid

26

IPG NASA http://nas.nasa.gov/~wej/home/IPG Globus http://www.globus.org/

Legion http://www.cs.virgina.edu/~grimshaw/

AppLeS http://www-cse.ucsd.edu/groups/hpcl/

NetSolve http://www.cs.utk.edu/netsolve/

NINF http://phase.etl.go.jp/ninf/

Condor http://www.cs.wisc.edu/condor/

CUMULVS http://www.epm.ornl.gov/cs/

WebFlow http://www.npac.syr.edu/users/gcf/

NGC http://www.nordicgrid.net

Grids are Hot

(14)

27

University of Tennessee Deployment:

S

Scalable calable InIntracampustracampusRResearch esearch GGrid: rid: SInRGSInRG

♦ Federated Ownership: CS, Chem Eng., Medical School,

Computational Ecology, El. Eng.

♦ Realapplications, middleware development, logistical

networking

The Knoxville Campus has two DS-3 commodity Internet connections and one DS-3 Internet2/Abilene connection.

An OC-3 ATM link routes IP traffic between the Knoxville campus, National Transportation Research Center, and Oak Ridge National Laboratory. UT participates in several national networking initiatives including Internet2 (I2), Abilene, the federal Next Generation Internet (NGI) initiative, Southern Universities Research Association (SURA) Regional Information Infrastructure (RII), and Southern Crossroads (SoX).

The UT campus consists of a meshed ATM OC-12 being migrated over to switched Gigabit by early 2002.

28

Grids vs. Capability Computing Grids vs. Capability Computing

♦

Not an “either/or” question

Øeach addresses different needs

Øboth are part of an integrated solution

♦

Grid strengths

Øcoupling necessarily distributed resources

Øinstruments, archives, and people

Øeliminating time and space barriers

Øremote resource access and capacity computing

ØGrids are not a cheap substitute for capability HPC

♦

Capability computing strengths

Øsupporting foundational computations

Øterascale and petascale “nation scale” problems

Øengaging tightly coupled teams and computations

(15)

29

Software Technology & Performance Software Technology & Performance

♦ Tendency to focus on hardware

♦ Software required to bridge an ever widening gap

♦ Gaps between usable and deliverable performance is very steep

ØPerformance only if the data and controls are setup just right

ØOtherwise, dramatic performance degradations, very unstable situation

ØWill become more unstable

♦ Challenge of Libraries, PSEs and Tools is

formidable with Tflop/s level, even greater with Pflops, some might say insurmountable.

30

Where Does the Performance Go? or Where Does the Performance Go? or

Why Should I Care About the Memory Hierarchy?

µProc 60%/yr.

(2X/1.5yr)

DRAM 9%/yr.

(2X/10 yrs) 1

10 100 1000

1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

DRAM CPU

1982

Performance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Processor-Memory Performance Gap:

(grows 50% / year)

(16)

31

Optimizing Computation and Optimizing Computation and Memory Use

Memory Use

♦

Computational optimizations

ØTheoretical peak:(# fpus)*(flops/cycle) * Mhz

ØPentium 4: (1 fpu)*(2 flops/cycle)*(2.53 Ghz) = 5060 MFLOP/s

♦

Operations like:

Ø α = x^Ty : 2 operands (16 Bytes) needed for 2 flops;

at 5060 Mflop/s will requires 5060 MW/s bandwidth

Ø y = α x + y : 3 operands (24 Bytes) needed for 2 flops;

at 5060 Mflop/s will requires 7590 MW/s bandwidth

♦

Memory optimization

ØTheoretical peak: (bus width) * (bus speed)

ØPentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MW/s

32

Memory Hierarchy Memory Hierarchy

♦ By taking advantage of the principle of locality:

ØPresent the user with as much memory as is available in the cheapest technology.

ØProvide access at the speed offered by the fastest technology.

Control

Datapath

Secondary Storage

(Disk) Processor

Registers Main

Memory (DRAM) Level

2 and 3 Cache (SRAM)

On-ChipCache

1s 10,000,000s

(10s ms) 100,000 s

(.1s ms)

Speed (ns): 10s 100s

100s

Gs Size (bytes):

Ks Ms

Tertiary Storage (Disk/Tape)

10,000,000,000s (10s sec) 10,000,000 s (10s ms)

Ts Distributed

Memory

Remote Cluster Memory

(17)

33

Self Adapting Software Self Adapting Software

♦

Software system that …

ØObtains information on the underlying system where they will run.

ØAdapts application to the presented data and the available resources perhaps provide automatic algorithm selection

ØDuring execution perform optimization and perhaps reconfigure based on newly available resources.

ØAllow the user to provide for faults and recover without additional users involvement

♦

The moral of the story

ØWe know the concepts of how to improve things.

ØCapture insights/experience – do what humans do well

ØAutomate the dull stuff

34

Software Generation Software Generation Strategy

Strategy - - ATLAS BLAS ATLAS BLAS

♦ Takes ~ 20 minutes to run, generates Level 1,2, & 3 BLAS

♦ “New” model of high performance programming where critical code is machine generated using parameter optimization.

♦ Designed for modern architectures

Ø Need reasonable C compiler

♦ Today ATLAS in used within various ASCI and SciDAC activities and by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE,…

♦ Parameter study of the hw

♦ Generate multiple versions of code, w/difference values of key performance parameters

♦ Run and measure the performance for various versions

♦ Pick best and generate library

♦ Level 1 cache multiply optimizes for:

Ø TLB access Ø L1 cache reuse Ø FP unit usage Ø Memory fetch Ø Register reuse

Ø Loop overhead minimization

(18)

35

ATLAS

(DGEMM n = 500)(DGEMM n = 500)

♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

♦ Looking at sparse operations 0.0

500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0

AMD Athlon-600 DEC ev56-533 DEC ev6-500

HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Intel P-III 933 MHz

Intel P-4 2.53 GHz w/SSE2

SGI R10000ip28-200SGI R12000ip30-270Sun UltraSparc2-200 Architectures

MFLOP/S

Vendor BLAS ATLAS BLAS F77 BLAS

36

GrADS

GrADS - - Grid Application Development Grid Application Development System

System

♦ Problem: Grid has distributed, heterogeneous, dynamic resources; how do we use them?

♦ Goal: reliable performance on dynamically changing resources

♦ Minimize work of preparing an application for Grid execution

Ø Provide generic versions of key components (currently built in to applications or manually done)

ØE.g., scheduling, application launch, performance monitoring

♦ Provide high-level programming tools to help automate application preparation

ØPerformance modeler, mapper, binder

(19)

37

NSF/NGS NSF/NGS GrADS

GrADS - - GrADSoft GrADSoft Architecture Architecture

♦

Goal: reliable performance on dynamically changing resources

Whole - Program Compiler

Libraries

Binder Real-time Performance

Monitor Performance

Problem

Resource Negotiator

Scheduler

Grid Runtime

System Source

Appli - cation

Config- urable Object Program Software

Components

Performance Feedback

Negotiation

PIs: Ken Kennedy, Fran Berman, Andrew Chein, Keith Cooper, JD, Ian Foster, Lennart Johnsson, Dan Reed, Carl Kesselman, John Mellor-Crummey, Linda Torczon & Rich Wolski

38

NSF/NGS NSF/NGS GrADS

GrADS - - GrADSoft GrADSoft Architecture Architecture

♦

Goal: reliable performance on dynamically changing resources

Whole - Program Compiler

Libraries

Binder Real-time Performance

Monitor Performance

Problem

Resource Negotiator

Scheduler

Grid Runtime

System Source

Appli - cation

Config- urable Object Program Software

Components

Performance Feedback

Negotiation

PIs: Ken Kennedy, Fran Berman, Andrew Chein, Keith Cooper, JD, Ian Foster, Lennart Johnsson, Dan Reed, Carl Kesselman, John Mellor-Crummey, Linda Torczon & Rich Wolski

(20)

39

Intelligent Component Intelligent Component

♦ System to mediate between user application and multiple possible libraries

♦ Self-Adaptivity and Learning Behavior

Ø Heuristics are tuned based on data

ØSystem gradually gets smarter (database)

Ø The system can educate the user

♦ User Interaction

Ø User can guide the system by providing further information Ø System teaches user about

properties of the data

40

Research Areas Research Areas

♦ Automatically generating performance models (e.g. for ScaLAPACK) on Grid resources

♦ Evaluating Performance “Contracts”

♦ Near Optimal Scheduling (execution) on the Grid

♦ Rescheduling for changing resources

♦ Checkpointing and fault tolerance

♦ High-latency tolerant algorithms (SANS ideas)

♦ Porting applications/libraries to GrADS framework

♦ Developing generic GrADSoft interfaces (API’s)

(21)

41

LAPACK For Clusters LAPACK For Clusters

♦ Developing middleware which couples cluster system information with the specifics of a user problem to launch cluster based applications on the “best” set of resource available.

♦ Using ScaLAPACK as the prototype software, but developing a framework

~ Mbit Switch, (fully connected)

~ Gbit Switch, (fully connected)

Remote memory server, e.g. IBP (TCP/IP)

Local network file server, SUN’s NFS (UDP/IP) e.g. 100 Mbit

Users, etc.

42 GrADS ScaLAPACK versus Non-GrADS ScaLAPACK

0 500 1000 1500 2000 2500

N=20000 - GrADS (14 proc) N=20000 - Non GrADS (17 proc) N=15000 - GrADS (15 proc) N=15000 - Non GrADS (17 proc) N=10000 - GrADS (10 proc) N=1000 - Non GrADS (17 proc) N=5000 - Grads (6 proc) N=5000 - NonGrADS (17 proc)

Time Taken (seconds)

Grid consists of 17 machines from two heterogeneous, shared (possibly loaded) clusters.

GrADS schedules execution on appropriate machines. Non-GraDS uses the ALL the machines.

ScaLAPACK Application Grid Overhead / Spawn GrADS Overhead

(22)

43

Research Directions Research Directions

♦ Parameterizable libraries

♦ Fault tolerant algorithms

♦ Annotated libraries

♦ Hierarchical algorithm libraries

♦ “Grid” (network) enabled strategies A new division of labor between

compiler writers, library writers, and algorithm developers and application developers will emerge.

44

Futures for Numerical Algorithms and Software Futures for Numerical Algorithms and Software

♦

Numerical software will be adaptive, exploratory, and intelligent

♦

Determinism in numerical computing will be gone.

Ø After all, its not reasonable to ask for exactness in numerical computations.

ØAuditability of the computation, reproducibility at a cost

♦

Importance of floating point arithmetic will be undiminished.

Ø16, 32, 64, 128 bits and beyond.

♦

Reproducibility, fault tolerance, and auditability

♦ Adaptivity

is a key so applications can

effectively use the resources.

(23)

45

Collaborators / Support Collaborators / Support

♦

TOP500

ØØH. Mauer, Mannheim UH. Mauer, Mannheim U Ø

ØH. Simon, NERSCH. Simon, NERSC Ø

ØE. Strohmaier, NERSCE. Strohmaier, NERSC

♦ GrADS

ØSathish Vadhiyar, UTK ØAsim YarKhan, UTK

ØKen Kennedy, Fran Berman, Andrew Chein, Keith Cooper, Ian Foster, Carl Kesselman, Lennart Johnsson, Dan Reed, Linda Torczon, &

Rich Wolski

ØThanks

Next Generation Software (NGS)NSF

Scientific Discovery through Advanced Computing (SciDAC)