1
High Performance Computing, High Performance Computing,
Computational Grid, and Numerical Libraries Computational Grid, and Numerical Libraries
Jack Dongarra Innovative Computing Lab
University of Tennessee
http://
http://www.cs.utk.edu/~dongarrawww.cs.utk.edu/~dongarra//
The 14th Symposium on Computer Architecture and High Performance Computing Vitoria/ES - Brazil - October 28-30, 2002
2
Technology Trends:
Technology Trends:
Microprocessor Capacity Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Not just processors, bandwidth, storage, etc
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
3 2005
2010
ASCI White Pacific
EDSAC 1 UNIVAC 1
IBM 7090
CDC 6600
IBM 360/195 CDC 7600
Cray 1
Cray X-MPCray 2 TMC CM-2 TMC CM-5 Cray T3D
ASCI Red
1950 1960 1970 1980 1990 2000 2010
1 KFlop/s 1 MFlop/s
1 GFlop/s 1 TFlop/s 1 PFlop/s
Scalar
Super Scalar Vector
Parallel
Super Scalar/Vector/Parallel
Moore’s Law Moore’s Law
4
H. Meuer, H. Simon, E. Strohmaier, & JD
H. Meuer, H. Simon, E. Strohmaier, & JD
- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP Ax=b,
dense problem- Updated twice a year
SC‘xy in the States in November
Meeting in Mannheim, Germany in June - All data available from www.top500.org
Size
Rate
TPP performance
5
Fastest Computer Over Time
0 10 20 30 40 50 60 70
1990 1992 1994 1996 1998 2000
Year
GFlop/s
X Y ( S c a t t e r ) 1
Cray Y-MP (8)
TMC CM-2 (2048)
Fujitsu VP-2600 In 1980 a computation that took 1 full year to complete can now be done in ~ 10 hours!
6
Fastest Computer Over Time
TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
0 100 200 300 400 500 600 700
1990 1992 1994 1996 1998 2000 Year
GFlop/s
X Y ( S c a t t e r ) 1
Hitachi CP- PACS (2040)
Intel Paragon (6788) Fujitsu VPP-500 (140) TMC
CM-5 (1024) NEC
SX-3 (4)
In 1980 a computation that took 1 full year to complete can now be done in ~ 16 minutes!
7
Fastest Computer Over Time
Hitachi CP-PACS
(2040) Intel Paragon
(6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3 (4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
0 1000 2000 3000 4000 5000 6000 7000
1990 1992 1994 1996 1998 2000 Year
GFlop/s
X Y ( S c a t t e r ) 1
ASCI White Pacific (7424) Intel ASCI Red Xeon (9632)
SGI ASCI Blue Mountain (5040) Intel
ASCI Red (9152)
ASCI Blue Pacific SST (5808) In 1980 a computation that took 1 full year to complete can today be done in ~ 27 seconds!
8
Fastest Computer Over Time
Hitachi CP-PACS
(2040) Intel Paragon
(6788) Fujitsu VPP-500 (140) TMC CM-5 (1024) NEC SX-3
(4) TMC CM-2 (2048) Fujitsu VP-2600 Cray Y-MP (8)
0 10 20 30 40 50 60 70
1990 1992 1994 1996 1998 2000
Year
TFlop/s
XY (Scatter) 1
2002
Intel ASCI Red (9152)
ASCI White Pacific (7424) Intel ASCI Red Xeon
(9632) ASCI Blue Mountain (5040)
Japanese
Earth Simulator NEC 5104
In 1980 a computation that took 1 full year to complete can today be done in ~ 5.4 seconds!
9
Machines at the Top of the List Machines at the Top of the List
53%
236 140
124.5
Fujitsu NWT 1993
83%
1.4 6768
338 2.3
281.1
Intel Paragon XP/S MP 1994
83%
1.0 6768
338 1
281.1
Intel Paragon XP/S MP 1995
60%
1.8 2048
614 1.3
368.2
Hitachi CP-PACS 1996
73%
3.0 9152
1830 3.6
Intel ASCI Option Red 1338
(200 MHz Pentium Pro) 1997
55%
2.1 5808
3868 1.6
ASCI Blue-Pacific SST, 2144
IBM SP 604E 1998
74%
0.8 9632
3207 1.1
ASCI Red Intel Pentium 2379
II Xeon core 1999
44%
3.5 7424
11136 2.1
ASCI White -Pacific, 4938
IBM SP Power 3 2000
65%
1.0 7424
11136 1.5
ASCI White -Pacific, 7226
IBM SP Power 3 2001
88%
3.7 5120
40960 5.0
35860
Earth Simulator Computer, NEC 2002
Efficiency Number of
Processors
Factor ? from Pervious Year Theoretica l Peak Gflop/s Factor ? from Pervious Year Measure d Gflop/s Computer
Year
10
A Tour
A Tour d’ d ’Force Force in Engineering in Engineering
♦ Homogeneous, Centralized, Proprietary, Expensive!
♦ Target Application: CFD-Weather, Climate, Earthquakes
♦ 640 NEC SX/6 Nodes (mod) Ø 5120 CPUs which have vector ops
♦ 40TeraFlops (peak)
♦ $250-$500 million for things in building
♦ Footprint of 4 tennis courts
♦ 7 MWatts
Ø Say 10 cent/KWhr - $16.8K/day =
$6M/year!
♦ Expect to be on top of Top500 until 60-100 TFlop ASCI machine arrives
♦ For the Top500 (June 2002) Ø Equivalent ~ 1/6 S Top 500
Ø Performance of ESC
> S Next Top 12 Computers
Ø S of all the DOE computers
= 27.5 TFlop/s
Ø Performance of ESC
> All the DOE + DOD machines (37.2 TFlop/s)
11
Top10 of the Top500 Top10 of the Top500
Rank Manufacturer Computer Rmax
[TF/s] Installation Site Country Year Area of Installation # Proc 1 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 Research 5120
2 IBM ASCI White
SP Power3
7.23 Lawrence Livermore National Laboratory
USA 2000 Research 8192
3 HP AlphaServer SC
ES45 1 GHz 4.46 Pittsburgh
Supercomputing Center USA 2001 Academic 3016
4 HP AlphaServer SC
ES45 1 GHz 3.98 Commissariat a l’Energie
Atomique (CEA) France 2001 Research 2560
5 IBM SP Power3
375 MHz 3.05 NERSC/LBNL USA 2001 Research 3328
6 HP AlphaServer SC
ES45 1 GHz
2.92 Los Alamos National Laboratory
USA 2002 Research 2048
7 Intel ASCI Red 2.38 Sandia National Laboratory USA 1999 Research 9632
8 IBM pSeries 690
1.3 GHz
2.31 Oak Ridge National Laboratory
USA 2002 Research 864
9 IBM ASCI Blue Pacific SST, IBM SP 604e
2.14 Lawrence Livermore National Laboratory
USA 1999 Research 5808
10 IBM pSeries 690
1.3 Ghz 2.00 IBM/US Army Reseach Lab
(ARL) USA 2002 Vendor 768
12
TOP500
TOP500 - - Performance Performance
1.17 TF/s
220 TF/s 35.8 TF/s
59.7 GF/s
134 GF/s
0.4 GF/s
Jun-93Nov-93 Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96 Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01 Jun-02 Fujitsu
'NWT' NAL
NEC Earth Simulator Intel ASCI Red
Sandia
IBM ASCI White LLNL
N=1
N=500 SUM
1 Gflop/s 1 Tflop/s
100 Mflop/s 100 Gflop/s 100 Tflop/s
10 Gflop/s 10 Tflop/s 1 Pflop/s
My Laptop
13
Performance Extrapolation Performance Extrapolation
ASCI Purple
Earth Simulator
Jun-93Jun-94Jun-95Jun-96Jun-97Jun-98Jun-99Jun-00Jun-01Jun-02Jun-03Jun-04Jun-05Jun-06Jun-07Jun-08Jun-09Jun-10
N=1
N=500 Sum
1 GFlop/s 1 TFlop/s 1 PFlop/s
100 MFlop/s 100 GFlop/s 100 TFlop/s
10 GFlop/s 10 TFlop/s 10 PFlop/s
My Laptop
14
Manufacturers Manufacturers
HP 168 (12 < 100), IBM 164 (47 < 100)
Cray
SGI IBM
Sun
HP
TMC Intel
Fujitsu
NEC Hitachi others
0 100 200 300 400 500
Jun-93Nov-93Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01Jun-02
15
Architectures Architectures
Single Processor
SMP MPP SIMD
Constellation
Cluster - NOW
0 100 200 300 400 500
Jun-93Nov-93Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01Jun-02
Y-MP C90
Sun HPC Paragon
CM5
T3D
T3E SP2
Cluster of Sun HPC
ASCI Red CM2
VP500 SX3
Constellation: # of p/n … n
16
Kflops
Kflops per Inhabitant per Inhabitant
450 358
245
207 203
158 141
67 643
0 100 200 300 400 500 600 700
Japan USA
Germany Scandinavia
UK France
Switzerland Italy
Luxembourg
283
76
White is ES contribution and Blue is ASCI contribution
17
80 Clusters on the Top500 80 Clusters on the Top500
♦
A total of 42 Intel based and 8 AMD based PC clusters are in the TOP500.
Ø31 of these Intel based cluster are IBM Netfinity systems delivered by IBM.
♦
A substantial part of these are installed at industrial customers especially in the oil-industry.
ØIncluding 5 Sun and 5 Alpha based clusters and 21 HP AlphaServer.
♦
14 of these clusters are labeled as 'Self-Made'.
18
Cluster on the Top500 Cluster on the Top500
0 10 20 30 40 50 60 70 80
Jun-97 Nov-97 Jun-98 Nov-98 Jun-99 Nov-99 Jun-00 Nov-00 Jun-01 Nov-01 Jun-02 AMD
Intel
IBM Netfinity Alpha
HP Alpha Server Sparc
Processor Breakdown
Pentium III, 37, 46%
Itanium, 2, 3%
Sparc, 5, 6% AMD, 8, 10%
Alpha, 25,
31% Pentium 4, 3,
4%
19
Distributed and Parallel Systems Distributed and Parallel Systems
Distributed systems hetero- geneous
Massively parallel systems homo- geneous
Grid based
Computing Google Network of ws
Clusters w/special interconnect
Entropia /UD
Earth Simulator
♦ Gather (unused) resources
♦ Steal cycles
♦ System SW manages resources
♦ System SW adds value
♦ 10% - 20% overhead is OK
♦ Resources drive applications
♦ Time to completion is not critical
♦ Time-shared
♦ SETI@home
Ø ~ 400,000 machines Ø Averaging 40 Tflop/s
♦ Bounded set of resources
♦ Apps grow to consume all cycles
♦ Application manages resources
♦ System SW gets in the way
♦ 5% overhead is maximum
♦ Apps drive purchase of equipment
♦ Real-time constraints
♦ Space-shared
♦ Earth Simulator Ø 5000 processors Ø Averaging 35 Tflop/s
SETI@home Parallel Dist
mem
ASCI Tflop/s
20
SETI@home
SETI@home: Global Distributed Computing : Global Distributed Computing
♦ Running on 500,000 PCs, ~1000 CPU Years per Day
Ø 485,821 CPU Years so far
♦ Sophisticated Data & Signal Processing Analysis
♦ Distributes Datasets from Arecibo Radio Telescope
21
SETI@home SETI@home
♦ Use thousands of Internet- connected PCs to help in the search for
extraterrestrial intelligence.
♦ When their computer is idle or being wasted this
software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.
♦ The results of this analysis are sent back to the SETI team, combined with thousands of other participants.
♦
Largest distributed computation project in existence
Ø2500 machines today ØAveraging 40 Tflop/s
♦
Today a number of companies trying this for profit.
22
Grid Computing Grid Computing - - from ET
from ET toAnthrax toAnthrax
23
♦ Google query attributes
Ø150M queries/day (2000/second) Ø3B documents in the index
♦ Data centers
Ø15,000 Linux systems in 6 data centers
Ø15 TFlop/s and 1000 TB total capability Ø40-80 1U/2U servers/cabinet
Ø100 MB Ethernet switches/cabinet with gigabit Ethernet uplink
Øgrowth from 4,000 systems (June 2000)
Ø18M queries then
♦ Performance and operation
Øsimple reissue of failed commands to new servers Øno performance debugging
Ø problems are not reproducible
Source: Monika Henzinger, Google
24
♦ Today there is a complex interplay and increasing interdependence among the sciences.
♦ Many science and engineering problems require widely dispersed resources be operated as systems.
♦ What we do as collaborative infrastructure developers will have profound influence on the future of science.
♦ Networking, distributed computing, and parallel computation research have matured to make it possible for distributed systems to support high- performance applications, but...
ØResources are dispersed ØConnectivity is variable
ØDedicated access may not be possible
In the past: Isolation
Motivation for Grid Computing Motivation for Grid Computing
Today: Collaboration
25
The Grid
26
IPG NASA http://nas.nasa.gov/~wej/home/IPG Globus http://www.globus.org/
Legion http://www.cs.virgina.edu/~grimshaw/
AppLeS http://www-cse.ucsd.edu/groups/hpcl/
NetSolve http://www.cs.utk.edu/netsolve/
NINF http://phase.etl.go.jp/ninf/
Condor http://www.cs.wisc.edu/condor/
CUMULVS http://www.epm.ornl.gov/cs/
WebFlow http://www.npac.syr.edu/users/gcf/
NGC http://www.nordicgrid.net
Grids are Hot
Grids are Hot
27
University of Tennessee Deployment:
University of Tennessee Deployment:
S
Scalable calable InIntracampustracampusRResearch esearch GGrid: rid: SInRGSInRG
♦ Federated Ownership: CS, Chem Eng., Medical School,
Computational Ecology, El. Eng.
♦ Realapplications, middleware development, logistical
networking
The Knoxville Campus has two DS-3 commodity Internet connections and one DS-3 Internet2/Abilene connection.
An OC-3 ATM link routes IP traffic between the Knoxville campus, National Transportation Research Center, and Oak Ridge National Laboratory. UT participates in several national networking initiatives including Internet2 (I2), Abilene, the federal Next Generation Internet (NGI) initiative, Southern Universities Research Association (SURA) Regional Information Infrastructure (RII), and Southern Crossroads (SoX).
The UT campus consists of a meshed ATM OC-12 being migrated over to switched Gigabit by early 2002.
28
Grids vs. Capability Computing Grids vs. Capability Computing
♦
Not an “either/or” question
Øeach addresses different needsØboth are part of an integrated solution
♦
Grid strengths
Øcoupling necessarily distributed resources
Øinstruments, archives, and people
Øeliminating time and space barriers
Øremote resource access and capacity computing
ØGrids are not a cheap substitute for capability HPC
♦
Capability computing strengths
Øsupporting foundational computationsØterascale and petascale “nation scale” problems
Øengaging tightly coupled teams and computations
29
Software Technology & Performance Software Technology & Performance
♦ Tendency to focus on hardware
♦ Software required to bridge an ever widening gap
♦ Gaps between usable and deliverable performance is very steep
ØPerformance only if the data and controls are setup just right
ØOtherwise, dramatic performance degradations, very unstable situation
ØWill become more unstable
♦ Challenge of Libraries, PSEs and Tools is
formidable with Tflop/s level, even greater with Pflops, some might say insurmountable.
30
Where Does the Performance Go? or Where Does the Performance Go? or
Why Should I Care About the Memory Hierarchy?
Why Should I Care About the Memory Hierarchy?
µProc 60%/yr.
(2X/1.5yr)
DRAM 9%/yr.
(2X/10 yrs) 1
10 100 1000
1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
DRAM CPU
1982
Performance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Processor-Memory Performance Gap:
(grows 50% / year)
31
Optimizing Computation and Optimizing Computation and Memory Use
Memory Use
♦
Computational optimizations
ØTheoretical peak:(# fpus)*(flops/cycle) * Mhz
ØPentium 4: (1 fpu)*(2 flops/cycle)*(2.53 Ghz) = 5060 MFLOP/s
♦
Operations like:
Ø α = xTy : 2 operands (16 Bytes) needed for 2 flops;
at 5060 Mflop/s will requires 5060 MW/s bandwidth
Ø y = α x + y : 3 operands (24 Bytes) needed for 2 flops;
at 5060 Mflop/s will requires 7590 MW/s bandwidth
♦
Memory optimization
ØTheoretical peak: (bus width) * (bus speed)
ØPentium 4: (32 bits)*(533 Mhz) = 2132 MB/s = 266 MW/s
32
Memory Hierarchy Memory Hierarchy
♦ By taking advantage of the principle of locality:
ØPresent the user with as much memory as is available in the cheapest technology.
ØProvide access at the speed offered by the fastest technology.
Control
Datapath
Secondary Storage
(Disk) Processor
Registers Main
Memory (DRAM) Level
2 and 3 Cache (SRAM)
On-ChipCache
1s 10,000,000s
(10s ms) 100,000 s
(.1s ms)
Speed (ns): 10s 100s
100s
Gs Size (bytes):
Ks Ms
Tertiary Storage (Disk/Tape)
10,000,000,000s (10s sec) 10,000,000 s (10s ms)
Ts Distributed
Memory
Remote Cluster Memory
33
Self Adapting Software Self Adapting Software
♦
Software system that …
ØObtains information on the underlying system where they will run.
ØAdapts application to the presented data and the available resources perhaps provide automatic algorithm selection
ØDuring execution perform optimization and perhaps reconfigure based on newly available resources.
ØAllow the user to provide for faults and recover without additional users involvement
♦
The moral of the story
ØWe know the concepts of how to improve things.
ØCapture insights/experience – do what humans do well
ØAutomate the dull stuff
34
Software Generation Software Generation Strategy
Strategy - - ATLAS BLAS ATLAS BLAS
♦ Takes ~ 20 minutes to run, generates Level 1,2, & 3 BLAS
♦ “New” model of high performance programming where critical code is machine generated using parameter optimization.
♦ Designed for modern architectures
Ø Need reasonable C compiler
♦ Today ATLAS in used within various ASCI and SciDAC activities and by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE,…
♦ Parameter study of the hw
♦ Generate multiple versions of code, w/difference values of key performance parameters
♦ Run and measure the performance for various versions
♦ Pick best and generate library
♦ Level 1 cache multiply optimizes for:
Ø TLB access Ø L1 cache reuse Ø FP unit usage Ø Memory fetch Ø Register reuse
Ø Loop overhead minimization
35
ATLAS
ATLAS
(DGEMM n = 500)(DGEMM n = 500)♦ ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.
♦ Looking at sparse operations 0.0
500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0
AMD Athlon-600 DEC ev56-533 DEC ev6-500
HP9000/735/135 IBM PPC604-112 IBM Power2-160 IBM Power3-200 Intel P-III 933 MHz
Intel P-4 2.53 GHz w/SSE2
SGI R10000ip28-200SGI R12000ip30-270Sun UltraSparc2-200 Architectures
MFLOP/S
Vendor BLAS ATLAS BLAS F77 BLAS
36
GrADS
GrADS - - Grid Application Development Grid Application Development System
System
♦ Problem: Grid has distributed, heterogeneous, dynamic resources; how do we use them?
♦ Goal: reliable performance on dynamically changing resources
♦ Minimize work of preparing an application for Grid execution
Ø Provide generic versions of key components (currently built in to applications or manually done)
ØE.g., scheduling, application launch, performance monitoring
♦ Provide high-level programming tools to help automate application preparation
ØPerformance modeler, mapper, binder
37
NSF/NGS NSF/NGS GrADS
GrADS - - GrADSoft GrADSoft Architecture Architecture
♦
Goal: reliable performance on dynamically changing resources
Whole - Program Compiler
Libraries
Binder Real-time Performance
Monitor Performance
Problem
Resource Negotiator
Scheduler
Grid Runtime
System Source
Appli - cation
Config- urable Object Program Software
Components
Performance Feedback
Negotiation
PIs: Ken Kennedy, Fran Berman, Andrew Chein, Keith Cooper, JD, Ian Foster, Lennart Johnsson, Dan Reed, Carl Kesselman, John Mellor-Crummey, Linda Torczon & Rich Wolski
38
NSF/NGS NSF/NGS GrADS
GrADS - - GrADSoft GrADSoft Architecture Architecture
♦
Goal: reliable performance on dynamically changing resources
Whole - Program Compiler
Libraries
Binder Real-time Performance
Monitor Performance
Problem
Resource Negotiator
Scheduler
Grid Runtime
System Source
Appli - cation
Config- urable Object Program Software
Components
Performance Feedback
Negotiation
PIs: Ken Kennedy, Fran Berman, Andrew Chein, Keith Cooper, JD, Ian Foster, Lennart Johnsson, Dan Reed, Carl Kesselman, John Mellor-Crummey, Linda Torczon & Rich Wolski
39
Intelligent Component Intelligent Component
♦ System to mediate between user application and multiple possible libraries
♦ Self-Adaptivity and Learning Behavior
Ø Heuristics are tuned based on data
ØSystem gradually gets smarter (database)
Ø The system can educate the user
♦ User Interaction
Ø User can guide the system by providing further information Ø System teaches user about
properties of the data
40
Research Areas Research Areas
♦ Automatically generating performance models (e.g. for ScaLAPACK) on Grid resources
♦ Evaluating Performance “Contracts”
♦ Near Optimal Scheduling (execution) on the Grid
♦ Rescheduling for changing resources
♦ Checkpointing and fault tolerance
♦ High-latency tolerant algorithms (SANS ideas)
♦ Porting applications/libraries to GrADS framework
♦ Developing generic GrADSoft interfaces (API’s)
41
LAPACK For Clusters LAPACK For Clusters
♦ Developing middleware which couples cluster system information with the specifics of a user problem to launch cluster based applications on the “best” set of resource available.
♦ Using ScaLAPACK as the prototype software, but developing a framework
~ Mbit Switch, (fully connected)
~ Gbit Switch, (fully connected)
Remote memory server, e.g. IBP (TCP/IP)
Local network file server, SUN’s NFS (UDP/IP) e.g. 100 Mbit
Users, etc.
42 GrADS ScaLAPACK versus Non-GrADS ScaLAPACK
0 500 1000 1500 2000 2500
N=20000 - GrADS (14 proc) N=20000 - Non GrADS (17 proc) N=15000 - GrADS (15 proc) N=15000 - Non GrADS (17 proc) N=10000 - GrADS (10 proc) N=1000 - Non GrADS (17 proc) N=5000 - Grads (6 proc) N=5000 - NonGrADS (17 proc)
Time Taken (seconds)
Grid consists of 17 machines from two heterogeneous, shared (possibly loaded) clusters.
GrADS schedules execution on appropriate machines. Non-GraDS uses the ALL the machines.
ScaLAPACK Application Grid Overhead / Spawn GrADS Overhead
43
Research Directions Research Directions
♦ Parameterizable libraries
♦ Fault tolerant algorithms
♦ Annotated libraries
♦ Hierarchical algorithm libraries
♦ “Grid” (network) enabled strategies A new division of labor between
compiler writers, library writers, and algorithm developers and application developers will emerge.
44
Futures for Numerical Algorithms and Software Futures for Numerical Algorithms and Software
♦
Numerical software will be adaptive, exploratory, and intelligent
♦
Determinism in numerical computing will be gone.
Ø After all, its not reasonable to ask for exactness in numerical computations.
ØAuditability of the computation, reproducibility at a cost
♦
Importance of floating point arithmetic will be undiminished.
Ø16, 32, 64, 128 bits and beyond.
♦
Reproducibility, fault tolerance, and auditability
♦ Adaptivity
is a key so applications can
effectively use the resources.
45
Collaborators / Support Collaborators / Support
♦
TOP500
ØØH. Mauer, Mannheim UH. Mauer, Mannheim U Ø
ØH. Simon, NERSCH. Simon, NERSC Ø
ØE. Strohmaier, NERSCE. Strohmaier, NERSC
♦ GrADS
ØSathish Vadhiyar, UTK ØAsim YarKhan, UTK
ØKen Kennedy, Fran Berman, Andrew Chein, Keith Cooper, Ian Foster, Carl Kesselman, Lennart Johnsson, Dan Reed, Linda Torczon, &
Rich Wolski
ØThanks
Next Generation Software (NGS)NSF
Scientific Discovery through Advanced Computing (SciDAC)