25 years of High Performance Computing: An Application Perspective

(1)

25 years of

High Performance Computing

An Application Perspective

Jackson State University Internet Seminar

September 28 2004 Geoffrey Fo

Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401

[email protected]

(2)

Personal Perspective I

n So I understood high performance computing in detail from

1980-1995

• Hypercube, HPF, CRPC, Sundry Grand Challenges

• I used HPT (Holes in Paper Tape) as an undergraduate

n I summarized applications for the “Source Book of Parallel

Computing” in 2002 and had 3 earlier books

n I tried (and failed so far) to develop Java as a good high

performance computing language from 1996-2003

n My last grant in this area developed HPJava

(http://www.hpjava.org) but it was small and ended in 2003

n I have watched DoD scientists develop parallel code in their

HPC Modernization Program

n I have worked closely with NASA on Earthquake Simulations

from 1997-now

(3)

Personal Perspective II

n

I have studied broadly requirements and best practice

biology

and

complex systems

(e.g. critical infrastructure

and network) simulations

n

Nearly all my research is nowadays in

Grid

(distributed)

computing

n

I have struggled to develop

computational science as an

academic discipline

n

I taught

classes

in parallel computing and computational

science to a dwindling audience with last ones in 1998

and 2000 with a new one next semester to JSU!

n

I read

High End Computing Revitalization

reports/discussions and remembered

Petaflop

meetings

n

I will discuss views from the “wilderness” of

new users

and new applications

(4)

Some Impressions I

n Computational Science is highly successful with simulations in

1980 being “toy 2D explorations” and today we have full 3D multidisciplinary simulations with magnificent visualization

• 128 node hypercube in 1983-5 had about 3 megaflop

performance but it did run at 80% efficiency

• LANL told us about multigrid in 1984

• Today’s highest end machines including Earth Simulator can

realize teraflop performance with 1-10% of peak speed

• The whole talk can be devoted to descriptions of these

simulations and their visualizations

n Curiously Japan seems to view the Earth Simulator as a failure (it

didn’t predict anything special) whereas the US is using it as a motivation for several large new machines

n Some industry has adopted HPC (oil, drug) but runs at modest

capability and most of action is in capacity computing and embarrassingly parallel computations (finance, biotech)

• Aerospace is in between

(5)

Some Impressions II

n There is a group of users with HPC knowledge at their fingertips

who can use current hardware with great effectiveness and maximum realistic efficiency

n I suspect in most fields, the knowledge of “average” users is at

best an ability to use MPI crudely and their use of machines will be good only if they are wise enough to use good libraries like PetSc

• From 1980-95, users were “early adopters” and willing to

endure great pain; today we need to support a broader range of mainstream users

• “strategy for HPC” is different for new users and new

applications

n Computer Science students (at universities I have been at) have

little interest in algorithms or software for parallel computing

n Increasing gulf between the Internet generation raised on

Python and Java and the best tools (Fortran, C, C++) of HPC

• Matlab and Mathematica represent another disparate

approach

• Java Grande was meant to address this but failed

(6)

Top10 HPC Machines June 2004

(7)

Top500 from Dongarra

Flops

(8)

Current HPC Systems

(9)

Memory Bandwidth/FLOP

Characteristics of Commodity Networks

(10)

Some Impressions III

n

At 100,000ft situation

today

in HPC is

not

drastically

different

from that

expected

around

1985

• Simulations getting larger in size and sophistication

• Move from regular to adaptive irregular data structures • Growing importance of multidisciplinary simulations

• Perhaps Moore’s law has continued and will continue for

longer than expected

• Computation reasonably respected as a science methodology

n

I expected more

performance increase

from

explicit

parallelism

and less from

more sophisticated chips

• i.e. I expected all machines (PCs) to be (very) parallel and

software like Word to be parallel

• I expected 105 to 106-way not 104 way parallelism in high end

supercomputers with nCUBE/Transputer/CM2 plus Weitek

style architectures

(11)

Some Impressions IV

n

So

parallel applications succeeded

roughly as expected

but the manner was a little different

n

As expected, essentially

all scientific simulations could

be parallelized

and the CS/Applied Math/Application

community has developed

remarkable algorithms

•

As noted

many scientists unaware

of them

today

and

some techniques like adaptive meshes and multipole

methods are not easy to understand and use

•

Field of parallel algorithms so successful that has

almost put itself out of business

n

The

parallel software model MPI

is roughly the same as

“mail box communication” system described say in

1980

memo by myself and Eugene Brookes (LLNL)

•

Even in 1980 we thought it pretty bad

(12)

Why 10

6 _{Processors Should Work I?}

n

In days gone by, we derived formulae for overhea

Speed Up

S

= Efficiency

ε

. Number of Processors

N

n

ε

= 1/( 1+

f

) where

f

is Overhead

n

f

comes from communication cost, load imbalance and

any difficulties from parallel algorithm being

ineffective

n

f

(communication) = Constant. tcomm/

n

1/d

. tcalc

n

This worked in 1982 and in analyses I have seen

recently, it still works

tcomm

tcalc tcalc

(13)

Why 10

6 _{Processors Should Work II?}

n

f

(communication) =

Constant

.

t

comm/

n

1/d

.

t

_calc

n

Here

n

is grain size and d (often 3) is (geometric)

dimension.

n

So this supports scaled speed-up that efficiency

constant if scale problem size

n

N proportional to

number of processors N

n

As

t

comm/

t

_calc

is naturally independent of N, there seems

no obvious reason that one can’t scale parallelism to

large N for large problems at fixed

n

However many applications complain about large N

and say their applications won’t scale properly and

demand the less cost effective architecture with fewer

but more powerful nodes

(14)

Some Impressions V

n I always thought of parallel computing as a map from an

application through a model to a computer

n I am surprised that modern HPC computer architectures do not

clearly reflect physical structure of most applications

• After all Parallel Computing Works because Mother Nature and Society

(which we are simulating) are parallel

n GRAPE and earlier particle dynamics successfully match special

characteristics (low memory, communication bandwidth) of O(N2_{) algorithms.}

Parallel Computin Works 1994

Of course vectors were introduced to reflect natural scientific data structures

Note Irregular problems still have geometrical structure even if no constant stride long vectors

I think mismatch between hardware and problem

architecture reflects software (Languages)

(15)

Computational Science

Seismic Simulation of Los Angeles Basin

• This is (sophisticated) wave equation similar to Laplace example

and you divide Los Angeles geometrically and assign roughly

equal number of grid points to each processor

(16)

Computational Science

Irregular 2D Simulation -- Flow over an Airfoil

• The Laplace grid

points become

finite element

mesh nodal points arranged as

triangles filling space

• All the action

(triangles) is near

near wing

boundary

• Use domain

decomposition but

no longer equal

area as equal triangle count

(17)

BlueGene/L has Architectur

I expected to be natural

STOP PRESS

16000 node BlueGene/L regains #1 TOP500 Position 29 Sept 2004

(18)

BlueGene/L Fundamentals

n Low Complexity nodes gives more flops per transistor and per

watt

n 3D Interconnect supports many scientific simulations as nature

as we see it is 3D

(19)

1024 Nodes full syste

with hypercube Interconnect 1987 MPP

(20)

Prescott has 125 Million Transistors

Compared to Ncube

100X Clock 500X Density

50000X Potential Peak Performance

Improvement

Probably more like 1000X

Realized Performance Improvement

(21)

1993 Top500

1993-2004

Number of Nodes 4X Performance 1000X

(22)

Performance per Transistor

(23)

Some Impressions VI

n Two key features of today’s applications

• Is the simulation built on fundamental equations or

phenomenological (coarse grained) degrees of freedom

• Is the application deluged with interesting data

n Most of HPCC activity 1990-2000 dealt with applications like

QCD, CFD, structures, astrophysics, quantum chemistry,

material science, neutron transport where reasonably accurate information available to describe basic degrees of freedom

n Classic model is to set up numerics of “well established

equations” (e.g. Navier Stokes) and solve with known boundary values and initial conditions

n Many interesting applications today have unknown boundary

conditions, initial conditions and equations

• They have a lot of possibly streaming data instead

n For this purpose, the goal of Grid technology is to manage the

experimental data

(24)

Data Deluged Science

n In the past, we worried about data in the form of parallel I/O or

MPI-IO, but we didn’t consider it as an enabler of new algorithms and new ways of computing

n Data assimilation was not central to HPCC n ASC set up because didn’t want test data!

n Now particle physics will get 100 petabytes from CERN

• Nuclear physics (Jefferson Lab) in same situation • Use around 30,000 CPU’s simultaneously 24X7

n Weather, climate, solid earth (EarthScope)

n Bioinformatics curated databases (Biocomplexity only 1000’s of

data points at present)

n Virtual Observatory and SkyServer in Astronomy n Environmental Sensor nets

(25)

Weather Requirements

(26)

Data

Information

Ideas Simulation

Model

Assimilation

Reasoning

Datamining

Computationa Science

Informatics

Data Deluge

Scienc

(27)

Virtual Observatory Astronomy Gri

Integrate Experiments

Radio Far-Infrared Visible

Visible + X-ray

Dust Map

Galaxy Density

(28)

In flight data Airline Maintenance Centre Ground Station Global Network Such as SITA

Internet, e-mail, pager

Engine Health (Data) Center

DAME Data Deluged Engineering

Rolls Royce and UK e-Science Progra

Distributed Aircraft Maintenance

Environment

~ Gigabyte per aircraft per Engine per transatlantic

flight

~5000 engines

(29)

USArray

Seismic

Sensors

(30)

a

Topography 1 km

Stress Change

Earthquakes

PBO

Site-specific Irregular

Scalar Measurements Constellations for Plate Boundary-Scale Vector Measurements

a

a Ice Sheets

Volcanoes

Long Valley, CA

Northridge, CA

Hector Mine, CA Greenland

(31)

Database Database

Researc Simulation

s _{Analysis and}

Visualizatio Portal Repositorie Federated Databases Data Filte Services

Field Trip Data

Streaming Data Sensor s

?

Discovery Services SERVOGrid Research Education Customization Services From Researc to Education Educatio Grid Computer Farm

Geoscience Research and

(32)

SERVOGrid Requirements

n

Seamless Access

to Data repositories and large scale

computers

n

Integration

of

multiple data sources

including sensors,

databases, file systems with analysis system

• Including filtered OGSA-DAI (Grid database access)

n

Rich meta-data

generation and access with

SERVOGrid specific Schema

extending openGIS

(Geography as a Web service) standards and using

Semantic Grid

n

Portals

with component model for user interfaces and

web control of all capabilities

n

Collaboration

to support world-wide work

n

Basic Grid tools:

workflow

and

notification

n

NOT metacomputing

(33)

Non Traditional Applications: Biology

n At a fine scale we have molecular dynamics (protein folding) and

at the coarsest scale CFD (e.g. blood flow) and structures (body mechanics)

n A lot of interest is in between these scales with

• Genomics: largely pattern recognition or data mining

• Subcellular structure: reaction kinetics, network structure

• Cellular and above (organisms, biofilms) where cell structure matters:

Cellular Potts Model

• Continuum Organ models, blood flow etc. • Neural Networks

n Cells, Neurons, Genes are structures defining biology simulation

approach in same way boundary layer defines CFD

n Data mining can be considered as a special case of a simulation

where model is “pattern to be looked for” and data set determines “dynamics” (where the pattern is)

(34)

Non Traditional Applications: Earthquakes

n

We know the dynamics at a coarse level (

seismic wave

propagation) and somewhat at a fine scale (

granular

physics

for friction)

n

Unknown details of

constituents

and sensitivity of

phase transitions

(earthquakes) to detail, make it hard

to use classical simulation methods to forecast

earthquakes

n

Data deluge

(Seismograms, dogs barking, SAR) again

does not directly tell you needed friction laws etc.

needed for classic simulations

n

Approaches

like “

pattern informatics

” combine data

mining with simulation

• One is looking for “dynamics” of “earthquake signals” to see

if the “big one” preceded by a certain structure in small quakes or other phenomenology

(35)

Computational Infrastructure fo

“Old and New” Applications

n

The Earth Simulator’s performance and continued

leadership of TOP500 has prompted several US

projects to retain the number 1 position

n

There are scientific fields (Cosmology, Quantum

Chromodynamics) which need very large machines

n

There are “national security” applications (such as

stockpile stewardship) who also need the biggest

machines available

n

However much of most interesting science does not

• The “new” fields just need more work to be able to run high

resolution studies

• Scientists typically prefer several “small” (100-500 node)

clusters with dedicated “personal” management to single large machines

(36)

High-End Revitalization and the Crusades

n RADICAL CHANGE MAY NOT BE THE ANSWER:

RESPONDING TO THE HEC HPCwire 108424 September 24, 2004

n Dear High-End Crusader …..

n There are various efforts aimed at revitalizing HPC

• http://www.cra.org/Activities/workshops/nitrd/

n Some involve buying large machines

n Others are investigating new approaches to HPC

• In particular addressing embarrassing software problem • Looking at new architectures

n I have not seen many promising new software ideas

• We cannot it seems express parallelism in a way that is

convenient for users and easy to implement

n I think hardware which is simpler and has more flops/transistor

will improve performance of several key science areas but not all

(37)

Non Traditional Applications: Critical

Infrastructure Simulations

n

These include

electrical/gas/water

grids and

Internet

,

transportation

, cell/wired

phone

dynamics.

n

One has some “classic

SPICE

style” network

simulations in area like power grid (although load and

infrastructure data incomplete)

• 6000 to 17000 generators

• 50000 to 140000 transmission lines • 40000 to 100000 substations

n

Substantial DoE involvement

through DHS

(38)

Non Traditional Applications: Critical

Infrastructure Simulations

n

Activity data

for people/institutions essential for

detailed dynamics but again these are not “classic” data

but need to be “fitted” in

data assimilation

style in

terms of some assumed lower level model.

• They tell you goals of people but not their low level movement

n

Disease

and

Internet virus

spread and

social network

simulations can be built on dynamics coming from

infrastructure simulations

• Many results like “small world” internet connection structure

are qualitative and unclear if they can be extended to detailed simulations

• A lot of interest in (regulatory) networks in Biology

(39)

(Non) Traditional Structure

n 1) Traditional: Known equations plus boundary values

n 2) Data assimilation: somewhat uncertain initial conditions and

approximations corrected by data assimilation

n 3) Data deluged Science: Phenomenological degrees of freedom

swimming in a sea of data

Known Data

Predictio n

Known Equations on

Agreed DoF

Phenomenologica Degrees of Freedom Swimming in a Sea of

Data

(40)

HPC Simulation Data Filter Data Filter Data Filter Data Filt er Data Filt_er Distributed Filters massage data For simulation Other Gri

(41)

Data Assimilation

n Data assimilation implies one is solving some optimization

problem which might have Kalman Filter like structur

n Due to data deluge, one will become more and more dominated

by the data (Nobs much larger than number of simulation

points).

n Natural approach is to form for each local (position, time)

patch the “important” data combinations so that optimization doesn’t waste time on large error or insensitive data.

n Data reduction done in natural distributed fashion NOT on

HPC machine as distributed computing most cost effective if calculations essentially independent

• Filter functions must be transmitted from HPC machine

(42)

Distributed Filtering

HPC Machine Distribute

Machine

Data Filter

Nobslocal patch 1

Nfilteredlocal patch 1

Data Filter

Nobslocal patch 2

Nfilteredlocal patch 2

Geographicall y

Distribute

Sensor patches

N

obslocal patch

>> N

filteredlocal patch

≈

Number_of_Unknowns

local patch

Send needed Filter Receive filtered data In simplest approach, filtered data gotten by linear transformations on original data based on Singular Value Decomposition of Least

squares matrix

Factorize Matri to product of

(43)

Some Questions for Non Traditional

Applications

n No systematic study of how best to represent data deluged

sciences without known equations

n Obviously data assimilation very relevant

n Role of Cellular Automata (CA) and refinements like the New

Kind of Science by Wolfram

• Can CA or Potts model parameterize any system?

n Relationship to back propagation and other neural network

representations

n Relationship to “just” interpolating data and then extrapolating

a little

n Role of Uncertainty Analysis – everything (equations, model,

data) is uncertain!

n Relationship of data mining and simulation

n A new trade-off: How to split funds between sensors and

simulation engines

(44)

Some Impressions VII

n My impression is that the knowledge of how to use HPC

machines effectively is not broadly distributed

• Many current users less sophisticated than you were in 1981

n Most simulations are still performed on sequential machines

with approaches that make it difficult to parallelize

• Code has to be re-engineered to use MPI – major “productivity”

n The parallel algorithms in new areas are not well understood

even though they are probably similar to those already developed

• Equivalent of multigrid (multiscale) not used – again mainly due to

software engineering issues – it’s too hard

n Trade-off between time stepped and event driven simulations not

well studied for new generation of network (critical infrastructure) simulations.

(45)

Some Impressions VIII

n I worked on Java Grande partly for obvious possible advantages

of Java over Fortran/C++ as a language but also so HPC could better leverage the technologies and intellectual capital of the Internet generation

n I still think HPC will benefit from

n A) Building environments similar to those in the Internet world

• Why would somebody grow up using Internet goodies and

then switch to Fortran and MPI for their “advanced” work

n B) Always asking when to use special HPC and when commodity

software/architectures can be used

• Python often misused IMHO and standards like HPF, MPI

don’t properly discuss hybrid HPC/Commodity systems and their relation

• The rule of the Millisecond

n

I still think new languages (or

dialects

) that

bridge

simulation and data, HPC and commodity world are

useful

(46)

Interaction of Commodity and HPC

Software and Services

n Using commodity hardware or software obviously

• Saves money and

• Broadens community that can be involved e.g. base parallel language on

Java or C# to involve the Internet generation

n Technologies roughly divide by communication latency

• I can get high bandwidth in all cases?

• e.g. Web Services and SOAP can use GridFTP and parallel streams as

well as slow HTTP protocols

n >1 millisecond latency: message based services n 10-1000 microseconds: method based scripting n 1-20 microseconds: MPI

• Often used when there are better solutions

n < 1 microsecond: inlining, optimizing compilers etc.?

n To maximize re-use and eventual productivity, use the approach

with highest acceptable latency

• Only 10% of code is the HPC part?