High Performance Computing

(1)

High Performance Computing

Trey Breckenridge

Computing Systems Manager Engineering Research Center Mississippi State University

(2)

What is High Performance Computing?

• HPC is ill defined and context dependent.

• In the late 1980’s, the US Government defined

supercomputers as processors capable of more than

100MFlops. This definition is clearly obsolete, as modern desktop PC’s are capable of ~ 5GFlops.

• Another approach is to describe HPC as the fastest

computers at any point in time, however, that is more a budgetary dependent definition.

• For the intent of this presentation, we will define HPC as:

Computing resources which provide at least an order of magnitude more computing power than is normally

(3)

What does the definition really mean?

• That definition sounds like HPC is hardware only. Isn’t the software important too?

• The full range of supercomputing activities including existing supercomputer systems, special purpose and experimental systems, and the new generation of large scale parallel architectures.

• HPC exists on a broad range of computer systems, from departmental clusters of desktop workstations to large parallel processing systems.

(4)

Why High Performance Computing?

• To achieve the maximum amount of computations in a minimum amount of time – SPEED!

• To solve problems that couldn’t otherwise be solved without large computer systems.

• Traditionally, HPC used in scientific and engineering fields for work with massively complex simulations. • Computations are typically “floating point” intensive.

(5)

Areas of HPC Use

• Traditional:

– Computational Fluid Dynamics (CFD)

– Climate, Weather, and Ocean Modeling and Simulation (CWO) – Nuclear Modeling and Simulation

– Geophysical/Petroleum Modeling

• Emerging:

– Computer Graphics/Scientific Visualization – Financial Modeling

– Database Applications – Bioinformatics

(6)

Parallel Computing

• A collection of processing elements that can

communicate and cooperate to solve large

problems more quickly than a single processing

element.

• Simultaneous use of multiple processors to

execute different parts of a program.

• Goal: To reduce wall-clock time of run

• No single processor ever again is likely to match

performance of existing parallel HPC systems:

(7)

Type of Parallelism

• Overt

– Parallelism is visible to the programmer – May be difficult to program (correctly) – Large improvements in performance

• Covert

– Parallelism is not visible to the programmer – Compiler responsible for parallelism

– Easy to do

(8)

Speed Up

• Speed Up is one quantitative measure of the

benefit of parallelism

• Speed Up is defined as S / T(N) where,

– S = best serial time

– T(N) = time required for N processors

• Since S/N is the best possible parallel time,

speedup typically should not exceed N

• S is sometimes difficult to measure causing many

people to substitute T(1) for S

(9)

(10)

Efficiency

• Speed up does not measure how efficiently the

processors are being used

– Is it worth using 100 processors to get a speed up of 2?

• Efficiency is defined as the ratio of the speed up

and the number of processors required to achieve

it

– The best efficiency is 1

(11)

(12)

Processors

• Vector

– Large rows of data are operated on simultaneously

• Scalar

– Data is operated on in a sequential fashion – Instruction sets

• Complex Instruction Set Computer (CISC) • Reduced Instruction Set Computer (RISC) • Post-RISC or CISC/RISC

– UltraSPARC – IBM Power4 – IA64

(13)

Scalar vs. Vector Arithmetic

DO 10 i = 1.n

a(i) = b(i) + c(i) 10 CONTINUE Scalar: Vector: a(1) = b(1) + c(1) a = b + c a(2) = b(2) + c(2) … a(n) = b(n) + c(n)

(14)

Where is Scalar better?

• If the vector length is small

• If the loop contains IF statements

• If partial vectorization involves large overhead

• If recursion is used

(15)

Architectural Classifications

Flynn’s Taxonomy

• Published by Flynn in 1972

– Outdated, but still widely used

• Categorizes machines by instruction streams and data streams

– A stream of instructions (the algorithm) tells the computer what to do.

– A stream of data (the input) is affected by these instructions.

• Four Categories

– SISD – Single Instruction, Single Data – MISD – Multiple Instruction, Single Data – SIMD – Single Instruction, Multiple Data – MIMD – Multiple Instruction, Multiple Data

(16)

SISD

Single Instruction, Single Data

• Conventional single processor computers

• Each arithmetic instruction initiates an operation

on a data item taken from a single stream of data

elements.

• Historical supercomputers and most contemporary

microprocessors are SISD

(17)

SIMD

Single Instruction, Multiple Data

• Many, simple processing elements – 1000s

• Each processor has its own local memory

• Each processor runs the same program

• Each processor processes different data streams

• All processors work in lock-step (synchronously)

• Very efficient for array/matrix operations

• Most older vector/array computers are SIMD

• Example machines:

– Cray YMP

(18)

MISD

Multiple Instruction, Single Data

• Very few machines fit this category

• None have been commercially successful or have

had any impact on computational science

(19)

MIMD

Multiple Instruction, Multiple Data

• Most diverse of the four classifications

• Multiple processors

• Each processor either has own, or accesses shared,

memory

• Each processor can run the same or different

programs

• Each processor processes different data streams

• Processors can work synchronously or

(20)

MIMD

cont.

• Processors can be either tightly or loosely coupled

• Examples include:

– Processors and memory units specifically designed to be components of a parallel architecture (e.g., Intel Paragon)

– Large scale parallel machines built from “off the shelf” workstations (e.g., Beowulf Cluster)

– Small scale multiprocessors made by connecting multiple vector processors together (e.g., Cray T90) – Wide variety of other designs as well

(21)

SPMD Computing

• Not a Flynn category, per se, but instead a combination of categories.

• SPMD stands for single program, multiple data

– The same program is run on the processors of an MIMD machine. – Occasionally the processors may synchronize.

– Because an entire program is executed on separate data, it is

possible that different branches are taken, leading to asynchronous parallelism

• SPMD came about as a desire to do SIMD like calculations on MIMD machine

– SPMD is not a hardware paradigm, but instead, the software equivalent of SIMD

(22)

Memory Classifications

Organization

• Shared Memory (SM-MIMD)

– Bus based

– Interconnection network

• Distributed Memory (DM-MIMD)

– Local

– Message passing

• Virtual shared memory (VSM-MIMD)

– Physically distributed, but appears as one image

Access

• Uniform Memory Access (UMA)

(23)

Memory Organization

Shared Memory

• One common memory block between all processors

Bus Based

• Since bus has limited bandwidth, number of processors which can be used is limited to a few tens of processors • Examples include typical multi-processors PC’s, SGI

(24)

Memory Organization

Switch based

• Utilizes (complex) inter-connected network to connect processors to shared memory modules

• May use multi-stage networks - NUMA

• Increases bandwidth to memory over bus based systems • Every processor still has access to global memory

(25)

Memory Organization

Distributed Memory

• Message Passing. Memory physically distributed through the machine. Each processor has private memory.

• Contents of private memory can only be accessed by that processor. If required by another processor, then it must be sent explicitly.

• In general, machines can be scaled to thousands of processors. • Requires special programming techniques.

(26)

Memory Organization

Virtual Shared Memory

• Objective is to have the scalability of distributed memory with the programmability of shared memory

• Global address space mapped onto physically distributed memory

• Data moves between processors “on demand” or as it is accessed

(27)

Compute Clusters

• Connecting multiple standalone machines via a network interconnect, utilizing software to access the combined systems as one computer

• The standalone machines could be inexpensive single processor workstations or multi-million dollar

multiprocessor servers

• Individual machines can be connected via numerous networking technologies using a variety of topologies.

– 100BaseT Ethernet – inexpensive, low performance, high latency – Myrinet (2 Gb/s) – expensive, high performance, low latency – Proprietary high speed network

• Nearly 20% of fastest 500 supercomputers in the world are clusters.

(28)

Beowulf Clusters

• First developed in 1994 at NASA Goddard

• Goal is to build a supercomputer utilizing a large number of inexpensive, commodity off-the-shelf (COTS) parts.

• Increasingly used for HPC applications due to high cost of MPPs and the wide availability of networked workstations. • Not a panacea for HPC. Many applications require shared

memory or vector solutions.

• Existing Beowulf clusters range from 2 to 4000 processors, are likely to reach 10000 processors in the near future.

(29)

Metacomputing

• Metacomputing is a dynamic environment that has some informal pool of nodes that can join or leave the

environment whenever they desire.

– SETI@HOME

• Why do we need metacomputing?

– Our computational needs are infinite – Our financial needs are finite

• Someday we will utilize computing cycles just like we utilize electricity from the power company.

– Enables us to “buy” cycles on an as needed basis.

• Commonly referred to “The Grid” or “Computational Grids”

(30)

Job Execution

• Most HPC systems do not allow interactive access.

• Batch-style jobs are submitted to the system via a queuing mechanism.

• Schedulers determine the order in which jobs should be run. Factors include

– User priority

– Resource availability

• The goal of the Scheduler is to maximize system utilization.

• Scheduler optimization is an important component and is a field of study of its own.

(31)

(32)

Programming Languages

• It has been said, “I don’t know what language they will be using to program high performance computers 10 years from now, but we do know it will be called FORTRAN.”

• C and C++ are making strides in the HPC community due to their ability to create complex data structures and better I/O routines.

• FORTRAN 90 incorporated many of the features of C (e.g., pointers). • High Performance Fortran (HPF) is FORTRAN 90 with

directive-based extensions allowing for shared and distributed memory

machines – clusters, traditional supercomputers, and massively parallel processors

• Today, many programmers prefer to do their data structure, communications, etc. in C, while doing the computations in FORTRAN.

(33)

Compilers

• Compilers are an often overlooked area of HPC,

but are of critical importance.

• Application run times are directly related to the

ability of the compiler to produce highly

optimized code.

– Poor compiler optimization could result in run times increasing by an order of magnitude.

• Optimization Levels

– None, Basic, Interprocedural analysis, Runtime profile analysis, Floating-point, Data flow analysis, Advanced

(34)

Distributed Memory Parallel

Programming

• Message passing is a programming paradigm

where one effectively writes multiple programs for

parallel execution.

• The problem must be decomposed, typically by

domain or function

• Each process knows only about its own local data.

If data is required from a different process, it must

send a message to that process asking for the data

• Access to remote data is much slower than to local

data, so a major objective is to minimize remote

communications.

(35)

Message Passing Environments

• PVM – Parallel Virtual Machine

– Portable and operable across heterogeneous computers – Performance sacrificed for flexibility

– Well defined protocol allows for interoperability between different implementations

• MPI – Message Passing Interface

– Today’s standard for message passing – Widely adopted by most vendors

– Portable and operable across heterogeneous computers – Good performance with reasonable efficiency

(36)

Shared Memory Parallel

Programming

• Every processor has direct access to the memory

of every other processor in the system

• Not widely used at programmer level, but widely

used at the system level (even on single processors

systems via Multithreading)

• Allows low-latency, high-bandwidth

communications

• Portability is poor

• Easy to program (compared to message passing)

• Directive controlled parallelism

(37)

Shared Memory Environments

• POSIX Threads (Pthreads)

• SHMEM

• OpenMP

– Quickly becoming the standard API for shared memory programming

– Emphasis on performance and scalability

– Allows for fine-grain or coarse-grain parallelism

– Some implementations are interoperable with MPI and PVM

(38)

Benchmarking

• Benchmarking is an important aspect of HPC and is used for purchase decisions, system configuration, and

application tuning.

• Rule 1: All vendors lie about their benchmarks!!

• Purchase decisions should not be based on published

benchmark results. If at all possible, run your code on the exact machine you are considering for purchase.

• LINPACK

– Mother of all benchmarks

– Not originally designed to be a benchmark, but instead a set of high performance library routines for linear algebra.

– Reports average megaflop rates by dividing the total number of floating-point operations by time

(39)

(40)

Summary

• HPC is parallel computing.

• HPC involves a broad spectrum of components,

and is only as fast as the weakest component,

whether that be processor, memory, network

interconnect, compiler, or software.

• HPC exists on a broad range of computer systems,

from departmental clusters of desktop

(41)

Additional Information

• Dowd, Kevin and Severance, Charles. High Performance

Computing, Second Edition. O’Reilly & Associates, Inc.,

1998.

• Dongarra, Jack. High Performance Computing:

Technology, Methods and Applications. Elsevier, 1995.

• Buyya, Rajkumar. High Performance Cluster Computing,

Volume 1. Prentice Hall PTR, 1999.

• Foster, Ian and Kesselman, Carl. The Grid: Blueprint for

a new Computing Infrastructure. Morgan Kaufmann