Parallel Programming and High-Performance Computing

(1)

Parallel Programming

and High-Performance Computing

Part 1: Introduction Dr. Ralf-Peter Mundani

(2)

• materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/

• Ralf-Peter Mundani

– email [email protected], phone 289–25057, room 3181 (city centre) – consultation-hour: Tuesday, 4:00—6:00 pm (room 02.05.058)

• Ioan Lucian Muntean

– email [email protected], phone 289–18692, room 02.05.059 • lecture (2 SWS)

– weekly

(3)

• content

– part 1: introduction

– part 2: high-performance networks – part 3: foundations

– part 4: programming memory-coupled systems – part 5: programming message-coupled systems – part 6: dynamic load balancing

(4)

• motivation

• classification of parallel computers • levels of parallelism

• quantitative performance evaluation

I think there is a world market for maybe five computers.

(5)

• numerical simulation: from phenomena to predictions

physical phenomenon technical process

1. modelling

determination of parameters, expression of relations 2. numerical treatment

model discretisation, algorithm development 3. implementation

software development, parallelisation 4. visualisation

illustration of abstract simulation results 5. validation

comparison of results with reality 6. embedding

mathematics

computer science application

(6)

• why parallel programming and HPC?

– complex problems (especially the so called “grand challenges”) demand for more computing power

• climate or geophysics simulation (tsunami, e. g.) • structure or flow simulation (crash test, e. g.) • development systems (CAD, e. g.)

• large data analysis (Large Hadron Collider at CERN, e. g.) • military applications (crypto analysis, e. g.)

• …

(7)

• objectives (in case all resources would be available N-times) – throughput: compute N problems simultaneously

• running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e. g.

• drawback: limited resources of single nodes

– response time: compute one problem at a fraction (1/N) of time

• running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g.

• drawback: writing a parallel program; communication – problem size: compute one problem with N-times larger data

• running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g.

(8)

• motivation

• classification of parallel computers

• levels of parallelism

(9)

• definition: “A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)

• possible appearances of such processing elements – specialised units (steps of a vector pipeline, e. g.)

– parallel features in modern monoprocessors (superscalar architectures, instruction pipelining, VLIW, multithreading, multicore, …)

– several uniform arithmetical units (processing elements of array computers, e. g.)

– processors of a multiprocessor computer (i. e. the actual parallel computers)

– complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers)

(10)

• reminder: dual core, quad core, manycore, and multicore

– observation: increasing frequency (and thus core voltage) over past years – problem: thermal power dissipation increases linearly in frequency and

(11)

• reminder: dual core, quad core, manycore, and multicore (cont’d)

– 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation

dissipation

performance

(12)

• reminder: dual core, quad core, manycore, and multicore (cont’d)

– idea: installation of two cores per die with same dissipation as single core system

dissipation

performance

(13)

• commercial parallel computers

– manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business)

– names have been coming and going rapidly

– in addition: several manufacturers of vector computers and non-standard architectures

company country year status in 2003

Sequent U.S. 1984 acquired by IBM

Intel U.S. 1984 out of business

Meiko U.K. 1985 bankrupt

nCUBE U.S. 1985 out of business

(14)

• commercial parallel computers (cont’d)

company country year status in 2003

Encore U.S. 1986 out of business

Floating Point Systems U.S. 1986 acquired by SUN

Myrias Canada 1987 out of business

Ametek U.S. 1987 out of business

Silicon Graphics U.S. 1988 active

C-DAC India 1991 active

Kendall Square Research U.S. 1992 bankrupt

(15)

• arrival of clusters

– in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices – growing attractiveness for parallel computers

– 1994: Beowulf, the first parallel computer built completely out of commodity hardware

• NASA Goddard Space Flight Centre • 16 Intel DX4 processors

• multiple 10 Mbit Ethernet links • Linux with GNU compilers • MPI library

– 1996: Beowulf cluster performing more than 1 GFlops

(16)

• arrival of clusters (cont’d)

– 2005: InfiniBand cluster at TUM

• 36 Opteron nodes (quad boards) • 4 Itanium nodes (quad boards)

• 4 Xeon nodes (dual boards) for interactive tasks • InfiniBand 4× Switch, 96 ports

(17)

• supercomputers

– supercomputing or high-performance scientific computing as the most important application of the big number crunchers

– national initiatives due to huge budget requirements

• Accelerated Strategic Computing Initiative (ASCI) in the U.S. – in the sequel of the nuclear testing moratorium in 1992/93 – decision: develop, build, and install a series of five

supercomputers of up to $100 million each in the U.S.

– start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’s first TFlops computer)

– then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, …

(18)

• supercomputers (cont’d)

– federal “Bundeshöchstleistungsrechner” initiative in Germany • decision in the mid-nineties

• three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich)

• one new installation every second year (i. e. a six year upgrade cycle for each centre)

• the newest one to be among the top 10 of the world

– overview and state of the art: Top500 list (updated every six month), see http://www.top500.org

(19)

• MOORE’s law

– observation of Intel co-founder Gordon E. MOORE,

describes important trend in history of computer hardware (1965)

number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every two years

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

• The Earth Simulator – world’s #1 from 2002—04 – installed in 2002 in Yokohama, Japan

– ES-building (approx. 50m × 65m × 17m) – based on NEC SX-6 architecture

– developed by three governmental agencies – highly parallel vector supercomputer

– consists of 640 nodes (plus 2 control & 128 data switching) • 8 vector processors (8 GFlops each)

• 16 GB shared memory

(31)

• BlueGene/L – world’s #1 since 2004 – installed in 2005 at LLNL, CA, USA

(beta-system in 2004 at IBM)

– cooperation of DoE, LLNL, and IBM – massive parallel supercomputer

– consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes) • 2 PowerPC 440d processors (2.8 GFlops each)

• 512 MB memory

Î 131,072 processors (367 TFlops peak performance) and

33.5 TB memory; 280.6 TFlops sustained performance (Linpack)

– nodes configured as 3D torus (32 × 32 × 64); global reduction tree for fast operations (global max / sum) in a few microseconds

(32)

• HLRB II (world’s #6 for 04/2006)

– installed in 2006 at LRZ, Garching – installation costs 38 M€

– monthly costs approx. 400,000 € – upgrade in 2007 (finished)

– one of Germany’s 3 supercomputers – SGI Altix 4700

– consists of 19 nodes (SGI NUMA link 2D torus)

• 256 blades (ccNUMA link with partition fat tree)

(33)

• standard classification according to FLYNN

– global data and instruction streams as criterion

• instruction stream: sequence of commands to be executed • data stream: sequence of data subject to instruction streams – two-dimensional subdivision according to

• amount of instructions per time a computer can execute • amount of data elements per time a computer can process – hence, FLYNN distinguishes four classes of architectures

• SISD: single instruction, single data • SIMD: single instruction, multiple data • MISD: multiple instruction, single data • MIMD: multiple instruction, multiple data

(34)

• standard classification according to FLYNN (cont’d)

– SISD

• one processing unit that has access to one data memory and to one program memory

• classical monoprocessor following VON NEUMANN’s principle

processor program memory

(35)

– SIMD

• several processing units, each with separate access to a (shared or distributed) data memory; one program memory

• synchronous execution of instructions

• example: array computer, vector computer

• advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions

• drawbacks: specialised hardware necessary, easily becomes out-dated due to recent developments at commodity market

processor

program memory data memory

(36)

– MISD

• several processing units that have access to one data memory; several program memories

• not very popular class (mainly for special applications such as Digital Signal Processing)

• operating on a single stream of data, forwarding results from one processing unit to the next

• example: systolic array (network of primitive processing elements that “pump” data)

(37)

– MIMD

• several processing units, each with separate access to a (shared or distributed) data memory; several program memories

• classification according to (physical) memory organisation – shared memory Î shared (global) address space

– distributed memory Î distributed (local) address space • example: multiprocessor systems, networks of computers

processor program memory

data memory

(38)

• processor coupling

– cooperation of processors / computers as well as their shared use of various resources require communication and synchronisation

– the following types of processor coupling can be distinguished • memory-coupled multiprocessor systems (MemMS)

• message-coupled multiprocessor systems (MesMS)

global memory distributed memory

shared

(39)

• processor coupling (cont’d)

– central issues

• scalability: costs for adding new nodes / processors • programming model: costs for writing parallel programs

• portability: costs for portation (migration), i. e. transfer from one system to another while preserving executability and flexibility • load distribution: costs for obtaining a uniform load distribution

among all nodes / processors

– MemMS are advantageous concerning scalability, MesMS are typically better concerning the rest

– hence, combination of MemMS and MesMS for exploiting all advantages Î distributed / virtual shared memory (DSM / VSM) – physical distributed memory with global shared address space

(40)

– uniform memory access (UMA)

• each processor P has direct access via the network to each memory module M with same access times to all data

• standard programming model can be used (i. e. no explicit send / receive of messages necessary)

• communication and synchronisation via shared variables

(inconsistencies (write conflicts, e. g.) have to prevented in general by the programmer)

(41)

– symmetric multiprocessor (SMP)

• only a small amount of processors, in most cases a central bus, one address space (UMA), but bad scalability

• cache-coherence implemented in hardware (i. e. a read always provides a variable’s value from its last write)

• example: double or quad boards, SGI Challenge

C: cache M C … C C

(42)

– non-uniform memory access (NUMA)

• memory modules physically distributed among processors

• shared address space, but access times depend on location of data (i. e. local addresses faster than remote addresses)

• differences in access times are visible in the program • example: DSM / VSM, Cray T3E

(43)

– cache-coherent non-uniform memory access (ccNUMA)

• caches for local and remote addresses; cache-coherence implemented in hardware for entire address space

• problem with scalability due to frequent cache actualisations • example: SGI Origin 2000

C M … network M C

(44)

– cache-only memory access (COMA)

• each processor has only cache-memory

• entirety of all cache-memories = global shared memory • cache-coherence implemented in hardware

• example: Kendall Square Research KSR-1

(45)

– no remote memory access (NORMA)

• each processor has direct access to its local memory only

• access to remote memory only via explicit message exchange (due to distributed address space) possible

• synchronisation implicitly via the exchange of messages

• performance improvement between memory and I/O due to parallel data transfer (Direct Memory Access, e. g.) possible

• example: IBM SP2, ASCI Red / Blue / White

P

…

network

(46)

• motivation

• classification of parallel computers

• levels of parallelism

(47)

• the suitability of a parallel architecture for a given parallel program strongly depends on the granularity of parallelism

• some remarks on granularity

– quantitative meaning: ratio of computational effort and communication / synchronisation effort (≈ amount of instructions between two necessary communication / synchronisation steps)

– qualitative meaning: level on which work is done in parallel

program level process level

block level instruction level

(48)

• program level

– parallel processing of different programs – independent units without any shared data

– no or only small amount of communication / synchronisation – organised by the OS

• process level

– a program is subdivided into processes to be executed in parallel

– each process consists of a larger amount of sequential instructions and has a private address space

(49)

• block level

– blocks of instructions are executed in parallel

– each block consists of a smaller amount of instructions and shares the address space with other blocks

– communication via shared variables; synchronisation mechanisms – term of block often referred to as light-weight-process (thread)

• instruction level

– parallel execution of machine instructions

– optimising compilers can increase this potential by modifying the order of commands (better exploitation of superscalar architecture and

pipelining mechanisms)

• sub-instruction level

(50)

• motivation

• classification of parallel computers • levels of parallelism

(51)

• execution time

– time T of a parallel program between start of the execution on one processor and end of all computations on the last processor

– during execution all processors are in one of the following states • compute

– computation time T_COMP

– time spent for computations • communicate

– communication time T_COMM

– time spent for send and receive operations • idle

– idle time T_IDLE

(52)

• parallel profile

– measures the amount of parallelism of a parallel program – graphical representation

• x-axis shows time, y-axis shows amount of parallel activities • identification of computation, communication, and idle periods – example

proc. C proc. B proc. A

(53)

• parallel profile (cont’d)

– degree of parallelism

• P(t) indicates the amount of processes (of one application) that can be executed in parallel at any point in time (i. e.

y-values of the previous example for any time t) – average parallelism (often referred to as parallel index)

• A(p) indicates the average amount of processes that can be executed in parallel, hence

or

where p is the amount of processes and t_i is the time when exactly i processes are busy

, t t i A(p) _p 1 i i p 1 i i

∑

= = ⋅ =

∫

⋅ − = 2 1 t t 1 2 P(t)dt t t 1 A(p)

(54)

• parallel profile (cont’d)

– previous example: A(p) = (1⋅18 + 2⋅4 + 3⋅13) / 35 = 65/35 = 1.86

– for A(p) exist several theoretical (typically quite pessimistic) estimates, often used as arguments against parallel systems

P(t) 1 2 3 5 10 15 20 25 30 35 40 45 time

(55)

• comparison multiprocessor / monoprocessor

– correlation of multi- and monoprocessor systems’ performance – important: program that can be executed on both systems

– definitions

• P(1): amount of unit operations of a program on the monoprocessor system

• P(p): amount of unit operations of a program on the multiprocessor systems with p processors

• T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles)

• T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors

(56)

• comparison multiprocessor / monoprocessor (cont’d)

– simplifying preconditions • T(1) = P(1)

– one operation to be executed in one step on the monoprocessor system

• T(p) ≤ P(p)

– more than one operation to be executed in one step (for p ≥ 2) on the multiprocessor system with p processors

(57)

– speed-up

• S(p) indicates the improvement in processing speed

• in general, 1 ≤ S(p) ≤ p – efficiency

• E(p) indicates the relative improvement in processing speed

• improvement is normalised by the amount of processors p • in general, 1/p ≤ E(p) ≤ 1 T(p) T(1) S(p)= p S(p) E(p)=

(58)

– speed-up and efficiency can be seen in two different ways • algorithm-independent

– best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the

multiprocessor system

Î absolute speed-up

Î absolute efficiency • algorithm-dependent

(59)

– overhead

• O(p) indicates the necessary overhead of a multiprocessor system for organisation, communication, and synchronisation

• in general, 1 ≤ O(p) – parallel index

• I(p) indicates the amount of operations executed on average per time unit P(1) P(p) O(p)= T(p) P(p) I(p)=

(60)

– utilisation

• U(p) indicates the amount of operations each processor executes on average per time unit

• conforms to the normalised parallel index – conclusions

• all defined expressions have a value of 1 for p = 1

p I(p)

(61)

– example (1)

• a monoprocessor systems needs 6000 steps for the execution of 6000 operations to compute some result

• a multiprocessor system with five processors needs 6750 operations for the computation of the same result, but it needs only 1500 steps for the execution

• thus P(1) = T(1) = 6000, P(5) = 6750, and T(5) = 1500 • speed-up and efficiency can be computed as

S(5) = 6000/1500 = 4 and E(5) = 4/5 = 0.8

Î there is an acceleration of factor 4 compared to the

(62)

– example (2)

• parallel index and utilisation can be computed as I(5) = 6750/1500 = 4.5 and U(5) = 4.5/5 = 0.9

Î on average 4.5 processors are simultaneously busy, i. e. each processor is working only for 90% of the execution time

• overhead can be computed as O(5) = 6750/6000 = 1.125

(63)

• scalability

– objective: adding further processing elements to the system shall reduce the execution time without any program modifications

– i. e. a linear performance increase with an efficiency close to 1 – important for the scalability is a sufficient problem size

• one porter may carry one suitcase in a minute • 60 porters won’t do it in a second

• but 60 porters may carry 60 suitcases in a minute

– in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited – when scaling the amount of processors together with the problem size

(so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems

(64)

• AMDAHL’s law

– the probably most important and most famous estimate for the speed-up (even if quite pessimistic)

– underlying model

• each program consists of a sequential part s, 0 ≤ s ≤ 1, that can only be executed in a sequential way; synchronisation, data I/O, e. g

• furthermore, each program consists of a parallelisable part 1−s that can be executed in parallel by several processes; finding the

maximum value within a set of numbers, e. g.

(65)

• AMDAHL’s law (cont’d)

– the speed-up can thus be computed as

– when increasing p → ∞ we finally get AMDAHL’s law

Î speed-up is bounded: S(p) ≤ 1/s

– the sequential part can have a dramatic impact on the speed-up – therefore central effort of all (parallel) algorithms: keep s small

p s 1 s 1 T(1) p s 1 T(1) s T(1) T(p) T(1) − + = ⋅ − + ⋅ = = S(p) s 1 p s 1 s 1 lim S(p) lim p p→∞ = →∞ ₊ − =

(66)

• AMDAHL’s law (cont’d)

– example

• s = 0.1 and, thus, S(p) ≤ 10

• independent from p the speed-up is bounded by this limit • where’s the error?

10 S(p)

(67)

• GUSTAFSON’s law

– addresses the shortcomings of AMDAHL’s law as it states that any

sufficient large problem can be efficiently parallelised

– instead of a fixed problem size it supposes a fixed time concept – underlying model

• execution time on the parallel machine is normalised to 1 • this contains a non-parallelisable part σ, 0 ≤ σ ≤ 1

– hence, the execution time for the sequential program on the monoprocessor can be written as

T(1) = σ + p⋅(1−σ)

(68)

• GUSTAFSON’s law (cont’d)

– difference to AMDAHL

• sequential part s(p) is not constant, but gets smaller with increasing p s(p) ∈ ]0, 1[

• often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, …)

• speed-up is not bounded for increasing p

, ) (1 p s(p) σ − ⋅ + σ σ =

(69)

• GUSTAFSON’s law (cont’d)

– some more thoughts about speed-up

• theory tells: a superlinear speed-up does not exist

– each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a

processor from the multiprocessor system • but superlinear speed-up can be observed

– when improving an inferior sequential algorithm

– when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system

(70)

• communication—computation-ratio (CCR)

– important quantity measuring the success of a parallelisation

• gives the relation of pure communication time and pure computing time

• a small CCR is favourable

• typically: CCR decreases with increasing problem size – example

• N×N matrix distributed among p processors (N/p rows each)

• iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values