High Performance Computing

(1)

1

High Performance Computing

Introduction, overview

High Performance Computing

Introduction, overview

Jesper Larsson Träff traff@par. …

Institute of Computer Engineering, Parallel Computing, 191-4 Treitlstrasse 3, 5. Stock (DG)

(2)

2

High Performance Computing: A (biased) overview Concerns: Either

1. Achieving highest possible performance as needed by some application(s)

2. Getting highest possible performance out of given (highly parallel) system

• Ad 1: Anything goes, including designing and building new systems, raw application performance matters

• Ad 2: Understanding and exploiting details at all levels of given system

(3)

3

• Understanding modern processors: Processor architecture, memory system, single-core performance, multi-core

parallelism

• Understanding parallel computers: Communication networks

• Programming parallel systems efficiently and effectively:

Algorithms, interfaces, tools, tricks

All issues at all levels are relevant

…but not always to the same extent and at the same time Ad. 2

Our themes for this lecture

(4)

4

Traditional “Scientific Computing”/HPC applications

• Weather

• Long-term weather forecast

• Climate (simulations: coupled models, multi-scale, multi- physics)

• Earth Science

• Nuclear physics

• Computational chemistry, Computational astronomy, Computational fluid dynamics, …

• Protein folding, Molecular Dynamics (MD)

• Cryptography (code-breaking, NSA)

• Weapons (design, nuclear stock pile), defense (“National

Qualified estimates say these problems require TeraFLOPS, PetaFLOPS, ExaFLOPS, …

(5)

5

Other, newer High-Performance Computing applications

• Machine Learning (ML), Deep Neural Networks (DNN) or other

• Data analytics (Google, Amazon, FB, …), “big data”

• Irregular data (graphs), irregular access patterns (graph algorithms)

Applications have different characteristics (operations, loops, tasks, access patterns, locality) and requirements (computation, memory, communication).

Different HPC architecture trade-offs for different applications

(6)

6

Ad. 1: Special purpose HPC systems for Molecular Dynamics

Special purpose computers have a history in HPC (computer science in general)

“Colossus” replica, Tony Sale 2006: Enigma code breaking

Thomas Haigh: Colossal genius: Tutte, Flowers, and a bad imitation of Turing. Commun. ACM 60(1): 29-35 (2017) Henry Shipley: Turing: Colossus computer revisited. Nat.

483(7389): 275 (2012)

(7)

7

N-body computations of forces between molecules to determine movements: Special type of computation with specialized

algorithms that could potentially be executed orders of

magnitude more efficiently (time, energy) on special-purpose hardware

Example: N-body problem

M. Snir: “A Note on N-Body Computations with Cutoffs”. Theory Comp. Syst. 37(2): 295-318,2004

(8)

8

MDGRAPE-3: PetaFLOPS performance in 2006, more than 3 times faster than BlueGene/L (Top500 #1 at that time)

MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)

(9)

9

MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)

Ohmura I, Morimoto G, Ohno Y, Hasegawa A, Taiji M. MDGRAPE- 4: A special-purpose computer system for molecular dynamics simulations. Phil. Trans. R. Soc. A 372: 20130387, 2014.

http://dx.doi.org/10.1098/rsta.2013.0387

(10)

10

Anton (van Leeuwenhoek): Another special purpose MD system

512-node (8x8x8 torus) Anton machine

D. E. Shaw Research (DESRES)

Special purpose Anton chip (ASIC)

(11)

11

From “Encyclopedia on Parallel Computing”, Springer 2011:

“Prior to Anton’s completion, few reported all-atom protein

simulations had reached 2μs, the longest being a 10-μs simulation that took over 3 months on the NCSA Abe supercomputer […].

On June 1, 2009, Anton completed the first millisecond-long simulation – more than 100 times longer than any reported previously.”

(12)

12 J. P. Grossman, Brian Towles, Brian Greskamp, David E. Shaw:

Filtering, Reductions and Synchronization in the Anton 2 Network. IPDPS 2015: 860-870

Brian Towles, J. P. Grossman, Brian Greskamp, David E. Shaw:

Unifying on-chip and inter-node switching within the Anton 2 network. ISCA 2014: 1-12

David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey Kuskin, Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph

Gagliardo, J. P. Grossman, Richard C. Ho, Doug Ierardi, István

Kolossváry, John L. Klepeis, Timothy Layman, Christine McLeavey, Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan,

Jochen Spengler, Michael Theobald, Brian Towles, Stanley C.

Wang: Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51(7): 91-97 (2008)

Ron O. Dror, Cliff Young, David E. Shaw: Anton, A Special-

Purpose Molecular Simulation Machine. Encyclopedia of Parallel Computing 2011: 60-71

(13)

13

Recent Anton 2 installation (from 2016):

Pittsburg Supercomputing Center (PSC), see

• https://www.psc.edu/resources/anton

• https://www.psc.edu/news-publications/2181-anton-2-will- increase-speed-size-of-molecular-simulations

(14)

14

David E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph

Gagliardo, Gennette Gill, Brian Greskamp, Richard C. Ho, Douglas J. Ierardi, Lev Iserovich, Jeffrey Kuskin, Richard H. Larson, Timothy Layman, Li-Siang Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A.

Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, Daniel Ramot, John K. Salmon, Daniele Paolo

Scarpazza, U. Ben Schafer, Naseer Siddique, Christopher W.

Snyder, Jochen Spengler, Ping Tak Peter Tang, Michael Theobald, Horia Toma, Brian Towles, Benjamin Vitale, Stanley C. Wang, Cliff Young: Anton 2: Raising the Bar for Performance and

Programmability in a Special-Purpose Molecular Dynamics Supercomputer. SC 2014: 41-53

(15)

15

Ad. 1: Special purpose HPC for Deep Neural Network processing Google TensorFlow processors (TPU) for DNN

• TPUv1: inference (2018): claims 15-30 times faster, 30-80 times more energy efficient that CPU/GPU

• TPUv2, v3: training (2020), 10 times performance/watt gains over Top500 supercomputers

Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, David A. Patterson:

A domain-specific supercomputer for training deep neural networks. Commun. ACM 63(7): 67-78 (2020)

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson:

A domain-specific architecture for deep neural networks.

Commun. ACM 61(9): 50-59 (2018)

Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38(3): 10-19 (2018)

(16)

16

TPUv1

Systolic array MM processor

(17)

17

Integrated interconnect TPUv2/v3

(18)

18

2-dimensional torus of TPUs

From (October 2020)

https://www.servethehome.com/go ogle-tpuv3-discussed-at-hot-

chips-32/

(19)

19

(20)

20

Special purpose architectures

• Dedicated functional units for special types of operations (FMA, MV, MM, …)

• Special ISA (special compiler support), often VLIW

• Special (short?) data formats

• Aggressive, special memory system

Special purpose/general purpose (Turing complete): Matter of degree

• Standardized ISA

• Balanced memory system, balanced communication system

• General data formats

• …

John L. Hennessy, David A. Patterson: A new golden age for

Turing Award lecture 2018

(21)

21

Ad 1.: Special purpose to general purpose

Special purpose sometimes have wider applicability

Special purpose advantages:

• Higher performance (FLOPS) for special types of computations/applications

• More efficient (energy, number of transistors, …)

• Graphics processing processors (GPU) for general purpose computing (GPGPU)

• Field Programmable Gate Arrays (FPGA)

HPC systems: Special purpose processors as accelerators (GPU, FPGA, Xeon Phi, …)

(22)

22

General purpose MD software packages

• GROMACS www.gromacs.org

• NAMD www.ks.uiuc.edu/Research/namd/

(23)

23

• Dense and sparse matrices, linear equations

• PDE (“Partial Differential Equations”, multi-grid methods)

• N-body problems (MD again)

• …

• Many (parallel) support libraries:

• BLAS -> LAPACK -> ScaLAPACK

• Intel’s MKL (Math Kernel Library)

• MAGMA/PLASMA

• FLAME/Elemental/PLAPACK [R. van de Geijn]

Other typical components in scientific computing applications

• PETSc (“Portable Extensible Toolkit for Scientific computation”)

(24)

24

Ad. 2: Template High-Performance Computing architecture

Georg Hager, Gerhard Wellein: Introduction to High

Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330

• Typical elements of modern, parallel (High-Performance Computing) architectures: “A qualitative approach”

• Balance: Which architecture for which applications?

• Levels of parallelism

• Parallelism in programming model/interface

(25)

25

L1 Lk

Main memory

Communication network

L1 L1

Lk L1

SIMD Acc

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc

• Hierarchical designs: core, processor, node, rack, island, …

• Orthogonal capabilities: Accelerators, vectors

• Different types parallelism at all levels

NIC NIC

(26)

26

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc

• Total number of cores (what counts as a core?)

• Size of memories

• Properties of communication network

NIC NIC

(27)

27

Main memory

Lk L1

SIMD Acc

Memory hierarchy

• Compute performance: How many instructions can each core perform per clock cycle (superscalar≥1)

• Special instructions&FUs: Vector, SIMD, FMA, (CISC…)

• Accelerator (if integrated in core)

Parallelism in core:

• Implicit, hidden (ILP)

• Explicit SIMD

• Explicit accelerator (GPU) How expressed, exploited?

(28)

28

Compute performance measured in FLOPS: Floating Point Operations per Second

Floating Point: In HPC almost always 64-bit IEEE Floating Point number (32 bits too little for many scientific applications, but not all!)

FLOPS M(ega)FLOPS 10⁶ G(iga)FLOPS 10⁹ T(era)FLOPS 10¹² P(eta)FLOPS 10¹⁵ E(xa)FLOPS 10¹⁸ Z(etta)FLOPS 10²¹ Y(otta)FLOPS 10²⁴

System peak Floating Point Performance (Rpeak)

Definition (HW peak performance):

Rpeak ≈

ClockFrequency x #FLOP/Cycle x

#CPU’s x #Cores/CPU

Optimistic, best case upper bound

(29)

29

Main memory

Lk L1

SIMD

• Compute performance: How many instructions can core perform per clock cycle (superscalar≥1)

• Special instructions&FUs: Vector, SIMD (v≥1 operations per cycle)

Vector processor:

Performance from wide SIMD unit

High performance for

applications with large vectors Memory hierarchy

Superscalar:

Multiple pipelines (integer, logical, FP add, FP mul, …

Requires right mix of instructions

(30)

30

Parallelism through

• Pipelining: Also complex

instructions can be delivered once per cycle. Problem:

dependencies, branches

• Multiple pipelines: Several different, independent

instructions can be executed concurrently

Superscalar: Multiple

pipelines (integer, logical, FP add, FP mul, …

(31)

31

Main memory

Lk L1

SIMD

Acc

• Compute performance: How many instructions can core perform per clock cycle (superscalar≥1)

• Special instructions&FUs: Vector, SIMD

• Accelerator: In core or external (e.g., GPU)

Heavily accelerated system, one or more accelerators

How tightly integrated with memory system/core?

High performance for applications that fit with

accelerator model _{Acc memory}

Memory hierarchy

(32)

32

Main memory

Lk L1

SIMD Acc

• Memory hierarchy: Latency (number of cycles to access first Byte), Bandwidth (Bytes/second)

• Balance between compute performance and memory bandwidth

• Memory access times not uniform (NUMA)

Memory hierarchy

(33)

33

Definition (HW Peak Performance):

Rpeak ≈ ClockFrequency x #FLOP/Cycle x #CPU’s x #Cores/CPU

Definition:

The hardware efficiency is the ratio Rmax/Rpeak, with Rmax the measured (sustained) application performance, Rpeak the nominal HW peak performance

Measured application performance (sustained performance): How many FLOPS does application achieve on system?

Note: This efficiency measure is totally different from the algorithmic efficiency E = SU/p

What if efficiency « 1?

(34)

34

Main memory

Lk L1

SIMD Acc

Application is (loosely speaking):

• Compute bound, if time for FLOPs per Byte read+written larger than (inverse) memory bandwidth

• Memory bound, if time for FLOPs per Byte read+written smaller than (inverse) memory bandwidth

Memory hierarchy

(35)

35

Given application (kernel) A:

Arithmetical (Operational) intensity OI:

Count (average) number of (Floating Point) OPerations per Byte read/written by the application/algorithm

Required BW, RB: HW Performance in (FL)OPS divided by OI Memory bound: RB > MB

Compute bound: RB < MB

Property of application

a = x*x+2*x*x*x+3*x*x*x*x+4*x*x*x*x*x;

Performance and memory

bandwidth (MB) properties of processor and memory system Example: Calculate RB on 2GHz, not superscalar processor, 64- bit Float

OI = 16/(2*8) = 1 FLOP/Byte, RB = 2GByte/s Can memory system deliver?

(36)

36

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc

Memory hierarchy

• Cache hierarchy: 2, 3, 4, … levels: How to exploit efficiently (capacity, associativity, …)?

• Caches shared at certain levels (different in different processors, e.g., AMD, Intel, …)

• Caches coherent?

• Memory typically (very) NUMA

Cache management most often transparent (done by CPU); can have huge

performance impact.

Applications do not benefit equally well from cache

system

Shared memory parallelism (OpenMP, threads, MPI, …)

(37)

37

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc NIC

Properties of communication network:

• Latency (time to initiate communication, first Byte), Bandwidth (Bytes/second) or time per unit

• Contention?

• How powerful is the network (performance, capabilities)?

• How is communication network integrated with memory and processor?

• What can communication coprocessor (NIC) do?

• Possible to “overlap”

communication and computation?

Overlap: Processors and communication system work in parallel

(38)

38

L1 Lk

Main memory

L1 L1

Lk L1

SIMD Acc NIC

Application is:

• Communication bound: Time for FLOPs per Byte (OI) smaller than communication bandwidth

Large number of cores with large compute performance (accelerator) share network bandwidth

Network parallelism:

• Explicit (MPI-like), implicit?

• Between cores, between nodes?

(39)

39

Roofline model: How well does application exploit given HW?

1. Estimate HW peak performance, in (FL)OPS 2. Estimate (main) memory bandwidth, in Bytes/s

3. Roof 1: Compute and plot Memory Performance as function of OI: Memory Performance (in (FL)OPS) =

OI (Operations/Byte)*BW (Bytes/s)

4. Roof 2: Plot HW peak performance as function of OI (constant)

Architectural HW roofs

Definition: The (OI,Roof) plot is the roofline model for the given architecture. The unit for “Roof” is (FL)OPS/s. Slope of memory roof is MB

(40)

40

1. Estimate OI (arithmetical/operational intensity)

2. Measure achieved performance of application (FLOPS) The application/kernel/algorithm/implementation

The (OI,Performance) is one point in the roofline model: If this point is close to a roof, the application is exploiting the HW

(memory system, compute capability) well Roofline model (most often): log-log plot

(41)

41

“Theoretical” roofline analysis:

Inspect kernel, algorithm, application: How many FLOPs per Byte read+written. Use specifications of hardware.

“Empirical” roofline analysis:

Measure memory bandwidth, e.g., STREAM benchmark.

Measure OI, e.g., using hardware performance counters (how many operations of different types, how many Bytes

read+written?)

Hardware performance: Study architecture, specification

(42)

42

Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38(3): 10-19 (2018)

TPUv1, GPU K80, Intel Haswell roofline models

Roofs for TPUv1

Application

(43)

43

Roofline analysis of application on given hardware

• If application is close to either memory or performance roofs: Exploits architecture well

• If application is close to memory roof, but far from

performance roof: too low OI, rethink algorithm to do more operations per Byte read+written

• If application is far from roofs: Architecture not exploited, wrong mix of operations, no vectorization, dependencies, …

(44)

44

From https://crd.lbl.gov/departments/computer- science/par/research/roofline/introduction/

(45)

45

Sophisticated roofline models:

• HW roofs for different types of functional units: FMA, SIMD (avx2), …

• HW roof for different kinds of operations: FP, integer, logical

• Different types of memory (caches, L1, L2, L3, main memory)

• Communication?

(46)

46

From https://crd.lbl.gov/departments/computer- science/par/research/roofline/software/ert/

(47)

47

(48)

48

(49)

49

From

https://crd.lbl.gov/departments/computer- science/PAR/research/roofline/introduction/

Typical arithmetical (operational) intensities

Prefix-sums, BFS, DFS, Merging, Sorting, …

(50)

50

Samuel Williams, Andrew Waterman, David A. Patterson:

Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4): 65-76 (2009)

Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel

Jeannot, Leonel Sousa: Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model.

IEEE Trans. Parallel Distrib. Syst. 30(6): 1374-1389 (2019) Aleksandar Ilic, Frederico Pratas, Leonel Sousa:

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency Modeling for Multi-Cores. IEEE Trans. Computers 66(1): 52-58 (2017)

David Cardwell, Fengguang Song: An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems. HPC Asia 2019: 26-35

Roofline model(s): References

(51)

51

Some past and present HPC architectures

Looking at Top500 list: www.top500.org Ranks supercomputer performance by

LINPACK benchmark (HPL), updated twice yearly (June, ISC Germany; November ACM/IEEE Supercomputing)

(52)

52

Serious background of Top500:

Benchmarking to evaluate (super)computer performance In HPC: Often based on one single benchmark

High Performance LINPACK (HPL) solves a system of linear equations under specified constraints (minimum number of operations), see www.top500.org

HPL performs well (high computational efficiency, high OI) on many architectures; allows a wide range of optimizations

HPL is less demanding on communication performance: Compute bound, OI (operational intensity) of O(n) FLOPs per Byte

HPL does not give a balanced view of “overall” system capabilities (communication)

HPL is politically important… (much money lost because of HPL…)

(53)

53

LINPACK performance as reported in Top500

• Rmax: FLOPS measured by solving large LINPACK instance

• Nmax: Problem size for reaching Rmax

• N/2: Problem size for reaching Rmax/2

• Rpeak: System Peak Performance as computed by system owner

Number of double precision floating point operations needed for solving the linear system must be (at least) 2/3 n³ + O(n²)

Excludes

• Strassen and other asymptotically fast matrix-matrix multiplication methods

• Algorithms that compute with less than 64-bit precision

(54)

54

June 2019

#500 system

#1 system

What are the systems at the jumps?

All systems

Factor >100 between

#1 and #500

(55)

55

June 2020

#500 system

#1 system All systems

(56)

56

June 2020: Rank #1 (June 2021: same)

System Cores Rmax

(TFLOPS Rpeak

(TFLOPS) Power (kW) Fugaku: A64FX 48C

2.2GHz, Tofu interconnect D, Fujitsu

RIKEN Center for Computational Science Japan

7,299,072 415,530.0 513,854.7 28,335

Hardware efficiency ≈ 80%

(57)

57

June 2019: Rank #1

(TFLOPS) Rpeak

(TFLOPS) Power (kW) Summit: IBM

Power System AC922, IBM POWER9 22C

3.07GHz, NVIDIA Volta GV100,

Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory

United States

2,414,592 148,600.0 200,794.9 10,096

(58)

58

(TFLOPS) Power (kW) Sierra: IBM

Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100,

Dual-rail Mellanox EDR Infiniband, IBM / NVIDIA / Mellanox

DOE/NNSA/LLNL United States

1,572,480 94,640.0 125,712.0 7,438

June 2019: Rank #2

(59)

59

November 2017: Rank #1

(TFLOPS) Power (kW) Sunway

TaihuLight:

Sunway MPP,

Sunway SW26010 260C 1.45GHz, Sunway, NRCPC National

Supercomputing Center Wuxi China

10,649,600 93,014.6 125,435.9 15,371

(60)

60

MasPar (1987-1996) MP2

Thinking Machines (1982-94) CM2, CM5

MasPar, CM2:

SIMD machines

CM5: MIMD machine

(61)

61

“Top” HPC systems 1993-2000 (from www.top500.org)

(62)

62

Earth Simulator (2002) and Earth Simulator 2 (2009)

(63)

63

K computer (2011) and Fugaku (2020)

(64)

64

HPL is politically important… (much money lost because of HPL…) HPL is used to make projections on supercomputing performance trends (as Moore’s “Law”)

HPL is a co-driver for supercomputing “performance”

development:

It is hard (for a compute center, for a politician, …) to defend building a system that will not rank highly on Top500

Strong (political) drive towards Exascale:

PetaFLOPS was achieved in 2008, ExaFLOPS expected ca. 2018- 2020, by simple extrapolation from Top500

(65)

65

November 2016 According to projection, 2018/19 ExaFlop prediction will not hold

Why not? Any specific

obstacles to ExaScale

performance?

(66)

66

November 2017 According to projection, 2018/19 ExaFlop prediction will not hold

(67)

67

June 2019 According to projection, 2018/19 ExaFlop prediction will not hold

(68)

68

June 2021

(69)

69

HPCC: www.hpcchallenge.org: Benchmark suite (DGEMM, STREAM, PTRANS, Random Access, FFT, B_Eff)

HPCG: http://hpcg-benchmark.org

HPGMG: https://crd.lbl.gov/departments/computer- science/PAR/research/hpgmg

Graph500 (Graph search, BFS): www.graph500.org

Green500 (Energy consumption/efficiency): www.green500.org Other HPC systems benchmarks

Intended to complement HPL or to highlight other aspects STREAM: www.cs.virginia.edu/stream: Memory performance

Still active? Part of Top500

(70)

70

NAS Parallel Benchmarks (NPB):

https://www.nas.nasa.gov/publications/npb.html: Benchmark suite of small kernels

• IS: Integer sort

• EP: Embarassingly parallel

• CG: Conjugate Gradient

• MG: Multigrid

• FT: Discrete 3D Fast Fourier Transform

• BT: Block tridiagonal solver

• SP: Scalar Pentadioganal solver

• LU: Lower-Upper factorization Gauss-Seidel solver

Often used in research papers. What is evaluated, under which conditions, and compared to what? Understand the benchmarks

(71)

71

Mini Application suite (https://mantevo.org):

• MiniAMR: Adaptive Mesh Refinement

• MiniFE: Finite Elements

• MiniGhost: 3D halo exchange (ghost cells) for finite differencing

• MiniMD: Molecular Dynamics

• CloverLeaf: compressible Euler equations

• TeaLeaf: Linear heat conduction equation

(72)

72

• Very early days: Single-processor supercomputers (vector)

• After ‘94, all supercomputers are parallel computers

• Earlier days: Custom-made, unique – highest performance processor + highest performance network

• Later days, now: Custom convergence, weaker standard

processors, but more of them, weaker networks (InfiniBand, Tori, …)

• Recent years: Accelerators (again): GPUs, FPGA, MIC, … Using top500: Broad trends in HPC systems architecture

Much interesting computer history in top500 list; but also much is lost, and many details are not there. See what you can find

(73)

73

Example: the Earth Simulator 2002-2004 (#1)

https://www.nytimes.com/2002/04/20/technology/japanese- computer-is-worlds-fastest-as-us-falls-back.html (pay to read)

(74)

74

System Vendor Cores Rmax

(GFLOPS) Rpeak

(GFLOPS) Power (KW) Earth-

Simulator NEC 5120 35860.00 40960.00 3200.00

June 2002, Earth Simulator

• Rmax: Performance achieved on HPL

• Rpeak: “Theoretical Peak Performance”, best case, all processors fully busy

Power: Processors only (cooling, storage)?

(75)

75

Power supply

• ~40TFLOPS

• 5120 vector processors

• 8 (NEC SX6) processors per node

• 640 nodes, 640x640 full crossbar interconnect

BUT: Energy expensive

Earth Simulator 2 (2009) only

vector system on Top500

• ~15MW

(76)

76

Vector processor operates on long vectors, not only scalars Peak performance:

8GFlops

Long vectors:

256 elements

Vector architecture pioneered by Cray (Cray-1 1976, late 60ties, early 70ties). Other vendors: Convex, Fujitsu, NEC, …

(77)

77

Vector processor operates on long vectors, not only scalars Peak performance:

8GFlops (with all vector pipes and FUs active) Long vectors:

256 (double/long) elements

Observe:

• Pipelines

• Caches

• Register banks

(78)

78

Main memory

SIMD

• One instruction

• Several, deep pipelines can be kept busy by long vector registers, no branches, no pipeline stalls

• Sufficient memory bandwidth to prefetch next register during vector instruction execution must be available

Vector registers

1

(79)

79

Main memory

SIMD

• One instruction

Vector registers

2

(80)

80

Main memory

SIMD

• One instruction

• Can sustain several operations per clock over a long interval Vector registers

SIMD k

Banked memory for high vector bandwidth

(81)

81

Main memory

SIMD

• One instruction

• Can sustain several operations per clock over a long interval Vector registers

SIMD

HPC: Pipelines for different types of (mostly Floating Point) operations found in applications (add, mul,

divide, √, …; additional special hardware)

Large vector register bank, different types (index, mask) Banked memory for high vector bandwidth

(82)

82

Prototypical SIMD/data parallel architecture

One (vector) instruction operates on multiple data (long vectors)

G. Blelloch: Vector Models for Data Parallel Computing”, MIT Press, 1990

(83)

83

int a[], b[n], c[n];

double x[n], y[n], z[n];

double xx[n], yy[n], zz[n];

for (i=0; i<n; i++) { a[i] = b[i]+c[i];

x[i] = y[i]+z[i];

xx[i] = (yy[i]*zz[i])/xx[i];

}

for (i=0; i<n; i+=v) { vadd(a+i,b+i,c+i);

vdadd(x+i,y+i,z+i);

vdmul(t,yy+i,zz+i);

vddiv(xx+i,t,xx+i);

}

Simple “data parallel (SIMD) loop”, n

independent (floating point) operations

translated into n/v vector operations

Translates to sth. like

Can keep both integer and floating point

pipes busy

n>>v: iteration i can prefetch vector for iteration i+v

(84)

85

High memory bandwidth by organizing memory into banks (NEC SX-6: 2K banks)

Element i, i+1, i+2, … in different banks, element i and i+2K in same bank: bank conflict, expensive because of serialization

32 Memory units, 64 banks each

Special

communication processor (RCU) directly connected to memory system

(85)

86

Vectorizable loop structures

for (i=0; i<n; i++) { a[i] = b[i]+c[i];

}

for (i=0; i<n; i++) { a[i] = a[i]+b[i]*c[i];

}

DAXPY, fused multiply add (FMA)

Simple loop, integer (long) and floating point operations

Typically pipelines for

• floating point add, multiply, divide;

• some integer operations;

• daxpy; square root; …

(86)

87

for (i=0; i<n; i++) {

if (cond[i]) a[i] = b[i]+c[i];

}

Conditional execution handled by masking:

for (i=0; i<n; i++) { R[i] = b[i]+c[i];

MASK[i] = cond[i];

if (MASK[i]) a[i] = R[i];

}

Roughly translates to:

MASK special register for conditional store (predicated store instruction), R

temporary register Wasteful when number of true-branches is small, always Ω(n)

(87)

88

#pragma vdir vector,nodep for (i=0; i<n; i++) {

a[ixa[i]] = b[ixb[i]]+c[ixc[i]];

}

Gather/Scatter operations.

Compiler may need help

Can cause memory bank conflicts, depending on index vector (many indices to same bank: serialization)

Memory bandwidth dependent on access pattern

(88)

89

#pragma vdir vector for (i=1; i<n; i++) {

a[i] = a[i-1]+a[i];

}

min = a[0];

#pragma vdir vector for (i=0; i<n; i++) {

if (a[i]<min) min = a[i];

}

Prefix-sums

Min/max operations

With special hardware support

(89)

90

#pragma vdir vector,nodep for (i=0; i<n; i++) {

a[s*i] = b[s*i]+c[s*i];

}

Strided access

Can cause memory bank conflicts (some strides always bad) Vectorizable loop structures

Large-vector processors currently out of fashion in HPC, almost non-existent

NEC SX-8 (2005), NEC SX-9 (2008), NEC SX-ACE (2013)

2009-2013: No NEC vector processors (market lost?)

(90)

91

NEC SX-Aurora TSUBASA: Vector Engine (ca. 2017)

• 8-core vector processor

• 1.2 TBytes/Second memory bandwidth Rpeak: 2.45TFLOPS

(91)

92

Many scientific applications fit well with vector model. Irregular, non-numerical applications often not

Mature compiler technology for vectorization and optimization (loop splitting, loop fusion…). Aim: Keep vector pipes busy

Allen, Kennedy: “Optimizing Compilers for Modern Architectures”, MKP 2002

Scalar (non-vectorizable) code carried out by standard, scalar processor; amount limits performance (Amdahl’s Law)

Vector programming model: Loops, sequential control flow,

compiler handles parallelism (implicit) by vectorizing loops (some help from programmer)

(92)

93

Small scale vectorization: Standard processors

• MMX, SSE, AVX, AVX2,… (128 bit vectors, 256 bit vectors)

• Intel MIC/Xeon Phi: 512 bit vectors, new, special vector instructions (2013: Compiler support not yet mature; 2016:

Much better), AVX-512 (2018: Xeon Phi defunct!)

High performance on standard processors:

• Exploit vectorization potential

• Check whether loops where indeed vectorized (gcc –ftree- vectorizer-verbose=n …, in combination with architecture specific optimizations)

2, 4, 8 Floating Point operations simultaneously by one vector instruction (no integers?)

(93)

94

Support for vectorization in OpenMP 3.0

#pragma omp simd [clauses…]

for (i=0; i<n; i++) { a[i] = b[i]+c[i];

}

Clauses: reduction (for sums), collapse (for nested loops)

(94)

95

Explicit parallelism

• 8-way SMP (8 vector processor per shared-memory node)

• Not cache-coherent

• Nodes connected by full crossbar

2-level explicit parallelism:

• Intra-node with shared-memory communication

• Inter-node with communication over crossbar

(95)

96

Coherence

Memory system is coherent, if any update (write) to memory by any processor will eventually become visible to any other

processor

L1 x Lk

Main memory

L1 L1

Lk

L1 x

Cache coherence: Any update to a value in cache of some

processor will eventually

become visible to any other processor (regardless of whether in cache of other processor)

Maintaining cache coherence (across sockets/large multi- cores) can be expensive!

(96)

97

Memory behavior, memory model

• Access (read, write) to different locations may take different time (NUMA: memory network, placement of memory controllers, caches, write buffers)

• In which order will updates to different locations by some processor become visible to other processors?

• Memory model specifies: Which accesses can overtake which other accesses

Sequential consistency: Accesses take effect in program order Most modern processors are not sequentially consistent

(97)

98

No cache-coherence: Earth Simulator/NEC SX

• Scalar unit of vector processor has cache

• Caches of different processors not coherent

• Vector units read/write directly to memory, no vector caches

• Write-through cache

Different design choice:

Cray X1 (vector computer early 2000) had a different, cache- coherent design (coherent on nodes, not across)

• Nodes must coordinate and synchronize

• Parallel programming model (OpenMP, MPI) helps

D. Abts, S. Scott, D. J. Lilja: “So Many States, So Little Time:

Verifying Memory Coherence in the Cray X1”, IPDPS 2003: 11

(98)

99

Example: MPI and cache non-coherence

i j

MPI_Recv(&y,…,comm,&status);

MPI_Send(&x,…,comm);

x: Mem of rank i y: Mem of rank j

y: Cache of j

Coherency/consistency needed after MPI_Recv: rank j must invalidate cache(lines) at the point where MPI requires coherence (at MPI_Recv)

Incoherent state

Processes i and j on same node

Vectorized memcpy

write

(99)

100

Example: MPI and cache non-coherence

i j

MPI_Recv(&y,…,comm,&status);

MPI_Send(&x,…,comm);

x: Mem of rank i y: Mem of rank j

y: Cache of j

Coherency/consistency needed after MPI_Recv:

clear_cache instruction invalidates all cache lines

Incoherent state

Expensive: 1) clear_cache itself; 2) all cached values lost!

Further complication with MPI: structured data/data types;

address &y alone does not tell where the data are

Vectorized memcpy

write

(100)

101

Example: OpenMP and cache non-coherence

#pragma omp parallel for for (i=0; i<n; i++) {

x[i] = f(y[i]);

} Sequential region: All

x[i]’s visible to all threads

OpenMP: All regions (parallel, critical, …) require memory in a consistent state (caches coherent); implicit flush/fence

constructs to force visibility (in OpenMP construct)

Lesson: Higher-level programming models can help to alleviate need for low-level, fine-grained cache coherency.

(101)

102

Cache coherence debate

• Cache: Beneficial for applications with spatial and/or temporal locality (not all applications have this: Graph algorithms)

• Caches a major factor in single-processor performance increase (since sometime in the 80ties)

Many new challenges for caches in parallel processors:

• Coherency

• Scalability

• Resource consumption (logic=transistors=chip area; energy)

• …

Milo M. K. Martin, Mark D. Hill, Daniel J. Sorin: Why on-chip cache coherence is here to stay. Commun. ACM 55(7): 78-89 (2012)

Too expensive?

(102)

103

MPI and OpenMP

Still most widely used programming interfaces/models for parallel HPC (there are contenders)

MPI: Message-Passing Interface, see www.mpi-forum.org

• MPI processes (ranks) communicate explicitly: point-to-point- communication, one-sided communication, collective

communication, parallel I/O

• Subgrouping and encapsulation (communicators)

• Much support functionality

OpenMP: shared-memory interface (C/Fortran pragma- extension), data (loops) and task parallel support, see www.openmp.org

(103)

104

Partitioned Global Address Space (PGAS) alternative to MPI Addressing mechanism for part of the processor-local address space can be shared between processes; referencing non-local parts of partitioned space leads to implicit communication

Language or library supported:

Some data structures (typically arrays) can be declared as shared (partitioned) across (all) threads

Note:

PGAS not same as Distributed Shared Memory (DSM). PGAS explicitly controls which data structures (arrays) are

partitioned, and how they are partitioned

(104)

105

Global array(s):

Thread k owns a:

Each block of global array in local memory of some process/thread

Simple, block cyclic distribution of array a

PGAS:

Data structures (simple arrays) partitioned (shared) over the memory of p threads

(105)

106

Global array(s):

Thread k owns

b = a[i];

a[j] = b;

Thread k:

PGAS Memory model:

Defines when update becomes visible to other threads

entails communication if index i or index j is not owned by thread k

a:

Each block of global array in local memory of some process/thread

(106)

107

Global array(s):

a[i] = b[j];

Thread k:

even if neither a[i] nor b[j] owned by k Thread k owns

PGAS Memory model:

Defines when update becomes visible to other threads a:

(107)

108

Global array(s):

forall(i=0; i<n; i+) { a[i] = f(x[i]);

}

Owner computes rule:

Thread k performs updates only on the elements(indices) owned by/local to k

partitioned (shared) over the memory of p threads Thread k owns

a:

(108)

109

Typical PGAS features:

Even more extreme:

SIMD array languages, array operations parallelized by library and runtime

Often less support for library building (process subgoups) than MPI

• Array assignments/operations translated into communication when necessary based on ownership

• Mostly simple, block-cyclic distributions of (multi- dimensional) arrays

• Collective communication support for redistribution, collective data transfer (transpositions, gather/scatter) and reduction- type operations

• Bulk-operations, array operations

(109)

110

Some PGAS languages/interfaces:

• UPC/UPC++: Unified Parallel C, C/C++ language extension;

collective communication support; severe limitations

• CaF: Co-array Fortran, standardized, but limited PGAS extension to Fortran

• CAF2: considerably more powerful, non-standardized Fortran extension

• X10 (Habanero): IBM asynchronous PGAS language

• Chapel: Cray, powerful data structure support

• Titanium: Java-extension

• Global Arrays (GA): older, PGAS-like library for array programming , see http://hpc.pnl.gov/globalarrays/

• HPF: High-Performance Fortran

Fortran is still an important language in HPC

(110)

111

Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van

Cutsem, Wolfgang De Meuter: Partitioned Global Address Space Languages. ACM Comput. Surv. 47(4): 62:1-62:27 (2015)

Activity, maturity of PGAS languages?

UPC finds some applications

Martina Prugger, Lukas Einkemmer, Alexander Ostermann:

Evaluation of the partitioned global address space (PGAS) model for an inviscid Euler solver. Parallel Computing 60: 22-40 (2016)

No new developments for the past decade? Implementation status and performance not discussed. Many PGAS language implementations use MPI as (default) communication layer

(111)

112

The Earth Simulator: Interconnect Full crossbar:

• Each node has a direct link (cable) to each other node

• Full bidirectional communication over each link

• All pairs of nodes can communicate simultaneously without having to share bandwidth

• Processors on node shared crossbar bandwidth

• Strong: 12.6 GByte/s BW vs. 64GFLOPS/node; for each Byte communicated ca. 6 FLOPs AI needed in application, otherwise processor idles

(112)

113

Fully connected network, p nodes, floor(p/2) possible pairs, in all pairings all nodes can communicate

directly

Maximum distance between any two nodes (diameter): one link

P N N N N

Fully connected

network realized as (indirect) crossbar network

(113)

114

Hierarchical/Hybrid communication subsystems

• Processors placed in shared-memory nodes; processors on same node are “closer” than processors on different nodes

• Different communication media within nodes (e.g., shared- memory) and between nodes (e.g., crossbar network)

• Processors on same node share bandwidth of inter-node network

• Compute nodes may have one or more “lanes” (rails) to network(s)

M

P P P P

M

P P P P

M

P P P P

M

P P P P

(114)

115

M

P P P P

M

P P P P

M

P P P P

M

P P P P

Actually, many more hierarchy levels:

• Cache (and memory) hierarchy:

L1 (data/instruction) -> L2 –> L3 (…)

• Processors (multi-core) share caches at certain levels (processor may differ, e.g., AMD vs. Intel)

• Network may itself be hierarchical (Clos/fat tree:

InfiniBand): Nodes, Racks, Islands, …

(115)

116

Part 1

Hierarchical communication system

Processors can be partitioned (non-trivially) such that:

• Processors in same partition communicate with roughly same performance (latency, bandwidth, number of ports, …)

• Processors in different partitions communicate with roughly same (lower) performance

Part 0 Part 1 Part k

Processors

…

Can again be hierarchical

A crossbar network is not hierarchical (all processors can communicate with same performance

(116)

117

“Pure”, homogeneous programming models oblivious to hierarchy

• MPI (no performance model, only indirect mechanisms for grouping processes according to system structure: MPI topologies)

• UPC (local/global, no grouping at all)

• …

Implementation challenge for compiler/library implementer to take hierarchy into account:

• Point-to-point communication uses closest path, e.g., shared memory when possible

• Efficient, hierarchical collective communication algorithms exist (for some cases, still incomplete and immature)

Programming model and system hierarchy

(117)

118

“Pure”, homogeneous programming models oblivious to hierarchy Application programmer relies on language/library to efficiently exploit system hierarchy:

• Portability!

• Performance portability?! All library/language functions give good performance on (any) given system, thus an application whose performance is dominated by library/language function will perform predictable when porting to another system

Sensible to analyze performance in terms of collective operations (building blocks), e.g.,

T(n,p) = TAllreduce(p)+TAlltoall(n)+T_Bcast(np)+O(n)

(118)

119

Hybrid/heterogeneous programming models (“MPI+X”)

• Conscious to certain aspects/levels of hierarchy

• Possibly more efficient application code:

• Example: MPI+OpenMP

• Less portable, less performance portable

• Sometimes unavoidable (accelerators): OpenCL, OpenMP, OpenACC, …

M

P P P P

M

P P P P

M

P P P P

M

P P P P

OpenMP MPI between master

threads

(119)

120

Earth simulator 2/SX-9, 2009

Compared to SX-6/Earth Simulator:

• More pipes

• Special pipes (square root)

Peak performance

>100GFLOPS/processor

(120)

121

Peak

performance/CPU 102.4Gflops Total number of CPUs 1280 Peak

performance/PN 819.2Gflops Total number of PNs 160 Shared

memory/PN 128GByte Total peak

performance 131Tflops

CPUs/PN 8 Total main

memory 20TByte

Earth Simulator 2/SX-9 system

(121)

122

Cheaper communication network than full crossbar: Fat-Tree

(122)

123

Fat-Tree: Indirect (multi-stage), hierarchical network

P P

N

P P

N

P P

N

P P

N

N N

N

Tree network, max 2 log p “hops” between

processors, p-1 “wires”

(123)

124

P P

N

P P

N

P P

N

P P

N

N N

N

Bandwidth increases,

“fatter”

wires

C. E. Leiserson: Fat-Trees: Universal Networks for Hardware- Efficient Supercomputing. IEEE Trans. Computers 34(10): 892- 901, 1985

(124)

125

P P

N

P P

N

P P

N

P P

N

N N

N

C. E. Leiserson: Fat-Trees: Universal Networks for Hardware- Efficient Supercomputing. IEEE Trans. Computers 34(10): 892- 901, 1985

Thinking Machines CM5, on first,

unofficial Top500 Fat-Tree: Indirect (multi-stage), hierarchical network

(125)

126

P P

N

P P

N

P P

N

P P

N

N N N N N

N

N N

N N N N Realization with

N small crossbar switches

Example: InfiniBand

(126)

127

Example: The Blue Gene’s, 2004 (#1)

(127)

128

System Vendor Cores Rmax

(GFLOPS) Rpeak (GFLOPS) BlueGene/L DD2

beta-System (0.7

GHz PowerPC 440) IBM 32768 70720.00 91750.00

November 2004, Blue Gene/L

(128)

129

Large number of cores (2012: 1572864 – Sequioa system), weaker cores, limited memory per core/node

IBM Blue Gene L

• ~200.000 processing cores

• 256MBytes to 1G/core Note:

Not possible to locally maintain state of whole system,

256MBytes/200.000 ~ 1KBytes

• Applications that need to maintain state information for each other process in trouble

• Libraries (e.g., MPI) that need to maintain state information for each process in (big) trouble

(129)

130

• “slow” processors, 700-800MHz

• Simpler processors, limited out-of-order, branch-prediction

• BG/L: 2-core, not cache-coherent

• BG/P: 4-core, cache-coherent

• BG/Q: ?

• Very memory constrained (512MB to 4GB/node)

• Simple, low-bisection 3d-torus network

Energy efficient, heavily present on Green500

P P P P

Note:Torus is not a hierarchical network

(130)

131

José E. Moreira, Valentina Salapura, George Almási, Charles Archer, Ralph Bellofatto, Peter Bergner, Randy Bickford, Matthias A. Blumrich, José R. Brunheroto, Arthur A. Bright,

Michael Brutman, José G. Castaños, Dong Chen, Paul Coteus, Paul Crumley, Sam Ellis, Thomas Engelsiepen, Alan Gara, Mark

Giampapa, Tom Gooding, Shawn Hall, Ruud A. Haring, Roger L.

Haskin, Philip Heidelberger, Dirk Hoenicke, Todd Inglett, Gerard V. Kopcsay, Derek Lieber, David Limpert, Patrick McCarthy, Mark Megerian, Michael Mundy, Martin Ohmacht, Jeff Parker, Rick A.

Rand, Don Reed, Ramendra K. Sahoo, Alda Sanomiya, Richard Shok, Brian E. Smith, Gordon G. Stewart, Todd Takken, Pavlos Vranas, Brian P. Wallenfelt, Michael Blocksome, Joe Ratterman:

The Blue Gene/L Supercomputer: A Hardware and Software Story. International Journal of Parallel Programming 35(3): 181- 206 (2007)

On the BlueGene/L System

(131)

132

George Almási, Charles Archer, José G. Castaños, John A.

Gunnels, C. Christopher Erway, Philip Heidelberger, Xavier Martorell, José E. Moreira, Kurt W. Pinnow, Joe Ratterman,

Burkhard D. Steinmacher-Burow, William Gropp, Brian R. Toonen:

Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and

Development 49(2-3): 393-406 (2005)

On MPI for the BlueGene/L System

(132)

133

Example: Road Runner, 2008 (#1)

First PetaFLOP system, seriously accelerated

Decommissioned 31.3.2013