Overview: Classical Parallel Hardware

(1)

Overview: Classical Parallel Hardware

Review of Single Processor Design

l

so we talk the same language

l

many things happen in parallel even on a single processor

l

identify potential issues for parallel hardware

l

why use 2 CPUs if you can double the speed on one! Multiple Processor Design

l

architectural classifications

l

shared/distributed memory

l

hierarchical/flat memory

l

dynamic/static processor connectivity

l

evaluating static networks

(2)

The Processor

Performs:

l

floating point operations (FLOPS) - add, mult, division (sqrt maybe!)

l

integer operations (MIPS) - adds etc, also logical ops and instruction processing

n

MIPS: Machine Instructions Per Second (on a very old VAX 100/780)

n

anyway, what is a machine instruction on different CPUs!

l

our primary focus will be in floating point operations The clock:

l

all ops take a fixed number of clock ticks to complete

l

clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)

n

Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake 3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)

l

clock speed is limited by etching+speed of light, hence motivates parallelism

n

(to my knowledge) IBM zEC12 is fastest commodity processor at 5.5GHz

(3)

Processor Performance

FLOPS/sec Prefix Occurrence

103 kilo very badly written code 106 mega badly written code

109 giga single-core

1012 tera multiple chip (NCI)

1015 peta 23 machines in Top500 (Nov 2012, measured) 1018 exa around 2020!

l

PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz

≡

40GF

l

Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz

≡

105GF

(4)

Adding Numbers

Consider adding two double precision (8 byte) numbers

0 1 11 12 63

±

Exponent Significand

Possible steps:

l

determine largest exponent

l

normalize significand of the smaller exponent to the larger

l

add significand

l

re normalize the significand and exponent of the result

(5)

Operation Pipelining

Step in Pipeline Waiting 1 2 3 4 Done X(6) X(5)→ X(4)→ X(3)→ X(2)→ X(1)

l

X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1)

l

asymptotically achieves 1 result per clock tick

l

the operation (X) is said to be pipelined: steps in the pipeline are running in parallel

l

requires same op consecutively on different (independent) data items

n

good for “vector operations” (note limitations on chaining output data to input)

l

not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6 operation latency repeat

+,-,* 4 1

/ 15 12

(6)

Instruction Pipelining

l

break instruction execution into

k

stages;

⇒

can get

≤ k-way

||

ism (generally, the circuitry for each stage is independent)

l

eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX = Execute Instrn., WB = Write Back

(branch): FI DI FO EX WB

(guess) FI DI FO EX WB

(sure) FI DI FO EX WB

n

note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline)

n

conditional branch instructions are problematic: the wrong guess may require flushing succeeding instructions from the pipeline

l

tendency to increase the number of stages, but this increases the startup latency e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31)

(7)

Pipelining: Dependent Instructions

principle: CPU must ensure result is the same as if no pipelining (or

||

ism)

l

instructions requiring only 1 cycle in the EX stage:

add %i1 , −1, %i1 ! i1 = i1 − 1

cmp %i1 , 0 ! is i1 = 0?

can be solved by pipeline feedback from EX stage to next cycle

l

(important) instructions requiring

c

cycles for the EX stage (ie. f.p. * +, load, store) are normally implemented by having

c

EX stages. This requires any dependent

instruction to be delayed by

c

cycles. e.g.

c

= 3

fmuld %f0 , %f2 , %f4 ! I0: f4 = f0 ∗ f2 (f.p.)

.... ! I1:

.... ! I2:

faddd %f4 , %f6 , %f6 ! I3: f6 = f4 + f6 (f.p.)

I0: FI DI FO EX1 EX2 EX3 WB

(8)

Superscalar (Multiple Instruction Issue)

A small number (w) of instructions are scheduled by the H/W to execute together.

l

groups must have an appropriate ‘instruction mix’ e.g. UltraSPARC (w = 4):







≤

2 different floating point

≤

1 load / store

; ≤

1 branch

≤

2 integer / logical







instr’ns per group

l

have

≤ w-way

||

ism over different types of instr’ns

l

generally requires:

n

multiple (≥ w) instruction fetches

n

an extra grouping (G) stage in the pipeline

l

problem: amplifies dependencies and other problems of pipelining by a factor of

w!

l

issues: the instruction mix must be balanced for maximum performance!

(9)

Review - Instruction Level Parallelism

l

instruction level parallelism (ILP): pipelining and superscalar, offer

≤ kw-way

||

ism

l

branch prediction techniques alleviates issue of conditional branches

n

uses a table recording the result of recently-taken branches

l

out-of-order execution: alleviates the issue of dependencies

n

pulls fetched instructions into a buffer of size

W

,

W

≥ w

n

execute them in any order provided dependencies are not violated

n

must keep track of all ‘in-flight’ instructions and associated registers (O(

W

2)

area and power!)

l

in most situations, the compiler can do as good a job as a human at exposing this parallelism (ILP was part of the ‘Free Lunch’)

An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per cycle

(10)

Memory Structure

Consider the DAXPY computation:

y

(

i

) = 1

.

234

∗ x

(

i

) +

y

(

i

)

If theoretically the CPU can perform the 2 FLOPs in 1 cycle

l

the memory system must load two

double

s (x(

i

) and

y

(

i

) – 16 bytes) and store one (y(

i

) – 8 bytes) each clock cycle

n

on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store traffic

l

typically:

n

a processor core can only issue one load OR store instruction in a clock cycle

n

DDR3-SDRAM memory is available clocked at

≈

1066 MHz with access times accordingly

Memory latency and bandwidth are critical performance issues.

l

caches: reduce latency and provide improved cache to CPU bandwidth

(11)

Cache Memory

Main Memory – large cheap memory

↓

– large latency/small bandwidth Cache – small fast expensive memory

↓↓↓↓↓

– lower latency/higher bandwidth CPU Registers

l

part of the memory hierarchy

l

cache hit: data is in cache and received in a few cycles

l

cache miss: data must be fetched from main memory (or higher level cache)

l

try to ensure data is in cache (or as close to the CPU as possible)

l

can we ‘block’ algorithm to minimize memory traffic?

Cache memory is effective because algorithms often use data that:

l

was recently accessed from memory (temporal locality)

l

was close to other recently accessed data (spatial locality)

(12)

Cache Mapping

Blocks of main memory are mapped to a cache line

l

cache lines are typically 16-128 bytes wide

l

mapping may be direct, or n-way associative. E.g. 2-way associative mapping

Cache Main Memory Mapping

Line 0 0 or 1 2 or 3 0 or 1 2 or 3 Line 1 0 or 1 2 or 3 0 or 1 2 or 3 Line 2 0 or 1 2 or 3 0 or 1 2 or 3 Line 3 0 or 1 2 or 3 0 or 1 2 or 3 0 or 1 2 or 3 0 or 1 2 or 3

l

entire cache line is fetched from memory, not just one element (why?)

n

try to structure code to use an entire cache line of data

n

best to have unit stride

(13)

Memory Banks

Memory bandwidth can be improved by having multiple parallel paths to/from memory, organized in

b

banks.

Bank 1 Bank 2 Bank 3 Bank 4

C P U

l

traditional solution used by vector processors

l

good performance for unit stride

l

very bad performance if bank conflicts occur

(14)

Parallelism in Memory Access

Dual Inline Memory Module

(DIMM)

l

a processor may have 2 or 4 DIMMs (i.e.

b

= 16

,

32), depending on its lowest-level

cache line size

l

each DRAM chip access is pipelined (select row address, select column address, access byte)

(15)

Going Parallel

l

inevitably the performance of a single processor is limited by the clock speed

l

improved manufacturing increases clock but is ultimately limited by the speed of light

l

ILP allows multiple ops at once, but is limited

It’s time to go parallel

Hardware Issues

l

Flynn’s Taxonomy of parallel processors

n

SIMD/MIMD

l

shared/distributed memory

l

hierarchical/flat memory

l

dynamic/static processor connectivity

l

characteristics of static networks

(16)

Architecture Classification: Flynn’s Taxonomy

l

why classify:

n

what kind of parallelism is employed? Which architecture has the best prospects for the future? What has already been achieved by current

architecture types? Reveal configurations that have not yet considered by the system architect. Enable the building of performance models.

Flynn’s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams

Data Stream

Single Multiple

Single SISD SIMD

Instruction 1CPU Array/Vector Processor

Stream Multiple MISD MIMD

(17)

SIMD and MIMD

SIMD: Single Instruction Multiple Data

l

also known as data parallel processors or array processors

l

vector processors (to some extent)

l

current examples include SSE instructions, SPEs on CellBE, GPUs

l

NVIDIA’s SIMT (T = Threads) is a slight variation

MIMD: Multiple Instruction Multiple Data

l

examples include quad-core PC, 24-core Xeons on Gadi

CPU CPU CPU CPU

I N T E R C O N N E C T CPU and Control CPU and Control CPU and Control CPU and Control Global Control Unit I N T E R C O N N E C T S I M D M I M D

(18)

MIMD

Most successful parallel model

l

more general purpose than SIMD (e.g. CM5 could emulate CM2)

l

harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines

l

scheduling: efficient allocation of processors to tasks in a dynamic fashion

l

synchronization: prevent processors accessing the same data simultaneously

l

interconnect design: processor to memory and processor to processor

interconnects. Also I/O network - often processors dedicated to I/O devices

l

overhead: inevitably there is some overhead associated with coordinating activities between processors, e.g. resolve contention for resources

l

partitioning: identifying parallelism in algorithms that is capable of exploiting concurrent processing streams is non-trivial

(Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable.)

(19)

Address Space Organization: Message Passing

l

each processor has local or private memory

l

interact solely by message passing

l

commonly known as distributed memory parallel computers

l

memory bandwidth scales with number of processors

l

example: between “nodes” on the NCI Gadi system

I N T E R C O N N E C T PROCESSOR MEMORY PROCESSOR MEMORY MEMORY PROCESSOR PROCESSOR MEMORY

(20)

Address Space Organization: Shared Address Space

l

processors interact by modifying data objects stored in a shared address space

l

simplest solution is a flat or uniform memory access (UMA)

l

scalability of memory bandwidth and processor-processor communications (arising from cache line transfers) are problems

l

so is synchronizing access to shared data objects

l

example: dual/quad core laptop (ignoring caches)

MEMORY MEMORY MEMORY MEMORY

I N T E R C O N N E C T

(21)

Non-Uniform Memory Access (NUMA)

Machine includes some hierarchy in its memory structure

l

all memory is visible to the programmer (single address space), but some memory takes longer to access than others

l

in this sense, cache introduces one level of NUMA

l

between sockets on the NCI Gadi system or in a multi-socket Opteron system

(some ANU research on an Opteron (2011))

I N T E R C O N N E C T

PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache

MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T

(22)

Shared Address Space Access

Parallel Random Access Machine (PRAM): any shared memory machine

l

what happens when multiple processors try to read/write to the same memory location at the same time?

l

PRAM models

n

Exclusive-read, exclusive-write (EREW PRAM)

n

Concurrent-read, exclusive-write (CREW PRAM)

n

Exclusive-read, concurrent-write (ERCW PRAM)

n

Concurrent-read, concurrent-write (CRCW PRAM)

Concurrent read OK, but concurrent-write requires arbitration:

l

Common: allowed if all values being written are identical

l

Arbitrary: an arbitrary processor is allowed to proceed and the rest fail

l

Priority: processors are organized into a predefined prioritized list; the process with highest priority succeeds, the rest fail

(23)

Dynamic Processor Connectivity: Crossbar

l

is a non-blocking network, in that the connection of two processors does not block connection between other processors

l

complexity grows as

O

(

p

2)

l

may be used to connect processors each with their own local memory Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and

l

may also be used to connect processors with various memory banks

(24)

Dynamic Processor Connectivity: Multi-staged Networks

000 010 100 110 001 011 101 111 Processors S W I T C H I N G N E T W O R K O M E G A N E T W O R K Memory 010 100 001 000 011 101 110 111

l

consist of lg

p

stages, where

p

is the number of processors (lg

p

= log₂(

p

))

l

s

and

t

are binary representations of message source and destination at stage 1

n

route through if the most significant bits of

s

and

t

are the same

n

otherwise, crossover

(25)

Dynamic Processor Connectivity: Bus

l

processor gains exclusive access to the bus for some period

l

the performance of the bus limits scalability

B U S

Performance

crossbar

>

multistage

>

bus Cost

crossbar

>

multistage

>

bus

Discussion point: which would you use for small, medium and large numbers of processors (

p)? What values of

p

would (roughly) be the cuttoff points?

(26)

Static Processor Connectivity: Complete, Mesh, Tree

Completely connected (becomes very complex!)

Linear array/ring, mesh/2d torus

Tree (static if nodes are processors)

Processors Switches

(27)

Static Processor Connectivity: Hypercube

0100 0101 0001 0000 1100 1000 1101 1001 1110 1010 1011 1111 0110 0010 0111 0011

l

a multidimensional mesh with exactly two processors in each dimension

p

= 2d

where

d

is the dimension of the hypercube

l

advantages:

≤ d

‘hops’ between any two processors

l

disadvantage: the number of connections per processor is

d

l

can be generalized to

k-ary

d

-cubes, e.g. a 4

×

4 torus is a 4-ary 2-cube

(28)

Static Processor Connectivity: Hypercube Characteristics

0100 0101 0001 0000 1100 1000 1101 1001 1110 1010 1011 1111 0110 0010 0111 0011

l

two processors are connected directly ONLY IF their binary labels differ by one bit

l

in a

d-dimensional hypercube, each processor directly connects to

d

others

l

a

d-dimensional hypercube can be partitioned into two

d

−

1 sub-cubes etc

l

the number of links in the shortest path between two processors is the Hamming distance between their labels

n

the Hamming distance between two processors labeled

s

and

t

is the number of bits that are on in the binary representation of

s

⊕ t

where

⊕

is the bitwise

(29)

Evaluating Static Interconnection Networks #1

Diameter

l

the maximum distance between any two processors in the network

l

directly determines communication time (latency)

Connectivity

l

the multiplicity of paths between any two processors

l

a high connectivity is desirable as it minimizes contention (also enhances fault-tolerance)

l

arc connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks

n

1 for linear arrays and binary trees

n

2 for rings and 2D meshes

n

4 for a 2D torus

(30)

Evaluating Static Interconnection Networks #2

Channel width

l

the number of bits that can be communicated simultaneously over a link connecting two processors

Bisection width and bandwidth

l

bisection width is the minimum number of communication links that have to be removed to partition the network into two equal halves

l

bisection bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors

Cost

l

many criteria can be used; we will use the number of communication links or wires required by the network

(31)

Summary: Static Interconnection Characteristics

Bisection Arc Cost

Network Diameter width connectivity (no. of links) Completely-connected 1

p

2

/

4

p

−

1

p

(

p

−

1)

/

2 Binary Tree 2 lg((

p

+ 1)

/

2) 1 1

p

−

1 Linear array

p

−

1 1 1

p

−

1 Ring

bp/

2

c

2 2

p

2D Mesh 2(

√

p

−

1)

√

p

2 2(

p

−

√

p

) 2D Torus 2

b

√

p/

2

c

2

√

p

4 2

p

Hypercube lg

p

p/

2 lg

p

(

p

lg

p

)

/

2

Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right sub-trees must pass through the root. The fat tree interconnect alleviates this:

l

usually, only the leaf nodes contain a processor

l

it can be easily ‘partitioned’ into sub-networks without degraded performance (useful for supercomputers)

Discussion point: which would you use for small, medium and large numbers of processors (

p)? What values of

p

would (roughly) be the cuttoff points?

(32)

NCI’s Raijin: A Petascale Supercomputer

l

57,472 cores (dual socket, 8 core Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodes

l

160 TBytes (approx.) of main memory

l

Mellanox Infiniband FDR interconnect (52km cables)

l

interconnects: ring (cores), full (sockets), fat tree (nodes)

l

10 PBytes (approx.) of usable fast filesystem (for shortterm scratch space apps, home directories)

l

power: 1.5 MW max. load

l

cooling systems: 100 tonnes of water

l

24th fastest in the world in debut (November 2012); first petaflop system in Australia

n

fastest (parallel) filesystem in the s. hemisphere

n

custom monitoring and deployment

n

custom Linux kernel

(33)

(34)