Overview: Classical Parallel Hardware
Review of Single Processor Designl
so we talk the same languagel
many things happen in parallel even on a single processorl
identify potential issues for parallel hardwarel
why use 2 CPUs if you can double the speed on one! Multiple Processor Designl
architectural classificationsl
shared/distributed memoryl
hierarchical/flat memoryl
dynamic/static processor connectivityl
evaluating static networksThe Processor
Performs:l
floating point operations (FLOPS) - add, mult, division (sqrt maybe!)l
integer operations (MIPS) - adds etc, also logical ops and instruction processingn
MIPS: Machine Instructions Per Second (on a very old VAX 100/780)n
anyway, what is a machine instruction on different CPUs!l
our primary focus will be in floating point operations The clock:l
all ops take a fixed number of clock ticks to completel
clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)n
Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake 3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)l
clock speed is limited by etching+speed of light, hence motivates parallelismn
(to my knowledge) IBM zEC12 is fastest commodity processor at 5.5GHzProcessor Performance
FLOPS/sec Prefix Occurrence
103 kilo very badly written code 106 mega badly written code
109 giga single-core
1012 tera multiple chip (NCI)
1015 peta 23 machines in Top500 (Nov 2012, measured) 1018 exa around 2020!
l
PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz≡
40GFl
Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz≡
105GFAdding Numbers
Consider adding two double precision (8 byte) numbers
0 1 11 12 63
±
Exponent SignificandPossible steps:
l
determine largest exponentl
normalize significand of the smaller exponent to the largerl
add significandl
re normalize the significand and exponent of the resultOperation Pipelining
Step in Pipeline Waiting 1 2 3 4 Done X(6) X(5)→ X(4)→ X(3)→ X(2)→ X(1)l
X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1)l
asymptotically achieves 1 result per clock tickl
the operation (X) is said to be pipelined: steps in the pipeline are running in parallell
requires same op consecutively on different (independent) data itemsn
good for “vector operations” (note limitations on chaining output data to input)l
not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6 operation latency repeat+,-,* 4 1
/ 15 12
Instruction Pipelining
l
break instruction execution intok
stages;⇒
can get≤ k-way
||
ism (generally, the circuitry for each stage is independent)l
eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX = Execute Instrn., WB = Write Back(branch): FI DI FO EX WB
(guess) FI DI FO EX WB
(guess) FI DI FO EX WB
(guess) FI DI FO EX WB
(sure) FI DI FO EX WB
n
note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline)n
conditional branch instructions are problematic: the wrong guess may require flushing succeeding instructions from the pipelinel
tendency to increase the number of stages, but this increases the startup latency e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31)Pipelining: Dependent Instructions
principle: CPU must ensure result is the same as if no pipelining (or
||
ism)l
instructions requiring only 1 cycle in the EX stage:add %i1 , −1, %i1 ! i1 = i1 − 1
cmp %i1 , 0 ! is i1 = 0?
can be solved by pipeline feedback from EX stage to next cycle
l
(important) instructions requiringc
cycles for the EX stage (ie. f.p. * +, load, store) are normally implemented by havingc
EX stages. This requires any dependentinstruction to be delayed by
c
cycles. e.g.c
= 3fmuld %f0 , %f2 , %f4 ! I0: f4 = f0 ∗ f2 (f.p.)
.... ! I1:
.... ! I2:
faddd %f4 , %f6 , %f6 ! I3: f6 = f4 + f6 (f.p.)
I0: FI DI FO EX1 EX2 EX3 WB
I1: FI DI FO EX1 EX2 EX3 WB
I2: FI DI FO EX1 EX2 EX3 WB
Superscalar (Multiple Instruction Issue)
A small number (w) of instructions are scheduled by the H/W to execute together.
l
groups must have an appropriate ‘instruction mix’ e.g. UltraSPARC (w = 4):
≤
2 different floating point≤
1 load / store; ≤
1 branch≤
2 integer / logical
instr’ns per group
l
have≤ w-way
||
ism over different types of instr’nsl
generally requires:n
multiple (≥ w) instruction fetchesn
an extra grouping (G) stage in the pipelinel
problem: amplifies dependencies and other problems of pipelining by a factor ofw!
l
issues: the instruction mix must be balanced for maximum performance!Review - Instruction Level Parallelism
l
instruction level parallelism (ILP): pipelining and superscalar, offer≤ kw-way
||
isml
branch prediction techniques alleviates issue of conditional branchesn
uses a table recording the result of recently-taken branchesl
out-of-order execution: alleviates the issue of dependenciesn
pulls fetched instructions into a buffer of sizeW
,W
≥ w
n
execute them in any order provided dependencies are not violatedn
must keep track of all ‘in-flight’ instructions and associated registers (O(W
2)area and power!)
l
in most situations, the compiler can do as good a job as a human at exposing this parallelism (ILP was part of the ‘Free Lunch’)An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per cycle
Memory Structure
Consider the DAXPY computation:y
(i
) = 1.
234∗ x
(i
) +y
(i
)If theoretically the CPU can perform the 2 FLOPs in 1 cycle
l
the memory system must load twodouble
s (x(i
) andy
(i
) – 16 bytes) and store one (y(i
) – 8 bytes) each clock cyclen
on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store trafficl
typically:n
a processor core can only issue one load OR store instruction in a clock cyclen
DDR3-SDRAM memory is available clocked at≈
1066 MHz with access times accordinglyMemory latency and bandwidth are critical performance issues.
l
caches: reduce latency and provide improved cache to CPU bandwidthCache Memory
Main Memory – large cheap memory
↓
– large latency/small bandwidth Cache – small fast expensive memory↓↓↓↓↓
– lower latency/higher bandwidth CPU Registersl
part of the memory hierarchyl
cache hit: data is in cache and received in a few cyclesl
cache miss: data must be fetched from main memory (or higher level cache)l
try to ensure data is in cache (or as close to the CPU as possible)l
can we ‘block’ algorithm to minimize memory traffic?Cache memory is effective because algorithms often use data that:
l
was recently accessed from memory (temporal locality)l
was close to other recently accessed data (spatial locality)Cache Mapping
Blocks of main memory are mapped to a cache line
l
cache lines are typically 16-128 bytes widel
mapping may be direct, or n-way associative. E.g. 2-way associative mappingCache Main Memory Mapping
Line 0 0 or 1 2 or 3 0 or 1 2 or 3 Line 1 0 or 1 2 or 3 0 or 1 2 or 3 Line 2 0 or 1 2 or 3 0 or 1 2 or 3 Line 3 0 or 1 2 or 3 0 or 1 2 or 3 0 or 1 2 or 3 0 or 1 2 or 3
l
entire cache line is fetched from memory, not just one element (why?)n
try to structure code to use an entire cache line of datan
best to have unit strideMemory Banks
Memory bandwidth can be improved by having multiple parallel paths to/from memory, organized in
b
banks.Bank 1 Bank 2 Bank 3 Bank 4
C P U
l
traditional solution used by vector processorsl
good performance for unit stridel
very bad performance if bank conflicts occurParallelism in Memory Access
Dual Inline Memory Module
(DIMM)
l
a processor may have 2 or 4 DIMMs (i.e.b
= 16,
32), depending on its lowest-levelcache line size
l
each DRAM chip access is pipelined (select row address, select column address, access byte)Going Parallel
l
inevitably the performance of a single processor is limited by the clock speedl
improved manufacturing increases clock but is ultimately limited by the speed of lightl
ILP allows multiple ops at once, but is limitedIt’s time to go parallel
Hardware Issues
l
Flynn’s Taxonomy of parallel processorsn
SIMD/MIMDl
shared/distributed memoryl
hierarchical/flat memoryl
dynamic/static processor connectivityl
characteristics of static networksArchitecture Classification: Flynn’s Taxonomy
l
why classify:n
what kind of parallelism is employed? Which architecture has the best prospects for the future? What has already been achieved by currentarchitecture types? Reveal configurations that have not yet considered by the system architect. Enable the building of performance models.
Flynn’s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams
Data Stream
Single Multiple
Single SISD SIMD
Instruction 1CPU Array/Vector Processor
Stream Multiple MISD MIMD
SIMD and MIMD
SIMD: Single Instruction Multiple Data
l
also known as data parallel processors or array processorsl
vector processors (to some extent)l
current examples include SSE instructions, SPEs on CellBE, GPUsl
NVIDIA’s SIMT (T = Threads) is a slight variationMIMD: Multiple Instruction Multiple Data
l
examples include quad-core PC, 24-core Xeons on GadiCPU CPU CPU CPU
I N T E R C O N N E C T CPU and Control CPU and Control CPU and Control CPU and Control Global Control Unit I N T E R C O N N E C T S I M D M I M D
MIMD
Most successful parallel modell
more general purpose than SIMD (e.g. CM5 could emulate CM2)l
harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machinesl
scheduling: efficient allocation of processors to tasks in a dynamic fashionl
synchronization: prevent processors accessing the same data simultaneouslyl
interconnect design: processor to memory and processor to processorinterconnects. Also I/O network - often processors dedicated to I/O devices
l
overhead: inevitably there is some overhead associated with coordinating activities between processors, e.g. resolve contention for resourcesl
partitioning: identifying parallelism in algorithms that is capable of exploiting concurrent processing streams is non-trivial(Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable.)
Address Space Organization: Message Passing
l
each processor has local or private memoryl
interact solely by message passingl
commonly known as distributed memory parallel computersl
memory bandwidth scales with number of processorsl
example: between “nodes” on the NCI Gadi systemI N T E R C O N N E C T PROCESSOR MEMORY PROCESSOR MEMORY MEMORY PROCESSOR PROCESSOR MEMORY
Address Space Organization: Shared Address Space
l
processors interact by modifying data objects stored in a shared address spacel
simplest solution is a flat or uniform memory access (UMA)l
scalability of memory bandwidth and processor-processor communications (arising from cache line transfers) are problemsl
so is synchronizing access to shared data objectsl
example: dual/quad core laptop (ignoring caches)MEMORY MEMORY MEMORY MEMORY
I N T E R C O N N E C T
Non-Uniform Memory Access (NUMA)
Machine includes some hierarchy in its memory structure
l
all memory is visible to the programmer (single address space), but some memory takes longer to access than othersl
in this sense, cache introduces one level of NUMAl
between sockets on the NCI Gadi system or in a multi-socket Opteron system(some ANU research on an Opteron (2011))
I N T E R C O N N E C T
MEMORY MEMORY MEMORY MEMORY
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache
MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache
Shared Address Space Access
Parallel Random Access Machine (PRAM): any shared memory machine
l
what happens when multiple processors try to read/write to the same memory location at the same time?l
PRAM modelsn
Exclusive-read, exclusive-write (EREW PRAM)n
Concurrent-read, exclusive-write (CREW PRAM)n
Exclusive-read, concurrent-write (ERCW PRAM)n
Concurrent-read, concurrent-write (CRCW PRAM)Concurrent read OK, but concurrent-write requires arbitration:
l
Common: allowed if all values being written are identicall
Arbitrary: an arbitrary processor is allowed to proceed and the rest faill
Priority: processors are organized into a predefined prioritized list; the process with highest priority succeeds, the rest failDynamic Processor Connectivity: Crossbar
l
is a non-blocking network, in that the connection of two processors does not block connection between other processorsl
complexity grows asO
(p
2)l
may be used to connect processors each with their own local memory Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor and Memory Processor andl
may also be used to connect processors with various memory banksDynamic Processor Connectivity: Multi-staged Networks
000 010 100 110 001 011 101 111 Processors S W I T C H I N G N E T W O R K O M E G A N E T W O R K Memory 010 100 001 000 011 101 110 111l
consist of lgp
stages, wherep
is the number of processors (lgp
= log2(p
))l
s
andt
are binary representations of message source and destination at stage 1n
route through if the most significant bits ofs
andt
are the samen
otherwise, crossoverDynamic Processor Connectivity: Bus
l
processor gains exclusive access to the bus for some periodl
the performance of the bus limits scalabilityMEMORY MEMORY MEMORY MEMORY
PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache
B U S
Performance
crossbar
>
multistage>
bus Costcrossbar
>
multistage>
busDiscussion point: which would you use for small, medium and large numbers of processors (
p)? What values of
p
would (roughly) be the cuttoff points?Static Processor Connectivity: Complete, Mesh, Tree
Completely connected (becomes very complex!)Linear array/ring, mesh/2d torus
Tree (static if nodes are processors)
Processors Switches
Static Processor Connectivity: Hypercube
0100 0101 0001 0000 1100 1000 1101 1001 1110 1010 1011 1111 0110 0010 0111 0011l
a multidimensional mesh with exactly two processors in each dimensionp
= 2dwhere
d
is the dimension of the hypercubel
advantages:≤ d
‘hops’ between any two processorsl
disadvantage: the number of connections per processor isd
l
can be generalized tok-ary
d
-cubes, e.g. a 4×
4 torus is a 4-ary 2-cubeStatic Processor Connectivity: Hypercube Characteristics
0100 0101 0001 0000 1100 1000 1101 1001 1110 1010 1011 1111 0110 0010 0111 0011l
two processors are connected directly ONLY IF their binary labels differ by one bitl
in ad-dimensional hypercube, each processor directly connects to
d
othersl
ad-dimensional hypercube can be partitioned into two
d
−
1 sub-cubes etcl
the number of links in the shortest path between two processors is the Hamming distance between their labelsn
the Hamming distance between two processors labeleds
andt
is the number of bits that are on in the binary representation ofs
⊕ t
where⊕
is the bitwiseEvaluating Static Interconnection Networks #1
Diameter
l
the maximum distance between any two processors in the networkl
directly determines communication time (latency)Connectivity
l
the multiplicity of paths between any two processorsl
a high connectivity is desirable as it minimizes contention (also enhances fault-tolerance)l
arc connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networksn
1 for linear arrays and binary treesn
2 for rings and 2D meshesn
4 for a 2D torusEvaluating Static Interconnection Networks #2
Channel width
l
the number of bits that can be communicated simultaneously over a link connecting two processorsBisection width and bandwidth
l
bisection width is the minimum number of communication links that have to be removed to partition the network into two equal halvesl
bisection bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processorsCost
l
many criteria can be used; we will use the number of communication links or wires required by the networkSummary: Static Interconnection Characteristics
Bisection Arc Cost
Network Diameter width connectivity (no. of links) Completely-connected 1
p
2/
4p
−
1p
(p
−
1)/
2 Binary Tree 2 lg((p
+ 1)/
2) 1 1p
−
1 Linear arrayp
−
1 1 1p
−
1 Ringbp/
2c
2 2p
2D Mesh 2(√
p
−
1)√
p
2 2(p
−
√
p
) 2D Torus 2b
√
p/
2c
2√
p
4 2p
Hypercube lgp
p/
2 lgp
(p
lgp
)/
2Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right sub-trees must pass through the root. The fat tree interconnect alleviates this:
l
usually, only the leaf nodes contain a processorl
it can be easily ‘partitioned’ into sub-networks without degraded performance (useful for supercomputers)Discussion point: which would you use for small, medium and large numbers of processors (
p)? What values of
p
would (roughly) be the cuttoff points?NCI’s Raijin: A Petascale Supercomputer
l
57,472 cores (dual socket, 8 core Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodesl
160 TBytes (approx.) of main memoryl
Mellanox Infiniband FDR interconnect (52km cables)l
interconnects: ring (cores), full (sockets), fat tree (nodes)l
10 PBytes (approx.) of usable fast filesystem (for shortterm scratch space apps, home directories)l
power: 1.5 MW max. loadl
cooling systems: 100 tonnes of waterl
24th fastest in the world in debut (November 2012); first petaflop system in Australian
fastest (parallel) filesystem in the s. hemispheren
custom monitoring and deploymentn
custom Linux kernelFurther Reading: Parallel Hardware
l
The Free Lunch Is Over!l
Ch 1, 2.1-2.4 of Introduction to Parallel Computingl
Ch 1, 2 of Principles of Parallel Programmingl
lecture notes from Calvin Lin:A Success Story: ISA Parallel Architectures