Dr. Gerhard Wellein, Dr. G. Hager, T. Zeiser HPC Services – Regionales Rechenzentrum Erlangen
University Erlangen-Nürnberg Sommersemester 2007
Modern processors
Performance MeasuresStrategies to build faster computers Multi-Core processors
Basic features of modern microprocessors Pipelining
Superscalar Architectures
Performance measures
Pure measures:
MFlops: Millions of Floating Point Operations per Second
(relevant for many technical & scientific applications)
MIPS: Millions of Instructions per Second
(e.g. data bases, web servers,…)
Execution time: wallclock time, user time, …..?
Number of Floating Point Operations executed 10^6 execution time
MFlops =
Number of Instructions executed 10^6 execution time MIPS =
Performance measures
MFlops and MIPS numbers can be computed / measures for hardware (peak performance) and application programs (sustained performance)
Standard benchmark programs
LINPACK (cf. discussion about TOP500; http://www.top500.org) LINPACK ~ Peak Performance in most cases
STREAM ( http://www.cs.virginia.edu/stream/) STREAM ~ Quality of data access
SPEC ( http://www.spec.org)
A lot of different benchmarks: WEB Server, MAIL Server,…
SPEC CPU2000: numerical performance (14 application benchmarks in FORTRAN or C)
HPC Challenge: Several specific HPC benchmarks http://icl.cs.utk.edu/hpcc/
Performance measures
SPEC numbers:
Relative performance compared to a reference system (SUN Ultra10; 300 MHz Sparc processor) * 100
Geometric mean over all applications
340 Intel Compiler V5.0
1.0 GHz Intel P3 – Dell Prec.420
2043 1285
1490 1514
SPEC_fp2000
Intel Compiler V8 Intel Compiler V7.1 Intel Compiler V7.0 PGI F90 5.1;
gcc 3.3.1 Compiler
1.5 GHz 6 MB L3 3.2 GHz
DDR-400 2.2 GHz
ASUS SK8N 2.2 GHz
ASUS SK8N Version
Intel Itanium2 Intel P4 –
Dell Prec. 360 AMD
Opteron AMD
Opteron Processor
There must be other things besides clock speed!
Strategies to build faster computers….
How to build faster computers Survey
1. Increase performance / throughput of CPU a) Reduce cycle time, i.e. increase clock speed (Moore) b) Increase throughput, i.e. superscalar (internal parallelism) 2. Improve data access time
a) Increase cache size
b) Improve main memory access (bandwidth & latency) 3. Use parallel computing (shared memory)
a) Requires shared-memory parallel programming b) Shared/separate caches
c) Possible memory access bottlenecks
4. Use parallel computing (distributed memory)
“Cluster” of computers tightly connected
1. Almost unlimited scaling of memory and performance 2. Distributed-memory
parallel programming
CPU
Cache
Memory
CPU Cache
CPU Cache
CPU Cache
CPU Cache Memory
CPU Cache Memory
CPU Cache Memory
CPU Cache Memory CPU
Cache Memory
CPU Cache Memory
How to build faster computers (1) Increase single processor performance
Reduce cycle time (increase clock speed)
Limited by current technology
Transition ECL → CMOS is necessary; done 10 years ago
Typical processor frequencies
PC: approx. 2.0-3.6 GHz
RISC: 1.3-1.7 GHz
Vector: 0.5-2.0 GHz
Problems with large power dissipation, even with CMOS
Power dissipation goes like (Voltage)2 times frequency
Requires pipelining of hardware units (cf. discussion later) 2 X Clock speed = 2 X Performance ? Memory does not run at CPU Speed – DRAM Gap!
Market volume
1965 G. Moore claimed
#transistors on processor chip doubles every 12-24 months
Processor speed grew roughly at the same rate My computer: 350 MHz (1998) – 3,000 MHz (2004) Growth rate: 43 % p.a. -> doubles every 24 months
Problem: Power dissipation (see RRZE systems…)
Intel Corp.
This trend is currently changing:
see multi-core How to build faster computers (2)
Increase single processor performance: Moore’s law
How to build faster computers (3) Increase single processor performance
Use internal parallelism (Instruction level parallelism) to increase throughput
Multiple arithmetic units can work in parallel (e.g. multiply and add units)
Intel Pentium4 / AMD Opteron:
1 Multiply & 1 Add unit 2 Flop / cycle
Intel Itanium2: 2 MultiplyAdd unit 4 Flop / cycle
NEC SX8: 1 Multiply & 1 Add unit
(4-way vector pipes) 8 Flop / cycle
Multiple Load/Store units are available for concurrent data transfer (memory/cache <-> registers)
Problem: Memory bandwidthbecomes a bottleneck very quickly!
Thus: Memory bandwidth limits internal parallelism
Memory (DRAM) Gap
Memory bandwidth grows only at a speed of 7% a year
Memory latency
remains constant / increases in terms of processor speed
Loading a single data item from main memory can cost 100s of cycles on a 3 GHz CPU
Introducing memory hierarchies (caches) – Complex optimization of code
Cache sizes can “easily” be enlarged -> Moore’s law Optimization of main memory access is mandatory for most applications
How to build faster computers (4) Data throughput
How to build faster computers (5) Parallel Computing
Parallel Computing (data/functional parallelism)
Multiple CPUs share work and solve a problem cooperatively
Bookkeeping is shifted from hardware to software (user or compiler!)
Basic architecture concepts:
Shared Memory
UMA (Uniform Memory Access) machines:
Easy to programbut memory bandwidth is limited! (e.g. 2- or 4-way SMPs / Multi-Core Chips)
ccNUMA (cache-coherent Non- Uniform Memory Access) Scalable to 100’s of processors (e.g. SGI Altix)
Distributed Memory
NORMA (No-Remote Memory Access) machines:
(e.g. NORMA – Xeon-Cluster; 10`s to 10.000`s of procs.)
NUMA (Non- Uniform Memory Access): CRAY T3E, Altix Allow for scaling of (local) memory bandwidth
but hard to program & Communication bandwidth limited!
Exploiting Moore’s law without substantially increasing the single processor’s clock speed:
Multiple (independent) processor cores per chip
Multi-Core processors
Moore‘s law:
In the past: Smaller circuits -> Faster clock speeds
In the future: Smaller circuits -> Put several processors on a single silicon die (chip)
Available multi-core processors
AMD Opteron (cf. RRZE cluster), IBM Power4/5,
Intel: Xeon “Woodcrest” (Dual-Core) & “Clovertown” (Quad-Core) Intel Conroe
Technical advantages of Multi-Core technology
Power consumption using a single silicon die:
2 processors cores with 2 GHz << 1 core with 4 GHz
Price of the processor is mainly determined by the silicon die
Problems:
lower single core/processor performance -> PARALLELIZATION
Memory Wall – Now several processors share one FSB
Multi-core processors The party is over!
MS
arithmetic unit
Main Memory
FP register
L1 cache L2 cache
„DRAM Gap“
Processor chip
FP register
L1 cache
arithmetic unit
Intel Xeon / Core (“Woodcrest”)
It is not a faster processor – it is a parallel computer on a chip.
Dual-Core: Put 2 processors on a chip which (may) share resources (L2 cache, memory bandwidth)
Efficient use of both cores for a single application -> programmer
Max Frequency Max Frequency
Power Power Performance Performance
1.00x 1.00x Multi-core processors
The party is over!
By courtesy of D. Vrsalovic, Intel
Over Over--clockedclocked
(+20%) (+20%) 1.73x 1.73x
1.13x 1.13x
1.00x 1.00x
Max Frequency Max Frequency
Power Power Performance Performance
Multi-core processors The party is over!
By courtesy of D. Vrsalovic, Intel
Over Over--clockedclocked
(+20%) (+20%)
Under Under--clockedclocked
( (--20%)20%)
0.51x 0.51x 0.87x 0.87x 1.00x
1.00x 1.73x
1.73x
1.13x 1.13x
Max Frequency Max Frequency
Power Power Performance Performance
Multi-core processors The party is over!
By courtesy of D. Vrsalovic, Intel
Over Over--clockedclocked
(+20%) (+20%)
1.00x 1.00x 1.73x
1.73x
1.13x 1.13x
Max Frequency Max Frequency
Power Power Performance Performance
Dual Dual--corecore
( (--20%)20%)
1.02x 1.02x 1.73x 1.73x Dual
Dual--CoreCore
Multi-core processors The party is over!
By courtesy of D. Vrsalovic, Intel
Multi-Core Processors How many of them will be useful?
Question: What fraction of performance must be sacrificed per core in order to benefit from m cores?
Prerequisite: Overall power dissipation should be unchanged
W power dissipation p performance (1 core) pm performance (m cores) εf rel. frequency change ∆fc/fc εp rel. performance
change ∆p/p m number of cores
W W
W + Δ = ( 1 + ε
f)
31 ) 1
( + εf 3m =
pm p
m= ( 1 + εp)
1 − 1
≥
⇒
≥ p m
p
mε
p3
1
/
1
−
= m
−ε
f Required relative frequency reduction vs. core count (m)
Available today Multi-Core Processors
How many of them will be useful?
© 2006 Intel Corporation
Evolutionary Configurable Architecture:
Evolutionary Configurable Architecture:
“
“Micro2015 Micro2015” ” Vision and Research Vision and Research
Many Many--core arraycore array
•
•CMP with 10sCMP with 10s--100s low 100s low power cores power cores
•
•Scalar coresScalar cores
••Capable of TFLOPS+Capable of TFLOPS+
••Full SystemFull System--onon--ChipChip
•
•Servers, workstations, Servers, workstations, embedded embedded…… Dual core
Dual core
•
•Symmetric multithreadingSymmetric multithreading Multi Multi--core arraycore array
•
•CMP with ~10 coresCMP with ~10 cores
Evolution Large, Scalar cores
Large, Scalar coresfor for high single high single--thread thread performance performance
Scalar plus many core Scalar plus many corefor for
highly threaded workloads highly threaded workloads
Intel Tera-Scale Computing
Research Program
Basic features of modern
microprocessors
Architecture of modern microprocessors
Application: High Level Programming Language (e.g. C / C++ / Fortran) - portable
Compiler translates program to machine specific machine instructions (IA32, IA64)
Modern computers/
microprocessors – van Neumann concept is still visible, but
Several memory levels (3-4)
Multiple Arithmetical Logical Units (e.g. 8 hardware untis for integer and fp operations on Itanium2)
Computer
Control Unit
Mem ALU IO
Instruction Set Compiler Application
Architecture of modern microprocessors
History In the beginning (~30-40 years ago) Complex Instruction Set Computers (CISC) :
Powerful & complex instructions, e.g: A=B*C: 1 instruction
Instruction set is close to high-level programming language
Variable length of instructions - Save storage!
Mid 80´s: Reduced Instruction Set Computer (RISC) evolved:
Fixed instruction length; enables pipelining and high clock frequencies
Uses simple instructions, e.g.: A=B*C is split into at least 4 operations (LD B, LD C, MULT A=B*C, ST A)
Nowadays: Superscalar RISC processors
IA32 (P4, Athlon, Opteron): Compiler still generates CISC instructions;
but processor core: RISC like
~2001: Explicitly Parallel Instruction Computing (EPIC) introduced
Compiler builds large group of instruction to be executed in parallel
First processors: Intel Itanium1/2 using the IA64 instruction set.
[email protected] 25 Cache based Processor
Cache based Processor
MS
Arithmetic &
functional units
Register
Simple view of modern processors
Cache based microprocessors (e.g. Intel P4, AMD Opteron)
Main Memory
L1 D-Cache L2 Cache: Data / Instr.
L1 I-Cache Fetch Decode Branch-Predict.
Processor
Frequency ~3 GHzFrequency ~0.4 GHzProcessor is built up by:
•Arithmetic & functional units, e.g. Multiply-unit, Integer-units, MMX, …
•These unitscan only use operands resident in the registers
• Operands are read (written) by load (store) unitsfrom main memory/caches to registers
• Caches are fast but small pieces of memory (5-10 times faster than main memory)
• a lot of additional logic: e.g. branch prediction
© 2006 Intel Corporation22.04.2007
26 Copyright © 2006 Intel Corporation
Disclaimer:This block diagram is for example purposes only.
Significant hardware blocks have been arranged or omitted for clarity.
Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT) 32 KB
Instruction Cache Next IP
Instruction Decode (4 issue) Fetch / Decode
Architecture Block Diagram
Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache
PortPortPortPort Bus Unit
Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports
32 KB Data Cache Execute
Port
FP Add
Integer SIMD Arithmetic
Memory Order Buffer (MOB) Load Store Addr
FP Div/Mul Integer Shift/Rotate SIMD SIMD
Integer Arithmetic
Integer Arithmetic
Intel® Core™
Port
Store Data
L2 Data Cache
Floating Point
L1i Cache IA32
IEU MMU L1d Cache
D-TLB
L2 Tag L3 Tag Pipeline
Bus Logic
L3 Cache
H. Strauss, HP
Simple view of modern processors
Intel Itanium 2 – physical view450 Mio. transistors on a 2 cm by 2 cm die !
In 2006/7:
More than 1700 Mio. transistors on a 2,5 cm by 2,5 cm die
Architecture of modern microprocessors
Pipelining of arithmetic/functional units Split complex operations (e.g. multiplication) into several simple / fast sub-operations (stages)
Makes short cycle time possible (simpler logic circuits), e.g.:
Multiplication takes 5 cycles, but
processor can work on 5 different multiplications simultaneously
Can produce one result each cycle after the pipeline is full
Drawback:
Pipeline must be filled - startup times
Requires complex instruction scheduling by compiler/hardware – software-pipelining / out-of-order
Extensive use requires large number of independent instructions – instruction level parallelism
Vector supercomputersuse this method excessively
Pipelining:
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N 1
B(1) C(1)
2
B(2) C(2) B(1) C(1)
3
B(3) C(3) B(2) C(2) B(1) C(1)
4
B(4) C(4) B(3) C(3) B(2) C(2) A(1)
5
B(5) C(5) B(4) C(4) B(3) C(3) A(2)
A(1) 6
B(6) C(6) B(5) C(5) B(4) C(4) B(3) C(3) A(2)
N+4 ...
A(N) ...
...
...
...
...
Cycle:
Separate Mant. / Exp.
Mult.
Mantissa Add.
Exponents Normal.
Result
Insert Sign Operation
First result is available after 5 cycles (=latency of pipeline)!
Pipelining
Benefits and drawbacks (1)
Pipelining versus purely sequential execution of multiplication
Speed-Up:
Tseq/ Tpipe= (5*N) / (N+4) = 5/(1 + 4/N) ~ 5 for large N (>>5)
Throughput (Results per Cycle) of Pipeline:
N / Tpipe(N) = N / (4 + N) = 1 / (1 + 4/N) ~1 for large N Sequential:
1 Multiplication = 5 cycles N Multiplications:
Tseq(N) = (5*N) cycles
Pipelining:
Start-Up = 5 cycles N Multiplications:
Tpipe(N)=(4+N) cycles
Pipelining
Benefits and drawbacks (2)
In general (m-stage pipe /pipeline depth: m) Speed-Up:
Tseq/ Tpipe= (m*N) / (N+m-1) ~ m for large N (>>m)
Throughput (Results per Cycle):
N / Tpipe(N) = N / (N+m-1) = 1 / [ 1+(m-1)/N ] ~ 1 for large N
Number of independent operations (NC) required to achive Tpresults per cycle:
Tp= 1 / [ 1+(m-1)/NC] NC = Tp(m-1) / (1- Tp)
Tp= 0.5 NC = m-1
Pipelining
Benefits and drawbacks (3)
Drawbacks:
Nsmall (e.g. N=1) – No speed up!
Increasing clock frequency -> pipeline depth mincreases
Operations (e.g. Multiplications) within pipeline must be independent!
Optimal scheduling of instructions by compiler depends on pipeline depth!
Effective pipeline length for execution of an arithmetic operation is much longer than number of pipeline stages of arithmetic unit.
Pipelining
Benefits and drawbacks (4)
Pipelining Efficient use
Efficient use of pipelining requires intelligent compilers
Rearrangement of instructions to hide latencies
High level of „Software pipelining“ (in particular Itanium)
Remove interdependencies that block parallel execution (user/ programmer)
Out-of-order execution on processor (except Itanium)
Example:
Simple Pseudo Code:
loop: load a[i]
mult a[i] = c, a[i]
store a[i]
branch.loop Fortran Code:
do i=1,N a(i) = a(i) * c end do
load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)
Latencies
Pipelining Efficient use
Example:
Simple Pseudo Code:
loop: load a[i]
mult a[i] = c, a[i]
store a[i]
branch.loop Fortran Code:
do i=1,N a(i) = a(i) * c end do
load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)
Latencies
Assumptions:
• One load, one store& one multiply (mult) can be issued per cycle
• The processor stalls, if there is one instruction which is waiting for operands
Pipelining Efficient use
Naive instruction issue Cycle
Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919
load loada[1] a[1]
mult multa[1]=c,a[1]a[1]=c,a[1]
store storea[1]a[1]
load loada[2]a[2]
multmulta[2]=c,a[2]a[2]=c,a[2]
store storea[2]a[2]
load loada[3]a[3]
load loada[1]a[1]
load loada[2]a[2]
loadloada[3]a[3]
load loada[4]a[4]
loadloada[5]a[5] multmulta[1]=c,a[1]a[1]=c,a[1]
load
loada[6]a[6] multmulta[2]=c,a[2]a[2]=c,a[2]
load
loada[7]a[7] multmulta[3]=c,a[3]a[3]=c,a[3] storestorea[1]a[1]
load
loada[8]a[8] multmulta[4]=c,a[4]a[4]=c,a[4] storestorea[2]a[2]
load
loada[9]a[9] multmulta[5]=c,a[5]a[5]=c,a[5] storestorea[3] a[3]
loadloada[10]a[10]multmulta[6]=c,a[6]a[6]=c,a[6] storestorea[4] a[4]
load
loada[11]a[11]multmulta[7]=c,a[7]a[7]=c,a[7] storestorea[5]a[5]
loadloada[12]a[12]multmulta[8]=c,a[8]a[8]=c,a[8] storestorea[6]a[6]
mult
multa[9]=c,a[9]a[9]=c,a[9] storestorea[7]a[7]
multmulta[10]=c,a[10]a[10]=c,a[10] storestorea[8]a[8]
mult
multa[11]=c,a[11]a[11]=c,a[11] storestorea[9]a[9]
mult
multa[12]=c,a[12]a[12]=c,a[12] storestorea[10]a[10]
store storea[11]a[11]
store storea[12]a[12]
Optimized instruction issue a[i]=a[i]*c; N=12
T= 96 cycles T= 19 cycles
Prolog
Epilog Kernel
Pipelining Efficient use
Optimized kernel:
Software pipelining by compiler: Reordering instructions considering the latencies of the instructions
Cycles in loop kernel should be much larger than in Prolog/Epilog
Dependencies within loop body prevent efficient software pipelining:
Pseudo Code:
loop: load a[i+6]
mult a[i+2] = c, a[i+2]
store a[i]
branch.loop
Latency of MULT pipeline: 2 cycles Latency of load: 4 cycles
Fortran Code:
do i=1,N a(i) = a(i-1) * c end do
Computation of a[i-1]
must be completed before a[i] is started!
Pipelining Efficient use
Naive instruction issue Cycle
Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919
load loada[1] a[1]
mult multa[2]=c,a[1]a[2]=c,a[1]
store storea[2]a[2]
loadloada[2]a[2]
mult multa[3]=c,a[2]a[3]=c,a[2]
store storea[3]a[3]
load loada[3]a[3]
loadloada[1]a[1]
mult multa[2]=c,a[1]a[2]=c,a[1]
mult
multa[3]=c,a[2]a[3]=c,a[2] storestorea[2]a[2]
mult
multa[4]=c,a[3]a[4]=c,a[3] storestorea[3] a[3]
mult
multa[5]=c,a[4]a[5]=c,a[4] storestorea[4]a[4]
mult
multa[6]=c,a[5]a[6]=c,a[5] storestorea[5]a[5]
multmulta[7]=c,a[6]a[7]=c,a[6] storestorea[6]a[6]
multmulta[8]=c,a[7]a[8]=c,a[7] storestorea[7]a[7]
mult
multa[9]=c,a[8]a[9]=c,a[8] storestorea[8]a[8]
Optimized instruction issue a[i]=a[i-1]*c; N=12
T= 96 cycles T= 26 cycles
Prolog
Kernel
Pipelining Efficient use
Performance impact of dependencies on Intel Xeon 2.66 GHz
Start-Up of long effective pipeline High Performance for data in caches (N < 30000)
Why ?
A(i)=A(i+1)*c A(i)=A(i-1)*c
Pipelining Efficient use
Basic types of (potential) dependencies within loop body may prevent efficient software pipelining, e.g.:
Dependency:
do i=2,N a(i) = a(i-1) * c end do
General version (offset as input parameter):
do i=max(1-offset,1),min(N-offset,N) a(i) = a(i-offset) * c
end do No dependency:
do i=1,N a(i) = a(i) * c end do
Pseudo-Dependency:
do i=1,N-1 a(i) = a(i+1) * c end do
Pipelining Data dependencies
Pipelining Data dependencies
Pipelining
Further potential problems
Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.
1 or 2 MultAdd units per processor, i.e. processor core
Modern microprocessors do not provide pipelines for div / sqrt or exp / sin ! Example: Cycles per Operation (8-Byte) (Xeon/Netburst)
~160-180 70*
70* 4*
Latency
130 70*
70* 2*
Throughput
130 35*
35* 1*
Cycles/Operation
y=sin(y) y=dsqrt(y)
y=a/y y=a+y (y=a*y) Operation
* Using SIMD instructions (SSE2)
Reduce number of complex operations if necessary.
Replace function call with a table lookup if the function is frequently computed for a few different arguments only.
Pipelining
Instruction pipeline Besides the arithmetic and functional unit, the instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instruction from L1I
Decode instruction
Execute Instruction
Hardware Pipelining on processor (all units can run concurrently):
Fetch Instruction1 from L1I
Decode Instruction1
Execute Instruction1 Fetch Instruction2
from L1I
Decode Instruction2
Decode Instruction3
Execute Instruction2 Fetch Instruction3
from L1I Fetch Instruction4
from L1I
t
…
Branches can stall this pipeline! (Speculative Execution, Predication)
Each Unit is pipelined itself (cf. Execute=Multiply Pipeline) 1
2
3 4
Pipelining
PowerPC Instruction Pipeline
14-stage pipeline for FP operations!
Pipeline of P4:
20 stages!
Superscalar Processors
Superscalar Processors can run multiple Instruction Pipelines at the same time!
Parallel hardware components / pipelines are available to
fetch / decode / issues multiple instructions per cycle (typically 2 – 8 per cycle)
load (store) multiple operands (results) from (to) cache per cycle (typically 2-4 8-byte words per cycle)
perform multiple integer / address calculations per cycle (e.g. 6 integer units on Itanium2)
perform multiple floating point operations per cycle (typically 2 or 4 floating point operations per cycle)
On superscalar RISC processors out-of order execution hardware is available to optimize the usage of the parallel hardware
Superscalar Processors
Instruction Level Parallelism through superscalar execution
Multiple units enable use of InstrucionLevel Parallelism (ILP):
Issuing m concurrent instructions per cycle: m-way superscalar
Modern processors are 3- to 6-way superscalar &
can perform 2 or 4 floating point operations per cycles Fetch Instruction1
from L1I
Decode Instruction1
Execute Instruction1 Fetch Instruction2
from L1I
Decode Instruction2
Decode
Instruction3 Execute Instruction2 Fetch Instruction3
from L1I Fetch Instruction4
from L1I Fetch Instruction1
from L1I
Decode Instruction1
Execute Instruction1 Fetch Instruction2
from L1I
Decode Instruction2
Decode
Instruction3 Execute Instruction2 Fetch Instruction3
from L1I Fetch Instruction4
from L1I Fetch Instruction1
from L1I
Decode Instruction1
Execute Instruction1 Fetch Instruction2
from L1I
Decode Instruction2
Decode Instruction3
Execute Instruction2 Fetch Instruction3
from L1I Fetch Instruction4
from L1I Fetch Instruction1
from L1I
Decode Instruction1
Execute Instruction1 Fetch Instruction2
from L1I
Decode Instruction2
Decode Instruction3
Execute Instruction2 Fetch Instruction3
from L1I Fetch Instruction4
from L1I
4-way
„superscalar“
t
Superscalar Processor Exploit ILP
Example: Calculate norm of a vector
Naive version:
2nd MADD has to wait for the first to complete, although in principle two independent MADD could be done t=0
do i=1,n t=t+a(i)*a(i) end do
2 FP Mult/Add units cannot be busy at the same time because of dependency in summation variable t
„Load-after-Store dependency“
R1= MADD(R1,A(I))
R1 = MADD(R1,A(I+1))
STALL
Superscalar Processor
Exploit ILP: Modulo variable expansion
t1=0 t2=0 do I=1,N,2
t1=t1+a(i)*a(i) t2=t2+a(i+1)*a(i+1) end do
t=t1+t2
Optimized version:
Two independent „instruction streams” can be processed by two separate FP Mult/Add units!
Most compilers can do those optimizations automatically!
R1= MADD(R1,A(I)) R2= MADD(R2,A(I+1)) R1= MADD(R1,A(I+2)) R2= MADD(R2,A(I+3))
…
Superscalar Processors Some pitfalls
Data dependencies can prevent the parallel use of hardware, e.g. for (i=0;…) A(i) = A(i-1)*c
(only one multiplication can be performed at the same time)
Data dependencies: Compiler can not resolve aliasing conflicts!
void subscale( A , B )
….for (i=0;…) A(i) = B(i-1)*c
In C/ C++ the pointers of A and B can point to the same memory location -> see above
You should tell the compiler if your are never using aliasing ( -fno-aliason Intel Compiler)
Superscalar Processors Some pitfalls
Avoid frequent and random (not predictable) branches in the application code, e.g.
do i=1,….
if( random(0:1) > 0.5) then
<Block1>
else<Block2>
endif enddo
Superscalar processors try to predict the branch and speculatively start the pipeline for the next iterations.
If the branch was mispredicted the pipeline has to be flushed!
Superscalar Processor Efficient Use of Pipelining and ILP
Efficient use of pipelining/ILP requires intelligent compilers
Rearrangement of instructions to hide latencies
„Software pipelining“
Remove interdependencies that block parallel execution
Programmer should
Avoid unpredictable branches (stop and restart of pipeline!)
Avoid Data dependencies (if possible)
Tell compiler that instructions are independent
(e.g. do not use pointer aliasing: -fno-alias with intel compiler)
Long FP pipeline is inefficient for very small loops
Pipeline must be filled, i.e. long start-up times
Summary:
Large number of independent / parallel instruction is mandatory to efficiently use pipelined, superscalar processors.
Most of the work can be done by the compiler, however programmer must provide reasonable code