• No results found

Performance measures. Modern processors. Performance measures. Performance measures CPU. Memory

N/A
N/A
Protected

Academic year: 2022

Share "Performance measures. Modern processors. Performance measures. Performance measures CPU. Memory"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Dr. Gerhard Wellein, Dr. G. Hager, T. Zeiser HPC Services – Regionales Rechenzentrum Erlangen

University Erlangen-Nürnberg Sommersemester 2007

Modern processors

Performance Measures

Strategies to build faster computers Multi-Core processors

Basic features of modern microprocessors Pipelining

Superscalar Architectures

[email protected] 2

Performance measures

ƒ Pure measures:

ƒ MFlops: Millions of Floating Point Operations per Second

(relevant for many technical & scientific applications)

ƒ MIPS: Millions of Instructions per Second

(e.g. data bases, web servers,…)

ƒ Execution time: wallclock time, user time, …..?

Number of Floating Point Operations executed 10^6 execution time

MFlops =

Number of Instructions executed 10^6 execution time MIPS =

[email protected] 3

Performance measures

ƒ MFlops and MIPS numbers can be computed / measures for hardware (peak performance) and application programs (sustained performance)

ƒ Standard benchmark programs

ƒ LINPACK (cf. discussion about TOP500; http://www.top500.org) LINPACK ~ Peak Performance in most cases

ƒ STREAM ( http://www.cs.virginia.edu/stream/) STREAM ~ Quality of data access

ƒ SPEC ( http://www.spec.org)

A lot of different benchmarks: WEB Server, MAIL Server,…

SPEC CPU2000: numerical performance (14 application benchmarks in FORTRAN or C)

ƒ HPC Challenge: Several specific HPC benchmarks http://icl.cs.utk.edu/hpcc/

[email protected] 4

Performance measures

ƒ SPEC numbers:

ƒ Relative performance compared to a reference system (SUN Ultra10; 300 MHz Sparc processor) * 100

ƒ Geometric mean over all applications

340 Intel Compiler V5.0

1.0 GHz Intel P3 – Dell Prec.420

2043 1285

1490 1514

SPEC_fp2000

Intel Compiler V8 Intel Compiler V7.1 Intel Compiler V7.0 PGI F90 5.1;

gcc 3.3.1 Compiler

1.5 GHz 6 MB L3 3.2 GHz

DDR-400 2.2 GHz

ASUS SK8N 2.2 GHz

ASUS SK8N Version

Intel Itanium2 Intel P4 –

Dell Prec. 360 AMD

Opteron AMD

Opteron Processor

There must be other things besides clock speed!

Strategies to build faster computers….

[email protected] 6

How to build faster computers Survey

1. Increase performance / throughput of CPU a) Reduce cycle time, i.e. increase clock speed (Moore) b) Increase throughput, i.e. superscalar (internal parallelism) 2. Improve data access time

a) Increase cache size

b) Improve main memory access (bandwidth & latency) 3. Use parallel computing (shared memory)

a) Requires shared-memory parallel programming b) Shared/separate caches

c) Possible memory access bottlenecks

4. Use parallel computing (distributed memory)

“Cluster” of computers tightly connected

1. Almost unlimited scaling of memory and performance 2. Distributed-memory

parallel programming

CPU

Cache

Memory

CPU Cache

CPU Cache

CPU Cache

CPU Cache Memory

CPU Cache Memory

CPU Cache Memory

CPU Cache Memory CPU

Cache Memory

CPU Cache Memory

(2)

[email protected] 7

How to build faster computers (1) Increase single processor performance

ƒ Reduce cycle time (increase clock speed)

ƒ Limited by current technology

ƒ Transition ECL → CMOS is necessary; done 10 years ago

ƒ Typical processor frequencies

ƒ PC: approx. 2.0-3.6 GHz

ƒ RISC: 1.3-1.7 GHz

ƒ Vector: 0.5-2.0 GHz

ƒ Problems with large power dissipation, even with CMOS

ƒ Power dissipation goes like (Voltage)2 times frequency

ƒ Requires pipelining of hardware units (cf. discussion later) 2 X Clock speed = 2 X Performance ? Memory does not run at CPU Speed – DRAM Gap!

Market volume

[email protected] 8

ƒ 1965 G. Moore claimed

#transistors on processor chip doubles every 12-24 months

ƒ Processor speed grew roughly at the same rate My computer: 350 MHz (1998) – 3,000 MHz (2004) Growth rate: 43 % p.a. -> doubles every 24 months

ƒ Problem: Power dissipation (see RRZE systems…)

Intel Corp.

This trend is currently changing:

see multi-core How to build faster computers (2)

Increase single processor performance: Moore’s law

[email protected] 9

How to build faster computers (3) Increase single processor performance

ƒ Use internal parallelism (Instruction level parallelism) to increase throughput

ƒ Multiple arithmetic units can work in parallel (e.g. multiply and add units)

ƒ Intel Pentium4 / AMD Opteron:

1 Multiply & 1 Add unit 2 Flop / cycle

ƒ Intel Itanium2: 2 MultiplyAdd unit 4 Flop / cycle

ƒ NEC SX8: 1 Multiply & 1 Add unit

(4-way vector pipes) 8 Flop / cycle

ƒ Multiple Load/Store units are available for concurrent data transfer (memory/cache <-> registers)

ƒ Problem: Memory bandwidthbecomes a bottleneck very quickly!

ƒ Thus: Memory bandwidth limits internal parallelism

[email protected] 10

ƒ Memory (DRAM) Gap

ƒ Memory bandwidth grows only at a speed of 7% a year

ƒ Memory latency

remains constant / increases in terms of processor speed

ƒ Loading a single data item from main memory can cost 100s of cycles on a 3 GHz CPU

ƒ Introducing memory hierarchies (caches) – Complex optimization of code

ƒ Cache sizes can “easily” be enlarged -> Moore’s law Optimization of main memory access is mandatory for most applications

How to build faster computers (4) Data throughput

[email protected] 11

How to build faster computers (5) Parallel Computing

ƒ Parallel Computing (data/functional parallelism)

ƒ Multiple CPUs share work and solve a problem cooperatively

ƒ Bookkeeping is shifted from hardware to software (user or compiler!)

ƒ Basic architecture concepts:

ƒ Shared Memory

ƒ UMA (Uniform Memory Access) machines:

Easy to programbut memory bandwidth is limited! (e.g. 2- or 4-way SMPs / Multi-Core Chips)

ƒ ccNUMA (cache-coherent Non- Uniform Memory Access) Scalable to 100’s of processors (e.g. SGI Altix)

ƒ Distributed Memory

ƒ NORMA (No-Remote Memory Access) machines:

(e.g. NORMA – Xeon-Cluster; 10`s to 10.000`s of procs.)

ƒ NUMA (Non- Uniform Memory Access): CRAY T3E, Altix Allow for scaling of (local) memory bandwidth

but hard to program & Communication bandwidth limited!

Exploiting Moore’s law without substantially increasing the single processor’s clock speed:

Multiple (independent) processor cores per chip

(3)

[email protected] 13

Multi-Core processors

ƒ Moore‘s law:

ƒ In the past: Smaller circuits -> Faster clock speeds

ƒ In the future: Smaller circuits -> Put several processors on a single silicon die (chip)

ƒ Available multi-core processors

ƒ AMD Opteron (cf. RRZE cluster), IBM Power4/5,

Intel: Xeon “Woodcrest” (Dual-Core) & “Clovertown” (Quad-Core) Intel Conroe

ƒ Technical advantages of Multi-Core technology

ƒ Power consumption using a single silicon die:

2 processors cores with 2 GHz << 1 core with 4 GHz

ƒ Price of the processor is mainly determined by the silicon die

ƒ Problems:

ƒ lower single core/processor performance -> PARALLELIZATION

ƒ Memory Wall – Now several processors share one FSB

[email protected] 14

Multi-core processors The party is over!

MS

arithmetic unit

Main Memory

FP register

L1 cache L2 cache

„DRAM Gap“

Processor chip

FP register

L1 cache

arithmetic unit

Intel Xeon / Core (“Woodcrest”)

It is not a faster processor – it is a parallel computer on a chip.

Dual-Core: Put 2 processors on a chip which (may) share resources (L2 cache, memory bandwidth)

Efficient use of both cores for a single application -> programmer

[email protected] 15

Max Frequency Max Frequency

Power Power Performance Performance

1.00x 1.00x Multi-core processors

The party is over!

By courtesy of D. Vrsalovic, Intel

[email protected] 16

Over Over--clockedclocked

(+20%) (+20%) 1.73x 1.73x

1.13x 1.13x

1.00x 1.00x

Max Frequency Max Frequency

Power Power Performance Performance

Multi-core processors The party is over!

By courtesy of D. Vrsalovic, Intel

[email protected] 17

Over Over--clockedclocked

(+20%) (+20%)

Under Under--clockedclocked

( (--20%)20%)

0.51x 0.51x 0.87x 0.87x 1.00x

1.00x 1.73x

1.73x

1.13x 1.13x

Max Frequency Max Frequency

Power Power Performance Performance

Multi-core processors The party is over!

By courtesy of D. Vrsalovic, Intel

[email protected] 18

Over Over--clockedclocked

(+20%) (+20%)

1.00x 1.00x 1.73x

1.73x

1.13x 1.13x

Max Frequency Max Frequency

Power Power Performance Performance

Dual Dual--corecore

( (--20%)20%)

1.02x 1.02x 1.73x 1.73x Dual

Dual--CoreCore

Multi-core processors The party is over!

By courtesy of D. Vrsalovic, Intel

(4)

[email protected] 19

Multi-Core Processors How many of them will be useful?

ƒ Question: What fraction of performance must be sacrificed per core in order to benefit from m cores?

ƒ Prerequisite: Overall power dissipation should be unchanged

ƒ W power dissipation p performance (1 core) pm performance (m cores) εf rel. frequency change ∆fc/fc εp rel. performance

change ∆p/p m number of cores

W W

W + Δ = ( 1 + ε

f

)

3

1 ) 1

( + ε

f 3

m =

pm p

m

= ( 1 + ε

p

)

1 − 1

p m

p

m

ε

p

3

1

/

1

= m

ε

f

[email protected] 20

ƒ Required relative frequency reduction vs. core count (m)

Available today Multi-Core Processors

How many of them will be useful?

© 2006 Intel Corporation

Evolutionary Configurable Architecture:

Evolutionary Configurable Architecture:

“Micro2015 Micro2015” ” Vision and Research Vision and Research

Many Many--core arraycore array

•CMP with 10sCMP with 10s--100s low 100s low power cores power cores

•Scalar coresScalar cores

••Capable of TFLOPS+Capable of TFLOPS+

••Full SystemFull System--onon--ChipChip

•Servers, workstations, Servers, workstations, embedded embedded…… Dual core

Dual core

•Symmetric multithreadingSymmetric multithreading Multi Multi--core arraycore array

•CMP with ~10 coresCMP with ~10 cores

Evolution Large, Scalar cores

Large, Scalar coresfor for high single high single--thread thread performance performance

Scalar plus many core Scalar plus many corefor for

highly threaded workloads highly threaded workloads

Intel Tera-Scale Computing

Research Program

Basic features of modern

microprocessors

[email protected] 23

Architecture of modern microprocessors

ƒ Application: High Level Programming Language (e.g. C / C++ / Fortran) - portable

ƒ Compiler translates program to machine specific machine instructions (IA32, IA64)

ƒ Modern computers/

microprocessors – van Neumann concept is still visible, but

ƒ Several memory levels (3-4)

ƒ Multiple Arithmetical Logical Units (e.g. 8 hardware untis for integer and fp operations on Itanium2)

Computer

Control Unit

Mem ALU IO

Instruction Set Compiler Application

[email protected] 24

Architecture of modern microprocessors

History

ƒ In the beginning (~30-40 years ago) Complex Instruction Set Computers (CISC) :

ƒ Powerful & complex instructions, e.g: A=B*C: 1 instruction

ƒ Instruction set is close to high-level programming language

ƒ Variable length of instructions - Save storage!

ƒ Mid 80´s: Reduced Instruction Set Computer (RISC) evolved:

ƒ Fixed instruction length; enables pipelining and high clock frequencies

ƒ Uses simple instructions, e.g.: A=B*C is split into at least 4 operations (LD B, LD C, MULT A=B*C, ST A)

ƒ Nowadays: Superscalar RISC processors

ƒ IA32 (P4, Athlon, Opteron): Compiler still generates CISC instructions;

but processor core: RISC like

ƒ ~2001: Explicitly Parallel Instruction Computing (EPIC) introduced

ƒ Compiler builds large group of instruction to be executed in parallel

ƒ First processors: Intel Itanium1/2 using the IA64 instruction set.

(5)

[email protected] 25 Cache based Processor

Cache based Processor

MS

Arithmetic &

functional units

Register

Simple view of modern processors

Cache based microprocessors (e.g. Intel P4, AMD Opteron)

Main Memory

L1 D-Cache L2 Cache: Data / Instr.

L1 I-Cache Fetch Decode Branch-Predict.

Processor

Frequency ~3 GHzFrequency ~0.4 GHz

Processor is built up by:

•Arithmetic & functional units, e.g. Multiply-unit, Integer-units, MMX, …

•These unitscan only use operands resident in the registers

• Operands are read (written) by load (store) unitsfrom main memory/caches to registers

• Caches are fast but small pieces of memory (5-10 times faster than main memory)

• a lot of additional logic: e.g. branch prediction

© 2006 Intel Corporation22.04.2007

26 Copyright © 2006 Intel Corporation

Disclaimer:This block diagram is for example purposes only.

Significant hardware blocks have been arranged or omitted for clarity.

Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT) 32 KB

Instruction Cache Next IP

Instruction Decode (4 issue) Fetch / Decode

Architecture Block Diagram

Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache

PortPortPortPort Bus Unit

Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports

32 KB Data Cache Execute

Port

FP Add

Integer SIMD Arithmetic

Memory Order Buffer (MOB) Load Store Addr

FP Div/Mul Integer Shift/Rotate SIMD SIMD

Integer Arithmetic

Integer Arithmetic

Intel® Core™

Port

Store Data

[email protected] 27

L2 Data Cache

Floating Point

L1i Cache IA32

IEU MMU L1d Cache

D-TLB

L2 Tag L3 Tag Pipeline

Bus Logic

L3 Cache

H. Strauss, HP

Simple view of modern processors

Intel Itanium 2 – physical view

450 Mio. transistors on a 2 cm by 2 cm die !

In 2006/7:

More than 1700 Mio. transistors on a 2,5 cm by 2,5 cm die

[email protected] 28

Architecture of modern microprocessors

Pipelining of arithmetic/functional units

ƒ Split complex operations (e.g. multiplication) into several simple / fast sub-operations (stages)

ƒ Makes short cycle time possible (simpler logic circuits), e.g.:

ƒ Multiplication takes 5 cycles, but

ƒ processor can work on 5 different multiplications simultaneously

ƒ Can produce one result each cycle after the pipeline is full

ƒ Drawback:

ƒ Pipeline must be filled - startup times

ƒ Requires complex instruction scheduling by compiler/hardware – software-pipelining / out-of-order

ƒ Extensive use requires large number of independent instructions – instruction level parallelism

ƒ Vector supercomputersuse this method excessively

[email protected] 29

Pipelining:

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N 1

B(1) C(1)

2

B(2) C(2) B(1) C(1)

3

B(3) C(3) B(2) C(2) B(1) C(1)

4

B(4) C(4) B(3) C(3) B(2) C(2) A(1)

5

B(5) C(5) B(4) C(4) B(3) C(3) A(2)

A(1) 6

B(6) C(6) B(5) C(5) B(4) C(4) B(3) C(3) A(2)

N+4 ...

A(N) ...

...

...

...

...

Cycle:

Separate Mant. / Exp.

Mult.

Mantissa Add.

Exponents Normal.

Result

Insert Sign Operation

First result is available after 5 cycles (=latency of pipeline)!

[email protected] 30

Pipelining

Benefits and drawbacks (1)

ƒ Pipelining versus purely sequential execution of multiplication

Speed-Up:

Tseq/ Tpipe= (5*N) / (N+4) = 5/(1 + 4/N) ~ 5 for large N (>>5)

Throughput (Results per Cycle) of Pipeline:

N / Tpipe(N) = N / (4 + N) = 1 / (1 + 4/N) ~1 for large N Sequential:

1 Multiplication = 5 cycles N Multiplications:

Tseq(N) = (5*N) cycles

Pipelining:

Start-Up = 5 cycles N Multiplications:

Tpipe(N)=(4+N) cycles

(6)

[email protected] 31

Pipelining

Benefits and drawbacks (2)

ƒ In general (m-stage pipe /pipeline depth: m) Speed-Up:

Tseq/ Tpipe= (m*N) / (N+m-1) ~ m for large N (>>m)

Throughput (Results per Cycle):

N / Tpipe(N) = N / (N+m-1) = 1 / [ 1+(m-1)/N ] ~ 1 for large N

ƒ Number of independent operations (NC) required to achive Tpresults per cycle:

Tp= 1 / [ 1+(m-1)/NC] NC = Tp(m-1) / (1- Tp)

Tp= 0.5 NC = m-1

[email protected] 32

Pipelining

Benefits and drawbacks (3)

ƒ Drawbacks:

ƒ Nsmall (e.g. N=1) – No speed up!

ƒ Increasing clock frequency -> pipeline depth mincreases

ƒ Operations (e.g. Multiplications) within pipeline must be independent!

ƒ Optimal scheduling of instructions by compiler depends on pipeline depth!

ƒ Effective pipeline length for execution of an arithmetic operation is much longer than number of pipeline stages of arithmetic unit.

[email protected] 33

Pipelining

Benefits and drawbacks (4)

[email protected] 34

Pipelining Efficient use

ƒ Efficient use of pipelining requires intelligent compilers

ƒ Rearrangement of instructions to hide latencies

ƒ High level of „Software pipelining“ (in particular Itanium)

ƒ Remove interdependencies that block parallel execution (user/ programmer)

ƒ Out-of-order execution on processor (except Itanium)

ƒ Example:

Simple Pseudo Code:

loop: load a[i]

mult a[i] = c, a[i]

store a[i]

branch.loop Fortran Code:

do i=1,N a(i) = a(i) * c end do

load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

[email protected] 35

Pipelining Efficient use

ƒ Example:

Simple Pseudo Code:

loop: load a[i]

mult a[i] = c, a[i]

store a[i]

branch.loop Fortran Code:

do i=1,N a(i) = a(i) * c end do

load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Assumptions:

• One load, one store& one multiply (mult) can be issued per cycle

• The processor stalls, if there is one instruction which is waiting for operands

[email protected] 36

Pipelining Efficient use

Naive instruction issue Cycle

Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919

load loada[1] a[1]

mult multa[1]=c,a[1]a[1]=c,a[1]

store storea[1]a[1]

load loada[2]a[2]

multmulta[2]=c,a[2]a[2]=c,a[2]

store storea[2]a[2]

load loada[3]a[3]

load loada[1]a[1]

load loada[2]a[2]

loadloada[3]a[3]

load loada[4]a[4]

loadloada[5]a[5] multmulta[1]=c,a[1]a[1]=c,a[1]

load

loada[6]a[6] multmulta[2]=c,a[2]a[2]=c,a[2]

load

loada[7]a[7] multmulta[3]=c,a[3]a[3]=c,a[3] storestorea[1]a[1]

load

loada[8]a[8] multmulta[4]=c,a[4]a[4]=c,a[4] storestorea[2]a[2]

load

loada[9]a[9] multmulta[5]=c,a[5]a[5]=c,a[5] storestorea[3] a[3]

loadloada[10]a[10]multmulta[6]=c,a[6]a[6]=c,a[6] storestorea[4] a[4]

load

loada[11]a[11]multmulta[7]=c,a[7]a[7]=c,a[7] storestorea[5]a[5]

loadloada[12]a[12]multmulta[8]=c,a[8]a[8]=c,a[8] storestorea[6]a[6]

mult

multa[9]=c,a[9]a[9]=c,a[9] storestorea[7]a[7]

multmulta[10]=c,a[10]a[10]=c,a[10] storestorea[8]a[8]

mult

multa[11]=c,a[11]a[11]=c,a[11] storestorea[9]a[9]

mult

multa[12]=c,a[12]a[12]=c,a[12] storestorea[10]a[10]

store storea[11]a[11]

store storea[12]a[12]

Optimized instruction issue a[i]=a[i]*c; N=12

T= 96 cycles T= 19 cycles

Prolog

Epilog Kernel

(7)

[email protected] 37

Pipelining Efficient use

ƒ Optimized kernel:

ƒ Software pipelining by compiler: Reordering instructions considering the latencies of the instructions

ƒ Cycles in loop kernel should be much larger than in Prolog/Epilog

ƒ Dependencies within loop body prevent efficient software pipelining:

Pseudo Code:

loop: load a[i+6]

mult a[i+2] = c, a[i+2]

store a[i]

branch.loop

Latency of MULT pipeline: 2 cycles Latency of load: 4 cycles

Fortran Code:

do i=1,N a(i) = a(i-1) * c end do

Computation of a[i-1]

must be completed before a[i] is started!

[email protected] 38

Pipelining Efficient use

Naive instruction issue Cycle

Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919

load loada[1] a[1]

mult multa[2]=c,a[1]a[2]=c,a[1]

store storea[2]a[2]

loadloada[2]a[2]

mult multa[3]=c,a[2]a[3]=c,a[2]

store storea[3]a[3]

load loada[3]a[3]

loadloada[1]a[1]

mult multa[2]=c,a[1]a[2]=c,a[1]

mult

multa[3]=c,a[2]a[3]=c,a[2] storestorea[2]a[2]

mult

multa[4]=c,a[3]a[4]=c,a[3] storestorea[3] a[3]

mult

multa[5]=c,a[4]a[5]=c,a[4] storestorea[4]a[4]

mult

multa[6]=c,a[5]a[6]=c,a[5] storestorea[5]a[5]

multmulta[7]=c,a[6]a[7]=c,a[6] storestorea[6]a[6]

multmulta[8]=c,a[7]a[8]=c,a[7] storestorea[7]a[7]

mult

multa[9]=c,a[8]a[9]=c,a[8] storestorea[8]a[8]

Optimized instruction issue a[i]=a[i-1]*c; N=12

T= 96 cycles T= 26 cycles

Prolog

Kernel

[email protected] 39

Pipelining Efficient use

ƒ Performance impact of dependencies on Intel Xeon 2.66 GHz

Start-Up of long effective pipeline High Performance for data in caches (N < 30000)

Why ?

A(i)=A(i+1)*c A(i)=A(i-1)*c

[email protected] 40

Pipelining Efficient use

ƒ Basic types of (potential) dependencies within loop body may prevent efficient software pipelining, e.g.:

Dependency:

do i=2,N a(i) = a(i-1) * c end do

General version (offset as input parameter):

do i=max(1-offset,1),min(N-offset,N) a(i) = a(i-offset) * c

end do No dependency:

do i=1,N a(i) = a(i) * c end do

Pseudo-Dependency:

do i=1,N-1 a(i) = a(i+1) * c end do

[email protected] 41

Pipelining Data dependencies

[email protected] 42

Pipelining Data dependencies

(8)

[email protected] 43

Pipelining

Further potential problems

ƒ Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.

ƒ 1 or 2 MultAdd units per processor, i.e. processor core

ƒ Modern microprocessors do not provide pipelines for div / sqrt or exp / sin ! Example: Cycles per Operation (8-Byte) (Xeon/Netburst)

~160-180 70*

70* 4*

Latency

130 70*

70* 2*

Throughput

130 35*

35* 1*

Cycles/Operation

y=sin(y) y=dsqrt(y)

y=a/y y=a+y (y=a*y) Operation

* Using SIMD instructions (SSE2)

ƒ Reduce number of complex operations if necessary.

ƒ Replace function call with a table lookup if the function is frequently computed for a few different arguments only.

[email protected] 44

Pipelining

Instruction pipeline

ƒ Besides the arithmetic and functional unit, the instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:

Fetch Instruction from L1I

Decode instruction

Execute Instruction

‰ Hardware Pipelining on processor (all units can run concurrently):

Fetch Instruction1 from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode Instruction3

Execute Instruction2 Fetch Instruction3

from L1I Fetch Instruction4

from L1I

t

‰ Branches can stall this pipeline! (Speculative Execution, Predication)

‰ Each Unit is pipelined itself (cf. Execute=Multiply Pipeline) 1

2

3 4

[email protected] 45

Pipelining

PowerPC Instruction Pipeline

14-stage pipeline for FP operations!

Pipeline of P4:

20 stages!

[email protected] 46

Superscalar Processors

ƒ Superscalar Processors can run multiple Instruction Pipelines at the same time!

ƒ Parallel hardware components / pipelines are available to

ƒ fetch / decode / issues multiple instructions per cycle (typically 2 – 8 per cycle)

ƒ load (store) multiple operands (results) from (to) cache per cycle (typically 2-4 8-byte words per cycle)

ƒ perform multiple integer / address calculations per cycle (e.g. 6 integer units on Itanium2)

ƒ perform multiple floating point operations per cycle (typically 2 or 4 floating point operations per cycle)

ƒ On superscalar RISC processors out-of order execution hardware is available to optimize the usage of the parallel hardware

[email protected] 47

Superscalar Processors

Instruction Level Parallelism through superscalar execution

‰ Multiple units enable use of InstrucionLevel Parallelism (ILP):

‰ Issuing m concurrent instructions per cycle: m-way superscalar

‰ Modern processors are 3- to 6-way superscalar &

can perform 2 or 4 floating point operations per cycles Fetch Instruction1

from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode

Instruction3 Execute Instruction2 Fetch Instruction3

from L1I Fetch Instruction4

from L1I Fetch Instruction1

from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode

Instruction3 Execute Instruction2 Fetch Instruction3

from L1I Fetch Instruction4

from L1I Fetch Instruction1

from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode Instruction3

Execute Instruction2 Fetch Instruction3

from L1I Fetch Instruction4

from L1I Fetch Instruction1

from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode Instruction3

Execute Instruction2 Fetch Instruction3

from L1I Fetch Instruction4

from L1I

4-way

„superscalar“

t

[email protected] 48

Superscalar Processor Exploit ILP

ƒ Example: Calculate norm of a vector

ƒ Naive version:

ƒ 2nd MADD has to wait for the first to complete, although in principle two independent MADD could be done t=0

do i=1,n t=t+a(i)*a(i) end do

2 FP Mult/Add units cannot be busy at the same time because of dependency in summation variable t

„Load-after-Store dependency“

R1= MADD(R1,A(I))

R1 = MADD(R1,A(I+1))

STALL

(9)

[email protected] 49

Superscalar Processor

Exploit ILP: Modulo variable expansion

t1=0 t2=0 do I=1,N,2

t1=t1+a(i)*a(i) t2=t2+a(i+1)*a(i+1) end do

t=t1+t2

‰ Optimized version:

Two independent „instruction streams” can be processed by two separate FP Mult/Add units!

Most compilers can do those optimizations automatically!

R1= MADD(R1,A(I)) R2= MADD(R2,A(I+1)) R1= MADD(R1,A(I+2)) R2= MADD(R2,A(I+3))

[email protected] 50

Superscalar Processors Some pitfalls

ƒ Data dependencies can prevent the parallel use of hardware, e.g. for (i=0;…) A(i) = A(i-1)*c

(only one multiplication can be performed at the same time)

ƒ Data dependencies: Compiler can not resolve aliasing conflicts!

void subscale( A , B )

….for (i=0;…) A(i) = B(i-1)*c

In C/ C++ the pointers of A and B can point to the same memory location -> see above

You should tell the compiler if your are never using aliasing ( -fno-aliason Intel Compiler)

[email protected] 51

Superscalar Processors Some pitfalls

ƒ Avoid frequent and random (not predictable) branches in the application code, e.g.

do i=1,….

if( random(0:1) > 0.5) then

<Block1>

else<Block2>

endif enddo

Superscalar processors try to predict the branch and speculatively start the pipeline for the next iterations.

If the branch was mispredicted the pipeline has to be flushed!

[email protected] 52

Superscalar Processor Efficient Use of Pipelining and ILP

ƒ Efficient use of pipelining/ILP requires intelligent compilers

ƒ Rearrangement of instructions to hide latencies

ƒ „Software pipelining“

ƒ Remove interdependencies that block parallel execution

ƒ Programmer should

ƒ Avoid unpredictable branches (stop and restart of pipeline!)

ƒ Avoid Data dependencies (if possible)

ƒ Tell compiler that instructions are independent

(e.g. do not use pointer aliasing: -fno-alias with intel compiler)

ƒ Long FP pipeline is inefficient for very small loops

ƒ Pipeline must be filled, i.e. long start-up times

ƒ Summary:

ƒ Large number of independent / parallel instruction is mandatory to efficiently use pipelined, superscalar processors.

ƒ Most of the work can be done by the compiler, however programmer must provide reasonable code

References

Related documents

BPM maturity within an organization is operationalized in 37 BPM capabilities that are translated to questions (items) that measure 7 dimensions of process maturity (Process


WCF
 Brokers
 Head
node
 Failover
 Head
node
 […]
 1.
User
submits
job.
 2.
Session
Manager
 assigns
WCF
Broker


Saved her best game for last as a freshman, posting career highs for points (14), rebounds (7), three-pointers (3) and minutes played (40) during the ASUN semifi nal triumph at

The MAN Group classifies significant opportunities and risks that may have an impact on its net assets, financial position, and results of operations into five risk fields:

If participants assume a stable world, then any belief in bias has little effect on sequential effects (Figure 4ac).. We were particularly interested in the effects

Create multiple versions of your exam (e.g., to discourage cheating) By going to the bottom of the exam and select the item in the picture “Exam consists of 1 version (form)”

Ma per ogni umano che volesse garantire alle Copie la stessa ragionevole presunzione di coscienza che garantiva al suo prossimo umano, e per ogni Copia che

Due to the lack of studies with appropriate meth- odology for evaluating treatment stability of ante- rior open bite in the mixed dentition, it was aimed to cephalometrically