Performance measures. Modern processors. Performance measures. Performance measures CPU. Memory

(1)

Dr. Gerhard Wellein, Dr. G. Hager, T. Zeiser HPC Services – Regionales Rechenzentrum Erlangen

University Erlangen-Nürnberg Sommersemester 2007

Modern processors

Performance Measures

Strategies to build faster computers Multi-Core processors

Basic features of modern microprocessors Pipelining

Superscalar Architectures

[email protected] 2

Performance measures

Pure measures:

MFlops: Millions of Floating Point Operations per Second

(relevant for many technical & scientific applications)

MIPS: Millions of Instructions per Second

(e.g. data bases, web servers,…)

Execution time: wallclock time, user time, …..?

Number of Floating Point Operations executed 10^6 execution time

MFlops =

Number of Instructions executed 10^6 execution time MIPS =

[email protected] 3

Performance measures

MFlops and MIPS numbers can be computed / measures for hardware (peak performance) and application programs (sustained performance)

Standard benchmark programs

LINPACK (cf. discussion about TOP500; http://www.top500.org) LINPACK ~ Peak Performance in most cases

STREAM ( http://www.cs.virginia.edu/stream/) STREAM ~ Quality of data access

SPEC ( http://www.spec.org)

A lot of different benchmarks: WEB Server, MAIL Server,…

SPEC CPU2000: numerical performance (14 application benchmarks in FORTRAN or C)

HPC Challenge: Several specific HPC benchmarks http://icl.cs.utk.edu/hpcc/

[email protected] 4

Performance measures

SPEC numbers:

Relative performance compared to a reference system (SUN Ultra10; 300 MHz Sparc processor) * 100

Geometric mean over all applications

340 Intel Compiler V5.0

1.0 GHz Intel P3 – Dell Prec.420

2043 1285

1490 1514

SPEC_fp2000

Intel Compiler V8 Intel Compiler V7.1 Intel Compiler V7.0 PGI F90 5.1;

gcc 3.3.1 Compiler

1.5 GHz 6 MB L3 3.2 GHz

DDR-400 2.2 GHz

ASUS SK8N 2.2 GHz

ASUS SK8N Version

Intel Itanium2 Intel P4 –

Dell Prec. 360 AMD

Opteron AMD

Opteron Processor

There must be other things besides clock speed!

Strategies to build faster computers….

[email protected] 6

How to build faster computers Survey

1. Increase performance / throughput of CPU a) Reduce cycle time, i.e. increase clock speed (Moore) b) Increase throughput, i.e. superscalar (internal parallelism) 2. Improve data access time

a) Increase cache size

b) Improve main memory access (bandwidth & latency) 3. Use parallel computing (shared memory)

a) Requires shared-memory parallel programming b) Shared/separate caches

c) Possible memory access bottlenecks

4. Use parallel computing (distributed memory)

“Cluster” of computers tightly connected

1. Almost unlimited scaling of memory and performance 2. Distributed-memory

parallel programming

CPU

Cache

Memory

CPU Cache

CPU Cache Memory

CPU Cache Memory CPU

Cache Memory

CPU Cache Memory

(2)

[email protected] 7

How to build faster computers (1) Increase single processor performance

Reduce cycle time (increase clock speed)

Limited by current technology

Transition ECL → CMOS is necessary; done 10 years ago

Typical processor frequencies

PC: approx. 2.0-3.6 GHz

RISC: 1.3-1.7 GHz

Vector: 0.5-2.0 GHz

Problems with large power dissipation, even with CMOS

Power dissipation goes like (Voltage)²times frequency

Requires pipelining of hardware units (cf. discussion later) 2 X Clock speed = 2 X Performance ? Memory does not run at CPU Speed – DRAM Gap!

Market volume

[email protected] 8

1965 G. Moore claimed

#transistors on processor chip doubles every 12-24 months

Processor speed grew roughly at the same rate My computer: 350 MHz (1998) – 3,000 MHz (2004) Growth rate: 43 % p.a. -> doubles every 24 months

Problem: Power dissipation (see RRZE systems…)

Intel Corp.

This trend is currently changing:

see multi-core How to build faster computers (2)

Increase single processor performance: Moore’s law

[email protected] 9

How to build faster computers (3) Increase single processor performance

Use internal parallelism (Instruction level parallelism) to increase throughput

Multiple arithmetic units can work in parallel (e.g. multiply and add units)

Intel Pentium4 / AMD Opteron:

1 Multiply & 1 Add unit 2 Flop / cycle

Intel Itanium2: 2 MultiplyAdd unit 4 Flop / cycle

NEC SX8: 1 Multiply & 1 Add unit

(4-way vector pipes) 8 Flop / cycle

Multiple Load/Store units are available for concurrent data transfer (memory/cache <-> registers)

Problem: Memory bandwidthbecomes a bottleneck very quickly!

Thus: Memory bandwidth limits internal parallelism

[email protected] 10

Memory (DRAM) Gap

Memory bandwidth grows only at a speed of 7% a year

Memory latency

remains constant / increases in terms of processor speed

Loading a single data item from main memory can cost 100s of cycles on a 3 GHz CPU

Introducing memory hierarchies (caches) – Complex optimization of code

Cache sizes can “easily” be enlarged -> Moore’s law Optimization of main memory access is mandatory for most applications

How to build faster computers (4) Data throughput

How to build faster computers (5) Parallel Computing

Parallel Computing (data/functional parallelism)

Multiple CPUs share work and solve a problem cooperatively

Bookkeeping is shifted from hardware to software (user or compiler!)

Basic architecture concepts:

Shared Memory

UMA (Uniform Memory Access) machines:

Easy to programbut memory bandwidth is limited! (e.g. 2- or 4-way SMPs / Multi-Core Chips)

ccNUMA (cache-coherent Non- Uniform Memory Access) Scalable to 100’s of processors (e.g. SGI Altix)

Distributed Memory

NORMA (No-Remote Memory Access) machines:

(e.g. NORMA – Xeon-Cluster; 10`s to 10.000`s of procs.)

NUMA (Non- Uniform Memory Access): CRAY T3E, Altix Allow for scaling of (local) memory bandwidth

but hard to program & Communication bandwidth limited!

Exploiting Moore’s law without substantially increasing the single processor’s clock speed:

Multiple (independent) processor cores per chip

(3)

Multi-Core processors

Moore‘s law:

In the past: Smaller circuits -> Faster clock speeds

In the future: Smaller circuits -> Put several processors on a single silicon die (chip)

Available multi-core processors

AMD Opteron (cf. RRZE cluster), IBM Power4/5,

Intel: Xeon “Woodcrest” (Dual-Core) & “Clovertown” (Quad-Core) Intel Conroe

Technical advantages of Multi-Core technology

Power consumption using a single silicon die:

2 processors cores with 2 GHz << 1 core with 4 GHz

Price of the processor is mainly determined by the silicon die

Problems:

lower single core/processor performance -> PARALLELIZATION

Memory Wall – Now several processors share one FSB

Multi-core processors The party is over!

MS

arithmetic unit

Main Memory

FP register

L1 cache L2 cache

„DRAM Gap“

Processor chip

FP register

L1 cache

arithmetic unit

Intel Xeon / Core (“Woodcrest”)

It is not a faster processor – it is a parallel computer on a chip.

Dual-Core: Put 2 processors on a chip which (may) share resources (L2 cache, memory bandwidth)

Efficient use of both cores for a single application -> programmer

Max Frequency Max Frequency

Power Power Performance Performance

1.00x 1.00x Multi-core processors

The party is over!

By courtesy of D. Vrsalovic, Intel

Over Over--clockedclocked

(+20%) (+20%) 1.73x 1.73x

1.13x 1.13x

1.00x 1.00x

(+20%) (+20%)

Under Under--clockedclocked

( (--20%)20%)

0.51x 0.51x 0.87x 0.87x 1.00x

1.00x 1.73x

1.73x

1.13x 1.13x

(+20%) (+20%)

1.00x 1.00x 1.73x

1.73x

1.13x 1.13x

Dual Dual--corecore

( (--20%)20%)

1.02x 1.02x 1.73x 1.73x Dual

Dual--CoreCore

(4)

Multi-Core Processors How many of them will be useful?

Question: What fraction of performance must be sacrificed per core in order to benefit from m cores?

Prerequisite: Overall power dissipation should be unchanged

W power dissipation p performance (1 core) p_m performance (m cores) ε_f rel. frequency change ∆f_c/f_c ε_p rel. performance

change ∆p/p m number of cores

W W

W + Δ = ( 1 + ε

_f

)

³

1 ) 1

( + ε

_f ³

m =

pm p

_m

= ( 1 + ε

_p

)

1 − 1

≥

⇒

≥ p m

p

_m

ε

_p

3

1

/

1

−

= m

⁻

ε

f

Required relative frequency reduction vs. core count (m)

Available today Multi-Core Processors

How many of them will be useful?

Evolutionary Configurable Architecture:

“

“Micro2015 Micro2015” ” Vision and Research Vision and Research

Many Many--core arraycore array

•

•CMP with 10sCMP with 10s--100s low 100s low power cores power cores

•

•Scalar coresScalar cores

••Capable of TFLOPS+Capable of TFLOPS+

••Full SystemFull System--onon--ChipChip

•

•Servers, workstations, Servers, workstations, embedded embedded…… Dual core

Dual core

•

•Symmetric multithreadingSymmetric multithreading Multi Multi--core arraycore array

•

•CMP with ~10 coresCMP with ~10 cores

Evolution Large, Scalar cores

Large, Scalar coresfor for high single high single--thread thread performance performance

Scalar plus many core Scalar plus many corefor for

highly threaded workloads highly threaded workloads

Intel Tera-Scale Computing

Research Program

Basic features of modern

microprocessors

Architecture of modern microprocessors

Application: High Level Programming Language (e.g. C / C++ / Fortran) - portable

Compiler translates program to machine specific machine instructions (IA32, IA64)

Modern computers/

microprocessors – van Neumann concept is still visible, but

Several memory levels (3-4)

Multiple Arithmetical Logical Units (e.g. 8 hardware untis for integer and fp operations on Itanium2)

Computer

Control Unit

Mem ALU IO

Instruction Set Compiler Application

Architecture of modern microprocessors

History

In the beginning (~30-40 years ago) Complex Instruction Set Computers (CISC) :

Powerful & complex instructions, e.g: A=B*C: 1 instruction

Instruction set is close to high-level programming language

Variable length of instructions - Save storage!

Mid 80´s: Reduced Instruction Set Computer (RISC) evolved:

Fixed instruction length; enables pipelining and high clock frequencies

Uses simple instructions, e.g.: A=B*C is split into at least 4 operations (LD B, LD C, MULT A=B*C, ST A)

Nowadays: Superscalar RISC processors

IA32 (P4, Athlon, Opteron): Compiler still generates CISC instructions;

but processor core: RISC like

~2001: Explicitly Parallel Instruction Computing (EPIC) introduced

Compiler builds large group of instruction to be executed in parallel

First processors: Intel Itanium1/2 using the IA64 instruction set.

(5)

[email protected] 25 Cache based Processor

Cache based Processor

MS

Arithmetic &

functional units

Register

Simple view of modern processors

Cache based microprocessors (e.g. Intel P4, AMD Opteron)

Main Memory

L1 D-Cache L2 Cache: Data / Instr.

L1 I-Cache Fetch Decode Branch-Predict.

Processor

Frequency ~3 GHzFrequency ~0.4 GHz

Processor is built up by:

•Arithmetic & functional units, e.g. Multiply-unit, Integer-units, MMX, …

•These unitscan only use operands resident in the registers

• Operands are read (written) by load (store) unitsfrom main memory/caches to registers

• Caches are fast but small pieces of memory (5-10 times faster than main memory)

• a lot of additional logic: e.g. branch prediction

Disclaimer:This block diagram is for example purposes only.

Significant hardware blocks have been arranged or omitted for clarity.

Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.

Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT) 32 KB

Instruction Cache Next IP

Instruction Decode (4 issue) Fetch / Decode

Architecture Block Diagram

Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache

PortPortPortPort Bus Unit

Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports

32 KB Data Cache Execute

Port

FP Add

Integer SIMD Arithmetic

Memory Order Buffer (MOB) Load Store Addr

FP Div/Mul Integer Shift/Rotate SIMD SIMD

Integer Arithmetic

Intel® Core™

Port

Store Data

L2 Data Cache

Floating Point

L1i Cache IA32

IEU MMU L1d Cache

D-TLB

L2 Tag L3 Tag Pipeline

Bus Logic

L3 Cache

H. Strauss, HP

Simple view of modern processors

Intel Itanium 2 – physical view

450 Mio. transistors on a 2 cm by 2 cm die !

In 2006/7:

More than 1700 Mio. transistors on a 2,5 cm by 2,5 cm die

Architecture of modern microprocessors

Pipelining of arithmetic/functional units

Split complex operations (e.g. multiplication) into several simple / fast sub-operations (stages)

Makes short cycle time possible (simpler logic circuits), e.g.:

Multiplication takes 5 cycles, but

processor can work on 5 different multiplications simultaneously

Can produce one result each cycle after the pipeline is full

Drawback:

Pipeline must be filled - startup times

Requires complex instruction scheduling by compiler/hardware – software-pipelining / out-of-order

Extensive use requires large number of independent instructions – instruction level parallelism

Vector supercomputersuse this method excessively

Pipelining:

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N 1

B(1) C(1)

2

B(2) C(2) B(1) C(1)

3

B(3) C(3) B(2) C(2) B(1) C(1)

4

B(4) C(4) B(3) C(3) B(2) C(2) A(1)

5

B(5) C(5) B(4) C(4) B(3) C(3) A(2)

A(1) 6

B(6) C(6) B(5) C(5) B(4) C(4) B(3) C(3) A(2)

N+4 ...

A(N) ...

...

Cycle:

Separate Mant. / Exp.

Mult.

Mantissa Add.

Exponents Normal.

Result

Insert Sign Operation

First result is available after 5 cycles (=latency of pipeline)!

Pipelining

Benefits and drawbacks (1)

Pipelining versus purely sequential execution of multiplication

Speed-Up:

T_seq/ T_pipe= (5*N) / (N+4) = 5/(1 + 4/N) ~ 5 for large N (>>5)

Throughput (Results per Cycle) of Pipeline:

N / T_pipe(N) = N / (4 + N) = 1 / (1 + 4/N) ~1 for large N Sequential:

1 Multiplication = 5 cycles N Multiplications:

T_seq(N) = (5*N) cycles

Pipelining:

Start-Up = 5 cycles N Multiplications:

T_pipe(N)=(4+N) cycles

(6)

Pipelining

In general (m-stage pipe /pipeline depth: m) Speed-Up:

T_seq/ T_pipe= (m*N) / (N+m-1) ~ m for large N (>>m)

Throughput (Results per Cycle):

N / T_pipe(N) = N / (N+m-1) = 1 / [ 1+(m-1)/N ] ~ 1 for large N

Number of independent operations (N_C) required to achive T_presults per cycle:

T_p= 1 / [ 1+(m-1)/N_C] N_C= T_p(m-1) / (1- T_p)

T_p= 0.5 N_C= m-1

Pipelining

Drawbacks:

Nsmall (e.g. N=1) – No speed up!

Increasing clock frequency -> pipeline depth mincreases

Operations (e.g. Multiplications) within pipeline must be independent!

Optimal scheduling of instructions by compiler depends on pipeline depth!

Effective pipeline length for execution of an arithmetic operation is much longer than number of pipeline stages of arithmetic unit.

Pipelining

Pipelining Efficient use

Efficient use of pipelining requires intelligent compilers

Rearrangement of instructions to hide latencies

High level of „Software pipelining“ (in particular Itanium)

Remove interdependencies that block parallel execution (user/ programmer)

Out-of-order execution on processor (except Itanium)

Example:

Simple Pseudo Code:

loop: load a[i]

mult a[i] = c, a[i]

store a[i]

branch.loop Fortran Code:

do i=1,N a(i) = a(i) * c end do

load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Example:

Simple Pseudo Code:

loop: load a[i]

mult a[i] = c, a[i]

store a[i]

branch.loop Fortran Code:

load a[i] Load operand to register (4 cycles) mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registers store a[i] Write back result from register to mem./cache (2 cycles) branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Assumptions:

• One load, one store& one multiply (mult) can be issued per cycle

• The processor stalls, if there is one instruction which is waiting for operands

Naive instruction issue Cycle

Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919

load loada[1] a[1]

mult multa[1]=c,a[1]a[1]=c,a[1]

store storea[1]a[1]

load loada[2]a[2]

multmulta[2]=c,a[2]a[2]=c,a[2]

loadloada[3]a[3]

loadloada[5]a[5] multmulta[1]=c,a[1]a[1]=c,a[1]

load

loada[6]a[6] multmulta[2]=c,a[2]a[2]=c,a[2]

load

loada[7]a[7] multmulta[3]=c,a[3]a[3]=c,a[3] storestorea[1]a[1]

load

loada[8]a[8] multmulta[4]=c,a[4]a[4]=c,a[4] storestorea[2]a[2]

load

loada[9]a[9] multmulta[5]=c,a[5]a[5]=c,a[5] storestorea[3] a[3]

loadloada[10]a[10]multmulta[6]=c,a[6]a[6]=c,a[6] storestorea[4] a[4]

load

loada[11]a[11]multmulta[7]=c,a[7]a[7]=c,a[7] storestorea[5]a[5]

loadloada[12]a[12]multmulta[8]=c,a[8]a[8]=c,a[8] storestorea[6]a[6]

mult

multa[9]=c,a[9]a[9]=c,a[9] storestorea[7]a[7]

multmulta[10]=c,a[10]a[10]=c,a[10] storestorea[8]a[8]

mult

Optimized instruction issue a[i]=a[i]*c; N=12

T= 96 cycles T= 19 cycles

Prolog

Epilog Kernel

(7)

Optimized kernel:

Software pipelining by compiler: Reordering instructions considering the latencies of the instructions

Cycles in loop kernel should be much larger than in Prolog/Epilog

Dependencies within loop body prevent efficient software pipelining:

Pseudo Code:

loop: load a[i+6]

mult a[i+2] = c, a[i+2]

store a[i]

branch.loop

Latency of MULT pipeline: 2 cycles Latency of load: 4 cycles

Fortran Code:

do i=1,N a(i) = a(i-1) * c end do

Computation of a[i-1]

must be completed before a[i] is started!

Naive instruction issue Cycle

Cycle11 Cycle Cycle22 Cycle Cycle33 Cycle Cycle44 Cycle Cycle55 Cycle Cycle66 Cycle Cycle77 Cycle Cycle88 Cycle Cycle99 Cycle Cycle1010 Cycle Cycle1111 Cycle Cycle1212 Cycle Cycle1313 Cycle Cycle1414 Cycle Cycle1515 Cycle Cycle1616 Cycle Cycle1717 Cycle Cycle1818 Cycle Cycle1919

load loada[1] a[1]

loadloada[2]a[2]

loadloada[1]a[1]

mult

multa[4]=c,a[3]a[4]=c,a[3] storestorea[3] a[3]

mult

Optimized instruction issue a[i]=a[i-1]*c; N=12

T= 96 cycles T= 26 cycles

Prolog

Kernel

Performance impact of dependencies on Intel Xeon 2.66 GHz

Start-Up of long effective pipeline High Performance for data in caches (N < 30000)

Why ?

A(i)=A(i+1)*c A(i)=A(i-1)*c

Basic types of (potential) dependencies within loop body may prevent efficient software pipelining, e.g.:

Dependency:

do i=2,N a(i) = a(i-1) * c end do

General version (offset as input parameter):

do i=max(1-offset,1),min(N-offset,N) a(i) = a(i-offset) * c

end do No dependency:

Pseudo-Dependency:

do i=1,N-1 a(i) = a(i+1) * c end do

Pipelining Data dependencies

(8)

Pipelining

Further potential problems

Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.

1 or 2 MultAdd units per processor, i.e. processor core

Modern microprocessors do not provide pipelines for div / sqrt or exp / sin ! Example: Cycles per Operation (8-Byte) (Xeon/Netburst)

~160-180 70^*

70^* 4^*

Latency

130 70^*

70^* 2^*

Throughput

130 35^*

35^* 1^*

Cycles/Operation

y=sin(y) y=dsqrt(y)

y=a/y y=a+y (y=a*y) Operation

* Using SIMD instructions (SSE2)

Reduce number of complex operations if necessary.

Replace function call with a table lookup if the function is frequently computed for a few different arguments only.

Pipelining

Instruction pipeline

Besides the arithmetic and functional unit, the instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:

Fetch Instruction from L1I

Decode instruction

Execute Instruction

Hardware Pipelining on processor (all units can run concurrently):

Fetch Instruction1 from L1I

Decode Instruction1

Execute Instruction1 Fetch Instruction2

from L1I

Decode Instruction2

Decode Instruction3

from L1I Fetch Instruction4

from L1I

t

…

Branches can stall this pipeline! (Speculative Execution, Predication)

Each Unit is pipelined itself (cf. Execute=Multiply Pipeline) 1

2

3 4

Pipelining

PowerPC Instruction Pipeline

14-stage pipeline for FP operations!

Pipeline of P4:

20 stages!

Superscalar Processors

Superscalar Processors can run multiple Instruction Pipelines at the same time!

Parallel hardware components / pipelines are available to

fetch / decode / issues multiple instructions per cycle (typically 2 – 8 per cycle)

load (store) multiple operands (results) from (to) cache per cycle (typically 2-4 8-byte words per cycle)

perform multiple integer / address calculations per cycle (e.g. 6 integer units on Itanium2)

perform multiple floating point operations per cycle (typically 2 or 4 floating point operations per cycle)

On superscalar RISC processors out-of order execution hardware is available to optimize the usage of the parallel hardware

Superscalar Processors

Instruction Level Parallelism through superscalar execution

Multiple units enable use of InstrucionLevel Parallelism (ILP):

Issuing m concurrent instructions per cycle: m-way superscalar

Modern processors are 3- to 6-way superscalar &

can perform 2 or 4 floating point operations per cycles Fetch Instruction1

from L1I

Decode Instruction1

from L1I

Decode Instruction2

Decode

Instruction3 Execute Instruction2 Fetch Instruction3

from L1I

Decode Instruction1

from L1I

Decode Instruction2

Decode

Instruction3 Execute Instruction2 Fetch Instruction3

from L1I

Decode Instruction1

from L1I

Decode Instruction2

Decode Instruction3

from L1I

Decode Instruction1

from L1I

Decode Instruction2

Decode Instruction3

from L1I

4-way

„superscalar“

t

Superscalar Processor Exploit ILP

Example: Calculate norm of a vector

Naive version:

2nd MADD has to wait for the first to complete, although in principle two independent MADD could be done t=0

do i=1,n t=t+a(i)*a(i) end do

2 FP Mult/Add units cannot be busy at the same time because of dependency in summation variable t

„Load-after-Store dependency“

R1= MADD(R1,A(I))

R1 = MADD(R1,A(I+1))

STALL

(9)

Superscalar Processor

Exploit ILP: Modulo variable expansion

t1=0 t2=0 do I=1,N,2

t1=t1+a(i)*a(i) t2=t2+a(i+1)*a(i+1) end do

t=t1+t2

Optimized version:

Two independent „instruction streams” can be processed by two separate FP Mult/Add units!

Most compilers can do those optimizations automatically!

R1= MADD(R1,A(I)) R2= MADD(R2,A(I+1)) R1= MADD(R1,A(I+2)) R2= MADD(R2,A(I+3))

…

Superscalar Processors Some pitfalls

Data dependencies can prevent the parallel use of hardware, e.g. for (i=0;…) A(i) = A(i-1)*c

(only one multiplication can be performed at the same time)

Data dependencies: Compiler can not resolve aliasing conflicts!

void subscale( A , B )

….for (i=0;…) A(i) = B(i-1)*c

In C/ C++ the pointers of A and B can point to the same memory location -> see above

You should tell the compiler if your are never using aliasing ( -fno-aliason Intel Compiler)

Superscalar Processors Some pitfalls

Avoid frequent and random (not predictable) branches in the application code, e.g.

do i=1,….

if( random(0:1) > 0.5) then

else<Block2>

endif enddo

Superscalar processors try to predict the branch and speculatively start the pipeline for the next iterations.

If the branch was mispredicted the pipeline has to be flushed!

Superscalar Processor Efficient Use of Pipelining and ILP

Efficient use of pipelining/ILP requires intelligent compilers

Rearrangement of instructions to hide latencies

„Software pipelining“

Remove interdependencies that block parallel execution

Programmer should

Avoid unpredictable branches (stop and restart of pipeline!)

Avoid Data dependencies (if possible)

Tell compiler that instructions are independent

(e.g. do not use pointer aliasing: -fno-alias with intel compiler)

Long FP pipeline is inefficient for very small loops

Pipeline must be filled, i.e. long start-up times

Summary:

Large number of independent / parallel instruction is mandatory to efficiently use pipelined, superscalar processors.

Most of the work can be done by the compiler, however programmer must provide reasonable code

Performance measures. Modern processors. Performance measures. Performance measures CPU. Memory

Modern processors

Performance measures

Performance measures

Performance measures

Strategies to build faster computers….

CPU

Memory

 Reduce cycle time (increase clock speed)

 PC: approx. 2.0-3.6 GHz

 RISC: 1.3-1.7 GHz

 Vector: 0.5-2.0 GHz

Exploiting Moore’s law without substantially increasing the single processor’s clock speed:

Multiple (independent) processor cores per chip

Multi-Core processors

„DRAM Gap“

W W

W + Δ = ( 1 + ε

)

1 ) 1

( + ε

m =

pm p

= ( 1 + ε

)

1 − 1

≥

⇒

≥ p m

p

ε

1

−

= m

ε

Evolutionary Configurable Architecture:

Evolutionary Configurable Architecture:

“

“Micro2015 Micro2015” ” Vision and Research Vision and Research

Basic features of modern

microprocessors

Architecture of modern microprocessors

Instruction Set Compiler Application

Architecture of modern microprocessors

Simple view of modern processors

Processor

Architecture Block Diagram

Simple view of modern processors

Architecture of modern microprocessors

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Superscalar Processors

Reduce cycle time (increase clock speed)

PC: approx. 2.0-3.6 GHz

RISC: 1.3-1.7 GHz

Vector: 0.5-2.0 GHz