Pipeline

(1)

CS 211: Computer Architecture

Instructor: Prof. Bhagi Narahari

Dept. of Computer Science

Course URL:

www.seas.gwu.edu/~narahari/cs211/

(2)

How to

improve performance?

•• Recall performance is function of

Recall performance is function of

–

– CPI: cycles per instructionCPI: cycles per instruction –

– Clock cycleClock cycle –

– Instruction countInstruction count

•• Reducing any of the 3 factors will lead to

Reducing any of the 3 factors will lead to

improved performance

(3)

How to

improve performance?

•• First step is to

First step is to apply concept of pipelining to

apply concept of pipelining to

the instruction execution process

–

– Overlap computationsOverlap computations

•• What does this do?

What does this do?

–

– Decrease clock cycleDecrease clock cycle –

– Decrease effective CPU time compared to originalDecrease effective CPU time compared to original clock cycle

clock cycle

•• Appendix A of Textbook

Appendix A of Textbook

–

(4)

Pipeline Approach to Improve

System Performance

•• Analogous to fluid flow in pipelines and

Analogous to fluid flow in pipelines and

assembly line in factories

•• Divide process into “stages” and send

Divide process into “stages” and send tasks

tasks

into a pipeline

–

– Overlap computations of different tasks byOverlap computations of different tasks by

operating on them concurrently in different stages operating on them concurrently in different stages

(5)

Instruction Pipeline

•• Instruction execution process lends itself

Instruction execution process lends itself

naturally to pipelining

–

– overlap the subtasks of instruction fetch, decodeoverlap the subtasks of instruction fetch, decode and execute

(6)

Linear Pipeline

Processor

Linear pipeline processes a sequence of

subtasks with linear precedence

At a higher level -

At a higher level - Sequence of processors

Sequence of processors

Data flowing in streams from stage S

₁₁

to the

final stage S

_k_k

Control of data flow :

Control of data flow : synchronous

synchronous or

or

asynchronous

S

₁₁

S

₂₂

S

_k_k 3 3 3 3

(7)

Synchronous

Pipeline

All transfers simultaneous All transfers simultaneous

One task or operation enters the pipeline per cycle One task or operation enters the pipeline per cycle Processors reservation table : diagonal

Processors reservation table : diagonal

4

4 4

(8)

Time Space Utilization of

Pipeline

S1

S2

S3

T1 T1 T1 T1 T1 T1 T2 T2 T2 T2 T2 T2 T3 T3 T3 T3 T4 T4

1

2

3

4

1

2

3

4 Time (in pipeline cycles)

Time (in pipeline cycles)

Full pipeline after 4 cycles Full pipeline after 4 cycles

(9)

Asynchronous

Pipeline

Transfers performed when individual processors are ready Transfers performed when individual processors are ready Handshaking protocol between processors

Handshaking protocol between processors Mainly used in multiprocessor systems with

Mainly used in multiprocessor systems with message-passingmessage-passing

5

5 5

(10)

Pipeline Clock and

Timing

S

_ii

S

_i+1_i+1 m m dd

Clock cycle of the pipeline : Clock cycle of the pipeline : Latch delay : Latch delay : d d = max { = max { _m_m} + d} + d Pipeline frequency : Pipeline frequency : f f 6 6 6 6

(11)

Speedup and

Efficiency

k-stage

k-stage pipeline processespipeline processes nn tasks tasks in in k k + + (n-1) (n-1) clockclock cycles:

cycles: k

k cycles for the first task andcycles for the first task and n-1n-1 cyclescycles for the remaining

for the remaining n-1n-1 taskstasks Total time to process

Total time to process nn taskstasks

T

T _k_k= [ k + (n-1)] = [ k + (n-1)] For the non-pipelined processor

For the non-pipelined processor

T T ₁₁ = n k = n k Speedup factor Speedup factor n k n k 7 7 7 7

(12)

Efficiency and

Throughput

Efficiency of the k-stages

Efficiency of the k-stages pipeline :pipeline : E E _k_k== SSk k k k == n n k + (n-1) k + (n-1)

Pipeline throughput (the number of tasks per unit

Pipeline throughput (the number of tasks per unit time) :time) : note equivalence to IPC

note equivalence to IPC

H H _k_k== nn [ k + (n-1)] [ k + (n-1)] == n f n f k + (n-1) k + (n-1) 10 10 10 10

(13)

Pipeline Performance:

Example

•• Task has 4 subtasks with time: t1=60, Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 nst2=50, t3=90, and t4=80 ns (nanoseconds)

(nanoseconds) •• latch delay = 10latch delay = 10

•• Pipeline cycle time = 90+10 = 1Pipeline cycle time = 90+10 = 100 ns00 ns •• For non-pipelined executionFor non-pipelined execution

–

– time = 60+50+90+80 = 280 nstime = 60+50+90+80 = 280 ns

•• Speedup for above case is: 280/100 = 2.8 !!Speedup for above case is: 280/100 = 2.8 !!

•• Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 nsPipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns •• Sequential time = 1000*280nsSequential time = 1000*280ns

•• Throughput= 1000/1003Throughput= 1000/1003 •• What is the problem here ?What is the problem here ?

(14)

Non-linear pipelines and

pipeline control algorithms

•• Can have non-linear path in pipeline…

Can have non-linear path in pipeline…

–

– How to schedule instructions so they do no How to schedule instructions so they do no conflictconflict for resources

for resources

•• How does one control the

How does one control the pipeline at the

pipeline at the

microarchitect

microarchitecture

ure level

level

–

– How to build a scheduler in hardware ?How to build a scheduler in hardware ? –

– How much time does scheduler have to makeHow much time does scheduler have to make decision ?

(15)

Non-linear Dynamic Pipelines

•• Multiple processors (Multiple processors (k-stagesk-stages) as linear pipeline) as linear pipeline •• Variable functions of individual processorsVariable functions of individual processors

•• Functions may be dynamically assignedFunctions may be dynamically assigned •• Feedforward and feedback connectionsFeedforward and feedback connections

(16)

CS211 16 CS211 16

Reservation Tables

•• Reservation table : displays time-space flow of Reservation table : displays time-space flow of data data through the pipeline analogous to

through the pipeline analogous to opcodeopcode of pipelineof pipeline

–

– Not diagonal, as in linear pipelinesNot diagonal, as in linear pipelines

•• Multiple reservation tables for different functionsMultiple reservation tables for different functions

–

– Functions may be dynamically assignedFunctions may be dynamically assigned

•• Feedforward and feedback connectionsFeedforward and feedback connections

•• The number of columns in the reservation table :The number of columns in the reservation table : evaluation time of a given function

(17)

Reservation Tables (Examples)

(18)

Latency Analysis

•• Latency

Latency :

: the

the number

number of c

of clock

lock cycles

cycles

between two initiations of the

between two initiations of the pipeline

pipeline

•• Collision

Collision : an attempt by two

: an attempt by two initiations to use

initiations to use

the same pipeline stage at the same

the same pipeline stage at the same time

time

•• Some latencies cause collision, some not

Some latencies cause collision, some not

(19)

Collisions (Example)

x x₁₁ x x₁₁xx₂₂ x x₁₁ x x₂₂ xx33 x x₂₂xx₃₃ x x₁₁ xx₄₄ _x_x 1 1xx22 xx22xx33 x x₄₄ x x₃₃xx₄₄ 1 1 22 33 44 55 66 77 88 99 1100 16 16 16 16

(20)

Latency Cycle

x x₁₁ 1 1 22 33 44 55 66 77 88 9 19 100 1111 1122 1133 1144 1155 1166 1177 x x₁₁ x x₁₁ x x₁₁ x x₁₁ x x₁₁ x x₁₁ x x₂₂ xx₁₁ x x₂₂ x x₂₂ x x₂₂ x x₂₂ x x₂₂ xx₃₃ x x₂₂ x x₂₂ x x₃₃ x x₃₃ x x₃₃ x x₃₃ 18 18 x x₃₃ Cycle Cycle Cycle Cycle Latency cycle

Latency cycle : : the sequthe sequence of ence of initiations which initiations which has repehas repetitivetitive subsequence and without collisions

subsequence and without collisions Latency sequence length

Latency sequence length : : the the number number of time of time intervalsintervals within the cycle

within the cycle Average latency

Average latency : the sum of all latencies divided by: the sum of all latencies divided by the number of latencies along

the number of latencies along the cyclethe cycle

17

17 17

(21)

Collision Free

Scheduling

Goal

Goal : : to find to find the shthe shortest aortest average verage latencylatency Lengths

Lengths : : for for reservation reservation table table withwith nn columns, maximum forbiddencolumns, maximum forbidden latency is

latency is m <= n – 1m <= n – 1, and permissible latency, and permissible latency p p isis 1

1 <= <= p p <= <= m m – – 11 Ideal case

Ideal case : : p p = = 1 1 (static (static pipeline)pipeline) Collision vector

Collision vector :: C = (C C = (C _m_mC C _m-1_m-1 . . .C . . .C ₂₂C C ₁₁ ) )

[[ C C _i_i= 1= 1 if latencyif latency i i causes collision ]causes collision ] [[ C C _i_i= 0= 0 for permissible latencies ]for permissible latencies ]

18

18 18

(22)

Collision Vector

x x x x x x x x x x x x

x

₁₁

x

₁₁

x

₁₁

x

₁₁

x

₁₁

x

₁₁ Reservation Table Reservation Table x x₂₂ x x₂₂ x x₂₂ x x₂₂ x x₂₂ x x₂₂

C

C =

= (?

(? ?

? .

. .

. ?

? ?)

?)

Value X

Value X₁₁ _{Value X}_{Value X}

2 2 19 19 19 19

(23)

Back

to

our

focus:

Computer

Pipelines

•• Execute

Execute billions of instructions

billions of instructions

, so

throughput is what matters

•• MIPS desirable features:

MIPS desirable features:

–

– all instructions same length,all instructions same length, –

– registers located in same place in instructionregisters located in same place in instruction format,

format, –

(24)

Designing a Pipelined

Processor

•• Go back and examine your datapath and control

Go back and examine your datapath and control

diagram

•• associated resources with states

associated resources with states

•• ensure that flows do not conflict, or f

ensure that flows do not conflict, or figure out how

igure out how

to resolve

(25)

5 Steps of MIPS

Datapath

Memory Memory Access Access Writ Writ e e Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Addr. Calc Calc L L M M D D A A L L U U M M U U X X M M e emm o or r y y R R e e g g F F i i l l e e M M U U X X M M U U X X D D a at t a a M M e emm o or r y y M M U U X X Sign Sign Extend Extend

4

A A d d d d e e r r _Zero?_Zero? Next SEQ PC Next SEQ PC A A d d d d _r_r e e s s s s Next PC Next PC I I _n n s s t t RD RD RS1 RS1 RS2 RS2 Imm Imm

What do we need to do to pipeline the process ?

(26)

5 Steps of MIPS/DLX Datapath

Memory Memory Access Access Writ Writ e e Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Addr. Calc Calc A A L L U U M M e emm o or r y y R R e e g g F F i i l l e e M M U U X X M M U U X X D D a at t a a M M e emm o or r y y M M U U X X Sign Sign Extend Extend Zero? Zero? I I F F /

/ _I_I_D_D / / I I DD_E_E X X M M E E M M / / _W_W B B E E X X / / _M_M E E M M

4

A A d d d d e e r r Next SEQ PC

Next SEQ PC Next SEQ PCNext SEQ PC

R RDD RRDD RRDD WW BB DD aatt aa

• • Data stationary control

Data stationary control

Next PC Next PC A A d d d d _r_r e e s s s s RS1 RS1 RS2 RS2 Imm Imm M M U U X X

(27)

Graphically Representing

Pipelines

•• Can help with

Can help with answering questions like:

answering questions like:

–

– how many cycles does it take to how many cycles does it take to execute this code?execute this code? –

– what is the ALU doing during cycle 4? what is the ALU doing during cycle 4? –

(28)

Visualizing Pipelining

II n n s s t t r. r. O O r r d d e e r r

Time (clock cycles) Time (clock cycles)

Reg Reg A A L L U U DMem DMem Ifetch

Ifetch RegReg

C

(29)

Conventional Pipelined Execution

Representation

IIFFeettcchh DDccdd EExxeecc MMeemm WWBB I IFFeettcchh DDccdd EExxeecc MMeemm WWBB I IFFeettcchh DDccdd ExExeecc MMeemm WWBB I IFFeettcchh DDccdd EExxeecc MMeemm WWBB IIFFeettcchh DDccdd EExxeecc MMeemm WWBB IIFFeettcchh DDccdd EExxeecc MMeemm WWBB Program Flow Program Flow Time Time

(30)

Single Cycle, Multiple Cycle,

vs. Pipeline

Clk Clk Cycle 1 Cycle 1

Multiple Cycle Implementation: Multiple Cycle Implementation:

IIffeettcchh RReegg EExxeecc MMeemm WWrr C

Cyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55 CCyycclle e 66 CCyycclle e 77 CCyycclle e 88 CCyycclle e 99 CCyycclle e 1100

L

Looaadd IIffeettcchh RReegg EExxeecc MMeemm WWrr

IIffeettcchh RReegg EExxeecc MMeemm L

Looaadd SSttoorree

Pipeline

Pipeline ImplementatioImplementation:n:

IIffeettcchh RReegg EExxeecc MMeemm WWrr Store

Store Clk

Clk

Single Cycle

Single Cycle Implementation:Implementation: L

Looaadd SSttoorree WWaassttee

Ifetch Ifetch R-type R-type

IIffeettcchh RReegg EExxeecc MMeemm WWrr R-type

R-type C

(31)

The Five Stages

of Load

•• Ifetch: Instruction Fetch

Ifetch: Instruction Fetch

–

– Fetch the instruction from tFetch the instruction from the Instruction Memoryhe Instruction Memory

•• Reg/Dec:

Reg/Dec: Registers

Registers Fetch

Fetch and I

and Instruction

nstruction Decode

Decode

•• Exec: Calculate the memory address

Exec: Calculate the memory address

•• Mem: Read the data from the Data Memory

Mem: Read the data from the Data Memory

•• Wr: Write the data back to the register file

Wr: Write the data back to the register file

C

Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55

IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr Load

(32)

The Four Stages

of R-type

•• Ifetch: Instruction Fetch

Ifetch: Instruction Fetch

–

– Fetch the instruction from tFetch the instruction from the Instruction Memoryhe Instruction Memory

•• Reg/Dec:

Reg/Dec: Registers

Registers Fetch

Fetch and I

and Instruction

nstruction Decode

Decode

•• Exec:

Exec:

–

– ALU operates on the two register operandsALU operates on the two register operands –

– Update PCUpdate PC

•• Wr: Write the ALU output back to the register file

Wr: Write the ALU output back to the register file

C

Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44

IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type

(33)

Pipelining the R-type and

Load Instruction

•• We have pipeline conflict or structural hazard:

We have pipeline conflict or structural hazard:

–

– Two instructions try to write to the Two instructions try to write to the register file at the sameregister file at the same time!

time!

Clock Clock

C

R-type

Load

IIffeettcchh RReegg//DDeecc EExxeecc WrWr

R-type R-type

R-type

Ops!

(34)

Important

Observation

•• Each functional unit can only be

Each functional unit can only be used

used once

once

per

instruction

•• Each functional unit must be used at the

Each functional unit must be used at the same

same

stage

for all

for all instructions:

instructions:

–

– Load uses Load uses Register FRegister File’s Write ile’s Write Port duPort during ring itsits 5th5th stagestage

–

– R-type uses Register File’s Write Port during itsR-type uses Register File’s Write Port during its 4th4th stagestage

IIffeettcchh RReegg//DDeecc EExxeecc MeemM m WWrr Load

Load

1 2

1 2 33 44 55

R-type

1 2

1 2 33 44

(35)

Solution 1: Insert “Bubble”

into the Pipeline

•• Insert a “bubble” into the

Insert a “bubble” into the pipeline to prevent 2 writes

pipeline to prevent 2 writes

at the same cycle

–

– The control logic can be complex.The control logic can be complex. –

– Lose instruction fetch and issue opportunity.Lose instruction fetch and issue opportunity.

Clock Clock

C

R-type

IIffeettcchh RReegg//DDeecc EExxeecc IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr

Load Load

R-type

R-type PipelinePipeline

Bubble

(36)

Solution 2: Delay R-type’s

Write by One Cycle

•• Delay R-type’s register write by one cycle:Delay R-type’s register write by one cycle:

–

– Now R-type instructions also use Reg File’s write port at Stage 5Now R-type instructions also use Reg File’s write port at Stage 5 –

– Mem stage is a Mem stage is a NOOPNOOP stage: nothing is being done.stage: nothing is being done.

Clock Clock

C

IIffeettcchh RReegg//DDeecc MemMem WrWr R-type

R-type

Load

R-type

IIffeettcchh RReegg//DDeecc ExecExec WrWr R-type

R-type MemMem

Exec Exec Exec Exec Exec Exec Exec Exec 1 1 22 33 44 55

(37)

Why

Pipeline?

•• Suppose we execute 100

Suppose we execute 100 instructions

instructions

•• Single Cycle Machine

Single Cycle Machine

–

– 45 ns/cycle 45 ns/cycle x 1 CPI x 1 CPI x 100 insx 100 inst = 4500 t = 4500 nsns

•• Multicycle Machine

Multicycle Machine

–

– 10 ns/cycle x 4.6 CPI (due to 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = inst mix) x 100 inst = 4600 ns4600 ns

•• Ideal pipelined machine

Ideal pipelined machine

–

(38)

Why Pipeline? Because the

resources are there!

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

A A L L U U Im

Im RegReg DDmm RReegg

A A L L U U Im

(39)

Problems with Pipeline

processors?

•• Limits to pipelining:Limits to pipelining: HazardsHazards prevent next instruction fromprevent next instruction from executing during its designated clock cycle

executing during its designated clock cycle and introduceand introduce stall cycles which increase CPI

stall cycles which increase CPI

–

– Structural hazardsStructural hazards: HW cannot support this combination of: HW cannot support this combination of instructions - two dogs fighting for

instructions - two dogs fighting for the same bonethe same bone –

– Data hazardsData hazards: Instruction depends on result of prior: Instruction depends on result of prior instruction still in the pipeline

instruction still in the pipeline »

» Data dependenciesData dependencies –

– Control hazardsControl hazards: Caused by delay between the fetching of: Caused by delay between the fetching of instructions and decisions about changes in control flow instructions and decisions about changes in control flow (branches and jumps).

(branches and jumps). »

» Control dependenciesControl dependencies

•• Can always resolve hazards by stallingCan always resolve hazards by stalling

•• More stall cycles = more CPU time = More stall cycles = more CPU time = less performanceless performance

–

(40)

Back to our old friend: CPU

time equation

•• Recall equation for CPU timeRecall equation for CPU time

•• So what are we doing by So what are we doing by pipelining the instructionpipelining the instruction execution process ?

execution process ?

–

– Clock ?Clock ? –

– Instruction Count ?Instruction Count ? –

– CPI ?CPI ? »

(41)

Speed Up Equation for

Pipelining

pipeline pipeline d d unpipelin unpipelin

Time

Cycle

Time

Cycle

CPI

stall

Pipeline

CPI

Ideal

depth

Pipeline

CPI

Ideal

Speedup

d d unpipelin unpipelin

Time

Cycle

depth

Pipeline

Speedup

Inst

per

cycles

Stall

Average

CPI

Ideal

CPI

_pipelined_pipelined

For simple RISC pipeline, CPI = 1:

(42)

One Memory Port/Structural

Hazards

Load

Instr 1

Instr 2

Instr 3

Instr 4

Ifetch RegReg

C

Cyycclle e 11CCyycclle e 22 CCyycclle e 3 C3 Cyycclle e 44Cycle 5Cycle 5 CCyycclle e 66CCyycclle e 77

(43)

One Memory Port/Structural

Hazards

II n n s s t t r. r. O O r r d d e e

Load

Instr 1

Instr 2

Stall

Ifetch RegReg

C

Cyycclle e 11CCyycclle e 22 CCyycclle e 3 C3 Cyycclle e 44Cycle 5Cycle 5 CCyycclle e 66CCyycclle e 77

B

(44)

Example: Dual-port vs.

Single-port

port

••

Machine A: Dual ported memory (“Harvard

Architecture”)

••

Machine B: Single ported memory, but its

pipelined implementation has a 1.05 times faster

clock rate

••

_{Ideal CPI = 1 for both}

••

_{Note - Loads will cause stalls of 1 cycle}

••

Recall our friend:

–

CPU = ICCPIClk

»

(45)

Example…

•• Machine A: Dual ported memory (“HarvardMachine A: Dual ported memory (“Harvard Architecture”)

Architecture”)

•• Machine B: Single ported memory, but its pipelinedMachine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate implementation has a 1.05 times faster clock rate •• Ideal CPI = 1 for bothIdeal CPI = 1 for both

•• Loads are 40% of instructions executedLoads are 40% of instructions executed SpeedUp

SpeedUp_A_A = Pipe. Depth/(1 + 0) x (clock= Pipe. Depth/(1 + 0) x (clock_unpipe_unpipe/clock/clock_pipe_pipe)) = Pipeline Depth

= Pipeline Depth SpeedUp

SpeedUp_B_B = Pipe. Depth/(1 + 0.4 x 1) x= Pipe. Depth/(1 + 0.4 x 1) x (clock

(clock_unpipe_unpipe/(clock/(clock_unpipe_unpipe/ 1.05)/ 1.05) = (Pi

= (Pipe. pe. Depth/1.4) Depth/1.4) x x 1.051.05 = 0.75 x Pipe. Depth

= 0.75 x Pipe. Depth SpeedUp

(46)

Data Dependencies

•• True dependencies and False

True dependencies and False dependencies

dependencies

–

– false implies we can remove the dependencyfalse implies we can remove the dependency –

– true implies we are stuck with it!true implies we are stuck with it!

•• Three types of data

Three types of data dependencies defined in

dependencies defined in

terms of how

terms of how succeeding instruct

succeeding instruction depends

ion depends

on preceding instruction

–

– RAW: Read after Write or Flow dependencyRAW: Read after Write or Flow dependency –

– WAR: Write after Read WAR: Write after Read or anti-dependencyor anti-dependency –

(47)

•

• Read After Write (RAW)

Read After Write (RAW)

Instr

_J_J

tries to read operand before Instr

_II

writes it

•• Caused by a “

Caused by a “Dependence

Dependence

” (in compiler

nomenclature).

nomenclature). This

This hazard re

hazard results

sults from an

from an actual

actual

need for

need for communication.

communication.

Three Generic Data Hazards

I: add

I: add r1

r1

,r2,r3

J: sub r4,

(48)

RAW Dependency

•• Example program (a) with two

Example program (a) with two instructions

instructions

–

– i1: load r1, a;i1: load r1, a; –

– i2: add r2, r1,r1;i2: add r2, r1,r1;

•• Program (b) with two

Program (b) with two instructions

instructions

–

– i1: mul r1, r4, r5;i1: mul r1, r4, r5; –

– i2: add r2, r1, r1;i2: add r2, r1, r1;

•• Both cases we cannot read in i2 until i1 has

Both cases we cannot read in i2 until i1 has

completed writing the result

–

– In (a) this is due toIn (a) this is due to load-use dependency load-use dependency –

(49)

•

• Write After Read (WAR)

Write After Read (WAR)

Instr

_J_J

writes operand

writes operand before

before

Instr

_II

reads it

•• Called an “

Called an “anti-dependence

anti-dependence

” by compiler writers.

This results from reuse of the name “

This results from reuse of the name “r1

r1

”.

•• Can’t happen in MIPS 5 stage pipeline because:

Can’t happen in MIPS 5 stage pipeline because:

–

_{All instructions take 5 stages, and}

–

Reads are always in stage 2, and

I: sub r4,

I: sub r4,r1

r1

,r3

J: add

J: add r1

r1

,r2,r3

K: mul r6,r1,r7

Three Generic Data Hazards

(50)

Three Generic Data Hazards

•

• Write After Write (WAW)

Write After Write (WAW)

Instr

_J_J

writes operand

writes operand before

before

Instr

_II

writes it.

•• Called an “

Called an “output dependence

output dependence

” by compiler writers

This also results from the reuse of name “

This also results from the reuse of name “ r1

r1

”.

•• Can’t happen in MIPS 5

Can’t happen in MIPS 5 stage pipeline because:

stage pipeline because:

–

All instructions take 5 stages, and

–

Writes are always in stage 5

•• Will see WAR and

Will see WAR and WAW in later more

WAW in later more complicated

complicated

pipes

I: sub

I: sub r1

r1

,r4,r3

J: add

J: add r1

r1

,r2,r3

K: mul r6,r1,r7

(51)

WAR

and

WAW

Dependency

•• Example program (a):

Example program (a):

–

– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –

– i2: add r2, r4, r5;i2: add r2, r4, r5;

•• Example program (b):

Example program (b):

–

– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –

– i2: add r1, r4, r5;i2: add r1, r4, r5;

•• both cases we have dependence between i1 and i2

both cases we have dependence between i1 and i2

–

– in (a) due to r2 must be in (a) due to r2 must be read before it is written intoread before it is written into –

– in (b) due to r1 in (b) due to r1 must be written by i2 after it must be written by i2 after it has been writtenhas been written into by i1

(52)

What to do with WAR and

WAW ?

•• Problem:

Problem:

–

– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –

– i2: add r2, r4, r5;i2: add r2, r4, r5;

(53)

What to do with WAR and

WAW

•• Solution:

Solution: Rename

Rename Registers

Registers

–

– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –

– i2: add r6, r4, r5;i2: add r6, r4, r5;

•• Register renaming can solve many of these

Register renaming can solve many of these

false dependencies

–

– note the role that the compiler plays note the role that the compiler plays in thisin this –

– specifically, the register allocation process--i.e., thespecifically, the register allocation process--i.e., the process that assigns registers to variables

(54)

Hazard Detection in H/W

•• Suppose instructionSuppose instruction i i is about to be issued and a is about to be issued and a predecessor instruction

predecessor instruction j j is in the instruction pipelineis in the instruction pipeline •• How to detect and store potential hazard informationHow to detect and store potential hazard information

–

– Note that hazards in machine code are based on Note that hazards in machine code are based on registerregister usage

usage –

– Keep track of results in Keep track of results in registers and their usageregisters and their usage »

» Constructing a register data flow graphConstructing a register data flow graph

•• For each instructionFor each instruction i i construct set of Read registersconstruct set of Read registers and Write registers

and Write registers

–

– Rregs(i) is set of registers that instruction i Rregs(i) is set of registers that instruction i reads fromreads from –

– Wregs(i) is set of registers that instruction i writes toWregs(i) is set of registers that instruction i writes to –

(55)

Hazard Detection in

Hardware

•• A RAW hazard exists on registerA RAW hazard exists on register ifif Rregs(Rregs( i i )) Wregs(Wregs( j j )) –

– Keep a record of pending writes (for inst's in the Keep a record of pending writes (for inst's in the pipe) andpipe) and compare with operand regs of

compare with operand regs of current instruction.current instruction. –

– When instruction issues, reserve its result register.When instruction issues, reserve its result register. –

– When on operation completes, remove its write reservation.When on operation completes, remove its write reservation.

•• A WAW hazard exists on registerA WAW hazard exists on register ifif Wregs(Wregs( i i )) Wregs(Wregs( j j )) •• A WAR hazard exists on registerA WAR hazard exists on register ifif Wregs(Wregs( i i )) Rregs(Rregs( j j ))

ρ ρ ρ ρ ρ ρ

(56)

Internal Forwarding: Getting

rid of some hazards

•• In some cases the data needed by the next

In some cases the data needed by the next

instruction at the ALU stage has been

computed by the ALU (or some stage defining

it) but has not been written back to the

registers

•• Can we “forward”

Can we “forward” this result by bypassing

this result by bypassing

stages ?

(57)

add

add r1

r1

,r2,r3

sub r4,

sub r4,r1

r1

,r3

and r6,

and r6,r1

r1

,r7

or

r8,

or

r8,r1

r1

,r9

Ifetch RegReg

Data Hazard on R1

(58)

Forwarding to Avoid Data

Hazard

add

add r1

r1

,r2,r3

sub r4,

sub r4,r1

r1

,r3

and r6,

and r6,r1

r1

,r7

or

r8,

or

r8,r1

r1

,r9

xor r10,

xor r10,r1

r1

,r11

Ifetch RegReg

(59)

Internal Forwarding of

Instructions

•• Forward result from ALU/Execute unit to

Forward result from ALU/Execute unit to

execute unit in next stage

•• Also can be used in cases of memory access

Also can be used in cases of memory access

•• in some cases, operand fetched from memory hasin some cases, operand fetched from memory has been computed previously by the program

been computed previously by the program –

– can we “forward” this result to a can we “forward” this result to a later stage thuslater stage thus avoiding an extra read from memory ?

avoiding an extra read from memory ? –

– Who does this ?Who does this ?

•• Internal forwarding casesInternal forwarding cases –

– Stage i to Stage i+k in pipelineStage i to Stage i+k in pipeline –

(60)

Internal Data

Forwarding

Store-load forwarding

Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ STO M,R1 STO M,R1 LD R2,MLD R2,M Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ STO M,R1

STO M,R1 MOVE R2,R1MOVE R2,R1

38

38 38

(61)

Internal Data

Forwarding

Load-load forwarding Load-load forwarding Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ LD R1,M LD R1,M LD R2,MLD R2,M Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ LD R1,M LD R1,M MOVE R2,R1MOVE R2,R1 39 39 39 39

(62)

Internal Data

Forwarding

Store-store forwarding

Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ STO

STO M, M, R1R1 STO M,R2STO M,R2

Memory Memory M M Access Unit Access Unit R R₁₁ RR₂₂ STO M,R2 STO M,R2 40 40 40 40

(63)

HW Change for

Forwarding

M M E E M M / / _W_W R R I I D D / / _E_E X X E E X X / / _M_M E E M M _Data_Data Memory Memory A A L L U U m m u uxx m m u uxx R R e e g gi i s s _t_t e er r s s NextPC NextPC Immediate Immediate m m u uxx

(64)

What about memory

operations?

A A BB op Rd Ra Rb op Rd Ra Rb op Rd Ra Rb op Rd Ra Rb Rd Rd to reg to reg file file R R Rd Rd ºº If instructions are initiated in order andIf instructions are initiated in order and

operations always occur in the same stage, operations always occur in the same stage, there ca

there can be n be no hazardno hazards between s between memorymemory operations!

operations!

ºº What does delaying WB on What does delaying WB on arithmeticarithmetic operations cost? operations cost? – cycles ? – cycles ? – hardware ? – hardware ?

ºº What about data dependence on loads?What about data dependence on loads? R1 <- R4 + R5 R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R2 <- Mem[ R2 + I ] R3 <- R2 + R1 R3 <- R2 + R1 “

“Delayed LoadsDelayed Loads””

ºº Can recognize this in decode stage andCan recognize this in decode stage and introduce bubble while stalling fetch stage introduce bubble while stalling fetch stage ºº Tricky situation:Tricky situation:

R1 <- Mem[ R2 + I ] R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Mem[R3+34] <- R1

Handle with bypass in memory stage! Handle with bypass in memory stage!

D D Mem Mem T T

(65)

II n n s s t t r. r. O O r r d d

lw

r1, 0

(r2)

sub r4,

sub r4,r1

r1

,r6

and r6,

and r6,r1

r1

,r7

Data Hazard Even with

Forwarding

Ifetch RegReg

Reg

Reg _A_A L L U U DMemDMem Ifetch

(66)

Data Hazard Even with

Forwarding

or r8,

or r8,r1

r1

,r9

lw

r1, 0

(r2)

sub r4,

sub r4,r1

r1

,r6

and r6,

and r6,r1

r1

,r7

Ifetch RegReg

Reg Reg Ifetch Ifetch A A L L U U DMem

DMem RegReg

Bubble Bubble Ifetch Ifetch A A L L U U DMem

DMem RegReg

Bubble

Bubble RegReg

Ifetch Ifetch A A L L U U DMem DMem Bubble

(67)

What can we (S/W) do?

(68)

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e,

assuming a, b, c, d ,e, and f in memory.

and f in memory.

Slow code: Slow code: LW Rb,b LW Rb,b LW LW RcRc,c,c ADD Ra,Rb, ADD Ra,Rb,RcRc SW a,Ra SW a,Ra LW Re,e LW Re,e LW LW RfRf,f,f SUB Rd,Re, SUB Rd,Re,RfRf

Software Scheduling to Avoid

Load Hazards

Fast code: Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c LW Re,e LW Re,e ADD Ra,Rb,Rc ADD Ra,Rb,Rc LW Rf,f LW Rf,f SW a,Ra SW a,Ra SUB Rd,Re,Rf SUB Rd,Re,Rf

(69)

Control

Hazards:

Branches

•• Instruction flow

Instruction flow

–

– Stream of instructions processed by Inst. Fetch unitStream of instructions processed by Inst. Fetch unit –

– Speed of “input flow” puts bound on rate oSpeed of “input flow” puts bound on rate off outputs generated

outputs generated

•• Branch instruction affects instruction flow

Branch instruction affects instruction flow

–

– Do not know next instruction to be executed untilDo not know next instruction to be executed until branch outcome known

branch outcome known

•• When we hit

When we hit a branch instruction

a branch instruction

–

– Need to compute target address (where to branch)Need to compute target address (where to branch) –

– Resolution of branch condition (true or false)Resolution of branch condition (true or false) –

(70)

Control Hazard on Branches

Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18:

18: or

or r6,r1,r7

r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Ifetch RegReg

(71)

Branch Stall Impact

•• If CPI = 1, 30% branch,

If CPI = 1, 30% branch,

Stall 3 cycles => new CPI = 1.9!

•• Two part solution:

Two part solution:

–

– Determine branch taken or not sooner, Determine branch taken or not sooner, ANDAND –

– Compute taken branch address earlierCompute taken branch address earlier

•• MIPS branch tests if register = 0 or

MIPS branch tests if register = 0 or

0

0 •• MIPS Solution:

MIPS Solution:

–

– Move Zero test to ID/RF stageMove Zero test to ID/RF stage –

(72)

Pipelined MIPS (DLX)

Datapath

Memory Memory Access Access Write Write Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Calc. Addr. Calc.

This is the correct 1

This is the correct 1 cyclecycle

latency implementation!

(73)

Four Branch Hazard

Alternatives

#1: Stall until branch direction is clear

#1: Stall until branch direction is clear – flushing pipe– flushing pipe #2: Predict Branch Not Taken

#2: Predict Branch Not Taken

–

– Execute successor instructions in sequenceExecute successor instructions in sequence –

– “Squash” instructions in pipeline if branch actually taken“Squash” instructions in pipeline if branch actually taken –

– Advantage of late pipeline state updateAdvantage of late pipeline state update –

– 47% DLX branches not taken on average47% DLX branches not taken on average –

– PC+4 already calculated, so use it PC+4 already calculated, so use it to get next instructionto get next instruction

#3: Predict Branch Taken #3: Predict Branch Taken

–

– 53% DLX branches taken on average53% DLX branches taken on average –

– But haven’t calculated branch target address in DLXBut haven’t calculated branch target address in DLX

»

» DLX still incurs 1 DLX still incurs 1 cycle branch penaltycycle branch penalty »

(74)

Four Branch Hazard

Alternatives

#4: Delayed Branch #4: Delayed Branch

–

– Define branch to take placeDefine branch to take place AFTERAFTER a following instructiona following instruction

branch instruction branch instruction sequential successor sequential successor₁₁ sequential successor sequential successor₂₂ ... ... sequential successor sequential successor_n_n branch target if taken branch target if taken –

– 1 slot delay allows 1 slot delay allows proper decision and branch targetproper decision and branch target address in 5 stage pipeline

address in 5 stage pipeline –

– DLX uses thisDLX uses this

Branch delay of length Branch delay of length nn

(75)

(76)

Delayed Branch

•• Where to get instructions to fill branch delay slotWhere to get instructions to fill branch delay slot??

–

– Before branch instructionBefore branch instruction –

– From the target address: only valuable when branch takenFrom the target address: only valuable when branch taken –

– From fall through: only valuable when branch not From fall through: only valuable when branch not takentaken –

– Cancelling branches allow more slots to be filledCancelling branches allow more slots to be filled

•• Compiler effectiveness for single branch delay slot:Compiler effectiveness for single branch delay slot:

–

– Fills about 60% of branch delay slotsFills about 60% of branch delay slots –

– About 80% of instructions executed in branch delay About 80% of instructions executed in branch delay slots usefulslots useful in computation

in computation –

– About 50% (60% x 80%) of slots usefully filledAbout 50% (60% x 80%) of slots usefully filled

•• Delayed Branch downside: 7-8 stage pipelines, multipleDelayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

(77)

Evaluating Branch

Alternatives

S

Scchheedduulliinngg BBrraanncchh CCPPII ssppeeeedduup p vv.. ssppeeeedduup p vv.. s

scchheemme e ppeennaallttyy uunnppiippeelliinneedd ssttaalll l S

Sttaalll l ppiippeelliinnee 33 11..4422 33..55 11..00 P

Prreeddiicct t ttaakkeenn 11 11..1144 44..44 11..2266 P

Prreeddiicct t nnoot t ttaakkeenn 11 11..0099 44..55 11..2299 D

Deellaayyeed d bbrraanncchh 00..55 11..0077 44..66 1.311.31

Conditional & Unconditional = 14%, 65% change PC

PP ii pp e l

e l ii nn e s

e s pp e e d

e e d uu p =

p =

PP ii pp e l

e l ii nn e d

e d e p

e p tt hh

1

(78)

Branch Prediction based on

history

•• Can we use history of branch behaviour to Can we use history of branch behaviour to predictpredict branch outcome ?

branch outcome ?

•• Simplest scheme: use 1 bit of “history”Simplest scheme: use 1 bit of “history”

–

– Set bit to Set bit to Predict Taken (T) or Predict Not-taken (NT)Predict Taken (T) or Predict Not-taken (NT) –

– Pipeline checks bit value Pipeline checks bit value and predictsand predicts »

» If incorrect then need to invalidate instructionIf incorrect then need to invalidate instruction –

– Actual outcome used to set the bit Actual outcome used to set the bit valuevalue

•• Example: let initial value = T, actual outcome ofExample: let initial value = T, actual outcome of branches

branches isis- NT, NT, T,T,NT,T,TT,T,NT,T,T

–

– Predictions are:Predictions are: TT,, NTNT,T,,T,TT,,NTNT,T,T »

» 3 wrong (in red), 3 3 wrong (in red), 3 correct = 50% accuracycorrect = 50% accuracy

•• In general, can have k-bit predictors: more when weIn general, can have k-bit predictors: more when we cover superscalar processors.

(79)

Summary :

Control and Pipelining

•• Just overlap tasks; easy if tasks are independent

Just overlap tasks; easy if tasks are independent

•• Speed Up

Speed Up

Pipeline Depth; if ideal CPI is 1, then:

•• Hazards limit performance on

Hazards limit performance on computers:

computers:

–

– Structural: need more HW resourcesStructural: need more HW resources –

– Data (RAW,WAR,WAW): need forwarding, compilerData (RAW,WAR,WAW): need forwarding, compiler scheduling

scheduling –

– Control: delayed branch, predictionControl: delayed branch, prediction

pipeline pipeline d d unpipelin unpipelin

Time

Cycle

Time

Cycle

CPI

stall

Pipeline

1

1 depth

depth

Pipeline

Speedup

(80)

Summary #1/2:

Pipelining

•• What makes it easy

What makes it easy

–

– all instructions are the same lengthall instructions are the same length –

– just a few instruction formats just a few instruction formats –

– memory operands appear only in loads and storesmemory operands appear only in loads and stores

•• What makes it hard?

What makes it hard? HAZARDS!

HAZARDS!

–

– structural structural hazards: hazards: suppose suppose we we had had only only one one memorymemory –

– control hazardcontrol hazards: s: need to need to worry about worry about branch branch instructionsinstructions –

– data hazardata hazards: ds: an insan instruction depentruction depends on ds on a previousa previous instruction

instruction

•• Pipelines pass control information down the pipe

Pipelines pass control information down the pipe just

just

as data moves down pipe

•• Forwarding/Stalls handled by local control

Forwarding/Stalls handled by local control

•• Exceptions stop the pipeline

Exceptions stop the pipeline

(81)

Introduction to ILP

•• What is ILP?

What is ILP?

–

– Processor and Compiler design techniques thatProcessor and Compiler design techniques that speed up execution by

speed up execution by causing individual machinecausing individual machine operations to execute in parallel

operations to execute in parallel

•• ILP is transparent to the

ILP is transparent to the user

user

–

– Multiple operations executed in parallel evenMultiple operations executed in parallel even though the system is handed a single program though the system is handed a single program written with a sequential processor in mind written with a sequential processor in mind

•• Same execution hardware as a normal

Same execution hardware as a normal RISC

RISC

machine

(82)

Frontend and Optimizer Frontend and Optimizer

Determine Dependences Determine Dependences

Determine Independences Determine Independences

Bind Operations to Function Units Bind Operations to Function Units

Bind Transports to Busses Bind Transports to Busses

Determine Dependences Determine Dependences

Bind Transports to Busses Bind Transports to Busses

Execute Execute Superscalar Superscalar Dataflow Dataflow Indep. Arch. Indep. Arch. VLIW VLIW TTA TTA C

Coommppiilleer r HHaarrddwwaarree

Determine Independences Determine Independences

Bind Operations to Function Units Bind Operations to Function Units

B. Ramakrishna Rau and Joseph A.

B. Ramakrishna Rau and Joseph A. Fisher. InstructionFisher. Instruction-level parallel: History -level parallel: History overview, and perspective. Theoverview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May

Journal of Supercomputing, 7(1-2):9-50, May 1993.1993.