ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 5: Pipelining

(1)

ELEC 5200/6200

Computer Architecture and Design

Spring 2017

Lecture 5: Pipelining

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 1

Ujjwal Guin, Assistant Professor

Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849

http://www.auburn.edu/~uzg0005/

Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University) [Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]

(2)

ILP: Instruction Level Parallelism

 Single-cycle and multi-cycle datapaths execute one instruction at a time.

 How can we get better performance?

 Answer: Execute multiple instructions at the same time.

– Pipelining – Enhance a multi-cycle datapath to fetch one instruction every cycle.

– Parallelism – Fetch multiple instructions every cycle.

(3)

Automobile Team Assembly

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 3

 1 car assembled every four hours  6 cars per day

 180 cars per month  2,040 cars per year

1 hour 1 hour

1 hour

(4)

Automobile Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 4 Task 1 1 hour Task 2 1 hour Task 3 1 hour Task 4 1 hour

 First car assembled in 4 hours (pipeline latency)  1 car completed per hour thereafter

 21 cars on first day, thereafter 24 cars per day  717 cars per month

 8,637 cars per year

 What gives 4X increase?

(5)

Throughput: Team Assembly

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5

Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing

Time of assembling one car = n hours

where n is the number of nearly equal subtasks, each requiring 1 unit of time

Throughput = 1/n cars per unit time

Red car completed Red car started Time Blue car started Blue car completed

(6)

Throughput: Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 6

Time to complete first car = n time units (latency) Cars completed in time T = T – n + 1

Throughput = 1 – (n – 1)/ T cars per unit time

Throughput (assembly line) 1 – (n – 1)/ T n(n – 1)

─────────────────── = ──────── = n – ───── → n

Throughput (team assembly) 1/n T as T→∞

Mechanical Electrical Painting Testing

Mechanical Electrical Painting Testing Car 1 Car 2 Car 3 Car 4 . . Car 1 complete Car 2 complete time

Key idea: overlap execution

(7)

Some Features of Assembly Line

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 7 Task 1 1 hour Task 2 1 hour Task 3 1 hour Task 4 1 hour

Mechanical Electrical Painting Testing

Electrical parts delivered (JIT)

Defect found Stall assembly line

to fix the cause of defect

3 cars in the assembly line are suspects, to be removed (flush pipeline)

(8)

Pros and Cons

 Advantages:

 Efficient use of labor.

 Specialists can do better job.

 Just in time (JIT) methodology eliminates warehouse cost.

 Disadvantages:

 Penalty of defect latency.

 Lack of flexibility in production.

 Assembly line work is monotonous and boring.  https://www.youtube.com/watch?v=IjarLbD9r30

 https://www.youtube.com/watch?v=ANXGJe6i3G8

 https://www.youtube.com/watch?v=5lp4EbfPAtI

(9)

Pipelining a Digital System

 Key idea: break big computation up into pieces

 Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

Pipeline Register

(10)

Pipelining a Digital System

 Why do this? Because it's faster for repeated computations 1ns Non-pipelined: 1 operation finishes every 1ns 200ps 200ps 200ps 200ps 200ps Pipelined: 1 operation finishes every 200ps 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 10

(11)

Pipelining a Processor

 Recall the 5 steps in instruction execution: 1. Instruction Fetch (IF)

2. Instruction Decode and Register Read (ID)

3. Execution operation or calculate address (ALU or EX) 4. Memory access (MEM)

5. Write result into register (WB)

 Review: Single-Cycle Processor

– All 5 steps done in a single clock cycle

– Dedicated hardware required for each step

 What happens if we break execution into multiple cycles, and add extra hardware?

– Recall that in Multi-cycle, datapath hardware differs from single-cycle

11 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(12)

Review - Single-Cycle Processor

12 IF Instruction Fetch ID Instruction Decode EX

Execute/ Address Calc.

MEM Memory Access WB Write Back 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 5 16 RD1 RD2 RN1 RN2 WN WD

Register File ALU

E X T N D 16 32 RD WD Data Memory ADDR 5 Instruction I 32 M U X <<2 RD Instruction Memory ADDR PC 4 ADD ADD M U X 32

(13)

13

Pipelining - Key Idea

 Question: What happens if we break execution into

multiple cycles, and add the extra hardware?

 Answer: in the best case, we can start executing a

new instruction on each clock cycle – this is pipelining

 Pipelining stages:

– IF - Instruction Fetch – ID - Instruction Decode

– EX - Execute / Address Calculation

– MEM - Memory Access (read / write)

– WB - Write Back (results into register file)

(14)

Project Summary

 A RISC CPU is to be designed in the VHDL modeling language, verified via the Mentor Graphics "ModelSim" or Aldec “Active-HDL” simulator, and implemented on the Altera DE2 FPGA board using Altera’s Quartus II software.

 The project consists of six parts. Due dates will be listed above as the semester progresses. You read problem definitions of all six parts before actually starting with Part 1, i.e., Instruction Set Architecture (ISA).

 Please submit only the List Format (do not submit wave format) of the simulation results in part 3, part 4, and

part 5. Always annotate your simulation results.

Maintain a single folder for submitting the project parts. When submitting a later part, all the previous parts need to be in the folder.

(15)

Instruction Set Architecture Classes

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 15 ALU Processor Memory … … ALU Processor Memory … … Memory … … ALU Processor Memory … … ALU Processor … … … …

a) Stack b) Accumulator _{c) Register-Memory} c) Register-Register

(16)

Basic Pipelined Processor

16

IF/ID

Pipeline Registers

ID/EX EX/MEM MEM/WB

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 5 16 RD1 RD2 RN1 RN2 WN WD

Register File ALU

E X T N D 16 32 RD WD Data Memory ADDR 5 Instruction I 32 M U X <<2 RD Instruction Memory ADDR PC 4 ADD ADD M U X 32

(17)

Single-Cycle vs. Pipelined Execution

17 Non-Pipelined 0 200 400 600 800 1000 1200 1400 1600 1800 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $3, 300($0) Instruc tion Fet ch Time Instruction Order 800ps 800ps 800ps Pipelined 0 200 400 600 800 1000 1200 1400 1600 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) lw $3, 300($0) Time Instruction Order 200ps Instruc tion Fet ch REG RD ALU REG WR MEM Instruc tion Fet ch REG RD ALU REG WR MEM 200ps 200ps 200ps 200ps 200ps 200ps

Note: REGRD is at the end of a stage but REGWR is at the beginning of a stage

(18)

Single-Cycle vs. Pipelined Execution (cont.)

 Time taken in pipeline stages is limited by the slowest operation

– Either ALU operation or Memory access

 Time taken in ALU stage (i.e. EX) is used as pipeline clock cycle in the following discussion

 If most memory access is cache access, MEM < ALU

 Assumptions (Fig 4.27 on p.276)

– Write to the register/memory occurs in the first half of the clock cycle

– Read from register/memory occurs in the second half of the clock cycle – If no such assumption, Cycle 5 of the following example will have issues

 Executing Multiple Instructions Clock Cycle 5, where the register file is used for 2 instructions at their different stages (ID and WB)

– How to design such an assumption?

0 200 400 600 800 1000 1200 1400 1600 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) lw $3, 300($0) Time Instruction Order 200ps Instruc tion Fet ch REG RD ALU REG WR MEM Instruc tion Fet ch REG RD ALU REG WR MEM 200ps 200ps 200ps 200ps 200ps 200ps 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 18

(19)

Comments about Pipelining

 The good news

– Multiple instructions are being processed at the same time

– This works because stages are isolated by registers

– Best case speedup of #Stages

 The bad news

– Instructions interfere with each other - Hazards

 Different instructions may need the same piece of hardware (e.g., memory) in same clock cycle --- Structure Hazard

 Not sure which is the next instruction for the next instruction fetch (IF) until EX of the branch instruction --- Control Hazard

 Instruction may require a result produced by an earlier instruction that is not yet complete --- Data Hazard

– Worst case: Must suspend execution - Stall

(20)

Example - Executing Multiple

Instructions

 Consider the following instruction sequence

lw $r0, 10($r1) sw $r3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10 20 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(21)

Executing Multiple Instructions

Clock Cycle 1

21

LW

(22)

Executing Multiple Instructions

Clock Cycle 2

22 LW SW 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(23)

Executing Multiple Instructions

Clock Cycle 3

23 LW SW ADD 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(24)

Executing Multiple Instructions

Clock Cycle 4

24 LW SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(25)

Executing Multiple Instructions

Clock Cycle 5

25 LW SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(26)

Executing Multiple Instructions

Clock Cycle 6

26 SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(27)

Executing Multiple Instructions

Clock Cycle 7

27 ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(28)

Executing Multiple Instructions

Clock Cycle 8

28

SUB

(29)

Compact View

IM REG ALU DM REG lw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG CC 8

(30)

Pipeline Hazards

 Where one instruction cannot immediately follow another

 Types of hazards

– Structural hazards - attempt to use same resource twice – Control hazards - attempt to make decision before

condition is evaluated

– Data hazards - attempt to use data before it is ready  We can always resolve hazards by waiting

– i.e. stall

(31)

31

Structural Hazards

 Attempt to use same resource twice at same time

 Example: A Single Memory for both instructions and data

– Accessed by IF stage

– Accessed at same time by MEM stage  Solutions

– Delay second access by one clock cycle, OR

– Provide separate memories for instructions and data (IM and DM)

 This is what MIPS does

 Recall “Harvard Architecture”

 Real pipelined processors have separate caches

(32)

Structural Hazard - Single Memory

0 2 4 6 8 10 Time 12 IF ID EX MEM WB 14 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 14 Memory Conflict 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 32

(33)

Control Hazards

 Attempt to make a decision before condition is evaluated  Example: beq $s0, $s1, offset

– Must begin fetching the instruction following the branch on the very next clock cycle

– But the pipeline does not know what is the next instruction since it only just received the branch instruction from memory

– Possible solutions: Stall, predict, or delayed decision

 If we add hardware to second stage to:

– Compare fetched registers for equality – Compute branch target and update PC

– This allows branch to be taken at end of second clock cycle

 May not be possible for longer pipelines since branch may not be resolved in 2nd

stage, then larger slowdown

– Must make sure that the additional hardware does not increase pipeline clock cycle.

(34)

34

Control Hazard Solutions

 Stall - Stop loading instructions until result is available  Predict - Assume an outcome and continue fetching

(undo if prediction is wrong) – Always assuming branch untaken

– Or assuming half of branch taken and half untaken  Delayed branch (used in MIPS)

– Always executes the next SAFE instruction in the sequence

 a safe instruction is an instruction which is not affected by the branch

– MIPS software will place such a safe instruction immediately after the delayed branch

 This step is hidden from MIPS assembly programmer

– If branch is taken, the taken branch changes the address of the instruction follows the safe instruction

(35)

Control Hazard – Stall

All following discussions are assumed with the extra

hardware at 2

nd

_stage

beq writes PC here with the extra hardware

new PC used here

0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB sw $s4,200($t5) 18

BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL

(36)

Control Hazard - Correct Prediction

Fetch assuming branch taken 0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB tgt: sw $s4,200($t5) 18 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 36

(37)

Control Hazard - Incorrect Prediction

“Squashed” instruction 0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB 18

BUBBLE BUBBLE BUBBLE BUBBLE

tgt:

sw $s4,200($t5)

(incorrect prediction - STALL)

IF

or $r8,$r8,$r9

(38)

Control Hazard - Delayed Branch

always executes

correct PC avail. here

0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB 18 Branch SLOT: and $r6,$r6,$r7

Or re-arrange the codes

to execute the previous “add” here

tgt:

sw $s4,200($t5) _IF _ID _EX _MEM _WB

(39)

Summary - Control Hazard Solutions

 Stall - stop fetching instruction until result is available – Significant performance penalty

– Hardware required to stall

 Predict - assume an outcome and continue fetching (undo if prediction is wrong)

– Performance penalty only when guess wrong – Hardware required to "squash" instructions

 Delayed branch - specify in architecture that following instruction is always executed

– Compiler re-orders instructions into delay slot

– Insert "NOP" (no-op) operations when can't use (~50%) – This is how original MIPS worked

(40)

Example: Delayed branch

Loop: lw $8, 100($7) addi $7, $7, 4

beq $7, $4, Loop

 addi is not a “safe” instruction to be placed at the

branch slot (i.e. the instruction after beq)

– Because the dependence of $7 between addi and beq.

 lw seems a safe instruction candidate but its

location does not allow it to be moved to the branch slot

– Because “addi $7, $7, 4” is after “lw $8, 100($7)”; i.e., if lw is moved to branch slot, the value of $7 is off by 4.

(41)

Example: delayed branch (cont.)

 Changes made for the MIPS codes

– Swapping addi and lw location

– Changing offset from 100 to 100-4=96

 In order to keep the results of two programs identical

– The value of $7 at the new location should be the value prior to “addi $7,$7,4”

Loop: addi $7, $7, 4

lw $8, 96($7)

beq $7, $4, Loop

 After the above swapping and changing of the offset, lw

can be safely moved to the delay slot

Loop: addi $7, $7, 4

beq $7, $4, Loop

lw $8, 96($7) # delay slot

(42)

 Attempt to use data before it is ready  Solutions

– Stalling - wait until result is available

– Forwarding (Bypassing)- make data available inside datapath

– Re-ordering instructions - use compiler to avoid hazards  Examples: add $s0, $t0, $t1 ; $s0 = $t0+$t1 sub $t2, $s0, $t3 ; $t2 = $s0-$t3 lw $s0, 0($t0) ; $s0 = MEM[$t0] sub $t2, $s0, $t3 ; $t2 = $s0-$t2

Data Hazards

42 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(43)

Data Hazard - Stalling

0 2 4 6 8 10 12 IF ID EX MEM 16 add $s0,$t0,$t1 STALL 18 sub $t2, $s0,$t3 _IF _EX _MEM STALL

BUBBLE BUBBLE BUBBLE BUBBLE

$s0 written here W s0 WB $s0 read here R s0 BUBBLE

May need one more , i.e. the 3rd_{, STALL to}

be absolutely data hazard free, if such a register can not be designed

(44)

Data Hazards - Forwarding

 Key idea: connect new value directly to next stage  Still read s0, but ignore in favor of new result

 Since forwarding is valid only if the destination stage is later in time than the source stage

– Problem: what about load instructions?

 If the “add” replaced by “lw”, data won’t be available until MEM stage.

44 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(45)

Data Hazards - Forwarding

 STALL still required for LOAD instruction

– Because data available after MEM

 MIPS architecture calls this delayed load, initial

implementations required compiler to deal with this

ID 0 2 4 6 8 10 12 IF ID EX MEM 16 lw $s0,20($t1) 18 sub $t2, $s0,$t3 IF EX MEM W s0 WB R s0 new value of s0 STALL

(46)

Data Hazards - Reordering

Instructions

 What are the hazards in this code?

lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)

 Using data forwarding, resolve the data hazard but will introduce STALL

 Reorder instructions to remove hazard without any STALL when using data forwarding:

lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) 46 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(47)

47

Summary - Pipelining Overview

 Pipelining increase throughput (but not latency)  Hazards limit performance

– Structural hazards – Control hazards – Data hazards

(48)

Summary: Hazards

 Structural hazards

– Cause: resource conflict

– Remedies: (i) hardware resources, (ii) stall (bubble)  Data hazards

– Cause: data unavailablity

– Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering

 Control hazards

– Cause: out-of-sequence execution (branch or jump)

– Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush

ELEC 5200-001/6200-001 Lecture 5 48 2/20/2017

(49)

Control Unit

for

Pipelined MIPS

(50)

Single-Cycle Control Logic

Inputs Outputs Instr. type Opcode Instruction bits 31 31 29 28 27 26 R 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 lw 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 sw 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0 beq 0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0 J 0 0 0 0 1 0 X X X 0 X 0 X X X 1 ELEC 5200-001/6200-001 Lecture 6 50 A LUO p0 A LUO p1 RegDst _ALUS rc Memto Reg RegW rite Mem Read MemW rite Branch Jum p 2/20/2017

(51)

Single-Cycle Control Circuit

ELEC 5200-001/6200-001 Lecture 5 51 lw _sw _beq _J R RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0 Jump Op5 Op4 Op3 Op2 Op1 Op0 2/20/2017

(52)

ELEC 5200-001/6200-001 Lecture 5 52

ALU Control Logic

Inputs Outputs to ALU

Instr. type

From CU Funct. Code from IR

(bits 0-5) 3-bit code Opera-tion ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 lw, sw 0 0 X X X X X X 010 Add B 0 1 X X X X X X 110 Subtract R 1 X X X 0 0 0 0 010 Add 1 X X X 0 0 1 0 110 Subtract 1 X X X 0 1 0 0 000 AND 1 X X X 0 1 0 1 001 OR 1 X X X 1 0 1 0 111 slt 2/20/2017

(53)

ELEC 5200-001/6200-001 Lecture 5 53

ALU Control

ALU 3 zero result overflow Operation select from control

Operation select ALU function

000 AND

001 OR

010 Add

110 Subtract

111 Set on less than

F3 F2 F1 F0

ALUOp1 ALUOp0

From Control Circuit

ALU control

(54)

Returning to Pipelined Control

 Opcode input to control is supplied by the pipeline register IF/ID in the ID (instruction decode) cycle.  Nine control signals are generated in the ID cycle,

but none is used. They are saved in the pipeline register ID/EX.

 ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM.

 Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB.  MemtoReg and RegWrite are used in the WB (write

back) cycle.

 Pipelined control is shown without Jump.

ELEC 5200-001/6200-001 Lecture 5 54 2/20/2017

(55)

Pipelined Datapath with Control

Signals

MemtoReg 5 RD1 RD2 RN1 RN2 WN WD Register File ALU E X T N D 16 32 RD WD Data Memory ADDR 32 <<2 RD Instruction Memory ADDR PC 4 ADD ADD 5 5 5

IF/ID ID/EX EX/MEM MEM/WB

Zero 0 1 MemRead ALUSrc MemWrite ALU Control 6 ALUOp 0 1 RegDst 5 rs rt rt rd RegWrite immed Branch 0 1 PCSrc _PCSrc 0 1 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 55

(56)

Control

 Basic approach:

– Based on single-cycle control – Place control unit in ID stage

– Pass control signals to following stages  Later: extra features to deal with:

– Data forwarding – Stalls

– Exceptions

(57)

Control for Pipelined Datapath

RegDst ALUOp[1:0] ALUSrc MemRead MemWrite Branch RegWrite MemtoReg EX M WB Control IF / ID ID / EX EX / MEM MEM / WB M WB WB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 57

(58)

Control for Pipelined Datapath

Execution/Address Calculation stage control

lines

Memory access stage control lines Write-back stage control lines Instruction Reg Dst ALU Op1 ALU Op0 ALU Src Branc h Mem Read Mem Write Reg write Mem to Reg R-format 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X RegDst ALUOp[1:0] ALUSrc MemRead MemWrite Branch RegWrite MemtoReg 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 58

(59)

Datapath and Control Unit

(60)

Tracking Control Signals - Cycle 1

LW

(61)

Tracking Control Signals - Cycle 2

SW LW

(62)

Tracking Control Signals - Cycle 3

ADD SW LW 0 01 1 W M W E 5 RD1 RD2 RN1 RN2 WN WD Register File ALU E X T N D 16 32 RD WD Data Memory ADDR 32 <<2 RD Instruction Memory ADDR PC 4 ADD ADD 5 5 5

IF/ID ID/EX EX/MEM MEM/WB

Zero 0 1 MemRead ALUSrc ALU Control 6 ALUOp 0 1 RegDst 5 rs rt rt rd RegWrite immed Branch 0 1 PCSrc RegWrite 0 1 W M Control 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 62

(63)

Tracking Control Signals - Cycle 4

SUB ADD SW LW 1 0 0 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 63

(64)

Tracking Control Signals - Cycle 5

1 1 ADD SUB SW LW 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 64

(65)

Data Hazards Revisited…

 Data hazards occur when data is used before it is stored

– RAW (read after write).

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles)

sub $2, $1, $3 Program execution order (in instructions) and $12, $2, $5 IM _Reg DM Reg IM DM Reg IM DM Reg CC 7 CC 8 CC 9 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Value of register $2: DM Reg Reg Reg Reg DM 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 65

(66)

66

Data Hazards Revisited… (cont.)

 Data hazards can be classified into 3 types, depending on the order of read and write accesses in the instructions.

 Consider two instructions i and j, with i occurring before j

– RAW (read after write)

 j tries to read a source before i writes it  So j incorrectly gets the old value

– WAR (write after read)

 j tries to write a destination before it is read by i

 So i incorrectly get the new value

 WAR never happens in MIPS because all READs are early in ID stage and all WRITEs are later in WB stage

 For example, auto-increment addressing, which write results early in the pipeline and other instruction reading a source after a write later in the pipeline

– WAW (write after write)

 j tries to write an operand before it is written by i

 The writes end up performed in the wrong order, so leaving the value written by i rather than the value written by j in the destination

 MIPS pipeline writes a register only in WB stage and avoids WAW

 WAW only occurs in pipelines that write in more than one pipeline stage, or allow an instruction to proceed even when a previous instruction is stalled

 Can RAR (read after read) be a data hazard?

(67)

Data Hazard Solution: Forwarding

 Key idea: connect data internally before it's stored

EX Hazard

MEM Hazard

(68)

Data Hazard Solution: Forwarding

 Add hardware to feed back ALU and MEM results to

both ALU inputs

(69)

Forwarding Unit

(70)

Controlling Forwarding

 Data hazard at “EX” stage: (EX Hazard)

– EX/MEM - test whether the instruction in EX/MEM writes register file and examine rd register

– ID/EX - test whether the instruction in ID/EX reads rs or rt

register and matches rd register in EX/MEM

 Data hazard at “MEM” stage: (MEM Hazard)

– MEM/WB - test whether the instruction in MEM/WB writes register file and examine rd (or rt) register

– ID/EX - test whether the instruction in ID/EX reads rs or rt

register and matches rd (or rt) register in EX/MEM

70 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(71)

Forwarding Unit Detail - EX Hazard

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd ≠ 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd ≠ 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

(72)

Forwarding Unit Detail - MEM Hazard

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd ≠ 0)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

72 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(73)

2/20/2017 73

MEM Hazard Complication

 One complication is potential data hazards between the result of the instruction in WB stage, the result of the instruction in MEM stage and the source

operand of the instruction in ALU stage.

 Example: What if we a register is changed more than once?

– add $1, $1, $2; – add $1, $1, $3; – add $1, $1, $4;

 Answer: forward most recent result (in MEM stage)

(74)

Forwarding Unit Detail - MEM Hazard

Revised

if (MEM/WB.RegWrite

and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01

if (MEM/WB.RegWrite

and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)

and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

(75)

Hazard Detection Unit - Control Detail

if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or ((ID/EX.RegisterRt = IF/ID.RegisterRt))) stall 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 75

(76)

Pipelined Processor with

Hazard Detection

PC Instruction memory Registers M u x M u x M u x Control ALU EX M WB M WB WB ID/EX EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit 0 M u x IF/ID In s tr u c ti o n ID/EX.MemRead IF /I D W ri te P C W ri te ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/MEM.RegisterRd MEM/WB.RegisterRd This is how “stall” is implemented

(77)

Hazard Detection Unit

How “stall” is implemented

 MUX zeros out control signals for instruction in ID – "squashes” the instruction

– “no-op” propagates through following stages

 IF/ID holds stalled instruction until next clock cycle  PC holds current value until next clock cycle

(re-loads first instruction)

(78)

Control (Branch) Hazards

 Just stalling for each branch is not practical  Common assumption: branch not taken

 When assumption fails: flush three instructions

– Note that the following figure does not assume the extra hardware to reduce the

(79)

Reducing Branch Delay

 Key idea: move branch logic to ID stage of pipeline

– New adder calculates branch target

(PC + 4 + extend(IMM) << 2)

– New hardware tests rs == rt immediately after register read – Add flush signal to squash instruction in IF/ID register

 Reduced penalty (1 cycle) when branch taken  Example on the next slide: Figure 4.62, p. 320

– Assume that branch is taken (i.e., $1==$3)

 One bubble

– i.e., One instruction is flushed

36 sub $10, $4, $8

40 beq $1, $3, 7 # PC-relative branch 40+4+7*4 =72 44 and $12, $2, $5

...

72 lw $4, 50(7)

79 2/20/2017 ELEC 5200-001/6200-001 Lecture 5

(80)

A couple of details are ignored

(i) IF.Flush comes from control unit;

(ii) output of the equivalence check of rs and rt should be fed into control unit, which then

determines the branch control for the MUX in front of PC

(81)

Branch Prediction

 Key idea: instead of always assuming branch not taken, use a prediction based on previous history

– Branch history table: a small memory

 Indexed by lower bits of the address of the branch instruction  Using one bit to save the history of “what happened” on last

execution

– branch taken (‘1’) – branch not taken (‘0’)

– Use history to make prediction

(82)

ELEC 5200-001/6200-001 Lecture 5 82

Branch Prediction

 Useful for program loops.

 A one-bit prediction scheme: a one-bit buffer

carries a “history bit” that tells what happened on the last branch instruction

 History bit = 1, branch was taken

 History bit = 0, branch was not taken

Predict branch not taken 0 Predict branch taken 1 taken taken Not taken Not taken 2/20/2017

(83)

Branch Prediction

ELEC 5200-001/6200-001 Lecture 5 83 = Prediction_Logic 0 1 PC+4 Next PC PC Low-order bits used as index

Address of Target History

recent branch addresses bit(s) instructions

(84)

Branch Prediction for a Loop

Execu -tion seq. Old hist. bit

Next instr. New

hist. bit Predi ction Pred. I Act. 1 0 e 1 b 1 Bad 2 1 b 2 b 1 Good 3 1 b 3 b 1 Good 4 1 b 4 b 1 Good 5 1 b 5 b 1 Good 6 1 b 6 b 1 Good 7 1 b 7 b 1 Good 8 1 b 8 b 1 Good 9 1 b 9 b 1 Good 10 1 b 10 e 0 Bad I = 0 I = I + 1 I – 10 = 0? Store X in memory X = X + R(I) Y N a b c d e Execution of Instruction d

(85)

Prediction Accuracy

 One-bit predictor:

 2 errors out of 10 predictions  Prediction accuracy = 80%

 To improve prediction accuracy, use two-bit predictor:

 A prediction must be wrong twice before it is changed

ELEC 5200-001/6200-001 Lecture 5 85 2/20/2017

(86)

ELEC 5200-001/6200-001 Lecture 5 86

Two-Bit Prediction Buffer

 Implemented as a two-bit counter.

 Can improve correct prediction statistics.

Predict branch not taken 00 Predict branch taken 10 Predict branch taken 11 Predict branch not taken 01 taken taken taken taken Not taken Not taken Not taken Not taken 2/20/2017

(87)

Branch Prediction for a Loop

Execu -tion seq. Old Pred. Buf

Next instr. New

pred. Buf Predi ction Pred. I Act. 1 10 2 1 2 11 Good 2 11 2 2 2 11 Good 3 11 2 3 2 11 Good 4 11 2 4 2 11 Good 5 11 2 5 2 11 Good 6 11 2 6 2 11 Good 7 11 2 7 2 11 Good 8 11 2 8 2 11 Good 9 11 2 9 2 11 Good 10 11 2 10 5 10 Bad I = 0 I = I + 1 I – 10 = 0? Store X in memory X = X + R(I) Y N 1 2 3 4 5 Execution of Instruction 4

(88)

Performance

Comparison

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 88

(89)

Single-Cycle Performance

 Assume

 200 ps for memory access  100 ps for ALU operation

 50 ps for register file read or write

 Cycle time set according to longest instruction:

lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite

= 200 + 50 +100 + 200 + 50 = 600 ps

 Cycles Per Instruction (CPI) = 1

 Av. instruction execution time = clock cycle time = 600 ps

ELEC 5200-001/6200-001 Lecture 5 89 2/20/2017

(90)

Multicycle Performance

 Consider SPECINT2000* instruction mix:

 25% lw 5 cycles  10% sw 4 cycles  11% branch 3 cycles  2% jump 3 cycles  52% ALU instr. 4 cycles

 Av. CPI = 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4

= 4.12

 Clock cycle time determined from longest operation (memory access) = 200 ps

 Av. instruction execution time = 4.12×200 = 824 ps

ELEC 5200-001/6200-001 Lecture 5 90 2/20/2017

(91)

Pipeline Performance

 Neglect initial latency (reasonable for long programs).

 One instruction completed every clock cycle unless delayed by hazard. Average CPI:

 lw 2 cycles in 50% cases due to hazard 1.5 cycles

 sw 1 cycle

 ALU 1 cycle

 branch 2 cycles in 25% cases due to hazard 1.25 cycles

 jump 2 cycles

 For SPECINT2000

Av. CPI = 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1 = 1.17

 Clock cycle time (longest operation: memory access) = 200 ps

 Av. instruction execution time = 1.17×200 = 234 ps

ELEC 5200-001/6200-001 Lecture 5 91 2/20/2017

(92)

ELEC 5200-001/6200-001 Lecture 5 92

Comparing Alternatives

Type of datapath and control Clock cycle time Average CPI Av. instruction execution time Single-cycle 600 ps 1.00 600 ps Multicycle 200 ps 4.12 824 ps Pipelined 200 ps 1.17 234 ps 2/20/2017

(93)

Exceptions

 A typical exception occurs when ALU produces an

overflow signal.

 Control asserts following actions on exception: – Change the PC address to 4000 0040hex. This is the

location of the exception routine. This is done by adding an additional input to the PC input multiplexer.

– Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush,

 Set IF/ID to 0 (nop).

 Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle.

ELEC 5200-001/6200-001 Lecture 5 93 2/20/2017

(94)

2/20/2017 ELEC 5200-001/6200-001 Lecture 5 94