ELEC 5200/6200
Computer Architecture and Design
Spring 2017
Lecture 5: Pipelining
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 1
Ujjwal Guin, Assistant Professor
Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849
http://www.auburn.edu/~uzg0005/
Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University) [Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]
ILP: Instruction Level Parallelism
Single-cycle and multi-cycle datapaths execute one instruction at a time.
How can we get better performance?
Answer: Execute multiple instructions at the same time.
– Pipelining – Enhance a multi-cycle datapath to fetch one instruction every cycle.
– Parallelism – Fetch multiple instructions every cycle.
Automobile Team Assembly
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 3
1 car assembled every four hours 6 cars per day
180 cars per month 2,040 cars per year
1 hour 1 hour
1 hour
Automobile Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 4 Task 1 1 hour Task 2 1 hour Task 3 1 hour Task 4 1 hour First car assembled in 4 hours (pipeline latency) 1 car completed per hour thereafter
21 cars on first day, thereafter 24 cars per day 717 cars per month
8,637 cars per year
What gives 4X increase?
Throughput: Team Assembly
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5
Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing
Time of assembling one car = n hours
where n is the number of nearly equal subtasks, each requiring 1 unit of time
Throughput = 1/n cars per unit time
Red car completed Red car started Time Blue car started Blue car completed
Throughput: Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 6
Time to complete first car = n time units (latency) Cars completed in time T = T – n + 1
Throughput = 1 – (n – 1)/ T cars per unit time
Throughput (assembly line) 1 – (n – 1)/ T n(n – 1)
─────────────────── = ──────── = n – ───── → n
Throughput (team assembly) 1/n T as T→∞
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing Car 1 Car 2 Car 3 Car 4 . . Car 1 complete Car 2 complete time
Key idea: overlap execution
Some Features of Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 7 Task 1 1 hour Task 2 1 hour Task 3 1 hour Task 4 1 hourMechanical Electrical Painting Testing
Electrical parts delivered (JIT)
Defect found Stall assembly line
to fix the cause of defect
3 cars in the assembly line are suspects, to be removed (flush pipeline)
Pros and Cons
Advantages:
Efficient use of labor.
Specialists can do better job.
Just in time (JIT) methodology eliminates warehouse cost.
Disadvantages:
Penalty of defect latency.
Lack of flexibility in production.
Assembly line work is monotonous and boring. https://www.youtube.com/watch?v=IjarLbD9r30
https://www.youtube.com/watch?v=ANXGJe6i3G8
https://www.youtube.com/watch?v=5lp4EbfPAtI
Pipelining a Digital System
Key idea: break big computation up into pieces
Separate each piece with a pipeline register1ns
200ps 200ps 200ps 200ps 200ps
Pipeline Register
Pipelining a Digital System
Why do this? Because it's faster for repeated computations 1ns Non-pipelined: 1 operation finishes every 1ns 200ps 200ps 200ps 200ps 200ps Pipelined: 1 operation finishes every 200ps 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 10
Pipelining a Processor
Recall the 5 steps in instruction execution: 1. Instruction Fetch (IF)
2. Instruction Decode and Register Read (ID)
3. Execution operation or calculate address (ALU or EX) 4. Memory access (MEM)
5. Write result into register (WB)
Review: Single-Cycle Processor
– All 5 steps done in a single clock cycle
– Dedicated hardware required for each step
What happens if we break execution into multiple cycles, and add extra hardware?
– Recall that in Multi-cycle, datapath hardware differs from single-cycle
11 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Review - Single-Cycle Processor
12 IF Instruction Fetch ID Instruction Decode EXExecute/ Address Calc.
MEM Memory Access WB Write Back 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 5 16 RD1 RD2 RN1 RN2 WN WD
Register File ALU
E X T N D 16 32 RD WD Data Memory ADDR 5 Instruction I 32 M U X <<2 RD Instruction Memory ADDR PC 4 ADD ADD M U X 32
13
Pipelining - Key Idea
Question: What happens if we break execution into
multiple cycles, and add the extra hardware?
Answer: in the best case, we can start executing a
new instruction on each clock cycle – this is pipelining
Pipelining stages:
– IF - Instruction Fetch – ID - Instruction Decode
– EX - Execute / Address Calculation
– MEM - Memory Access (read / write)
– WB - Write Back (results into register file)
Project Summary
A RISC CPU is to be designed in the VHDL modeling language, verified via the Mentor Graphics "ModelSim" or Aldec “Active-HDL” simulator, and implemented on the Altera DE2 FPGA board using Altera’s Quartus II software.
The project consists of six parts. Due dates will be listed above as the semester progresses. You read problem definitions of all six parts before actually starting with Part 1, i.e., Instruction Set Architecture (ISA).
Please submit only the List Format (do not submit wave format) of the simulation results in part 3, part 4, and
part 5. Always annotate your simulation results.
Maintain a single folder for submitting the project parts. When submitting a later part, all the previous parts need to be in the folder.
Instruction Set Architecture Classes
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 15 ALU Processor Memory … … ALU Processor Memory … … Memory … … ALU Processor Memory … … ALU Processor … … … …a) Stack b) Accumulator c) Register-Memory c) Register-Register
Basic Pipelined Processor
16
IF/ID
Pipeline Registers
ID/EX EX/MEM MEM/WB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5 5 16 RD1 RD2 RN1 RN2 WN WD
Register File ALU
E X T N D 16 32 RD WD Data Memory ADDR 5 Instruction I 32 M U X <<2 RD Instruction Memory ADDR PC 4 ADD ADD M U X 32
Single-Cycle vs. Pipelined Execution
17 Non-Pipelined 0 200 400 600 800 1000 1200 1400 1600 1800 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $3, 300($0) Instruc tion Fet ch Time Instruction Order 800ps 800ps 800ps Pipelined 0 200 400 600 800 1000 1200 1400 1600 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) lw $3, 300($0) Time Instruction Order 200ps Instruc tion Fet ch REG RD ALU REG WR MEM Instruc tion Fet ch REG RD ALU REG WR MEM 200ps 200ps 200ps 200ps 200ps 200psNote: REGRD is at the end of a stage but REGWR is at the beginning of a stage
Single-Cycle vs. Pipelined Execution (cont.)
Time taken in pipeline stages is limited by the slowest operation
– Either ALU operation or Memory access
Time taken in ALU stage (i.e. EX) is used as pipeline clock cycle in the following discussion
If most memory access is cache access, MEM < ALU
Assumptions (Fig 4.27 on p.276)
– Write to the register/memory occurs in the first half of the clock cycle
– Read from register/memory occurs in the second half of the clock cycle – If no such assumption, Cycle 5 of the following example will have issues
Executing Multiple Instructions Clock Cycle 5, where the register file is used for 2 instructions at their different stages (ID and WB)
– How to design such an assumption?
0 200 400 600 800 1000 1200 1400 1600 lw $1, 100($0) Instruc tion Fet ch REG RD ALU REG WR MEM lw $2, 200($0) lw $3, 300($0) Time Instruction Order 200ps Instruc tion Fet ch REG RD ALU REG WR MEM Instruc tion Fet ch REG RD ALU REG WR MEM 200ps 200ps 200ps 200ps 200ps 200ps 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 18
Comments about Pipelining
The good news
– Multiple instructions are being processed at the same time
– This works because stages are isolated by registers
– Best case speedup of #Stages
The bad news
– Instructions interfere with each other - Hazards
Different instructions may need the same piece of hardware (e.g., memory) in same clock cycle --- Structure Hazard
Not sure which is the next instruction for the next instruction fetch (IF) until EX of the branch instruction --- Control Hazard
Instruction may require a result produced by an earlier instruction that is not yet complete --- Data Hazard
– Worst case: Must suspend execution - Stall
Example - Executing Multiple
Instructions
Consider the following instruction sequence
lw $r0, 10($r1) sw $r3, 20($r4) add $r5, $r6, $r7 sub $r8, $r9, $r10 20 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 1
21
LW
Executing Multiple Instructions
Clock Cycle 2
22 LW SW 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 3
23 LW SW ADD 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 4
24 LW SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 5
25 LW SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 6
26 SW ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 7
27 ADD SUB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Executing Multiple Instructions
Clock Cycle 8
28
SUB
Compact View
IM REG ALU DM REG lw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG CC 8
Pipeline Hazards
Where one instruction cannot immediately follow another
Types of hazards
– Structural hazards - attempt to use same resource twice – Control hazards - attempt to make decision before
condition is evaluated
– Data hazards - attempt to use data before it is ready We can always resolve hazards by waiting
– i.e. stall
31
Structural Hazards
Attempt to use same resource twice at same time
Example: A Single Memory for both instructions and data
– Accessed by IF stage
– Accessed at same time by MEM stage Solutions
– Delay second access by one clock cycle, OR
– Provide separate memories for instructions and data (IM and DM)
This is what MIPS does
Recall “Harvard Architecture”
Real pipelined processors have separate caches
Structural Hazard - Single Memory
0 2 4 6 8 10 Time 12 IF ID EX MEM WB 14 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 14 Memory Conflict 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 32Control Hazards
Attempt to make a decision before condition is evaluated Example: beq $s0, $s1, offset
– Must begin fetching the instruction following the branch on the very next clock cycle
– But the pipeline does not know what is the next instruction since it only just received the branch instruction from memory
– Possible solutions: Stall, predict, or delayed decision
If we add hardware to second stage to:
– Compare fetched registers for equality – Compute branch target and update PC
– This allows branch to be taken at end of second clock cycle
May not be possible for longer pipelines since branch may not be resolved in 2nd
stage, then larger slowdown
– Must make sure that the additional hardware does not increase pipeline clock cycle.
34
Control Hazard Solutions
Stall - Stop loading instructions until result is available Predict - Assume an outcome and continue fetching
(undo if prediction is wrong) – Always assuming branch untaken
– Or assuming half of branch taken and half untaken Delayed branch (used in MIPS)
– Always executes the next SAFE instruction in the sequence
a safe instruction is an instruction which is not affected by the branch
– MIPS software will place such a safe instruction immediately after the delayed branch
This step is hidden from MIPS assembly programmer
– If branch is taken, the taken branch changes the address of the instruction follows the safe instruction
Control Hazard – Stall
All following discussions are assumed with the extra
hardware at 2
ndstage
beq writes PC here with the extra hardware
new PC used here
0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB sw $s4,200($t5) 18
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
Control Hazard - Correct Prediction
Fetch assuming branch taken 0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB tgt: sw $s4,200($t5) 18 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 36Control Hazard - Incorrect Prediction
“Squashed” instruction 0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB 18BUBBLE BUBBLE BUBBLE BUBBLE
tgt:
sw $s4,200($t5)
(incorrect prediction - STALL)
IF
or $r8,$r8,$r9
Control Hazard - Delayed Branch
always executes
correct PC avail. here
0 2 4 6 8 10 12 IF ID EX MEM WB 16 add $r4,$r5,$r6 beq $r0,$r1,tgt IF ID EX MEM WB IF ID EX MEM WB 18 Branch SLOT: and $r6,$r6,$r7
Or re-arrange the codes
to execute the previous “add” here
tgt:
sw $s4,200($t5) IF ID EX MEM WB
Summary - Control Hazard Solutions
Stall - stop fetching instruction until result is available – Significant performance penalty
– Hardware required to stall
Predict - assume an outcome and continue fetching (undo if prediction is wrong)
– Performance penalty only when guess wrong – Hardware required to "squash" instructions
Delayed branch - specify in architecture that following instruction is always executed
– Compiler re-orders instructions into delay slot
– Insert "NOP" (no-op) operations when can't use (~50%) – This is how original MIPS worked
Example: Delayed branch
Loop: lw $8, 100($7) addi $7, $7, 4
beq $7, $4, Loop
addi is not a “safe” instruction to be placed at the
branch slot (i.e. the instruction after beq)
– Because the dependence of $7 between addi and beq.
lw seems a safe instruction candidate but its
location does not allow it to be moved to the branch slot
– Because “addi $7, $7, 4” is after “lw $8, 100($7)”; i.e., if lw is moved to branch slot, the value of $7 is off by 4.
Example: delayed branch (cont.)
Changes made for the MIPS codes
– Swapping addi and lw location
– Changing offset from 100 to 100-4=96
In order to keep the results of two programs identical
– The value of $7 at the new location should be the value prior to “addi $7,$7,4”
Loop: addi $7, $7, 4
lw $8, 96($7)
beq $7, $4, Loop
After the above swapping and changing of the offset, lw
can be safely moved to the delay slot
Loop: addi $7, $7, 4
beq $7, $4, Loop
lw $8, 96($7) # delay slot
Attempt to use data before it is ready Solutions
– Stalling - wait until result is available
– Forwarding (Bypassing)- make data available inside datapath
– Re-ordering instructions - use compiler to avoid hazards Examples: add $s0, $t0, $t1 ; $s0 = $t0+$t1 sub $t2, $s0, $t3 ; $t2 = $s0-$t3 lw $s0, 0($t0) ; $s0 = MEM[$t0] sub $t2, $s0, $t3 ; $t2 = $s0-$t2
Data Hazards
42 2/20/2017 ELEC 5200-001/6200-001 Lecture 5Data Hazard - Stalling
0 2 4 6 8 10 12 IF ID EX MEM 16 add $s0,$t0,$t1 STALL 18 sub $t2, $s0,$t3 IF EX MEM STALLBUBBLE BUBBLE BUBBLE BUBBLE
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
$s0 written here W s0 WB $s0 read here R s0 BUBBLE
May need one more , i.e. the 3rd, STALL to
be absolutely data hazard free, if such a register can not be designed
Data Hazards - Forwarding
Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result
Since forwarding is valid only if the destination stage is later in time than the source stage
– Problem: what about load instructions?
If the “add” replaced by “lw”, data won’t be available until MEM stage.
44 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Data Hazards - Forwarding
STALL still required for LOAD instruction
– Because data available after MEM
MIPS architecture calls this delayed load, initial
implementations required compiler to deal with this
ID 0 2 4 6 8 10 12 IF ID EX MEM 16 lw $s0,20($t1) 18 sub $t2, $s0,$t3 IF EX MEM W s0 WB R s0 new value of s0 STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
Data Hazards - Reordering
Instructions
What are the hazards in this code?
lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)
Using data forwarding, resolve the data hazard but will introduce STALL
Reorder instructions to remove hazard without any STALL when using data forwarding:
lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) 46 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
47
Summary - Pipelining Overview
Pipelining increase throughput (but not latency) Hazards limit performance
– Structural hazards – Control hazards – Data hazards
Summary: Hazards
Structural hazards– Cause: resource conflict
– Remedies: (i) hardware resources, (ii) stall (bubble) Data hazards
– Cause: data unavailablity
– Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering
Control hazards
– Cause: out-of-sequence execution (branch or jump)
– Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush
ELEC 5200-001/6200-001 Lecture 5 48 2/20/2017
Control Unit
for
Pipelined MIPS
Single-Cycle Control Logic
Inputs Outputs Instr. type Opcode Instruction bits 31 31 29 28 27 26 R 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 lw 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 sw 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0 beq 0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0 J 0 0 0 0 1 0 X X X 0 X 0 X X X 1 ELEC 5200-001/6200-001 Lecture 6 50 A LUO p0 A LUO p1 RegDst ALUS rc Memto Reg RegW rite Mem Read MemW rite Branch Jum p 2/20/2017Single-Cycle Control Circuit
ELEC 5200-001/6200-001 Lecture 5 51 lw sw beq J R RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0 Jump Op5 Op4 Op3 Op2 Op1 Op0 2/20/2017ELEC 5200-001/6200-001 Lecture 5 52
ALU Control Logic
Inputs Outputs to ALU
Instr. type
From CU Funct. Code from IR
(bits 0-5) 3-bit code Opera-tion ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 lw, sw 0 0 X X X X X X 010 Add B 0 1 X X X X X X 110 Subtract R 1 X X X 0 0 0 0 010 Add 1 X X X 0 0 1 0 110 Subtract 1 X X X 0 1 0 0 000 AND 1 X X X 0 1 0 1 001 OR 1 X X X 1 0 1 0 111 slt 2/20/2017
ELEC 5200-001/6200-001 Lecture 5 53
ALU Control
ALU 3 zero result overflow Operation select from controlOperation select ALU function
000 AND
001 OR
010 Add
110 Subtract
111 Set on less than
F3 F2 F1 F0
ALUOp1 ALUOp0
From Control Circuit
ALU control
Returning to Pipelined Control
Opcode input to control is supplied by the pipeline register IF/ID in the ID (instruction decode) cycle. Nine control signals are generated in the ID cycle,
but none is used. They are saved in the pipeline register ID/EX.
ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM.
Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB. MemtoReg and RegWrite are used in the WB (write
back) cycle.
Pipelined control is shown without Jump.
ELEC 5200-001/6200-001 Lecture 5 54 2/20/2017
Pipelined Datapath with Control
Signals
MemtoReg 5 RD1 RD2 RN1 RN2 WN WD Register File ALU E X T N D 16 32 RD WD Data Memory ADDR 32 <<2 RD Instruction Memory ADDR PC 4 ADD ADD 5 5 5IF/ID ID/EX EX/MEM MEM/WB
Zero 0 1 MemRead ALUSrc MemWrite ALU Control 6 ALUOp 0 1 RegDst 5 rs rt rt rd RegWrite immed Branch 0 1 PCSrc PCSrc 0 1 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 55
Control
Basic approach:
– Based on single-cycle control – Place control unit in ID stage
– Pass control signals to following stages Later: extra features to deal with:
– Data forwarding – Stalls
– Exceptions
Control for Pipelined Datapath
RegDst ALUOp[1:0] ALUSrc MemRead MemWrite Branch RegWrite MemtoReg EX M WB Control IF / ID ID / EX EX / MEM MEM / WB M WB WB 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 57Control for Pipelined Datapath
Execution/Address Calculation stage control
lines
Memory access stage control lines Write-back stage control lines Instruction Reg Dst ALU Op1 ALU Op0 ALU Src Branc h Mem Read Mem Write Reg write Mem to Reg R-format 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X RegDst ALUOp[1:0] ALUSrc MemRead MemWrite Branch RegWrite MemtoReg 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 58
Datapath and Control Unit
Tracking Control Signals - Cycle 1
LW
Tracking Control Signals - Cycle 2
SW LW
Tracking Control Signals - Cycle 3
ADD SW LW 0 01 1 W M W E 5 RD1 RD2 RN1 RN2 WN WD Register File ALU E X T N D 16 32 RD WD Data Memory ADDR 32 <<2 RD Instruction Memory ADDR PC 4 ADD ADD 5 5 5IF/ID ID/EX EX/MEM MEM/WB
Zero 0 1 MemRead ALUSrc ALU Control 6 ALUOp 0 1 RegDst 5 rs rt rt rd RegWrite immed Branch 0 1 PCSrc RegWrite 0 1 W M Control 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 62
Tracking Control Signals - Cycle 4
SUB ADD SW LW 1 0 0 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 63Tracking Control Signals - Cycle 5
1 1 ADD SUB SW LW 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 64Data Hazards Revisited…
Data hazards occur when data is used before it is stored
– RAW (read after write).
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles)
sub $2, $1, $3 Program execution order (in instructions) and $12, $2, $5 IM Reg DM Reg IM DM Reg IM DM Reg CC 7 CC 8 CC 9 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Value of register $2: DM Reg Reg Reg Reg DM 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 65
66
Data Hazards Revisited… (cont.)
Data hazards can be classified into 3 types, depending on the order of read and write accesses in the instructions.
Consider two instructions i and j, with i occurring before j
– RAW (read after write)
j tries to read a source before i writes it So j incorrectly gets the old value
– WAR (write after read)
j tries to write a destination before it is read by i
So i incorrectly get the new value
WAR never happens in MIPS because all READs are early in ID stage and all WRITEs are later in WB stage
For example, auto-increment addressing, which write results early in the pipeline and other instruction reading a source after a write later in the pipeline
– WAW (write after write)
j tries to write an operand before it is written by i
The writes end up performed in the wrong order, so leaving the value written by i rather than the value written by j in the destination
MIPS pipeline writes a register only in WB stage and avoids WAW
WAW only occurs in pipelines that write in more than one pipeline stage, or allow an instruction to proceed even when a previous instruction is stalled
Can RAR (read after read) be a data hazard?
Data Hazard Solution: Forwarding
Key idea: connect data internally before it's stored
EX Hazard
MEM Hazard
Data Hazard Solution: Forwarding
Add hardware to feed back ALU and MEM results to
both ALU inputs
Forwarding Unit
Controlling Forwarding
Data hazard at “EX” stage: (EX Hazard)
– EX/MEM - test whether the instruction in EX/MEM writes register file and examine rd register
– ID/EX - test whether the instruction in ID/EX reads rs or rt
register and matches rd register in EX/MEM
Data hazard at “MEM” stage: (MEM Hazard)
– MEM/WB - test whether the instruction in MEM/WB writes register file and examine rd (or rt) register
– ID/EX - test whether the instruction in ID/EX reads rs or rt
register and matches rd (or rt) register in EX/MEM
70 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Forwarding Unit Detail - EX Hazard
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
Forwarding Unit Detail - MEM Hazard
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
72 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
2/20/2017 73
MEM Hazard Complication
One complication is potential data hazards between the result of the instruction in WB stage, the result of the instruction in MEM stage and the source
operand of the instruction in ALU stage.
Example: What if we a register is changed more than once?
– add $1, $1, $2; – add $1, $1, $3; – add $1, $1, $4;
Answer: forward most recent result (in MEM stage)
Forwarding Unit Detail - MEM Hazard
Revised
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Hazard Detection Unit - Control Detail
if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or ((ID/EX.RegisterRt = IF/ID.RegisterRt))) stall 2/20/2017 ELEC 5200-001/6200-001 Lecture 5 75Pipelined Processor with
Hazard Detection
PC Instruction memory Registers M u x M u x M u x Control ALU EX M WB M WB WB ID/EX EX/MEM MEM/WB Data memory M u x Hazard detection unit Forwarding unit 0 M u x IF/ID In s tr u c ti o n ID/EX.MemRead IF /I D W ri te P C W ri te ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/MEM.RegisterRd MEM/WB.RegisterRd This is how “stall” is implementedHazard Detection Unit
How “stall” is implemented
MUX zeros out control signals for instruction in ID – "squashes” the instruction
– “no-op” propagates through following stages
IF/ID holds stalled instruction until next clock cycle PC holds current value until next clock cycle
(re-loads first instruction)
Control (Branch) Hazards
Just stalling for each branch is not practical Common assumption: branch not taken
When assumption fails: flush three instructions
– Note that the following figure does not assume the extra hardware to reduce the
Reducing Branch Delay
Key idea: move branch logic to ID stage of pipeline
– New adder calculates branch target
(PC + 4 + extend(IMM) << 2)
– New hardware tests rs == rt immediately after register read – Add flush signal to squash instruction in IF/ID register
Reduced penalty (1 cycle) when branch taken Example on the next slide: Figure 4.62, p. 320
– Assume that branch is taken (i.e., $1==$3)
One bubble
– i.e., One instruction is flushed
36 sub $10, $4, $8
40 beq $1, $3, 7 # PC-relative branch 40+4+7*4 =72 44 and $12, $2, $5
...
72 lw $4, 50(7)
79 2/20/2017 ELEC 5200-001/6200-001 Lecture 5
A couple of details are ignored
(i) IF.Flush comes from control unit;
(ii) output of the equivalence check of rs and rt should be fed into control unit, which then
determines the branch control for the MUX in front of PC
Branch Prediction
Key idea: instead of always assuming branch not taken, use a prediction based on previous history
– Branch history table: a small memory
Indexed by lower bits of the address of the branch instruction Using one bit to save the history of “what happened” on last
execution
– branch taken (‘1’) – branch not taken (‘0’)
– Use history to make prediction
ELEC 5200-001/6200-001 Lecture 5 82
Branch Prediction
Useful for program loops.
A one-bit prediction scheme: a one-bit buffer
carries a “history bit” that tells what happened on the last branch instruction
History bit = 1, branch was taken
History bit = 0, branch was not taken
Predict branch not taken 0 Predict branch taken 1 taken taken Not taken Not taken 2/20/2017
Branch Prediction
ELEC 5200-001/6200-001 Lecture 5 83 = PredictionLogic 0 1 PC+4 Next PC PC Low-order bits used as indexAddress of Target History
recent branch addresses bit(s) instructions
Branch Prediction for a Loop
Execu -tion seq. Old hist. bitNext instr. New
hist. bit Predi ction Pred. I Act. 1 0 e 1 b 1 Bad 2 1 b 2 b 1 Good 3 1 b 3 b 1 Good 4 1 b 4 b 1 Good 5 1 b 5 b 1 Good 6 1 b 6 b 1 Good 7 1 b 7 b 1 Good 8 1 b 8 b 1 Good 9 1 b 9 b 1 Good 10 1 b 10 e 0 Bad I = 0 I = I + 1 I – 10 = 0? Store X in memory X = X + R(I) Y N a b c d e Execution of Instruction d
Prediction Accuracy
One-bit predictor:
2 errors out of 10 predictions Prediction accuracy = 80%
To improve prediction accuracy, use two-bit predictor:
A prediction must be wrong twice before it is changed
ELEC 5200-001/6200-001 Lecture 5 85 2/20/2017
ELEC 5200-001/6200-001 Lecture 5 86
Two-Bit Prediction Buffer
Implemented as a two-bit counter.
Can improve correct prediction statistics.
Predict branch not taken 00 Predict branch taken 10 Predict branch taken 11 Predict branch not taken 01 taken taken taken taken Not taken Not taken Not taken Not taken 2/20/2017
Branch Prediction for a Loop
Execu -tion seq. Old Pred. BufNext instr. New
pred. Buf Predi ction Pred. I Act. 1 10 2 1 2 11 Good 2 11 2 2 2 11 Good 3 11 2 3 2 11 Good 4 11 2 4 2 11 Good 5 11 2 5 2 11 Good 6 11 2 6 2 11 Good 7 11 2 7 2 11 Good 8 11 2 8 2 11 Good 9 11 2 9 2 11 Good 10 11 2 10 5 10 Bad I = 0 I = I + 1 I – 10 = 0? Store X in memory X = X + R(I) Y N 1 2 3 4 5 Execution of Instruction 4
Performance
Comparison
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 88
Single-Cycle Performance
Assume
200 ps for memory access 100 ps for ALU operation
50 ps for register file read or write
Cycle time set according to longest instruction:
lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite
= 200 + 50 +100 + 200 + 50 = 600 ps
Cycles Per Instruction (CPI) = 1
Av. instruction execution time = clock cycle time = 600 ps
ELEC 5200-001/6200-001 Lecture 5 89 2/20/2017
Multicycle Performance
Consider SPECINT2000* instruction mix:
25% lw 5 cycles 10% sw 4 cycles 11% branch 3 cycles 2% jump 3 cycles 52% ALU instr. 4 cycles
Av. CPI = 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4
= 4.12
Clock cycle time determined from longest operation (memory access) = 200 ps
Av. instruction execution time = 4.12×200 = 824 ps
ELEC 5200-001/6200-001 Lecture 5 90 2/20/2017
Pipeline Performance
Neglect initial latency (reasonable for long programs).
One instruction completed every clock cycle unless delayed by hazard. Average CPI:
lw 2 cycles in 50% cases due to hazard 1.5 cycles
sw 1 cycle
ALU 1 cycle
branch 2 cycles in 25% cases due to hazard 1.25 cycles
jump 2 cycles
For SPECINT2000
Av. CPI = 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1 = 1.17
Clock cycle time (longest operation: memory access) = 200 ps
Av. instruction execution time = 1.17×200 = 234 ps
ELEC 5200-001/6200-001 Lecture 5 91 2/20/2017
ELEC 5200-001/6200-001 Lecture 5 92
Comparing Alternatives
Type of datapath and control Clock cycle time Average CPI Av. instruction execution time Single-cycle 600 ps 1.00 600 ps Multicycle 200 ps 4.12 824 ps Pipelined 200 ps 1.17 234 ps 2/20/2017Exceptions
A typical exception occurs when ALU produces an
overflow signal.
Control asserts following actions on exception: – Change the PC address to 4000 0040hex. This is the
location of the exception routine. This is done by adding an additional input to the PC input multiplexer.
– Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush,
Set IF/ID to 0 (nop).
Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle.
ELEC 5200-001/6200-001 Lecture 5 93 2/20/2017
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 94