ENEE446---Lectures-11/7/07 A. Yavuz Oruç
Professor, UMD, College Park
Copyright © 2007 A. Yavuz Oruç. All rights reserved.
Instruction Scheduling in Multi Function Pipelines
(1) Static Instruction Scheduling
In-Order or Compiler-Driven (
Hardware
)
Out-of-order Execution (
Software
)
Static Scheduling-In-order execution:
Fetch, decode and execute steps of instructions and operand
fetch and store steps are overlapped or pipelined in the order
they are written in the program.
Example:
MIPS Pipeline
IF ID EX ME WB 1st Instruction
IF ID EX ME WB 2nd Instruction
Static Scheduling-Out-of-order execution
The latencies apply when there is a data dependence between the
instructions.
Previous Instruction Next Instruction Execution Time Latency (Stall)
FP ALU Instruction FP ALU Instruction 4 3
FP ALU Instruction ST double Instruction 4 2
LD double Instruction FP ALU Instruction 1 1
LD double Instruction SD Double Instruction 1 0
Integer Instruction Integer Instruction 1 0
Dependency Stalls
Based on the latencies specified on the previous slide
IF ID EX DM WB
IF ID stall FP1 FP2
LD
Out-of-order execution may reduce the number of stalls:
Assembly Code:
Loop:
//Load a double precision number from memory location [R1] into F0
L.D F0,0(R1) ; F0=vector element
//Add the double precision numbers in F0 and F2 and store the result in F4.
ADD.D F4,F0,F2 ;
//Store the double precision number in F4 into memory location [R1]
S.D 0(R1),F4 ;
//Decrement R1 by 8.
DSUBUI R1,R1,8 ;
//Branch back to the loop if R1 is not zero
BNEZ R1,Loop ;
branch if R1!=zero //No operationWith Stalls
Loop:
1 L.D F0,0(R1) ;
2
Stall; (Because ADD.D needs F0)
3 ADD.D F4,F0,F2 ;
4
Stall; (Because S.D must wait on ADD.D for F4)
5
Stall;
6 S.D 0(R1),F4 ;
7 DSUBUI R1,R1,8 ;
8
Stall; (Because BNEZ must wait on DSUBUI for R1)
9 BNEZ R1,Loop ;
10
Stall;
Delayed branch slot (To determine the branch direction)
Reorder to Reduce the Stalls (
Compiler generated
)
1 Loop: L.D F0,0(R1)
2
Stall (ADD.D needs F0)
3 ADD.D F4,F0,F2;
4 DSUBUI R1,R1,8;
5
Stall; (BNEZ must wait on DSUBUI for R1)
6 BNEZ R1,Loop;
delayed branch (the instruction following it will always be executed!)7 S.D 8(R1),F4 ;
store address is adjusted since R1 is changed.
Can the number of clock cycles be reduced further?
Loop Unrolling Technique to reduce stalls:
Loop:
1 L.D F0,0(R1) Stall;
2 ADD.D F4,F0,F2 Stall; Stall;
3 S.D 0(R1),F4 ; //Drop DSUBUI & BNEZ 4 L.D F6,-8(R1)
Stall;
5 ADD.D F8,F6,F2 Stall; Stall;
6 S.D -8(R1),F8 ; //Drop DSUBUI & BNEZ 7 L.D F10,-16(R1)
Stall;
8 ADD.D F12,F10,F2 Stall; Stall;
Number of clock cycles per iteration = 27/4 = 6.5.
Nothing is gained by unrolling the loop.
What if we unroll and move the loads to the beginning of the program:
14 cycles /4 = 3.5 cycles/iteration which is pretty close to 3
operations per loop.
This is the bare minimum to load, add, and store the result, i.e.,
x[i] = x[i] + s;
The number of cycles should approach 3 as the number of times we
unroll increases.
Dynamic Scheduling (Out-of-Order Execution in Hardware)
Simplifies compiling
Key idea: Move the interdependent instructions as far away as
possible without violating the integrity of the code.
Example:
LDD R1,R2;
ADD R1,R3;
SUB R0,R4;
can execute faster when it is reordered as
Scoreboarding Algorithm
(Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)) The notes (actual images) that follow are taken from (http.cs.berkeley.edu/~kubitron)) with some changes.)
Key Steps:
1--Decode and issue instruction (In-order). Do not issue if there is
a structural hazard (resource conflicts)
2--Do not issue if a previously issued instruction has the same
destinations address (i.e., stall when there is potential for WAW)
3--Read operands. Do not issue until all operands are read
Hazards are controlled and avoided by the scoreboard by stalling
instructions that would cause structural and data hazards.
The scoreboard decides when a stalled instruction can read its
operands, resume execution, and when it can write its results into its
target registers. Since instructions can execute out-of-order, it is
possible to have WAR hazards unless they are stalled.
Stalling instructions with WAW hazards ensures that out-of-order execution of instructions will not cause an incorrect writing of results into registers.
Stalling instructions until their operands become ready ensures that out-of-order execution of instructions will not lead to incorrect results, and also avoid potential deadlocks:
Stalling the issue of SUB until it gets both its operands avoids a potential deadlock: If both ADD and SUB are allowed to issue and execute with SUB ahead of ADD, then ADD must wait for SUB to complete execution as there is only one integer unit, and SUB must wait to read R1 and write back R2, leading to a deadlock.
Scoreboard Data Structure:
For each instruction we keep a table of entries:
Instruction status: (1) ID and Issue, (2) Operand Read, (3) Execute, (4) Write back Indicates where the instruction is in the pipeline.
For each functional unit we keep a table of entries:
Functional unit status: Indicates the state of the functional unit (FU).
Busy: Yes or No
Op: +, -, *, /,etc.
Fj,Fk: Source-registers from which the functional unit receives its operands.
Q
j,Q
k:
Functional units outputting to source registers F
j, F
k.
FU is done Execution Complete For all f, if Qj(f) = FU then Rj(f) <-Yes; if Qk(f) = FU then Rk(f)<-Yes; Busy(FU)<- No
Result(Fi(FU))<- Null; (Remove Fi from FU.) WAR hazard if, for some FUx,
Fj(FUx) = Fi(FU) AND Rj(FUx) = Yes OR
Fk(FUx) = Fi(FU) AND Rk(FUx) = Yes
Write result
Rj <- No, Rk <- No
(Clear for next read of operands) Rj(FU) and Rk(FU) are Yes
Read operands
Busy(FU) <- Yes;
Op(FU) <- op; (From instr) Fi(FU) <- D (From instr) Fj(FU) <- S1 (From instr) Fk(FU) <- S2 (From instr) Qj(FU) <- Result(S1)
(functional unit that outputs to S1)
Qk(FU) <- Result(S2)
(functional unit that outputs to S2) if Qj(FU) = null, Rj <- Yes else Rj <- No if Qk(FU) = null, Rk <- Yes else Rk<- No
Result(D) <- FU
FU is free (No structural hazards), No WAW hazard
Issue
Bookkeeping(When done) Condition to Proceed
-
Tomasulo's Algorithm (www.cs.ucf.edu/courses/eel5708/ slides/lecture_15_tomasulo.ppt)The main idea is to use register renaming to avoid stalling instructions because of WAR and WAW hazards.
Distributed Computing: Control is distributed into the function units.
Register Renaming: Operands are held in buffers, called "reservation stations". Registers in instructions are replaced by values or pointers to reservation
stations(RS);
Parallelism: More reservation stations than registers => more parallelism than compiler optimization.
Common Data Bus: Results flow to FUs from reservation stations over a
Common Data Bus that broadcasts the results to all FUs => avoidance of RAW hazards (similar to data forwarding).
Example (Register renaming):
DIV.D F0,F2,F4;
ADD.D F6,F0,F8; ---> RAW with DIV.D SD.D F6, 0(R1); ---> WAW with ADD.D SUB.D F8,F10,F14;--> WAR with ADD.D
MUL.D F6,F10,F8; ---> WAW with ADD.D, RAW with SUB.D, WAR with SD.D
Renaming can be done statically by a compiler, but static renaming has two limitations:
- The number of registers places an upper bound on the number of registers that can be renamed.
-The renaming of a register is limited to blocks of code between branches unless a sophisticated analysis of where branches might lead the program execution is carried out across branches.
Tomasulo Algorithm Steps:
1. Issue: Get the next instruction from the FIFO instruction queue. If a
matching reservation station is free (no structural hazard), control issues the instruction, and if the operands are available, it sends them to the reservation station and otherwise, it keeps track of the functional units that will produce the operands, and renames the registers in the instruction by pointing them to the reservation station buffers.
2. Execution: Operate on operands (EX). If both operands of a function unit
are ready then execute the instruction; otherwise monitor the Common Data Bus for results. Waiting until all operands are ready avoids RAW hazards. (Potential structural hazard due to multiple issue of instructions- more than one reservation station may compete for execution on the attached FU.)
3. Write result: Complete execution (WB). Write on Common Data Bus to all
Each reservation station has seven fields:
1-Op: Operation to perform in the attached functional unit
2-Qj, 3-Qk: Reservation stations that will produce the source operands for the attached functional unit (operands to be used) Qj,Qk = 0 => ready or available in Vj, Vk.
4-Vj, 5-Vk: Values of source operands (Either Vj and Vk or Qj and Vk are valid.)
6- A: Address of a memory operand or result.