Instruction Scheduling in Multi Function Pipelines. (2) Dynamic Scheduling (Real-time Scheduling-Hardware).

(1)

ENEE446---Lectures-11/7/07 A. Yavuz Oruç

Professor, UMD, College Park

Instruction Scheduling in Multi Function Pipelines

(1) Static Instruction Scheduling

In-Order or Compiler-Driven (

Hardware

)

Out-of-order Execution (

Software

)

(2)

Static Scheduling-In-order execution:

Fetch, decode and execute steps of instructions and operand

fetch and store steps are overlapped or pipelined in the order

they are written in the program.

Example:

MIPS Pipeline

IF ID EX ME WB 1st Instruction

IF ID EX ME WB 2nd Instruction

(3)

(4)

Static Scheduling-Out-of-order execution

The latencies apply when there is a data dependence between the

instructions.

Previous Instruction Next Instruction Execution Time Latency (Stall)

FP ALU Instruction FP ALU Instruction 4 3

FP ALU Instruction ST double Instruction 4 2

LD double Instruction FP ALU Instruction 1 1

LD double Instruction SD Double Instruction 1 0

Integer Instruction Integer Instruction 1 0

(5)

Dependency Stalls

Based on the latencies specified on the previous slide

IF ID EX DM WB

IF ID stall FP1 FP2

LD

(6)

Out-of-order execution may reduce the number of stalls:

Assembly Code:

Loop:

//Load a double precision number from memory location [R1] into F0

L.D F0,0(R1) ; F0=vector element

//Add the double precision numbers in F0 and F2 and store the result in F4.

ADD.D F4,F0,F2 ;

//Store the double precision number in F4 into memory location [R1]

S.D 0(R1),F4 ;

//Decrement R1 by 8.

DSUBUI R1,R1,8 ;

//Branch back to the loop if R1 is not zero

BNEZ R1,Loop ;

branch if R1!=zero //No operation

(7)

With Stalls

Loop:

1 L.D F0,0(R1) ;

2 Stall; (Because ADD.D needs F0)

3 ADD.D F4,F0,F2 ;

4 Stall; (Because S.D must wait on ADD.D for F4)

5 Stall;

6 S.D 0(R1),F4 ;

7 DSUBUI R1,R1,8 ;

8 Stall; (Because BNEZ must wait on DSUBUI for R1)

9 BNEZ R1,Loop ;

10 Stall;

Delayed branch slot (To determine the branch direction)

(8)

Reorder to Reduce the Stalls (

Compiler generated

)

1 Loop: L.D F0,0(R1)

2 Stall (ADD.D needs F0)

3 ADD.D F4,F0,F2;

4 DSUBUI R1,R1,8;

5 Stall; (BNEZ must wait on DSUBUI for R1)

6 BNEZ R1,Loop;

delayed branch (the instruction following it will always be executed!)

7 S.D 8(R1),F4 ;

store address is adjusted since R1 is changed.

(9)

Can the number of clock cycles be reduced further?

Loop Unrolling Technique to reduce stalls:

Loop:

1 L.D F0,0(R1) Stall;

2 ADD.D F4,F0,F2 Stall; Stall;

3 S.D 0(R1),F4 ; //Drop DSUBUI & BNEZ 4 L.D F6,-8(R1)

Stall;

6 S.D -8(R1),F8 ; //Drop DSUBUI & BNEZ 7 L.D F10,-16(R1)

Stall;

(10)

Number of clock cycles per iteration = 27/4 = 6.5.

Nothing is gained by unrolling the loop.

What if we unroll and move the loads to the beginning of the program:

(11)

14 cycles /4 = 3.5 cycles/iteration which is pretty close to 3

operations per loop.

This is the bare minimum to load, add, and store the result, i.e.,

x[i] = x[i] + s;

The number of cycles should approach 3 as the number of times we

unroll increases.

(12)

Dynamic Scheduling (Out-of-Order Execution in Hardware)

Simplifies compiling

Key idea: Move the interdependent instructions as far away as

possible without violating the integrity of the code.

Example:

LDD R1,R2;

ADD R1,R3;

SUB R0,R4;

can execute faster when it is reordered as

(13)

Scoreboarding Algorithm

(Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)) The notes (actual images) that follow are taken from (http.cs.berkeley.edu/~kubitron)) with some changes.)

Key Steps:

1--Decode and issue instruction (In-order). Do not issue if there is

a structural hazard (resource conflicts)

2--Do not issue if a previously issued instruction has the same

destinations address (i.e., stall when there is potential for WAW)

3--Read operands. Do not issue until all operands are read

(14)

(15)

Hazards are controlled and avoided by the scoreboard by stalling

instructions that would cause structural and data hazards.

The scoreboard decides when a stalled instruction can read its

operands, resume execution, and when it can write its results into its

target registers. Since instructions can execute out-of-order, it is

possible to have WAR hazards unless they are stalled.

(16)

(17)

Stalling instructions with WAW hazards ensures that out-of-order execution of instructions will not cause an incorrect writing of results into registers.

Stalling instructions until their operands become ready ensures that out-of-order execution of instructions will not lead to incorrect results, and also avoid potential deadlocks:

Stalling the issue of SUB until it gets both its operands avoids a potential deadlock: If both ADD and SUB are allowed to issue and execute with SUB ahead of ADD, then ADD must wait for SUB to complete execution as there is only one integer unit, and SUB must wait to read R1 and write back R2, leading to a deadlock.

(18)

(19)

Scoreboard Data Structure:

For each instruction we keep a table of entries:

Instruction status: (1) ID and Issue, (2) Operand Read, (3) Execute, (4) Write back Indicates where the instruction is in the pipeline.

For each functional unit we keep a table of entries:

Functional unit status: Indicates the state of the functional unit (FU).

Busy: Yes or No

Op: +, -, *, /,etc.

F_j,F_k: Source-registers from which the functional unit receives its operands.

(20)

Q

_j

,Q

_k

:

Functional units outputting to source registers F

_j

, F

_k

.

(21)

FU is done Execution Complete For all f, if Q_j(f) = FU then R_j(f) <-Yes; if Q_k(f) = FU then R_k(f)<-Yes; Busy(FU)<- No

Result(F_i(FU))<- Null; (Remove F_i from FU.) WAR hazard if, for some FU_x,

F_j(FU_x) = F_i(FU) AND R_j(FU_x) = Yes OR

F_k(FU_x) = F_i(FU) AND R_k(FU_x) = Yes

Write result

R_j <- No, R_k <- No

(Clear for next read of operands) R_j(FU) and R_k(FU) are Yes

Read operands

Busy(FU) <- Yes;

Op(FU) <- op; (From instr) F_i(FU) <- D (From instr) F_j(FU) <- S1 (From instr) F_k(FU) <- S2 (From instr) Q_j(FU) <- Result(S1)

(functional unit that outputs to S1)

Q_k(FU) <- Result(S2)

(functional unit that outputs to S2) if Q_j(FU) = null, R_j <- Yes else R_j <- No if Q_k(FU) = null, R_k <- Yes else R_k<- No

Result(D) <- FU

FU is free (No structural hazards), No WAW hazard

Issue

Bookkeeping(When done) Condition to Proceed

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

(36)

(37)

(38)

(39)

(40)

(41)

(42)

(43)

(44)

(45)

(46)

(47)

(48)

-

Tomasulo's Algorithm (www.cs.ucf.edu/courses/eel5708/ slides/lecture_15_tomasulo.ppt)

The main idea is to use register renaming to avoid stalling instructions because of WAR and WAW hazards.

Distributed Computing: Control is distributed into the function units.

Register Renaming: Operands are held in buffers, called "reservation stations". Registers in instructions are replaced by values or pointers to reservation

stations(RS);

Parallelism: More reservation stations than registers => more parallelism than compiler optimization.

Common Data Bus: Results flow to FUs from reservation stations over a

Common Data Bus that broadcasts the results to all FUs => avoidance of RAW hazards (similar to data forwarding).

(49)

Example (Register renaming):

DIV.D F0,F2,F4;

ADD.D F6,F0,F8; ---> RAW with DIV.D SD.D F6, 0(R1); ---> WAW with ADD.D SUB.D F8,F10,F14;--> WAR with ADD.D

MUL.D F6,F10,F8; ---> WAW with ADD.D, RAW with SUB.D, WAR with SD.D

(50)

Renaming can be done statically by a compiler, but static renaming has two limitations:

- The number of registers places an upper bound on the number of registers that can be renamed.

-The renaming of a register is limited to blocks of code between branches unless a sophisticated analysis of where branches might lead the program execution is carried out across branches.

(51)

(52)

Tomasulo Algorithm Steps:

1. Issue: Get the next instruction from the FIFO instruction queue. If a

matching reservation station is free (no structural hazard), control issues the instruction, and if the operands are available, it sends them to the reservation station and otherwise, it keeps track of the functional units that will produce the operands, and renames the registers in the instruction by pointing them to the reservation station buffers.

2. Execution: Operate on operands (EX). If both operands of a function unit

are ready then execute the instruction; otherwise monitor the Common Data Bus for results. Waiting until all operands are ready avoids RAW hazards. (Potential structural hazard due to multiple issue of instructions- more than one reservation station may compete for execution on the attached FU.)

3. Write result: Complete execution (WB). Write on Common Data Bus to all

(53)

Each reservation station has seven fields:

1-Op: Operation to perform in the attached functional unit

2-Qj, 3-Qk: Reservation stations that will produce the source operands for the attached functional unit (operands to be used) Qj,Qk = 0 => ready or available in Vj, Vk.

4-Vj, 5-Vk: Values of source operands (Either Vj and Vk or Qj and Vk are valid.)

6- A: Address of a memory operand or result.

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

(69)

(70)

(71)

(72)

(73)

(74)