• No results found

Instruction Scheduling in Multi Function Pipelines. (2) Dynamic Scheduling (Real-time Scheduling-Hardware).

N/A
N/A
Protected

Academic year: 2021

Share "Instruction Scheduling in Multi Function Pipelines. (2) Dynamic Scheduling (Real-time Scheduling-Hardware)."

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

ENEE446---Lectures-11/7/07 A. Yavuz Oruç

Professor, UMD, College Park

Copyright © 2007 A. Yavuz Oruç. All rights reserved.

Instruction Scheduling in Multi Function Pipelines

(1) Static Instruction Scheduling

In-Order or Compiler-Driven (

Hardware

)

Out-of-order Execution (

Software

)

(2)

Static Scheduling-In-order execution:

Fetch, decode and execute steps of instructions and operand

fetch and store steps are overlapped or pipelined in the order

they are written in the program.

Example:

MIPS Pipeline

IF ID EX ME WB 1st Instruction

IF ID EX ME WB 2nd Instruction

(3)
(4)

Static Scheduling-Out-of-order execution

The latencies apply when there is a data dependence between the

instructions.

Previous Instruction Next Instruction Execution Time Latency (Stall)

FP ALU Instruction FP ALU Instruction 4 3

FP ALU Instruction ST double Instruction 4 2

LD double Instruction FP ALU Instruction 1 1

LD double Instruction SD Double Instruction 1 0

Integer Instruction Integer Instruction 1 0

(5)

Dependency Stalls

Based on the latencies specified on the previous slide

IF ID EX DM WB

IF ID stall FP1 FP2

LD

(6)

Out-of-order execution may reduce the number of stalls:

Assembly Code:

Loop:

//Load a double precision number from memory location [R1] into F0

L.D F0,0(R1) ; F0=vector element

//Add the double precision numbers in F0 and F2 and store the result in F4.

ADD.D F4,F0,F2 ;

//Store the double precision number in F4 into memory location [R1]

S.D 0(R1),F4 ;

//Decrement R1 by 8.

DSUBUI R1,R1,8 ;

//Branch back to the loop if R1 is not zero

BNEZ R1,Loop ;

branch if R1!=zero //No operation

(7)

With Stalls

Loop:

1 L.D F0,0(R1) ;

2

Stall; (Because ADD.D needs F0)

3 ADD.D F4,F0,F2 ;

4

Stall; (Because S.D must wait on ADD.D for F4)

5

Stall;

6 S.D 0(R1),F4 ;

7 DSUBUI R1,R1,8 ;

8

Stall; (Because BNEZ must wait on DSUBUI for R1)

9 BNEZ R1,Loop ;

10

Stall;

Delayed branch slot (To determine the branch direction)

(8)

Reorder to Reduce the Stalls (

Compiler generated

)

1 Loop: L.D F0,0(R1)

2

Stall (ADD.D needs F0)

3 ADD.D F4,F0,F2;

4 DSUBUI R1,R1,8;

5

Stall; (BNEZ must wait on DSUBUI for R1)

6 BNEZ R1,Loop;

delayed branch (the instruction following it will always be executed!)

7 S.D 8(R1),F4 ;

store address is adjusted since R1 is changed.

(9)

Can the number of clock cycles be reduced further?

Loop Unrolling Technique to reduce stalls:

Loop:

1 L.D F0,0(R1) Stall;

2 ADD.D F4,F0,F2 Stall; Stall;

3 S.D 0(R1),F4 ; //Drop DSUBUI & BNEZ 4 L.D F6,-8(R1)

Stall;

5 ADD.D F8,F6,F2 Stall; Stall;

6 S.D -8(R1),F8 ; //Drop DSUBUI & BNEZ 7 L.D F10,-16(R1)

Stall;

8 ADD.D F12,F10,F2 Stall; Stall;

(10)

Number of clock cycles per iteration = 27/4 = 6.5.

Nothing is gained by unrolling the loop.

What if we unroll and move the loads to the beginning of the program:

(11)

14 cycles /4 = 3.5 cycles/iteration which is pretty close to 3

operations per loop.

This is the bare minimum to load, add, and store the result, i.e.,

x[i] = x[i] + s;

The number of cycles should approach 3 as the number of times we

unroll increases.

(12)

Dynamic Scheduling (Out-of-Order Execution in Hardware)

Simplifies compiling

Key idea: Move the interdependent instructions as far away as

possible without violating the integrity of the code.

Example:

LDD R1,R2;

ADD R1,R3;

SUB R0,R4;

can execute faster when it is reordered as

(13)

Scoreboarding Algorithm

(Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)) The notes (actual images) that follow are taken from (http.cs.berkeley.edu/~kubitron)) with some changes.)

Key Steps:

1--Decode and issue instruction (In-order). Do not issue if there is

a structural hazard (resource conflicts)

2--Do not issue if a previously issued instruction has the same

destinations address (i.e., stall when there is potential for WAW)

3--Read operands. Do not issue until all operands are read

(14)
(15)

Hazards are controlled and avoided by the scoreboard by stalling

instructions that would cause structural and data hazards.

The scoreboard decides when a stalled instruction can read its

operands, resume execution, and when it can write its results into its

target registers. Since instructions can execute out-of-order, it is

possible to have WAR hazards unless they are stalled.

(16)
(17)

Stalling instructions with WAW hazards ensures that out-of-order execution of instructions will not cause an incorrect writing of results into registers.

Stalling instructions until their operands become ready ensures that out-of-order execution of instructions will not lead to incorrect results, and also avoid potential deadlocks:

Stalling the issue of SUB until it gets both its operands avoids a potential deadlock: If both ADD and SUB are allowed to issue and execute with SUB ahead of ADD, then ADD must wait for SUB to complete execution as there is only one integer unit, and SUB must wait to read R1 and write back R2, leading to a deadlock.

(18)
(19)

Scoreboard Data Structure:

For each instruction we keep a table of entries:

Instruction status: (1) ID and Issue, (2) Operand Read, (3) Execute, (4) Write back Indicates where the instruction is in the pipeline.

For each functional unit we keep a table of entries:

Functional unit status: Indicates the state of the functional unit (FU).

Busy: Yes or No

Op: +, -, *, /,etc.

Fj,Fk: Source-registers from which the functional unit receives its operands.

(20)

Q

j

,Q

k

:

Functional units outputting to source registers F

j

, F

k

.

(21)

FU is done Execution Complete For all f, if Qj(f) = FU then Rj(f) <-Yes; if Qk(f) = FU then Rk(f)<-Yes; Busy(FU)<- No

Result(Fi(FU))<- Null; (Remove Fi from FU.) WAR hazard if, for some FUx,

Fj(FUx) = Fi(FU) AND Rj(FUx) = Yes OR

Fk(FUx) = Fi(FU) AND Rk(FUx) = Yes

Write result

Rj <- No, Rk <- No

(Clear for next read of operands) Rj(FU) and Rk(FU) are Yes

Read operands

Busy(FU) <- Yes;

Op(FU) <- op; (From instr) Fi(FU) <- D (From instr) Fj(FU) <- S1 (From instr) Fk(FU) <- S2 (From instr) Qj(FU) <- Result(S1)

(functional unit that outputs to S1)

Qk(FU) <- Result(S2)

(functional unit that outputs to S2) if Qj(FU) = null, Rj <- Yes else Rj <- No if Qk(FU) = null, Rk <- Yes else Rk<- No

Result(D) <- FU

FU is free (No structural hazards), No WAW hazard

Issue

Bookkeeping(When done) Condition to Proceed

(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)

-

Tomasulo's Algorithm (www.cs.ucf.edu/courses/eel5708/ slides/lecture_15_tomasulo.ppt)

The main idea is to use register renaming to avoid stalling instructions because of WAR and WAW hazards.

Distributed Computing: Control is distributed into the function units.

Register Renaming: Operands are held in buffers, called "reservation stations". Registers in instructions are replaced by values or pointers to reservation

stations(RS);

Parallelism: More reservation stations than registers => more parallelism than compiler optimization.

Common Data Bus: Results flow to FUs from reservation stations over a

Common Data Bus that broadcasts the results to all FUs => avoidance of RAW hazards (similar to data forwarding).

(49)

Example (Register renaming):

DIV.D F0,F2,F4;

ADD.D F6,F0,F8; ---> RAW with DIV.D SD.D F6, 0(R1); ---> WAW with ADD.D SUB.D F8,F10,F14;--> WAR with ADD.D

MUL.D F6,F10,F8; ---> WAW with ADD.D, RAW with SUB.D, WAR with SD.D

(50)

Renaming can be done statically by a compiler, but static renaming has two limitations:

- The number of registers places an upper bound on the number of registers that can be renamed.

-The renaming of a register is limited to blocks of code between branches unless a sophisticated analysis of where branches might lead the program execution is carried out across branches.

(51)
(52)

Tomasulo Algorithm Steps:

1. Issue: Get the next instruction from the FIFO instruction queue. If a

matching reservation station is free (no structural hazard), control issues the instruction, and if the operands are available, it sends them to the reservation station and otherwise, it keeps track of the functional units that will produce the operands, and renames the registers in the instruction by pointing them to the reservation station buffers.

2. Execution: Operate on operands (EX). If both operands of a function unit

are ready then execute the instruction; otherwise monitor the Common Data Bus for results. Waiting until all operands are ready avoids RAW hazards. (Potential structural hazard due to multiple issue of instructions- more than one reservation station may compete for execution on the attached FU.)

3. Write result: Complete execution (WB). Write on Common Data Bus to all

(53)

Each reservation station has seven fields:

1-Op: Operation to perform in the attached functional unit

2-Qj, 3-Qk: Reservation stations that will produce the source operands for the attached functional unit (operands to be used) Qj,Qk = 0 => ready or available in Vj, Vk.

4-Vj, 5-Vk: Values of source operands (Either Vj and Vk or Qj and Vk are valid.)

6- A: Address of a memory operand or result.

(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)
(65)
(66)
(67)
(68)
(69)
(70)
(71)
(72)
(73)
(74)

References

Related documents

Figure 4. Pair-wise comparisons of classifiers with the Nemenyi test applied to results on a) IEDB-SRDS1, b) IEDB-SRDS2, c) IEDB- SRDS3, and d) IEDB-WUPDS. AUC values for

minimum recovery budget increased recovery budget variable re- covery bud- get translated demand scenarios (coincidental) covering 100% 19.92% 5.43% 35.30% production cost 490.33

The influence of cuspate swash dynamics on transient surf zone circulation is investigated using both field observation at the low-tide terraced Grand Popo beach and

ing of what at the time was called ethereal energy (energy that was supposed to come from the ether). Those alchemists knew about the forces of nature in as much as their

Al/PTFE-B than in B/PTFE-Al at low temperature suggest that there is more contact area or interface between Al and PTFE in Al/PTFE-B while there is more interface between B and PTFE

Dhampyrs have the same number of health levels as a regular human or Kuei-jin. However, the hardiness of their perpetually dying bodies fortifies them with resilience beyond

The PTC radio components supplied by Lilee Systems include the base station, wayside, yard, and locomotive radios, as well as the mobility controller.. The mobility controller

&#34;Carlos Castaneda left the world the same way that his teacher, Don Juan Matus, did: his teacher, Don Juan Matus, did:  with full..  with full awareness,&#34; rea