CS 211: Computer Architecture
CS 211: Computer Architecture
Instructor: Prof. Bhagi Narahari
Instructor: Prof. Bhagi Narahari
Dept. of Computer Science
Dept. of Computer Science
Course URL:
Course URL:
www.seas.gwu.edu/~narahari/cs211/
How to
How to
improve performance?
improve performance?
•• Recall performance is function of
Recall performance is function of
–
– CPI: cycles per instructionCPI: cycles per instruction –
– Clock cycleClock cycle –
– Instruction countInstruction count
•• Reducing any of the 3 factors will lead to
Reducing any of the 3 factors will lead to
improved performance
How to
How to
improve performance?
improve performance?
•• First step is to
First step is to apply concept of pipelining to
apply concept of pipelining to
the instruction execution process
the instruction execution process
–
– Overlap computationsOverlap computations
•• What does this do?
What does this do?
–
– Decrease clock cycleDecrease clock cycle –
– Decrease effective CPU time compared to originalDecrease effective CPU time compared to original clock cycle
clock cycle
•• Appendix A of Textbook
Appendix A of Textbook
–
Pipeline Approach to Improve
Pipeline Approach to Improve
System Performance
System Performance
•• Analogous to fluid flow in pipelines and
Analogous to fluid flow in pipelines and
assembly line in factories
assembly line in factories
•• Divide process into “stages” and send
Divide process into “stages” and send tasks
tasks
into a pipeline
into a pipeline
–
– Overlap computations of different tasks byOverlap computations of different tasks by
operating on them concurrently in different stages operating on them concurrently in different stages
Instruction Pipeline
Instruction Pipeline
•• Instruction execution process lends itself
Instruction execution process lends itself
naturally to pipelining
naturally to pipelining
–
– overlap the subtasks of instruction fetch, decodeoverlap the subtasks of instruction fetch, decode and execute
Linear Pipeline
Linear Pipeline
Processor
Processor
Linear pipeline processes a sequence of
Linear pipeline processes a sequence of
subtasks with linear precedence
subtasks with linear precedence
At a higher level -
At a higher level - Sequence of processors
Sequence of processors
Data flowing in streams from stage S
Data flowing in streams from stage S
11to the
to the
final stage S
final stage S
kkControl of data flow :
Control of data flow : synchronous
synchronous or
or
asynchronous
asynchronous
S
S
11S
S
22S
S
kk 3 3 3 3Synchronous
Synchronous
Pipeline
Pipeline
All transfers simultaneous All transfers simultaneous
One task or operation enters the pipeline per cycle One task or operation enters the pipeline per cycle Processors reservation table : diagonal
Processors reservation table : diagonal
4
4
4 4
Time Space Utilization of
Time Space Utilization of
Pipeline
Pipeline
S1
S1
S2
S2
S3
S3
T1 T1 T1 T1 T1 T1 T2 T2 T2 T2 T2 T2 T3 T3 T3 T3 T4 T41
2
3
4
1
2
3
4
Time (in pipeline cycles)
Time (in pipeline cycles)
Full pipeline after 4 cycles Full pipeline after 4 cycles
Asynchronous
Asynchronous
Pipeline
Pipeline
Transfers performed when individual processors are ready Transfers performed when individual processors are ready Handshaking protocol between processors
Handshaking protocol between processors Mainly used in multiprocessor systems with
Mainly used in multiprocessor systems with message-passingmessage-passing
5
5
5 5
Pipeline Clock and
Pipeline Clock and
Timing
Timing
S
S
iiS
S
i+1i+1 m m ddClock cycle of the pipeline : Clock cycle of the pipeline : Latch delay : Latch delay : d d = max { = max { mm} + d} + d Pipeline frequency : Pipeline frequency : f f 6 6 6 6
Speedup and
Speedup and
Efficiency
Efficiency
k-stage
k-stage pipeline processespipeline processes nn tasks tasks in in k k + + (n-1) (n-1) clockclock cycles:
cycles: k
k cycles for the first task andcycles for the first task and n-1n-1 cyclescycles for the remaining
for the remaining n-1n-1 taskstasks Total time to process
Total time to process nn taskstasks
T
T k k = [ k + (n-1)] = [ k + (n-1)] For the non-pipelined processor
For the non-pipelined processor
T T 11 = n k = n k Speedup factor Speedup factor n k n k 7 7 7 7
Efficiency and
Efficiency and
Efficiency and
Efficiency and
Throughput
Throughput
Throughput
Throughput
Efficiency of the k-stages
Efficiency of the k-stages pipeline :pipeline : E E k k == SSk k k k == n n k + (n-1) k + (n-1)
Pipeline throughput (the number of tasks per unit
Pipeline throughput (the number of tasks per unit time) :time) : note equivalence to IPC
note equivalence to IPC
H H k k == nn [ k + (n-1)] [ k + (n-1)] == n f n f k + (n-1) k + (n-1) 10 10 10 10
Pipeline Performance:
Pipeline Performance:
Example
Example
•• Task has 4 subtasks with time: t1=60, Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 nst2=50, t3=90, and t4=80 ns (nanoseconds)
(nanoseconds) •• latch delay = 10latch delay = 10
•• Pipeline cycle time = 90+10 = 1Pipeline cycle time = 90+10 = 100 ns00 ns •• For non-pipelined executionFor non-pipelined execution
–
– time = 60+50+90+80 = 280 nstime = 60+50+90+80 = 280 ns
•• Speedup for above case is: 280/100 = 2.8 !!Speedup for above case is: 280/100 = 2.8 !!
•• Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 nsPipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns •• Sequential time = 1000*280nsSequential time = 1000*280ns
•• Throughput= 1000/1003Throughput= 1000/1003 •• What is the problem here ?What is the problem here ?
Non-linear pipelines and
Non-linear pipelines and
pipeline control algorithms
pipeline control algorithms
•• Can have non-linear path in pipeline…
Can have non-linear path in pipeline…
–
– How to schedule instructions so they do no How to schedule instructions so they do no conflictconflict for resources
for resources
•• How does one control the
How does one control the pipeline at the
pipeline at the
microarchitect
microarchitecture
ure level
level
–
– How to build a scheduler in hardware ?How to build a scheduler in hardware ? –
– How much time does scheduler have to makeHow much time does scheduler have to make decision ?
Non-linear Dynamic Pipelines
Non-linear Dynamic Pipelines
•• Multiple processors (Multiple processors (k-stagesk-stages) as linear pipeline) as linear pipeline •• Variable functions of individual processorsVariable functions of individual processors
•• Functions may be dynamically assignedFunctions may be dynamically assigned •• Feedforward and feedback connectionsFeedforward and feedback connections
CS211 16 CS211 16
Reservation Tables
Reservation Tables
•• Reservation table : displays time-space flow of Reservation table : displays time-space flow of data data through the pipeline analogous to
through the pipeline analogous to opcodeopcode of pipelineof pipeline
–
– Not diagonal, as in linear pipelinesNot diagonal, as in linear pipelines
•• Multiple reservation tables for different functionsMultiple reservation tables for different functions
–
– Functions may be dynamically assignedFunctions may be dynamically assigned
•• Feedforward and feedback connectionsFeedforward and feedback connections
•• The number of columns in the reservation table :The number of columns in the reservation table : evaluation time of a given function
Reservation Tables (Examples)
Reservation Tables (Examples)
Reservation Tables (Examples)
Latency Analysis
Latency Analysis
•• Latency
Latency :
: the
the number
number of c
of clock
lock cycles
cycles
between two initiations of the
between two initiations of the pipeline
pipeline
•• Collision
Collision : an attempt by two
: an attempt by two initiations to use
initiations to use
the same pipeline stage at the same
the same pipeline stage at the same time
time
•• Some latencies cause collision, some not
Some latencies cause collision, some not
Collisions (Example)
Collisions (Example)
Collisions (Example)
Collisions (Example)
x x11 x x11xx22 x x11 x x22 xx33 x x22xx33 x x11 xx44 xx 1 1xx22 xx22xx33 x x44 x x33xx44 1 1 22 33 44 55 66 77 88 99 1100 16 16 16 16Latency Cycle
Latency Cycle
Latency Cycle
Latency Cycle
x x11 1 1 22 33 44 55 66 77 88 9 19 100 1111 1122 1133 1144 1155 1166 1177 x x11 x x11 x x11 x x11 x x11 x x11 x x22 xx11 x x22 x x22 x x22 x x22 x x22 xx33 x x22 x x22 x x33 x x33 x x33 x x33 18 18 x x33 Cycle Cycle Cycle Cycle Latency cycleLatency cycle : : the sequthe sequence of ence of initiations which initiations which has repehas repetitivetitive subsequence and without collisions
subsequence and without collisions Latency sequence length
Latency sequence length : : the the number number of time of time intervalsintervals within the cycle
within the cycle Average latency
Average latency : the sum of all latencies divided by: the sum of all latencies divided by the number of latencies along
the number of latencies along the cyclethe cycle
17
17
17 17
Collision Free
Collision Free
Collision Free
Collision Free
Scheduling
Scheduling
Scheduling
Scheduling
GoalGoal : : to find to find the shthe shortest aortest average verage latencylatency Lengths
Lengths : : for for reservation reservation table table withwith nn columns, maximum forbiddencolumns, maximum forbidden latency is
latency is m <= n – 1m <= n – 1, and permissible latency, and permissible latency p p isis 1
1 <= <= p p <= <= m m – – 11 Ideal case
Ideal case : : p p = = 1 1 (static (static pipeline)pipeline) Collision vector
Collision vector :: C = (C C = (C mmC C m-1m-1 . . .C . . .C 2 2C C 11 ) )
[[ C C i i = 1= 1 if latencyif latency i i causes collision ]causes collision ] [[ C C i i = 0= 0 for permissible latencies ]for permissible latencies ]
18
18
18 18
Collision Vector
Collision Vector
Collision Vector
Collision Vector
x x x x x x x x x x x xx
x
11x
x
11x
x
11x
x
11x
x
11x
x
11 Reservation Table Reservation Table x x22 x x22 x x22 x x22 x x22 x x22C
C =
= (?
(? ?
? .
. .
. .
. ?
? ?)
?)
Value XValue X11 Value XValue X
2 2 19 19 19 19
Back
Back
to
to
our
our
focus:
focus:
Computer
Computer
Pipelines
Pipelines
•• Execute
Execute billions of instructions
billions of instructions
, so
, so
throughput is what matters
throughput is what matters
•• MIPS desirable features:
MIPS desirable features:
–
– all instructions same length,all instructions same length, –
– registers located in same place in instructionregisters located in same place in instruction format,
format, –
Designing a Pipelined
Designing a Pipelined
Processor
Processor
•• Go back and examine your datapath and control
Go back and examine your datapath and control
diagram
diagram
•• associated resources with states
associated resources with states
•• ensure that flows do not conflict, or f
ensure that flows do not conflict, or figure out how
igure out how
to resolve
to resolve
5 Steps of MIPS
5 Steps of MIPS
Datapath
Datapath
Memory Memory Access Access Writ Writ e e Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Addr. Calc Calc L L M M D D A A L L U U M M U U X X M M e emm o or r y y R R e e g g F F i i l l e e M M U U X X M M U U X X D D a at t a a M M e emm o or r y y M M U U X X Sign Sign Extend Extend
4
4
A A d d d d e e r r Zero?Zero? Next SEQ PC Next SEQ PC A A d d d d r r e e s s s s Next PC Next PC I I n n s s t t RD RD RS1 RS1 RS2 RS2 Imm ImmWhat do we need to do to pipeline the process ?
What do we need to do to pipeline the process ?
5 Steps of MIPS/DLX Datapath
5 Steps of MIPS/DLX Datapath
Memory Memory Access Access Writ Writ e e Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Addr. Calc Calc A A L L U U M M e emm o or r y y R R e e g g F F i i l l e e M M U U X X M M U U X X D D a at t a a M M e emm o or r y y M M U U X X Sign Sign Extend Extend Zero? Zero? I I F F /
/ I I DD / / I I DDE E X X M M E E M M / / WW B B E E X X / / MM E E M M
4
4
A A d d d d e e r r Next SEQ PCNext SEQ PC Next SEQ PCNext SEQ PC
R RDD RRDD RRDD WW BB DD aatt aa
•
• Data stationary control
Data stationary control
Next PC Next PC A A d d d d r r e e s s s s RS1 RS1 RS2 RS2 Imm Imm M M U U X X
Graphically Representing
Graphically Representing
Pipelines
Pipelines
•• Can help with
Can help with answering questions like:
answering questions like:
–
– how many cycles does it take to how many cycles does it take to execute this code?execute this code? –
– what is the ALU doing during cycle 4? what is the ALU doing during cycle 4? –
Visualizing Pipelining
Visualizing Pipelining
II n n s s t t r. r. O O r r d d e e r rTime (clock cycles) Time (clock cycles)
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
C
Conventional Pipelined Execution
Conventional Pipelined Execution
Representation
Representation
IIFFeettcchh DDccdd EExxeecc MMeemm WWBB I IFFeettcchh DDccdd EExxeecc MMeemm WWBB I IFFeettcchh DDccdd ExExeecc MMeemm WWBB I IFFeettcchh DDccdd EExxeecc MMeemm WWBB IIFFeettcchh DDccdd EExxeecc MMeemm WWBB IIFFeettcchh DDccdd EExxeecc MMeemm WWBB Program Flow Program Flow Time TimeSingle Cycle, Multiple Cycle,
Single Cycle, Multiple Cycle,
vs. Pipeline
vs. Pipeline
Clk Clk Cycle 1 Cycle 1Multiple Cycle Implementation: Multiple Cycle Implementation:
IIffeettcchh RReegg EExxeecc MMeemm WWrr C
Cyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55 CCyycclle e 66 CCyycclle e 77 CCyycclle e 88 CCyycclle e 99 CCyycclle e 1100
L
Looaadd IIffeettcchh RReegg EExxeecc MMeemm WWrr
IIffeettcchh RReegg EExxeecc MMeemm L
Looaadd SSttoorree
Pipeline
Pipeline ImplementatioImplementation:n:
IIffeettcchh RReegg EExxeecc MMeemm WWrr Store
Store Clk
Clk
Single Cycle
Single Cycle Implementation:Implementation: L
Looaadd SSttoorree WWaassttee
Ifetch Ifetch R-type R-type
IIffeettcchh RReegg EExxeecc MMeemm WWrr R-type
R-type C
The Five Stages
The Five Stages
of Load
of Load
•• Ifetch: Instruction Fetch
Ifetch: Instruction Fetch
–
– Fetch the instruction from tFetch the instruction from the Instruction Memoryhe Instruction Memory
•• Reg/Dec:
Reg/Dec: Registers
Registers Fetch
Fetch and I
and Instruction
nstruction Decode
Decode
•• Exec: Calculate the memory address
Exec: Calculate the memory address
•• Mem: Read the data from the Data Memory
Mem: Read the data from the Data Memory
•• Wr: Write the data back to the register file
Wr: Write the data back to the register file
C
Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55
IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr Load
The Four Stages
The Four Stages
of R-type
of R-type
•• Ifetch: Instruction Fetch
Ifetch: Instruction Fetch
–
– Fetch the instruction from tFetch the instruction from the Instruction Memoryhe Instruction Memory
•• Reg/Dec:
Reg/Dec: Registers
Registers Fetch
Fetch and I
and Instruction
nstruction Decode
Decode
•• Exec:
Exec:
–
– ALU operates on the two register operandsALU operates on the two register operands –
– Update PCUpdate PC
•• Wr: Write the ALU output back to the register file
Wr: Write the ALU output back to the register file
C
Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
Pipelining the R-type and
Pipelining the R-type and
Load Instruction
Load Instruction
•• We have pipeline conflict or structural hazard:
We have pipeline conflict or structural hazard:
–
– Two instructions try to write to the Two instructions try to write to the register file at the sameregister file at the same time!
time!
Clock Clock
C
Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55 CCyycclle e 66 CCyycclle e 77 CCyycclle e 88 CCyycclle e 99
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr Load
Load
IIffeettcchh RReegg//DDeecc EExxeecc WrWr
R-type R-type
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
Ops!
Important
Important
Observation
Observation
•• Each functional unit can only be
Each functional unit can only be used
used once
once
per
per
instruction
instruction
•• Each functional unit must be used at the
Each functional unit must be used at the same
same
stage
stage
for all
for all instructions:
instructions:
–
– Load uses Load uses Register FRegister File’s Write ile’s Write Port duPort during ring itsits 5th5th stagestage
–
– R-type uses Register File’s Write Port during itsR-type uses Register File’s Write Port during its 4th4th stagestage
IIffeettcchh RReegg//DDeecc EExxeecc MeemM m WWrr Load
Load
1 2
1 2 33 44 55
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
1 2
1 2 33 44
Solution 1: Insert “Bubble”
Solution 1: Insert “Bubble”
into the Pipeline
into the Pipeline
•• Insert a “bubble” into the
Insert a “bubble” into the pipeline to prevent 2 writes
pipeline to prevent 2 writes
at the same cycle
at the same cycle
–
– The control logic can be complex.The control logic can be complex. –
– Lose instruction fetch and issue opportunity.Lose instruction fetch and issue opportunity.
Clock Clock
C
Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55 CCyycclle e 66 CCyycclle e 77 CCyycclle e 88 CCyycclle e 99
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
IIffeettcchh RReegg//DDeecc EExxeecc IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr
Load Load
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type
IIffeettcchh RReegg//DDeecc EExxeecc WWrr R-type
R-type PipelinePipeline
Bubble
Bubble
Solution 2: Delay R-type’s
Solution 2: Delay R-type’s
Write by One Cycle
Write by One Cycle
•• Delay R-type’s register write by one cycle:Delay R-type’s register write by one cycle:
–
– Now R-type instructions also use Reg File’s write port at Stage 5Now R-type instructions also use Reg File’s write port at Stage 5 –
– Mem stage is a Mem stage is a NOOPNOOP stage: nothing is being done.stage: nothing is being done.
Clock Clock
C
Cyycclle e 11 CCyycclle e 22 CCyycclle e 33 CCyycclle e 44 CCyycclle e 55 CCyycclle e 66 CCyycclle e 77 CCyycclle e 88 CCyycclle e 99
IIffeettcchh RReegg//DDeecc MemMem WrWr R-type
R-type
IIffeettcchh RReegg//DDeecc MemMem WrWr R-type
R-type
IIffeettcchh RReegg//DDeecc EExxeecc MMeemm WWrr Load
Load
IIffeettcchh RReegg//DDeecc MemMem WrWr R-type
R-type
IIffeettcchh RReegg//DDeecc MemMem WrWr R-type
R-type
IIffeettcchh RReegg//DDeecc ExecExec WrWr R-type
R-type MemMem
Exec Exec Exec Exec Exec Exec Exec Exec 1 1 22 33 44 55
Why
Why
Pipeline?
Pipeline?
•• Suppose we execute 100
Suppose we execute 100 instructions
instructions
•• Single Cycle Machine
Single Cycle Machine
–
– 45 ns/cycle 45 ns/cycle x 1 CPI x 1 CPI x 100 insx 100 inst = 4500 t = 4500 nsns
•• Multicycle Machine
Multicycle Machine
–
– 10 ns/cycle x 4.6 CPI (due to 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = inst mix) x 100 inst = 4600 ns4600 ns
•• Ideal pipelined machine
Ideal pipelined machine
–
Why Pipeline? Because the
Why Pipeline? Because the
resources are there!
resources are there!
II n n s s t t r. r. O O r r d d e e r r
Time (clock cycles) Time (clock cycles)
Inst 0
Inst 0
Inst 1
Inst 1
Inst 2
Inst 2
Inst 4
Inst 4
Inst 3
Inst 3
A A L L U U ImIm RegReg DDmm RReegg
A A L L U U Im
Im RegReg DDmm RReegg
A A L L U U Im
Im RegReg DDmm RReegg
A A L L U U Im
Im RegReg DDmm RReegg
A A L L U U Im
Problems with Pipeline
Problems with Pipeline
processors?
processors?
•• Limits to pipelining:Limits to pipelining: HazardsHazards prevent next instruction fromprevent next instruction from executing during its designated clock cycle
executing during its designated clock cycle and introduceand introduce stall cycles which increase CPI
stall cycles which increase CPI
–
– Structural hazardsStructural hazards: HW cannot support this combination of: HW cannot support this combination of instructions - two dogs fighting for
instructions - two dogs fighting for the same bonethe same bone –
– Data hazardsData hazards: Instruction depends on result of prior: Instruction depends on result of prior instruction still in the pipeline
instruction still in the pipeline »
» Data dependenciesData dependencies –
– Control hazardsControl hazards: Caused by delay between the fetching of: Caused by delay between the fetching of instructions and decisions about changes in control flow instructions and decisions about changes in control flow (branches and jumps).
(branches and jumps). »
» Control dependenciesControl dependencies
•• Can always resolve hazards by stallingCan always resolve hazards by stalling
•• More stall cycles = more CPU time = More stall cycles = more CPU time = less performanceless performance
–
Back to our old friend: CPU
Back to our old friend: CPU
time equation
time equation
•• Recall equation for CPU timeRecall equation for CPU time
•• So what are we doing by So what are we doing by pipelining the instructionpipelining the instruction execution process ?
execution process ?
–
– Clock ?Clock ? –
– Instruction Count ?Instruction Count ? –
– CPI ?CPI ? »
Speed Up Equation for
Speed Up Equation for
Pipelining
Pipelining
pipeline pipeline d d unpipelin unpipelinTime
Time
Cycle
Cycle
Time
Time
Cycle
Cycle
CPI
CPI
stall
stall
Pipeline
Pipeline
CPI
CPI
Ideal
Ideal
depth
depth
Pipeline
Pipeline
CPI
CPI
Ideal
Ideal
Speedup
Speedup
d d unpipelin unpipelinTime
Time
Cycle
Cycle
depth
depth
Pipeline
Pipeline
Speedup
Speedup
Inst
Inst
per
per
cycles
cycles
Stall
Stall
Average
Average
CPI
CPI
Ideal
Ideal
CPI
CPI
pipelinedpipelinedFor simple RISC pipeline, CPI = 1:
One Memory Port/Structural
One Memory Port/Structural
Hazards
Hazards
II n n s s t t r. r. O O r r d d e e r rTime (clock cycles) Time (clock cycles)
Load
Load
Instr 1
Instr 1
Instr 2
Instr 2
Instr 3
Instr 3
Instr 4
Instr 4
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
C
Cyycclle e 11CCyycclle e 22 CCyycclle e 3 C3 Cyycclle e 44Cycle 5Cycle 5 CCyycclle e 66CCyycclle e 77
Reg Reg A A L L U U DMem DMem Ifetch
One Memory Port/Structural
One Memory Port/Structural
Hazards
Hazards
II n n s s t t r. r. O O r r d d e eTime (clock cycles) Time (clock cycles)
Load
Load
Instr 1
Instr 1
Instr 2
Instr 2
Stall
Stall
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
C
Cyycclle e 11CCyycclle e 22 CCyycclle e 3 C3 Cyycclle e 44Cycle 5Cycle 5 CCyycclle e 66CCyycclle e 77
B
Example: Dual-port vs.
Example: Dual-port vs.
Single-port
port
••
Machine A: Dual ported memory (“Harvard
Machine A: Dual ported memory (“Harvard
Architecture”)
Architecture”)
••
Machine B: Single ported memory, but its
Machine B: Single ported memory, but its
pipelined implementation has a 1.05 times faster
pipelined implementation has a 1.05 times faster
clock rate
clock rate
••
Ideal CPI = 1 for both
Ideal CPI = 1 for both
••
Note - Loads will cause stalls of 1 cycle
Note - Loads will cause stalls of 1 cycle
••
Recall our friend:
Recall our friend:
–
–
CPU = IC*CPI*Clk
CPU = IC*CPI*Clk
»
Example…
Example…
•• Machine A: Dual ported memory (“HarvardMachine A: Dual ported memory (“Harvard Architecture”)
Architecture”)
•• Machine B: Single ported memory, but its pipelinedMachine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate implementation has a 1.05 times faster clock rate •• Ideal CPI = 1 for bothIdeal CPI = 1 for both
•• Loads are 40% of instructions executedLoads are 40% of instructions executed SpeedUp
SpeedUpAA = Pipe. Depth/(1 + 0) x (clock= Pipe. Depth/(1 + 0) x (clockunpipeunpipe/clock/clockpipepipe)) = Pipeline Depth
= Pipeline Depth SpeedUp
SpeedUpBB = Pipe. Depth/(1 + 0.4 x 1) x= Pipe. Depth/(1 + 0.4 x 1) x (clock
(clockunpipeunpipe/(clock/(clockunpipeunpipe/ 1.05)/ 1.05) = (Pi
= (Pipe. pe. Depth/1.4) Depth/1.4) x x 1.051.05 = 0.75 x Pipe. Depth
= 0.75 x Pipe. Depth SpeedUp
Data Dependencies
Data Dependencies
•• True dependencies and False
True dependencies and False dependencies
dependencies
–
– false implies we can remove the dependencyfalse implies we can remove the dependency –
– true implies we are stuck with it!true implies we are stuck with it!
•• Three types of data
Three types of data dependencies defined in
dependencies defined in
terms of how
terms of how succeeding instruct
succeeding instruction depends
ion depends
on preceding instruction
on preceding instruction
–
– RAW: Read after Write or Flow dependencyRAW: Read after Write or Flow dependency –
– WAR: Write after Read WAR: Write after Read or anti-dependencyor anti-dependency –
•
•
Read After Write (RAW)
Read After Write (RAW)
Instr
Instr
JJtries to read operand before Instr
tries to read operand before Instr
IIwrites it
writes it
•• Caused by a “
Caused by a “Dependence
Dependence
” (in compiler
” (in compiler
nomenclature).
nomenclature). This
This hazard re
hazard results
sults from an
from an actual
actual
need for
need for communication.
communication.
Three Generic Data Hazards
Three Generic Data Hazards
I: add
I: add r1
r1
,r2,r3
,r2,r3
J: sub r4,
RAW Dependency
RAW Dependency
•• Example program (a) with two
Example program (a) with two instructions
instructions
–
– i1: load r1, a;i1: load r1, a; –
– i2: add r2, r1,r1;i2: add r2, r1,r1;
•• Program (b) with two
Program (b) with two instructions
instructions
–
– i1: mul r1, r4, r5;i1: mul r1, r4, r5; –
– i2: add r2, r1, r1;i2: add r2, r1, r1;
•• Both cases we cannot read in i2 until i1 has
Both cases we cannot read in i2 until i1 has
completed writing the result
completed writing the result
–
– In (a) this is due toIn (a) this is due to load-use dependency load-use dependency –
•
•
Write After Read (WAR)
Write After Read (WAR)
Instr
Instr
JJwrites operand
writes operand before
before
Instr
Instr
IIreads it
reads it
•• Called an “
Called an “anti-dependence
anti-dependence
” by compiler writers.
” by compiler writers.
This results from reuse of the name “
This results from reuse of the name “r1
r1
”.
”.
•• Can’t happen in MIPS 5 stage pipeline because:
Can’t happen in MIPS 5 stage pipeline because:
–
–
All instructions take 5 stages, and
All instructions take 5 stages, and
–
–
Reads are always in stage 2, and
Reads are always in stage 2, and
I: sub r4,
I: sub r4,r1
r1
,r3
,r3
J: add
J: add r1
r1
,r2,r3
,r2,r3
K: mul r6,r1,r7
K: mul r6,r1,r7
Three Generic Data Hazards
Three Generic Data Hazards
Three Generic Data Hazards
•
•
Write After Write (WAW)
Write After Write (WAW)
Instr
Instr
JJwrites operand
writes operand before
before
Instr
Instr
IIwrites it.
writes it.
•• Called an “
Called an “output dependence
output dependence
” by compiler writers
” by compiler writers
This also results from the reuse of name “
This also results from the reuse of name “ r1
r1
”.
”.
•• Can’t happen in MIPS 5
Can’t happen in MIPS 5 stage pipeline because:
stage pipeline because:
–
–
All instructions take 5 stages, and
All instructions take 5 stages, and
–
–
Writes are always in stage 5
Writes are always in stage 5
•• Will see WAR and
Will see WAR and WAW in later more
WAW in later more complicated
complicated
pipes
pipes
I: sub
I: sub r1
r1
,r4,r3
,r4,r3
J: add
J: add r1
r1
,r2,r3
,r2,r3
K: mul r6,r1,r7
K: mul r6,r1,r7
WAR
WAR
and
and
WAW
WAW
Dependency
Dependency
•• Example program (a):
Example program (a):
–
– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –
– i2: add r2, r4, r5;i2: add r2, r4, r5;
•• Example program (b):
Example program (b):
–
– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –
– i2: add r1, r4, r5;i2: add r1, r4, r5;
•• both cases we have dependence between i1 and i2
both cases we have dependence between i1 and i2
–
– in (a) due to r2 must be in (a) due to r2 must be read before it is written intoread before it is written into –
– in (b) due to r1 in (b) due to r1 must be written by i2 after it must be written by i2 after it has been writtenhas been written into by i1
What to do with WAR and
What to do with WAR and
WAW ?
WAW ?
•• Problem:
Problem:
–
– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –
– i2: add r2, r4, r5;i2: add r2, r4, r5;
What to do with WAR and
What to do with WAR and
WAW
WAW
•• Solution:
Solution: Rename
Rename Registers
Registers
–
– i1: mul r1, r2, r3;i1: mul r1, r2, r3; –
– i2: add r6, r4, r5;i2: add r6, r4, r5;
•• Register renaming can solve many of these
Register renaming can solve many of these
false dependencies
false dependencies
–
– note the role that the compiler plays note the role that the compiler plays in thisin this –
– specifically, the register allocation process--i.e., thespecifically, the register allocation process--i.e., the process that assigns registers to variables
Hazard Detection in H/W
Hazard Detection in H/W
•• Suppose instructionSuppose instruction i i is about to be issued and a is about to be issued and a predecessor instruction
predecessor instruction j j is in the instruction pipelineis in the instruction pipeline •• How to detect and store potential hazard informationHow to detect and store potential hazard information
–
– Note that hazards in machine code are based on Note that hazards in machine code are based on registerregister usage
usage –
– Keep track of results in Keep track of results in registers and their usageregisters and their usage »
» Constructing a register data flow graphConstructing a register data flow graph
•• For each instructionFor each instruction i i construct set of Read registersconstruct set of Read registers and Write registers
and Write registers
–
– Rregs(i) is set of registers that instruction i Rregs(i) is set of registers that instruction i reads fromreads from –
– Wregs(i) is set of registers that instruction i writes toWregs(i) is set of registers that instruction i writes to –
Hazard Detection in
Hazard Detection in
Hardware
Hardware
•• A RAW hazard exists on registerA RAW hazard exists on register ifif Rregs(Rregs( i i )) Wregs(Wregs( j j )) –
– Keep a record of pending writes (for inst's in the Keep a record of pending writes (for inst's in the pipe) andpipe) and compare with operand regs of
compare with operand regs of current instruction.current instruction. –
– When instruction issues, reserve its result register.When instruction issues, reserve its result register. –
– When on operation completes, remove its write reservation.When on operation completes, remove its write reservation.
•• A WAW hazard exists on registerA WAW hazard exists on register ifif Wregs(Wregs( i i )) Wregs(Wregs( j j )) •• A WAR hazard exists on registerA WAR hazard exists on register ifif Wregs(Wregs( i i )) Rregs(Rregs( j j ))
ρ ρ ρ ρ ρ ρ
Internal Forwarding: Getting
Internal Forwarding: Getting
rid of some hazards
rid of some hazards
•• In some cases the data needed by the next
In some cases the data needed by the next
instruction at the ALU stage has been
instruction at the ALU stage has been
computed by the ALU (or some stage defining
computed by the ALU (or some stage defining
it) but has not been written back to the
it) but has not been written back to the
registers
registers
•• Can we “forward”
Can we “forward” this result by bypassing
this result by bypassing
stages ?
II n n s s t t r. r. O O r r d d e e r r
add
add r1
r1
,r2,r3
,r2,r3
sub r4,
sub r4,r1
r1
,r3
,r3
and r6,
and r6,r1
r1
,r7
,r7
or
r8,
or
r8,r1
r1
,r9
,r9
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Data Hazard on R1
Data Hazard on R1
Time (clock cycles) Time (clock cycles)
Time (clock cycles) Time (clock cycles)
Forwarding to Avoid Data
Forwarding to Avoid Data
Hazard
Hazard
II n n s s t t r. r. O O r r d d e e r radd
add r1
r1
,r2,r3
,r2,r3
sub r4,
sub r4,r1
r1
,r3
,r3
and r6,
and r6,r1
r1
,r7
,r7
or
r8,
or
r8,r1
r1
,r9
,r9
xor r10,
xor r10,r1
r1
,r11
,r11
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Internal Forwarding of
Internal Forwarding of
Instructions
Instructions
•• Forward result from ALU/Execute unit to
Forward result from ALU/Execute unit to
execute unit in next stage
execute unit in next stage
•• Also can be used in cases of memory access
Also can be used in cases of memory access
•• in some cases, operand fetched from memory hasin some cases, operand fetched from memory has been computed previously by the program
been computed previously by the program –
– can we “forward” this result to a can we “forward” this result to a later stage thuslater stage thus avoiding an extra read from memory ?
avoiding an extra read from memory ? –
– Who does this ?Who does this ?
•• Internal forwarding casesInternal forwarding cases –
– Stage i to Stage i+k in pipelineStage i to Stage i+k in pipeline –
Internal Data
Internal Data
Forwarding
Forwarding
Store-load forwarding
Store-load forwarding
Memory Memory M M Access Unit Access Unit R R11 RR22 STO M,R1 STO M,R1 LD R2,MLD R2,M Memory Memory M M Access Unit Access Unit R R11 RR22 STO M,R1STO M,R1 MOVE R2,R1MOVE R2,R1
38
38
38 38
Internal Data
Internal Data
Forwarding
Forwarding
Load-load forwarding Load-load forwarding Memory Memory M M Access Unit Access Unit R R11 RR22 LD R1,M LD R1,M LD R2,MLD R2,M Memory Memory M M Access Unit Access Unit R R11 RR22 LD R1,M LD R1,M MOVE R2,R1MOVE R2,R1 39 39 39 39Internal Data
Internal Data
Forwarding
Forwarding
Store-store forwarding
Store-store forwarding
Memory Memory M M Access Unit Access Unit R R11 RR22 STOSTO M, M, R1R1 STO M,R2STO M,R2
Memory Memory M M Access Unit Access Unit R R11 RR22 STO M,R2 STO M,R2 40 40 40 40
HW Change for
HW Change for
Forwarding
Forwarding
M M E E M M / / WW R R I I D D / / E E X X E E X X / / MM E E M M DataData Memory Memory A A L L U U m m u uxx m m u uxx R R e e g gi i s s t t e er r s s NextPC NextPC Immediate Immediate m m u uxx
What about memory
What about memory
operations?
operations?
A A BB op Rd Ra Rb op Rd Ra Rb op Rd Ra Rb op Rd Ra Rb Rd Rd to reg to reg file file R R Rd Rd ºº If instructions are initiated in order andIf instructions are initiated in order andoperations always occur in the same stage, operations always occur in the same stage, there ca
there can be n be no hazardno hazards between s between memorymemory operations!
operations!
ºº What does delaying WB on What does delaying WB on arithmeticarithmetic operations cost? operations cost? – cycles ? – cycles ? – hardware ? – hardware ?
ºº What about data dependence on loads?What about data dependence on loads? R1 <- R4 + R5 R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R2 <- Mem[ R2 + I ] R3 <- R2 + R1 R3 <- R2 + R1 “
“Delayed LoadsDelayed Loads””
ºº Can recognize this in decode stage andCan recognize this in decode stage and introduce bubble while stalling fetch stage introduce bubble while stalling fetch stage ºº Tricky situation:Tricky situation:
R1 <- Mem[ R2 + I ] R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Mem[R3+34] <- R1
Handle with bypass in memory stage! Handle with bypass in memory stage!
D D Mem Mem T T
Time (clock cycles) Time (clock cycles)
II n n s s t t r. r. O O r r d d
lw
lw
r1, 0
r1, 0
(r2)
(r2)
sub r4,
sub r4,r1
r1
,r6
,r6
and r6,
and r6,r1
r1
,r7
,r7
Data Hazard Even with
Data Hazard Even with
Forwarding
Forwarding
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg
Reg A A L L U U DMemDMem Ifetch
Data Hazard Even with
Data Hazard Even with
Forwarding
Forwarding
Time (clock cycles) Time (clock cycles)
or r8,
or r8,r1
r1
,r9
,r9
II n n s s t t r. r. O O r r d d e e r rlw
lw
r1, 0
r1, 0
(r2)
(r2)
sub r4,
sub r4,r1
r1
,r6
,r6
and r6,
and r6,r1
r1
,r7
,r7
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg Ifetch Ifetch A A L L U U DMem
DMem RegReg
Bubble Bubble Ifetch Ifetch A A L L U U DMem
DMem RegReg
Bubble
Bubble RegReg
Ifetch Ifetch A A L L U U DMem DMem Bubble
What can we (S/W) do?
Try producing fast code for
Try producing fast code for
a = b + c;
a = b + c;
d = e – f;
d = e – f;
assuming a, b, c, d ,e,
assuming a, b, c, d ,e, and f in memory.
and f in memory.
Slow code: Slow code: LW Rb,b LW Rb,b LW LW RcRc,c,c ADD Ra,Rb, ADD Ra,Rb,RcRc SW a,Ra SW a,Ra LW Re,e LW Re,e LW LW RfRf,f,f SUB Rd,Re, SUB Rd,Re,RfRf
Software Scheduling to Avoid
Software Scheduling to Avoid
Load Hazards
Load Hazards
Fast code: Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c LW Re,e LW Re,e ADD Ra,Rb,Rc ADD Ra,Rb,Rc LW Rf,f LW Rf,f SW a,Ra SW a,Ra SUB Rd,Re,Rf SUB Rd,Re,RfControl
Control
Hazards:
Hazards:
Branches
Branches
•• Instruction flow
Instruction flow
–
– Stream of instructions processed by Inst. Fetch unitStream of instructions processed by Inst. Fetch unit –
– Speed of “input flow” puts bound on rate oSpeed of “input flow” puts bound on rate off outputs generated
outputs generated
•• Branch instruction affects instruction flow
Branch instruction affects instruction flow
–
– Do not know next instruction to be executed untilDo not know next instruction to be executed until branch outcome known
branch outcome known
•• When we hit
When we hit a branch instruction
a branch instruction
–
– Need to compute target address (where to branch)Need to compute target address (where to branch) –
– Resolution of branch condition (true or false)Resolution of branch condition (true or false) –
Control Hazard on Branches
Control Hazard on Branches
Three Stage Stall
Three Stage Stall
10: beq r1,r3,36
10: beq r1,r3,36
14: and r2,r3,r5
14: and r2,r3,r5
18:
18: or
or r6,r1,r7
r6,r1,r7
22: add r8,r1,r9
22: add r8,r1,r9
36: xor r10,r1,r11
36: xor r10,r1,r11
Reg Reg A A L L U U DMem DMem IfetchIfetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Ifetch RegReg
Reg Reg A A L L U U DMem DMem Ifetch
Branch Stall Impact
Branch Stall Impact
•• If CPI = 1, 30% branch,
If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
Stall 3 cycles => new CPI = 1.9!
•• Two part solution:
Two part solution:
–
– Determine branch taken or not sooner, Determine branch taken or not sooner, ANDAND –
– Compute taken branch address earlierCompute taken branch address earlier
•• MIPS branch tests if register = 0 or
MIPS branch tests if register = 0 or
0
0
•• MIPS Solution:
MIPS Solution:
–
– Move Zero test to ID/RF stageMove Zero test to ID/RF stage –
Pipelined MIPS (DLX)
Pipelined MIPS (DLX)
Datapath
Datapath
Memory Memory Access Access Write Write Back Back Instruction Instruction Fetch Fetch Instr. Decode Instr. Decode Reg. Fetch Reg. Fetch Execute Execute Addr. Calc. Addr. Calc.This is the correct 1
This is the correct 1 cyclecycle
latency implementation!
Four Branch Hazard
Four Branch Hazard
Alternatives
Alternatives
#1: Stall until branch direction is clear
#1: Stall until branch direction is clear – flushing pipe– flushing pipe #2: Predict Branch Not Taken
#2: Predict Branch Not Taken
–
– Execute successor instructions in sequenceExecute successor instructions in sequence –
– “Squash” instructions in pipeline if branch actually taken“Squash” instructions in pipeline if branch actually taken –
– Advantage of late pipeline state updateAdvantage of late pipeline state update –
– 47% DLX branches not taken on average47% DLX branches not taken on average –
– PC+4 already calculated, so use it PC+4 already calculated, so use it to get next instructionto get next instruction
#3: Predict Branch Taken #3: Predict Branch Taken
–
– 53% DLX branches taken on average53% DLX branches taken on average –
– But haven’t calculated branch target address in DLXBut haven’t calculated branch target address in DLX
»
» DLX still incurs 1 DLX still incurs 1 cycle branch penaltycycle branch penalty »
Four Branch Hazard
Four Branch Hazard
Alternatives
Alternatives
#4: Delayed Branch #4: Delayed Branch
–
– Define branch to take placeDefine branch to take place AFTERAFTER a following instructiona following instruction
branch instruction branch instruction sequential successor sequential successor11 sequential successor sequential successor22 ... ... sequential successor sequential successornn branch target if taken branch target if taken –
– 1 slot delay allows 1 slot delay allows proper decision and branch targetproper decision and branch target address in 5 stage pipeline
address in 5 stage pipeline –
– DLX uses thisDLX uses this
Branch delay of length Branch delay of length nn
Delayed Branch
Delayed Branch
•• Where to get instructions to fill branch delay slotWhere to get instructions to fill branch delay slot??
–
– Before branch instructionBefore branch instruction –
– From the target address: only valuable when branch takenFrom the target address: only valuable when branch taken –
– From fall through: only valuable when branch not From fall through: only valuable when branch not takentaken –
– Cancelling branches allow more slots to be filledCancelling branches allow more slots to be filled
•• Compiler effectiveness for single branch delay slot:Compiler effectiveness for single branch delay slot:
–
– Fills about 60% of branch delay slotsFills about 60% of branch delay slots –
– About 80% of instructions executed in branch delay About 80% of instructions executed in branch delay slots usefulslots useful in computation
in computation –
– About 50% (60% x 80%) of slots usefully filledAbout 50% (60% x 80%) of slots usefully filled
•• Delayed Branch downside: 7-8 stage pipelines, multipleDelayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Evaluating Branch
Evaluating Branch
Alternatives
Alternatives
S
Scchheedduulliinngg BBrraanncchh CCPPII ssppeeeedduup p vv.. ssppeeeedduup p vv.. s
scchheemme e ppeennaallttyy uunnppiippeelliinneedd ssttaalll l S
Sttaalll l ppiippeelliinnee 33 11..4422 33..55 11..00 P
Prreeddiicct t ttaakkeenn 11 11..1144 44..44 11..2266 P
Prreeddiicct t nnoot t ttaakkeenn 11 11..0099 44..55 11..2299 D
Deellaayyeed d bbrraanncchh 00..55 11..0077 44..66 1.311.31
Conditional & Unconditional = 14%, 65% change PC
Conditional & Unconditional = 14%, 65% change PC
PP ii pp e l
e l ii nn e s
e s pp e e d
e e d uu p =
p =
PP ii pp e l
e l ii nn e d
e d e p
e p tt hh
1
Branch Prediction based on
Branch Prediction based on
history
history
•• Can we use history of branch behaviour to Can we use history of branch behaviour to predictpredict branch outcome ?
branch outcome ?
•• Simplest scheme: use 1 bit of “history”Simplest scheme: use 1 bit of “history”
–
– Set bit to Set bit to Predict Taken (T) or Predict Not-taken (NT)Predict Taken (T) or Predict Not-taken (NT) –
– Pipeline checks bit value Pipeline checks bit value and predictsand predicts »
» If incorrect then need to invalidate instructionIf incorrect then need to invalidate instruction –
– Actual outcome used to set the bit Actual outcome used to set the bit valuevalue
•• Example: let initial value = T, actual outcome ofExample: let initial value = T, actual outcome of branches
branches is- is- NT, NT, T,T,NT,T,TT,T,NT,T,T
–
– Predictions are:Predictions are: TT,, NTNT,T,,T,TT,,NTNT,T,T »
» 3 wrong (in red), 3 3 wrong (in red), 3 correct = 50% accuracycorrect = 50% accuracy
•• In general, can have k-bit predictors: more when weIn general, can have k-bit predictors: more when we cover superscalar processors.
Summary :
Summary :
Control and Pipelining
Control and Pipelining
•• Just overlap tasks; easy if tasks are independent
Just overlap tasks; easy if tasks are independent
•• Speed Up
Speed Up
Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth; if ideal CPI is 1, then:
•• Hazards limit performance on
Hazards limit performance on computers:
computers:
–
– Structural: need more HW resourcesStructural: need more HW resources –
– Data (RAW,WAR,WAW): need forwarding, compilerData (RAW,WAR,WAW): need forwarding, compiler scheduling
scheduling –
– Control: delayed branch, predictionControl: delayed branch, prediction
pipeline pipeline d d unpipelin unpipelin
Time
Time
Cycle
Cycle
Time
Time
Cycle
Cycle
CPI
CPI
stall
stall
Pipeline
Pipeline
1
1
depth
depth
Pipeline
Pipeline
Speedup
Speedup
Summary #1/2:
Summary #1/2:
Pipelining
Pipelining
•• What makes it easy
What makes it easy
–
– all instructions are the same lengthall instructions are the same length –
– just a few instruction formats just a few instruction formats –
– memory operands appear only in loads and storesmemory operands appear only in loads and stores
•• What makes it hard?
What makes it hard? HAZARDS!
HAZARDS!
–
– structural structural hazards: hazards: suppose suppose we we had had only only one one memorymemory –
– control hazardcontrol hazards: s: need to need to worry about worry about branch branch instructionsinstructions –
– data hazardata hazards: ds: an insan instruction depentruction depends on ds on a previousa previous instruction
instruction
•• Pipelines pass control information down the pipe
Pipelines pass control information down the pipe just
just
as data moves down pipe
as data moves down pipe
•• Forwarding/Stalls handled by local control
Forwarding/Stalls handled by local control
•• Exceptions stop the pipeline
Exceptions stop the pipeline
Introduction to ILP
Introduction to ILP
•• What is ILP?
What is ILP?
–
– Processor and Compiler design techniques thatProcessor and Compiler design techniques that speed up execution by
speed up execution by causing individual machinecausing individual machine operations to execute in parallel
operations to execute in parallel
•• ILP is transparent to the
ILP is transparent to the user
user
–
– Multiple operations executed in parallel evenMultiple operations executed in parallel even though the system is handed a single program though the system is handed a single program written with a sequential processor in mind written with a sequential processor in mind
•• Same execution hardware as a normal
Same execution hardware as a normal RISC
RISC
machine
Frontend and Optimizer Frontend and Optimizer
Determine Dependences Determine Dependences
Determine Independences Determine Independences
Bind Operations to Function Units Bind Operations to Function Units
Bind Transports to Busses Bind Transports to Busses
Determine Dependences Determine Dependences
Bind Transports to Busses Bind Transports to Busses
Execute Execute Superscalar Superscalar Dataflow Dataflow Indep. Arch. Indep. Arch. VLIW VLIW TTA TTA C
Coommppiilleer r HHaarrddwwaarree
Determine Independences Determine Independences
Bind Operations to Function Units Bind Operations to Function Units
B. Ramakrishna Rau and Joseph A.
B. Ramakrishna Rau and Joseph A. Fisher. InstructionFisher. Instruction-level parallel: History -level parallel: History overview, and perspective. Theoverview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May
Journal of Supercomputing, 7(1-2):9-50, May 1993.1993.