Computer Architecture and
Computer Architecture and
Organization
Organization
Chapter 8
Chapter 8
What is Pipelining?
What is Pipelining?
Implementation technique whereby multiple
Implementation technique whereby multiple
instructions are overlapped in execution
instructions are overlapped in execution
Throughput of an instruction pipeline is
Throughput of an instruction pipeline is
determined by how often an instruction exists
determined by how often an instruction exists
the pipeline - The
the pipeline - The time per instructiontime per instruction does not does not change, only
change, only throughputthroughput increases increases
What is Pipelining?
What is Pipelining?
Potential speedup = Number pipe
Potential speedup = Number pipe
stages
stages
Pipeline rate limited by slowest pipeline
Pipeline rate limited by slowest pipeline
stage (
stage (lengthlength of the machine cycle ) of the machine cycle )
Unbalanced lengths of pipe stages
Unbalanced lengths of pipe stages
reduces speedup (involves overhead) -
reduces speedup (involves overhead) -
since the clock can run no faster than the
since the clock can run no faster than the
time needed for the slowest pipeline stage
What is Pipelining?
What is Pipelining?
Time to “fill” pipeline and time to “drain”
Time to “fill” pipeline and time to “drain”
(
(pipeline bubbles or no-operations
(NOPs), load delay) it reduces speedupit reduces speedup
Dependencies (data, procedural, resource
Dependencies (data, procedural, resource
conflicts, anti-dependency) stall the
conflicts, anti-dependency) stall the
pipeline -
pipeline - implies that there would be a
chain of one or more hazards between the two instructions, thereby reducing or
True data dependency
True data dependency
ADD r1, r2 (r1 := r1+r2;)
ADD r1, r2 (r1 := r1+r2;)
MOVE r3,r1 (r3 := r1;)
MOVE r3,r1 (r3 := r1;)
Can fetch and decode second instruction
Can fetch and decode second instruction
in parallel with first
in parallel with first
Can NOT execute second instruction until
Can NOT execute second instruction until
first is finished
Procedural Dependency
Procedural Dependency
Can not execute instructions after a
Can not execute instructions after a
branch in parallel with instructions before a
branch in parallel with instructions before a
branch
branch
Also, if instruction length is not fixed,
Also, if instruction length is not fixed,
instructions have to be decoded to find out
instructions have to be decoded to find out
how many fetches are needed
how many fetches are needed
This prevents simultaneous fetches
Resource Conflicts
Resource Conflicts
Two or more instructions requiring access
Two or more instructions requiring access
to the same resource at the same time
to the same resource at the same time
– e.g. two arithmetic instructionse.g. two arithmetic instructions
Can duplicate resources
Can duplicate resources
Anti-dependency
Anti-dependency
Write-write dependency
Write-write dependency
– R3:=R3 + R5; (I1)R3:=R3 + R5; (I1) – R4:=R3 + 1; (I2)R4:=R3 + 1; (I2) – R3:=R5 + 1; (I3)R3:=R5 + 1; (I3) – R7:=R3 + R4; (I4)R7:=R3 + R4; (I4)
– I3 can not complete before I2 starts as I2 I3 can not complete before I2 starts as I2
needs a value in R3 and I3 changes R3
Dependencies
Extra incrementer to update the PC
Extra incrementer to update the PC
more often (instead of the ALU)
more often (instead of the ALU)
A separate MDR for loads (memory to
A separate MDR for loads (memory to
CPU) and stores (CPU to memory)
CPU) and stores (CPU to memory)
High memory bandwidth to
High memory bandwidth to
accommodate more data to and from
accommodate more data to and from
memory
memory
Stage
Stage
A
A stagestage of the pipeline is a portion of the of the pipeline is a portion of the pipeline execution where different phases
pipeline execution where different phases
in different instructions execute at the
in different instructions execute at the
same time.
same time.
Stages are connected one to the next to
Stages are connected one to the next to
form a pipe - instructions enter at one end,
form a pipe - instructions enter at one end,
progress through the stages, and exit at
progress through the stages, and exit at
the other end.
Depth
Depth
The
The depthdepth of a pipeline is the number of of a pipeline is the number of instructions it can overlap
Designer’s Goal
Designer’s Goal
To balance the length of each pipeline stage
To balance the length of each pipeline stage
For a perfectly balanced stage:
Ns = Number of pipe stages
Time per instruction on the pipelined machine Tip =
Tip =
Tiup = Time per instruction on unpipelined machine
Example 1
Example 1
Given: Consider an unpipelined machine
Given: Consider an unpipelined machine
with 6 execution stages of lengths 55 ns,
with 6 execution stages of lengths 55 ns,
55 ns, 60 ns, 60 ns, 55 ns, and 55 ns.
55 ns, 60 ns, 60 ns, 55 ns, and 55 ns.
Determine: Instruction latency on this
Determine: Instruction latency on this
machine;
machine;
How much time does it take to execute How much time does it take to execute 100 instructions?
Solution to Example 1
Solution to Example 1
Instruction latency
Instruction latency = 55+55+60+60+55+55= 340 ns= 55+55+60+60+55+55= 340 ns
Time to execute 100 instructions
Time to execute 100 instructions = 100*340 = 100*340 = 34,000 ns
Example 2
Example 2
Given: Suppose we introduce pipelining on
Given: Suppose we introduce pipelining on
this machine. Assume that when
this machine. Assume that when
introducing pipelining, the clock skew adds
introducing pipelining, the clock skew adds
5ns of overhead to each execution stage
5ns of overhead to each execution stage
Determine: instruction latency on the
Determine: instruction latency on the
pipelined machine
pipelined machine
How much time does it take to execute How much time does it take to execute 100 instructions?
Solution to Example 2
Solution to Example 2
Remember that in the pipelined implementation, the
Remember that in the pipelined implementation, the
length of the pipe stages must all be the same,
length of the pipe stages must all be the same,
i.e., the speed of the slowest stage plus overhead.
i.e., the speed of the slowest stage plus overhead.
With 10 ns overhead it comes to:
With 10 ns overhead it comes to:
The length of pipelined stage
The length of pipelined stage = MAX(lengths of = MAX(lengths of unpipelined stages) + overhead = 60 + 10 = 70 ns
unpipelined stages) + overhead = 60 + 10 = 70 ns
Instruction latency
Instruction latency = 70 ns= 70 ns
Time to execute 100 instructionsTime to execute 100 instructions = 70*6*1 + = 70*6*1 + 70*1*99 = 420 + 6930 = 7350 ns
What is the speedup obtained from
What is the speedup obtained from
pipelining?
pipelining?
Solution:
Solution: not considering any not considering any stallsstalls introduced by introduced by different types of hazards
different types of hazards
Average instruction time not pipelined
Average instruction time not pipelined = 340 ns = 340 ns
Average instruction time pipelined
Average instruction time pipelined = 70 ns = 70 ns
Speedup
Instruction sequences (Sidetrack)
Instruction sequences (Sidetrack)
1. Fetch the instruction from
memory;
2. Decode the instruction (it
is an arithmetic instruction, but the CPU has to find that out through a decode operation);
3. Fetch the operands from the register file;
4. Apply the operands to the
ALU;
5. Write the result back to
the register file.
1. Fetch the instruction from
memory;
2. Decode the instruction (it is
a branch instruction);
3. Fetch the components of
the address from the
instruction or register file;
4. Apply the components of
the address to the ALU (address arithmetic);
5. Copy the resulting effective
address into the PC, thus accomplishing the branch.
Load and Store Instructions Sequence
1. Fetch the instruction from memory;
2. Decode the instruction (it is a load or store
instruction);
3. Fetch the components of the address from the
instruction or register file;
4. Apply the components of the address to the
ALU (address arithmetic);
5. Apply the resulting effective address to memory
along with a read (load) or write (store) signal. If it is a write signal, the data item to be written
Frequency of occurrence of instruction types for a variety of languages. The percentages do not sum
to 100 due to roundoff.
(Adapted from [Knuth, 1991].)
Statement
Statement Average Percent of TimeAverage Percent of Time
Assignment
Assignment 4747
If
If 2323
Call
Call 1515
Loop
Loop 66
Goto
Goto 33
Other
Pipelined VS. Unpipelined
Pipelined VS. Unpipelined
F – Fetch Instruction D – Decode Instruction
E – Execute to the ALU
Unpipelined
Pipelined
F D O E S F D O
S E
F D O
time
F D O E S
F D O E S
F D O E S
O – Fetch Operand
S – Store Result
Both fetching and executing the instruction is done by one unit
t
F D O E S
Hazards
Hazards
prevent the next instruction in the
prevent the next instruction in the
instruction stream from being
instruction stream from being
executing during its designated clock
executing during its designated clock
cycle
cycle
Three classes of hazards:
Three classes of hazards:
– Structural hazardsStructural hazards – Data hazardsData hazards
Structural Hazard
Structural Hazard
arise from resource conflicts when the
arise from resource conflicts when the
hardware cannot support all possible
hardware cannot support all possible
combinations of instructions in
combinations of instructions in
simultaneous overlapped execution
Common instances of structural
Common instances of structural
hazards arise when :
hazards arise when :
Some functional unit is not fully pipelined - then
Some functional unit is not fully pipelined - then
a sequence of instructions using that
a sequence of instructions using that
unpipelined unit cannot proceed at the rate of
unpipelined unit cannot proceed at the rate of
one per clock cycle
one per clock cycle
Some resource has not been duplicated enough
Some resource has not been duplicated enough
- to allow all combinations of instructions in the
- to allow all combinations of instructions in the
pipeline to execute
Solution of Structural Hazard
Solution of Structural Hazard
stall the pipeline for one clock cycle when
stall the pipeline for one clock cycle when
a data-memory access occurs (bubbles)
a data-memory access occurs (bubbles)
Stall - prevent the succeeding instruction
Stall - prevent the succeeding instruction
from doing its phase of the cycle
from doing its phase of the cycle
allows the stalled instruction to proceed
allows the stalled instruction to proceed
without conflict but defeats the purpose of
without conflict but defeats the purpose of
overlapping the instructions for the stage
Pipeline Bubble
Pipeline Bubble
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Fetch Decode Exec Store R-type
Fetch Decode Exec Fetch Decode Oper Exec Store
Load
Fetch Decode Exec Store R-type
Fetch Decode Exec Store R-type
Fetch Decode Exec Store
Bubble Bubble Bubble Oper
Insert a bubble into the pipeline to prevent two writes or stores at the same cycle
Delay Store Cycle
Delay Store Cycle
Add NOP (No Operation)
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Fetch Decode Exec Store R-type
Fetch Decode Exec Fetch Decode Oper Exec Store
Load
Fetch Decode Exec Store R-type
Fetch Decode Exec Store R-type
Fetch Decode Exec Store
NOP
NOP
NOP Oper
Data Hazard
Data Hazard
arise when an instruction depends on the
arise when an instruction depends on the
result of a previous instruction in a way
result of a previous instruction in a way
that is exposed by the overlapping of
that is exposed by the overlapping of
instructions in the pipeline
Data Hazard Classification
Data Hazard Classification
RAW (read after write)
RAW (read after write) - - jj tries to read a source before tries to read a source before ii
writes it, so
writes it, so jj incorrectly gets the old value incorrectly gets the old value
Solution: Forwarding (“Forward” result from one stage to
Solution: Forwarding (“Forward” result from one stage to
another)
another)
– The ALU result from the Op Fetch /Exec register is The ALU result from the Op Fetch /Exec register is
always fed back to the ALU input latches
always fed back to the ALU input latches
– If the forwarding hardware detects that the previous If the forwarding hardware detects that the previous
ALU operation has written the register corresponding
ALU operation has written the register corresponding
to the source for the current ALU operation,
to the source for the current ALU operation, control control logic
logic selects the forwarded result as the ALU input selects the forwarded result as the ALU input rather than the value read from the register file
rather than the value read from the register file
Data Hazard Classification
Data Hazard Classification
WAW (write after write)
WAW (write after write) - - j j tries to write an tries to write an operand before it is written by
operand before it is written by ii. The writes end . The writes end up being performed in the wrong order, leaving
up being performed in the wrong order, leaving
the value written by
the value written by ii rather than the value rather than the value written by
written by jj in the destination in the destination
Solution: Stall
Solution: Stall
Data Hazard Classification
Data Hazard Classification
WAR (write after read)
WAR (write after read) - - j j tries to write a tries to write a destination before it is read by
destination before it is read by ii , so , so ii incorrectly incorrectly gets the new value
gets the new value
Solution: Stall
Solution: Stall
Control Hazard
Control Hazard
arise from the pipelining of branches and
arise from the pipelining of branches and
other instructions that change the PC
other instructions that change the PC
Solution:
Solution: Stall the pipeline as soon as the Stall the pipeline as soon as the branch is detected - Program Counter value
branch is detected - Program Counter value
changed to target address
changed to target address
References
References
David Patterson and John L. Hennessy.
David Patterson and John L. Hennessy.
Computer Architecture: A Quantitative
Computer Architecture: A Quantitative
Approach. 2nd Edition. 1996.
Approach. 2nd Edition. 1996.
William Stallings. Computer Architecture
William Stallings. Computer Architecture
and Organization. 2000
and Organization. 2000
Andrie Coronel and Ariel Maguyon.
Andrie Coronel and Ariel Maguyon.
Computer Architecture slides. First Sem.
Computer Architecture slides. First Sem.
SY 2006-2007, Ateneo de Manila
SY 2006-2007, Ateneo de Manila
University