Multiple-Issue Processors
Pipelining can achieve CPI close to 1
– Mechanisms for handling hazards – Static or dynamic scheduling
– Static or dynamic branch handling
Increase in transistor counts (Moore’s Law):
– More complex hardware mechanisms
– Or, more replicated functional units (more parallelism)
Solution: start more than one instruction in the same clock cycle
→ CPI < 1 (or IPC > 1, Instructions per Cycle)
Two approaches:
– Superscalar: instructions are chosen dynamically by the hardware
– VLIW (Very Long Instruction Word): instructions are chosen statically
Superscalar Processors
Hardware attempts to issue up to n instructions on every cycle, where n is called the issue width of the processor and the processor is said to have n issue slots and to be a n-issue processor
Instructions issued must respect data dependences
In some cycles not all issue slots can be used
Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses
Branches?
– With branch prediction, we can predict branches and fetch instructions – Can we execute such predicted instructions?
Speculative Execution
Speculative execution – execute control dependent instructions even when we are not sure if they
should be executed
Hardware undo, in case of a mis-prediction
Multi-issue + branch prediction + Tomasulo
– Implemented in most current processors
Key Idea: Execute out-of-order but commit in order
Remember Control Hazards in 5-stage pipeline?
When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage.
By this time 3 instructions have been fetched from the fall-through path.
IF Reg ALU Mem Reg
IF Reg ALU Mem Reg
IF Reg ALU Mem Reg
IF Reg ALU Mem Reg
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
BEQZ R1, label
SUB R4, R2, R5
AND R6, R2, r7
OR R8, r2, R9 :
:
IF Reg ALU Mem Reg
XOR R10, R1, R11
Kill instructions in EX, DEC and IF
as they move forwards
label:
Tomasulo with Hardware Speculation
Issue:
– Get instruction from queue
– Issue if an RS is free and an ROB entry is also free
– Stall if no RS or no free ROB entry
– Instructions now tagged with ROB entry number, not RS.Id
Execute:
– Same as before: monitor CDB and start instruction when operands are available
Write Result:
– Broadcast result with ROB identifier – ROB captures result to commit later – Store operations are saved in the ROB
until store data is available and store instruction is committed
Commit:
– If branch, check prediction and squash following instructions if incorrect – If store, send data and address to
memory unit and perform write action – Else, update register with new value and
release ROB entry
Address unit
FP adders FP multipliers Memory unit
Address unit
From instruction fetch unit
Instruction Queue
FP registers
Re- Order Buffer
Load buffers
Reservation stations
1 2 3
4 5 6
11
. . .
ld f1, 4(r1) st f4, 8(r2) mul f3, f1, f2 add f4, f5, f3
VLIW Processors
Compiler chooses and “packs” independent instructions into a single long “instruction” word or “bundle”
No need for hardware to check the instructions for dependences
Not all portions of the long instruction word will be used in every cycle
Compiler must be able to expose a lot of parallelism in the instruction flow
Example:
Memory op 1 Memory op 2 fp op 1 int op
ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2
fp op 2
addd f8,f6,f2
Superscalar vs. VLIW Processors
+
Superscalar can handle dynamic events like cache misses, unpredictable memory dependences, branches, etc.+
Superscalar can exploit old binaries from previous implementations-
Superscalar complexity limits issue width to 4 or 8+
VLIW requires much simpler hardware implementation+
VLIW implementations can have wider issue-
VLIW requires more complex compiler support-
VLIW cannot use old binaries when pipeline implementation changes-
VLIW code size increases because of empty issue slotsWhat are the practical Limitations to ILP?
Limitations on max issue width and instruction window size
Effects of realistic branch prediction
The effect of limited numbers of registers (ROB size)
The effects of cache performance
150 119
75 18
63 55
0 50 100 150 200
tomcatv doduc fpppp li espresso gcc
Instruction issues per cycle
H&P, fig 3.1 (p.157) shows the available ILP in a perfect processor, with none of the above constraints.
This is shown for 6 of the SPEC92 benchmarks – the first 3 are integer, the final 3 are floating point
benchmarks.
These levels of ILP are impossible to achieve in practice, due to the
limitations on window size, issue width, branch prediction accuracy and cache performance.
Effect of Instruction Window
55 63
18
75
119
150
36 41
15
61 59 60
10 15 12
49
16
45
10 13 11
35
15
34
8 8 9 14
9 14
0 20 40 60 80 100 120 140 160
gcc espresso li fpppp doduc tomcatv
Instructions Per Clock
Infinite 2048 512 128 32
Branch Prediction Limits ILP
We’ve looked at various branch prediction schemes
Here we see the accuracy achieved by 3 standard schemes
Profile-based, 2-bit counter and Tournament predictors are in use today
H&P, fig 3.2 (p.159) shows the branch prediction accuracy for three types of dynamic branch prediction scheme.
99 95 86
88 86
88
99 84
84 77
82 70
100 97 97 98 96 94
0 20 40 60 80 100
tomcatv doduc fpppp li espresso
gcc
Branch prediction accuracy
Profile-based 2-bit counter Tournament
Effect of Branch prediction
Effect of Limited ROB
Limits to Multiple-issue
Inherent limitation of ILP in most programs:
– In practice we would need N independent instructions to keep a W- issue processor busy, where N=W*pipeline width
– Data and control dependences significantly limit amount of ILP
Complexity of the hardware based on issue width
– Number of functional units increases linearly → OK
– Number of ports for register file increases linearly → bad – Number of ports for memory increases linearly → bad
– Number of dependence tests increases quadratically → bad
– Bypass/forwarding logic and wires increases quadratically → bad
Summary of Factors Limiting ILP in Real Programs
Compared with an ideal processor
– Finite number of registers (introduces WAW and WAR stalls) – Imperfect branch prediction (pipeline flushes)
– Limited issue width
– Instruction fetch delays (cache misses) – Limited instruction window
Implications for future performance growth?
– Single processor has inherent limits
– To use future silicon area, need to go to multiple cores