Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

(1)

Multiple-Issue Processors

 Pipelining can achieve CPI close to 1

– Mechanisms for handling hazards – Static or dynamic scheduling

– Static or dynamic branch handling

 Increase in transistor counts (Moore’s Law):

– More complex hardware mechanisms

– Or, more replicated functional units (more parallelism)

 Solution: start more than one instruction in the same clock cycle

→ CPI < 1 (or IPC > 1, Instructions per Cycle)

 Two approaches:

– Superscalar: instructions are chosen dynamically by the hardware

– VLIW (Very Long Instruction Word): instructions are chosen statically

(2)

Superscalar Processors

 Hardware attempts to issue up to n instructions on every cycle, where n is called the issue width of the processor and the processor is said to have n issue slots and to be a n-issue processor

 Instructions issued must respect data dependences

 In some cycles not all issue slots can be used

 Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses

 Branches?

– With branch prediction, we can predict branches and fetch instructions – Can we execute such predicted instructions?

(3)

Speculative Execution

 Speculative execution – execute control dependent instructions even when we are not sure if they

should be executed

 Hardware undo, in case of a mis-prediction

 Multi-issue + branch prediction + Tomasulo

– Implemented in most current processors

 Key Idea: Execute out-of-order but commit in order

(4)

Remember Control Hazards in 5-stage pipeline?

 When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage.

 By this time 3 instructions have been fetched from the fall-through path.

IF Reg ALU Mem Reg

IF Reg ^ALU Mem Reg

IF Reg ALU Mem Reg

IF Reg ^ALU Mem Reg

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

BEQZ R1, label

SUB R4, R2, R5

AND R6, R2, r7

OR R8, r2, R9 :

:

IF Reg ^ALU Mem Reg

XOR R10, R1, R11

Kill instructions in EX, DEC and IF

as they move forwards

label:

(5)

Tomasulo with Hardware Speculation

 Issue:

– Get instruction from queue

– Issue if an RS is free and an ROB entry is also free

– Stall if no RS or no free ROB entry

– Instructions now tagged with ROB entry number, not RS.Id

 Execute:

– Same as before: monitor CDB and start instruction when operands are available

 Write Result:

– Broadcast result with ROB identifier – ROB captures result to commit later – Store operations are saved in the ROB

until store data is available and store instruction is committed

 Commit:

– If branch, check prediction and squash following instructions if incorrect – If store, send data and address to

memory unit and perform write action – Else, update register with new value and

release ROB entry

Address unit

FP adders FP multipliers Memory unit

Address unit

From instruction fetch unit

Instruction Queue

FP registers

Re- Order Buffer

Load buffers

Reservation stations

1 2 3

4 5 6

11

. . .

ld f1, 4(r1) st f4, 8(r2) mul f3, f1, f2 add f4, f5, f3

(6)

VLIW Processors

 Compiler chooses and “packs” independent instructions into a single long “instruction” word or “bundle”

 No need for hardware to check the instructions for dependences

 Not all portions of the long instruction word will be used in every cycle

 Compiler must be able to expose a lot of parallelism in the instruction flow

 Example:

Memory op 1 Memory op 2 fp op 1 int op

ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2

fp op 2

addd f8,f6,f2

(7)

Superscalar vs. VLIW Processors

+

Superscalar can handle dynamic events like cache misses, unpredictable memory dependences, branches, etc.

+

Superscalar can exploit old binaries from previous implementations

-

Superscalar complexity limits issue width to 4 or 8

+

VLIW requires much simpler hardware implementation

+

VLIW implementations can have wider issue

-

VLIW requires more complex compiler support

-

VLIW cannot use old binaries when pipeline implementation changes

-

VLIW code size increases because of empty issue slots

(8)

What are the practical Limitations to ILP?

 Limitations on max issue width and instruction window size

 Effects of realistic branch prediction

 The effect of limited numbers of registers (ROB size)

 The effects of cache performance

150 119

75 18

63 55

0 50 100 150 200

tomcatv doduc fpppp li espresso gcc

Instruction issues per cycle

H&P, fig 3.1 (p.157) shows the available ILP in a perfect processor, with none of the above constraints.

This is shown for 6 of the SPEC92 benchmarks – the first 3 are integer, the final 3 are floating point

benchmarks.

These levels of ILP are impossible to achieve in practice, due to the

limitations on window size, issue width, branch prediction accuracy and cache performance.

(9)

Effect of Instruction Window

55 63

18

75

119

150

36 41

15

61 59 60

10 15 12

49

16

45

10 13 11

35

15

34

8 8 9 14

9 14

0 20 40 60 80 100 120 140 160

gcc espresso li fpppp doduc tomcatv

Instructions Per Clock

Infinite 2048 512 128 32

(10)

Branch Prediction Limits ILP

 We’ve looked at various branch prediction schemes

 Here we see the accuracy achieved by 3 standard schemes

 Profile-based, 2-bit counter and Tournament predictors are in use today

H&P, fig 3.2 (p.159) shows the branch prediction accuracy for three types of dynamic branch prediction scheme.

99 95 86

88 86

88

99 84

84 77

82 70

100 97 97 98 96 94

0 20 40 60 80 100

tomcatv doduc fpppp li espresso

gcc

Branch prediction accuracy

Profile-based 2-bit counter Tournament

(11)

Effect of Branch prediction

(12)

Effect of Limited ROB

(13)

Limits to Multiple-issue

 Inherent limitation of ILP in most programs:

– In practice we would need N independent instructions to keep a W- issue processor busy, where N=W*pipeline width

– Data and control dependences significantly limit amount of ILP

 Complexity of the hardware based on issue width

– Number of functional units increases linearly → OK

– Number of ports for register file increases linearly → bad – Number of ports for memory increases linearly → bad

– Number of dependence tests increases quadratically → bad

– Bypass/forwarding logic and wires increases quadratically → bad

(14)

Summary of Factors Limiting ILP in Real Programs

 Compared with an ideal processor

– Finite number of registers (introduces WAW and WAR stalls) – Imperfect branch prediction (pipeline flushes)

– Limited issue width

– Instruction fetch delays (cache misses) – Limited instruction window

 Implications for future performance growth?

– Single processor has inherent limits

– To use future silicon area, need to go to multiple cores