Computer Architecture Elementary Pipelining Study

(1)

(2)

6.1 [M] Consider the following instructions at the given addresses in the memory: 1000 Add R3, R2, #20 1004 Subtract R5, R4, #3 1008 And R6, R4, #0x3A 1012 Add R7, R2, R4

Initially, and . These instructions are executed in a computer that has a five-stage pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining instructions are fetched in successive cycles.

(3)

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1 through 8.

--

(6.2.a) Flow Of Instructions

:

*Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

Stage:

1

2

3

4

5

6

7

8 Operation:

1000 Add

R3

, R2,

#20

Fetch

IR (1000)

Decode

RA [R2]

RB [R3]

Compute

RZ [R2]+

#20

Memory

RY [RZ]

Write

R3 [RY]

1004

Subtract

R5, R4, #3

Fetch

IR (1004)

Decode

RA [R4]

RB [R5]

Compute

RZ [R4]-

#3

Memory

RY [RZ]

Write

R5

[RY]

1008 And R6, R4, #0x3A

Fetch

IR (1008)

Decode

RA [RZ]

“R3”

RB [R6]

Compute

RZ

[R4]&

#0x3A

Memory

RY [RZ]

Write

R6

[RY]

1012 Add R7, R2, R4

Fetch

IR (1012)

Decode

RA [R2]

RB [R4]

Compute

RZ

[R2]+

[R4]

Memory

RY

[RZ]

Write

R7

[RY]

(b) With reference to Figures 5.8 and 5.9, describe the contents of registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8.

(4)

(6.1) Contents of Registers

R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ

:

Stage:

1

2

3

4

5

6

7

8

9 Register:

[PC]

1000

1004

1008

1012

?

[IR]

Add

(1004)

Subtract

(1004)

And

(1008)

Add

(1012)

?

[RA]

?

[R2]

2000

[R4]

50 [R4]

50 [R2]

2000

?

[RB]

?

[R3]

?

[R5]

?

[R6]

?

[R4]

50 ?

?

[RZ]

?

[R2]+#20

2020

[R4]-#3

47 [R4] & #0x3A

50 [R2]+ [R4]

2050

?

[RY]

?

[R2]+#20

2020

[R4]-#3

47 [R4] & #0x3A

50 [R2]+ [R4]

2050

?

[R2]

2000

[R3]

?

[R2]+#20

2020

[R2]+#20

2020

[R2]+#20

2020

[R2]+#20

2020

[R4]

50

50 [R5]

?

[R4]-#3

47 [R4]-#3

47 [R6]

?

[R4] & #0x3A

50 [R4] & #0x3A

50 [R7]

?

[R2]+ [R4]

2050

(5)

6.2

6.2 [M] Repeat Problem 6.1 for the following program:

1000 Add R3, R2, #20 1004 Subtract R5, R4, #3

1008 And R6, R3, #0x3A // Data Dependency R3!!! 1012 Add R7, R2, R4

Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZ in Figure 5.8 and that the processor uses forwarding of operands.

--

From Problem 6.1…

Initially, and . These instructions are executed in a computer that has a five-stage pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining instructions are fetched in successive cycles.

(6)

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1 through 8.

--

(6.2.a) Flow Of Instructions

:

*Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle

Stage:

1

2

3

4

5

6

7

8 Operation:

1000 Add

R3

, R2,

#20

Fetch

IR (1000)

Decode

RA [R2]

RB [R3]

Compute

RZ [R2]+

#20

Memory

RY [RZ]

Write

R3 [RY]

1004

Subtract

R5, R4, #3

Fetch

IR (1004)

Decode

RA [R4]

RB [R5]

Compute

RZ [R4]-

#3

Memory

RY [RZ]

Write

R5

[RY]

1008 And R6, R3, #0x3A

Fetch

IR (1008)

_{RA [RZ]}

Decode

“R3”

RB [R6]

Compute

RZ

[R3]&

#0x3A

Memory

RY [RZ]

Write

R6

[RY]

1012 Add R7, R2, R4

Fetch

IR (1012)

Decode

RA [R2]

RB [R4]

Compute

RZ

[R2]+

[R4]

Memory

RY

[RZ]

Write

R7

[RY]

(b) With reference to Figures 5.8 and 5.9, describe the contents of R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8.

(7)

(6.2.b) Contents of Registers

R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ

:

Stage:

1

2

3

4

5

6

7

8

9 Register:

[PC]

1004

1008

1012

?

[IR]

Add

(1000)

Subtract

(1004)

And

(1008)

Add

(1012)

?

[RA]

?

[R2]

2000

[R4]

50 “R3”

2020

[R2]

2000

?

[RB]

?

[R3]

?

[R5]

?

[R4]

50 [R4]

50 ?

?

[RZ]

?

[R2]+#20

2020

[R4]-#3

47 [R3] &

#0x3A

32 [R2]+ [R4]

2050

?

[RY]

?

[R2]+#20

2020

[R4]-#3

47 [R3] &

#0x3A

32 [R2]+ [R4]

2050

?

[R2]

2000

[R3]

?

[R2]+#20

2020

[R2]+#20

2020

[R2]+#20

2020

[R2]+#20

2020

[R4]

50

50 [R5]

?

[R4]-#3

47 [R4]-#3

47 [R6]

?

[R3] &

#0x3A

32 [R3] &

#0x3A

32 [R7]

?

[R2]+ [R4]

2050

(8)

6.7

6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a program are

branch instructions. Delayed branching is used, with one delay slot. Assume that there are no stalls

caused by other factors.

(a.)First, derive an expression for the execution time in cycles if all delay slots are filled with NOP instructions.

Branch

20%

Delay slots

100% No-Operation

--

( ) ( )

(b.)Then, derive another expression that reflects the execution time with 70 percent of delay slots filled with useful instructions by the optimizing compiler.

Branch

20%

Delay slots

30% No-Operation

Delay slots

70% Useful Operation

--

( ) ( )

(c.) From these expressions, determine the compiler’s contribution to the increase in performance, expressed as a speedup percentage.

--

The Optimizing Compiler with a 1-branch-delay processor makes operations faster than

the same processor without an Optimizing Compiler.

(9)

6.8

6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with .

The output from the optimizing compiler is such that the first delay slot is filled with a useful instruction

70 percent of the time, but the second slot is filled with a useful instruction only 10 percent of the time.

Compare the compiler-optimized execution time for this case with the compiler-optimized execution time for Problem 6.7. Assume that the two processors have the same clock rate. Indicate which

processor/compiler combination is faster, and determine the speedup percentage by which it is faster. --

(a.)The execution time in cycles if all delay slots are filled with NOP instructions.

(a.)The execution time in cycles if the delays are filled as described above.

Branches

20% Of Operations

Slot #1

Delay slot

30% No-Operation

Delay slot

70% Useful Operation

Slot #2

Delay slot

90% No-Operation

Delay slot

10% Useful Operation

( ) ( ) ( ) ( ) ( )

(10)

(c.) From these expressions, determine the compiler’s contribution to the increase in performance, expressed as a speedup percentage.

--

The Pipeline, with optimizing compiler is best:

An Optimizing Compiler for a 2-branch -delay processor makes executions faster than

the same processor without an Optimizing Compiler

.

(11)

6.14

6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalar

processor shown in Figure 6.13.

(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of 75 percent arithmetic instructions and 25 percent memory-access instructions?

Arithmetic (75%)+ Memory Access(25%) --

Given the assumptions marked on Figure 6.13 above the Arithmetic path takes on average:

Again, given the assumptions marked on Figure 6.13 above the Arithmetic path takes:

(

)

Because there is a significantly un-equal amount of instructions passed through each pipeline, we can safely assume that the Load/Store( 1/4th Of Total Instructions MINORITY) will be able to finish before the Arithmetic( 3/4th Of Total Instructions MAJORITY) .

1Cycle

1Cycle 1Cycle 1Cycle

1Cycle

1Cycle 2Cycles 1Cycle

Jordan’s Assumptions

(12)

( ) ( ) ( ) ( )

Thus our limiting “bottleneck” is the Arithmetic sector of the pipeline:

(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2 using the same clock?

--

Given the assumptions marked on Figure 6.2 above the simple pipeline will be limited by the memory access: _{( )} _{( )} Jordan’s Assumptions In Yellow 1Cycle 1Cycle 1Cycle 1Cycle 2Cycles 1Cycle

(13)

Thus our comparison now becomes:

_{( )} Vs.

Because both processors would be dealing with the same set of instructions and the same clock our comparison simplifies:

( ) Vs.

So we saved a little more than one cycle by re-routing memory access to a separate pipeline – we could have saved even more cycles by having a more balanced set of instructions :(ie. Arithmetic (50%) ; Memory Access(50%))

Furthermore looking at our “Speed Up Ratio”:

( )

And so we find that our Double-Path-Super-Scalar-Processor is faster than a

(14)

6.15

HAS BRANCHES

6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors in Figures 6.2 and 6.13, assuming that the mix of instructions consists of 15 percent branch instructions that are never taken, 65 percent arithmetic instructions, and 20 percent memory access instructions. Assume a

prediction accuracy of 100 percent for all branch instructions-(Branch delay is minimized 𝛿_{𝑏𝑟𝑎𝑛𝑐ℎ} ).

𝐶𝑦𝑐𝑙𝑒 --

(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) instructions?

Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) --

Making the same assumptions as in problem 6.14, marked on Figure 6.13 above;

Because the number of Arithmetic instructions outweighs the number of Memory Accesses the “bottleneck constraint” will originate from the Arithmetic sector of the pipeline.

Where again the Arithmetic path takes:

( ) ( ) ( ) ( )

ℎ

_ℎ _ℎ

And we now have the additional possibility of a branch delay, HOWEVER THIS IS TAKEN CARE OF by the Fetch Sector, as a pre-cursor to the Execution Stage:

ℎ

(15)

_ℎ

(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2 using the same clock?

--

Making the same assumptions as in problem 6.14, marked on Figure 6.2 above; And we now have the additional possibility of a single-cycle branch delay:

ℎ ( ) ℎ

( )

_{ℎ ( )} Thus our comparison becomes

_{ℎ ( )} Vs. ℎ

Because both processors would be dealing with the same set of instructions and the same clock our comparison simplifies:

ℎ ( ) Vs.

_ℎ

So we can see that the simple pipeline takes longer to reconcile the branch delays, but the super scalar pipeline is essentially un-effected by branching considerations (because this is taken care of in the fetch stage)

Furthermore looking at our “Speed Up Ratio”:

ℎ ( )

And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-Pipelined – Processor when we also take branching effects into consideration.

(16)

(17)

KEY For Tables Of Operations

Fetch=E

Decode=D

Execute (Using Processor Hardware)=E

Execute With Register Renaming (Not Using Processor Hardware)=E

Write Back=W

Waiting on=X

NOTE: This method of display, was developed in collaboration with Nathan Genetzky

(Supplementary.A.)Completed The Preceding Table Of Operations

NOTE: Horizontal = Temporal-axis

:

F D E W

F D R3

E

W

F D ONE E

W

F

D

R6

E

W

F

D

E

W

F

D R7

E

W

F

D FIVE

E

W

F

D

R1

E

W

F

D

E

W

F

D R6

R6

E

W

F

D SEVEN E

W

R3

R7 R7

R6,R2

R5 R1

R0 R3 R6

R2

(18)

(B.2)What is the savings using capability? --

We save SEVEN cycles. Performing 11 operations in 17 cycles ( Using - Out Of Order Capability) instead of 24 cycles (NOT Using - Out Of Order Capability)- THIS IS REALLY SIGNIFICANT!!!!

(Supplementary.B.)Completed The Preceding Table Of Operations

NO OUT OF ORDER CAPABILITY NO REGISTER RENAMING

NOTE: Horizontal = Temporal-axis

:

F D E W

F D R3 E E E E E W

F D ONE ONE ONE ONE ONE E W

F D R6 R6 R6 R6 R6 E W

F D THREE THREE THREE THREE E W

F D R7 R7 R7 R7 R7 R7 E W

F D FIVE FIVE FIVE FIVE FIVE FIVE E W

F D R1 R1 R1 SIX SIX SIX E W

F D SEVEN SEVEN SEVEN SEVEN SEVEN SEVEN E E E E E W

F D R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 E W

F D NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE E W

(19)

(Supplementary.C)Re-Completed The Table Of Operations

OUT OF ORDER CAPABILITY

NOTE: Horizontal = Temporal-axis

:

F D E

W

F D R3 R3 E

E

W

F D

E

W

F D R6

R6

R6 R6 E

W

F

D

E

W

F

D R7 R7 E

W

F

D

E

W

F

D R1 R1

R1

E

W

F

D

E

W

F

D R6

R6

E

W

F

D

E

W

R3 R7 R7 R5 R2 R3 R6

R1,R6

R0,R2

(20)

(A.)OUT OF ORDER EXECUTION

Instruction Fetch Decode Execute Writeback

0 ADD R3, R1, R2 0 1 2 3 1 LOAD R6, [R3] 1 2 4 9 2 AND R7, R5, 3 2 3 5 7 3 ADD R1, R6, R0 3 4 10 11 4 SRL R7, R0, 8 4 5 6 7 5 OR R2, R4, R7 5 6 8 9 6 SUB R5, R3, R4 6 7 9 10 7 ADD R0, R1, R10 7 8 12 13 8 LOAD R6, [R5] 8 9 10 15 9 SUB R2, R1, R6 9 10 16 17 10 AND R3, R7, 15 10 11 13 14

(C.) SUPER SCALAR (2 Instructions Each Stage) WITH OUT OF ORDER CAPABILITIES

(B.) NO OUT OF ORDER OPERATIONS