Computer Architecture TDTS10

(1)

Computer Architecture TDTS10

Erik Larsson

Department of Computer Science

why parallelism? Performance gain from

increasing clock frequency is no longer an option.

Outline

 Superscalar Processors

 Very Long Instruction Word Processors

 Parallel computers

2

Pipelining

FI DI CO FO EI WO

FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand

Superpipelining

FI DI CO FO EI WO

(2)

Superscalar Architecture

5

FI DI CO FO EI WO

6

Superscalar Architectures

 Superscalar architectures allow several instructions to be issued and completed per clock cycle.

 A superscalar architecture consists of a number of pipelines that are working in parallel.

 Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel.

Limitations

 Resource conflicts

 Control dependency

 Data conflicts

 True data dependency

 Output dependency

 Anti-dependency

7

Limitations worse -

Resource conflict - similar to structural hazards

Control - branches Data-conflicts

8

True Data Dependency

MUL R4,R3,R1 R4 ← R3 * R1 - - - - ADD R2,R4,R5 R2 ← R4 + R5

FI DI CO FO EI WO

MUL R4,R3,R1 ADD R2,R4,R5

Assume:

R1=2, R2=0, R3=2, R4=0, R5=2 Correct: MUL makes: R4=4 ADD makes: R2=6

R4 becomes 4

Value of R4 is read.

R4=0 at this point

FO EI WO

(3)

9

Output Dependency

 Two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs:

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R4,R2,R5 R4 ← R2 + R5

FI DI CO FO EI WO

MUL R4,R3,R1 ADD R4,R2,R5

10

Antidependency

 An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs:

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R3,R2,R5 R3 ← R2 + R5

Output Dependency and Antidependency

 Without dependencies by using additional registers:

 Case 1:

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R7,R2,R5 R7 ← R2 + R5

 Case 2:

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R6,R2,R5 R6 ← R2 + R5

Storage conflicts

R4

Policies for Parallel Instruction Execution

 The ability of a superscalar processor to execute instructions in parallel is determined by:

 the number and nature of parallel pipelines

 the mechanism used to find independent instructions.

 The policies are characterized by:

 the order in which instructions are issued for execution.

 the order in which instructions are completed.

 Execution policies:

 In-order issue with in-order completion.

 In-order issue with out-of-order completion.

 Out-of-order issue with out-of-order completion.

Correct operation

(4)

13

 Two instructions can be fetched and decoded at a time;

 Three functional units can work in parallel: floating point unit, integer adder, integer multiplier;

 Two instructions can be written back (completed) at a time;

14

Policies for Parallel Instruction Execution

We consider the following instruction sequence:

I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 ← R8 + R9

I3: MUL R4,R2,R3 R4 ← R2 * R3 I4: MUL R5,R6,R7 R5 ← R6 * R7 I5: ADD R10,R5,R7 R10 ← R5 + R7 I6: ADD R11,R2,R3 R11 ← R2 + R3

 I1 requires two cycles to execute;

 I3 and I4 are in conflict for the same functional unit;

 I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5);

 I2, I5 and I6 are in conflict for the same functional unit;

15

In-Order Issue

with In-Order Completion

Decode/

Issue Decode/

Issue ExecuteExecuteExecute Write back/

complete Write back/

complete Cycle

I1 I2 1

I3 I4 I1 I2 2

I5 I6 I1 ^STALL 3

I3 I1 I2 4

I4 I3 5

I5 I4 6

I6 I5 7

I6 8

I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3

16

In-Order Issue with In-Order Completion

 Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order.

 An instruction cannot be issued before the previous one has been issued;

 An instruction completes only after the previous one has completed.

 To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute;

 The processor detects and handles (by stalling) true data dependencies and resource conflicts.

(5)

17

In-Order Issue with In-Order Completion

 With in-order issue&in-order completion the processor has not to bother about output-dependency and antidependency. It has only to detect true data dependencies.

 No one of the two dependencies will be violated if instructions are issued/completed in-order:

Output dependency

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R4,R2,R5 R4 ← R2 + R5

Antidependency

MUL R4,R3,R1 R4 ← R3 * R1

- - - -

ADD R3,R2,R5 R3 ← R2 + R5

18

Out-of-Order Issue with Out-of-Order Completion

Decode/

Issue Decode/

Issue ExecuteExecuteExecute Write back/

complete Write back/

complete Cycle

I1 I2 1

I3 I4 I1 I2 2

I5 I4 I1 I3 I2 3

I6 I4 I1 I3 4

I5 I4 I6 5

I6 I5 6

7 8

I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3

Comments on Superscalars

 Main techniques for superscalar processors:

 additional pipelined units which are working in parallel;

 out-of-order issue&out-of-order completion;

 register renaming.

 All of the above techniques are aimed to enhance performance.

 Experiments have shown:

 only adding additional units is not efficient

 out-of-order issue is extremely important

 register renaming can improve performance with more than 30%

 it is important to provide a fetching/decoding capacity so that instruction window is sufficiently large.

Some Architectures

PowerPC 604

 six independent execution units:

 Branch execution unit

 Load/Store unit

 3 Integer units

 Floating-point unit

 in-order issue Power PC 620

 provides in addition to the 604 out-of-order issue

(6)

21

Some Architectures

Pentium

 three independent execution units:

 2 Integer units

 Floating point unit

 in-order issue

two instructions issued per clock cycle.

Pentium II to 4

 provides in addition to the Pentium out-of-order issue

 five instructions can be issued in one cycle

22

Summary Superscalars

Good

 The hardware solves everything:

 Hardware detects potential parallelism between instructions;

 Hardware tries to issue as many instructions as possible in parallel.

 Hardware solves register renaming.

 Binary compatibility

 If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism.

 Why?

Because the new hardware will issue the old instruction sequence in a more efficient way.

23

Summary Superscalars

Bad

 Very complex

 Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique.

 Power consumption can be very large!

 The instruction window is limited ⇒ this limits the capacity to detect potentially parallel instructions

Outline

24

(7)

25

The Alternative: VLIW Processors

 VLIW architectures rely on compile-time detection of parallelism

⇒ the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one

“large” instruction.

 After one instruction has been fetched all the corresponding operations are issued in parallel.

 No hardware is needed for run-time detection of parallelism.

 The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations.

26

VLIW Processors

 Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line.

VLIW Processors Summary VLIW Processors

 Advantages

 the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism.

 power consumption can be reduced.

 Compilers can detect parallelism based on global analysis program.

 Problems:

 large number of registers needed in order to keep all FUs active.

 Large data transport capacity is needed between

 FUs and the register file

 register files and memory.

 instruction cache and fetch unit.

 Large code size, partially because unused operations -> wasted bits.

 Incompatibility of binary code

(8)

Outline

29 30

Parallel Computers

 One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application.

 Such computers have been organized in very different ways.

 Some key features:

 number and complexity of individual CPUs

 availability of common (shared memory)

 interconnection topology

 performance of interconnection network

 I/O devices

The need for high performance!

Two main factors contribute to high performance of modern processors:

Fast circuit technology Architectural features:

large caches multiple fast buses pipelining

superscalar architectures (multiple funct. units)

31

Classification of Computer Architectures

 Flynn’s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate.

 The multiplicity of instruction stream and data stream gives us four different classes:

 Single instruction, single data stream - SISD

 Single instruction, multiple data stream - SIMD

 Multiple instruction, single data stream - MISD

 Multiple instruction, multiple data stream- MIMD

 References

 Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972.

 Duncan, Ralph, "A Survey of Parallel Computer Architectures", IEEE Computer. February 1990, pp. 5-16.

Processor organisation

32

Single instruction, single data stream (SISD)

Single instruction, multiple data stream (SIMD)

Multiple instruction, single data stream (MISD)

Multiple instruction, multiple data stream (MIMD)

Uniprocessor Vector

processor Array processor

Shared

memory Distributed memory

Symmetric multiprocessor (SMP)

Nonuniform memory access (NUMA) Processor organizations

(9)

33

Single Instruction stream, Single Data stream (SISD)

34

Single Instruction stream, Multiple Data stream (SIMD) with shared memory

SIMD with no shared

memory Multiple Instruction stream, Multiple Data

stream (MIMD) with shared memory

(10)

37

MIMD with no shared memory

38

Multiprocessors

 Shared memory MIMD computers are called multiprocessors:

Amdahl's law

 S = (s + p)/(s + p/N)= 1 / (s + p/N)

 where S is speed, s is seriel and p is parallel, and N is the number of processors.

 (s+p) is the total execution time

 p is the time that can be executed in parallel

 s is the time that cannot be executed in parallel

 Example, assume that 20% must be executed serial while 80%

can be executed in parallel, and that there are 4 processors.

 Optimal speed-up is 4 but this example gives: S= (0,2 + 0,8)/

(0,2+0,8/4) = 1/0,4 = 2,5

39

N - parallel units

40

Multiprocessors

 IBM System/370 (1970s): two IBM CPUs connected to shared memory.

 IBM System370/XA (1981): multiple CPUs can be connected to shared memory.

 IBM System/390 (1990s): similar features like S370/XA, with improved performance.

Possibility to connect several multiprocessor systems together through fast fiber-optic connection.

 CRAY X-MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns).

 CRAY Y-MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns).

 C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors.

 CRAY 3 (1993): maximum 16 vector processors (cycle time 2ns).

 Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola 68020 processors, interconnected by a sophisticated dynamic switching network.

 BBN TC2000 (1990): improved version of the Butterfly using Motorola 88100 RISC processor.

(11)

Questions

 What is a superscalar architecture?

 How does Amdahl's law work?

 Give examples why many parallel units will not improve performance?

 What is the difference between: In-order issue with in-order completion, In-order issue with out-of-order completion, and Out-of-order issue with out-of-order completion?

 Give examples of True data dependency, Output dependency, and Anti-dependency

 Give advantages and disadvantages of VLIW

 How did Flynn classify computers?

41 www.liu.se