Computer Architecture TDTS10
Erik Larsson
Department of Computer Science
why parallelism? Performance gain from
increasing clock frequency is no longer an option.
Outline
Superscalar Processors
Very Long Instruction Word Processors
Parallel computers
2
Pipelining
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand
Superpipelining
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand
Superscalar Architecture
5
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand
FI DI CO FO EI WO
6
Superscalar Architectures
Superscalar architectures allow several instructions to be issued and completed per clock cycle.
A superscalar architecture consists of a number of pipelines that are working in parallel.
Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel.
Limitations
Resource conflicts
Control dependency
Data conflicts
True data dependency
Output dependency
Anti-dependency
7
Limitations worse -
Resource conflict - similar to structural hazards
Control - branches Data-conflicts
8
True Data Dependency
MUL R4,R3,R1 R4 ← R3 * R1 - - - - ADD R2,R4,R5 R2 ← R4 + R5
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
MUL R4,R3,R1 ADD R2,R4,R5
Assume:
R1=2, R2=0, R3=2, R4=0, R5=2 Correct: MUL makes: R4=4 ADD makes: R2=6
R4 becomes 4
Value of R4 is read.
R4=0 at this point
FO EI WO
9
Output Dependency
Two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs:
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R4,R2,R5 R4 ← R2 + R5
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
MUL R4,R3,R1 ADD R4,R2,R5
10
Antidependency
An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs:
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R3,R2,R5 R3 ← R2 + R5
Output Dependency and Antidependency
Without dependencies by using additional registers:
Case 1:
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R7,R2,R5 R7 ← R2 + R5
Case 2:
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R6,R2,R5 R6 ← R2 + R5
Storage conflicts
R4
R4
Policies for Parallel Instruction Execution
The ability of a superscalar processor to execute instructions in parallel is determined by:
the number and nature of parallel pipelines
the mechanism used to find independent instructions.
The policies are characterized by:
the order in which instructions are issued for execution.
the order in which instructions are completed.
Execution policies:
In-order issue with in-order completion.
In-order issue with out-of-order completion.
Out-of-order issue with out-of-order completion.
Correct operation
13
Two instructions can be fetched and decoded at a time;
Three functional units can work in parallel: floating point unit, integer adder, integer multiplier;
Two instructions can be written back (completed) at a time;
14
Policies for Parallel Instruction Execution
We consider the following instruction sequence:
I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 ← R8 + R9
I3: MUL R4,R2,R3 R4 ← R2 * R3 I4: MUL R5,R6,R7 R5 ← R6 * R7 I5: ADD R10,R5,R7 R10 ← R5 + R7 I6: ADD R11,R2,R3 R11 ← R2 + R3
I1 requires two cycles to execute;
I3 and I4 are in conflict for the same functional unit;
I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5);
I2, I5 and I6 are in conflict for the same functional unit;
15
In-Order Issue
with In-Order Completion
Decode/
Issue Decode/
Issue ExecuteExecuteExecute Write back/
complete Write back/
complete Cycle
I1 I2 1
I3 I4 I1 I2 2
I5 I6 I1 STALL 3
I3 I1 I2 4
I4 I3 5
I5 I4 6
I6 I5 7
I6 8
I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3
16
In-Order Issue with In-Order Completion
Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order.
An instruction cannot be issued before the previous one has been issued;
An instruction completes only after the previous one has completed.
To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute;
The processor detects and handles (by stalling) true data dependencies and resource conflicts.
17
In-Order Issue with In-Order Completion
With in-order issue&in-order completion the processor has not to bother about output-dependency and antidependency. It has only to detect true data dependencies.
No one of the two dependencies will be violated if instructions are issued/completed in-order:
Output dependency
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R4,R2,R5 R4 ← R2 + R5
Antidependency
MUL R4,R3,R1 R4 ← R3 * R1
- - - -
ADD R3,R2,R5 R3 ← R2 + R5
18
Out-of-Order Issue with Out-of-Order Completion
Decode/
Issue Decode/
Issue ExecuteExecuteExecute Write back/
complete Write back/
complete Cycle
I1 I2 1
I3 I4 I1 I2 2
I5 I4 I1 I3 I2 3
I6 I4 I1 I3 4
I5 I4 I6 5
I6 I5 6
7 8
I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3
Comments on Superscalars
Main techniques for superscalar processors:
additional pipelined units which are working in parallel;
out-of-order issue&out-of-order completion;
register renaming.
All of the above techniques are aimed to enhance performance.
Experiments have shown:
only adding additional units is not efficient
out-of-order issue is extremely important
register renaming can improve performance with more than 30%
it is important to provide a fetching/decoding capacity so that instruction window is sufficiently large.
Some Architectures
PowerPC 604
six independent execution units:
Branch execution unit
Load/Store unit
3 Integer units
Floating-point unit
in-order issue Power PC 620
provides in addition to the 604 out-of-order issue
21
Some Architectures
Pentium
three independent execution units:
2 Integer units
Floating point unit
in-order issue
two instructions issued per clock cycle.
Pentium II to 4
provides in addition to the Pentium out-of-order issue
five instructions can be issued in one cycle
22
Summary Superscalars
Good
The hardware solves everything:
Hardware detects potential parallelism between instructions;
Hardware tries to issue as many instructions as possible in parallel.
Hardware solves register renaming.
Binary compatibility
If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism.
Why?
Because the new hardware will issue the old instruction sequence in a more efficient way.
23
Summary Superscalars
Bad
Very complex
Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique.
Power consumption can be very large!
The instruction window is limited ⇒ this limits the capacity to detect potentially parallel instructions
Outline
Superscalar Processors
Very Long Instruction Word Processors
Parallel computers
24
25
The Alternative: VLIW Processors
VLIW architectures rely on compile-time detection of parallelism
⇒ the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one
“large” instruction.
After one instruction has been fetched all the corresponding operations are issued in parallel.
No hardware is needed for run-time detection of parallelism.
The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations.
26
VLIW Processors
Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line.
VLIW Processors Summary VLIW Processors
Advantages
the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism.
power consumption can be reduced.
Compilers can detect parallelism based on global analysis program.
Problems:
large number of registers needed in order to keep all FUs active.
Large data transport capacity is needed between
FUs and the register file
register files and memory.
instruction cache and fetch unit.
Large code size, partially because unused operations -> wasted bits.
Incompatibility of binary code
Outline
Superscalar Processors
Very Long Instruction Word Processors
Parallel computers
29 30
Parallel Computers
One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application.
Such computers have been organized in very different ways.
Some key features:
number and complexity of individual CPUs
availability of common (shared memory)
interconnection topology
performance of interconnection network
I/O devices
The need for high performance!
Two main factors contribute to high performance of modern processors:
Fast circuit technology Architectural features:
large caches multiple fast buses pipelining
superscalar architectures (multiple funct. units)
31
Classification of Computer Architectures
Flynn’s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate.
The multiplicity of instruction stream and data stream gives us four different classes:
Single instruction, single data stream - SISD
Single instruction, multiple data stream - SIMD
Multiple instruction, single data stream - MISD
Multiple instruction, multiple data stream- MIMD
References
Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972.
Duncan, Ralph, "A Survey of Parallel Computer Architectures", IEEE Computer. February 1990, pp. 5-16.
Processor organisation
32
Single instruction, single data stream (SISD)
Single instruction, multiple data stream (SIMD)
Multiple instruction, single data stream (MISD)
Multiple instruction, multiple data stream (MIMD)
Uniprocessor Vector
processor Array processor
Shared
memory Distributed memory
Symmetric multiprocessor (SMP)
Nonuniform memory access (NUMA) Processor organizations
33
Single Instruction stream, Single Data stream (SISD)
34
Single Instruction stream, Multiple Data stream (SIMD) with shared memory
SIMD with no shared
memory Multiple Instruction stream, Multiple Data
stream (MIMD) with shared memory
37
MIMD with no shared memory
38
Multiprocessors
Shared memory MIMD computers are called multiprocessors:
Amdahl's law
S = (s + p)/(s + p/N)= 1 / (s + p/N)
where S is speed, s is seriel and p is parallel, and N is the number of processors.
(s+p) is the total execution time
p is the time that can be executed in parallel
s is the time that cannot be executed in parallel
Example, assume that 20% must be executed serial while 80%
can be executed in parallel, and that there are 4 processors.
Optimal speed-up is 4 but this example gives: S= (0,2 + 0,8)/
(0,2+0,8/4) = 1/0,4 = 2,5
39
N - parallel units
40
Multiprocessors
IBM System/370 (1970s): two IBM CPUs connected to shared memory.
IBM System370/XA (1981): multiple CPUs can be connected to shared memory.
IBM System/390 (1990s): similar features like S370/XA, with improved performance.
Possibility to connect several multiprocessor systems together through fast fiber-optic connection.
CRAY X-MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns).
CRAY Y-MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns).
C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors.
CRAY 3 (1993): maximum 16 vector processors (cycle time 2ns).
Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola 68020 processors, interconnected by a sophisticated dynamic switching network.
BBN TC2000 (1990): improved version of the Butterfly using Motorola 88100 RISC processor.
Questions
What is a superscalar architecture?
How does Amdahl's law work?
Give examples why many parallel units will not improve performance?
What is the difference between: In-order issue with in-order completion, In-order issue with out-of-order completion, and Out-of-order issue with out-of-order completion?
Give examples of True data dependency, Output dependency, and Anti-dependency
Give advantages and disadvantages of VLIW
How did Flynn classify computers?
41 www.liu.se