Parallel computer Structures
Three types
• Pipeline Computer • Array Processor
Pipelining
Break instructions into steps
Work on instructions like in an assembly line
Allows for more instructions to be executed in less
time
A n-stage pipeline is n times faster than a non pipeline processor
5
What is Pipelining?
Like an Automobile Assembly Line for Instructions
Each step does a little job of processing the instruction Ideally each step operates in parallel
Simple Model Instruction Fetch Instruction Decode Instruction Execute
F1 D1 E1
F2 D2 E2
pipeline
• It is technique of decomposing a sequential
process into suboperation, with each
suboperation completed in dedicated segment. Pipeline is commonly known as an assembly line operation. It is similar like assembly line of car
manufacturing.
First station in an assembly line set up a chasis, next station is
Execution in a pipelined
processor
Pipeline Stages
We can divide the execution of an instruction into the following 5
“classic” stages:
IF:
Instruction Fetch
ID:
Instruction Decode, register fetch
EX:
Execution
MEM:
Memory Access
Pipeline Stages
RISC processor has 5 stage instruction pipeline to
execute all the instructions in the RISC instruction set. Following are the 5 stages of RISC pipeline with their respective operations:
Stage 1 (Instruction Fetch)
In this stage the CPU reads instructions from the
address in the memory whose value is present in the program counter.
Stage 2 (Instruction Decode)
Pipeline Stages (contd…)
Stage 3 (Instruction Execute)
In this stage, ALU operations are performed.
Stage 4 (Memory Access)
In this stage, memory operands are read and written from/to the memory that is present in the instruction.
Stage 5 (Write Back)
RISC Pipeline Stages
Fetch instruction Decode instruction Execute instruction Access operand Write result
Note
:
Without Pipelining
Instr 1
Instr 2
Insrtruction cycle 1 2 3 4 5 6 8 9 10
• Normally, you would perform the fetch, decode,
With Pipelining
2 3 4 5 6 7 8
9 Clock Cycle
1
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
• The processor is able to perform each stage simultaneously
.
Pipeline (cont.)
Length of pipeline depends on the longest tep Thus in RISC, all instructions were made to be the same length
Each stage takes 1 clock cycle
Stages of Execution in Pipelined MIPS
5 stage instruction pipeline
1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute:
Mem-reference: Calculate Address R-format: Perform ALU Operation
4) Memory: Load: Store:
Read Data from Data Memory Write Data to Data Memory
Pipelined Execution
Representation
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
To simplify pipeline, every instruction takes same number of
steps, called stages
Consider a ‘k’ segment pipeline with clock cycle time as ‘Tp’.
Let there be ‘n’ tasks to be completed in the pipelined processor.
Now, the first instruction is going to take ‘k’ cycles to come out of the pipeline but the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e a total of ‘n – 1’ cycles.
Performance of a
pipelined processor
So, time taken to execute ‘n’ instructions in a pipelined processor:
Performance of a
pipelined processor
(contd..)
Speedup of the pipelined processor over
non-pipelined processor, when ‘n’ tasks are executed on the same processor is:
When the number of tasks ‘n’ are significantly larger than k, that is, n >> k
Pipeline
Hazards
T h e r e are situations, called hazards, thatprevent the next instruction in the instruction stream from executing during its designated cycle
T h e r e are three classes of hazards
Structural hazard D a t a hazard
Branch hazard
Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards. They arise when an instruction depends on the result of a previous instruction
in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards.They arise from the pipelining of branches and other instructions
22
What Makes Pipelining
Hard?
Power failing,
Arithmetic overflow,
I / O device request,
O S call,
Pipeline
Hazards
Structural hazardResource conflicts when the hardware cannot support
all possible combination of instructions simultaneously
D a t a hazard
A n instruction depends on the results of a previous
instruction
Branch hazard
Structural hazard
M
Single Memory is a Structural
Hazard
Load Instr 1 Instr 2 Instr 3 Instr 4 A L UM Reg M Reg
A
L
U
M Reg M Reg
A
L
U
M Reg M Reg
A L U Re g M Reg A L U
M Reg M Reg
I n s t r. O r d e
•
rCan’t read same memory twice in same clockcycle
Structural hazard
Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)Memory data fetch requires on FI and FO
S1 S2 S3 S4 S5
1 2 3 4 5 S1
S2
S3
S4
S5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Structural hazard
T o solve this hazard, we “stall” the pipeline until the resource is freed
Structural Hazards Solution
Structural Hazard Solution:
1. Add more Hardware
Data hazard
Example:
ADD R1R2+R3
SUB R4R1-R5
AND R6R1 AND R7
OR R8R1 OR R9
XOR R10R1 XOR
Data
hazard
Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)FO: fetch data value WO: store the executed value
S1 S2 S3 S4 S5
Data
hazard
D e l a y load approach inserts a no-operation instruction to
avoid the data conflict
R1 R2+R 3 ADD No-op No-op SUB AND OR XOR
R4R1-R5
Data
hazard
It can be further solved by a simple hardware technique calledforwarding (also called bypassing or short-circuiting)
The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely
36
Data Hazard
Classification
T h r e e types of data hazards
R A W : Read After
Write
W A W : Write After Write W A R : Write After Read
•
RAR : Read
After Read
Read After Write
(RAW)
A read after write (RAW) data hazard refers toasituation where an instruction refers to a result that has not yet been calculated or retrieved.
T h i s can occur because even though an instruction
is executed after a previous instruction, the previous instruction has not been completely processed through the pipeline.example:
Write After Read
(WAR)
A write after read (WAR) data hazard represents aproblem with concurrent execution.
For example:
i1. i2.
Write After Write (WAW
A write after write (WAW) data hazard may occur in a concurrent execution environment.
example:
i1. R2 <- R4 + R7 i2. R2 <- R1 + R3
We must delay the WB (Write Back) of i2
Branch hazards
B r a n c h hazards can cause a greater performance
loss for pipelines
W h e n a branch instruction is executed, it may or
may not change the PC
I f a branch changes the PC to its target
Branch
hazards
T h e r e are FOUR schemes to handle branch hazards F r e e z e scheme
Predict-untaken scheme
5-Stage
Pipelining
Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)1 2 3 4 5 S1
S2
S3
S4
S5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Branch
Untaken
(Freeze approach)
T h e simplest method of dealing with branches is to redo the fetch following a branch
Branch
Taken
(Freeze approach)
T h e simplest method of dealing with branches is to redo
the fetch following a branch
Branch
Taken
(Freeze approach)
T h e simplest scheme to handle branches is to freeze the pipeline holding or deleting any instructions after the branch until the branch destination is known
T h e attractiveness of this solution lies primarily
Branch
Hazards
(Predicted-untaken)
A higher performance, and only slightly more
complex, scheme is to treat every branch as not taken
I t is implemented by continuing to fetch instructions as if
the branch were normal instruction
T h e pipeline looks the same if the branch is not taken
Branch
Taken
(Predicted-taken)
A n alternative scheme is to treat every branch as
taken
A s soon as the branch is decoded and the target
Array Processor
Array processor is a synchronous parallel computer with multiple ALU called processing elements ( PE) that can operate in parallel in lockstep fashion.
Array Processor
Classification
SIMD ( Single Instruction Multiple Data ): is an array processor that has a single instruction multiple data organization. It manipulates vector instructions by means of multiple functional unit responding to a common instruction.
ILLIAC-IV, CM -2(Connection Machine ),MP-1(MasPar-1), BSP (Bulk Synchronous Parallel )
Attached array processor: is an auxiliary processor attached to a general purpose computer.
Array Processor Architecture – SIMD
• SIMD has two basic configuration– a. Array processors using RAM also known as (Dedicated memory organization )
ILLIAC-IV, CM-2,MP-1–
b. Associative processor using content accessible memory also known as (Global Memory Organization)
SIMD Architecture – Array
Processor using RAM
•Here we have a Control Unit and multiple synchronized PE.
•The control unit controls all the PE below it .
•Control unit decode all the instructions given to it and decides where the decoded instruction should be executed.
•The vector instructions are
broadcasted to all the PE.
This broad casting is to get spatial parallelism through duplicate PE.
SIMD Architecture –
Array Processor using RAM Processing Element
A PE consists of an ALU with working registers and a local memory PMEMi which is used to store distributed data.
• All PE do the same function synchronously under the super vision of CU in a lock-step fashion.
• Before execution in a PE the vector instructions should be loaded into its PMEM .
• Data can be added into the PMEM from an external source or by the CU • When executing a instruction all the PE does not have to work ,only the enabled PE have to work.
SIMD Architecture – Array Processor
using RAM
Interconnection Network and Host Computer
IN: All communication between PE’s are done by the interconnection network. It does all the routing and manipulation function . This interconnection network is under the control of CU.
SIMD Architecture –
Masking and data routing organization• One PE is connected to another PE via its routing register R.
• When one PE is communicating with the other PE ,it is the contents of the R register that is transferred.
• During a instruction cycle only the enabled PE will take the operand send to them while the other PE will discard the operands send to them.
• For an enabled PE the status register S =1 and for a
SIMD Architecture –
Associative processor using content accessible memoryIn this configuration PE does not have private memory. Memories attached to PE are replaced by parallel memory modules shared to all PE via an alignment network • Alignment network does path switching between PE and parallel memory.
• The PE to PE communication is also via alignment network .
• The alignment network is controlled by the CU.
Attached Array Processor
• In this configuration the attached array processor has an input output interface to common processor and another interface with a local memory.
Advantages
The principal reason for using the array processor is speed.
• The design of most array processors optimizes its performance for repetitive arithmetic operations , making it much faster at the vector arithmetic than the host CPU. Since most array processors operate asynchronously from the host CPU, they constitute a co-processor which increases the capacity of the system.
Multiprocessor Computer
• System contains two or more processors of approximately comparable capabilities.
• All processors share access access to common set of memory modules, I/O channels, and peripheral devices.
• The entire system must be controlled by a single integrated operating system providing interactions between processors and their programs.
Multiprocessor Computer
Interprocess communication can be done through shared memories or through an interrupt network.
Mutiprocessor hardware system organization is determined by interconnection structure to be used between memories and I/O channels.
Some of different interconnections used are :
• Time-Shared Common Bus
• Crossbar- Switch Network