parallelcomputerstructures

(1)

(2)

Parallel computer Structures

Three types

• _{Pipeline Computer} • _{Array Processor}

(3)

(4)

Pipelining

 Break instructions into steps

 Work on instructions like in an assembly line

 Allows for more instructions to be executed in less

time

 A n-stage pipeline is n times faster than a non pipeline processor

(5)

5

What is Pipelining?

 Like an Automobile Assembly Line for Instructions



Each step does a little job of processing the instruction Ideally each step operates in parallel

 Simple Model Instruction Fetch    Instruction Decode Instruction Execute

F1 D1 E1

F2 D2 E2

(6)

pipeline

• It is technique of decomposing a sequential

process into suboperation, with each

suboperation completed in dedicated segment.  Pipeline is commonly known as an assembly line operation.  It is similar like assembly line of car

manufacturing.

 First station in an assembly line set up a chasis, next station is

(7)

Execution in a pipelined

processor

(8)

(9)

(10)

Pipeline Stages

We can divide the execution of an instruction into the following 5

“classic” stages:

IF:

Instruction Fetch

ID:

Instruction Decode, register fetch

EX:

Execution

MEM:

Memory Access

(11)

Pipeline Stages

RISC processor has 5 stage instruction pipeline to

execute all the instructions in the RISC instruction set. Following are the 5 stages of RISC pipeline with their respective operations:

Stage 1 (Instruction Fetch)

In this stage the CPU reads instructions from the

address in the memory whose value is present in the program counter.

Stage 2 (Instruction Decode)

(12)

Pipeline Stages (contd…)

Stage 3 (Instruction Execute)

In this stage, ALU operations are performed.

Stage 4 (Memory Access)

In this stage, memory operands are read and written from/to the memory that is present in the instruction.

Stage 5 (Write Back)

(13)

RISC Pipeline Stages

 Fetch instruction  Decode instruction  Execute instruction  Access operand  Write result

 _Note

:

(14)

Without Pipelining

Instr 1

Instr 2

Insrtruction cycle 1 2 3 4 5 6 8 9 10

• _{Normally, you would perform the fetch, decode,}

(15)

With Pipelining

2 3 4 5 6 7 8

9 Clock Cycle

1

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

• _{The processor is able to perform each stage simultaneously}

.

(16)

Pipeline (cont.)

 Length of pipeline depends on the longest tep  Thus in RISC, all instructions were made to  be the same length

 Each stage takes 1 clock cycle

(17)

Stages of Execution in Pipelined MIPS

5 stage instruction pipeline

1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute:

Mem-reference: Calculate Address R-format: Perform ALU Operation

4) Memory: Load: Store:

Read Data from Data Memory Write Data to Data Memory

(18)

Pipelined Execution

Representation

IFtch Dcd Exec Mem WB

Program Flow

 To simplify pipeline, every instruction takes same number of

steps, called stages

(19)

Consider a ‘k’ segment pipeline with clock cycle time as ‘Tp’.

Let there be ‘n’ tasks to be completed in the pipelined processor.

Now, the first instruction is going to take ‘k’ cycles to come out of the pipeline but the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e a total of ‘n – 1’ cycles.

(20)

Performance of a

pipelined processor

So, time taken to execute ‘n’ instructions in a pipelined processor:

(21)

Performance of a

pipelined processor

(contd..)

Speedup of the pipelined processor over

non-pipelined processor, when ‘n’ tasks are executed on the same processor is:

When the number of tasks ‘n’ are significantly larger than k, that is, n >> k

(22)

(23)

Pipeline

Hazards

 _{T h e r e are situations, called hazards, that}

prevent the next instruction in the instruction stream from executing during its designated cycle

 _{T h e r e are three classes of hazards}

Structural hazard  D a t a hazard

 Branch hazard

Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.

Data Hazards. They arise when an instruction depends on the result of a previous instruction

in a way that is exposed by the overlapping of instructions in the pipeline.

Control Hazards.They arise from the pipelining of branches and other instructions

(24)

22

What Makes Pipelining

Hard?

 _{Power failing,}

_{Arithmetic overflow,}

 _{I / O device request,}

 _{O S call,}

(25)

Pipeline

Hazards

__{Structural hazard}

Resource conflicts when the hardware cannot support

all possible combination of instructions simultaneously

 _{D a t a hazard}

 A n instruction depends on the results of a previous

instruction

 _{Branch hazard}

(26)

Structural hazard

(27)

M

Single Memory is a Structural

Hazard

Load Instr 1 Instr 2 Instr 3 Instr 4 A L U

M _Reg _M Reg

A

L

U

M Reg M Reg

A

L

U

M Reg M Reg

A L U Re g M Reg A L U

M Reg M Reg

I n s t r. O r d e

•

r_{Can’t read same memory twice in same clock}

cycle

(28)

Structural hazard

Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)

Memory data fetch requires on FI and FO

S1 S2 S3 S4 S5

1 2 3 4 5 S1

S2

S3

S4

S5

1 2 3 4 5

(29)

Structural hazard

 _{T o solve this hazard, we “stall” the pipeline until the} resource is freed

(30)

Structural Hazards Solution

Structural Hazard Solution:

1. Add more Hardware

(31)

(32)

Data hazard

Example:

ADD R1R2+R3

SUB R4R1-R5

AND R6R1 AND R7

OR R8R1 OR R9

XOR R10R1 XOR

(33)

Data

hazard

Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)

FO: fetch data value WO: store the executed value

S1 S2 S3 S4 S5

(34)

Data

hazard

 _{D e l a y load approach inserts a no-operation instruction to}

avoid the data conflict

R1 R2+R 3 ADD No-op No-op SUB AND OR XOR

R4R1-R5

(35)

(36)

Data

hazard

 It can be further solved by a simple hardware technique called

forwarding (also called bypassing or short-circuiting)

 The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely

(37)

(38)

36

Data Hazard

Classification

 _{T h r e e types of data hazards}

 R A W : Read After

Write

 W A W : Write After Write  W A R : Write After Read

•

RAR : Read

After Read

(39)

Read After Write

(RAW)

_ _{A read after write (RAW) data hazard refers to}

asituation where an instruction refers to a result that has not yet been calculated or retrieved.

 _{T h i s can occur because even though} _{an instruction}

is executed after a previous instruction, the previous instruction has not been completely processed through the pipeline._example:

(40)

Write After Read

(WAR)

 _{A write after read (WAR) data hazard represents} aproblem with concurrent execution.

For example:

i1. i2.

(41)

Write After Write (WAW

 _{A write after write (WAW) data hazard may occur in} a concurrent execution environment.

example:

i1. R2 <- R4 + R7 i2. R2 <- R1 + R3

We must delay the WB (Write Back) of i2

(42)

Branch hazards

 _{B r a n c h hazards can cause a greater performance}

loss for pipelines

 _{W h e n a branch instruction is executed, it}_{may or}

may not change the PC

 _{I f a branch changes the PC to its target}

(43)

Branch

hazards

 _{T h e r e are}_FOUR_{schemes to handle branch hazards}

 F r e e z e scheme

Predict-untaken scheme

(44)

5-Stage

Pipelining

_Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI)

1 2 3 4 5 S1

S2

S3

S4

S5

1 2 3 4 5

(45)

Branch

Untaken

(Freeze approach)

 _{T h e simplest method of dealing with branches is to} redo the fetch following a branch

(46)

Branch

Taken

(Freeze approach)

 _{T h e simplest method of dealing with branches is to redo}

the fetch following a branch

(47)

Branch

Taken

(Freeze approach)

 _{T h e simplest scheme to handle branches is to} freeze the pipeline holding or deleting any instructions after the branch until the branch destination is known

 _{T h e attractiveness of this solution lies primarily}

(48)

Branch

Hazards

(Predicted-untaken)

 _{A higher performance, and only slightly more}

complex, scheme is to treat every branch as not taken

 _{I t is implemented by continuing to fetch instructions as if}

the branch were normal instruction

 _{T h e pipeline looks the same if the branch is not} taken

(49)

(50)

(51)

Branch

Taken

(Predicted-taken)

 _{A n alternative scheme is to treat every branch as}

taken

 _{A s soon as the branch is decoded and the target}

(52)

(53)

(54)

(55)

Array Processor

Array processor is a synchronous parallel computer with multiple ALU called processing elements ( PE) that can operate in parallel in lockstep fashion.

(56)

Array Processor

Classification

SIMD ( Single Instruction Multiple Data ): is an array processor that has a single instruction multiple data organization. It manipulates vector instructions by means of multiple functional unit responding to a common instruction.

ILLIAC-IV, CM -2(Connection Machine ),MP-1(MasPar-1), BSP (Bulk Synchronous Parallel )

Attached array processor: is an auxiliary processor attached to a general purpose computer.

(57)

Array Processor Architecture – SIMD

• SIMD has two basic configuration– a. Array processors using RAM also known as (Dedicated memory organization )

ILLIAC-IV, CM-2,MP-1–

b. Associative processor using content accessible memory also known as (Global Memory Organization)

(58)

SIMD Architecture – Array

Processor using RAM

•Here we have a Control Unit and multiple synchronized PE.

•The control unit controls all the PE below it .

•Control unit decode all the instructions given to it and decides where the decoded instruction should be executed.

•The vector instructions are

broadcasted to all the PE.

This broad casting is to get spatial parallelism through duplicate PE.

(59)

SIMD Architecture –

Array Processor using RAM Processing Element

A PE consists of an ALU with working registers and a local memory PMEMi which is used to store distributed data.

• All PE do the same function synchronously under the super vision of CU in a lock-step fashion.

• Before execution in a PE the vector instructions should be loaded into its PMEM .

• Data can be added into the PMEM from an external source or by the CU • When executing a instruction all the PE does not have to work ,only the enabled PE have to work.

(60)

SIMD Architecture – Array Processor

using RAM

Interconnection Network and Host Computer

IN: All communication between PE’s are done by the interconnection network. It does all the routing and manipulation function . This interconnection network is under the control of CU.

(61)

SIMD Architecture –

Masking and data routing organization

• One PE is connected to another PE via its routing register R.

• When one PE is communicating with the other PE ,it is the contents of the R register that is transferred.

• During a instruction cycle only the enabled PE will take the operand send to them while the other PE will discard the operands send to them.

• _{For an enabled PE the status register S =1 and for a}

(62)

SIMD Architecture –

Associative processor using content accessible memory

In this configuration PE does not have private memory. Memories attached to PE are replaced by parallel memory modules shared to all PE via an alignment network • Alignment network does path switching between PE and parallel memory.

• The PE to PE communication is also via alignment network .

• The alignment network is controlled by the CU.

(63)

Attached Array Processor

• In this configuration the attached array processor has an input output interface to common processor and another interface with a local memory.

(64)

Advantages

The principal reason for using the array processor is speed.

• The design of most array processors optimizes its performance for repetitive arithmetic operations , making it much faster at the vector arithmetic than the host CPU. Since most array processors operate asynchronously from the host CPU, they constitute a co-processor which increases the capacity of the system.

(65)

(66)

Multiprocessor Computer

• System contains two or more processors of approximately comparable capabilities.

• All processors share access access to common set of memory modules, I/O channels, and peripheral devices.

• The entire system must be controlled by a single integrated operating system providing interactions between processors and their programs.

(67)

Multiprocessor Computer

Interprocess communication can be done through shared memories or through an interrupt network.

Mutiprocessor hardware system organization is determined by interconnection structure to be used between memories and I/O channels.

Some of different interconnections used are :

• Time-Shared Common Bus

• Crossbar- Switch Network

(68)