UNIT-5.pptx

(1)

Concurrent Processors

(2)

PARALLEL PROCESSING

 Example of parallel Processing:  Multiple Functional Unit:

(3)

PARALLEL COMPUTERS

Architectural Classification

Number of Data Streams

Number of

Instruction Streams

Single

Multiple

Single Multiple

SISD SIMD

MISD MIMD

Parallel Processing

 Flynn's classification

Based on the multiplicity of Instruction Streams and Data Streams

Instruction Stream

Sequence of Instructions read from memory _{Data Stream}

(4)

SISD COMPUTER SYSTEMS

Control

Unit Processor Unit Memory

Instruction stream

Data stream

•

Characteristics:



One control unit, one processor unit, and one memory unit



Parallel processing may be achieved by means of:



multiple functional units



pipeline processing

(5)

MISD COMPUTER SYSTEMS

M CU P

• • •

Memory

Data stream

Characteristics

- There is no computer at present that can be classified as MISD

(6)

SIMD COMPUTER SYSTEMS

Control Unit Memory

Alignment network

P P • • • P

M M • • • M

Data bus

Data stream

Processor units

Memory modules

•

Characteristics

 Only one copy of the program exists

 A single controller executes one instruction at a time

(7)

MIMD COMPUTER SYSTEMS

Interconnection Network

P M P M _{• • •} P M

Shared Memory

•

Characteristics:



Multiple processing units (multiprocessor system)



Execution of multiple instructions on multiple data

•

Types of MIMD computer systems

- Shared memory multiprocessors

- Message-passing multicomputers (multicomputer system)

• The main difference between multicomputer system and multiprocessor

system is that the multiprocessor system is controlled by one operating

system that provides interaction between processors and all the

component of the system cooperate in the solution of a problem.

(8)

PIPELINING

Suboperations in each segment: R1  A_i, R2  B_i Load A_i and B_i

R3  R1 * R2, R4  C_i Multiply and load C_i

R5  R3 + R4 Add

• A technique of decomposing a sequential process into suboperations,

with each subprocess being executed in a special dedicated segment

that operates concurrently with all other segments.

Ai * Bi + Ci for i = 1, 2, 3, ... , 7

A_i R1 R2 Multiplier R3 R4 Adder R5 Memory Pipelining

B_i C_i

Segment 1

Segment 2

(9)

OPERATIONS IN EACH PIPELINE STAGE

Clock Pulse

Segment 1 Segment 2 Segment 3

Number R1 R2 R3 R4 R5 1 A1 B1 --- ---

2 A2 B2 A1 * B1 C1

3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6

9 A7 * B7 + C7

(10)

GENERAL PIPELINE

•

General Structure of a 4-Segment Pipeline

S₁ R₁ S₂ R₂ S₃ R₃ S₄ R₄

Input Clock

•

Space-Time Diagram

The following diagram shows 6 tasks T1 through T6 executed in 4

segments.

1 2 3 4 5 6 7 8 9

T1 T1 T1 T1 T2 T2 T2 T2 T3 T3 T4 T4 T4 T4 T5 T5 T5 T5 T6 T6 T6 T6 Clock cycles Segment 1 2 3 4 Pipelining T3 T3

No matter how many

segments, once the

(11)

Parallelism

Executing two or more operations at a same time is known as Parallelism.

Pipelining

Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are

(12)

Vector Processing

 Vector systems provide instructions that operate at the vector level.

(13)

Vector Processors

 Vector processors have high-level operations that work on linear arrays of numbers.

Architecture



Memory-memory (Results are stored in

memory)



Vector-register (Load/store architecture

(14)

Advantages of vector processing

 Data hazards can be eliminated due to nature of data.

 Memory latency can be reduced due to pipeline load and store operations.

(15)

Multiple issue processors

 Also known as superscalar processors.

(16)

Types of multiple issue processor

 Static Multiple Issue

SIMD instructions (single-instruction

multiple-data) for specialized applications, including

both graphics and scientific applications. A single

instruction specifies operations, element by

element, on arrays (vectors) of data. The

operations are explicitly parallel

 Dynamic Multiple Issue

Issue multiple instructions in each clock cycle

(17)

Vector Processing

 Control logic grows linearly with issue width

 _{Vector unit switches off when not in use}

- Higher energy efficiency

 _{More predictable real-time performance}  _{Vector instructions expose data}

 _{parallelism without speculation}

Multiple issue Processing

 _{Control logic grows quadratically with issue width}  _{Control logic consumes energy}

 regardless of available parallelism

(18)

VECTOR PROCESSING

Vector Processing



There is a class of computational problems that are beyond the

capabilities of a conventional computer. These problems require a

vast number of computations that will take a conventional

computer days or even weeks to complete.

Vector Processing Applications



Problems that can be efficiently formulated in terms of vectors

and matrices



Long-range weather forecasting - Petroleum

explorations



Seismic data analysis- Medical diagnosis



Aerodynamics and space flight simulations



Artificial intelligence and expert systems



Mapping the human genome



Image processing

Vector Processor (computer)

(19)



Supercomputer = Vector Instruction + Pipelined floating-point arithmetic



High computational speed, fast and large memory system.



Extensive use of parallel processing.



It is equipped with multiple functional units and each unit has its own

pipeline configuration.



Optimized for the type of numerical calculations involving vectors and

matrices of floating-point numbers.



Limited in their use to a number of scientific applications:

o

numerical weather forecasting,

o

seismic wave analysis,

o

space research.



They have limited use and limited market because of their high price.

(20)

Problems with conventional approach

 Limits to conventional exploitation of ILP:

1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)

2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle

3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality;

(21)

Alternative Model:Vector Processing

+

r1 r2

r3

add r3, r1, r2

SCALAR

(1 operation)

v1 v2

v3

+

vect or lengt h

add.vv v3, v1, v2

VECTOR

(N operations)

(22)

Properties of Vector Processors

 Each result independent of previous result

=> long pipeline, compiler ensures no dependencies => high clock rate

 Vector instructions access memory with known pattern => highly interleaved memory

=> amortize memory latency of over 64 elements

=> no (data) caches required! (Do use instruction cache)

 Reduces branches and branch problems in pipelines

(23)

Styles of Vector Architectures

 memory-memory vector processors: all vector operations are memory to memory

 vector-register processors: all vector operations between vector registers (except load and store)

 Vector equivalent of load-store architectures

 Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC

(24)

Components of Vector Processor

 Vector Register: fixed length bank holding a single vector

 has at least 2 read and 1 write ports

 typically 8-32 vector registers, each holding 64-128 64-bit elements

 Vector Functional Units (FUs): fully pipelined, start new operation every clock

 typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit

 Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs

 Scalar registers: single element for FP scalar or address

(25)

Vector Advantages

 Easy to get high performance; N operations:  are independent

 use same functional unit

 access disjoint registers

 access registers in same order as previous instructions

 access contiguous memory words or known pattern

 can exploit large memory bandwidth

 hide memory latency (and any other latency)

 Scalable (get higher performance as more HW resources available)

 Compact: Describe N operations with 1 short instruction (v. VLIW)

 Predictable (real-time) performance vs. statistical performance (cache)

 Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b

 Mature, developed compiler technology

(26)

26

Advantages



Each result is independent of previous results - allowing deep

pipelines and high clock rates.



A single vector instruction performs a great deal of work -

meaning less fetches and ewer branches (and in turn fewer

mispredictions).



Vector instructions access memory a block at a time which allows

memory latency to be amortized over many elements.



Vector instructions access memory with known patterns, which

allows multiple memory banks to simultaneously supply

operands.

(27)

27

Disadvantages



Not as fast with scalar instructions



Complexity of the multi-ported VRF



Difficulties implementing precise exceptions

(28)

Vector Processors and Multiple Issue Processors are Compared on the basis of following two factors

:-- Cost

(29)

Cost Comparison

• While comparing the cost we must approximate the area used by both the technologies in the form of additional / required units.

• The cost of execution units is about the same for both.

• A major difference lies in the storage hierarchy.

(30)

Vector Processors use small data cache whereas,

(31)

Performance Comparison

 The performance of Vector Processors depends on two factors

:- Percentage of code that is vectorizable.

 Average length of vectors.

For short vectors data cache is sufficient in MI machines therefore, Short VectorsM.I. Processors perform better than equivalent Vector Processor.

And as vectors get longer, the performance of M.I. machine becomes much more dependent on size of data cache,

(32)

Vectors Lower Power

Vector

• One instruction

fetch,decode, dispatch per vector

• Structured register accesses

• Smaller code for high

performance, less power in instruction cache misses

• Bypass cache

• One TLB lookup per

group of loads or stores

• Move only necessary data across chip boundary

Single-issue Scalar

• One instruction fetch, decode,

dispatch per operation

• Arbitrary register accesses,

adds area and power

• Loop unrolling and software

pipelining for high performance

increases instruction cache footprint

• All data passes through cache;

waste power if no temporal locality

• One TLB lookup per load or store

(33)

Superscalar Energy Efficiency

Even Worse

Vector

• Control logic grows

linearly with issue width

• Vector unit switches off when not in use

• Vector instructions expose

parallelism without speculation

• Software control of

speculation when desired:

• Whether to use vector mask or compress/expand for conditionals

Superscalar



Control logic grows

quadratically with issue

width



Control logic consumes

energy regardless of

available parallelism



Speculation to increase

(34)

VECTOR PROCESSOR

Commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined

architecture which operations on vectors and matrices can efficiently exploit.

 • Each result independent of previous result =>High clock rate

(35)

MULTIPLE ISSUE PROCESSOR

A superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to

different execution units on the processor.

It implements a form of parallelism called instruction-level parallelism within a single processor

 The CPU can execute multiple instructions per clock cycle

 Featue of all current x86 architectures.

(36)

Multiple Issue machine

Some machines now try to go beyond pipelining to execute more than one instruction at a clock cycle, producing an effective CPI < 1.

This is possible if we duplicate some of the functional parts of the processor (e.g., have two ALUs or a register file with 4 read ports and 2 write ports), and have logic to issue several instructions concurrently.

There are two general approaches to multiple issue:

 static multiple issue (where the scheduling is done at compile time) and;

(37)

Static Multiple Issue

1. SIMD instructions (single-instruction multiple-data) for specialized

applications, including both graphics and scientific applications.

 A single instruction specifies operations, element by element, on arrays (vectors) of data.

 The operations are explicitly parallel, so no complex checking is required.

(38)

2. EPIC (explicitly-parallel instruction) architectures, also called VLIW (very-long instruction word) machines.

 These are general purpose architectures (instruction set) based upon large instructions containing several operations which are to be performed in parallel.

By making the parallelism explicit, we gain two advantages over superscalar:

a. much less logic (and hence less time) is required to identify parallelism at execution time, and

(39)

Dynamic Multiple Issue

(superscalar)

 Issue multiple instructions in each clock cycle

 Requires multiple arithmetic units and register files with additional ports to avoid structural hazards

 Generally extended to dynamic pipeline scheduling

 issue instructions (in order) to reservation / functional units

 execute instructions out of order as operands are available

 commit results (in order) to registers / memory

 Feature of all current x86 architectures

parallelism called