• No results found

UNIT-5.pptx

N/A
N/A
Protected

Academic year: 2020

Share "UNIT-5.pptx"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Concurrent Processors

(2)

PARALLEL PROCESSING

 Example of parallel Processing:  Multiple Functional Unit:

(3)

PARALLEL COMPUTERS

Architectural Classification

Number of Data Streams

Number of

Instruction Streams

Single

Multiple

Single Multiple

SISD SIMD

MISD MIMD

Parallel Processing

 Flynn's classification

Based on the multiplicity of Instruction Streams and Data Streams

Instruction Stream

Sequence of Instructions read from memory Data Stream

(4)

SISD COMPUTER SYSTEMS

Control

Unit Processor Unit Memory

Instruction stream

Data stream

Characteristics:

One control unit, one processor unit, and one memory unit

Parallel processing may be achieved by means of:

multiple functional units

pipeline processing

(5)

MISD COMPUTER SYSTEMS

M CU P

M CU P

M CU P

Memory

Instruction stream

Data stream

Characteristics

- There is no computer at present that can be classified as MISD

(6)

SIMD COMPUTER SYSTEMS

Control Unit Memory

Alignment network

P P • • • P

M M • • • M

Data bus

Instruction stream

Data stream

Processor units

Memory modules

Characteristics

 Only one copy of the program exists

 A single controller executes one instruction at a time

(7)

MIMD COMPUTER SYSTEMS

Interconnection Network

P M P M • • • P M

Shared Memory

Characteristics:

Multiple processing units (multiprocessor system)

Execution of multiple instructions on multiple data

Types of MIMD computer systems

- Shared memory multiprocessors

- Message-passing multicomputers (multicomputer system)

• The main difference between multicomputer system and multiprocessor

system is that the multiprocessor system is controlled by one operating

system that provides interaction between processors and all the

component of the system cooperate in the solution of a problem.

(8)

PIPELINING

Suboperations in each segment: R1  Ai, R2  Bi Load Ai and Bi

R3  R1 * R2, R4  Ci Multiply and load Ci

R5  R3 + R4 Add

• A technique of decomposing a sequential process into suboperations,

with each subprocess being executed in a special dedicated segment

that operates concurrently with all other segments.

Ai * Bi + Ci for i = 1, 2, 3, ... , 7

Ai R1 R2 Multiplier R3 R4 Adder R5 Memory Pipelining

Bi Ci

Segment 1

Segment 2

(9)

OPERATIONS IN EACH PIPELINE STAGE

Clock Pulse

Segment 1 Segment 2 Segment 3

Number R1 R2 R3 R4 R5 1 A1 B1 --- ---

2 A2 B2 A1 * B1 C1

3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6

9 A7 * B7 + C7

(10)

GENERAL PIPELINE

General Structure of a 4-Segment Pipeline

S1 R1 S2 R2 S3 R3 S4 R4

Input Clock

Space-Time Diagram

The following diagram shows 6 tasks T1 through T6 executed in 4

segments.

1 2 3 4 5 6 7 8 9

T1 T1 T1 T1 T2 T2 T2 T2 T3 T3 T4 T4 T4 T4 T5 T5 T5 T5 T6 T6 T6 T6 Clock cycles Segment 1 2 3 4 Pipelining T3 T3

No matter how many

segments, once the

(11)

Parallelism

Executing two or more operations at a same time is known as Parallelism.

Pipelining

Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are

(12)

Vector Processing

 Vector systems provide instructions that operate at the vector level.

(13)

Vector Processors

 Vector processors have high-level operations that work on linear arrays of numbers.

Architecture

Memory-memory (Results are stored in

memory)

Vector-register (Load/store architecture

(14)

Advantages of vector processing

 Data hazards can be eliminated due to nature of data.

 Memory latency can be reduced due to pipeline load and store operations.

(15)

Multiple issue processors

 Also known as superscalar processors.

(16)

Types of multiple issue processor

 Static Multiple Issue

SIMD instructions (single-instruction

multiple-data) for specialized applications, including

both graphics and scientific applications.  A single

instruction specifies operations, element by

element, on arrays (vectors) of data.  The

operations are explicitly parallel

 Dynamic Multiple Issue

Issue multiple instructions in each clock cycle

(17)

Vector Processing

 Control logic grows linearly with issue width

Vector unit switches off when not in use

- Higher energy efficiency

More predictable real-time performance Vector instructions expose data

parallelism without speculation

Multiple issue Processing

Control logic grows quadratically with issue widthControl logic consumes energy

 regardless of available parallelism

(18)

VECTOR PROCESSING

Vector Processing

There is a class of computational problems that are beyond the

capabilities of a conventional computer. These problems require a

vast number of computations that will take a conventional

computer days or even weeks to complete.

Vector Processing Applications

Problems that can be efficiently formulated in terms of vectors

and matrices

Long-range weather forecasting - Petroleum

explorations

Seismic data analysis- Medical diagnosis

Aerodynamics and space flight simulations

Artificial intelligence and expert systems

Mapping the human genome

Image processing

Vector Processor (computer)

(19)

Supercomputer = Vector Instruction + Pipelined floating-point arithmetic

High computational speed, fast and large memory system.

Extensive use of parallel processing.

It is equipped with multiple functional units and each unit has its own

pipeline configuration.

Optimized for the type of numerical calculations involving vectors and

matrices of floating-point numbers.

Limited in their use to a number of scientific applications:

o

numerical weather forecasting,

o

seismic wave analysis,

o

space research.

They have limited use and limited market because of their high price.

(20)

Problems with conventional approach

 Limits to conventional exploitation of ILP:

1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)

2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle

3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality;

(21)

Alternative Model:Vector Processing

+

r1 r2

r3

add r3, r1, r2

SCALAR

(1 operation)

v1 v2

v3

+

vect or lengt h

add.vv v3, v1, v2

VECTOR

(N operations)

(22)

Properties of Vector Processors

 Each result independent of previous result

=> long pipeline, compiler ensures no dependencies => high clock rate

 Vector instructions access memory with known pattern => highly interleaved memory

=> amortize memory latency of over 64 elements

=> no (data) caches required! (Do use instruction cache)

 Reduces branches and branch problems in pipelines

(23)

Styles of Vector Architectures

memory-memory vector processors: all vector operations are memory to memory

vector-register processors: all vector operations between vector registers (except load and store)

 Vector equivalent of load-store architectures

 Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC

(24)

Components of Vector Processor

Vector Register: fixed length bank holding a single vector

 has at least 2 read and 1 write ports

 typically 8-32 vector registers, each holding 64-128 64-bit elements

Vector Functional Units (FUs): fully pipelined, start new operation every clock

 typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit

Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs

Scalar registers: single element for FP scalar or address

(25)

Vector Advantages

 Easy to get high performance; N operations:  are independent

 use same functional unit

 access disjoint registers

 access registers in same order as previous instructions

 access contiguous memory words or known pattern

 can exploit large memory bandwidth

 hide memory latency (and any other latency)

 Scalable (get higher performance as more HW resources available)

 Compact: Describe N operations with 1 short instruction (v. VLIW)

 Predictable (real-time) performance vs. statistical performance (cache)

 Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b

 Mature, developed compiler technology

(26)

26

Advantages

Each result is independent of previous results - allowing deep

pipelines and high clock rates.

A single vector instruction performs a great deal of work -

meaning less fetches and ewer branches (and in turn fewer

mispredictions).

Vector instructions access memory a block at a time which allows

memory latency to be amortized over many elements.

Vector instructions access memory with known patterns, which

allows multiple memory banks to simultaneously supply

operands.

(27)

27

Disadvantages

Not as fast with scalar instructions

Complexity of the multi-ported VRF

Difficulties implementing precise exceptions

(28)

Vector Processors and Multiple Issue Processors are Compared on the basis of following two factors

:-- Cost

(29)

Cost Comparison

• While comparing the cost we must approximate the area used by both the technologies in the form of additional / required units.

• The cost of execution units is about the same for both.

• A major difference lies in the storage hierarchy.

(30)

Vector Processors use small data cache whereas,

(31)

Performance Comparison

 The performance of Vector Processors depends on two factors

:- Percentage of code that is vectorizable.

 Average length of vectors.

For short vectors data cache is sufficient in MI machines therefore, Short VectorsM.I. Processors perform better than equivalent Vector Processor.

And as vectors get longer, the performance of M.I. machine becomes much more dependent on size of data cache,

(32)

Vectors Lower Power

Vector

• One instruction

fetch,decode, dispatch per vector

• Structured register accesses

• Smaller code for high

performance, less power in instruction cache misses

• Bypass cache

• One TLB lookup per

group of loads or stores

• Move only necessary data across chip boundary

Single-issue Scalar

• One instruction fetch, decode,

dispatch per operation

• Arbitrary register accesses,

adds area and power

• Loop unrolling and software

pipelining for high performance

increases instruction cache footprint

• All data passes through cache;

waste power if no temporal locality

• One TLB lookup per load or store

(33)

Superscalar Energy Efficiency

Even Worse

Vector

• Control logic grows

linearly with issue width

• Vector unit switches off when not in use

• Vector instructions expose

parallelism without speculation

• Software control of

speculation when desired:

• Whether to use vector mask or compress/expand for conditionals

Superscalar

Control logic grows

quadratically with issue

width

Control logic consumes

energy regardless of

available parallelism

Speculation to increase

(34)

VECTOR PROCESSOR

Commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined

architecture which operations on vectors and matrices can efficiently exploit.

 • Each result independent of previous result =>High clock rate

(35)

MULTIPLE ISSUE PROCESSOR

A superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to

different execution units on the processor.

It implements a form of parallelism called instruction-level parallelism within a single processor

 The CPU can execute multiple instructions per clock cycle

 Featue of all current x86 architectures.

(36)

Multiple Issue machine

Some machines now try to go beyond pipelining to execute more than one instruction at a clock cycle, producing an effective CPI < 1.

This is possible if we duplicate some of the functional parts of the processor (e.g., have two ALUs or a register file with 4 read ports and 2 write ports), and have logic to issue several instructions concurrently. 

There are two general approaches to multiple issue:

static multiple issue (where the scheduling is done at compile time) and;

(37)

Static Multiple Issue

1. SIMD instructions (single-instruction multiple-data) for specialized

applications, including both graphics and scientific applications. 

 A single instruction specifies operations, element by element, on arrays (vectors) of data. 

 The operations are explicitly parallel, so no complex checking is required. 

(38)

2. EPIC (explicitly-parallel instruction) architectures, also called VLIW (very-long instruction word) machines. 

 These are general purpose architectures (instruction set) based upon large instructions containing several operations which are to be performed in parallel. 

By making the parallelism explicit, we gain two advantages over superscalar: 

a. much less logic (and hence less time) is required to identify parallelism at execution time, and

(39)

Dynamic Multiple Issue

(superscalar)

 Issue multiple instructions in each clock cycle

 Requires multiple arithmetic units and register files with additional ports to avoid structural hazards

 Generally extended to dynamic pipeline scheduling

 issue instructions (in order) to reservation / functional units

 execute instructions out of order as operands are available

 commit results (in order) to registers / memory

 Feature of all current x86 architectures

parallelism called

References

Related documents

In summary, we found the overall disparity in revascularization rates for Aboriginal compared with non-Aboriginal Australians was associated with lower revascularization rates for

The egfr-FISH analysis showed that three AI positive cases with relative increase of the shorter CA-SSR-1 allele were FISH positive with low or high “chromosome

virtualization platform should provide strict address isolation between virtual networks (meaning one virtual network cannot inadvertently address another) as well as

The High Performance Lab’s staff are experts in performance test design, test automation, sizing and performance analysis, system tuning, large data simulation, high-end system

– Cached data persistence through failovers Performance Capacity HDD PCI-e Cache HDD Performance Capacity +.. Flash Accel™ and Server

1) Questionnaires: The questionnaire aimed to measure the development of children assessing their skills regarding tactile interaction. The questionnaire was delivered at two

A study by professors from Ohio State University, North Carolina State and James Madison University is examining how the political attitudes and opinions of college students

In systems with small cache, the speedup factor for the implementation with sbox instruction will be higher than in systems with large cache, mainly because the performance of