Concurrent Processors
PARALLEL PROCESSING
Example of parallel Processing: Multiple Functional Unit:
PARALLEL COMPUTERS
Architectural Classification
Number of Data Streams
Number of
Instruction Streams
Single
Multiple
Single Multiple
SISD SIMD
MISD MIMD
Parallel Processing
Flynn's classification
Based on the multiplicity of Instruction Streams and Data Streams
Instruction Stream
Sequence of Instructions read from memory Data Stream
SISD COMPUTER SYSTEMS
Control
Unit Processor Unit Memory
Instruction stream
Data stream
•
Characteristics:
One control unit, one processor unit, and one memory unit
Parallel processing may be achieved by means of:
multiple functional units
pipeline processing
MISD COMPUTER SYSTEMS
M CU P
M CU P
M CU P
• • •
• • •
Memory
Instruction stream
Data stream
Characteristics
- There is no computer at present that can be classified as MISD
SIMD COMPUTER SYSTEMS
Control Unit Memory
Alignment network
P P • • • P
M M • • • M
Data bus
Instruction stream
Data stream
Processor units
Memory modules
•
Characteristics
Only one copy of the program exists
A single controller executes one instruction at a time
MIMD COMPUTER SYSTEMS
Interconnection Network
P M P M • • • P M
Shared Memory
•
Characteristics:
Multiple processing units (multiprocessor system)
Execution of multiple instructions on multiple data
•
Types of MIMD computer systems
- Shared memory multiprocessors
- Message-passing multicomputers (multicomputer system)
• The main difference between multicomputer system and multiprocessor
system is that the multiprocessor system is controlled by one operating
system that provides interaction between processors and all the
component of the system cooperate in the solution of a problem.
PIPELINING
Suboperations in each segment: R1 Ai, R2 Bi Load Ai and Bi
R3 R1 * R2, R4 Ci Multiply and load Ci
R5 R3 + R4 Add
• A technique of decomposing a sequential process into suboperations,
with each subprocess being executed in a special dedicated segment
that operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai R1 R2 Multiplier R3 R4 Adder R5 Memory Pipelining
Bi Ci
Segment 1
Segment 2
OPERATIONS IN EACH PIPELINE STAGE
Clock Pulse
Segment 1 Segment 2 Segment 3
Number R1 R2 R3 R4 R5 1 A1 B1 --- ---
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
GENERAL PIPELINE
•
General Structure of a 4-Segment Pipeline
S1 R1 S2 R2 S3 R3 S4 R4
Input Clock
•
Space-Time Diagram
The following diagram shows 6 tasks T1 through T6 executed in 4
segments.
1 2 3 4 5 6 7 8 9
T1 T1 T1 T1 T2 T2 T2 T2 T3 T3 T4 T4 T4 T4 T5 T5 T5 T5 T6 T6 T6 T6 Clock cycles Segment 1 2 3 4 Pipelining T3 T3
No matter how many
segments, once the
Parallelism
Executing two or more operations at a same time is known as Parallelism.
Pipelining
Pipelining is an implementation technique where multiple instructions are overlapped in execution. The computer pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages are
Vector Processing
Vector systems provide instructions that operate at the vector level.
Vector Processors
Vector processors have high-level operations that work on linear arrays of numbers.
Architecture
Memory-memory (Results are stored in
memory)
Vector-register (Load/store architecture
Advantages of vector processing
Data hazards can be eliminated due to nature of data.
Memory latency can be reduced due to pipeline load and store operations.
Multiple issue processors
Also known as superscalar processors.
Types of multiple issue processor
Static Multiple Issue
SIMD instructions (single-instruction
multiple-data) for specialized applications, including
both graphics and scientific applications. A single
instruction specifies operations, element by
element, on arrays (vectors) of data. The
operations are explicitly parallel
Dynamic Multiple Issue
Issue multiple instructions in each clock cycle
Vector Processing
Control logic grows linearly with issue width
Vector unit switches off when not in use
- Higher energy efficiency
More predictable real-time performance Vector instructions expose data
parallelism without speculation
Multiple issue Processing
Control logic grows quadratically with issue width Control logic consumes energy
regardless of available parallelism
VECTOR PROCESSING
Vector Processing
There is a class of computational problems that are beyond the
capabilities of a conventional computer. These problems require a
vast number of computations that will take a conventional
computer days or even weeks to complete.
Vector Processing Applications
Problems that can be efficiently formulated in terms of vectors
and matrices
Long-range weather forecasting - Petroleum
explorations
Seismic data analysis- Medical diagnosis
Aerodynamics and space flight simulations
Artificial intelligence and expert systems
Mapping the human genome
Image processing
Vector Processor (computer)
Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
High computational speed, fast and large memory system.
Extensive use of parallel processing.
It is equipped with multiple functional units and each unit has its own
pipeline configuration.
Optimized for the type of numerical calculations involving vectors and
matrices of floating-point numbers.
Limited in their use to a number of scientific applications:
o
numerical weather forecasting,
o
seismic wave analysis,
o
space research.
They have limited use and limited market because of their high price.
Problems with conventional approach
Limits to conventional exploitation of ILP:
1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)
2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle
3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality;
Alternative Model:Vector Processing
+
r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
v3
+
vect or lengt hadd.vv v3, v1, v2
VECTOR
(N operations)
Properties of Vector Processors
Each result independent of previous result
=> long pipeline, compiler ensures no dependencies => high clock rate
Vector instructions access memory with known pattern => highly interleaved memory
=> amortize memory latency of over 64 elements
=> no (data) caches required! (Do use instruction cache)
Reduces branches and branch problems in pipelines
Styles of Vector Architectures
memory-memory vector processors: all vector operations are memory to memory
vector-register processors: all vector operations between vector registers (except load and store)
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC
Components of Vector Processor
Vector Register: fixed length bank holding a single vector
has at least 2 read and 1 write ports
typically 8-32 vector registers, each holding 64-128 64-bit elements
Vector Functional Units (FUs): fully pipelined, start new operation every clock
typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit
Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs
Scalar registers: single element for FP scalar or address
Vector Advantages
Easy to get high performance; N operations: are independent
use same functional unit
access disjoint registers
access registers in same order as previous instructions
access contiguous memory words or known pattern
can exploit large memory bandwidth
hide memory latency (and any other latency)
Scalable (get higher performance as more HW resources available)
Compact: Describe N operations with 1 short instruction (v. VLIW)
Predictable (real-time) performance vs. statistical performance (cache)
Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
26
Advantages
Each result is independent of previous results - allowing deep
pipelines and high clock rates.
A single vector instruction performs a great deal of work -
meaning less fetches and ewer branches (and in turn fewer
mispredictions).
Vector instructions access memory a block at a time which allows
memory latency to be amortized over many elements.
Vector instructions access memory with known patterns, which
allows multiple memory banks to simultaneously supply
operands.
27
Disadvantages
Not as fast with scalar instructions
Complexity of the multi-ported VRF
Difficulties implementing precise exceptions
Vector Processors and Multiple Issue Processors are Compared on the basis of following two factors
:-- Cost
Cost Comparison
• While comparing the cost we must approximate the area used by both the technologies in the form of additional / required units.
• The cost of execution units is about the same for both.
• A major difference lies in the storage hierarchy.
Vector Processors use small data cache whereas,
Performance Comparison
The performance of Vector Processors depends on two factors
:- Percentage of code that is vectorizable.
Average length of vectors.
For short vectors data cache is sufficient in MI machines therefore, Short VectorsM.I. Processors perform better than equivalent Vector Processor.
And as vectors get longer, the performance of M.I. machine becomes much more dependent on size of data cache,
Vectors Lower Power
Vector
• One instruction
fetch,decode, dispatch per vector
• Structured register accesses
• Smaller code for high
performance, less power in instruction cache misses
• Bypass cache
• One TLB lookup per
group of loads or stores
• Move only necessary data across chip boundary
Single-issue Scalar
• One instruction fetch, decode,
dispatch per operation
• Arbitrary register accesses,
adds area and power
• Loop unrolling and software
pipelining for high performance
increases instruction cache footprint
• All data passes through cache;
waste power if no temporal locality
• One TLB lookup per load or store
Superscalar Energy Efficiency
Even Worse
Vector
• Control logic grows
linearly with issue width
• Vector unit switches off when not in use
• Vector instructions expose
parallelism without speculation
• Software control of
speculation when desired:
• Whether to use vector mask or compress/expand for conditionals
Superscalar
Control logic grows
quadratically with issue
width
Control logic consumes
energy regardless of
available parallelism
Speculation to increase
VECTOR PROCESSOR
Commonly called supercomputers, the vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined
architecture which operations on vectors and matrices can efficiently exploit.
• Each result independent of previous result =>High clock rate
MULTIPLE ISSUE PROCESSOR
A superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to
different execution units on the processor.
It implements a form of parallelism called instruction-level parallelism within a single processor
The CPU can execute multiple instructions per clock cycle
Featue of all current x86 architectures.
Multiple Issue machine
Some machines now try to go beyond pipelining to execute more than one instruction at a clock cycle, producing an effective CPI < 1.
This is possible if we duplicate some of the functional parts of the processor (e.g., have two ALUs or a register file with 4 read ports and 2 write ports), and have logic to issue several instructions concurrently.
There are two general approaches to multiple issue:
static multiple issue (where the scheduling is done at compile time) and;
Static Multiple Issue
1. SIMD instructions (single-instruction multiple-data) for specialized
applications, including both graphics and scientific applications.
A single instruction specifies operations, element by element, on arrays (vectors) of data.
The operations are explicitly parallel, so no complex checking is required.
2. EPIC (explicitly-parallel instruction) architectures, also called VLIW (very-long instruction word) machines.
These are general purpose architectures (instruction set) based upon large instructions containing several operations which are to be performed in parallel.
By making the parallelism explicit, we gain two advantages over superscalar:
a. much less logic (and hence less time) is required to identify parallelism at execution time, and
Dynamic Multiple Issue
(superscalar)
Issue multiple instructions in each clock cycle
Requires multiple arithmetic units and register files with additional ports to avoid structural hazards
Generally extended to dynamic pipeline scheduling
issue instructions (in order) to reservation / functional units
execute instructions out of order as operands are available
commit results (in order) to registers / memory
Feature of all current x86 architectures