Vector Processing
The other kind of SIMD machine is vector processor.
• A processor can operate on an entire vector in one instruction • Work done automatically in parallel (simultaneously)
• The operand to the instructions are complete vectors instead of one element
• Data parallelism
• The most important unit of vector computer is the pipelined
Vector Processing(Contd…)
• A vector instruction performs an operation on each element in consecutive cycles
• Vector functional units are pipelined
• Each pipeline stage operates on a different data element
• Vector instructions allow deeper pipelines
• No intra-vector dependencies
• no hardware interlocking within a vector • No control flow within a vector
Vector Processing(Contd…)
A Vector ALU
• Machine takes two n-element vectors as input and operates on corresponding elements in
parallel using a vector ALU that operate on all the n elements simultaneously.
An Example of floating point
Pipeline
An Example of floating point
Pipeline
Vector Processing(Contd…)
• Each vector data register holds N M-bit values • Vector control registers: VLEN, VSTR, VMASK
• Vector Mask Register (VMASK) Indicates which elements of vector to operate on
• Set by vector test instructions e.g., VMASK[i] = (Vk[i] == 0) • Maximum VLEN can be N
The architecture of a vector
supercomputer.
• As shown in Fig. the vector processor is attached to the scalar processor as an optional
feature. Program and data are first loaded into the main memory through a host computer.
• All instructions are first decoded by the scalar control unit- If the decoded instruction is a scalar operation or a program control operation, it will be directly executed by the scalar processor using the scalar functional pipelines.
• If the instruction is decoded as a vector operation, it will be sent to the vector control unit. This control unit will supervise the flow of vector data between the main memory and vector functional pipelines. The vector data flow is coordinated by the control unit.
• A number of vector functional pipelines may be built into a vector processor.
• Two pipeline vector supercomputer models are described below.
• Vector Memory-Memory
The architecture of a vector
supercomputer
• Fig. shows a register-to-register architecture. Vector registers are used to hold the vector operands, intermediate and final vector results. The vector functional pipelines retrieve operands from and put results into the vector registers.
• All vector registers are programmable in user instructions. Each vector register is equipped with a component counter which keeps track of the component registers used in successive pipeline cycles. The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a vector register in a Cray Series supercomputer.
The architecture of a vector
supercomputer
Vector Memory-Memory versus
Vector Register Architecture
Vector Memory-Memory
• Instructions that operate on
memory-resident vectors,
reading source operands from vectors located in memory and writing results to a destination vector in memory.
Vector Register Architecture
Vector register architectures advantages
over vector memory-memory architectures
• A vector memory-memory architecture has to write all intermediate results to memory and then has to read them back from memory.
Vector Chaining
Vector Chaining
• For example,
One way to compute R1=R1*R2+R3
Where R1, R2, and R3 are all vector registers, would be to do vector multiplication, element by element store the result somewhere and then do the vector addition.
Registers and functional units of the
Cray-1
• Eight 64-bit registers are used to address memory.
• Sixty four 24-bit B registers are used to hold A registers when they are not needed,
rather than writing back to memory.
• Eight 64-bit S registers are for holding scalar quantities. Values in these registers are for holding scalar values. Values in these can be used as operands for both integer and floating-point operations.
• Sixty four 64-bit T registers are extra storage for S registers to reduce no. of LOADs
and STOREs.
• Eight 64-bit vector Registers. Each register can hold 64-element floating point vector.
Two vectors can be added, subtracted, or multiplied in one 16-bit instruction register.