SHARED MEMORY - aca notes

• An alignment network is used as the inter PE memory communication network.

• The memory module (m) should be relative prime to number of PEs so that parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &

17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic, data routing and masking operations over vector quantities. In case of

• Bit-Slice SIMD: vectors are binary vectors.

• Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.

All the SIMD instructions are vectors of equal length n, where n corresponds to the number of PEs.

HOST & I/ O: All I/O operations are handled by the host computers in an SIMD organization. A special control memory is used between the host and the array control unit. This is a staging memory for holding program and data.

Divided data sets are distributed to the local memory modules or the shared memory modules before starting the program instruction.

The host manages the mass storage or the graphics display of computational results.

Computer Arithmetic Principles

Arithmetic operations can be performed by considering two basic forms of operations…

3) That is performed due to fixed memory size

4) The other which can be performed by rounding off or truncating the value

Fixed Point Operations: - As defined early the concept behind involves fixed point operation with a sign magnitude, by using the concept of 1’s Complement and 2’s Complement. But 1’s complement introduces a second zero, also known as the dirty zero.

This includes general arithmetic operations, such as:

5) Add 6) Subtract 7) Multiply 8) Divide

Floating Point Numbers: - in there are two parts:

3) M- Mantissa

4) E – Exponent with implies base Formula that we work on is: X= M.R^E Where, R=2 incase binary number system The size of 32 is utilized as:

4) 1bit for Sign(0 bit),

5) 8bit for Exponential(1-8), and 6) 23bit for Mantissa (9-31).

E= (-127,128) is represented as (0,255) & X= (-1) ^s.2^ (E-127). (1. M)

Conditions that exist are as follows:

5) If E = 255 & m! = 0 implies that X is Not a Number.

6) If E = 255 & m = 0 implies that X is an Infinite Number.

7) If E = 0 & m! = 0 implies that X is a Number.

8) If E = 0 & m = 0 implies that X +0, -0.

Floating Point Operations: - The operations that can be performed are as follows 5) X+Y = (Mx. 2( Ex-Ey) + My) X^Ey)

Flynn’s classification scheme is based on the notion of a stream of information. Two types of information flow into a processor: instructions and data. The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit. According to Flynn’s classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into the following four distinct categories:

1) Single-Instruction Single-Data streams (SISD);

2) Single-Instruction Multiple-Data streams (SIMD);

3) Multiple-Instruction Single-Data streams (MISD); and 4) Multiple-Instruction Multiple-Data streams (MIMD).

The architecture of SIMD Computer models are determined by:

3) Memory Distribution, and 4) Addressing Schemes Used.

SIMD computers use a single control unit and distributed memories and some of them use associative memories. The instruction set of an SIMD computer is decoded by the array control unit. The major components of SIMD computers are:

3) Processing Elements (PEs) in the SIMD array are passive.

4) Arithmetic and Control Units (ALUs) executes instructions broadcast from control unit.

All PEs must operate in lockstep, synchronized by the same array controller.

DISTRIBUTED MEMORY MODEL

• It consists of an array of PEs which is controlled by the same array control unit.

• Host computer is responsible for the programs and data being loaded on the control memory.

• When an instruction is sent to the control unit for decoding, then a scalar or program control operation is executed by the scalar processor that is attached to the control unit.

• In case of a vector operation, it gets broadcasted to all the PEs for parallel execution. The partition data is distributed to all the local memories through a vector data bus.

• Data routing network is a program control through the control unit.

• Masking logic is there to provide to enable or disable any PE from the instruction cycle.

Examples of Distributed Memory:

3) MESH ARCHITECTURE: Illiac IV, Goodyear MPP, AMT DAP 610.

4) HYPERCUBE: CM-2, X-Net.

SHARED MEMORY

• An alignment network is used as the inter PE memory communication network.

• The memory module (m) should be relative prime to number of PEs so that parallel memory access can be achieved through skewing without conflicts.

Examples of Shared Memory: Burroughs Scientific Processor (BSP- having 16 PEs &

17 memory modules), CM/ 200.

SIMD Instructions: SIMD computers to execute vector instructions for arithmetic, logic, data routing and masking operations over vector quantities. In case of

• Bit-Slice SIMD: vectors are binary vectors.

• Word-Parallel SIMD: vector components are 4-byte or 8-byte numerical value.

All the SIMD instructions are vectors of equal length n, where n corresponds to the number of PEs.

Divided data sets are distributed to the local memory modules or the shared memory modules before starting the program instruction.

The host manages the mass storage or the graphics display of computational results.

Introduction:

In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

Computer-related pipelines include:

1. Instruction pipelines , such as the classic RISC pipeline, which are used in processors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages, including

instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time.

2. Graphics pipelines , found in most graphics cards, which consist of

multiple arithmetic units, or complete CPUs, that implement the various stages of common rendering operations (perspective projection, window

clipping, color and light calculation, rendering, etc.).

3. Software pipelines , consisting of multiple processes arranged so that the output stream of one process is automatically and promptly fed as the input stream of the next one. Unix pipelines are the classical implementation of this concept.

Advantages of Pipelining :

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases.

2. Some combinatorial circuits such as adders or multipliers can be made faster by adding more circuitry.

3. If pipelining is used instead, it can save circuitry vs. a more complex combinatorial circuit.

Disadvantages of Pipelining :

1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor.

3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.

Arithmetic pipelines

The most popular arithmetic operation utilized to illustrate the operation of arithmetic pipelines in the literature are: floating-point addition and multiplication.

Floating-point addition

Consider the addition of two normalized floating-point numbers:

A = (E., Ma) and B = (Et, Mb) to obtain the sum

S = (Es, Ms)

where E and M represent the exponent and mantissa, respectively.

The addition follows the steps shown below:

1. Equalize the exponents:

if E.< Eb, swap A and B; Ediff = Ea-Eb Shift Mb right Edith bits

2. Add Mantissae:

Ms = Ma + Mb Es=Ea

3. Normalize Ms and adjust Es to reflect the number of shifts required to normalize.

4. Normalized M, might have larger number of bits than can be accommodated by the mantissa field in the representation. If so, round M.

5.If rounding causes a mantissa overflow, renormalize M. and adjust EQ accordingly.

Figure shows a five-stage pipeline configuration for the addition process given above.

Floating-point add pipeline

The throughput of the above pipeline can be enhanced by rearranging the computations into a larger number of stages, each consuming a smaller amount of time, as shown in Figure 3.6. Here, equalizing exponents is performed using a subtract exponents stage and a shift stage that shifts mantissa appropriately. Similarly, normalizing is split into two stages.

This eight-stage pipeline provides a speedup of 8/5 = 1.6 over the pipeline of the above figure.

Modified floating-point add pipeline

In the pipeline of above figure we have assumed that the shift stages can perform an arbitrary number of shifts in one cycle. If that is not the case, the shifters have to be used repeatedly. Figure 3.7 shows the rearranged pipeline where the feedback paths indicate the reuse of the corresponding stage.

Floating-point multiplication

Consider the multiplication of two floating-point numbers A = (E,, Ma) and B = (Eb,Mb), resulting in the product

P = (Er,Mr). The multiplication follows the pipeline configuration shown in figure 1 and the steps are listed below:

1. Add exponents: Ep = Ea + Eb.

2. Multiply mantissae: Mp = Ma * Mb • Mp will be a double-length mantissa.

3. Normalize Mp and adjust Ep accordingly.

4. Convert Mp into single-length mantissa by rounding.

5. If rounding causes a mantissa overflow, renormalize and adjust EP accordingly.

Stage 2 in the above pipeline would consume the largest amount of time. In Figure below stage 2 is split into two stages, one performing partial products and the other accumulating them. In fact, the operations of these two stages can be overlapped in the sense that when the accumulate stage is adding, the other stage can be producing the next partial product.

Floating-point multiplication pipeline

Floating-point multiplier pipeline with feedback loops

Floating-point adder/ multiplier

The pipelines shown so far in this section are unifunction pipelines since they are designed to perform only one function. Note that the pipelines of Figures above have several common stages. If a processor is

required to perform both addition and multiplication, the two pipelines can be merged into one as shown in figure above. Obviously, there will be two distinct paths of dataflow in this pipeline, one for

addition and the other for multiplication. This is a

multifunction pipeline. A multifunction pipeline can perform more than one operation. The interconnection between the stages of the pipeline changes according to the function it is performing.

Obviously, a control input that determines the particular function to be performed on the operand being input is needed for proper operation of the multifunction pipeline.

Static Arithmetic Pipelines

Most of today’s arithmetic pipelines are designed to perform fixed functions.

These arithmetic and logic units perform fixed point and floating point operations separately. The fixed point unit is also called the integer unit. The floating point unit can be built either as part of control processor or on a separate coprocessor.

These arithmetic units perform scalar operations involving one pair of operands at a time. The pipelining in scalar arithmetic pipelines is controlled by software loops. Vector arithmetic units can be designed with pipeline hardware directly under firmware or hardwired control.

Scalar and vector arithmetic pipelines differ mainly in the area of register files and control mechanism involved. Vector hardware pipelines are often built as add on option to a scalar processor or as an attached processor driven by a control processor. Both scalar and vector processors are used in modern supercomputers.

Arithmetic Pipeline Stages

Depending on the function to be implemented, different pipeline stages in an arithmetic unit require different hardware logic. Since all arithmetic operations (such as add, subtract, multiply, divide, squaring, square rooting, logarithm, etc.) can be implemented with the basic add and shifting operations, the core arithmetic stages require some form of hardware to add or to shift.

For example. a typical three-stage floating-point adder includes a first stage for exponent comparison and equalization which is implemented with an integer adder and some shifting logic; a second stage for fraction addition using a high-speed carry look-ahead adder; and a third stage for fraction normalization and exponent readjustment using a shifter and another addition logic.

Arithmetic or logical shifts can be easily implemented with shift registers. Highspeed addition requires either the use of a carry-propagation adder (CPA) which adds two numbers and produces an arithmetic sum as shown in Fig. 6.22a, or the use of a carry-save adder

(CSA) to "add" three input numbers and produce one sum output and a carry output as exemplified in Figure below.

In a CPA, the carries generated in successive digits are allowed to propagate from the low end to the high end, using either ripple carry propagation or some carry lookahead technique.

In a CSA, the carries are not allowed to propagate but instead are saved in a carry vector. In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input numbers. expressed as

X = (xn-I , xn -2... , x1, xo ). The CSA performs bitwise operations

simultaneously on all columns of digits to produce two n-bit output numbers, denoted as

S^b = (0. S^n-1. S^n-2, ... , S¹. S^o) and C = (Cⁿ C^n-1... . C¹.0).

Note that the leading hit of the bitwise sum S^b is always a 0, and the tail bit of the carry vector C is always a 0. The input-output relationships are expressed below:

Si= xi O Y. t zi

Ci+1= xiyi V yizi V zixi (6.21) for i = O. 1, 2, ... . n - 1, where o is the exclusive OR and V is the logical OR operation.

Note that the arithmetic sum of three input numbers, i.e.,

S = X + Y + Z, is obtained by adding the two output numbers, i.e., S

= Sb + C, using a CPA. We use the CPA

and CSA s to implement the pipeline stages of a fixed-point multiply unit as follows.

Multiply Pipeline Design

a.)An n-bit carry propagate adder(CPA) which allows either carry propagation or applies the carry look ahead technique.

b.)An n-bit carry save adder(csa) where s^b is the bitwise sum of X,Y, nad Z and c is the carry vector generated without carry propogation between digits.

Consider the multiplication of two 8-bit integers A x B = pi where p is the 16 bit product in double precision. This fixed point multiplication can be written as the summation of eight partial products as shown below. P=A x B = p0+p1+p2+….p7 where x and + are arithmetic multiply and add operations respectively.

Note that the partial product pj is obtained by multiplying the multiplicand A by the jth bit of B and then shifting the result j bits to the left for j-0,1,2….7.Thus pj is (8+j) bits long with j trailing zeroes.

The first stage generates all eight partial products ranging from 8 bits to 15 bits simultaneously. The second stage is made up of two levels of four CSAs which essentially merges eight numbers into four numbers ranging from 13 to 15 bits. The third stage consists of two CSAs which merge four numbers into two 16 bit numbers . The final stage is a cpa which adds up the last two numbers to produce the final product P.

A pipeline unit for fixed point multiplication of 8 bit integers.

SIMD Computers and Performance Enhancement

SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism.

Performance Enhancement Using SIMD:

The SIMD concept is a method of improving performance in applications where highly repetitive operations need to be performed. Simply put, SIMD is a technique of

performing the same operation, be it arithmetic or otherwise, on multiple pieces of data simultaneously.

Traditionally, when an application is being programmed and a single operation needs to be performed across a large dataset, a loop is used to iterate through each element in the dataset and perform the required procedure. During each iteration, a single piece of data has a single operation performed on it. This is known as Single Instruction Single Data (SISD) programming. SISD is generally trivial to implement and both the intent and method of the programmer can quickly be seen at a later time.

Loops such as this, however, are typically very inefficient, as they may have to iterate thousands, or even millions of times.

Ideally, to increase performance, the number of iterations of a loop needs to be reduced.

One method of reducing iterations is known as loop unrolling. This takes the single operation that was being performed in the loop, and carries it out multiple times in each iteration. For example, if a loop was previously performing a single operation and taking 10,000 iterations, its efficiency could be improved by performing this operation 4 times in each loop and only having 2500 iterations.

The SIMD concept takes loop unrolling one step further by incorporating the multiple actions in each loop iteration, and performing them simultaneously. With SIMD, not only can the number of loop iterations be reduced, but also the multiple operations that are required can be reduced to a single, optimized action.

SIMD does this through the use of ‘packed vectors’ (hence the alternate name of vector processing). A packed vector, like traditional programming vectors or arrays, is a data structure that contains multiple pieces of basic data. Unlike traditional vectors, however, a SIMD packed vector can then be used as an argument for a specific instruction (For example an arithmetic operation) that will then be performed on all elements in the vector simultaneously (Or very close to). Because of this, the number of values that can be loaded into the vector directly affects performance; the more values being processed at once, the faster a complete dataset can be completed.

This size depends on two things:

1. The data type being used (ie int, float, double etc) 2. The SIMD implementation

When values are stored in packed vectors and ‘worked upon’ by a SIMD operation, they are actually moved to a special set of CPU registers where the parallel processing takes place. The size and number of these registers is determined by the SIMD implementation being used.

The other area that dictates the usefulness of a SIMD implementation (Other than the level of hardware performance itself) is the instruction set. The instruction set is the list of available operations that a SIMD implementation provides for use with packed vectors. These typically include operations to efficiently store and load values to and from a vector, arithmetic operations (add, subtract, divide, square root etc), logical operations (AND, OR etc) and comparison operations (greater than, equal to etc).

The more operations a SIMD implementation provides, the simpler it is for a developer to perform the required function. SIMD operations are available directly when writing code in assembly however not in the C language. To simplify SIMD optimization in C,

intrinsics can be used that are essentially a header file containing functions that translate values to their corresponding call in assembler.

SIMD Example

The best way to demonstrate the effectiveness of SIMD is through an example. One area where SIMD instructions are particularly useful is within image manipulation. When a raster-based image, for example a photo, has a filter of some kind applied to it, the filter has to process the colour value of each pixel and return the new value. The larger the image, the more pixels that need to be processed. The operation of calculating each new

pixel value, however, is the same for every pixel. Put another way, there is a single

In document aca notes (Page 37-56)