SIMD Array
6.2 Algorithms used for the Arithmetic Functions
Before considering the algorithms that were developed, the
representation of the different data values had to be decided upon. In this design there are four basic data types:
•Input Value and Weight Value
During the development of the Tx Extraction Network it was decided to use an 8 bit 2's complement integer to represent the weights
and the input values. Simulations had been carried out on various
data lengths, concluding that 8 bits provided a safe margin of
accuracy in these models for the recognition phase.
•Address Pointers
The size of the address pointers has a direct effect on the maximum size of network that can be implemented. The size was chosen to be 16 bits, this would provide an adequate address range for most networks, and provide a convenient size for the storage of the pointers.
Richard Palmer Phd. Thesis
•Synaptic Product
The size of this value is dependent upon the length of both the input and weight values. In this implementation 8 bits are used to represent both these values, resulting in the synaptic product requiring 16 bits for its representation.
•Partial Sum
The size of the partial sum is again fairly arbitrary. All that is required is that it should be large enough to ensure that overflows do not become too frequent. The size chosen for this value was 32 bits, this again represented a convenient length for its storage.
After considering the different data representations it was decided to use a 16 bit datapath, with special instructions being included to process the inputs and weights, and for the double length partial sum. The size of the datapath was chosen primarily to ensure that each pointer could be updated in a single clock cycle: this is important since pointer manipulations form a large role in the processor's operations.
The arithmetic operations to be performed on the datapath were then considered. These operations were broken-down into three groups, and considered separately.
6.2.1 Pointer Manipulations
One pointer manipulation is the increment of the relevant pointer
during each memory fetch. This requires a conventional adder unit, with the ability to set one of the inputs to zero, and the carry_in high. The
only other pointer manipulation function is the ADD_MOD_STEP
instruction, which requires the use of a conventional adder with no additional facilities.
6.2.2 Partial Sum Addition
Richard Palmer Phd. Thesis
on the datapath to cater for the different length operands: the synaptic product being 16 bits wide, while the partial sum is 32 bits wide.
As mentioned previously any operation that is carried out on the partial sum is performed over a double length instruction. Between these two instructions the carry generated during the first addition instruction must be carried into the second addition, and the sign of the synaptic product has also to be carried through to an operand in the second addition. The sign extension is needed to convert the 16 bit synaptic product into a 32 bit value. This requires either a 0 extension (OhOOO) for a positive synaptic product, or -1 extension (Ohffff) for a negative synaptic product. The algorithm used for this addition of the synaptic product to the partial sum can be described thus:
LSWord Partial Sum := LSWord Partial Sum + Synaptic Product IF Sign of Synaptic Product is +ve
MSWord Partial Sum := MSWord Partial Sum + OhOOOO + Carry ELSE
MSWord Partial Sum : = MSWord Partial Sum + Ohffff + Carry
The other complication in this partial sura addition is the requirement for saturated arithmetic, this ensures that the sign does not alter if an overflow occurs. Saturated arithmetic operates during the second addition cycle, and requires special logic to be placed on the output of the addition unit. This logic performs the following algorithm:
If No Overflow
MSWord Partial Sum := Output of Adder If Overflow from +ve to -ve
MSWord Partial Sum := 0h7fff If Overflow from -ve to +ve
MSWord Partial Sum := 0h8000
6.2.3 Multiplication Instructions
The multiplication algorithm requires special consideration since it represents the main function for each synaptic update. To ensure maximum throughput, the algorithm should not require any more cycles than the
Richard Palmer Phd. Thesis
minimum required for the loading of the operands and in the computing of the product. This factor plays heavily in the choice of the algorithm since it must cater for:
•Unsigned Multiplication (+ve x +ve)
•Mixed Multiplication (+ve x -ve and -ve x +ve) •Signed Multiplication (-ve x -ve)
The algorithm chosen was taken from the Texas Instruments 74888/74890
Bit Slice Processor^]. This algorithm can perform all the above
multiplication types, with no time overheads in sign correction after the product has been computed.
The algorithm requires three registers for the multiplication, two of these being general purpose registers (acc and reg_b), and the third being a special shift register (reg_sh). Before a multiplication can start, the registers should be loaded as below:
acc zero
reg_b multiplicand
reg_sh multiplier
The multiplicand holds the negative integer during mixed and signed
multiplication, allowing it to be adjusted during the last
multiplication iteration if required. This operation is determined by the 'Mode Bit' in the multiplication algorithm. To ensure that the operands are loaded into the correct registers the following algorithm is used:
Load Weight into reg_b and reg_sh If Sign of Weight is -ve
Load Input into reg_b Else
Load Input into reg_sh
This algorithm is implemented in hardware, thus reducing the number of instruction cycles required to load the operands.
Richard Palmer Phd. Thesis
Each multiplication requires N iterations to provide the 2N bit result (where N = size of operands). Upon completion the result is held in the registers as below:
acc MSByte of product
reg_sh LSByte of product
reg_b Unchanged
The multiplication operation can be expressed as the following recursion; this recursion repeats for J = 0 to N. Iterations J = 0 to N-l are termed iterative cycles (MULTI), while iteration J = N is termed the terminating cycle (MULTT):
pj+i = 2 ( pJ + M ( Multiplicand x Multiplier [ J ] ) )
Where:
PJ = Partial Product for Jth Iteration
J = Iteration Number [Repeat for J = 0 to N]
[J] = Bit at position J
2 = Right Shift
[Varies according to iteration type and sign of operands]
M = Mode Bit
[Varies according to iteration type]
•Right Shift
The right shift operates as a double length shift on the partial product. The type of shift mode used varies according to the signs of the operands, and whether it is an iterative cycle or a terminating cycle. A selection mechanism switches between an arithmetic shift, which shifts in the MSBit of the partial sum, and a logical shift which uses the carry out from the adder unit. The selection between these different shift modes is shown in Table 6.1.
•Mode Bit
The mode bit is used to perform the sign adjustment on the
multiplicand, and is dependent only upon the type of
Richard Palmer
Table 6.1 Multiplication Shift Modes
Phd. Thesis Type of Multiply Normal Iteration Final Iteration Signed Multiplication Mixed Multiplication Unsigned Multiplication MSBit Sum Carry Out Carry Out MSBit Sum MSBit Sum MSBit Sum
is set to 1, allowing the multiplicand to be added to the partial product unaffected. During the terminating cycle this mode bit is set to -1, selecting the 2's complement of the multiplicand for the addition. This 2's complementation is performed by the negation of the multiplicand, and the setting of the carry in for the adder to high.
One last point to mention in this algorithm is the problem encountered during the simulations of large positive and negative integers. It was found that the sign of the final product became corrupt when multiplying large negative integers. To counter this problem the use of a sign guard bit was introduced; this requires the extension of both the multiplier and the multiplicand from 8 bits to 9 bits. This is performed by copying the MSBit from each operand into the sign guard bit. The multiplication is then performed on the extended 9 bit multiplier and multiplicand, resulting in a 18 bit product. The top two bits of this 18 bit product represent two sign guard bits, and can be truncated. The 16 bit product obtained after this truncation can then be used to represent the synaptic product.