9
A NEW COMPACT TRIANGULAR ARRAY DIVISION
ARCHITECTURE USING NEWTON-RAPHSON ALGORITHM
B. Nagaraj1, S. Sarankumar2, S. Saranya3
1Latha Infotech, Coimbatore, Tamilnadu, India.
2Project Associate, Accenture, Tamilnadu, India.
3Project Associate, Centre for Management, Tamilnadu, India.
ABSTRACT—Division is the most complex arithmetic operation in present digital architectures and
high-performance computing systems. Generally, this operation involves repeated subtraction and multiplication. Many algorithms have been proposed for efficient hardware division operation units which achieves fast computation in processors curtailing other parameters. Therefore, devising a technique to attain both low power consumption and low area seems to be very appealing and necessary. Accordingly, this work proposes the implementation of the well known Newton-Raphson algorithm in the basic non-restoring architecture. The proposed algorithm is applicable for signed-digit(SD) numbers-both integer and floating point of any radix. The main drawback in the existing non-restoring division architecture were the additional area and power consumption and its limited applications. This can be reduced by introducing Newton-Raphson iterations in the architecture which involves-Goldschmidt’s iteration and Markstein’s iterations. These ideas are implemented using VHDL in Xilinx and modelsim tools. A significant improvement in power consumption upto 6% and an area reduction upto 60% are observed when compared to the existing division algorithm.
INDEX TERMS— Newton-Raphson algorithm, non-restoring division, signed-digit(SD), Goldschmidt
iterations, Markstein iterations.
1.Introduction
The most complex and time-consuming of the four basic arithmetic operations is the division operation. Integer division, i.e., division of two integers yielding an integer quotient and an integer remainder, is one of the basic arithmetic operations. In modern micro-processors, integer division takes many clock cycles, even more than double precision floating-point division. Furthermore, the number of clock cycles for integer division varies depending on the operands values [3].The arithmetic division operation is considered as a sequence of subtractions and comparisons. These two operations have carry-chains in opposite directions, it is not possible to provide the respective carries simultaneously. Hence, it is generally not possible to start a comparison before the corresponding subtraction has been computed. In this scenerio, the division operation is regarded as a slow operation, characterized by a delay of O(n2)[1]. Digit-recurrence (or subtract-shift) algorithms ,both restoring and non-restoring, are widely used to perform integer division. In processors, integer division is usually implemented by software or firmware using a shifter and an adder (subtracter) based on either of the above two algorithms. In general, the non-restoring algorithm is slightly faster than the other. SRT(Sweeney, Robertson and Tocher) division is the
digit-recurrence algorithm used for floating point division. SRT division uses subtraction as the fundamental operator to retire a fixed number of quotient bits in each iteration. To perform fast division, low latency and short cycle time are considered. The latency is set by radix r, higher radicies offers low latency. The cycle time is set by the operations occuring in each cycle: quotient digit
selection and partial remainder generation[4]. For floating point arithmetic, division based on Newton-Raphson’s iterations becomes a viable alternative to SRT-based divisions. The Newton-Raphson iterations which involves- Goldschmidt’s and Markstein iterations are employed for both integer and floating point division. As a result of this, higher precision numbers can also be considerd. The Newton-Raphson’s iterations are usually performed in a higher precision than the precision of the operand and of the quotient[2].
2. Related Works
The fundamental methods to design division circuits have been presented in [1],[5] and [6]. Such an operation is defined by [3], where the dividend N
10 N=QD + R
The quotient is obtained via a sequence of subtractions and shifts. In the generic step j of this sequence, the intermediate remainder Rj is compared with the divisor D. If Rj> D , the quotient bit qj is set to 1, otherwise to 0.
A) Radix-2 Non-Restoring Algorithm
The hardware algorithm is based on the radix-2 non-restoring integer division algorithm[1].
N,n-bit divisor
Rem,remainder
R0=N,partial remainder q0 = 1
for j=1,n do Rj = │Rj-1 │- 2n-jD If Rj=0,then Rj= 0 qj=0 else
if Rj < 0 then qj = 1-qj-1 │Rj│= -Rj else
qj = qj-1 end if end if end for if Rn = 0 then Rem= 0 else
if qn =1 then Rem =Rn else
Rem = D-Rn end if
end if
In radix-2 non-restoring division algorithm the represention of each partial remainder are as a pair of its sign and absolute value and represent the latter in the SD2 representation. The calculation of the absolute value of each partial remainder is performed in parallel with the sign detection of the partial remainder and can start before the completion of the sign detection of the preceding partial remainder. Since normalization of the divisor is not needed, an area-consuming leading one (or zero) detection and shifts of variable-amount are not required. Each quotient digit is directly obtained from the sign of the corresponding partial remainder[3].
B) Srt Division Algorithm
SRT dividers are commonly used for modern floating point units. Previous works are oriented only for radix-2 and radix-4 numbers.
Higher precision and higher radix dividers can also be implemented using this algorithm. Higher radix dividers can be obtained by integrating many lower radix divider stages. SRT division uses subtraction as the fundamental operation to achieve a fixed number of quotient bits in each iteration. The partial remainder is initially set to the dividend which is considered. In each step, the divisor is compared to the partial remainder to produce a quotient digit. The quotient digit is then multiplied by the divisor and subtracted from the partial remainder; the result is then shifted by one position to form a new partial remainder. Computers often use radix-2 or radix-4 [7-10]. One step is required for each quotient digit. The latency of division is analyzed by defining the format of the operands and the range of each quotient digit[4].Since the area is proportional to the number of blocks, area can be reduced without impacting latency by clocking the divider at a higher frequency than the rest of the processor [11-18].
C) Newton-Raphson Algorithm
Newton-Raphson algorithm is generally focused for both integer and floating point numbers. Floating-point arithmetic standardize both binary and decimal floating point arithmetic[19-23], and introduce a correctly rounded fused multiply-add operation. As a consequence, software implementation of binary and decimal division may become a common practice in the near future[2]. The techniques for proving correct rounding for division algorithms based on Newton-Raphson’s iterations consists mainly of two iterations- Goldschmidt iteration and Markstein iterations. Newton-Raphson iterations involves higher precision and correct rounding for operands. These iterations afford higher radicies for division operation.
3.The Base Architecture
The original architecture for a (N=4, D=4) non-restoring SD2 divisor (eight columns and four rows), based on the non-restoring division algorithm consist mainly of a rectangular array of n
rows and 2n columns and n2 cells. Each cell consists
11
fig:1 original non-restoring divisor circuit[1]
Every elementary cell in Fig:1 is composed of the following circuits: a hc circuit , a sum circuit, and an absc circuit. The hc circuit accepts as inputs two signals: the abss, which can have one of the following values:(- 1,0,+1) from the output of the upper level, and the -d signal which can be 0 or -1.This produces a h vertical output and a horizontal carry c. This leads to the deduction of the values:(-2,-1,0 and +1). The absc circuits accept as input three signals: xi, ,nulli-1,and signi-1. Both nulli-1 and signi-1 are 1-b values, while xi is represented using the SD2 representation. The same notation is used for the three outputs: nulli-1and signi-1 are 1-b values while abssi is represented using SD2 notation. The xi and abssi variables, due to their SD2 representation, are represented by two bits. The input and ouputs of the absc circuit are as follows:
• Horizontal input: nulli-1 (0 if the sign was not previously determined;1 otherwise) and signi-1( if the sign was previously determined and it was negative; 0 otherwise).
Vertical input: .xi
•Horizontal output: nulli-1 (0 if the sign was not previously determined and xi=0;1 otherwise) and signi = signi-1 if the
sign was previously determined; signi = 1 if the sign was not previously determined and xi = -1 ;si =0 otherwise. Vertical output: abssi= xi if signi =0; abssi= - xi otherwise.
The original structure of non-restoring division algorithm is further OR-optimized. As a result of the optimization, the unwanted circuits had been replaced by the direct input so that the circuit computation time can be reduced.Furthermore, the
whole circuit had been considered as two sections-a rectangular and a triangular part as in Fig:2 .
fig:2 optimized non-restoring sd2 divisor circuit[1]
The rectangular part is then curtailed taking into consideration two assumptions:
• Remove the signals going from the right circuit to the left one, In other words, design the circuits removing the carry line that was affecting sthe left circuit.
• Compute the value of all the input signals of the circuits in the right part of the divisor.
The remaining triangular part forms the functional part where further optimization is done in an iterative manner.
The partial remainders are obtained in a computational
manner where the process takes place step by step.
12 The remainder and quotient can also be obtained from the XOR gates as shown in the final architecture. The final triangular array division circuit architecture can be shown as:
fig 4: final architecture[1]
In this architecture the number of gates had been reduced drastically by about 40%as a result the power consumption has been reduced. The triangular array architecture achieves fast computation as well as saves area.
Problem Definition
The final architecture Fig:4 for non-restoring architecture discussed in the paper is confined generally for integer division. It focuses mainly on positive integer division which have a maximum of 5 bits even though it saves area and power. The area and the circuit complexity can be further reduced without affecting the delay of the system by introducing the new Newton-Raphson algorithm in the final non-restoring architecture. This enables the advanced application of the basic architecture over a wide area in the hardware point of view and hence introduction in advanced processors which require high speed operations and fast computation saving the power and reducing the complexity of the design.
4.The Proposed Architecture
In the proposed architecture, the basic non-restoring structure is employed with the
Newton-Raphson algorithm. By the introduction of the Newton-Raphson Algorithm in this architecture, it can be used for the division operation of integer and floating point numbers. Since the division operation of floating point numbers are time-consuming normally, the merging of the efficient architecture with the widely used algorithm would bring out good results.
Newton-Raphson algorithm generally comprises mainly of two iterations- Goldschmidt’s iteration and Markstein iteration. These iterations provides higher precision and faithful rounding of the quotient terms compared to that of the operands used. The signed digits of randomly selected raix values and precision can be used efficently. Here we assume that there occurs no overflow nor underflow. Now in this paper we have considered 32bit number as operands, which can be extended to any radix according to the users choice. Consider the floating point number given as Fβ,p the set of radix-β, precision-p floating-point numbers which form the interval [βe,βe+1). For any z ≠ 0 in R, if z ∈ [ßez, ßez+1) with ez∈ Z, then ez denotes the exponent of z, and ulp(z) := ß e+1-p the unit in the last place. The middle of two consecutive floating-point numbers in Fβ,p is called midpoint in precision p : every midpoint m in precision p can be written as m = ±(sm+ 1/2 · ß 1-p)ßem ,with sm a significand of precision p in [1, ß). Since we do not consider overflow or underflow, com-puting the quotient a/b is equivalent to computing the quotient of their significand. Then assume without loss of generality that both a and b lie in the interval [1, ß).
In this paper, we consider three different types of precision: pi is the precision of the input operands, pw is the working precision in which the intermediate computations are performed, and po is the output precision. Hence, given a,b ϵ Fβ,pi, the division algorithm considered is efficient to compute RNβ,po(a/b).Thus we can obtain multiprecision quotient. Considering the precision pw ≥ pi, and the precision pw ≥ po , the limits are specified.
A.Equations
To compute a/b, an initial approximation to 1/b is obtained from a lookup table addressed by the first digits of b. One next defines the approximation to 1/b using iteration (1) below:
n+1 = n + n (1- b n)……….(1) Then n = n is taken as an initial approximation to a/b that can be improved using
n+1 = n + n (a- b n)…………...(2) To compute the reciprocal 1/b, using equations (1), the following two iterations are done :
Markstein n+1 = RN (1- b n)
13 Goldschmidt 1 = RN (1- b 0 )
n+2 =RN( n+1 2)
n+1 = RN ( + n+1 n )…….(4)
The Markstein iteration [7] and [8] immediately derives from Equation (1). The Goldschmidt iteration is obtained from Markstein iteration by substituting, rn+1 with rn2. Eventhough both iterations are mathematically equivalent, when we use in floating point arithmetic when compared to integer division, it behaves differently. In Goldschmidt iteration n+2 and n+1 can be computed concurrently. Hence, this iteration is faster due to its parallelism. Goldschmidt iteration is done at the beginning were accuracy is not an issue and then Markstein iteration is done to obtain correctly rounded result.
B.Simulation result
The simulation is done in modelsim and Xilinx tool is used for power estimation and area analysis. Spartan 3- FPGA is used for the implementation process. The power has been reduced from 0.099 to 0.093. Moreover the number of 4-input LUT’s have been reduced from 2949 to 417. Thus, the power and area estimation have proven to be efficient when compared to the non-restoring algorithm.
fig: 5 division using newton-raphson algorithm
TABLE I Comparision of results
Algorithm Power No:of 4-input LUT’s
No: of slices
Non-restoring algorithm
0.099 2949 1896
Newton-Raphson Algorithm
0.093 417 234
5.Conclusions
The implementation of the triangular array division circuit using Newton-Raphson Algorithm hereby is assumed to be the best and efficient method for division operations. The area in terms of the number of gate counts have been reduced further by 60% without affecting the delay of the circuit. Moreover the power consumption has also been reduced by 6% when compared with the basic non-restoring algorithm.
References
[1] Marco Domenico Santambrogio, Renato Stefanelli, “A New Compact SD2 Positive Integer Triangular Array Division Circuit,”IEEE transaction on VLSI systems, Vol. 19, Jan 2011.
[2] Nicolas Louvet, Jean-Michel Muller, Adrien Panhaleux,“Newton-Raphson Algorithms for
Floating-Point Division Using an FMA,”IEEE conference, pages 200-209,Dec 2010.
[3] N. Takagi, S. Kadowaki, and K. Takagi, “A hardware algorithm for integer division using the sd2 representation,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. E89-A, no. 10, pp. 2874–2881,2006.
[4] D. L. Harris, S. F. Oberman, and M. A.
Horowitz, “SRT division architectures and implementations,” in Proc. 13th IEEE Symp. Comput.Arithmetic, 1997, pp. 18–25.
[5] M. D. Ercegovac and T. Lang, Digital
Arithmetic. San Mateo, CA: Morgan Kaufmann, 2004.
[6] S. Kadowaki, “A hardware algorithm for i n t e g e r division,” in Proc.17th IEEE Symp. Comput. Arithmetic, Washington, DC, 2005, pp.140–146.
[7] Peter Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM J. Res. Dev., 34(1):111–119.
[8] Peter Markstein ,IA-64 and elementary functions: speed and precision. Hewlett-Packard professional books. 2000
[9] Lindstrom, Mary J., and Douglas M. Bates. "Newton—Raphson and EM algorithms for linear mixed-effects models for repeated-measures data." Journal of the American Statistical Association 83.404 (1988): 1014-1022.
[10] Lindstrom, M. J., & Bates, D. M. (1988). Newton—Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83(404), 1014-1022.
[11] Lindstrom, Mary J., and Douglas M. Bates.
14
Statistical Association 83, no. 404 (1988): 1014-1022.
[12] Lindstrom, M.J. and Bates, D.M., 1988.
Newton—Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83(404), pp.1014-1022.
[13] Lindstrom MJ, Bates DM. Newton—Raphson
and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association. 1988 Dec 1;83(404):1014-22.
[14] Shakeel PM, Baskar S, Dhulipala VS, Mishra S,
Jaber MM., “Maintaining security and privacy in health care system using learning based Deep-Q-Networks”, Journal of medical systems, 2018 Oct
1;42(10):186.https://doi.org/10.1007/s10916-018-1045-z
[15]Sridhar KP, Baskar S, Shakeel PM, Dhulipala VS., “Developing brain abnormality recognize system using multi-objective pattern producing neural network”, Journal of Ambient Intelligence and Humanized Computing, 2018:1-9. https://doi.org/10.1007/s12652-018-1058-y
[16]Shakeel PM, Baskar S, Dhulipala VS, Jaber MM., “Cloud based framework for diagnosis of diabetes mellitus using K-means clustering”, Health information science and systems, 2018 Dec 1;6(1):16.https://doi.org/10.1007/s13755-018-0054-0
[17] MuhammedShafi. P,Selvakumar.S*, Mohamed
Shakeel.P, “An Efficient Optimal Fuzzy C Means (OFCM) Algorithm with Particle Swarm Optimization (PSO) To Analyze and Predict Crime Data”, Journal of Advanced Research in Dynamic and Control Systems, Issue: 06,2018, Pages: 699-707
[18] Sampath, R., and A. Saradha. "Alzheimer's
Disease Image Segmentation with Self-Organizing Map Network." JSW 10.6 (2015): 670-680.
[19] Sampath, R., and Dr A. Saradha. "Classification of Alzheimer Disease Stages Exploiting an ANFIS Classifier." International Journal of Applied Engineering Research.[Electronic] 9.22 (2014): 16979-16990.
[20] Sampath, R., and J. Indumathi. "Earlier detection of Alzheimer disease using N-fold cross validation approach." Journal of medical systems 42.11 (2018): 217.
[21] Sampath, R., et al. "STUDY OF
CONNECTIVITY PROPERTIES AND NETWORK TOPOLOGY FOR
NEUROIMAGING CLASSIFICATION BY USING ADAPTIVE NERO-FUZZY INFERENCE SYSTEM." (2006).
[22] Sampath, R., and Dr A. Saradha. “A Hybrid
approach for Alzheimer’s disease Classification using 2D Gabor Wavelet transform and Extreme Machine Learning Classifier” JOURNAL OF PURE AND APPLIED MICROBIOLOGY Vol9. 5 .2015
[23] Sampath, R., and Dr A. Saradha ““Alzheimer’s