A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers
Klaus Schneider and Adrian Willenb¨ucher Embedded Systems Group University of Kaiserslautern
Kaiserslautern, Germany {schneider, willenbuecher}@cs.uni-kl.de Abstract—Signed-digit (SD) numbers generalize traditional
radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representa-tions that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction ofn-digit numbers are performed with circuits of depthO(1)and sizeO(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at positioni depend only on the summands’ digits at positionsiandi−1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two’s complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations.
I. INTRODUCTION
Although there are many other number systems, simple radix numbers to a base B > 0 are still popular in computer arithmetic. Ann-digit radix-Bnumber is thereby given as a sequence of digits[xn−1, . . . , x0]withxi∈ {0, . . . , B−1} that denotes the following natural number:
[xn−1, . . . , x0]B:= n−1
i=0
xi·Bi
It is well-known that the addition of radix-Bnumbers suffers inherently from carry propagation: In the worst case, a carry is generated when adding the least significant digitsx0and
y0, and is then propagated from the rightmost digitsx0, y0
to the leftmost digitsxn−1, yn−1. As a consequence, simple
carry-ripple adders have depth1 O(n). Even though this can
be reduced to a depth ofO(log(n)), e.g., by carry-lookahead adders [1], the depth still grows with the number of digits.
1The depth of a circuit is the length of the longest path from inputs to
outputs.
Circuits with a depth depending on the number of digits nlimit the clock speed of synchronous circuits in terms of n. For radix-B numbers, it is not difficult to see that ad-dition, subtraction, multiplication, division and comparison operations ofn-digit numbers all require a depth of at least O(log(n))since the digits of the results depend onalldigits of the operands. For all basic operations, optimalO(log(n)) algorithms are known, even though these require sometimes substantial mathematical effort [2]–[4].
Since this minimal O(log(n)) depth cannot be improved for radix-B numbers, one has to consider non-conventional number systems for improvements. For example, residue number systems (RNS) [5], [6] encode a number x by its moduli (x1, . . . , xn) := ((x mod p1), . . . ,(x mod pn)) that are unique for numbers x ∈ {0, . . . ,(ni=1pi)−1} for relatively prime numbers pi. Addition, subtraction, and multiplication can be done in parallel on the moduli, and thus, with a depth O(1). Division can only be done by iterative methods like Newton-Raphson or Goldschmidt it-eration which lead again to a depth ofO(log(n)). The main problems for RNS numbers are however that comparison (<) is not possible and that conversions to and from radix numbers are relatively expensive.
An alternative to RNS numbers are signed-digit (SD) numbers [3], [7]–[12] that allow negative digits of a range {−D, . . . ,+D} with D < B for radix-B numbers. Due to the redundant number representation, addition and sub-traction can be implemented with a depth of O(1), i.e., independent of the number of digits, while multiplication, division, and comparison can still be implemented with a depth ofO(log(n)). The key to carry-free addition is thereby to switch to another representation of the sum in case carries would have to be generated (see Section II-A).
However, the standard algorithm for addition and sub-traction of SD numbers [7] does not work for the important baseB= 2as we will also explain in Section II-A. For this reason, Parhami [8] and others [13] suggested to recode the given input numbers so that the later addition and subtraction of binary numbers will become carry-free.
In this paper, we prove that the standard algorithm of Avizienis can be refined to correctly handle binary SD num-bers. Avizienis’ algorithm computes for two digits xi and
yi, a transfer digitti+1∈ {1,0,+1}and an interim sumwi
such that the sum digitsi can be computed assi=ti+wi. Our algorithm computes an additional condition i that stores some important information to define the transfer and sum digits. Our transfer digits ti depend on the operand digitsxi−1, yi−1, xi−2, yi−2 and the additional conditioni depends on xi−1, yi−1 only, so that our algorithm has still
depthO(1). We implemented our algorithm on FPGAs and compared its speed and area requirements with previous approaches to SD addition and also with a carry-lookahead adder. It turned out that our algorithm is faster than a hybrid carry-lookahead/carry-ripple adder for more than 24 bits on our hardware platform, and requires just about 50% of the chip area of other SD addition circuits.
Our paper is organized as follows: In Section II, we discuss Avizienis’ algorithm for adding SD numbers. In Section III, we first analyze why that algorithm does not work for the case of binary numbers, and then develop a solution for this problem in Section III-B. To demonstrate the efficiency of our algorithm, we present experimental results in Section IV.
II. PREVIOUSWORK
In this section, we review known results about signed-digit numbers. To this end, we provide new proofs that allow us to discuss in the next section where the difficulties to define a carry-free addition for binary SD numbers come from. A. Signed-Digit Numbers
Avizienis introduced in [7] the following SD numbers to a radixB >1and a digit set {−D, . . . ,+D}:
Definition 1: Given some numberD and a radixB >1, a sequence [xn−1, . . . , x0] of digits xi ∈ {−D, . . . ,+D}
encodes the following integer: [xn−1, . . . , x0]D,B:=
n−1
i=0
xi·Bi
There may be several SD representations of the same num-ber. For example, forB= 3andD= 2, the value5 can be encoded as [2,−1],[1,2] or[1,−1,−1]. To understand the different redundant representations of a number, we list the following well-known theorem without proof:
Theorem 1 (Uniqueness of Division with Remainder): For all integers x, y ∈ Z with y = 0, there are uniquely defined numbersq, r∈Zwithx=q·y+rand0≤r <|y|. We therefore write q:= (x div y)and r:= (x mod y). By the above theorem, we conclude the following result:
Lemma 1 (SD Number Representations):
x = [xn−1, . . . , x0]D,B = [xn−1, . . . , x0]D,B implies
x0=x0+k·B for somek∈Z.
Proof: Using y1 := [xn−1, . . . , x1]D,B and y1 := [xn−1, . . . , x1]D,B, we obviously havex=y1·B+x0=
y1 ·B+x0, and thereforex0−x0 = (y1−y1)·B holds.
Hence, x0−x0 is a multiple ofB, so that the proposition
holds withk:=y1−y1.
Due to the redundant representations of a number, it is not possible to reduce equality testing to checking the equality of the corresponding digits. However, due to the (constant depth) reduction x = y ⇔ x−y = 0, checking equality can be reduced to checking whether the result is zero. This is possible with depth O(log(n)) if zero has a unique representation (i.e., all digits being zero). To be able to check equality of SD numbers, Avizienis therefore imposed that D < B must hold because of the following result:
Theorem 2 (Unique Representation of Zero):
The number 0 has a unique representation as SD number [xn−1, . . . , x0]D,B if and only ifD < B holds.
Proof:For anyn, we have[xn−1, . . . , x0]D,B= 0for xi= 0. For any other representation[xn−1, . . . , x0]D,B = 0withx0=x0, we would havex0 −x0=x0=k·B with
k= 0by the previous lemma. However, this is impossible iffx0∈ {−D, . . . ,+D} ⊆ {−(B−1), . . . , B−1}holds.
Hence, we see that x0 = 0 is uniquely determined for x = 0 if and only if D < B holds. Then, we have [xn−1, . . . , x1]D,B = 0, and the same argument applies
to the next digitx1, and so on.
For example, we have[1,−B]D,B =[−1, B]D,B= 0if we would allowD=B. Hence, we always assumeD < B in the following to ensure the unique representation of 0.
This uniqueness result can be generalized to other least significant digitsx0: Assume first thatB≤2·D (and thus
B−D≤D) holds, so that we can partition the legal digits {−D, . . . ,+D}into the following intervals:
−D . . . D−B D−B+ 1. . . B−D−1 B−D . . . D
By Lemma 1, the digits D−1 := {−D, . . . , D−B} and
D+1:= {B−D, . . . , D} can be mapped to each other by
either adding or subtractingB, while for the digits D0 := {D−B+ 1, . . . , B−D−1} no legal digits are obtained this way. Thus, digits inD0 are uniquely determined, while
digits in either D−1 or D+1 have exactly one alternative.
Choosing the alternative, we have to either increment or decrement the next digitxi+1, and then the same discussion
can be repeated forxi+1.
However, ifB >2·Dholds, then there are no alternatives left for the digits (sincex0−B≤D−B < D−2·D=−D).
Hence, to ensure redundancy, we have to impose as second constraintB ≤2·D (in addition to D < B) to obtain the following result:
Lemma 2 (Redundancy of SD Representations):
For any SD number x = [xn−1, . . . , x0]D,B with D <
B≤2·D, the following holds:
• If(x mod B)∈ {0, . . . , B−D−1}, thenx0is uniquely defined as x0:= (x mod B).
• If (x mod B) ∈ {B−D, . . . , D}, then either x0 =
(x mod B)or x0= (x mod B)−B holds, thus there are exactly two solutions for x0.
• If(x mod B)∈ {D+1, . . . , B−1}, thenx0is uniquely defined as x0:= (x mod B)−B.
Table I
POSSIBLE DECOMPOSITIONSui=xi+yi=ti+1·B+wiWITHxi, yi, wi∈ {−D, . . . ,+D}ASSUMINGD < B≤2·D.
range ofui possible decompositionui=ti+1·B+wiwithwi∈ {−D, . . . ,+D} ui∈ {−2D, . . . ,−D−1} (ti+1, wi) = (−1, B+ui) withwi∈ {B−2D, . . . , B−D−1} ⊆ {−D+ 1, . . . , D−1} ui∈ {−D, . . . ,−B+D} (ti+1, wi) = (−1, B+ui) withwi∈ {B−D, . . . , D} ⊆ {−D+ 1, . . . ,D} or(ti+1, wi) = (0, ui) withwi∈ {−D, . . . ,−B+D} ⊆ {−D, . . . , D−1} ui∈ {−(B−D−1), . . . , B−D−1} (ti+1, wi) = (0, ui) withwi∈ {−(B−D−1), . . . , B−D−1} ⊆ {−D+ 1, . . . , D−1} ui∈ {B−D, . . . , D} (ti+1, wi) = (0, ui) withwi∈ {B−D, . . . , D} ⊆ {−D+ 1, . . . ,D} or(ti+1, wi) = (+1,−B+ui) withwi∈ {−D, . . . ,−B+D} ⊆ {−D, . . . , D−1} ui∈ {D+ 1, . . . ,2D} (ti+1, wi) = (+1,−B+ui) withwi∈ {D−B+ 1, . . . ,2D−B} ⊆ {−D+ 1, . . . , D−1}
The constraint D < B is added to ensure the unique representation of zero (to ensure that we can check equality of SD numbers) while the second constraint B ≤ 2·D is added to ensure a minimal redundancy that can be exploited for a carry-free addition as explained below. Note that Avizienis imposed a stronger second constraint B <2·D that then excludes the case B = 2. We will see in the following discussion why he did so and why we will not be that strict.
The above lemma is the key to construct a carry-free addition algorithm: If two SD numbers[xn−1, . . . , x0]D,B
and [yn−1, . . . , y0]D,B have to be added, we may first
consider the expression [un−1, . . . , u0]D,B with ui :=
xi+yi. Since eachxi and eachyiare legal digits, we have −2·D≤ui≤2·D.
According to Avizienis, each ui is decomposed into an outgoing transfer digitti+1 and an interim sum digitwi so that xi +yi = ui = ti+1·B +wi holds. Due to −2·
B < −2·D ≤ xi+yi ≤ 2·D < 2·B, it follows that
ti+1∈ {−1,0,+1}holds for all such decompositions. Note
that a particular choiceti+1 ∈ {−1,0,+1} determines the
range of ui = ti+1·B+wi, so that we can easily prove the following lemma (note that D < B ≤ 2·D implies −B−D <−2D <−D≤ −B+D <0< B−D≤D <
2D < B+D):
Lemma 3: For given digitsxi, yi∈ {−D, . . . ,+D}with
D < B≤2·D, the numberui=xi+yican be decomposed
asui=ti+1·B+wiwithwi∈ {−D, . . . ,+D}andti+1∈ {−1,0,+1}as shown in Table I.
The proof is easily obtained by checking the cases men-tioned in Table I.
The final step of the computation consists now in comput-ing the sum digitssi:=wi+tiby means of the transfer and interim sum digits. We have to make sure that these additions will not produce a carry. For this reason, Avizienis demanded thatwi∈ {−D+ 1, . . . ,+D−1}must hold, which is also possible according to the following lemma:
Lemma 4: For given digitsxi, yi∈ {−D, . . . ,+D}with
D < B <2·D, the numberui=xi+yican be decomposed
asui=ti+1·B+wiwithwi∈ {−D+ 1, . . . ,+D−1}as shown in Table I.
Proof:The proof is easily obtained by checking all the cases mentioned in Table I. Note that the cases with ui ∈ {−D, . . . ,−B+D} and ui ∈ {B−D, . . . , D} allow two different decompositions and for each case, there is oneui that produces an interim sumwi∈ {−D+1, . . . ,+D−1}. In that case, however, we use the other possible decomposition and can therefore ensurewi∈ {−D+ 1, . . . ,+D−1}.
Since it is possible to find a decomposition with wi ∈ {−D + 1, . . . ,+D − 1}, it is now possible to compute the final sum digits si := wi +ti without producing a carry! However, the reader might have noted that we had to strengthen the constraintD < B≤2·D used before to D < B <2·Dto make this possible.
Based on the above lemma, the carry-free addition due to Avizienis is now as follows:
Theorem 3 (Carry-Free Addition by Avizienis):
The addition of SD numbers x=[xn−1, . . . , x0]D,B and
y = [yn−1, . . . , y0]D,B with D < B < 2·D can be
computed in depthO(1)withO(n)work (gates) as follows: 1) fori∈ {0, . . . , n−1}, computeui:=xi+yi 2) fori∈ {0, . . . , n−1}, compute ti+1:= ⎧ ⎨ ⎩ +1 : ifui≥+D −1 : ifui≤ −D 0 : if −D < ui<+D 3) fori∈ {0, . . . , n−1}, compute digitssi:=ti+ui−ti+1·B =:wi witht0:= 0 The final sum is then the SD number
s=[tn, sn−1, . . . , s0]D,B.
Each of the above steps can be performed in parallel, so that the sum can be computed in three steps. Moreover, in the
casesui∈ {−D, . . . ,−B+D} andui∈ {B−D, . . . , D} the algorithm prefers the decomposition with ti+1 = 0
except for the cases ui = ±D, where the other possible decomposition is used. This way, we always have wi ∈ {−D+ 1, . . . ,+D−1} and therefore, the final addition si:=ti+wi produces a legal digit.
Other operations can be implemented as follows:
• Subtraction of x and y can be simply performed by addition ofxand−y=[−yn−1, . . . ,−y0]D,B which
can also be done with depthO(1)and workO(n).2 • Checking equality ofx and y is reduced to checking
whetherx−y= 0holds. The subtraction can be done with depthO(1)and workO(n), but checking that all obtained digits are zero requires depth O(logn) and workO(n).
• Comparing x < y is reduced to testing for x−y <
0. The subtraction can be done with depth O(1) and work O(n), but checking the sign may require depth O(log(n))since some of the leading digits can be zero (the sign of the first non-zero digit determines the sign).
• Multiplication can be obtained by adding the partial productsx·yi·Bi which can be arranged with a depth ofO(log(n))and work O(n2)[14], [15].
• Division can be implemented by multiplication of the integer reciprocal, requiring depthO(log(n))and work O(n2)[2].
Hence, SD numbers are an interesting number representation that leads to efficient arithmetic algorithms.
B. Binary SD Numbers
Avizienis already noted that his algorithm does not work for binary SD numbers for the reasons we explained in the previous section. Using the weaker constraints D < B ≤
2·D, we can reconsider Table I that reduces forB= 2and D= 1to the following decompositions:
ui (ti+1, wi) −2 (−1,0) −1 (0,−1)or(−1,+1) 0 (0,0) +1 (0,+1)or(+1,−1) +2 (+1,0)
As can be seen, there is no decomposition that always allows us to achieve thatwi∈ {−D+1, . . . ,+D−1}={0}holds. For this reason, it was widely accepted that there is no carry-free addition for general binary SD numbers.
One possible solution is to consider a radixB= 2kand to represent digitsxi then as two’s complement numbers with
k+ 1bits. The disadvantage is that the depth is increased to O(log(k)) (due to addition of two’s complement numbers with k bits), as considered in [11]. Since small numbers 2The work of a parallel algorithm is the number of executed operations,
i.e., the number of gates of the corresponding circuit.
k can be chosen, this may still be a practical solution. Many papers consider also variants of these SD number representations, e.g. using asymmetric digits sets [12].
As another solution, Parhami [8] suggested recoding a given binary SD number x of length n to an equivalent SD number x of length n + 1 such that there are no two neighboring digits xi+1 and xi with xi+1 ·xi = 1. Unfortunately, the output of his addition algorithm does not satisfy this condition, so that it has to be recoded again before another addition takes place. This does not only increase the required chip area, but also adds further latency to each addition. Other works on recoding SD numbers are discussed in [13].
We therefore considered whether it is possible to construct a direct algorithm for the addition of binary SD numbers despite the problems with the decomposition mentioned in the previous section. As we report in the next section, it turns out that there is indeed such an algorithm, and it can be efficiently implemented in hardware.
III. OURALGORITHM FORCARRY-FREEADDITION OF
BINARYSD NUMBERS
A. Analyzing the Problem
Below, we first analyze the problem for base B = 2 and then construct a carry-free binary SD addition algorithm. We have to add two digitsxi andyiof given numbers plus the transfer digit ti that comes from the neighboring digits to the right. All of xi, yi, and ti belong to the digit set {−1,0,+1}, and we have to define transfer digits ti+1, an
interim sum wi, and the final sum digit si such that the following constraints hold:
1) xi+yi= 2·ti+1+wi 2) si=ti+wi
3) ti+1,wi andsi are digits from{−1,0,+1}
4) ti+1is defined independent ofti (to avoid a propaga-tion chain)
To this end, consider Table II: The first three columns list the possible inputs forxi,yi andti. The next two columns are values forti+1andwi that were computed by the algorithm of the previous section, i.e.
ti+1:= ⎧ ⎨ ⎩ +1 ifxi+yi≥+1 −1 ifxi+yi≤ −1 0 otherwise andwi:=xi+yi−2·ti+1andsi:=xi+yi+ti−2·ti+1. As can be seen, the algorithm sometimes computes values forsi that are not in the allowed range. The symbol “*” in the rightmost column marks these rows (where the algorithm fails) and we have colored these rows in dark gray. It is not difficult to see that a correct result would have been possible, since we have [ti+1, si]1,2 = [−1,+2]1,2 = 0 = [0,0]1,2 and [ti+1, si]1,2 = [+1,−2]1,2 = 0 = [0,0]1,2 holds.
Table II
VALUES OFti+1ANDsiFOR STANDARDSDADDITION. xi yi ti ti+1 wi si -1 -1 -1 -1 0 -1 -1 -1 0 -1 0 0 -1 -1 +1 -1 0 +1 -1 0 -1 -1 +1 0 + -1 0 0 -1 +1 +1 + -1 0 +1 -1 +1 +2 * -1 +1 -1 0 0 -1 -1 +1 0 0 0 0 -1 +1 +1 0 0 +1 0 -1 -1 -1 +1 0 + 0 -1 0 -1 +1 +1 + 0 -1 +1 -1 +1 +2 * 0 0 -1 0 0 -1 0 0 0 0 0 0 0 0 +1 0 0 +1 0 +1 -1 +1 -1 -2 * 0 +1 0 +1 -1 -1 + 0 +1 +1 +1 -1 0 + +1 -1 -1 0 0 -1 +1 -1 0 0 0 0 +1 -1 +1 0 0 +1 +1 0 -1 +1 -1 -2 * +1 0 0 +1 -1 -1 + +1 0 +1 +1 -1 0 + +1 +1 -1 +1 0 -1 +1 +1 0 +1 0 0 +1 +1 +1 +1 0 +1
However, we cannot simply change these rows in the table to correct the outputsti+1andsi, since the computation of
ti+1 must be independent of ti, and should only depend on xi and yi. Hence, changing the value of ti+1 in a row forces us to make the same change in all rows wherexi and
yihas the same value. We therefore say that two input triples (xi, yi, ti)and(xi, yi, ti)are equivalent iffxi=xi∧yi=yi holds. The symbol “+” denotes the rows that are equivalent in this sense to another input that leads to wrong results, and we have colored these rows in a lighter gray. We therefore see that we have four critical input classes (xi, yi, ti) = (−1,0,∗),(xi, yi, ti) = (0,−1,∗),(xi, yi, ti) = (0,+1,∗), and (xi, yi, ti) = (+1,0,∗)that refer to the decomposition cases in Table I where two decompositions are possible.
Since we have to define a decomposition2·ti+1+wi=
xi + yi independent of ti, there is no solution by the information given in this table. For example, consider the critical input class(xi, yi, ti) = (−1,0,∗): Usingti+1=−1
as computed by the algorithm leads to value si = +2 for
ti= +1. Usingti+1= 0instead leads to valuesi=−2for
ti=−1, and usingti+1= +1leads to forbidden values of
si for all values ofti(see Table III). Thus, it is not possible
Table III
ALTERNATIVE VALUES OFti+1FOR(xi=−1ANDyi= 0). xi yi ti ti+1 si ti+1 si
-1 0 -1 0 -2 +1 -4
-1 0 0 0 -1 +1 -3
-1 0 +1 0 0 +1 -2
to define a decomposition forti+1 that only depends onxi andyi as remarked by Avizienis!
B. Solution
Our algorithm uses additional information that solves the problem explained in the previous section. As the algorithm describes a hardware circuit, we make use of an encoding of the digits {−1,0,+1} by a pair of booleans(x.0, x.1). There are many encodings of the digits{−1,0,+1}, but the following two are the most popular ones:
Value sign-value neg-pos
-1 (true, true) (true, false) 0 (false, false) (false, false) +1 (false, true) (false, true)
We choose the neg-pos encoding for our algorithm because it lends itself well to a concise description of the logic equations below; in addition, it makes negating a value a simple swap of the pair’s elements.
The key idea of our solution is to choose different decompositionsxi+yi= 2·ti+1+wi in the critical cases (with gray color) of Table II. Since we cannot do this based onxiandyi only, and since we are not allowed to consider
ti, we introduce a new input i such that
(ti= +1 → ¬i) ∧ (ti=−1 → i)
holds, and we generate an output i+1 that maintains this
property as an invariant
(ti+1= +1 → ¬i+1) ∧ (ti+1=−1 → i+1) (1)
that is forwarded to the full adder that receives xi+1 and
yi+1 as inputs, whilei is provided in addition to ti by the full adder forxi−1 andyi−1.
Usingi, we can then decide whether we use the one or the other possible decomposition in the critical cases (with gray color) of Table II. Note that i does not hold the full information ofti, since it is not determined for ti = 0. To establish the above invariant, we define
i+1:=xi.0∨yi.0
which means thati+1 holds if and only if at least one of
the digitsxi,yi is−1.
We prove that equation (1) holds by inspecting Table IV, where the solution computed by our algorithm is given as
Table IV
VALUES OFti+1ANDsiFOR OURSDADDITION ALGORITHM. xi yi ti i i+1 ti+1 si
x y tin lin lout tout s
-1 -1 -1 T T -1 -1 -1 -1 0 * T -1 0 -1 -1 +1 F T -1 +1 -1 0 -1 T T -1 0 -1 0 0 T T -1 +1 -1 0 0 F T 0 -1 -1 0 +1 F T 0 0 -1 +1 -1 T T 0 -1 -1 +1 0 * T 0 0 -1 +1 +1 F T 0 +1 0 -1 -1 T T -1 0 0 -1 0 T T -1 +1 0 -1 0 F T 0 -1 0 -1 +1 F T 0 0 0 0 -1 T F 0 -1 0 0 0 * F 0 0 0 0 +1 F F 0 +1 0 +1 -1 T F 0 0 0 +1 0 T F 0 +1 0 +1 0 F F +1 -1 0 +1 +1 F F +1 0 +1 -1 -1 T T 0 -1 +1 -1 0 * T 0 0 +1 -1 +1 F T 0 +1 +1 0 -1 T F 0 0 +1 0 0 T F 0 +1 +1 0 0 F F +1 -1 +1 0 +1 F F +1 0 +1 +1 -1 T F +1 -1 +1 +1 0 * F +1 0 +1 +1 +1 F F +1 +1
the three rightmost columns, and we can also verify that the important equation xi+yi+ti = 2·ti+1+si holds, and that all computed values are legal digits. Note that the inputs in Table IV are arbitrary, but input i must respect the mentioned invariant above. We use ‘*’ in case its value is a don’t care (i.e., ifti= 0).
As can be seen, in case of non-critical inputs (those that are not given in gray color), the decomposition of xi + yi = ui into ti+1 · 2 + wi does only depend on
xi and yi, while in the critical cases, it also depends on i. Using the information of i, it is possible to choose a decomposition where always legal digits are obtained for ti+1 andsi without generating a carry digit.
It is interesting to note that i and i+1 have strong
relationships totiandti+1 due to the mentioned invariants.
However,i+1 only depends on the digitsxi andyi, while ti+1depends oni, but not onti. This is very important: In
principle, we could replaceiby(ti=−1)without making the equations incorrect. However, the hardware circuit would then suffer from carry propagation since ti+1 would then
depend on ti.
Figure 1 defines a full adder using the Quartz language [16] that can be cascaded to obtain a carry-free binary SD adder. Inputs are declared by ?while outputs are declared with !. The inputs tin, x, and y are thereby pairs of booleans that encode digits{−1,0,+1}via the neg-pos en-coding, i.e., ε(x.0, x.1) = (x.0⇒ −1|0) + (x.1⇒+1|0) maps a pair of booleans to the corresponding digits. The module also makes use of local boolean variables w1, w2, w3,w4,w,u1, andu0.wis thereby defined such that it holds if and only if one of the critical input cases are given (the gray shaded ones in Table IV). Variablesu1andu0are used to define some common subexpressions.
module SgnFullAdd (
(bool∗bool) ? tin , ?x ,? y ,bool ? lin , (bool∗bool) ! tout ,! s ,bool ! lout )
{
bool w1 , w2 , w3 , w4 ,w , u1 , u0 ;
// d e f i n e the c r i t i c a l input cases : w1 = ! x .0 & ! x .1 & y .1; // x ==0 & y ==+1 w2 = ! x .0 & ! x .1 & y .0; // x ==0 & y==−1 w3 = ! y .0 & ! y .1 & x .1; // y ==0 & x ==+1 w4 = ! y .0 & ! y .1 & x .0; // y ==0 & x==−1 w = w1 | w2 | w3 | w4 ;
u1 = ! lin & w ; // tin !=−1 & c r i t i c a l i n p u t u0 = lin & w ; // tin !=+1 & c r i t i c a l i n p u t // d e t e r m i n e lout := x=−1 | y=−1
lout = x .0 | y .0;
// tout .0 holds iff x = y=−1 | tin !=+1 & x + y=−1 tout .0 = x .0 & y .0 | lin & ( w2 | w4 ) ; // tout .1 holds iff x = y =+1 | tin !=−1 & x + y =+1 tout .1 = x .1 & y .1 | ! lin & ( w1 | w3 ) ; // d e t e r m i n e sum di g i t
s .0 = tin .0 & ! u0 | u1 & ! tin .1; s .1 = tin .1 & ! u1 | u0 & ! tin .0;
}
Figure 1. Implementation of a Full Adder for Binary SD Numbers
As can be seen,toutonly depends on x,y,lin; sdepends on tin,lin,x,y, and lout on x,y. Therefore, there is no dependency from tin to tout and neither is there one from lin to lout. Dependencies between neighbored full adder modules are shown in Figure 2. As can be seen, a sum digitsi depends onxi, yi, xi−1, yi−1, xi−2, yi−2,i on
xi−1, yi−1, andti onxi−1, yi−1, xi−2, yi−2.
It is not difficult to prove that the following theorem holds where ε(x)maps the pair of booleans x= (x.0, x.1) to a digit{−1,0,+1} according to the neg-pos encoding:
Theorem 4 (Correctness ofSgnFullAdd):
If x, y, tin are pairs of booleans that encode digits {−1,0,+1}, and if lin is a boolean such that condition
(lin → ¬tin.1)∧(¬lin → ¬tin.0) holds, then the following holds for moduleSgnFullAddshown in Figure 1:
• toutandsencode signed binary digits{−1,0,+1}
• (lout→ ¬tout.1)∧(¬lout→ ¬tout.0)
• ε(x) +ε(y) +ε(tin) = 2∗ε(tout) +ε(s)
Proof: The proof can be made by an exhaustive enu-meration of all cases, which has been performed by means of the Averest tool set.
Thus, all bits i, then all transfer digits, and then all sum digits are computed in three parallel steps, thus requiring time O(1). Hence, we obtained a carry-free addition of binary SD numbers without the need to re-encode the inputs. The crucial fact used here is that we can extract enough information from the next less-significant digits to distinguish the cases where forbidden digits for si would be computed within the critical inputs. Note that i does not have the complete information to determinetisince that would lead to a dependency betweenti+1andtithat would introduce a carry chain.
C. Conversion to/from Binary Numbers
Converting radix-2 or two’s complement numbers to binary SD numbers does not require any logic resources. For a radix-2 number x = [xn−1, . . . , x0], the equivalent SD
number x in neg-pos encoding is x.0 := [0, . . . ,0] and x.1 := [xn−1, . . . , x0]; for a two’s complement num-ber x = [xn−1, . . . , x0], an equivalent SD number is
x.0 := [xn−1,0, . . . ,0] and x.1 := [0, xn−2, . . . , x0]. The
correctness of this can be easily seen from the equation [xn−1, . . . , x0]2C =−xn−1·2n−1+ni=0−2xi·2i, where x2C denotes the two’s complement interpretation of a bitvectorx.
To convert an SD numberx back to a radix-2 or a two’s complement number, the bitvector[xn−1.0, . . . , x0.0]is
in-terpreted as a 2 number and subtracted from the radix-2 number [xn−1.1, . . . , x0.1] (since [xn−1, . . . , x0]1,2 =
n−1
i=0(xi.1 − xi.0) · 2i ). This requires a single n-bit
subtraction which needs time O(log(n)) and returns an
(n+ 1)-bit radix-2/two’s complement number. IV. BENCHMARKRESULTS
A. Setup
We implemented our addition algorithm in hardware on a Xilinx Virtex 5 FPGA, along with Parhami’s algorithm [8], and a simple addition of two’s complement numbers to make comparisons. On these FPGAs, simple addition is implemented using a dedicated carry logic and fast carry
chains, resulting in a combination of carry-lookahead and carry-ripple adders. This method is the fastest and the smallest carry-based addition for all but very high bit-width numbers. For Parhami’s method, we chose the signed-value encoding, since it was the one they focused on in [8]. Our benchmarks were set up as follows:
• To measure latency, we registered the inputs and outputs of the respective adder implementation. The synthesis and implementation tools were set to optimize for clock frequency, and the given latencies are the minimum clock periods which were still routable.
• Forarea, the design was solely comprised of the adder circuit, with the FPGA’s pins serving as the inputs and the outputs of the adder. The tools were set to optimize for area, and the area is measured in occupied lookup tables (LUTs).
For our benchmarks, we assumed that the inputs are given as signed-digit numbers. This is necessary in order to ensure that the input is as general as possible so that the synthesis tools are not able to optimize the circuit unrealistically by exploitingdon’t-careconditions. We measured the following benchmarks:
• add2: addition circuit with two n-digit inputs and an
(n+ 1)-digit output
• add3:addition circuit with threen-digit inputs and an
(n+ 2)-digit output B. Results
Table V shows the latency and the maximum frequency of the two-input and the three-input adder for our new addition algorithm and compares it to Parhami’s adder. The values were determined for an input width ofn= 64, but they are actually independent ofn(with very small deviations due to slight variations in the LUT array and the routing network of the FPGA). We included the values for a 64-bit native addition as a reference.
As can be seen, our algorithm is more than 40 % faster than Parhami’s SD addition. It also tends to achieve a frequency which is 50 % higher than a 64-bit native FPGA addition. This is to be expected, since our algorithm has a constant O(1)latency, while the best latency which any carry-based addition can achieve is O(logn). In fact, our algorithm is so efficient that the breakeven point is at n= 24, for which native addition has a latency of 2.11 ns. Interestingly, Parhami’s adder is actually slower than native FPGA addition for the three-input case, even though it is faster for the two-input case.
In Table VI, we show the area requirements for the different algorithms. For all of them, the occupied area is proportional to their input width, hence we give the number of LUTs per input digit (measured forn= 64). For example, our method requires 3 LUTs per digit, so adding two 32-digit numbers requires 96 LUTs. As expected, three-input
Table V
LATENCY OF TWO-INPUT AND THREE-INPUT ADDERS IN NANOSECONDS,RESP.MAXIMUM FREQUENCY INMHZ.
add2 (ns / MHz) add3 (ns / MHz)
Our adder 2.02 / 495 3.14 / 318
Parhami’s adder 2.88 / 347 4.97 / 201
Simple adder (64-bit) 3.19 / 313 4.71 / 212 Table VI
AREA REQUIREMENTS OF TWO-INPUT AND THREE-INPUT ADDERS IN
LUTS PER INPUT BIT.
add2 add3
Our adder 3.0 6.0
Parhami’s adder 7.3 14.7
Two’s complement adder 1.0 2.0
adders need twice the area of two-input adders, since they are just two adders in sequence. Our method requires three times as much area as the native addition algorithm, and less than half of Parhami’s algorithm.
Note that in the case of an ASIC implementation, our algorithm would likely perform even better compared to a two’s complement adder since the latter benefits from the dedicated carry-propagation chain on the FPGA, an advantage which would not exist on an ASIC.
V. CONCLUSION
We developed an algorithm for adding binary SD numbers which does not require the recoding step of previous ap-proaches [8]. Our algorithm makes use of an additional input i that is used to determine suitable transfer and interim sum digits that avoid this way a carry generation. By implementing our addition algorithm on an FPGA, we showed that our method is approximately 40 % faster and needs less than half as much area compared to previous approaches to binary SD addition. It has a lower latency than even the fastest carry-based two’s complement addition for input widths as low as 24 bits, allowing it to be used as a replacement in many practical, latency-critical hardware designs.
REFERENCES
[1] P. Kogge and H. Stone, “A parallel algorithm for the efficient solution of a general class of recurrences,”IEEE Transactions on Computers (T-C), vol. 22, pp. 786–793, 1973.
[2] P. Beame, S. Cook, and H. Hoover, “Log depth circuits for division and related problems,” inFoundations of Computer Science (FOCS). West Palm Beach, Florida, USA: IEEE Computer Society, 1984, pp. 1–6.
[3] B. Parhami,Computer Arithmetic – Algorithms and Hardware Designs. Oxford University Press, 2000.
[4] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann, 2003.
[5] H. Garner, “The residue number system,” IRE Transactions on Electronic Computers, vol. 8, pp. 140–147, June 1959. [6] H. Garner, R. Arnold, B. Benson, C. Brockus, R. Gonzalez,
and D. Rozenberg, “Residue number systems for computers,” University of Michigan, Technical Report 61-483, October 1961.
[7] A. Avizienis, “Signed-digit number representations for fast parallel arithmetic,”IRE Transactions on Electronic Comput-ers, vol. 10, no. 3, pp. 389–400, September 1961.
[8] B. Parhami, “Carry-free addition of recorded binary signed-digit numbers,” IEEE Transactions on Computers (T-C), vol. 37, no. 11, pp. 1470–1476, 1988.
[9] ——, “Generalized signed-digit number systems: A unify-ing framework for redundant number representations,”IEEE Transactions on Computers (T-C), vol. 39, no. 1, pp. 89–98, January 1990.
[10] S.-H. Shieh and C.-W. Wu, “Asymmetric high-radix signed-digit number systems for carry-free addition,” Journal of Information Science and Engineering, vol. 19, no. 6, pp. 1015–1039, 2003.
[11] G. Jaberipur and M. Ghodsi, “High radix signed digit number systems: Representation paradigms,”Scientia Iranica, vol. 10, no. 4, pp. 383–391, 2003.
[12] S. Gorgin and G. Jaberipur, “A family of high radix signed digit adders,” inSymposium on Computer Arithmetic (ARITH). T¨ubingen, Germany: IEEE Computer Society, 2011, pp. 112–120.
[13] M. Joye and S.-M. Yen, “Optimal left-to-right binary signed-digit recoding,” IEEE Transactions on Computers (T-C), vol. 49, no. 7, pp. 740–748, 2000.
[14] C. Koc and S. Johnson, “Multiplication of signed-digit num-bers,”Electronics Letters, vol. 30, no. 11, pp. 840–841, 1994. [15] C. Hung and B. Parhami, “Generalized signed-digit multipli-cation and its systolic realizations,” inCircuits and Systems. Detroit, Michigan, USA: IEEE Computer Society, 1993, pp. 1505–1508.
[16] K. Schneider, “The synchronous programming language Quartz,” Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany, Internal Report 375, December 2009.
[17] S. Arno and F. Wheeler, “Signed digit representation of minimal Hamming weight,”IEEE Transactions on Computers (T-C), vol. 42, no. 8, pp. 1007–1010, August 1993. [18] A. Booth, “A signed binary multiplication technique,”
Quar-terly Journal of Mechanics and Applied Mathematics (QJ-MAM), vol. 4, no. 2, pp. 236–240, 1951.
[19] D. Phatak, T. Goff, and I. Koren, “Constant-time addition and simultaneous format conversion based on redundant binary representations,” IEEE Transactions on Computers (T-C), vol. 50, 2001.
[20] G. Reitwiesner, Advances in Computers. Academic Press, 1960, ch. Binary Arithmetic.