A High Speed Residue-to-Binary Converter for Balanced 4-Moduli Set

(1)

Journal of Computing and Security

http://www.jcomsec.org

A High Speed Residue-to-Binary Converter for Balanced 4-Moduli

Set

MohammadReza Taheri

a

, Nasim Shafiee

a

, Mohammad Esmaeildoust

b

,

Zhale Amirjamshidi

c

, Reza Sabbaghi-nadooshan

c

, Keivan Navi

a,

∗

a_{Faculty of Computer Science and Engineering, Shahid Beheshti University, GC, Tehran, Iran.}

b_{Faculty of Marine Engineering, Khorramshahr University of Marine Science and Technology, Khuzestan, Iran.} c_{Electronic Engineering Department, Islamic Azad University, Central Tehran Branch, Tehran, Iran.}

A R T I C L E I N F O.

Article history: Received:6 January 2014

Revised:7 November 2015

Accepted:7 December 2015

Published Online:7 February 2016

Keywords:

Mixed Radix Conversion, Residue Arithmetic, Residue Number System, Residue-to-Binary Converter

A B S T R A C T

The moduli set

2n−1₋₁_,₂n+1₋₁_,₂n_,₂n₋₁ _{has been recently proposed in} literature for class of 4n-bit dynamic range in residue number system. Due to only utilizing modulus in the form of 2k₋_{1 besides modulo 2}n_{, this moduli} set enjoys the efficient Arithmetic Unit (AU) in its architecture. Not only does the efficiency of a RNS system depend on the residue arithmetic unit but it also is limited to the residue to binary converter. In this paper, a new two level residue-to-binary converter architecture based on Mixed Radix Conversion (MRC) is presented for the aforementioned moduli set. The proposed converter includes two levels of design based on MRC properties. Firstly, the 3-moduli subset 2n−1−1,2n+1−1,2n−1 is properly organized and as it does not calculate several values, it results in some cost modifications. Eventually, a two-moduli set 2n−1−1 2n+1−1(2n−1),2n is formed to compute the binary of RNS counterpart. The proposed architecture is shown to be more efficient both in terms of hardware cost and conversion delay in comparison with the related state-of-the-art works.

c

1

Introduction

The carry-free nature of the residue number system (RNS) makes it suitable to be used in the arithmetic level in VLSI design to achieve parallelism [1], [2]. In RNS, a weighted number is decomposed into a set of residues. Since arithmetic operations on residues can be performed without carry propagation between them, RNS results in high-speed addition, subtraction

∗ _{Corresponding author.}

Email addresses:moh [email protected](MR. Taheri), [email protected](N. Shafiee),m [email protected] (M. Esmaeildoust),[email protected](Z. Amirjamshidi), r [email protected](R. Sabbaghi-nadooshan), [email protected](K. Navi)

and multiplication [3], [4], which is appropriate for dig-ital signal processing (DSP) [5], [6], image processing [7], cryptography [8], [9] and communication systems [10]. However, arithmetic operations like division, sign detection and comparison are difficult in RNS.

(2)

Table 1. Comparison of arithmetic operation for different moduli sets for high dynamic range applications

Moduli set Design Critical modulus Delay

{2n−1,2n,2n+ 1,2n+1−1} [13,14] 2n+ 1 2log2n+ 6

{2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1_{+ 1}_} _[₁₃_,₁₅_] ₂n+1_{+ 1} _2log

2(n+ 1) + 6

{2n−3,2n−1,2n+ 1,2n+ 3} [16] 2n+ 3 2log2(n−1) + 7

{2n_,₂n+1₋₁_,₂n₋₁_,₂n−1₋₁_} _[₁₇_] ₂n+1₋₁ _2log

2(n+ 1) + 3

and reverse conversion. A reverse converter has more complex architecture and its complexity will grow depending on the number of modules. Therefore, an effective design of reverse converter is needed in order to get the benefit of the RNS [12].

Many works have been reported on balanced 4-moduli sets such as {2n −1,2n,2n + 1,2n+1−1}

[13,14,16],{2n−1,2n,2n+1,2n+1−1}[13,15],{2n−

3,2n−1,2n+ 1,2n+ 3}[16] and{2n,2n+1−1,2n−

1,2n−1₋₁_}_[₁₇_{]. Efficiency of arithmetic operations}

is restricted to critical modulus. The critical moduli in [13–17] are shown in Table 1. The unit gate de-lays of the parallel prefix adders 2k₋_{1, 2}k_{+ 1 and} 2k_{+ 3 are 2log}

2n+ 3,2log2n+ 6 and 2log2(n−1) + 7,

respectively [18–20]. Therefore, as it is shown in Ta-ble 1, moduli set {2n_,₂n+1 ₋₁_,₂n ₋₁_,₂n−1 ₋₁_}

[17] provides more effecient arithmetic unit. However, more efficient reverse converter for the moduli set

{2n_,₂n+1₋₁_,₂n₋₁_,₂n−1₋₁_} _{with less hardware}

requirements and delay, compared to [17] and other moduli sets in literature, is needed. Therefore, in this paper, a new design of the reverse converter for the 4-moduli set is presented. The proposed converter has achieved less delay and more desirable hardware re-quirements compared to the state-of-the-art convert-ers.

This paper consists of a background about RNS in Section 2, design of the proposed RNS to binary converter in Section3, evaluation of hardware require-ments and critical path delay of the proposed reverse converter in Section4, comparison of the performance of the proposed RNS to binary converter with other moduli sets in Section5and finally the conclusions of the paper in Section6.

2

Background

A residue number system is defined in terms of rel-atively prime moduli set {P1, P2, . . . , Pn} that is

gcd(Pi, Pj) = 1 for i 6= j. An integer numberX in the range of [0, M −1] can be represented as X = (x1, x2, . . . , xn) wherexi=XmodPi , 0 ≤xi ≤Pi, andM =P1×P2. . .×Pn is the dynamic range of the RNS system [21].

Reverse conversion algorithms are principally based on the Chinese remainder theorem (CRT), Mixed-radix conversion (MRC) and new Chinese remainder theorems (New CRTs) [11]. Through the MRC, the numberX can be calculated using

X =vn n−1

Y

i=1

Pi+· · ·+v3P2P1+v2P1+v1 (1)

The coefficient{v1, v2, . . . , vn}can be obtained from the following formulas:

v1=x1 (2)

v2=

(x2−v1) P1−1

_P

2 _P

2

(3)

v3=

(x3−v1) P1−1

_P

3−v2

P2−1 _P

3 _P

3 (4) In general

vn=

(xn−v1) P1−1

_P

n−v2

P2−1

_P

n

− · · · −vn−1) Pn−1−1

_P

n

_P

n

(5) WhereP_i−1

_P

j is the multiplicative inverse ofPi

mod-uloPj [11].

Three types of adders are used to realize the hard-ware architecture of the reverse converter, Carry Save Adder (CSA) for operations in modulo 2n, CSA with End Around Carry (EAC) for operations in modulo 2k−1, Carry Propagate Adder (CPA) and Modular Adder (MA). For MA in modulo 2k−1, CPA with end around carry (EAC) is used, which has the similar area and double delay in comparison with a regular CPA [22]. These are explained more in Section4.

3

Proposed RNS to Binary Converter

The two-level architecture, realized by the MRC method, can lead to an efficient implementa-tion of RNS to binary converter of moduli set Ψ = 2n−1−1,2n+1−1,2n−1,2n . In the first step, number Y is calculated from the residues in the subset Γ =

(3)

Subset {2n-1, 2n+1-1, 2n-1-1} using MRC

x

1

x

2

Superset {(2n-1)(2n+1-1)(2n-1-1), 2n} using MRC

X

3

x

Y

x

4 First Step Second Step

Figure 1. Proposed Schema for residue-to-binary conversion

using MRC in a parallel manner. In the second step, the MRC method is applied to the superset

Λ =

2n−1₋₁

2n+1₋₁

(2n₋₁₎_,₂n _{and the} final result is realized. The proposed reverse converter scheme is composed of two parts, as shown in Figure1. The details are presented in the next subsections.

3.1 First Step Design

In the first step, the reverse converter of the subset Γ is designed. In order to decrease the delay generated by the serial attribute of the MRC method, the proposed approach in [23] is utilized.Using this approach, more parallelism is obtained without noteworthy hardware redundancy. Also to reduce the total architecture delay, in the first step, all modulus which are in the form of 2k₋_{1 and modulo 2}n _{will be included in the next} step. Utilizing modulo 2n_{in the second step leads to} significant improvement in terms of delay because this modulo has better speed compared to modulus in the forms of 2k₋_{1. The first step of design is described as} follows. The weighted numberY can be calculated as

Y =Z1+Z2P1+Z3P1P2 (6)

where

Z1=x1 (7)

Z2=

(x2−x1) P1−1

_P 2 _P 2 (8)

Z3=

(x3−x1) P1−1

_P

3−Z2

P2−1 _P 3 _P 3 (9) andP1= 2n−1,P2= 2n+1−1 andP3= 2n−1−1. Proposition 1. The multiplicative inverse ofP1in moduloP2is

P₁−1

_P

2 =−2.

Proof. By considering multiplicative inverse defini-tion we have:

(2

n₋₁₎_× P₁−1

_P

2

₂n+1₋₁= 1

→ |(2n−1)×(−2)|₂n+1₋₁=

2−2n+1 ₂n+1₋₁ =1− 2n+1−1

₂n+1₋₁= 1

Proposition 2. The multiplicative inverse ofP1 in moduloP3is

P₁−1

_P

3= 1.

Proof. Based on multiplicative inverse definition, it’s clear that:

(2

n₋₁₎_× P₁−1

_P

3

₂n−1₋₁= 1

→ |(2n−1)|₂n−1₋₁=

2× 2n−1−1

+ 1₂_n−1₋1= 1

Proposition 3. The multiplicative inverse ofP2 in

moduloP3is P₂−1

_P

3=

n

2−1 P

i=0

22i_.

Proof. Based on multiplicative inverse definition, it’s obvious:

2

n+1₋₁

× P2−1

_P

3

₂n−1₋₁= 1

→

2n+1−1×

n

2−1 X

i=0

22i ₂n−1₋₁

=

2n+1−1

×1−2

n

−3 ₂n−1₋₁

=

4× 2n−1−1 + 3

×2

n₋₁ 3

2n−1₋₁ =

|2n−1|₂n−1₋₁=

2×(2n−1−1) + 1

₂n−1₋₁= 1

After realizing multiplicative inverses,Z2 can be

calculated as follows

Z2=|(x2−x1)×(−2)|₂n+1₋₁ (10) Lemma 1. IfV is an n-bit number in the interval [0,2n₋_1]_{, the residue of}₍₋_V₎_{in modulo}₂n₋₁_equals to one’s complement ofV [24].

Lemma 2. IfV is an n-bit number in the interval [0,2n ₋_1]_{, the multiplication of} _V _by ₂p _{in modulo} 2n₋₁_{, equals to its}_p_{-bit circular left shift counterpart} [24].

By multiplyingx2−x1by -2, based on lemma 2, Z2is resulted as:

Z2=|L1−L2|₂n+1₋₁ (11) where

L1=x1,n−1· · ·x1,0 (12)

(4)

Z2=  



L1−L2 if L1−L2≥0 L1−L2+ 2n+1−1

if L1−L2< 0

(14) To calculate Z3, after calculating

P− 1 1 _P 3 and P− 1 2 _P

3, the results are replaced in Equation (9) as follow:

Z3=

((x3−x1)×1−Z2)

× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2 ₂n−1₋₁

=

(x3−x1−Z2)

× 20+ 22+· · ·+ 2n−2

2n−1₋₁

(15)

To eliminate the computation ofZ2in modulo 2n+1−1,

in computingZ3, the following method can be utilized.

The result of subtractingL2 fromL1will be either a

positive number smaller than 2n+1−1 or a negative number greater than 1−2n+1. By default the first case has a result in modulo 2n+1−1; however, adding 2n+1−1 to the result ofe1−e2is required whene1−e2

is negative. The outgoing carry of the adder utilized for L1 andL2 subtraction, can distinguish the two

cases indicated in Equation (14).

IfL1> L2,Z3 can be obtained as Equation (16):

Z3 =

(x3−x1−L1+L2)

× 20+ 22+· · ·+ 2n−2 ₂n−1₋₁

(16)

For more simplicityx3−x1−L1+L2 is rewritten

in the bit-level representation and then segregated in numbers with the length ofn−1 bit to ease applying its coefficient, 20+ 22+· · ·+ 2n−2.

Z3=

           

x3,n−2· · ·x3,0−0· · ·0

| {z }

n−2

x1,n−1

−x1,n−2· · ·x1,0−0· · ·0

| {z }

n−3

L1,nL1,n−1

−L1,n−2· · ·L1,0+ 0· · ·0

| {z }

n−3

L2,nL2,n−1

+L2,n−2· · ·L2,0

           

× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2

2n−1₋₁ (17)

and using Lemma1:

Z3=

           

x3,n−2· · ·x3,0+ 1· · ·1

| {z }

n−2

¯ x1,n−1

+¯x1,n−2· · ·x¯1,0+ 1· · ·1

| {z }

n−3

¯

L1,nL¯1,n−1

+ ¯L1,n−2· · ·L¯1,0+ 0· · ·0

| {z }

n−3

L2,nL2,n−1

+L2,n−2· · ·L2,0

           

× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2

2n−1₋₁ (18)

Equation (18) can be simplified as following:

Z3=  

Z3,1+Z3,2+Z3,3+Z3,4

+Z3,5+Z3,6+Z3,7 



× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2 ₂n−1₋₁

(19)

where

Z3,1=x3,n−2· · ·x3,0 Z3,2= 1· · ·1

| {z } n−2

¯

x1,n−1

Z3,3= ¯x1,n−2· · ·x¯1,0 Z3,4= 1· · ·1

| {z } n−3

¯

L1,nL¯1,n−1 Z3,5= ¯L1,n−2· · ·L¯1,0 Z3,6= 0· · ·0

| {z } n−3

L2,nL2,n−1

Z3,7=L2,n−2· · ·L2,0

In the other case whenL1< L2, Z3=

x3−x1−L1+L2− 2n+1−1

× 20+ 22+· · ·+ 2n−2 ₂n−1₋₁ and since 2n+1−1

₂n−1₋₁ = |−3|2n−1₋₁,the fol-lowing expression is resulted:

− 2n+1−1

× 20+ 22+· · ·+ 2n−2

2n−1₋₁ =|−1|₂n−1₋₁= 1· · ·1

| {z } n−2

0 (20)

therefore,Z3 can be rewritten as

Z3=  

Z3,1+Z3,2+Z3,3+Z3,4

+Z3,5+Z3,6+Z3,7 



× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2 +Z3,8

₂n−1₋₁

(21)

where

Z3,8= 1· · ·1 | {z } n−2

0

In Figure2,Z3is generated by Operand Preparation

Unit1 (OPU1) withx1,x2 andx3as its inputs. The

values ofZ3,1,Z3,2,Z3,3,Z3,4,Z3,5,Z3,6andZ3,7are

also reduced toS1 andC1 by CSA1, CSA2, CSA3,

CSA4 and CSA5.

Z3is then obtained as

Z3=                           

(S1+C1)

× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2 ₂n−1₋₁

if L1−L2≥0

(S1+C1)

× 20_{+ 2}2₊_{· · ·}_{+ 2}n−2

+Z3,8

2n−1₋₁

(5)

Based on Lemma2 Z3=

 



|ϕ+θ|₂n−1₋1 if L1−L2≥0

|ϕ+θ+Z3,8|₂n−1₋₁ if L1−L2< 0

(23)

ϕ=

n−2 2 X

i=0

CLS(S1,2i) (24)

θ=

n−2 2 X

i=0

CLS(Ci,2i) (25) where CLS(x,y) equals toy-bit circular left shift ofx. Operand Preparation Unit2 (OPU2) is used to implement Z3 with S1 and C1 as its inputs. Also

OPU2 generates multiple (n−1)-bit outputs. The outputs are CLS(C1, n−2), ..., CLS(C1,0) and CLS

(S1, n−2), ..., CLS(S1,0). To obtain a compact final

output, all of the outputs must be reduced by a CSA tree. So the output of the mentioned CSA tree and the output of the MUX connect to the CSA block and later to the modulo 2n−1₋_{1 adder. The final result}

of the MA2 isZ3signal which will be used in the next

step.

Z2is the other signal which should be prepared for

the next step. To Decrease the delay of the proposed design, it is preferred to computeZ2in parallel with

calculatingZ3. Based on the Equation (11),Z2should

be obtained by a modular adder in modulo 2n+1₋₁

named MA1. Also, ¯Z2 is needed in the second step

of design; therefore, the output of the MA1 goes to Operand Preparation Unit3 (OPU3), includingn+ 1 inverters, to produce ¯Z2. Only in the first step, all of

the above mentioned calculations are done in parallel with computingZ3. Thus the delay of computingZ2

is not considered in the critical path delay. Hardware implementations ofZ2andZ3 are shown in Figure2.

After the calculation ofZ2 andZ3, Y can be

ob-tained from its residues in 3-moduli set Γ as

Y =Z1+Z2P1+Z3P1P2 (26)

Y =Z1+Z2×(2n−1) +Z3×(2n−1)× 2n+1−1

(27) At the next level of design,Z1,Z2, andZ3 are used

with consideration of the value of Y. Furthermore,Z2

andZ3are computed at the first stage for more parallel

architecture. There is no need to compute the final value ofY at the first stage. Only some arrangements of Z1, Z2andZ3 which are needed in computingY

and indicated by yi,i= 1, . . . ,7, are utilized at the next level of design. Therefore the delay of computing Y is omitted. The yi signals are expressed by the

following expression:

Y =Z1+Z2(2n−1) +Z3(2n−1) 2n+1−1

=Z1+Z20· · ·0 | {z }

n

−Z2+Z30· · ·0 | {z } 2n+1

−Z30· · ·0 | {z } n+1

−Z30· · ·0 | {z }

n

+Z3 (28)

Y =y1+y2+y3+y4+y5+y6+y7 (29)

wherey1 =Z1 ,y2 =Z20· · ·0 | {z }

n

,y3=−Z2,y4= Z30· · ·0

| {z } 2n+1

,y5=−(Z30· · ·0 | {z } n+1

) ,y6=−(Z30· · ·0 | {z }

n ) and

y7=Z3.

3.2 Second Step Design

After the computation ofyi,i= 1, . . . ,7, the two mod-ulus superset Λ is considered for obtaining weighted numberX. The residue of weighted numberXin mod-ulo P123 andP4 is equal to Y andx4, respectively,

whereP123= (2n−1)× 2n+1−1× 2n−1−1and P4= 2n. The MRC method for moduli set with two

modulus is utilized to calculateX as follows:

X =v1+v2P123 (30)

where

v1=Y (31)

v2=

(x4−Y) P123−1

_P

4 P4

(32)

Proposition 4. The multiplicative inverse ofP123in moduloP4is equal to−2n−1−1.

Proof. According to multiplicative inverses defini-tion, we have:

(2

n₋₁₎_× ₂n+1₋₁

× 2n−1−1× P123−1

_P

4 ₂n= 1

→

(2n−1)× 2n+1−1

× 2n−1−1× −2n−1−1₂n

=(−1)×(−1)× −22n−2+ 1

₂n+1₋₁ =|(−1)×(−1)×1|₂n+1₋₁= 1

thus,

v2=

(x4−Y)×(−1)× 2n−1+ 1₂n=|Y −x4|2n

(33) By replacing Y based on Equation (29),V2could

be rewritten as:

v2=

y1+y2+y3+y4+y5+y6+y7+ ¯x4+ 1

× 2n−1_{+ 1}

2n

(6)

Accord-n-1-bit CSA with EAC (CSA5)

n-1-input CSA tree with EAC

Modulo 2n-1

-1 adder (MA2)

Z 3

Operand Preparation Unit 2

n-1-bit CSA with EAC (CSA4) Operand Preparation Unit 1

x 1 x 2 x 3

Comparator

n-1-bit CSA with EAC (CSA1) n-1-bit CSA with EAC (CSA2)

n-1-bit CSA with EAC (CSA3)

...

CLS(S1, n-2)

CLS(C1, 0)

S 1 C 1

Cout

n-1-bit CSA with EAC (CSA6) Mux

Z 3,8 0 L1 L2

Z 3,7 Z 3,6

Z 3,5 Z 3,4 Z 3,3

Z 3,2 Z 3,1

Modulo 2n+1

-1 adder (MA1)

L 1 L 2

Z 2

...

CLS(S1, 0) CLS(C1, n-2)

n n+1 n-1

n n+1

n-1

n+1

n-1

Figure 2. Hardware schema for first step design

ingly, only the lowest weightednbits are used in the operations. Therefore,v2 can be expressed as:

v

2

=

Z

1,0

0

· · ·

0

| {z }

n−1

+ ¯

Z

2,0

0

· · ·

0

| {z }

n−1

+

Z

3,0

Z

3

+¯

x

4,0

0

· · ·

0

| {z }

n−1

+¯

x

4

+

Z

1

+ ¯

Z

2

+

µ

2n (35)

v2=

K1+K2+Z1+ ¯Z2+µ

₂_n (36) where

k1=XORhx¯4,0,x¯4,n−1ix¯4,n−2· · ·x¯4,1¯x4,0 (37)

k2=XOR

Z1,0,Z¯2,0, Z3,0

Z3 (38) µ= 0· · ·00

| {z } n−2

10 (39)

The structure of the OPU4 is implemented based on the above equation forv2. The inputs areZ1,Z2,Z3,

andx4 and its outputs arek1andk2. CSA7 neglects

the nth _{bit of its outputs. CSA7 is also put a 0 in} the least significant bit of the carry. It also omits the most significant bit of ¯Z2by considering Equation (29)

based on the previous subsection. This procedure is also done by CSA8 and CSA9. Finally, the outputs of the CSAs go to the Modular adder 3 (MA3) as its

n-bit CSA(CSA7)

n-bit CSA(CSA8)

Modulo 2n_{adder (MA3)}

v2

k1 k2

Z2

Z1

n-bit CSA(CSA9) µ

x4 Z1 Z2 Z3

Figure 3. Calculation of ofv2

inputs to computev2. Hardware implementation of v2is shown in Figure3.

(7)

X=v1+v2×(2n−1)× 2n+1−1× 2n−1−1

(40) Sincev1=Y,v1is replaced byY as follow: X=Y +v2×(2n−1)× 2n+1−1

× 2n−1₋₁

(41) Equation (41) can be simplified as

X =v2Z3Z2Z1−0· · ·0 | {z } n−1

v20· · ·0 | {z } 2n+1

−0· · ·0 | {z }

n

v2Z3Z2

−0· · ·0 | {z } n+1

v2Z3v2+v20· · ·0 | {z } n+1

+v20· · ·0 | {z }

n

+v2Z3 (42)

X =v2Z3Z2Z1+ 1· · ·1 | {z } n−1

¯

v21· · ·1 | {z } 2n+1

+ 1· · ·1 | {z } n+1

¯

v2Z¯3Z¯2

+ 1· · ·1 | {z }

n ¯

v2Z¯3v¯2+v20· · ·0 | {z } n−1

11 +v20· · ·0 | {z }

n

+v2Z3

(43)

X is the summation of seven values,

7 P

k=1 Xk , where X1 = v2Z3Z2Z1 , X2 = 1· · ·1

| {z } n−1

¯

v21· · ·1 | {z } 2n+1

,

X3 = 1· · ·1 | {z } n+1

¯

v2Z¯3Z¯2 , X4 = 1· · ·1 | {z }

n ¯

v2Z¯3v¯2 , X5 = v20· · ·0

| {z } n−1

11 ,X6=v20· · ·0 | {z }

n

andX7=v2Z3.Xk with the bit-length of 4n-bit enters to the carry save adder tree and the outcomes of carry save adder connect to the input of a 4n-bit CPA to compute weighted number X. Figure4depicts the architecture of this scenario.

X

v2

Operand preparation Unit 5

Z1 Z2 Z3

4n-bit CPA

X1

X2

X3

X4

X5

X6

X7

2n+1-bit CSA(CSA10)

3n+1-bit CSA(CSA11)

3n+2-bit CSA(CSA12)

3n+3-bit CSA(CSA13)

4n-bit CSA(CSA14)

Figure 4. Hardware implementation for calculation ofX

3.3 Numerical Example

Considering moduli set{63, 127, 31, 64}, which is derived from moduli set Ψ whenn= 6, the RNS num-ber (33, 7, 5, 63) can be converted to its equivalent in weighted numberX as:

First stage:

x1= 3310= 1000012 x2= 710= 00001112 x3= 510= 001012

By substituting these values in Equation (6),(11), (19) and (21), the following results will be obtained:

Z1= 3310= 1000012 L1= 10000102 L2= 00011102 Z2= 01101002 Z3= 110012

Second stage: by considering Equation (31), (36) and (43), the desired values in second step are ob-tained:

v1=Y = 10732210= 110100011001110102 k1= 0

k2= 5710= 1110012 v2= 3910= 1001112

X = 107322 + 39×63×127×31 = 9876543 ThusX = 9876543, and the verification can be sim-ply done as

x1=|9876543|₆₃= 33 x2=|9876543|₁₂₇= 7 x3=|9876543|31= 5 x4=|9876543|64= 63

4

Hardware Cost and the Delay of

Pro-posed Converter

(8)

Table 2. Different conditions of a full adder cell according on constant input

Number of constant value Constant value Reduced gates

1 1 a pair of two input XNOR and OR gate

1 0 a pair of two input XOR and AND gate

2 Same value ( both 0 or both 1) Wire

2 Same value ( one input 0 another 1) Inverter gate

which two numbers are aggregated in a ripple structure. The logic function of a full adder is described by the following equations:

Sum=XOR(x, y, z) =xyz+xy¯z¯+ ¯xyz¯+ ¯xyz¯ (44)

Carry=xy+xz+zy (45) According to Equations (44) and (45), if one of the inputs of the full adder equals to 1 (for instancez= 1), the SumandCarry are equivalent toxy+ ¯xy¯and

x+yrespectively. If one of the inputs equals to 0 (for instancez = 0), theSumandCarry are equivalent toxy¯+ ¯xyandxyrespectively. In these two cases only two input gates are used. If two inputs have constant values, the simplification process is the same as above. Table2shows different conditions of a full adder cell according to constant inputs.

In the first step of the design of the reverse converter, the modular adders (MA1, MA2) are implemented by CPAs with EAC (Figure2). The 2k−1 modular adder has the similar area and double delay compared to thek-bit CPA. The latter modular adder (MA3) is implemented by a regularn-bit CPA neglecting its carryout (Figure3). CSAs used in the design of the reverse converter are divided to the regular CSA and CSA with EAC [13].

(1) In the first level of design, CSAs of the reduction tree are CSAs with EAC. The output of a final CSA enters to the modulo 2n−1−1 adder (MA2). (2) The summation of five operands in modulo 2n

computesV2in the initiation of the second step

of design. The structure of the reduction tree used for computing V2 employs three regular

CSAs. Due to the fact that the output of the reduction tree must be in modulo 2n_{, for} achiev-ing the hardware cost efficiency, the carryout signal of every CSA in the architecture shown in Figure3is neglected.

(3) In the last part of the second step of design, for computingX, CSAs which their inputs are signals with different bit numbers, are used. The only difference between regular CSA and CSA with EAC is the generated result ofcfrom CSA.

Fig-ure5demonstrates basic architecture of a CSA and a CSA with EAC. The delay of n-bit CSA denotes additional time of a full adder cell. In CSA like CPA, hardware cost can be reduced according to the con-stant input values. In general, the hardware cost of

n-bit CSA is equal to the hardware cost ofnfull adder cell. Table3shows the hardware cost and the delay of various components in the proposed reverse converter.

5

Comparison

This section presents the comparison of the pro-posed reverse converter architecture for the mod-uli set Ψ with other balanced 4-modmod-uli sets with the same dynamic range class, such as the 4-moduli sets {2n₋₁_,₂n_,₂n _{+ 1}_,₂n+1₋₁

[13, 14],{2n−1,2n,2n + 1,2n+1+ 1 [13, 15],

{2n−3,2n−1,2n+ 1,2n+ 3}[16] and{2n,2n+1−

1,2n₋₁_,₂n−1₋₁ _[₁₇_{]. The comparisons are done}

in terms of the delay and the area of the reverse converter. Table4shows the comparison between the proposed reverse converter and its other state-of-the-art counterpstate-of-the-art. In order to achieve fair comparison, the delay and the area of the modulus adders and carry save adders are considered the same as [24].As shown in Table2, the proposed reverse converter for the moduli set Ψ has achieved the highest speed of the reverse converter compared to{2n₋₁_,₂n_,₂n₊ 1,2n+1−1 [13,14],{2n−1,2n,2n+ 1,2n+1+ 1 [13, 15], {2n−3,2n −1,2n + 1,2n+ 3} [16] and

{2n,2n+1 −1,2n −1,2n−1−1 [17]. It is worth mentioning that, the proposed reverse converter is the fastest adder based reverse converter in the balanced 4-moduli class [17].

(9)

x0 x1 x2 xn-2

yn-1

y0 y1 y2 yn-2

yn-1

z0 z1 z2 zn-2

zn-1

s0 s1 s2 sn-2

sn-1

...

c0 c1 c2

cn-1 cn-2 ... 0

α

β

Generate Sum and Carry from

α,β and operands

Carry Save Adder z y x

s c s

c x+y+z=2c+s

(a)

x0 x1 x2 xn-2

yn-1

y0 y1 y2 yn-2

yn-1

z0 z1 z2 zn-2

zn-1

s0 s1 s2 sn-2

sn-1

...

c0 c1 c2

cn-2 ...

Carry Save Adder with End Around Carry

z y x

s c s

c cn-1

(b)

Figure 5. Basic Architecture for (a) a Carry Save Adder and (b) a CSA with End Around Carry.

Table 3. Hardware and delay of various components in the proposed reverse converter.

Component Area Delay Component Area Delay

OPU1 (2n+ 1)AInv 1DInv CSA7 nAF A 1DF A

Comparator 3nAAN D+nAOR3 nDAN D CSA8 (n−1)AF A+AXOR 1DF A

+nDOR3 +AAN D

CSA1 1AF A 1DF A CSA9 (n−2)AXOR+ 1AXN OR 1DInv

+ (n−2) (AXN OR+AOR) + (n−2)AAN D+ 1AOR

CSA2 2AF A 1DF A MA3 (n−3)AF A (n−3)DF A

+1AXOR+ 1AAN D +1DHA

CSA3 (n−1)AF A 1DF A OPU5 (2n−1)AInv 1DInv

CSA4 (n−1)AF A 1DF A CSA10 (n−2)AF A 1DF A

+2AXOR+ 2AAN D

CSA5 (n−1)AF A 1DF A CSA11 (n−2)AF A 1DF A

+ (2n+ 2) (AXN OR+AOR)

OPU2 0 0 CSA12 2nAF A 1DF A

+ (n+ 1) (AXOR+AAN D)

CSA Tree n2₋_n

AF A pDF A CSA13 2nAF A 1DF A

+ (n+ 1) (AXOR+AAN D)

CSA6 (n−2)AF A 1DF A CSA14 (3n+ 1)AF A 1DF A

+AXOR+AAN D +2 (AXOR+AAN D)

MA1 (n+ 1)AF A (2n+ 2)DF A CPA (4n−2)AF A (4n−2)DF A

+2 (AXOR+AAN D) +1DHA

MA2 (n−1)AF A (2n−2)DF A MUX 2:1 0 0

(10)

Table 4. Hardware requirements and delay of reverse converters.

Moduli Set Design Hardware requirements Delay

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1₋₁ _[₁₃_]₋₁ ₍₉_n_{+ 5 + ((}_n₋_{4) (}_n_{+ 1)}_/₂₎₎_A

F A (23n+ 12)/2DF A

+ 2nAXN OR+ 2nAOR+ (6n+ 1)AIN V

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1_{+ 1} _[₁₃_] ₂_n2_{+ 11}_n_{+ 3} ₁₁_.₅_nD

F A

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1_{+ 1} _[₁₃_]₋₂ ₍₆_n_{+ 7)}_A

IN V + n2+ 12n+ 12AF A (16n+ 22)DF A

+2n(AXN OR+AOR) + (4n+ 8)A2:1M U X

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1_{+ 1} _[₁₅_] ₍₅₈_n_{+ 23 +}_log

2(c+ 1))AF A 24n+ 17 + logc2+1

DF A

{2n₋₃_,₂n₋₁_,₂n_{+ 1}_,₂n_{+ 3}_} _[₁₆_]-_C₁_CE ₂₅_.₅_n_{+ 12 + 2}_.₅_n2

AF A (18n+ 23)DF A

+5nAHA+ 3n(AXN OR+AOR)

{2n₋₃_,₂n₋₁_,₂n_{+ 1}_,₂n_{+ 3}_} _[₁₆_]-_C₂_CE ₍₂₀_n_{+ 17)}_A

F A+ (3n−4)AHA (13n+ 22)DF A

+2n₍₅_n_{+ 2)}_A

ROM +3DROM

{2n₋₃_,₂n₋₁_,₂n_{+ 1}_,₂n_{+ 3}_} _[₁₆_]-_C₃_CE ₍₂₃_n_{+ 11)}_A

F A+ (2n− 2)AHA (16n+ 14)DF A

+(6n+ 4)2n_A

ROM +DROM

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1₋₁ _[₁₄_]-3_stage-CE ₍_n2_{+ 10}_n_{+ 3)}_A

F A+AHA (9n+ 6 +m)DF A∗

+(3n+ 2)AIN V + 2A2:1M U X

2n_,₂n+1₋

1,2n₋₁_,₂n−1₋

1 [17]-D1-C-I n2_{+ 16}_n_{+ 6}

AF A+ 4nAIN V (12n+ 9 +q)DF A∗

+ (n+ 2) (AXN OR+AOR)

+ (3n−5) (AXOR+AAN D)

2n,2n+1−1,2n−1,2n−1−1 [17]-D1-C-III n2+ 24n+ 24AF A+ (2n+ 3)AHA (8n+ 11 +q)DF A∗

+2(AXN OR+AOR)

+ (2n+ 1)A3:1M U X+ 4nAIN V

2n_,₂n+1₋

1,2n₋₁_,₂n−1₋

1 [17]-D1-C-II n2_{+ 22}_n_{+ 22}

AF A+ (2n+ 2)AHA (8n+ 11 +q)DF A∗

+10(2n+ 1)AROM+ 2 (AXN OR+AOR)

+(2n+ 1)A2:1M U X+ 4nAIN V

2n_,₂n+1₋

1,2n₋₁_,₂n−1₋

1 Proposed n2_{+ 21}_n₋₁₁

AF A (7n+ 1)DF A

+ (3n+ 1) (AXN OR+AOR) +n(DOR3+DAN D)

+ (6n+ 9)AAN D+ (3n+ 9)AXOR +2DHA+ 4DIN V

+nAOR3+ (5n+ 1)AIN V

∗_m_and_q_{are the number of levels in CSA tree with (}_n_{+ 2), (}_n_{+ 1) inputs, respectively.}

which confirms the remarkable improvement in terms of speed of the reverse converter. Also degraded hard-ware resources are achieved compared to [13–17].

6

Conclusion

In this paper, the quadruple moduli set Ψ was the fo-cus of study in reducing the computational intensity of the reverse converter design. Ψ has the dynamic range of 4n and utilizes modulos only in the form of 2k−1beside modulo 2n, which provides efficient

arithmetic operations in RNS channels. The new re-verse converter eliminates the extra intermediate cal-culations. For each level of design, the moduli subsets are selected to make the design more efficient in both delay and the hardware cost. To put everything in the nutshell, the overall area and time complexity analy-sis indicates that the proposed reverse converters are more efficient than the converters for the 4-moduli set

(11)

Table 5. Unit gate area and delay of reverse converters.

Moduli Set Design Unit gate area Unit gate delay

2n−1,2n,2n+ 1,2n+1−1 [13]-1 3.5n2+ 72.5n+ 23 46n+ 24

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1_{+ 1} _[₁₃_]-2 ₇_n2_{+ 128}_n_{+ 146} ₆₄_n_{+ 88}

{2n3,2n1,2n+ 1,2n+ 3} [16]C1CE 17.5n2+ 210.5n+ 84 72n+ 92

2n₋₁_,₂n_,₂n_{+ 1}_,₂n+1₋₁ _[₁₄_{]- 3}_Stage-CE ₇_n2_{+ 76}_n_{+ 41} ₃₆_n_{+ 24 + 4}_m∗

2n,2n+1−1,2n−1,2n−1−1 [17]D1-C-I 7n2+ 136n+ 30 48n+ 36 + 4q∗

2n_,₂n+1₋₁_,₂n₋₁_,₂n−1₋₁ _[₁₇_]_D₁_-C-II ₇_n2_{+ 204}_n_{+ 174} ₃₂_n_{+ 44 + 4}_q∗

2n_,₂n+1

−1,2n₋₁_,₂n−1

−1 Proposed 7n2_{+ 169}_n₋₄₇ ₃₀_n_{+ 4}

∗_m_and_q_{are the number of levels in CSA tree with (}_n_{+ 2), (}_n_{+ 1) inputs, respectively.}

Acknowledgements

The authors are grateful to the anonymous reviewers’ valuable comments and suggestions that improved the quality of manuscript. Also the authors would like to thank Dr. B. Yoberd and Ms. F. Shaker for their literature contributions.

References

[1] MA Bayoumi and P Srinivasan. Parallel arith-metic: from algebra to architecture. InCircuits and Systems, 1990., IEEE International Sympo-sium on, pages 2630–2633. IEEE, 1990.

[2] T Stouraitis and V Paliouras. Considering the alternatives in low-power design. Circuits and Devices Magazine, IEEE, 17(4):22–29, 2001. [3] Behrooz Parhami. Computer arithmetic:

algo-rithms and hardware designs. Oxford University Press, Inc., 2009.

[4] Mi Lu.Arithmetic and logic in computer systems, volume 169. John Wiley & Sons, 2005.

[5] Richard Conway and John Nelson. Improved rns fir filter architectures. Circuits and Systems II: Express Briefs, IEEE Transactions on, 51(1): 26–28, 2004.

[6] Ricardo Chaves and Leonel Sousa. Rdsp: A risc dsp based on residue number system. InDigital System Design, 2003. Proceedings. Euromicro Symposium on, pages 128–135. IEEE, 2003. [7] Wei Wang, MNS Swamy, and M Omair

Ah-mad. Rns application for digital image process-ing. InSystem-on-Chip for Real-Time Applica-tions, 2004. Proceedings. 4th IEEE International Workshop on, pages 77–80. IEEE, 2004.

[8] Sung-Ming Yen, Seungjoo Kim, Seongan Lim, and Sang-Jae Moon. Rsa speedup with chinese re-mainder theorem immune against hardware fault cryptanalysis. Computers, IEEE Transactions

on, 52(4):461–472, 2003.

[9] Mohammad Esmaeildoust, Dimitrios Schini-anakis, Hamid Javashi, Thanos Stouraitis, and Keivan Navi. Efficient rns implementation of el-liptic curve point multiplication over.Very Large Scale Integration (VLSI) Systems, IEEE Trans-actions on, 21(8):1545–1549, 2013.

[10] Javier Ram´ırez, Antonio Garc´ıa, U Meyer-Baese, and A Lloris. Fast rns fpl-based communications receiver design and implementation. In Field-Programmable Logic and Applications: Reconfig-urable Computing Is Going Mainstream, pages 472–481. Springer, 2002.

[11] Keivan Navi, Amir Sabbagh Molahosseini, and Mohammad Esmaeildoust. How to teach residue number system to computer scientists and engi-neers. Education, IEEE Transactions on, 54(1): 156–163, 2011.

[12] MohammadReza Taheri, Elham Khani, Moham-mad Esmaeildoust, and Keivan Navi. Effi-cient reverse converter design for five moduli set

{2n_,₂2n+1₋₁_,₂n/2₋₁_,₂n/2₊₁_,₂n₊₁_}_._Journal of Computations & Modelling, 2(1):93–108, 2012. [13] PV Ananda Mohan and AB Premkumar. Rns-to-binary converters for two four-moduli sets{2 n-1, 2 n, 2 n+ n-1, 2 n+ 1- 1}and{2 n- 1, 2 n, 2 n+ 1, 2 n+ 1+ 1}. Circuits and Systems I: Regular Papers, IEEE Transactions on, 54(6):1245–1254, 2007.

[14] B Cao, T Srikanthan, and CH Chang. Efficient reverse converters for four-moduli sets {2n- 1, 2n, 2n+ 1, 2n+ 1- 1}and{ 1, 2n, 2n+ 1, 2n-1- 1}. IEE Proceedings-Computers and Digital Techniques, 152(5):687–696, 2005.

(12)

[16] PV Ananda Mohan. New reverse converters for the moduli set{2n-3, 2n-1, 2n+ 1, 2n+ 3}. AEU-International Journal of Electronics and Commu-nications, 62(9):643–658, 2008.

[17] Mohammad Esmaeildoust, Keivan Navi, Moham-madReza Taheri, Amir Sabbagh Molahosseini, and Siavash Khodambashi. Efficient rns to bi-nary converters for the new 4-moduli set{2n, 2n+ 1-1, 2n-1, 2n-1-1}. IEICE Electronics Express, 9 (1):1–7, 2012.

[18] Lampros Kalampoukas, Dimitris Nikolos, Costas Efstathiou, Haridimos T Vergos, and John Kala-matianos. High-speed parallel-prefix modulo 2n-1 adders. IEEE Transactions on Computers, (7): 673–680, 2000.

[19] Costas Efstathiou, Haridimos T Vergos, and Dim-itris Nikolos. Fast parallel-prefix modulo 2 n+ 1 adders. Computers, IEEE Transactions on, 53 (9):1211–1216, 2004.

[20] Riyaz Patel, Mohammed Benaissa, Neil Powell, Said Boussakta, et al. Novel power-delay-area-efficient approach to generic modular addition. Circuits and Systems I: Regular Papers, IEEE Transactions on, 54(6):1279–1292, 2007. [21] W Kenneth Jenkins and Benjamin J Leon. The

use of residue number systems in the design of finite impulse response digital filters. Circuits and Systems, IEEE Transactions on, 24(4):191– 201, 1977.

[22] Mohammad Esmaeildoust, Keivan Navi, and Mo-hammadReza Taheri. High speed reverse con-verter for new five-moduli set {2n, 22n+ 1-1, 2n/2-1, 2n/2+ 1, 2n+ 1}. IEICE Electronics Ex-press, 7(3):118–125, 2010.

[23] NAVI Keivan, Mohammad Esmaeildoust, and Amir Sabbagh Molahosseini. A general reverse converter architecture with low complexity and high performance. IEICE TRANSACTIONS on Information and Systems, 94(2):264–273, 2011. [24] Amir Sabbagh Molahosseini, Keivan Navi, Chitra

Dadkhah, Omid Kavehei, and Somayeh Timarchi. Efficient reverse converter designs for the new 4-moduli sets and based on new crts. Circuits and Systems I: Regular Papers, IEEE Transactions on, 57(4):823–835, 2010.

MohammadReza Taherireceived his B.Sc. in Computer Hardware Engineering from Is-fahan University, IsIs-fahan, Iran. He obtained his M.Sc. degree in Computer System Archi-tecture from Science and Research Branch of Islamic Azad University, Tehran, Iran. He is currently pursuing his Ph.D. Degree in Com-puter Architecture at Shahid Beheshti Uni-versity, Tehran, Iran. He is also a member of the Nanotechnology and Quantum Computing Laboratory of Shahid Beheshti University since 2009. His current research interests include residue number system, low power arithmetic, approximate computing, and circuit techniques for emerging technologies.

Nasim Shafieereceived the B.Sc degree in computer hardware engineering from Shahid Beheshti University, Tehran, Iran in 2014. She is a member of the Nanotechnology and Quantum Computing Laboratory of Shahid Beheshti University since 2013. Her research interests include low power computer arithmetic, approximate computing, and robotic.

Mohammad Esmaeildoust received his M.Sc. degree in Computer architecture at Shahid Beheshti University of Technology, Tehran, Iran, in 2008. He also received the Ph.D. degree in computer architecture from Shahid Beheshti University of Technology, Tehran, Iran, in 2012. He is currently As-sistant Professor in faculty of Marine Engi-neering, Khorramshahr University of Marine Science and Technology. His research interests include VLSI design, Cryptography, Network security, computer arithmetic.

Zhale Amirjamshidiearned her M.Sc. in electronic engineering from Central Branch of Islamic Azad University, Tehran, Iran. She is currently pursuing the Ph.D. degree in electronic engineering at Iran University of Science and Technology, Tehran, Iran. Her research interests are mainly focus on low power digital arithmetic and renewable energy.

Reza Sabbaghi-nadooshan received his B.Sc. and M.Sc. in electrical engineering from the Iran University of Science and Technology, Tehran, Iran, in 1991 and 1994 and Ph.D. in Electrical Engineering from the Science and Research Branch, Islamic Azad University, Tehran, Iran in 2010. From 1998, he became faculty member of Department of Electronics in Central Tehran branch, Islamic Azad University, Tehran, Iran. His current research interests include nanocomputing and networks-on-chips. He is a member of IEEE.