High-Throughput 16-Bit Systolic Multiplier Using Modular Shifting Algorithm For NIST Pentanomials

(1)

70

High-Throughput 16-Bit Systolic Multiplier Using

Modular Shifting Algorithm For NIST Pentanomials

JAYARAJKUMAR S.

Department of Electronics and Communication, K.S.Rangasamy College of Technology,

thirunchegode, Namakkal Email: [email protected]

SIVANANDAM K.

Asst. professor, Department of Electronics and Communication,

K.S.Rangasamy College of Technology, thirunchegode, Namakkal. Email: [email protected]

Abstract— The finite field multipliers consuming high-throughput rate and low-latency having grown excessive attention in recent cryptographic systems, and coding theory but such multipliers above Galois field GF(2m) for National institute standard technology (NIST) pentanomials are not so plentiful. We introduce two pairs of low latency and high throughput bit-parallel and digit-serial systolic multipliers depends on NIST pentanomials. We propose a unique decomposition technique to recognize the multiplication by several parallel arrays in a two-dimensional (2-D) systolic structure (BP-I) with a critical-path of 2TX, where TX is the

propagation delay of XOR gate. The parallel arrays in two dimensional systolic structure are estimated along the vertical direction to attain a proposed 16-bit digit-serial structure (PDS-I) with the same critical-path. Designed for high-throughput applications, we proposed another pair of bit-parallel (BP-II) and Modified 16 bit digit-serial (PDS-II) structures based on a unique modular reduction method, where the critical-path is reduced to (TA+TX), TA is an propagation delay of

AND gate. The steps for data sharing between a pair of processing elements (PEs) of adjacent systolic arrays has been suggested to reduce the area-complexity of BP-I and BP-II advance. The existing method consumes more power and high area overhead. In systolic multiplier used to reduce area and power for the ASIC implementations and is also reduce the average computation time. Systolic multiplier is a better choice for high-speed VLSI implementation.

Keywords—Bit-parallel, finite field,

high-throughput, low-latency, NIST pentanomials, Digit-serial, 16-bit systolic multiplier.

Manuscript received April, 2016.

I.INTRODUCTION

The Finite field multiplication above GF (2m) is a simple field operation, it has been extensively used for various applications such as Error-control coding, computer algebra and crypto graphy system. Recently the irreducible pentanomials must be used to generate large binary extension field to be used in ECC [3]-[7]. The National Institute of standrds and technology (NIST) [8] has signify three pentanomials for ECC operation. Systolic array is used in every Sequential algorithm that can be transformed to a parallel version which is very easy to run on array processors that execute operations in systolic way and systolic array is one of the solutions in the requirement of highly parallel computational power.

The Systolic structures contain a replicated basic cells and each basic cell is linked with its adjacent cells over pipelining. A systolic multiplier based on common polynomial is characterized in [11] and two systolic multipliers have been offered for error correction in [10] and [13]. Different low-latency Montgomery multipliers are recently introduced in [15]. To accomplish area and time trade-off, digit-serial multipliers based on pentanomials are described in [16] and [17]. Very recently, low-latency digit-serial systolic multipliers are anticipated in [18]. But design approach in [18] is suitable only for all most all equally spaced polynomial (AESP) and cannot be applied to fields based on the NIST recommended pentanomials. Generally all existing systolic multipliers, including bit-parallel and digit-serial structures. Bit-parallel designs are wished for mainly high-speed implementation and it involve large chip-area mainly for bigger values of the field order m. but pentanomials can provide significant optimization in terms of both area and speed.

(2)

71

generally increase the digit-size or field order, which reduce the throughput-rate efficiently and average computation time (ACT) of the digit-serial structures rise with digit-size.

The systolic architecture is a better choice for a high speed VLSI implementation. To increase the throughput-rate, a unique modular reduction method is presented for reducing the critical-paths of bit-parallel and digit-serial structures to (TA+TX). We have

proposed two modified bit-parallel structures with low-area complexity based on data distribution system.

II.METHODOLOGY

1.1 Proposed bit-parallel and Digit-serial Register involvement systolic multiplier

Consider P, Q and R are field elements in GF (2m), where R is the invention of P and Q, is given by

R = P. Q mod f (z) ….. (1)

We assume and apply the values in 16-bit systolic multiplier manner m= [15:0] , w = 4, d = 4

Let P = i. zi Q = i .zi , R = i.zi, for pi, qi and ri € {0,1}

We develop (1) as R = i (Q. zi mod f(z)) =

i = i pi. …….(2)

The product assumed in (1), hence, can be conveyed in terms of w inside the products of vector Pu and Qu for the transformations of matrix formation is given by

u = 0, 1….w-1 as

u = 0,1,2,…3

R = Qo PoT + Q1 P1T +……+ Qw-1 PTw-1

= Qo PoT + Q1 P1T + Q2 P2T+Q3P3T

= u .PuT = u ……(3)

Let undertake that the field GF (2m) is created from pentanomial of degree m given by f (z) = zm + zk1+ zk2 + zk3 +1, for 1≤ k3 < k2 < k1 ≤ m-1. Then we can ensure Q1 from Q as

Q1 = Q . z mod f(z) = q1m-1 zm-1+…..+ q11 z+ qo1 …(4)

We can find Qi from Q for i > 2 for NIST pentanomials as

Q1 = Q . zi mod f(z) = qim-1 .zm-1+…..+ q1i.z+qoi…(5)

Digit-serial realization we can accumulated the partial products as shortly they are computed to reduce the average computation time. To reduce the difficulty of modular reduction operation we present now a unique data sharing system.

We can express zm, zm+1,..zm+3w-2 as an lengthy polynomial basis and can describe

QEw-1 = i .zi = Qw-1 + iw-1.zi... (6)

= Q3 + 16 . z16

iw-1= ik1-1 ……… (7)

The bit selected operations further define as

Y (QEw-1, w-1) = Qw-1

Y(QE3,3) = Q3 ... (8)

The above equation (8) can significantly reduce the register complication in the systolic multiplier subsequently many bits can shared.

We are going to adjust additional Bit-parallel systolic multiplier for reduce area-complexity. Let us define

QE(v+1)w+u = i(v+1) 3+u Xi ……(9)

i(v+1) 3+u = k1-1(v+1) 3+i …(10)

The proposed data distribution technique can significantly reduce the XOR gate difficulty and register count.

We introduce digit-serial 16 bit systolic multiplier used to increase the high-throughput rate and reduce the delay. Let us define g = (2i-k1+k3) the extended polynomial can define

QAi = i {A}j . x j ……(11)

The equation (11) can reduce the modular reduction time from 2TxtoTx . The bit-parallel systolic multipliers

(3)

72

- unit delay

B a0 1 1 a15

C

d

Fig.1. proposed 16 bit-parallel systolic multiplier.

16 1 16

m+3w-3 16 16

16 1 16

m m ….m

Fig.2.Structure of PRC cell and PE-1.

The proposed bit-parallel systolic structure consisting of one pre-computing cell(PRC),one pipelined adder tree (PAT), and w-sytolic arrays and each array consuming (d-PEs). The PRC cell consist of M-1 cell and bit-rewiring cell yields w outputs (Q0, Q1,… Qw-1) to be fed to w arrays of the structure and M-1 cell in PRC derives QEw-1 from Q. A consistent PE involves a M-II cell derives Q(v+1)w+u from Qvw+u for AND cell and XOR cell. During each cycle peroid the result of AND cell is added together in XOR cell with another input from left and at the time upcoming result is latched out to the right. The M-II cell is latched out to the next PE to be used in The critical path of BP-1 is max {TPRC,TM-II,TA+TX} = 2TX it denote the

propagation time of PRC cell,M-II cell, AND gate and XOR gate respectively. BP-I yields the first output (d+1+log2 w) cycles after the operands are fed to the structure and the sucessive ouput will be available in every cycle. To reduce the area-complexity of BP-1 further we can use the approach and for identical data sharing by all PEs of BP-1. The M-III cell in regular PE of array-1 derives QE(v+1)w+u from QEvw+u, and (m+3w-6) bits are selected and shared by (w-1) PEs in the same column in array-2 to array-1w . The proposed data distribution scheme significantly reduces XOR gate difficulty and register counts ,since there is no M-III cell in PEs except array1.Thus the area complication of MBP-I is expressively smaller than of BP-I, while the critical path and latency of MBP-1 are exactely as those of BP-I.

1.2 Proposed 16-bit Digit-serial systolic Multiplier-I

The proposed Bit-parallel systolic multiplier is projected along the perpendicular direction to attain the 16- bit Proposed digit-serial systolic multiplier (PDS-1). The PDS-1 consists of (m+1) PEs and one accumulation (AC) cell. The input bits B are loaded in to an input register and the output register is fed to PE1 as well as the M-IV cell to perform modular operation by one degree interval in each cycle period. The m-output bits of M-IV cell are then latched back in to the input register and used by PE-0 in the next cycle period. The regular PE from PE-1 to PE (d-1) contains a M-II cell an AND cell an XOR cell and register cell. The AC cell has m-parallel bit-level finite field accumulator through each cycle period the newly conventional input is added with the earlier accumulated result and the result of addition is stored in the register cell to be used at next cycle. The PDS-1 have same critical path and it gives the first output of desired {d+w} cycles afterwards the pair of operands are fed to the structure however the successive output are produced at the interval of w cycles thereafter.

....

2 d

ao,a1...aw-1 aw,aw+1..a2w-1 am-w,am-w+1…am-1

C Fig.3. Block diagram for proposed 16-bit digit-serial systolic multiplier-I.

16 REG cell

16

1

16 1 16 16

m-bits

input m-bits

1 out put

16 REG cell

REG cell

16 1

Fig.4. Internal structure of PE [0],regular PE and AC cell.

Digit level pipelined strategy involves low latency and few number of register required compared to bit-level parallel architecture.

1.3 Proposed16-bit Digit-serial systolic multiplierII

Based on the Proposed 16-bit Digit-serial systolic multiplier-I and novel modular reduction method we

M-IV

Input register

M-II

AND cell

XOR

cell

XOR

cell

XOR

cell

XOR

cell

PE-1 PE-2 PE-d

-d P

R C

P A T A

M-I

Bit rewiring

cell

M-II

AND cell

XOR cell

PE-0 PE-1 PE-2 PE-d A

(4)

73

have derived the PDS-II multiplier. It consists of (m+1) PEs and one accumulation (AC) cell. The inner structure of PE-0, regular PE and AC cell are planned. All bits of operand B are fed to the M-V cell and then output is formerly loaded in to the (m+g) bit-registers and latched output bits are fed to PE-1 to the M-VII cell to perform the modular operation by one degree through each cycle period . The output bits of M-VII cell are then latched back in in to the registers used in the PE-0 in the next cycle period. The regular PE from PE-1 to PE (d-1) contains M-VI cell and AND cell a XOR cell and a register cell the same as that inBP-II. The AC cell comprises m parallel bits bit level finite field accumulators..

....

2 d

ao,a1...aw-1 aw,aw+1..a2w-1 am-w,am-w+1….am-1

C Fig.5. Block diagram for proposed 16 bit digit-serial systolic multiplier-II.

During each cycle period the newly received input is then added to the previously accumulated result and the result of addition stored in the register cell to be used in the next cycle period

REG cell

16 16

1

16 1 m+g m+g

m+g m-bits

input m-bits

1 out put

16 REG cell

REG cell

16 1

B

Fig.6. internal structure of PE[0] ,regular PE and AC cell.

Digit-serial 16-bit systolic multiplier-II have same critical path as that of Bit-parallel multiplier. It gives the first output of desired product after the (d+w) cyles, while the successive outputs are produced after the interval of w cycles thereafter.

1.4 Area and time analysis

The area complexity, in relations of total logic gate count, register count, and time difficulty in terms of latency, critical-path, and average computation time of proposed and prevailing designs in ([9]–[19]). The

proposed bit-parallel structures require lower latency and less area-complexity than the existing bit-parallel designs. It is work efficient indication that the proposed structures achieve flexible low-latency realization where the latency is low and it is independent of field order. The16-bit digit-serial systolic designs, proposed structures have stable critical-path, while the critical-paths of existing designs are a function of either digit-size and binary extension field order. The suggested structures have considerably low time-complexity compare with the previous designs.

III. RESULT AND DISCUSSION

The bit-parallel architecture which developments on entire word of input data per clock cycle and it is ideal for high-speed applications when pipelined at the bit-level and compared to 16-bit Digit-level pipelined design which contains lower latency and limited registers required. 16-bit Digit-serial systolic multiplier used to increase the field order and digit size through ASIC implementations.

TABLE I

Device Utilization

Logic Utilization Used Available Utilization

No. of Slices 208 2448 8%

No. of Slice Flip Flops 310 4869 6%

No. of 4 input LUTs 249 4896 5%

No. of bonded IOBs 54 158 34%

No. of GCLKs 1 24 4%

Timing Summary:

All values displayed in nanoseconds (ns)

Timing constraint : Default path analysis

Clock period : 8.733ns (frequency: 114.508MHz)

Total number of paths / destination ports: 4826 / 387

Total = 8.733ns

PE-0 PE-1 PE-2 PE-d A

C

M-II

AND cell

XOR

cell

XOR

cell

XOR

cell

XOR

cell

M-VII

Qu

{P}

(5)

74

Fig.7. comparisons of area and power complexity.

3.1 Evaluation and parameter selection

The measurable analysis of the proposed method for 16-bit Digit-serial systolic multiplier used to reduce the area and power for the ASIC implantation. The PDS-I and PDS-II used to achieve high performance and environment needs for high speed VLSI implementation.

TABLE II

COMPARISON OF SYNTHESIS RESULTS FOR ASIC IMPLEMENTATIONS OF THE SYTOLIC MULTIPLIER

Design Area Power APD

[16] 47351 7.05 333825

[17] 527072 86.93 310372

PDS-I 103.291 4.935 509.7410

PDS-II 95.3175 2.349 224.0276

Unit for Area : µm2 ; ACT : ns ; APP : µm2 .ns .

IV.CONCLUSION

The Low-latency high-throughput systolic designs for multipliers above GF(2m) built on NIST suggested pentanomials are offered. We have proposed an algorithm to decompose the multiplication to be managed individually by multiple systolic arrays in parallel order to low latency. Based on proposed decomposition scheme we have proposed a pair of bit-parallel and 16-bit digit-serial systolic multipliers. We have propose an efficient approach for data distribution system by multiple systolic arrays to reduce the register count, and hence the overall area-difficulty. Moreover, we have planned a novel modular reduction approach to reduce the delay of

modular reduction process, and based on we have derived another pair of bit-parallel and 16-bit digit-serial structures where the critical-path is reduced to attain high-throughput rate. The combination results show that the proposed multipliers have expressively very low latency, low area-time complexity and high throughput rate than the existing challenging designs. To greatest of our knowledge, this is the first report on low latency systolic multipliers for finite field where latency is independent on field order. The existing method consumes more power and high area overhead. In 16-bit systolic multiplier used to reduce area and power for the ASIC implementations and is also reduce the regular computation time. Systolic multiplier is a better choice for high-speed VLSI implementation.

ACKNOWLEDGEMENT

We would like to thank the Centre for VLSI Design, Department of Electronics and Communication Engineering, K.S.Rangasamy College of Technology, Tiruchengode, Tamilnadu, for providing the FPGA kit and Synopsys tools.

REFERENCES

[1] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography ser. London Mathematical Society Lecture Note Series. Cambridge, U.K.: Cambridge Univ. Press, 1999.

[2] N.R.Murthy and M.N.S.Swamy,“Cryptographic applications

of brahmaqupta-bhaskara equation,” IEEE Trans. Circuits Syst.I, Reg.Papers, vol. 53, no. 7, pp. 1565–1571, 2006.

[3] J. Xie, P. K.Meher, and J. He, “Low-complexity multiplier for based on all-one polynomials,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst, vol. 21, no. 1, pp. 168–173, Jan. 2013.

[4] Song and K. K.Parhi, “Low-energy digit-serial/parallel finite field multipliers,” J. VLSI Digit. Process. vol. 19, pp. 149– 166, 1998.

[5] P. K. Meher, “On efficient implementation of accumulation in finite field over and its applications,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 4, pp. 541–550, 2009.

[6] C.-Y. Lee, J.-S.Horng,I.-C. Jou, and E.-H. Lu,“Low complexity bit-parallel systolic Montgomery multipliers for special classes of GF(2m),” IEEE Trans. Comput., vol. 54, no. 9, pp. 1061–1070, 2005.

[7] P. K. Meher, “Systolic and super-systolic multipliers for finite field GF(2m) based on irreducible trinomials,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 4, pp. 1031–1040, May 2008.

(6)

75

[9] C.W.Chiouetal.,“Concurrent error detection

inmontgomerymultiplicationoverGF(2m),”IEICETrans.Fun dam.Electron.Commum Comput. Sci., vol. E89-A, no. 2, pp. 566–574, 2006.

[10] C.-S. Yeh et al., “Systolic multipliers for finite fields ,” IEEE Trans. Comput., vol. C-33, no. 4, pp. 357–360, Apr. 1984.

[11] S. K. Jain, L. Song, and K. K. Parhi, “Efficient semi systolic architectures for finite field arithmetic,” IEEE Trans. Very Large Scale Integr (VLSI) Syst., vol. 6, no. 1, pp. 734–749, Mar. 1998.

[12] S. B. Sarmadi and M. A. Hasan,“Concurrent error detection in finite field arithmetic operations using pipelined and systolic architectures,” IEEE Trans. Comput., vol. 58, no. 11, pp. 1553–1567, Nov. 2009.

[13] A. Hariri and A. Reyhani-Masoleh, “Digit-level semi-systolic and systolic structures for the shifted polynomial basis multiplication over binary extension fields,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 11, pp. 2125–2129, Mar. 2009.

[14] J. Xie, J. He, and P. K. Meher, “Low latency systolic Montgomery multiplier for finite field based on pentanomials,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst, vol. 21, no. 2, pp. 385–389, Feb. 2013.

[15] S. Kumar, T. Wollinger, and C. Paar, “Optimum digit serial multipliers for curve-based cryptography,” IEEE Trans. Comput., vol.55, no. 10, pp. 1306–1311, Oct. 2006.

[16] C. H. Kim, C. P. Hong, and S. Kwon, “A digit serial multiplier for finite field,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 4, pp. 476–483, Apr. 2005.

[17] J.-S. Pan, C.-Y. Lee, and P. K. Meher, “Low-latency digit serial and digit-parallel systolic multipliers for large binary extension fields,”IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 12, pp. 3195–3204, Dec. 2013.

[18] S. Shanmugapriyan and K. Sivanandam, “Area efficient run time reconfigurable architecture for double precision multiplier” IEEE International Conference on Intelligent Systems and Control pp. 2109-2113, Jan.2015.