• No results found

LUT Based Computing for Memory

N/A
N/A
Protected

Academic year: 2022

Share "LUT Based Computing for Memory"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

T.Thangam

, IJRIT 406

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 11, November, 2013, Pg. 406-418

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com

ISSN 2001-5569

LUT Based Computing for Memory Size Reduction

T.Thangam1, K.Gowri2

1Associate Professor, Department of ECE ,

PSNA College of Engineering and technology, Dindigul, Tamilnadu, India [email protected]

2Associate Professor,

Department of ECE ,PSNA College of Engineering and technology, Dindigul, Tamilnadu, India [email protected]

Abstract

In this paper we propose a memory based multiplier that uses LUT design for reduced memory size by a factor of two. The advantages of antisymmetric product coding (APC) and odd-multiple-storage (OMS) techniques were combined together to design an efficient memory-based multiplier to provide a reduction in LUT size to one fourth of the conventional LUT. The proposed LUT based multiplier involves 12-bit word size. This LUT design can be used for efficient implementation of high precision multiplication through input operand decomposition. The area and delay were also shown to be improved with existing results.

Keywords: Digital Signal Processing (DSP), Look-Up- Table (LUT), Antisymmetric Product Coding (APC), Odd- Multiple-Storage (OMS).

1. Introduction

A memory unit has several specific applications, which include mobile devices, consumer products, automotive, biomedical instruments and space applications. The upcoming memories are expected to provide faster access and to consume less power. Embedded memories will have dominating presence in the system on- chips (SoCs), which may exceed 90%, of the total SoC content. [1].When the computational functions are performed by look-up tables (LUTs), instead of actual calculation, it closely resembles to human-like computing and the memory based computations are simple to design and offer advantages like greater potential for high-throughput, low-latency

(2)

T.Thangam

, IJRIT 407

implementation and less dynamic power. Lookup tables are tables that store numeric data in a multidimensional array consumption format. The savings in terms of processing time can be significant since retrieving a value from memory is often faster than undergoing an 'expensive' computation or input/output operation. The tables may be precalculated and stored in static program storage or calculated as part of a program initialization phase (memorization). A conventional lookup-table (LUT)-based multiplier is shown in Fig.1 [1], i n w h i c h t h e fixed coefficient A will be multiplied with input word A.

Fig.1. Conventional LUT-based multiplier

The positive binary input X with word length L generate 2L possible values of X . T h e n u m b e r o f p r o d u c t t e r m i s C = A.X for 2L possible values of 2L. Therefore, for memory-based multiplication, LUT of 2L words, consisting of precomputed product values corresponding to all possible values of X, is conventionally used. The product word (A.Xi) is stored at the location Xi for 0 ≤ Xi ≤ 2L − 1, such that if an L-bit binary value of Xi is used as the address for the LUT, then the corresponding product value ( A. Xi) is available as its output.

In odd-multiple-storage (OMS) scheme to design a LUT, only the odd multiples of the fixed coefficient are required to be stored whereas the antisymmetric product coding (APC) approach reduces the LUT size to half as the product words are recoded as anti-symmetric pairs[2,3]. The APC techniques reduce the LUT table size by a factor of two. However, the OMS technique cannot be combined with the APC scheme, since the APC words generated according to odd numbers [4]. T h e r e f o r e , APC approach and mo d i fi ed OMS T echniq ue is combined to simplify the two’s complement operations since the input address and LUT output could always be transformed into odd integers for efficient memory based multiplication[5].In this paper, we discuss the design of 12 bits word size LUT multiplier which is based on the work done in ref [1].

2. PROPOSED LUT DESIGN BASED ON APC AND MODIFIED OMS TECHNIQUE

This section discusses about the proposed APC technique and its further optimization by combining it with a modified form of OMS.

2.1 APC technique for LUT

X and A in Fig.1 are assumed be positive integers [6,7]. The table 1 shows the product words for a word X with length L = 7. It is observed that the input word X on the first column of each row is the two’s complement of that on the third column of the same row [1]. In addition, the sum of product values corresponding to these two input values on the same row is 128A. Let the product values on the second and fourth columns of a row be u and v respectively. Since u = [(u + v)/2 − (v − u)/2] and v = [(u + v)/2 + (v − u)/2], for (u + v) = 128A, we get

 

  − +

 =

 

 −

= 64 2

64 2 v u

A u v

A v

u

(1)

As the product values on the second and fourth columns of table 1 have a negative mirror symmetry, product words can be used to reduce the LUT size and instead of storing u and v, only [(v − u)/2] is stored for a pair of input on a given row. The 6-bit LUT addresses and corresponding coded words are listed on the fifth and sixth columns of the table1, respectively.

The 6-bit address

(

'0

)

1 ' 2 ' 3 ' 4 ' 5 '

'

x x x x x x

X =

of the antisymmetric product code (APC) word is given by

(3)

T.Thangam

, IJRIT 408



 



 

=

= =

0 1

6 '

' 6

' '

x if X

x if X

X

L

L (2)

where

X

L

= ( x

5

x

4

x

3

x

2

x

1

x

0

)

is the six less significant bits of X, and X’L is the two’s complement of X L. By adding or subtracting the stored value (v − u) to or from the fixed value 64A when x6 is 1 or 0, the desired product can be obtained using the formula,

Product word = 64A + (sign value) × (APC word) (3)

sign value = 1 for x6 = 1 and sign value = −1 for x6 = 0. The product value for X = (1000000) corresponds to APC value “zero,” which could be derived by resetting the LUT output, instead of storing that in the LUT.

Table.1 APC words for different input values for L=7

Input ,X

Product Values

Input ,X

Product Values

Address x’5x’4x’3x’2x’1x’0

APC words

0000001 A 1111111 127A 111111 63A

0000010 2A 1111110 126A 111110 62A

0000011 3A 1111101 125A 111101 61A

0000100 4A 1111100 124A 111100 60A

0000101 5A 1111011 123A 111011 59A

0000110 6A 1111010 121A 111010 58A

0000111 7A 1111001 121A 111001 57A

0001000 8A 1111000 120A 111000 56A

0001001 9A 1110111 119A 110111 55A

0001010 10A 1110110 118A 110110 54A

0001011 11A 1110101 117A 110101 53A

0001100 12A 1110100 116A 110100 52A

0001101 13A 1110011 115A 110011 51A

0001110 14A 1110010 114A 110010 50A

0001111 15A 1110001 113A 110001 49A

0010000 16A 1110000 112A 110000 48A

0010001 17A 1101111 111A 101111 47A

0010010 18A 1101110 110A 101110 46A

0010011 19A 1101101 109A 101101 45A

0010100 20A 1101100 108A 101100 44A

0010101 21A 1101011 107A 101011 43A

0010110 22A 1101010 106A 101010 42A

0010111 23A 1101001 105A 101001 41A

0011000 24A 1101000 104A 101000 40A

0011001 25A 1100111 103A 100111 39A

0011010 26A 1100110 102A 100110 38A

0011011 27A 1100101 101A 100101 37A

0011100 28A 1100100 100A 100100 36A

0011101 29A 1100011 99A 100011 35A

0011110 30A 1100010 98A 100010 34A

0011111 31A 1100001 97A 100001 33A

0100000 32A 1100000 96A 100000 32A

(4)

T.Thangam

, IJRIT 409

Input ,X

Product Values

Input ,X

Product Values

Address x’5x’4x’3x’2x’1x’0

APC words

0100001 33A 1011111 95A 011111 31A

0100010 34A 1011110 94A 011110 30A

0100011 35A 1011101 93A 011101 29A

0100100 36A 1011100 92A 011100 28A

0100101 37A 1011011 91A 011011 27A

0100110 38A 1011010 90A 011010 26A

0100111 39A 1011001 89A 011001 25A

0101000 40A 1011000 88A 011000 24A

0101001 41A 1010111 87A 010111 23A

0101010 42A 1010110 86A 010110 22A

0101011 43A 1010101 85A 010101 21A

0101100 44A 1010100 84A 010100 20A

0101101 45A 1010011 83A 010011 19A

0101110 46A 1010010 82A 010010 18A

0101111 47A 1010001 81A 010001 17A

0110000 48A 1010000 80A 010000 16A

0110001 49A 1001111 79A 001111 15A

0110010 50A 1001110 78A 001110 14A

0110011 51A 1001101 77A 001101 13A

0110100 52A 1001100 76A 001100 12A

0110101 53A 1001011 75A 001011 11A

0110110 54A 1001010 74A 001010 10A

0110111 55A 1001001 73A 001001 9A

0111000 56A 1001000 72A 001000 8A

0111001 57A 1000111 71A 000111 7A

0111010 58A 1000110 70A 000110 6A

0111011 59A 1000101 69A 000101 5A

0111100 60A 1000100 68A 000100 4A

0111101 61A 1000011 67A 000011 3A

0111110 62A 1000010 66A 000010 2A

0111111 63A 1000001 65A 000001 A

1000000 64A 1000000 64A 000000 0

2.2 Modified OMS for LUT design

It is shown in [2] that, for the multiplication of any binary word X of size L, with a fixed coefficient A, instead of storing all the 2L possible values of C = A . X, only (2L/2) words corresponding to the odd multiples of A may be stored in the LUT, while all the even multiples of A could be derived by left-shift operations of one of those odd multiples. The LUT for the multiplication of an L-bit input with a W-bit coefficient may be designed by the algorithm used in ref [1].

In Table 2, a s shown that, at t h i r t y t w o memory locations, the eight odd multiples, A × (2i + 1) are stored as Pi, for i = 0, 1, 2. . . 31.A barrel shifter for producing a maximum of five left shifts could be used to derive all the even multiples of A.

As required by (3), the word to be stored for X = (0000000) is 64A, which is obtained from A by six left shifts using a barrel shifter. I f 64A is not derived from A, only a maximum of five left shifts is required to obtain all other even multiples of A. A maximum of five bit shifts can be implemented by a two-stage logarithmic barrel shifter, but the implementation of six shifts requires a five- stage barrel shifter. Therefore,

(5)

T.Thangam

, IJRIT 410

to store 2A for input X = (0000000), the product 64A can be derived by five arithmetic left shifts.

The product values and encoded words for input words X = (0000000) and (1000000) are separately shown in Table 3. For X = (0000000), the desired encoded word 64A is derived by 5-bit left shifts of 2A [stored at address (100000)]. For X = (1000000), the APC word “0” is derived by resetting the LUT output, by an active-high RESET signal given by

(

x0 x1 x2 x3 x4 x5

)

x6

RESET = + + + + + (4)

It may be seen from Tables 2 and 3 that the 7-bit input word X can be mapped into a 6-bit LUT address (d5d4d3d2d1d0), by a simple set of mapping relations

0 '' 5

'' 1 0,1,2,3,4

x d

i for x

di i

=

= +

= (5)

where X”= (x”5x”4x”3x”2x”1x”0) is generated by shifting-out all the leading zeros of X’ by an arithmetic right shift followed by address mapping, i.e.,



 



 

=

= =

0 1

6 '

'' 6

' '

x if Y

x if Y

X

L

L (6)

Table.2 OMS based design of LUT of APC words for L=7 Input ,X’

x’5x’4x’3x’2x’1x’0

Product Values

# of shifts

Shifted Input

,X”

Stored APC words

Address d5d4d3d2d1d0

000001 A 0

000001 P0=A 000000

000010 2×A 1

000100 4×A 2

001000 8×A 3

010000 16×A 4

100000 32×A 5

000011 3A 0

000011 P1=3A 000001

000110 2×3A 1

001100 4×3A 2

011000 8×3A 3

110000 16×3A 4

000101 5A 0

000101 P2=5A 000010

001010 2×5A 1

010100 4×5A 2

101000 8×5A 3

000111 7A 0

000111 P3=7A 000011

001110 2×7A 1

011100 4×7A 2

111000 8×7A 3

001001 9A 0

001001 P4=9A 000100

010010 2×9A 1

100100 4×9A 2

(6)

T.Thangam

, IJRIT 411

Table.3 Products and encoded words for X=(0000000) AND (1000000)

3. MULTIPLIER DESIGN USING MODIFIED LUT TABLE

3.1. Design of Multiplier using APC for L =7

The fig.2 shows the structure of LUT-based multiplier for a word length of L = 7. The APC technique is used which consist of a six-input LUT of 64 words to store the APC values of product words as given in the sixth column of

001011 11A 0

001011 P5=11A 000101

010110 2×11A 1

101100 4×11A 2

001101 13A 0

001101 P6=13A 000110

011010 2×13A 1

110100 4×13A 2

001111 15A 0

001111 P7=15A 000111

011110 2×15A 1

111100 4×15A 2

010001 17A 0

010001 P8=17A 001000

100010 2×17A 1

010011 19A 0

010011 P9=19A 001001

100110 2×19A 1

010101 21A 0

010101 P10=21A 001010

101010 2×21A 1

010111 23A 0

010111 P11=23A 001011

101110 2×23A 1

011001 25A 0

011001 P12=25A 001100

110010 2×25A 1

011011 27A 0

011011 P13=27A 001101

110110 2×27A 1

011101 29A 0

011101 P14=29A 001110

111010 2×29A 1

011111 31A 0

011111 P15=31A 001111

111110 2×31A 1

100001 33A 0 100001 P16=33A 010000

100011 35A 0 100011 P17=35A 010001

100101 37A 0 100101 P18=37A 010010

100111 39A 0 100111 P19=39A 010011

110111 55A 0 110111 P27=55A 011011

111001 57A 0 111001 P28=57A 011100

111011 59A 0 111011 P29=59A 011101

111101 61A 0 111101 P30=61A 011110

111111 63A 0 111111 P31=63A 011111

Input ,X x6x5x4x3x2x1x0

Product Values

Encoded Word

Stored Values

# of shifts

Address d5d4d3d2d1d0

1000000 64A 0 --- -- ---

0000000 0 64A 2A 5 100000

(7)

T.Thangam

, IJRIT 412

Table 1, except on the last row, where 2A is stored for input X = (0000000) instead of storing a “0” for input X = (1000000). An Multiplexer circuit is used for generating 7-bit addresses (x’5x’4x’3x’2x’1x’0) according to (2), where x6 is the control bit and (x5x4x3x2x1x0) are inputs. The equation (4) is used for generating RESET control signal. The output of the LUT is added with or subtracted from 64A, for x6 = 1or 0, respectively, according to (3) by the add/subtract cell. Hence, x6 is used as the control for the add/subtract cell.

3.2. Implementation of the Designed LUT Using Modified OMS

The proposed APC–OMS combined design of the LUT for L = 7 and for any coefficient width W is shown in Fig. 3.

It consists of an LUT of nine words of (W + 4)-bit width, a six-to-thirty three-line address decoder, a barrel shifter, an address generation circuit, and a control circuit for generating the RESET signal and control word (s1s0) for the barrel shifter.

The precomputed values of A × (2i + 1) are stored as Pi, for i = 0, 1, 2, . . . , 31, at the thirty two consecutive locations of the memory array, as specified in Table 2, while 2A is stored for input X = (0000000) at LUT address

“100000,” as specified in Table 3. The decoder takes the 6-bit address from the address generator and generates thirty three word-select signals, i.e., {wi, for 0 ≤ i ≤ 32}, to select the referenced word from the LUT. The 6-to-33- line decoder is a simple modification of 5-to-32-line decoder, as shown in Fig. 4a. The control bits s0 and s1 to be used by the barrel shifter to produce the desired number of shifts of the LUT output are generated by the control circuit. Note that (s1s0) is a 2-bit binary equivalent of the required number of shifts specified in Tables 2 and 3. The RESET signal given by (4) can alternatively be generated as (d5 AND x6). The control circuit to generate the control word and RESET is shown in Fig. 4b. The address-generator circuit receives the 7-bit input operand X and maps that onto the 6-bit address word (d5d4d3d2d1d0), according to (5) and (6). A simplified address generator is presented later in this section

.

Fig. 2. LUT-based multiplier for L = 7 using the APC technique

(8)

T.Thangam

, IJRIT 413

Fig. 3. Proposed APC–OMS combined LUT design for the multiplication of W-bit fixed coefficient A with 7-bit input X

Fig. 4a Five -to-thirty two line address-decoder

(9)

T.Thangam

, IJRIT 414

Fig. 4b Control circuit for generation of s0, s1, and RESET.

3.3 LUT Design for Signed and Unsigned Operands

The APC–OMS combined optimization of the LUT can also be performed for signed values of A and X. When both operands are in sign-magnitude form, the multiples of magnitude of the

fixed coefficient are to be stored in the LUT, and the sign of the product could be obtained by the XOR operation of sign bits of both multiplicands. When both operands are in two’s complement forms, a two’s complement operation of the output of the LUT is required to be performed for x6 = 1. There is no need to add the fixed value 64A in this case, because the product values are naturally in anti-symmetric form. The add/subtract circuit is not required in Fig. 2, instead of that a circuit is required to perform the two’s complement operation of the LUT output. For the multiplication of unsigned input X with signed, as well as unsigned, coefficient A, the products could be stored in two’s complement representation, and the add/subtract circuit in Fig. 2 could be modified as shown in Fig. 5. A straightforward implementation of sign-modification circuit involves multiplexing of the LUT output and its two’s complement. To reduce the area–time complexity over such straightforward implementation, we discuss here a simple design for sign modification of the LUT output.Note that, except the last word, all other words in the LUT are odd multiples of A. The fixed coefficient could be even or odd, but if we assume A to be an odd number, then the all the stored product words (except the last one) would be odd. If the stored value P is an odd number, it can be expressed as

i D

D

P P

P

P =

1 2

...

` (7)

and its two’s complement is given by

i D

D

P P

P

P

' 2 '

1 '

'

=

...

` (8)

Where P’i is the one’s complement of Pi for 1 ≤ i ≤ D − 1, and D = W + L − 1 is the width of the stored words. If we store the two’s complement of all the product values and change the sign of the LUT output for x6 = 1, then the sign of the last LUT word need not be changed. Based on (7,8), we can therefore have a simple sign-modification circuit [shown in Fig. 6(a)] when A is an odd integer. However, the fixed coefficient A could be even as well. When A is a nonzero even integer, we can express it as A’ × 2l, and A’ is an odd integer. Instead of storing multiples of A, we can store multiples of A’ in the LUT, and the LUT output can be left shifted by l bits by a hardwired shifter.

Similarly, using (5) and (6), we can have an address-generation circuit as shown in Fig. 6(b), since all the shifted- address YL (except the last one) is an odd integer.

(10)

T.Thangam

, IJRIT 415

Fig. 5. Modification of the add/subtract cell in Fig. 2 for the two’s complement representation of product words.

Fig. 6a Optimized implementation of the sign modification of the odd LUT output.

Fig. 6. (b) Address-generation circuit.

(11)

T.Thangam

, IJRIT 416 4. RESULTS

Fig.7 LUT APC–OMS Optimization Top Module Symbol

Fig. 8. RTL schematic of Top Module

Fig.9. Simulation Result

Table 4

Performance analysis of LUT based multiplier for different word length.

Wor d size

Addition scheme

(Area)

CSD based multiplier

(Time Delay)

Proposed LUT Based

Total CSD based multiplier

(Time Delay)

Used Un

Used

Slack detectio

n

8 bit

Carry Selective Difference

12.365ns 4 9315 9319 0.087ns

Wallace

Tree 12.452ns 8 9311 9319 ---

16 bit

Carry Selective Difference

12.965ns 8 9311 9319 0.314ns

Wallace

Tree 13.279ns 16 9303 9319 0.326ns

(12)

T.Thangam

, IJRIT 417

SYNTHESIS REPORT

Source Parameters

Input File Name : "APC_OMS.prj"

Input Format : mixed

Ignore Synthesis Constraint File : NO Target Parameters

Output File Name : "APC_OMS"

Output Format : NGC

Target Device : xc3s500e-4-fg320 Device utilization summary

Selected Device : 3s500efg320-4 Number of Slices : 13 out of 4656 0%

Number of 4 input LUTs : 25 out of 9312 0%

Number of IOs : 27

Number of bonded IOBs : 27 out of 232 11%

Timing Detail

All values displayed in nanoseconds (ns) Timing constraint: Default path analysis

Total number of paths / destination ports : 291 / 19

Delay : 9.761ns

Source : X<4> (PAD)

Destination : APC_PROD<10> (PAD) Data Path : X<4> to APC_PROD<10>

Total : 9.761ns (7.520ns logic, 2.241ns route) (77.0% logic, 23.0% route) CPU : 0.42 / 6.28 s [ Elapsed: 0.00 / 6.00 s]

Total memory usage : 159444 kilobytes Number of errors : 0 (0 filtered) Number of warnings : 26 (0 filtered) Number of infos : 2 (0 filtered)

4. Conclusion

The proposed LUT multipliers for W x L = 12x7 is coded in VHDL and synthesized in XilinxISE 10.1i. Modelsim 6.3c is used for simulation, where the LUTs are implemented as arrays of constants, and additions are implemented by the Carry selective difference and Wallace tree. we have shown the possibility of using LUT based multipliers for reduced memory size.

5. REFERENCES

[1] P. K . Meher, “LUT optimization for memory-based computation,” Trans. Circuits Syst.II, vol. 57, no. 4, April 2010, pp.285-289.

[2] P. K. Meher, “New approach to LUT implementation and accumulation for memory-based multiplication,” in Proc. IEEE ISCAS, May 2009, pp. 453–456.

32 bit

Carry Selective Difference

16.524ns 16 9303 9319 2.270ns

Wallace

Tree 14.254ns 32 9287 9319 0.544ns

(13)

T.Thangam

, IJRIT 418

[3] P. K. Meher, “New look-up-table optimizations for memory-based multiplication,” in Proc. ISIC, Dec. 2009, pp.

663–666.

[4] R.Ramya and S.Sudha, “LUT Optimization Using Combined APC-OMS Technique For Memory-Based Computation”, IJCAES., Vol.3, AUG 2013.

[5] A.Srinivasalu and G.Ramanjaneya Reddy,”Optimization of memory based LUT Multiplier,”IJECIERD,vol.3,oct 2013,pp.125-132.

[5] J.-I. Guo, C.-M. Liu, and C.-W. Jen, “The efficient memory-based VLSI array design for DFT and DCT,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, Oct. 1992, pp. 723–733.

[6] H.-R. Lee, C.-W. Jen and C.-M. Liu, “On the design automation of the memory-based VLSI architectures for FIR filters,” IEEE Trans. Consum Electron., vol. 39, no. 3, Aug. 1993 ,pp. 619–629.

References

Related documents

In GRANULE cipher design a strong S- box is accompanied with the strong and asymmetric permutation layer which results not only in preventing clustering of trails but also

atmospheric temperature at 1 bar, in all three terrestrial-type bodies which possess thick atmospheres seem to be related to the quaternary root of relative differences in

This paper begins to redress this gap through describing patterns of substance use and service needs among people using general social services in the Western Cape and

analysed the qualitative and quantitative approaches to metabolic tumour response assessment with Na 18 F and 18 F-FDG PET and developed a framework for Na 18 F PET response

Ocean carbon sink estimates based on the LDEO and SOCAT synthesis products have been included in recent versions of the Global Carbon Bud- get (Sect. 7.3) (Le Quéré et al., 2014,

While aiming at supporting the maritime clusters’ future development and competitiveness, this research is to focus on the strategic level understanding and

Due to lack of accessibility, a burn crew in a wetlands area can use only the tools they can carry with them to the site, and the burn inevitably requires more personnel;

Three major future projects - uCity Square, Schuylkill Yards, and the 30th Street Station District Plan - make up a long-term vision that will further cement University City’s status