VLSI Implementation of Least Mean Square Adaptive Filter with Low Power and Area

(1)

Sathya priya Thiyagarajan, IJRIT 26

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com www.ijrit.com www.ijrit.com

www.ijrit.com

ISSN 2001-5569

VLSI Implementation of Least Mean Square Adaptive Filter with Low Power and Area

Sathya priya Thiyagarajan¹, Kavitha Karuppansamy²,Boopathi Raja Govindasamy³ Jayanthi Loganathan⁴and Saranya Natarajan⁵

1,4 PG scholar, Department of ECE, Velalar College of Engineering and Technology, Anna University Chennai, Tamilnadu, India

[email protected] , [email protected]

2,3 Assistant Professor, Department of ECE, Velalar College of Engineering and Technology, Anna University Chennai, Tamilnadu, India

[email protected], [email protected]

5 PG scholar, Department of EEE, Velalar College of Engineering and Technology, Anna University Chennai, Tamilnadu, India

[email protected]

Abstract

The Least Mean Square (LMS) adaptive filter is the most popular and most widely used adaptive filter, because of its simplicity and also its satisfactory convergence performance. Partial Product Generator is used for achieving lower area-delay-power and adaptation- delay. The Delayed Least Mean Square (DLMS) algorithm introduces a delay in the updating of filter coefficients. The DLMS algorithm is the most popular for real-time adaptive system implementations. An efficient architecture for the implementation of a delayed Least Mean Square adaptive filter is developed. From synthesis results, we found that the proposed design offers nearly 16%

less area-delay product (ADP) and nearly 13% less energy-delay product (EDP) than the existing systolic structures for filter lengths N = 8, 16 and 32. The main advantage of Delayed LMS adaptive filter architecture is used to reduce the power and time complexity.

Keywords: Adaptive filter, Least Mean Square (LMS) algorithm, sampling period reduction, Mean Square Error (MSE).

1. Introduction

Adaptive filter is a filter that varies in time, adapting their coefficients according to some reference. The aim of adaptive filter is to estimate a sequence of scalars from an observation sequence filtered by a system in which coefficients vary. These coefficients converge towards the optimum coefficients which minimize the Mean Square Error (MSE) between the filtered observation signal and the desired sequence. The direct-form LMS adaptive filter includes an inner- product computation to obtain the filter output in a long critical path. When the critical path exceeds the desired sample period, pipelined implementation is required to reduce it. In conventional LMS algorithm, the pipelined implementation is not available because of its recursive behavior. But in the Delayed LMS (DLMS) algorithm [2], pipelined implementation of the filter is available. To implement the DLMS algorithm in systolic architectures [2], [3] and [4], lot of work has been done to increase the maximum usable frequency. But, these include filter length N for an adaptation delay of ~ N cycles. For a large adaptation delay, filter order is high and also the convergence performance degrades considerably. To reduce the adaptation delay Visvanathan et al. [5] have proposed a modified systolic architecture. In a transpose-form LMS adaptive filter, the number of delays in weights which varies from 1 to N depends on the filter output at any instant. Van and Feng [6] have proposed a systolic architecture with large processing elements (PEs).It also achieves the critical path of one MAC operation with lower adaptation delay. A fine-grained pipelined design is proposed by Ting et al. [7] to limit the critical path to the maximum of one addition time. This method supports high sampling frequency and large number of pipeline latches, but it involves a lot of area overhead for pipelining and higher power consumption than in [6].Further effort has been made by Meher and Maheshwari [8] to reduce the number of adaptation delays. The 2-bit multiplication cell with an efficient adder tree is proposed by Meher and Park [9].To minimize the critical path and also reduce the number of pipeline delays along with the area, energy consumption and sampling period, the pipelined inner-product computation is used. The proposed design is found to be more efficient in terms of the energy-delay product (EDP) and power-delay product (PDP) compared to the existing methods.

(2)

2. Design of LMS Adaptive Filter

There are two main computing blocks in the general adaptive filter architecture: 1) Error-Computation block, and 2) Weight-Update block. The block diagram of the DLMS adaptive filter is shown in Fig. 1, where the delay introduced by the whole of adaptive filters structure. But in proposed method the adaptation delay of conventional LMS can be decomposed into two parts: one part is the delay introduced by the pipeline stages in FIR filtering, and another part is the delay involved in pipelining of weight update process. This allows us to perform optimal pipelining by feed forward cut- set retiming of both these sections separately. It is used to minimize the number of pipeline stages and adaptation delay.

Based on such a decomposition of delay, the DLMS adaptive filter can be implemented by a structure shown in Fig. 2.

During the weight update, the error with delays is used, while the filtering unit uses the weights delayed by cycles.

Fig. 1 Structure of conventional delayed LMS Adaptive filter

Fig. 2 Structure of modified delayed LMS Adaptive filter

2.1 Error Computation Block

The proposed structure for an N-tap DLMS adaptive filter of error-computation unit is shown in Fig. 3. It consists of N number of 2-b partial product generators (PPG) with respect to N multipliers and a cluster of L/2 binary adder trees, which is followed by a single shift–add tree.

Fig. 3 Structure of error-computation block 2.1.1 Structure of PPG

The structure of each PPG is shown in Fig. 4. It consists of L/2 number of 2-to-3 decoders and also the same number of AND/OR cells (AOC). Each of the 2-to-3 decoders takes a 2-b digit (u1u0) as input and produces 3 outputs b0 = u0. , b1 = ·u1, & b2 = u0· u1, such that b0 = 1 for (u1u0) = 1, b1 = 1 for (u1u0) = 2, and b2 = 1 for (u1u0) = 3.

(3)

Sathya priya Thiyagarajan, IJRIT 28 Fig. 4 Structure of PPG (AOC stands for AND/OR cell)

The decoder output b0, b1 and b2 along with w, 2w, and 3w are fed to an AOC.The value of w, 2w, and 3w are in 2’s complement representation and sign-extended to have (W + 2) bits each. While computing the partial product corresponding to the most significant digit (MSD), i.e., ( ) of the input sample, the AOC (L/2 − 1) is fed with w, −2w, and −w as input since ( ) can have four possible values 0, 1, −2, and −1.

2.1.2 Structure of AOC

The structure and function of an AOC are shown in Fig. 5. Each AOC consists of three AND cells and two OR cells. The Fig. 5b and Fig. 5c are shown the structure and function of AND cells and OR cells respectively. Each AND cell takes an n-bit input D and a single bit input b, and also consists of n AND gates.

Fig. 5 (a) Structure and function of AND/OR cell. Binary operators • and + in (b) and (c) are implemented using AND and OR gates, respectively

This distributes all the n bits of input D to its n AND gates as one of the input. The other inputs of all the n AND gates are fed with the single-bit input b. As shown in Fig. 5c, Like AND gates, each OR cell takes a pair of n-bit input words and has n OR gates. A pair of bits in the same bit position in B and D is fed to the same OR gate. The w, 2w, and 3w are output of an AOC which is corresponding to the decimal values 1, 2, and 3 of the 2-b input (u1u0), respectively. The decoder along with the AOC performs a multiplication of input operand w with a 2-b digit (u1u0).The PPG of Fig. 4 performs L/2 parallel multiplications of input word w with a 2-b digit, which produce L/2 partial products of the product word wu.

2.1.3 Structure of Adder Tree

Conventionally, the shift-adds operation is performed on the partial products of each PPG separately to obtain the product value. To compute the desired inner product all the N product values are added. However, the shift-add operation increases the word length, and consequently increases the adder size of N − 1 additions of the product values. To avoid such increase in word size of the adders, all the N partial products of the same place value from all the N PPGs are added

(4)

by one adder tree. All the L/2 partial products are generated by each of the N PPGs, which are added by (L/2) binary adder trees. The shift-add tree is used to add the outputs of the L/2 adder trees according to their place values. To add N partial product, each of the binary adder trees require stages of adders, and stages of adders are required by shift–add tree to add L/2 output of L/2 binary adder trees. The addition scheme for four-tap filter of error- computation block and input word size L = 8 is shown in Fig. 6. For N = 4 and L = 8, the adder network requires four binary adder trees for each two stages and a two-stage shift–add tree. In proposed method, Ripple Carry Adder is used instead of Carry Save Adder.

Fig. 6 Adder-structure of the filtering unit for N = 4 and L = 8.

2.2 Weight-Update Block

The proposed structure of weight-update block is shown in Fig. 7. To update N filter weights, the weight update block performs N multiply- accumulate operations of the form (µ×e) × .

Fig. 7 Structure of weight-update block.

By taken the step size µ as a negative power of 2, we can realize the multiplication with currently available error only by a shift operation. Each of the MAC unit performs the multiplication of the shifted value of error with the delayed input samples , which is followed by the additions with the corresponding old weight values .All the N multiplications for MAC operations are performed by N PPGs, which is followed by N shift– add trees. Each of the PPGs generates L/2 partial products corresponding to the product of recently shifted error value µ × e with L/2 and the number of 2-b digits of the input word ,where the sub expression 3µ×e is shared within the multiplier. Since the scaled error (µ×e) is multiplied with the entire N delayed input values in the weight-update block, so the sub expression can be shared across

(5)

all the multipliers. This leads to the substantial reduction of the adder complexity. So the final outputs of MAC units comprise the desired updated weights which are used as inputs to the error-computation block as well as the weight- update block for the next iteration.

2.3 Adaptation Delay

As shown in Fig. 2, the adaptation delay is decomposed into n1 and n2. The error-computation block generates the delayed error by −1 cycles as shown in Fig. 3, which is fed to the weight-update block shown in Fig. 7 after scaling by µ; then the input is delayed by 1 cycle before the PPG to make the total delay introduced by FIR filtering be .In Fig. 7, the weight-update block generates , and the weights are delayed by +1 cycles. However, it should be noted that the delay by 1 cycle is due to the latch before the PPG, which is included in the delay of the error-computation block, i.e., , Therefore, the delay generates in the weight-update block becomes .

3. Results and Discussions

The design is simulated in ModelSim tool by VHDL language is shown in Fig. 8.The power analysis and area utilization is done by implemented the design in Xilinx ISE9.2i is shown in Table 1.

Fig. 8 Modified delayed LMS Adaptive filter

Table 1

PARAMETER GATE

COUNT

POWER (mW) Filter with

Carry Save Adder

8670 100

Filter with Ripple Carry

Adder

5456 51

4. Conclusion

Partial product generator is used to compute inner-product computation by common sub expression sharing in order to reduce adaptation delay, area and power. The optimized balanced pipelining across the time-consuming blocks of the structure is proposed to reduce the adaptation delay and power consumption, as well. When the adaptive filter is necessary to be operating at a lower sampling rate, one can reduce the power consumption further with the help of the proposed design which have lower operating voltage and clock slower than the maximum usable frequency.

(6)

References

[1] Pramod Kumar, Meher and Sang Yoon Park, “Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation Delay”, IEEE Transactions on VLSI Systems, Volume 21, Jan 2013, pp. 346-403.

[2] Meyer.M.D and Agrawal.D.P, “A modular pipelined implementation of a delayed LMS transversal adaptive filter” ,in Proc. IEEE Int. Symp.Circuits Syst., May1990, pp. 1943–1946.

[3] Haimi-Cohen.R and Herzberg.H, “A systolic array realization of an LMS adaptive filter and the effects of delayed adaptation”, IEEE Trans.Signal Process., vol. 40, no. 11, Nov 1992, pp. 2799–2803.

[4] Agrawal.D. P and Meyer.M.D, “A high sampling rate delayed LMS filter architecture”, IEEE Trans. Circuits Syst. II, Analog Digital Signal Process., vol. 40, no. 11, Nov 1993,pp. 727–729.

[5] Ramanathan.S and Visvanathan.V, “A systolic architecture for LMS adaptive filtering with minimal adaptation delay”, in Proc. Int. Conf. Very Large Scale Integr. (VLSI) Design, Jan. 1996,pp. 286–289.

[6] Feng.W.S and Van.L.D, “An efficient systolic architecture for the DLMS adaptive filter and its applications”, IEEE Trans.Circuits Syst. II, Analog Digital Signal Process .,vol. 48,no.4, Apr 2001,pp.359–366.

[7] Cowan.C.F.N,Ting.L.K and Woods.Y.Yi.R, “High speed FPGA-based implementations of delayed-LMS filters Very Large Scale Integr. (VLSI) Signal Process., vol. 39, nos. 1–2, Jan 2005, pp. 113–131.

[8] Meher.P.K and Maheshwari.M, “A high-speed FIR adaptive filter architecture using a modified delayed LMS algorithm,” in Proc. IEEE Int. Symp. Circuits Syst., May 2011, pp. 121–124.

[9] Meher.P.K and Park.S.Y,“Low adaptation-delay LMS adaptive filter part-I and II” IEEE Int.Midwest Symp.

Circuits Syst., Aug 2011,pp. 1–4.

[10] Caraiscos.C and Liu.B , “A roundoff error analysis of the LMS adaptive algorithm” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 1, Feb 1984, pp. 34–41.

[11] Parhi.K.K,”VLSI Digital Signal Procesing Systems: Design and Implementation” New York, USA: Wiley, 1999.

[12] Menard .D, Rocher.R,, Scalart.P and Sentieys .O, “Accuracy evaluation of fixed-point LMS algorithm,” in Proc.

IEEE Int. Conf. Acoust., Speech, Signal Process., May 2004,pp. 237–240.

Author’s Profile

Sathya priya.T has received B.E degree in Electronics and Communication Engineering from Anna University, Coimbatore 2012. She is currently pursuing Master of Engineering in VLSI Design in Velalar College of Engineering and Technology under Anna University, Chennai. She has presented one paper in International Conference and three papers in National Conferences. Her areas of interest in research are Digital Electronics and VLSI Signal Processing.

Kavitha.K has received B.E degree in Electronics and Communication Engineering from Kongu Engineering College, Perundurai in 1997 and M.E –VLSI Design in Kongu Engineering College, Perundurai in the year 2009. She is working as an Assistant Professor in the department of ECE, Velalar College of Engineering and Technology, Erode. She has presented eight papers in National Conferences and two papers in International Conferences. Her areas of interest are Digital Electronics and VLSI Design.

Boopathi Raja.G has received B.E degree in Electronics and Communication Engineering from SSM College of Engineering, Komarapalayam in 2011 and M.E –Applied Electronics in Muthayammal Engineering College, Rasipuram in the year 2013. He is working as an Assistant Professor in the department of ECE, Velalar College of Engineering and Technology, Erode. He has presented three papers in National and International Conferences. His areas of interest are Nano Electronics and VLSI Design.

Jayanthi.L has received B.E degree in Electronics and Communication Engineering from Anna University, Coimbatore 2012. She is currently pursuing Master of Engineering in VLSI Design in Velalar College of Engineering and Technology under Anna University, Chennai. She has presented three papers in National Conferences and one paper in International Conference. Her areas of interest in research are Digital Electronics and VLSI Signal Processing.

Saranya.N has received B.E degree in Information Technology from Anna University, Coimbatore 2012. She is currently pursuing Master of Engineering in Applied Electronics in Velalar College of Engineering and Technology under Anna University, Chennai. She has presented one paper in National Conference and four papers in International Conferences. Her areas of interest in research are Data mining and Networking.