Global Optimization Methods for Adaptive IIR Filters

(1)

OCLOO, SENANU K. Global Optimization Methods for Adaptive IIR Filters. (Under the direction of Associate Professor W. Edmonson).

Adaptive filtering systems mimic the ability of biological systems to change their internal configuration so as to better survive in their environment. This ability is critical because adaptive filters operate in noisy, time-varying environments. At de-sign time, although performance objectives are well-defined, there is limited a priori information about the characteristics of the input signals. As a result, systems ca-pable of meeting performance specifications while operating under such conditions need to be able to make on-the-fly changes to their structure so as to constantly improve performance. Over the last couple of decades, their efficacy and robustness have been demonstrated in numerous applications and today, they are used in a wide variety of applications ranging from radar, sonar and active noise control to channel equalization, adaptive antenna systems and hearing aids.

(2)

(3)

by

Senanu K. Ocloo

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fullfilment of the requirements for the Degree of

Doctor of Philosophy

Electrical and Computer Engineering

Raleigh, North Carolina 2007

Approved By:

Dr. Mo-Yuen Chow Dr. Ethelbert Chukwu

(4)

Dedication

To my parents, John and Edith Ocloo, who are my inspiration and have always loved,

(5)

Biography

(6)

Acknowledgements

First of all, I will like to express my thanks to my advisor, Dr. W. Edmonson, for his guidance and support throughout the course of my PhD. Among the many things I learned from you is the ability to think outside-the-box and push the envelope. I thank Dr. W. Alexander for bringing me on board at N. C. State University and guiding me through my Master’s and PhD programs. Your sage advice helped me through many tough situations.

The High Performance DSP lab (HiPER DSP), setup by Dr. W. Alexander and now headed by Dr. Edmonson, provided an atmosphere conducive for success—one that encouraged members to strive for excellence while helping each other. As a result, many thanks goes to members of HiPER DSP who have supported me over the years—Lalit, Gary, Chuanhua, Viren, Sanath and many others. Thanks for the encouragement, laughs and everything. I will like to mention specially Cranos and Ramsey with whom I started the doctoral program and who have been with me through the ups and the downs...through the good and the bad. I will never forget you!

Special thanks goes to Dr. Seth & Elizabeth Tetteh-Ocloo and family (Dickson, Shirley, Seth. Jr., and Jonelle) who are responsible for bringing me to the United States and providing and supporting me throughout this long journey. Thanks for your patience, love and unfailing support. I will never forget or take for granted what you have done for me. I will also like express my gratitude all my extended family— cousins, in-laws, nephews and nieces—who have encouraged me over the years. Your love was always evident. It truly takes a village to raise a child.

I will like to thank my parents, John and Edith, who instilled in me the discipline I needed for the journey and also, the love for it. I can never repay what you have given me. Also, to my brother, Delali, and my sister, Kafui, thanks for all the support and allowing me to bounce ideas off of you.

(7)

List of Figures

1.1 Interaction of an Adaptive Filter with its Environment . . . 3

2.1 Basic adaptive filtering setup . . . 9

2.2 The four major classes of adaptive filtering applications . . . 10

2.3 Transversal structure of an adaptive filter . . . 13

2.4 Cascade Structure of an Adaptive Filter . . . 13

2.5 Lattice-Ladder Structure of an Adaptive Filter with N =M + 1 . . . 13

2.6 Details for i-th stage of a Lattice Filter . . . 14

2.7 Basic Genetic Algorithm Cycle . . . 27

2.8 1-Point Crossover of two binary strings . . . 28

2.9 Flow of Simulated Annealing . . . 34

3.1 Basic cycle of a Branch-and-Bound Algorithm . . . 39

4.1 Exponential Growth of Boxes . . . 56

4.2 Interplay between q and k . . . 58

4.3 Flow of SIBB Algorithm . . . 60

4.4 Effect ofK on Estimator . . . 63

5.1 System Identification Setup . . . 66

5.2 Error surface for undermodeled system . . . 67

5.3 Final filter coefficients for each run . . . 68

5.4 Trajectory of filter coefficients . . . 69

5.5 Evolution of filter coefficients . . . 69

5.6 System Identification Setup with Noise . . . 75

5.7 Final Points for all 100 independent runs of RCGA . . . 77

5.8 Estimated MSE Values . . . 80

5.9 Effect ofq on Convergence Rate . . . 84

(10)

List of Tables

2.1 Summary of IIR LMS Algorithm . . . 23

2.2 Genetic Search Learning Algorithm . . . 31

3.1 Global Optimization using Branch-and-Bound . . . 41

4.1 Smoothed Interval Branch-and-Bound Algorithm . . . 59

4.2 Tunable Parameters of SIBB . . . 62

5.1 Initial Parameters . . . 66

5.2 Final Filter Coefficients . . . 67

5.3 Steps of our modified Genetic Algorithm . . . 72

5.4 Simulation Parameters for Windowed GA . . . 73

5.5 Simulation Parameters for Modified ASA . . . 74

5.6 Mean of Final Points . . . 76

5.7 Variance of Final Points . . . 76

5.8 Success Rates . . . 77

5.9 Convergence Rates . . . 78

5.10 Average Number of Function Evaluations . . . 78

5.11 Estimated MSE Values . . . 80

5.12 Variance of Final Points for SIBB and ASA . . . 80

5.13 Variance of Final Points for BCGA and RCGA . . . 81

5.14 Convergence Rates . . . 81

(11)

Chapter 1 Introduction

(12)

1.1 The Need for Adaptability

Biological systems in nature have the unique ability of being able to adjust their behavior or structure so as to better survive in their environment. For instance, it is believed that polar bears developed white fur as camouflage in the arctic regions. All other bears have black or brown fur. These changes in behavior and/or physiology of organisms are a result of changes in their underlying genetic makeup that takes place over several years. Organisms that are capable ofadaptingto their ever-changing environment survive and those that cannot die off - a phenomenon known as survival-of-the-fittest.

Adaptation can be described as the process of modifying a thing so as to suit new conditions [3]. This presupposes the existence of a goal (which, in the case of biological systems, is survival), and a mechanism for achieving it. However, it does not require complete knowledge about future environmental or operating conditions. Designers of signal processing systems face numerous situations where a device’s op-erating environment is known to change with time. Furthermore, at design time, complete knowledge about the operating environment is unavailable, although the desired performance objectives are clearly defined. In such situations, systems that are capable of mimicking the behavior of biological systems are required. For instance, in hearing aid applications, mechanical and acoustic feedback severely limits the max-imum achievable gain. Feedback can also cause system instability which is sometimes audible as a continuous high-frequency tone [4]. This is an annoying problem and clearly, at design time, designers of feedback cancellation systems for use in hearing aids do not have the exact values of the feedback signal (as a function of time), but rather, only information about the spectral and stochastic characteristics of feedback signals. In such a situation where partial rather than complete a priori knowledge is available, systems that are capable of adapting to changes in the feedback signal -much like a biological system - represent the best choice.

(13)

In other words, signals with different frequencies travel at different velocities [5]. Con-sider a wireless communication channel where the transmitter moves around, causing the channel between the transmitter and receiver to change over time. Surrounding buildings will reflect signals, trees may absorb them and movement of the transmitter will cause doppler effects. This results in several delayed and attenuated versions of the original, transmitted signal to arrive at the base station. Again, system designers face a situation where the channel response changes with time and channel charac-teristics (impulse and frequency response) are not known a priori [6]. As a result, a system that is effective in canceling out the effects of the time-varying channel needs to be able to make adjustments on-the-fly. In other words, such a system needs to be adaptive.

One of the most important artificial adaptive systems designed for use in signal processing applications is what is known as the adaptive filter. These systems are designed to mimic the behavior of biological systems and over the past couple of decades, their robustness has been demonstrated in practice. Examples of areas where such systems are employed include radar, sonar, channel equalization, echo cancellation, active noise control, adaptive antenna systems and hearing aids [1] [2] [5]. Adaptive filters interact with their environment as shown in Figure 1.1. Sensors

Environment

Adaptive System

Inputs Outputs

(14)

provide input information which is processed via a well-defined series of steps so as to determine what changes to the internal configuration of the filter, if any, are required. This way, the filter outputs effect the desired changes in the environment. Adaptive filters also monitor their performance by measuring how far they are from the goal state and this information, called the error, plays a vital role in determining what changes are necessary.

1.2 The Case for Adaptive IIR Filters

Adaptive filters have been found to be very effective tools for extracting data from corrupted signals in time-varying environments. They adapt to changes in their operating environment by adjusting their internal configuration (filter coefficients) on-the-fly, following a well-defined series of steps referred to as anadaptive algorithm. As with all adaptive processes, this requires specification of a performance goal - one that is to be achieved over time, and in this case, the goal is to minimize the system error in some sense. Mathematically, this is expressed via a cost or performance function which is a stochastic function of the error. Naturally, we strive to do so with a minimum of computing resources (power, time and memory) without degrading the ability of the algorithm to locate globally optimum filter coefficients.

(15)

tend to have pole-zero structures because they have rational transfer functions [8]. Secondly, they are generally able to meet performance specifications using fewer filter coefficients than equivalent adaptive FIR implementations [9]. Antoniou [10] notes that reductions of as much as 5 to 10 times can be achieved which in turn translates into a reduction in the number of multiplier blocks required to implement the filter. This has important ramifications for the total amount of power consumed by the filter. For instance, when filters are implemented using fixed-point arithmetic [7], multipliers turn out to be the most power-hungry components and so, the amount of power consumed by the filter is determined to a large extent by these multipliers. Reducing the number of coefficients reduces the number of multipliers required and therefore, lowers the power consumed by the filter [11]. Thus, the ability of adaptive IIR filters to meet performance specifications with fewer coefficients translates into lower power consumption making adaptive IIR solutions particularly attractive in applications where power and memory resources are limited. Examples of such applications include cellphones, hearing aids, biomedical and space applications.

1.3 Research Motivation

In the previous section, we made the case for using adaptive IIR filters, stating that they have advantages of better accuracy in modeling physical plants, lower filter order and lower power consumption. Despite these advantages, adaptive IIR filters have not found widespread application for two reasons. First, they can become unstable because they have non-trivial poles which may move outside the unit circle in the z-plane during adaptation. Secondly, they have feedback which causes the associated

(16)

Algorithms [19] [20] [21]. However, these approaches may not guarantee convergence to globally optimum filter coefficients in finite time and/or may yield unstable filters. This dissertation proposes a novel algorithm for adapting the coefficients of adap-tive IIR filters aimed at addressing some of the aforementioned issues. The algorithm is designed to locate the global minima of the performance function while minimizing the use of computational resources. In this regard, the following performance metrics are of interest:

• Computational Efficiency: This relates to the amount of computing re-sources required to accomplish the task. Specifically, we consider the

– Computational complexity of the adaptive algorithm

– Convergence rate, meaning the number of data samples the algorithm has to process before it settles on a final solution

• Success Rate: This is a measure of how successful the adaptive algorithm is at locating optimal solutions. In cases where local and global optima exist, it refers to the algorithm’s ability to locate global optima

• Stability: Stable filters produce bounded outputs in response to bounded in-puts [7]. Unstable filters are undesirable and must to be avoided. Thus, the stability metric refers to the ability of the algorithm to produce stable filters.

The specific goals of this research are to:

1. Develop an adaptive algorithm capable of processing discrete-time data in a fashion suitable for adaptive IIR filtering applications

2. Minimize the number of parameters of the algorithm that need to be tuned

(17)

(18)

Chapter 2 Background

This dissertation presents a novel approach to adapting the coefficients of adaptive Infinite Impulse Response (IIR) filters. Significant research has been conducted in this area over the past few decades and thus, numerous algorithms for solving the problems associated with adaptive IIR filters already exist. In order to understand the state-of-the-art as far as adaptation algorithms go, and also, the significance of our work, an understanding of adaptive filter theory, the properties of existing algorithms and alternative optimization techniques is required. This chapter presents the basics of adaptive filter theory, highlighting the problems faced when adaptive IIR filters are used and a literature survey of a number of existing adaptation algorithms.

2.1 Adaptive Filters

(19)

Figure 2.1 shows the general setup of an adaptive filtering application where x(n) represents the input to the filter, d(n) represents the desired or reference signal and

Adaptive Filter

₊

x(n)

d(n)

y(n)

_

+

e(n)

Figure 2.1: Basic adaptive filtering setup

y(n) represents the filter output. Inputsx(n) and d(n) are modeled as random

pro-cesses [22]. The error signal,e(n), is given as the difference between the desired signal and the filter output:

e(n) =d(n)−y(n). (2.1) There are four main classes of applications for which adaptive filters are used, namely, system identification, inverse modeling, interference cancelation and linear predic-tion [5] [23]. Figure 2.2 depicts setups typical of these classes. System identificapredic-tion setups are used in applications such as vibrational studies of mechanical systems [5] and acoustic echo cancellation [24], while inverse modeling configurations are used in channel equalization [7] and speech analysis [2] to name a few. Active Noise Control represents an application where the interference cancellation setup is used. Linear prediction setups, shown in Figure 2.2(d), are used in signal encoding, spectral esti-mation and image compression [2].

(20)

Adaptive Filter Plant + x(n) d(n) y(n) _ + e(n)

(a) System Identification

Adaptive Filter +

x(n) d(n) y(n) _ + e(n) Delay Plant + Noise + +

(b) Inverse Modeling

Adaptive Filter +

x(n) d(n) y(n) _ + e(n) + Noise + + Plant

(c) Interference Cancellation

Adaptive Filter ₊

x(n) d(n) y(n) _ + e(n) Delay

(d) Linear Prediction

(21)

is a linear function of inputs and outputs, and whose adjustments are held constant after adaptation are classified as linear filters. As a result, linear system theory can be used to analyze them [25].

There are two kinds of realizable (causal) adaptive linear filters namely Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters. They form their output according to the following difference equation:

y(n) =

N−1

X

i=0

bi(n)x(n−i) + M

X

i=1

ai(n)y(n−i) (2.2)

where N and M represent the lengths of the feedforward and feedback sections of the filter respectively, and theai’s andbi’s are the adjustable parameters of the filter,

called filter coefficients. Note that adaptive FIR filters have no feedback coefficients, implying that all ai’s are zero. A more compact form of (2.2) can be obtained using

linear algebra. Let u(n) and w(n) represent column vectors of the inputs and filter coefficients respectively such that

u(n) = [x(n) x(n−1)· · · x(n−N + 1) y(n−1)· · · y(n−M)]T (2.3) w(n) = [b0(n) b1(n)· · · bN−1(n) a1(n) · · · aM(n)]T . (2.4)

Then

y(n) = wT(n)u(n). (2.5) An equivalent representation of (2.2) also exists in the z-domain [7] where the z-transform of a discrete-time signal, s(n), is defined as

S(z) =

∞

X

k=−∞

s(k)z−k. (2.6)

Let Z represent the z-transform operator and let Z {x(n)}= X(z) and Z {y(n)}= Y(z). Then, at time n, the z-transform of (2.2) is:

Y(z) =

N−1

X

i=0

biZ {x(n−i)}+ M

X

i=1

aiZ {y(n−i)}. (2.7)

Note that the time indices ofbi andai have been dropped to reflect the fact that filter

(22)

Using the time-shifting property of z-transforms [7] which states that

Z {s(n−n0)}=z−n0S(z), (2.8)

equation (2.7) becomes

Y(z) =

N−1

X

i=0

biz−iX(z) + M

X

i=1

aiz−iY(z). (2.9)

The transfer function of such a filter in the z-domain, defined as the ratio of the output and input signals, is therefore

H(z) = Y(z) X(z) =

PN−1

i=0 biz −i

1−PM

i=1aiz−i

(2.10)

= b0+b1z

−1₊_b

2z−2+· · ·+bN−1z−(N−1)

1−a1z−1−a2z−2− · · · −aMz−M

. (2.11)

Adaptive filters can be realized in a number of different structures, one of them being the tapped-delay-line or transversal structure shown in Figure 2.3. The z−1

blocks represent delays of one clock cycle. They may also be realized using a cascade structure where two or more lower order filters are connected in series [23] as depicted in Figure 2.4. This structure is obtained by factoring (2.11) in terms of its poles and zeros [8]. In practice, it is typical to employ a cascade of second-order filters or second-order sections such that

H(z) =

Q

Y

i=1

Hi(z) (2.12)

where

Hi(z) =

b0i+b1iz−1+b2iz−2

1−a1iz−1−a2iz−2

(2.13) and Q represents the number of filter sections. Note that H(z) may also be factored to give

H(z) =

P

X

i=1

Hi(z) (2.14)

where each Hi(z), i = 1. . . P represents the transfer function for the i-th section.

(23)

+

z-1

b0

z-1

b1

z-1

b2

z-1

bN-1

+

z-1

a1

z-1

a2

z-1

aM

x(n) y(n)

Figure 2.3: Transversal structure of an adaptive filter

H₁(z) H₂(z) H_Q(z)

x(n) y(n)

Figure 2.4: Cascade Structure of an Adaptive Filter

k

M

k

M-1

k

1

+ + + +

x(n)

y(n) v0 v1

vM-2 vM-1

vM

Stage M Stage M-1 Stage 1

LATTICE SECTION

LADDER SECTION

(24)

Yet another filter form that is particularly attractive in speech processing appli-cations is what is known as thelatticeorlattice-ladder structure [27] shown in Figure 2.5. Note that this structure comprises a lattice and a ladder section. All stages of the lattice section are identical (except for the κparameters) as shown in Figure 2.6. The κ parameters are known as the reflection coefficients and the vi’s are known as

+

+ z-1

k_i

-k_i

f_i-1(n)

g_i-1(n) g_i(n)

f_i(n)

Figure 2.6: Details fori-th stage of a Lattice Filter

the ladder coefficients.

As mentioned in Section 1.2, there are two kinds of causal [7] adaptive linear filters namely Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters. The differences in the way they form their outputs result in differing temporal and spectral characteristics which will be discussed in greater detail later on. However, their purpose is the same - to adapt to changes in their operating environment and thereby, achieve a desired performance objective.

Performance objectives are inherent to adaptive systems. For a biological system, the objective may be to be able to survive on less water, to withstand cooler tem-peratures or to hide from predators. In the case of adaptive filters, the goal is to minimize the output error, e(n), in some sense. To this end, performance objectives are expressed as mathematical functions ofe(n) calledcost functionswhich we denote byξ. Designers of adaptive filtering systems have a number of different cost functions to choose from, including the following [2] [23]:

• Least Squares: ξ = _K1₊₁PK

(25)

• Mean Square Error: ξ = E [e2(n)]

• Weighted Least Squares: ξ=PK

i=0λ

i_|_e(n₋_i)_|2_, ₀_{< λ <}₁

• Instantaneous Squared Value: ξ=|e(n)|2

where E(·) is the statistical expectation operator. Note that the MSE criterion rep-resents a special case of the general class of cost functions

ξp = E [|e(n)|p], p= 1,2,3,4. . . (2.15)

and is the optimal criterion when input data is drawn from a Gaussian distribution. In practice, designers base their decision on which cost function to employ mainly on its mathematical tractability. As a result, the computationally simple Mean Squared Error (MSE) criterion is the most widely used one and for this reason, in the remain-der of this dissertation, we restrict ourselves to performance functions based on this criterion. The performance function can be expanded as follows:

ξ = Ee2(n) = E{d(n)−y(n)}2

(2.16) = Ed2(n)−2E [d(n)y(n)] + Ey2(n). (2.17) It can also be expressed in terms of auto- and cross-correlation statistics. Let S(n) and Z(n) represent two real-valued, discrete-time random processes, and let φss(n)

and φsz(n) represent their auto- and cross-correlation functions respectively. Then

φss(m) = E [s(n)s(n+m)], (2.18)

φsz(m) = E [s(n)z(n+m)]. (2.19)

Thus,ξ can be expressed as

ξ=φee(0) = φdd(0)−2φdy(0) +φyy(0). (2.20)

Irrespective of which cost function a designer picks, performance objectives are considered met whenξ is minimized. In other words, ifQrepresents the set of points that minimize ξ, then, the optimalset of filter coefficients, wopt, lie inQ:

(26)

The optimal MSE operating point can be determined analytically using linear algebra techniques but the sheer computational complexity of this approach together with the effects of using finite precision processing units prevent it from being used in practice. Rather, iterative adaptation algorithms that locate wopt over a number

of time steps are preferred. These algorithms process new, incoming data every time step, with the goal of converging to a final solution that approximates wopt.

The task of locating this point can be viewed as an optimization problem, expressed mathematically as

min

w ξ. (2.22)

An assumption that is typically made in order to facilitate this process is that the input stochastic process is ergodic [13]. This permits the estimation of ensem-ble averages using time averages. The nature of the cost function, that is, whether it is unimodal or multimodal, determines to a large extent the type of algorithm used. Adaptive FIR filters have cost functions that are quadratic in the filter co-efficients [5] [13] and thus, possess a single minimum point. This means that their cost functions are unimodal, and therefore, local optimization algorithms are appro-priate. On the other hand, in general, adaptive IIR filters have multimodal cost functions [12] [28] and therefore, require the use of global optimization techniques. Regardless of the minimization scheme employed, it is imperative the optimal operat-ing point be located while usoperat-ing a minimum of computational resources (e.g. memory, time) and that the filter remain BIBO stable [7] throughout this process.

(27)

concerns are best managed by utilizing cascade filter structures made up of second-order sections [29] [30]. This simplifies the process of checking for stability because the stability requirement reduces to ensuring that the two feedback coefficients of each section, a1i and a2i, satisfy the constraints |a2i| < 1 and |a1i| < 1−a2i, which

describe a feasible region referred to as thestability triangle [7] [9].

The combination of BIBO stability requirements and the nature of the MSE cost function determine the complexity of the adaptive algorithm employed.

2.2 Adaptive IIR Filtering Algorithms

Numerous signal processing and communication systems operate in environments where conditions change with time. Given that there is limited a priori informa-tion about the operating environment at design time, adaptive filters represent the most effective devices for such applications. Although both FIR and IIR implementa-tions can be used to solve any particular filtering problem, in practice, adaptive FIR implementations are more widely used because of the simplicity of minimizing the associated MSE cost function and the fact that the filter is guaranteed to be stable. However, adaptive IIR implementations exhibit qualities that make them advanta-geous in numerous applications [28]. First, they can more accurately model physical plants which tend to have pole-zero structures because adaptive IIR filters have ratio-nal transfer functions themselves. Unlike FIR filters whose poles are restricted to the origin of the z-plane, poles can be placed anywhere inside the unit circle. This has several advantages, one of them being narrower transition bands. Secondly, they are generally able to meet performance specifications using fewer filter coefficients than equivalent adaptive FIR implementations [9]. This advantage translates into lower power consumption making adaptive IIR solutions particularly attractive in applica-tions where power and memory resources are limited. Examples of such applicaapplica-tions include cellphones, hearing aids, biomedical and space applications.

(28)

have non-trivial poles which may move outside the unit circle in the z-plane during adaptation. Secondly, in general, the presence of feedback terms in equation (2.2) causes the associated MSE cost function to be multimodal [8]. The characteristics of the input excitation and the relative number of poles and zeros between the plant and adaptive filter model determine whether or not the cost function is multimodal.

Definition 1 Consider an adaptive filtering system where a filter with transfer

func-tion given by (2.11) models a plant with transfer funcfunc-tion

P(z) = d0+d1z

−1 ₊_d

2z−2+· · ·+dNd−1z

−(Nd−1) 1−c1z−1−c2z−2− · · · −cMcz

−Mc (2.23)

Let n∗ = min{M −Mc, N−Nd}. Then, H(z) is of sufficient orderif n∗ ≥0and is of insufficient order if n∗ <0 [31].

The conditions under which the error surface may be multimodal can be classified as [32]:

1. Sufficient order filter excited with white noise 2. Insufficient order filter excited with white noise 3. Sufficient order filter excited with colored noise 4. Insufficient order filter excited with colored noise

In 1981, Stearns [12] conjectured that the MSE error surface for Category 1 was unimodal. Fan and Nayeri [33] however showed through counterexamples that this held true only for first- (M=1) and second-order (M=2) systems. S¨oderstr¨om and Stoica [34] later proved that a filter of sufficient order, when excited with white noise, yields a unimodal MSE error surface if

(Nb + 1)−Mc ≥0. (2.24)

(29)

in [31] and [35], Fan and Nayeri derived conditions under which systems that fall into Categories 2, 3 and 4 produce unimodal and multimodal error surfaces.

In practice, it is usually difficult, if not impossible, to determine Nd and Md and

therefore, whether the conditions for unimodality in any category have been met. As a result, designers consider the general case where the cost function is multimodal. The need to minimize the cost function while ensuring the filter remains stable creates such a challenging mathematical problem that it has not been adequately solved to date.

Significant research effort has been focused on developing algorithms for adaptive IIR filters [14] [15] [16] [17] [18] [19]. Adaptation necessarily involves optimization which in turn requires managing constraints and trade-offs [36], and these adaptive algorithms seek to do the same. One primary trade-off is that of success rate versus computational efficiency. On one hand, these algorithms must locate wopt by

deter-mining the global minima ofξ and not any other local minimum points. This relates to the success rate of the algorithm and therefore, the efficacy of the entire adaptive filtering system. On the other hand, this must be done while minimizing the use of power, time, computing and memory resources because real-time applications where these systems are used have limited resources. This relates to the computational effi-ciency of the system. The more computing, memory and power resources are needed, the bigger the physical size and monetary cost of the system which is undesirable in many applications. Obviously, the most effective system at locating global minima is not necessarily the most computationally efficient one and so the key is to strike a balance between the two requirements.

All algorithms for adjusting the coefficients of adaptive IIR filters seek to manage the trade-off between computational efficiency and success rate, while constraining the poles of the filter to the unit circle. In evaluating the suitability of an algorithm, the following criteria are used:

• Ability to locate wopt: The primary goal of an adaptive algorithm is to

(30)

in determining how well an algorithm performs. Assume that the algorithm converges to a point,w0, after a number of iterations.

– Bias: This measures the difference, in the mean, between w0 and wopt,

and is defined as: E[w0]−wopt. In general, unbiased estimates ofwopt are

required.

– Excess MSE: The presence of noise in the measurement of gradients (e.g. Jacobians, Hessians) and other parts of the adaptation process causes al-gorithms to converge to points that oscillate about wopt. As a result, the

mean-square error associated with w0, ξw0, is greater than the minimum

mean-square error attainable,ξmin which is associated with wopt. “Excess

MSE” therefore refers to theexcess mean-square error associated with w0 and is defined as [5]:

Excess MSE =E[ξw0 −ξ_min] (2.25)

The ideal algorithm produces zero excess MSE.

• Convergence rate: This refers to the number of time steps/iterations it takes the algorithm to converge to a final solution. Naturally, the faster the algorithm converges, the better.

• Computational complexity: As much as possible, this must be kept to a minimum. With iterative algorithms, it is sometimes difficult to determine the exact number of multiplications, additions and so on that is required. Conse-quently, the number of times the algorithm evaluates the cost function (and Jacobian, Hessian, if appropriate) is used as a measure of the computational complexity.

• Stability: Poles of solution points need to be constrained to the unit circle in the z-plane.

(31)

2.3 Literature Survey

Over the last couple of decades, researchers have developed a number of algorithms designed to help realize the advantages of adaptive IIR filters. In this section, we survey state-of-the-art IIR filtering algorithms. In each case, we briefly describe the algorithm and discuss its advantages and disadvantages.

2.3.1 IIR-LMS Algorithm

The IIR-Least Mean Square (IIR-LMS) algorithm represents a stochastic version of the well-known steepest descent algorithm [37] [38] [39] which updates the current estimate using the following recursion:

xn+1 =xn+sndn. (2.26)

Here, sn represents a step-size and dn denotes the search direction which is taken

as the negative of the gradient vector at xn. This forms the basis of the IIR-LMS

algorithm which is developed as follows.

The recursive equation that governs the output of adaptive IIR filters is shown in (2.2) and reproduced here for convenience:

y(n) =

N−1

X

i=0

bi(n)x(n−i) + M

X

i=1

ai(n)y(n−i) (2.27)

Minimization of ξ = E [e2_{(n)] requires an infinite number of realizations of the} _e2_(n)

random process - something that is not possible in practice. Instead, a coarse estimate, ˆ

ξ(n) = e2(n) (2.28)

= d2(n)−2d(n)y(n) +y2(n) is used.

Let u and w be as defined in (2.3) and (2.4) respectively. Then from (2.28) ˆ

(32)

where ∇w is the gradient operator with respect to the filter coefficient vector w.

Noting that d(n) is independent ofw, the gradient vector becomes: ˆ

∇ξ(n) = −2e(n)∇wy(n)

= −2e(n)

∂y(n) ∂b0(n)

∂y(n) ∂b1(n)

· · · ∂y(n)

∂bN−1(n)

∂y(n) ∂a1(n)

· · · ∂y(n)

∂aM(n)

T

.(2.30)

The partial derivatives in (2.30) are calculated using (2.27) as follows: ∂y(n)

∂bi(n)

= x(n−i) +

M

X

l=1

bl(n)

∂y(n−l) ∂bi(n)

, for i= 0,1, . . . , N −1 (2.31) ∂y(n)

∂ai(n)

= y(n−i) +

M

X

l=1

bl(n)

∂y(n−l) ∂ai(n)

, for i= 1, . . . , M (2.32)

To simplify later expressions, define αi(n) =

∂y(n) ∂ai(n)

fori= 1, . . . , M (2.33)

βi(n) =

∂y(n) ∂bi(n)

for i= 0,1, . . . , N −1 (2.34) Using the assumption that the filter coefficients, ai and bi, vary slowly in time, the

following simplifications can be made: ∂y(n−l)

∂ai(n)

≈ ∂y(n−l)

∂ai(n−l)

=αi(n−l) (2.35)

∂y(n−l) ∂bi(n)

≈ ∂y(n−l)

∂bi(n−l)

=βi(n−l) (2.36)

for l = 1,2, . . . , M. Substituting (2.35) and (2.36) into equations (2.32) and (2.31) respectively, the equations for updatingαi(n) and βi(n) are:

βi(n) = x(n−i) + M

X

l=1

bl(n)βi(n−l) (2.37)

αi(n) = y(n−i) + M

X

l=1

bl(n)αi(n−l) (2.38)

Hence, the LMS update equation for IIR filters, using the output error method, becomes

(33)

where

η(n) = [β0(n) · · · βN−1(n) α1(n) α2(n) · · · αM(n)]T. (2.40)

Theαi(n)’s andβi(n)’s are updated using equations (2.38) and (2.37). Placing (2.39)

in the form of (2.26), we get sn =µ and dn = 2e(n)η(n). In summary, the steps of

the IIR-LMS algorithm are as listed in Table 2.1.

Table 2.1: Summary of IIR LMS Algorithm [13]

Steps of the IIR-LMS Algorithm

1. Compute output y(n) = wT(n)u(n),

whereu(n) = [x(n) · · · x(n−N)y(n−1) · · · y(n−M)]T 2. Estimate error e(n) = d(n)−y(n)

3. Updateη(n) βi(n) =x(n−i) +

PM

l=1bl(n)βi(n−l)

αi(n) = y(n−i) +

PM

l=1bl(n)αi(n−l)

η(n) = [β0(n) · · · βN−1(n) α1(n) · · · αM(n)]T

4. Adapt coefficients w(n+ 1) =w(n) + 2µe(n)η(n)

(34)

this problem [41] [42]. Although they work well in non-stationary environments, they have the disadvantage of increased computational complexity.

Boukis et al [40] address the problem of determining the optimal step-size by proposing an algorithm that uses trainable step-sizes. The proposed algorithm, called the “Adaptive Step Size Output Error”, (ASSOE) algorithm, introduces a step for training µ which is:

µ(n+ 1) =µ(n)−ρ∇µJ(n)|µ=µ(n−1), (2.41)

where J(n) represents the instantaneous, squared error J(n) = 1

2e

2_{(n) =} 1

2[d(n)−y(n)]

2

(2.42) and the gradient is defined as

∇µJ(n) =−e(n)uT(n)

∂w(n)

∂µ(n−1). (2.43)

Using assumptions similar to those of the IIR-LMS algorithm, the term Υ(n) =

∂w(n)

∂µ(n−1) is derived as

Υ(n) = λΥ(n−1)−λµ(n−1)Φf(n−1)ΦTf(n−1)Υ(n−1)

+λµ(n−1)e(n−1)∂Φf(n−1)

∂w(n−1) Υ(n−1)

+e(n−1)Φf(n−1) (2.44)

where Φf is an estimate of the gradient of y(n) with respect to w. It turns out the

multiplicative term, λ (λ ∈[0,1]), determines whether or not ASSOE behaves as an extension of a variable step-size LMS algorithm developed for FIR filters.

Simulation results showed that larger values of λlead to faster convergence which is not unexpected. Boukis et al note that the range of allowable values for µ(n) is determined by the unknown plant and that when compared to the LMS algorithm that uses a constant step size set to the optimum value, the ASSOE algorithm performed only marginally worse. However, in practice, when the optimum µ is unknown and time-varying, ASSOE performs better.

(35)

that approach. The primary advantage of these algorithms is that they are com-putationally simple. This simplifies implementation and lends itself to theoretical analysis [43]. However, it is well known that they are sensitive to initial conditions. Thus, if the algorithm is initialized to a point that lies within the vicinity of a local minimum point, it converges to that local minimum. However, if the initial point is within the vicinity of a global minimum point, it converges to the global minimum point. As it is usually impossible to determine whether the initial point lies in the vicinity of a local or global minimum point, it is impossible to initialize the algorithm to a point that guarantees convergence to wopt. Another drawback of IIR-LMS style

algorithms is that they require a procedure for checking the stability of the filter. This usually takes the form of using second-order sections as described in Section 2.1 and ultimately increases the complexity of the overall algorithm. In general, the drawbacks associated with IIR-LMS type algorithms tend to outweigh the advantages and therefore, such algorithms are not used in practice.

2.3.2 Global Least Mean Square (GLMS) Algorithm

Steepest-descent algorithms such as IIR-LMS, are known to be sensitive to ini-tial conditions, as a result of which convergence to optimum operating points (global minima) cannot be guaranteed. Edmonsonet al [17] address this issue by developing a modified IIR-LMS algorithm, called the “Global Least Mean Square” (GLMS) al-gorithm, that has better convergence characteristics. They base this algorithm on a method known as Stochastic Approximation with convolution Smoothing (SAS) [44] where a non-convex function is convolved with a “noise” probability density function (pdf) to “smoothen” it. The variance of the noise pdf determines the amount of smoothing that occurs.

(36)

its original, non-convex shape. The update recursion for the filter coefficients is: w(n+ 1) =w(n)−λ(n)Ξ(n) +ηβ (2.45) where λ(n) is a gain sequence, η is a random vector, and β controls the degree of smoothing. The parameter Ξ(n) is a vector of gradient estimates, computed using the relationships in (2.37) and (2.38).

Experimental tests on system identification setups showed that GLMS outper-formed IIR-LMS in noisy and noiseless environments and was comparable to Steiglitz-McBride approaches [28] [45] [46]. Overall, Edmonson et al found GLMS to be only marginally more computationally complex than IIR-LMS which is a positive attribute. However, as the authors point out, work remains to be done to determine the optimal smoothing schedule,β, and also, its convergence properties. As a stochastic optimiza-tion technique, GLMS is conjectured to converge to global minima with probability 1, although no analytical work proving this exists.

2.3.3 Genetic Algorithms

Adaptive IIR filters, unlike their FIR counterparts, generally have multimodal cost functions making global optimization algorithms necessary for minimizing them. Steepest descent algorithms such as IIR-LMS, ASSOE and GLMS are local mization schemes and therefore may converge to locally optimal points. Global opti-mization algorithms provide a more effective means of minimizingξ because they are designed to avoid local minima. Genetic Algorithms represents one such scheme.

(37)

Old Population New Population

Selection

Reproduction

Mutation

Figure 2.7: Basic Genetic Algorithm Cycle [21]

Any implementation of a genetic algorithm begins with the generation of an initial population of points inside the given search space. Each point is referred to as a chromosome. As per the original formulation of the algorithm, each member of this randomly generated population is represented by a binary string of length L. Then, each chromosome is evaluated independently of other chromosomes using a fitness function. This function, which is equivalent to a cost function, assignsfitness values to each chromosome such that the “healthiest” ones have larger values. The bit string is said to represent the genotype and the fitness value, the phenotype.

(38)

Uni-versal Sampling (SUS) [51] also uses the roulette wheel analogy but instead of a single selection point, it utilizes N equally spaced ones, where N represents the size of the intermediate population. This way, the entire population is determined with one spin of the wheel. Once the intermediate population has been formed, reproduction takes place.

Reproduction mimics biological mating by randomly selecting members of the intermediate population, and exchanging information between them. During the ex-change of information, a process called crossover, a pair of strings “recombine” to form two new strings that will be part of a new population. The following exam-ple from [51] illustrates the process of recombination. Consider two binary strings: 1101001100101101 and yxyyxyxxyyyxyxxy. A 1-point crossover occurs by randomly choosing a recombination point and swapping the segments of the strings to create two new strings as illustrated in Figure 2.8. The two resulting strings are11010yxxyyyxyxxy

1101001100101101

yxyyxyxxyyyxyxxy

11010yxxyyyxyxxy

yxyyx01100101101 11010

01100101101 yxyyx

yxxyyyxyxxy

Crossover

Figure 2.8: 1-Point Crossover of two binary strings

and yxyyx01100101101.

Next, mutation takes place. Each bit of each chromosome in the population flips (0’s become 1’s and vice versa) with probability pm, where pm is usually less than

1% [51]. The process serves to add small perturbations to the newly generated strings and ensure that all parts of the search space are reached. It also helps the algorithm escape from local minima. Note that the process of going from the current population to the next represents one iteration in the execution of GA.

(39)

incorrect paths. This is due to the randomization process. Although GA has the advantage of being able to search large domains and escaping local minimas, its primary disadvantage is its high computational complexity [21]. The need to compute fitness values for all chromosomes in the population during each and every iteration greatly increases the computational cost of this approach. Another disadvantage of this formulation of GA lies in the use of binary strings for encoding chromosomes. This approach, called Binary-Coded GA, leads to discretization of the search space, as a result of which the exact global minimum may not be located [52]. In order to avoid this, some researchers suggest representing chromosomes as real numbers, an approach that is called Real-Coded GA (RCGA). As Herreraet al[53] note, employing real numbers does not degrade performance because the desirable properties of GA do not derive from the use of bit strings.

Etter [19] was the first to investigate the application of GA to adaptive filters. This work served to demonstrate the viability of Binary-Coded GA schemes as adaptive IIR filtering algorithms. Nambiar et al [20] also used GA to solve system identi-fication problems and found their implementations exhibited slow convergence and high computational complexity. They suggested using GA only to locate points close to global minima since Binary Coded GA discretizes the search space and therefore may not be able to locate the exact optimum point. This conclusion is similar to that obtained in [54] where they developed a hybrid GA and Simulated Annealing (see Sec-tion 2.3.4) scheme for adaptive IIR filtering. Ma and Cowan [55] later extended the work of Nambiaret al in [54] by using “more difficult” plant models. They explored direct form, parallel, cascade and lattice filter structures and concluded that lattice structures produced the best result. They found cascade structures to perform better than parallel ones. The authors identified three main advantages of GA namely (1) No stability check is needed when lattice structures are used. Restricting the search space of reflection coefficients to (-1,1) ensures stable results. (2) When poles lie very close to the unit circle, GA is better than gradient-descent schemes because there is no chance poles will move outside the unit circle due to noisy measurements, and (3) GA is highly parallelizable.

(40)

2.3.1) and GA, Ng et al [21] developed a hybrid scheme that combines gradient-descent and evolutionary mechanisms. They took the approach of using RCGA when-ever the gradient-descent portion converged to a local minimum, or the convergence speed was slow. They achieved this by using the mutation operator to perturb the filter coefficients, thereby generating a new set of coefficients. Stability tests were performed to ensure that stable filters resulted. Then, the filter with the highest fitness value (lowest MSE) was picked for adaptation by the gradient-descent scheme. The fitness value of the j-th chromosome, e−_j2, was defined as:

e2_j = 1 te

te

X

n=1

[d(n)−yj(n)]2 (2.46)

where te represents the window size over which the errors will be accumulated, and

yj(n) is the estimated output associated with the j-th set of estimated parameters.

Note that the fitness value is inversely proportional to the mean-square error,e2 j. This

process was repeated until the global minimum point was reached. Table 2.2 [21] details the steps of the algorithm. The authors compared the performance of their hybrid algorithm to that of Binary-Coded GA and IIR-LMS for a system identification problem and found the hybrid algorithm to converge much faster than both. In general, the new algorithm retained much of the simplicity of the gradient-descent algorithm, and required fewer computations than a standard implementation of GA.

2.3.4 Simulated Annealing

(41)

Table 2.2: Genetic Search Learning Algorithm [21]

Steps of Hybrid Algorithm

Step 0: Initialization: Initialize the filter coefficients (the initial chromo-some) by some small real random numbers

Step 1: IIR-LMS Adaptation: Calculate the system error e(n)

Step 2: Ife(n)<tolerance, then STOP. Otherwise, update filter coefficients using IIR-LMS update recursions (2.37), (2.38) and (2.39)

Step 3: Compute the error gradient, 4E = [e(n)−e(n−τ)]/τ, where τ is the window size for estimation of 4E

Step 4: If (|4E| < gradient threshold) and (e(n) > error threshold) such that a given local minimum criterion is satisfied, then start the ge-netic search by mutating the current chromosome. Check the sta-bility of the filter. Calculate the fitness of each new offspring over a period of time te. Select the offspring with minimum MSE and

(42)

algorithm to solve continuous, multimodal optimization problems. The only require-ment for the minimization of a function f(x), where x∈ _Rn_{, is that it be bounded;}

it does not need to be continuous. SA, like GA, is a guided, random search method. However, unlike GA, SA evolves a single point in the search space.

There are three main parts to any SA algorithm, namely generation of new points, acceptance/rejection of newly generated points and annealing of the acceptance tem-perature. Together, these processes evolve a given initial, randomly generated point,

x0, that tends to a global minimum [59]. In the first part of the algorithm, candidate

points, x0, are generated with the aid of probability density functions (pdf) centered at the current point, xi. These pdfs, referred to asgenerating pdfs, are typically

uni-form. Next, the candidate point is accepted or rejected based on a chosen criterion. The most widely used one is the Metropolis Criterion [57]:

If ∆f ≤0, then accept the new point xi+1 =x0

Else, accept the new point with probability:

p(∆f) =exp(−∆f /T)

where ∆f =f(x0)−f(xi) andT is the temperature. Since the goal is to minimize the

cost function,f(x), points that result in lower values off are automatically accepted. However, note that points for which ∆f > 0 are not automatically discarded. The probability of accepting a point that results in an increase in the cost function is di-rectly proportional to the temperature. The higher the temperature, the greater the number accepted. The ability to accept “worse” points is an important feature that enables SA to make uphill climbs, and therefore, escape local minima. As a result of this inbuilt mechanism, SA is also known as probabilistic hill climbing.

(43)

be reduced only when the algorithm is at equilibrium. Annealing schedules take this into consideration and try to strike a balance between the competing needs for fast convergence and slow cooling. Examples of such schedules include an exponential annealing schedule suggested by Kirkpatrick et al,

Ti+1 =αTi α≈0.9, (2.47)

the Boltzman-type schedule [60]

Ti+1 =T0

log_ei0

log_ei (2.48)

wherei0 is an initial time index, and a linear cooling schedule proposed by Randelman

and Grest [61] where the temperature is reduced by an amount, ∆T, every L steps:

Ti+1 =Ti−∆T. (2.49)

In each case, the best equilibrium point, xopt_i , is recorded before the temperature is reduced, and after annealing, the algorithm continues from xopt_i .

The process of generating candidate points, accepting or rejecting them and an-nealing the temperature is repeated over and over again until there is no significant reduction in f asT is reduced or some other predetermined termination criteria are met. Figure 2.9 shows the general flow of the algorithm. At termination, the best point found is returned.

As a heuristic, global optimization technique, the convergence properties of SA are of great interest, and over the years, significant research effort has gone into determining just what they are. Modeling the evolution of points via Markov chains, researchers found SA to converge to global minima asymptotically. In other words, SA converges to global minima with probability 1 if there are an infinite number of transitions at each temperature,Ti, and if limi→∞Ti = 0 [59]. Clearly, this requires an

infinite amount of time and data, and any implementation of SA can only approximate such a process. As a result, any particular implementation of SA is not guaranteed to converge to a global minimum point.

(44)

GENERATE CANDIDATE POINTS

Ensure points lie inside search space

ACCEPT/REJECT CANDIDATE POINTS

Use Metropolis Criterion

ANNEAL TEMPERATURE

Record current best point and reduce temperature

Is Algorithm at equilibrium?

Stopping Criteria Satisfied?

END INITIALIZE PARAMETERS

No

No Yes

Yes

(45)

to tune [59] [60] [62] [63]. As result, over the last two decades, researchers have de-veloped annealing/cooling schedules that result in faster convergence. For instance, Mean Field Annealing (MFA) [64] [65] utilizes a deterministic approximation to a Gibbs distribution that allows it to converge in only 1/50-th of the time required by traditional SA. However, MFA does not guarantee convergence to global min-ima. The desire to make SA converge faster led to the creation of some annealing schedules that quench rather than cool, a technique sometimes referred to as Simu-lated Quenching [60]. These schedules however nullify the theoretical guarantee of convergence associated with SA.

The relative simplicity of SA, coupled with its ability to locate global minima, makes it an attractive option for signal processing applications [66]. As a result, several researchers have sought to apply it to adaptive filtering, and in particular, adaptive IIR filters [54]. Chen et al [18] applied a variant of SA, called Adaptive Simulated Annealing (ASA), to just this kind of problem. The algorithm is based on a version of SA known as Very Fast Simulated Re-Annealing (VFSR) [67] whose salient feature is its ability to adapt to the changes in the nature of multidimensional cost functions. ASA achieves this by employing a generating pdf, parameterized by a set of generating temperatures, one for each dimension. These temperatures control the way the neighborhood around the current point is explored and are periodically re-scaled (re-annealed) to reflect the different slopes of ξ at the current point. Thus, ASA reduces temperatures during the annealing and re-annealing steps.

The annealing step occurs after everyNgeneracandidate points have been accepted.

For an N-dimensional problem, temperatures are reduced as follows:

ka = ka+ 1 (2.50)

kj = kj+ 1 (2.51)

Tka = T0exp(−ck

1/N

a ) (2.52)

T_g,kj

j = T

j

g,0exp(−ck 1/N

j ) 1≤j ≤N (2.53)

where ka and kj represent annealing times, T_g,kj represents the j-th generating

(46)

accepted points, re-annealing takes places. First, the slope of the function along each coordinate axis is estimated as

sj =

∂ξ ∂xj

1≤j ≤N (2.54)

and then, the temperatures are re-scaled:

T_g,kj j =

smax

si

T_g,kj

j 1≤j ≤N (2.55)

where smax = max{si}. The annealing times are also adjusted as

kj = −

1 cloge

T_g,kj j T_g,j₀

!!N

1≤j ≤N (2.56)

ka =

−1

cloge

Tka T0

N

. (2.57)

As proposed by Ingber [67], Chenet aluse an acceptance rule known as the Barker Criterion [68] which accepts uphill climbs with a probability given by:

p(∆f) = 1 1 + exp

∆f q

(2.58)

where q is a constant. Unlike the Metropolis criterion, the Barker criterion may not accept descent steps, particularly those that do not lower f by much [69].

Simulations conducted by Chen et al on system identification setups showed that ASA was able to locate the global minimum points. The authors note that the results obtained were comparable to those obtained in [21] and [70] where GA was used. They concluded that the efficiency of the Adaptive Simulated Annealing algorithm is of the order of GA.

(47)

(48)

Chapter 3 Analytical Approach

As discussed in Chapter 2, existing algorithms for adjusting the coefficients of adaptive IIR filters usually fail to locate the global minima of MSE cost functions. Consequently, adaptive IIR filters have not seen widespread usage. The sheer com-plexity of the problem, coupled with the limitations on computing power and memory, dictate that new and innovative solutions are required. This chapter describes a global optimization technique known as Branch-and-Bound and an arithmetic tool, Interval Arithmetic, which together form the basis of our approach.

3.1 Branch-and-Bound

Branch-and-Bound [71] [72] is a recursive, global optimization approach that guar-antees convergence to global minima. It requires a way of splitting up the search space into non-overlapping subregions (branching), and a way of computing upper and lower bounds on the cost function over a region (bounding). These two mechanisms are re-quired because branch-and-bound is based on the following two observations:

(49)

2. If the lower bound over a region D is greater than the upper bound of another region F, then D can be safely discarded from the search process (pruning). Branch-and-bound algorithms work by bounding the cost function over the search space, discarding regions that are guaranteed not to contain global minimum points, and then splitting the remaining regions. This process of bounding, pruning and branching is applied repeatedly to the remaining subregions until one or more prede-termined termination criteria are satisfied. Figure 3.1 illustrates this process. Note

Bound

Prune

Branch

BEGIN

END

Termination Criteria Met?

No

Yes

(50)

that gradient information is not used in this process.

Consider the problem of minimizing a scalar-valued function f(x) over a region X0 ⊆RN. We seek the set of global minimizers,X∗, such that

X∗ ={x∗ ∈X0|f(x∗)≤f(x) ∀x∈X0} (3.1)

and the minimum value of f, f∗, given by:

f∗ = min

x∈X0

f(x). (3.2)

We introduce some notations: letfubandflbrepresent the upper and lower bounds on

f∗respectively. The upper bound off over a regionY ⊂X0will be denoted byfU(Y)

and the lower bound by fL_(Y_{). Without loss of generality, we consider}

coordinate-axis aligned regions or polytopes in_RN which shall be referred to asboxes. The width of the largest dimension of a box Y will be denoted by w(Y).

Definition 2 (Partition) Let D represent a box in_RN _{and I, a finite set of indices.}

Then P ={Xi : i∈I} is said to be a partition of D if

D=[

i∈I

Xi

Table 3.1 shows the steps of a generic branch-and-bound algorithm designed to locate all global minima of f. It has as inputs:

- M initial search regions or boxes believed to contain global minimum points - x: tolerance on the width of boxes (x >0)

- f: tolerance on fub−flb (f >0)

Input parametersx andf are selected such that the following conditions are satisfied

at termination: - w(X(i)₎_≤

(51)

Table 3.1: Global Optimization using Branch-and-Bound

Steps of a generic Branch-and-Bound algorithm

1: Place all M initial search boxes, {Xi} M

i=1, in a list, L1. Bound f over all

boxes and set fub = min1≤i≤MfU(Xi)

2: (PRUNE) Discard from L1 all boxes for which fU(X) > fub. If L1 is

empty, go to Step 7. Otherwise, proceed to Step 3

3: Form a partition, P, by selecting one or more boxes from L1 for further

processing. Remove these boxes from L1 and go to Step 4.

4: (BRANCH) Partition every box in P and denote by P0, the set of all new sub-boxes created. Proceed to Step 5.

5: (BOUND) Compute bounds of f over each box in P0_{. For each box, if}

fU_(X)₋_fL_(X)_≤

f orw(X)≤x, place it inL2. Otherwise, add it to L1.

LetQ denote the number of boxes in L1. Continue to Step 6.

6: Let ¯fest = min1≤j≤QfU(Xj), Xj ∈ P0. If ¯fest < fub, set fub = ¯fest. Go to

Step 2.

7: Discard from L2 all boxes for which fL(X) > fub. Denote by

X(1), X(2), . . . , X(p), the p boxes that remain. If global minima exist, they lie in these pboxes.

(52)

- fub−flb ≤f

- flb < f∗ < fub

The algorithm presented in Table 3.1 provides bounds onf∗ in addition to the location of global minima. At termination, the global minima (if any exist) lie in the remaining p boxes.

Note that branch-and-bound employs lists (L1 and L2) to track boxes that

re-quire further processing, suggesting the need for memory. The amount of memory required is a function of the number of boxes produced after splitting a single box. This, in turn, is exponentially related to the number of dimensions that are split, k. Specifically, each box produces qk _{new boxes, where} _q _{represents the number of}

sub-intervals produced after each dimension is split. This phenomenon of exponential growth is referred to as thecurse of dimensionality[73], and can severely degrade the performance of the algorithm. This is because the greater the number of boxes in memory, the greater the amount of work that has to be done to discard them, and therefore the slower the algorithm [74].

One way to address the potential slowness of branch-and-bound algorithms is to minimize the number of boxes created. This relates to the strategy for splitting boxes and although several methods have been suggested [75] [76] [77], the best strategy is not known. For instance, Hansen [78] suggests splitting boxes along the two largest dimensions (each dimension being split in two) while Mark´ot et al [79] note that splitting along one dimension is typically the best choice.

Another way to speed up branch-and-bound is to utilize schemes that lead to boxes being discarded earlier than they otherwise will. In fact, as Ratz [77] notes, the power of branch-and-bound derives from its ability to quickly discard boxes that are guaranteed not to contain global minima. Boxes that remain after the search space has been reduced have a higher probability of containing global minimum points and so by concentrating computational efforts on only these boxes, the time required to converge will be reduced. This realization led to the development of a number of acceleratingprocedures for use during pruning [77] [80], some of which are:

(53)

decreasing over it. Monotonicity is determined by first computing the Jacobian,

∇f, for allx∈ B. Let ¯gi represent the set of all values for thei-th component of

∇f ∀x∈ B. Then, if 0∈/ ¯gi for alli,f is monotonically increasing or decreasing

overB.

Convexity Test: Since f must be convex in the neighborhood of its minimum points, any box, B, that is not convex can be discarded. Convexity requires the Hessian of f, H(x), to be positive definite [39]. Thus, if analysis of H(x) results in the determination that f is concave, the box under consideration can be discarded.

Midpoint Test: In Step 6 of Table 3.1, a threshold, fub, is set using the minimum

upper bound obtained. The lower this value is, the better because there is a greater chance of discarding boxes in Step 2. A better threshold may be obtained by considering the value of f at the midpoint/center of a box, fmp.

This way, ¯fest = minfmp ≤ minfU. Clearly, we obtain a threshold that is no

worse than before and yet, could be significantly better.

Although these procedures add complexity to the algorithm, their benefits more than justify their use. Other variants of branch-and-bound which aim at more effectively pruning boxes include and-Cut [81] which uses cutting planes, and Branch-and-Prune [82]. Ultimately, accelerating procedures seek to reduce the computational time and effort required without preventing the algorithm from converging to global minima.

(54)

Branch-and-Bound a very competitive option when convergence to sub-optimal local minima is unacceptable.

3.2 Interval Arithmetic

Interval Arithmetic [85] [86] was developed as a way of bounding the errors (e.g. rounding, quantization) that accrue during numerical computations, and is based on the manipulation of intervals of real numbers instead of individual real numbers. An intervalX is defined as X = [a, b] such that

X ={x∈_R:a≤x≤b}. (3.3)

A real numberx, also called adegenerate interval, is defined as X = [x, x]. The basic arithmetic operations of addition, subtraction, multiplication and division are defined for intervals. Let? denote one of these operations for real numbers. Then

X ? Y = {x ? y :x∈X, y ∈Y} (3.4) = [min(x ? y), max(x ? y)]. (3.5)

Consider two intervals, X = [a, b] and Y = [c, d]. Then

X +Y = [a+c, b+d] (3.6)

(55)

X ∗Y =                                         

[ac, bd] if a≥0 and c≥0

[bc, bd] if a≥0 and c <0< d

[bc, ad] if a≥0 and d≤0

[ad, bd] if a <0< b and c≥0 [bc, ac] if a <0< b and d≤0

[ad, bc] if b ≤0 andc≥0

[ad, ac] if b ≤0 andc < 0< d

[bd, ac] if b ≤0 andd≤0

[min (bc, ad), max(ac, bd)] if a <0< b and c <0< d

(3.8) 1 Y = 1 d, 1 c

(0∈/ Y) (3.9)

X

Y =X∗

1 Y

(0∈/ Y). (3.10)

Equation (3.10) defines division for cases where the denominator interval does not contain zero. For cases where it does, extended interval arithmetic [78] is defined.

X / Y =

                            

[b/c,∞] if b ≤0 andd = 0 [−∞, b/d]∪[b/c,∞] if b ≤0 andc < 0< d [−∞, b/d] if b ≤0 andc= 0 [−∞,∞] if a <0< b [−∞, a/c] if a ≥0 and d= 0 [−∞, a/c]∪[a/d,∞] if a ≥0 and c <0< d [a/d,∞] if a ≥0 and c= 0

(3.11)

In comparing intervals, X and Y are said to be equal if a = c and b = d, and X < Y if b < c. Similarly, X > Y if a > d. Also, an interval [a, b] is said to be positive if a ≥ 0, strictly positive if a > 0, negative if b ≤ 0, and strictly negative if b <0. The midpoint of X is defined as

m(X) = a+b

(56)

Clearly, the definition of an interval given in (3.3) suggests that set theory may be applied to intervals. Thus, operations such as intersection, union, cartesian product and so on are defined for intervals [87].

One of the salient features of interval arithmetic is that the outcome of computa-tions performed on digital computers are guaranteed to enclose the infinite precision results because of the way rounding is performed. To illustrate, consider a function f(x) defined over a region D such that fL ≤ f(x)≤fU, ∀x∈ D. Note that fL and

fU may not be exactly representable in a floating-point or fixed-point number system.

The result of computing the range of f over D using interval arithmetic will be an interval_F= [f ,f¯] where f is the largest machine representable number smaller than or equal tofL, and ¯f is the smallest machine representable number that is larger than

or equal tofU. By roundingfLtof andfU to ¯f, a process calledoutward rounding,F

is guaranteed to contain the true range, [fL, fU], such that f ≤fL ≤f(x)≤fU ≤f.¯

As a result of this ability to guarantee that the outcome of interval computations contain infinite precision results, interval-based operations are said to be reliable.

Although interval computations produce reliable results, the true result may be overestimated [85]. In general, whenever a variable occurs more than once in an interval expression, each occurrence is considered a different variable. This creates what is known as thedependency problem. For example, subtracting X = [a, b] from itself using (3.7) gives

X−X = [a, b]−[a, b] = [a−b, b−a] (3.13) instead of the interval [0,0] which one will expect. This is because, instead of com-puting the result as{x−x:x∈X}, it is computed as{x−y:x∈X, y ∈X}. This phenomenon of overestimation can severely degrade the performance of interval-based algorithms, and must therefore be mitigated.

Global Optimization Methods for Adaptive IIR Filters

Dedication

Biography

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

The Need for Adaptability

Environment

Adaptive System

1.2

The Case for Adaptive IIR Filters

1.3

Research Motivation

Chapter 2

Background

2.1

Adaptive Filters

Adaptive Filter

+

x(n)

d(n)

y(n)

_

+

e(n)

k

k

k

2.2

Adaptive IIR Filtering Algorithms

2.3

Literature Survey

2.3.1

IIR-LMS Algorithm

2.3.2

Global Least Mean Square (GLMS) Algorithm

2.3.3

Genetic Algorithms

2.3.4

Simulated Annealing

Chapter 3

Analytical Approach

3.1

Branch-and-Bound

Bound

Prune

Branch

BEGIN

END

3.2

Interval Arithmetic

₊