is the identity matrix In using the LMS algorithm, we recognize that

(3.39) where z-1is the unit-time delay operator, implying storage. Using Eqs. (3.38) and (3.39), we may thus represent the LMS algorithm by the signal-flow graph depicted in Fig. 3.3. This signal-flow graph reveals that the LMS algorithm is an example of a stochastic feed-

back system. The presence of feedback has a profound impact on the convergence behav-

ior of the LMS algorithm.

3.6 MARKOV MODEL PORTRAYING THE DEVIATION OF THE LMS

ALGORITHM FROM THE WIENER FILTER

To perform a statistical analysis of the LMS algorithm, we find it more convenient to work with the weight-error vector, defined by

(3.40) where wo is the optimum Wiener solution defined by Eq. (3.32) and is the corresponding estimate of the weight vector computed by the LMS algorithm. Thus,

wˆ (n) (n) = wo - wˆ (n) wˆ (n) = z-1_[w_{ˆ (n + 1)]} = [I - x(n)xT(n)]wˆ (n) + x(n)d(n) wˆ (n + 1) = wˆ(n) + x(n)[d(n) - xT_(n)w_{ˆ (n)]} hx(n) d(n) hx(n) xT_(n) w(n₁₎ w(n) z1I ˆ ˆ

Σ

FIGURE 3.3 Signal-flow graph representation of the LMS algorithm. The graph embodies feedback depicted in color.

in terms of , assuming the role of a state, we may rewrite Eq. (3.38) in the com- pact form

(3.41) Here, we have

(3.42) where I is the identity matrix. The additive noise term in the right-hand side of Eq. (3.41) is defined by

(3.43) where

(3.44) is the estimation error produced by the Wiener filter.

Equation (3.41) represents a Markov model of the LMS algorithm, with the model being characterized as follows:

• The updated state of the model, denoted by the vector , depends on the old state , with the dependence itself being defined by the transition matrix A(n). • Evolution of the state over time n is perturbed by the intrinsically generated noise

f(n), which acts as a “driving force”.

Figure 3.4 shows a vector-valued signal-flow graph representation of this model. The branch labeled z-1I represents the memory of the model, with z-1acting as the unit-time

delay operator, as shown by

(3.45) This figure highlights the presence of feedback in the LMS algorithm in a more compact manner than that in Fig. 3.3.

The signal-flow graph of Fig. 3.4 and the accompanying equations provide the framework for the convergence analysis of the LMS algorithm under the assumption of a small learning-rate parameter. However, before proceeding with this analysis, we will digress briefly to present two building blocks with that goal in mind: the Langevin equation, presented in Section 3.7, followed by Kushner’s direct-averaging method, presented in Section 3.8. With those two building blocks in hand, we will then go on to study convergence analysis of the LMS algorithm in Section 3.9.

z-1[ (n + 1)] = (n)  (n)  (n + 1) eo(n) = d(n) - wTox(n) f(n) = - x(n)eo(n) A(n) = I - x(n)xT_(n)  (n + 1) = A(n) (n) + f(n)  (n)

Section 3.6 Markov Model Portraying the Deviation of the LMS Algorithm 105

FIGURE 3.4 Signal-flow graph representation of the Markov model described in Eq. (3.41); the graph embodies feedback depicted in color. (n) A(n) (n f ) z 1I ( n 1) d d

3.7 THE LANGEVIN EQUATION: CHARACTERIZATION OF BROWNIAN MOTION

Restating the remarks made towards the end of Section 3.5 in more precise terms insofar as stability or convergence is concerned, we may say that the LMS algorithm (for small enough ) never attains a perfectly stable or convergent condition. Rather, after a large number of iterations, n, the algorithm approaches a “pseudo-equilibrium” condition, which, in qualitative terms, is described by the algorithm executing Brownian motion around the Wiener solution. This kind of stochastic behavior is explained nicely by the Langevin equation of nonequilibrium thermodynamics.3_{So, we will make a brief} digression to introduce this important equation.

Let v(t) denote the velocity of a macroscopic particle of mass m immersed in a viscous fluid. It is assumed that the particle is small enough for its velocity due to ther- mal fluctuations deemed to be significant. Then, from the equipartition law of thermo-

dynamics, the mean energy of the particle is given by

(3.46)

where kBis Boltzmann’s constant and T is the absolute temperature. The total force exerted on the particle by the molecules in the viscous fluid is made up of two components:

(i) a continuous damping force equal to -v(t) in accordance with Stoke’s law, where

is the coefficient of friction;

(ii) a fluctuating force Ff(t), whose properties are specified on the average.

The equation of motion of the particle in the absence of an external force is therefore given by

Dividing both sides of this equation by m, we get

(3.47)

where

(3.48)

and

(3.49) The term Γ(t) is the fluctuating force per unit mass; it is a stochastic force because it depends on the positions of the incredibly large number of atoms constituting the particle, which are in a state of constant and irregular motion. Equation (3.47) is called the Langevin

equation, and Γ(t) is called the Langevin force. The Langevin equation, which describes

the motion of the particle in the viscous fluid at all times (if its initial conditions are specified), was the first mathematical equation describing nonequilibrium thermodynamics.

(t) = Ff(t) m = m dv dt = -v(t) + (t) mdv dt = -v(t) + Ff(t) 1 2⺕ [v 2 (t)] = 1

In Section 3.9, we show that a transformed version of the LMS algorithm has the same mathematical form as the discrete-time version of the Langevin equation. But, before doing that, we need to describe our next building block.

3.8 KUSHNER’S DIRECT-AVERAGING METHOD

The Markov model of Eq. (3.41) is a nonlinear stochastic difference equation. This equa- tion is nonlinear because the transition matrix A(n) depends on the outer product

x(n)xT(n) of the input vector x(n). Hence, the dependence of the weight-error vector on x(n) violates the principle of superposition, which is a requirement for lin- earity. Moreover, the equation is stochastic because the training sample {x(n), d(n)} is drawn from a stochastic environment. Given these two realities, we find that a rigorous statistical analysis of the LMS algorithm is indeed a very difficult task.

However, under certain conditions, the statistical analysis of the LMS algorithm can be simplified significantly by applying Kushner’s direct-averaging method to the model of Eq. (3.41). For a formal statement of this method, we write the following (Kushner, 1984):

Consider a stochastic learning system described by the Markov model

where, for some input vector x(n), we have

and the additive noise f(n) is linearly scaled by the learning-rate parameter. Provided that

• the learning-rate parameter is sufficiently small, and

• the additive noise f(n) is essentially independent of the state , the state evolution of a modified Markov model described by the two equations

(3.50) (3.51) is practically the same as that of the original Markov model for all n.

The deterministic matrix of Eq. (3.51) is the transition matrix of the modified Markov model. Note also that we have used the symbol for the state of the modified Markov model to emphasize the fact that the evolution of this model over time is identically equal to that of the original Markov model only for the limiting case of a vanishingly small learning-rate parameter .

A proof of the statement embodying Eqs. (3.50) and (3.51) is addressed in Prob- lem 3.7, assuming ergodicity (i.e., substituting time averages for ensemble averages). For the discussion presented herein, it suffices to say the following:

1. As mentioned previously, when the learning-rate parameter is small, the LMS

algorithm has a long memory. Hence, the evolution of the updated state can be traced in time, step by step, all the way back to the initial condition .

2. When is small, we are justified in ignoring all second- and higher-order

terms in in the series expansion of 0(n + 1).

 (0) 0(n + 1) 0(n) A(n) A(n) = I - ⺕[x(n)xT_(n)] 0(n + 1) = A(n) 0(n) + f0(n)  (n) A(n) = I - x(n)xT_(n)  (n + 1) = A(n)(n) + f(n)  (n + 1)

3. Finally, the statement embodied in Eqs. (3.50) and (3.51) is obtained by invoking

ergodicity, whereby ensemble averages are substituted for time agerages.

3.9 STATISTICAL LMS LEARNING THEORY FOR SMALL

LEARNING-RATE PARAMETER

Now that we are equipped with Kushner’s direct-averaging method, the stage is set for a principled statistical analysis of the LMS algorithm by making three justifiable assumptions:

Assumption I: The learning-rate parameter is small

By making this assumption, we justify the application of Kushner’s direct-averaging method—hence the adoption of the modified Markov model of Eqs. (3.50) and (3.51) as the basis for the statistical analysis of the LMS algorithm.

From a practical perspective, the choice of small ␩ also makes sense. In particular, the LMS algorithm exhibits its most robust behavior with respect to external distur- bances when is small; the issue of robustness is discussed in Section 3.12.

Assumption II: The estimation error eo(n) produced

by the Wiener filter is white.

This assumption is satisfied if the generation of the desired response is described by the

linear regression model

(3.52) Equation (3.52) is simply a rewrite of Eq. (3.44), which, in effect, implies that the weight vector of the Wiener filter is matched to the weight vector of the regression model describing the stochastic environment of interest.

Assumption III: The input vector x(n) and the desired

In document Neural Networks and Learning Machines (3rd Edition) (Page 135-139)