• No results found

CS 295 (UVM) The LMS Algorithm Fall / 23

N/A
N/A
Protected

Academic year: 2021

Share "CS 295 (UVM) The LMS Algorithm Fall / 23"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

The LMS Algorithm

The LMS Objective Function.

Global solution.

Pseudoinverse of a matrix.

Optimization (learning) by gradient descent.

LMS or Widrow-Hoff Algorithms.

Convergence of the Batch LMS Rule.

Compiled by LATEX at 3:50 PM on February 27, 2013.

(2)

LMS Objective Function

Again we consider the problem of programming a linear threshold function

x1 w2 w3 wn w1 x2 x3

xn

w0 y

y D sgn. w

T

x C w

0

/ D (

1 ; if w

T

x C w

0

> 0 ; 1 ; if w

T

x C w

0

 0 : so that it agrees with a given dichotomy of m feature vectors,

X

m

D f. x

1

; `

1

/; : : : ; . x

m

; `

m

/ g:

where x

i

2 R

n

and `

i

2 f 1 ; C 1 g , for i D 1 ; 2 ; : : : ; m.

CS 295 (UVM) The LMS Algorithm Fall 2013 2 / 23

(3)

LMS Objective Function (cont.)

Thus, given a dichotomy

X

m

D f. x

1

; `

1

/; : : : ; . x

m

; `

m

/ g;

we seek a solution weight vector w and bias w

0

, such that, sgn w

T

x

i

C w

0

 D `

i

; for i D 1 ; 2 ; : : : ; m.

Equivalently, using homogeneous coordinates (or augmented feature vectors), sgn b w

T

b x

i

 D `

i

; for i D 1 ; 2 ; : : : ; m, where b x

i

D 1 ; x

Ti



T

. Using normalized coordinates,

sgn b w

T

b x

0

i

 D 1 ; or, b w

T

b x

0

i

> 0 ; for i D 1 ; 2 ; : : : ; m, where b x

0

i

D `

i

b x

i

.

(4)

LMS Objective Function (cont.)

Alternatively, let b 2 R

m

satisfy b

i

> 0 for i D 1 ; 2 ; : : : ; m. We call b a margin vector.

Frequently we will assume that b

i

D 1.

Our learning criterion is certainly satisfied if

b w

T

b x

0

i

D b

i

for i D 1 ; 2 ; : : : ; m. (Note that this condition is sufficient, but not necessary.) Equivalently, the above is satisfied if the expression

E. b w / D

m

X

i

D

1

 b

i

b w

T

b x

0

i



2

:

equals zero. The above expression we call the LMS objective function, where LMS stands for least mean square. (N.B. we could normalize the above by dividing the right side by m.)

CS 295 (UVM) The LMS Algorithm Fall 2013 4 / 23

(5)

LMS Objective Function: Remarks

Converting the original problem of classification (satisfying a system of inequalities) into one of optimization is somewhat ad hoc. And there is no guarantee that we can find a b w

? 2 R

n

C

1

that satisfies E. b w

? / D 0. However, this conversion can lead to practical compromises if the original inequalities possess inconsistencies.

Also, there is no unique objective function. The LMS expression,

E. b w / D

m

X

i

D

1

b

i

b w

T

b x

0

i



2

:

can be replaced by numerous candidates, e.g.

m

X

i

D

1

ˇ ˇ b

i

b w

T

b x

0

i

ˇ ˇ;

m

X

i

D

1



1 sgn b w

T

b x

0

i

 

;

m

X

i

D

1



1 sgn b w

T

b x

0

i

 

2

; etc.

However, minimizing E. b w

? / is generally an easy task.

(6)

Minimizing the LMS Objective Function

Inspection suggests that the LMS Objective Function E. b w / D

m

X

i

D

1

b

i

b w

T

b x

0

i



2

describes a parabolic function. It may have a unique global minimum, or an infinite number of global minima which occupy a connected linear set. (The latter can occur if m < n C 1.) Letting,

X D 0 B B B B

@ b x

0

T 1

b x

0

T

::

2

: b x

0

T m

1 C C C C A

D 0 B B B

@

`

1

x O

1

0 ;

1

x O

1

0 ;

2

   x O

1

0 ;

n

`

2

x O

2

0 ;

1

x O

2

0 ;

2

   x O

2

0 ;

n

:: : :: : :: : : :: :: :

`

m

x O

m

0 ;

1

x O

m

0 ;

2

   x O

m

0 ;

n

1 C C C A

2 R

m

.

n

C

1

/ ;

then,

E. b w / D

m

X

i

D

1

b

i

b x

0

T i

b w



2

D 0 B B B B

@ b

1

b x

0

T 1

b w b

2

b x

0

T 2

b w :: : b

m

b x

0

T m

b w

1 C C C C A

2

D b

0 B B B B

@ b x

0

T 1

b x

0

T

::

2

: b x

0

T m

1 C C C C A

b w

2

D k b X b w j

2

:

CS 295 (UVM) The LMS Algorithm Fall 2013 6 / 23

(7)

Minimizing the LMS Objective Function (cont.)

E. b w / D k b X b w k

2

D . b X b w /

T

. b X b w / D 

b

T

b w

T

X

T



. b X b w / D b w

T

X

T

X b w b w

T

X

T

b b

T

X b w C b

T

b D b w

T

X

T

X b w 2b

T

X b w C k b k

2

As an aside, note that

X

T

X D .b x 0

1

; : : : ; b x 0

m

/ 0 B

@ b x

0

T

::

1

: b x

0

T m

1 C A D

m

X

i

D

1

b x 0

i

b x

0

T

i

2 R .

n

C

1

/.

n

C

1

/ ;

b

T

X D . b

1

; : : : ; b

m

/ 0 B

@ b x

0

T 1

::

: b x

0

T m

1 C A D

m

X

i

D

1

b

i

b x

0

T

i

2 R

n

C

1

:

(8)

Minimizing the LMS Objective Function (cont.)

Okay, so how do we minimize E. b w / D b w

T

X

T

X b w 2b

T

X b w C k b k

2

Using calculus (e.g., Math 121), we can compute the gradient of E. b w / , and algebraically determine a value of b w

? which makes each component vanish. That is, solve

rE.b w / D 0 B B B B B B B B B B

@

@E

@ b w

0

@E

@ b w

1

:: :

@E

@ b w

n

1 C C C C C C C C C C A

D 0 :

It is straightforward to show that

rE.b w / D 2X

T

X b w 2X

T

b :

CS 295 (UVM) The LMS Algorithm Fall 2013 8 / 23

(9)

Minimizing the LMS Objective Function (cont.)

Thus,

rE.b w / D 2X

T

X b w 2X

T

b D 0 if

b w ? D . X

T

X /

1

X

T

b D X Ž b ; where the matrix,

X Ž D .

def

X

T

X /

1

X

T

2 R .

n

C

1

/

m

is called the pseudoinverse of X . If X

T

X is singular, one defines

X Ž D

def

lim

!

0

. X

T

X C  I /

1

X

T

: Observe that if X

T

X is nonsingular,

X Ž X D . X

T

X /

1

X

T

X D I :

(10)

Example

The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Second Edition, Wiley, NY, 2001, p. 241.

Given the dichotomy,

X

4

D ˚ . 1 ; 2 /

T

; 1 ; . 2 ; 0 /

T

; 1 ; . 3 ; 1 /

T

; 1 ; . 2 ; 3 /

T

; 1  we obtain,

X D 0 B B B B B

@ b x

0

T 1

b x

0

T 2

b x

0

T 3

b x

0

T 4

1 C C C C C A

D 0 B B

@

1 1 2

1 2 0

1 3 1

1 2 3

1 C C A : Whence,

X

T

X D 0

@

4 8 6

8 18 11

6 11 14

1

A ; and, X Ž D . X

T

X /

1

X

T

D 0 B B

@

5 4

13 12

3 4

7 12 1

2 1 6

1 2

1 6

0

13

0

13

1 C C A :

CS 295 (UVM) The LMS Algorithm Fall 2013 10 / 23

(11)

Example (cont.)

Letting, b D . 1 ; 1 ; 1 ; 1 /

T

, then

b w D X Ž b D 0 B B

@

5 4

13 12

3 4

7 12 1

2 1 6

1 2

1 6

0

13

0

13

1 C C A

0 B B B

@ 1 1 1 1 1 C C C A

D 0 B B

@

11 3 4 3 2 3

1 C C A :

Whence,

w

0

D 11

3 and w D

 4 3 ; 2

3



T

x1 x2

1 1

2 2

3 3

4 4

5

(12)

Method of Steepest Descent

An alternative approach is the method of steepest descent.

We begin by representing Taylor’s theorem for functions of more than one variable:

let x 2 R

n

, and f W R

n

! R , so

f . x / D f . x

1

; x

2

; : : : ; x

n

/ 2 R:

Now let ı x 2 R

n

, and consider

f . x C ı x / D f . x

1

C ı x

1

; : : : ; x

n

C ı x

n

/:

Define F W R ! R , such that,

F . s / D f . x C s ı x /:

Thus,

F . 0 / D f . x /; and, F . 1 / D f . x C ı x /:

CS 295 (UVM) The LMS Algorithm Fall 2013 12 / 23

(13)

Method of Steepest Descent (cont.)

Taylor’s theorem for a single variable (Math 21/22),

F . s / D F . 0 / C 1 1 Š F

0 . 0 / s C 1 2 Š F

00 . 0 / s

2

C 1 3 Š F

000 . 0 / s

3

C    :

Our plan is to set s D 1 and replace F . 1 / by f . x C ı x / , F . 0 / by f . x / , etc.

To evaluate F 0 . 0 / we will invoke the multivariate chain rule, e.g., d

ds f u . s /; v . s /  D @ f

@ u . u ; v / u 0 . s / C @ f

@ v . u ; v / v 0 . s /:

Thus,

F 0 . s / D dF

ds . s / D df

ds . x

1

C s ı x

1

; : : : ; x

n

C s ı x

n

/ D @ f

@ x

1

. x C s ı x / d

ds . x

1

C s ı x

1

/ C    C @ f

@ x

n

. x C s ı x / d

ds . x

n

C s ı x

n

/ D @ f

@ x

1

. x C s ı x /  ı x

1

C    C @ f

@ x

n

. x C s ı x /  ı x

n

:

(14)

Method of Steepest Descent (cont.)

Thus,

F 0 . 0 / D @ f

@ x

1

. x /  ı x

1

C    C @ f

@ x

n

. x /  ı x

n

D r f . x /

T

ı x :

CS 295 (UVM) The LMS Algorithm Fall 2013 14 / 23

(15)

Method of Steepest Descent (cont.)

Thus, it is possible to show

f . x C ı x / D f . x / C r f . x /

T

ı x C O kı x k

2



D f . x / C kr f . x / kkı x k cos  C O kı x k

2

 ; where  defines the angle between r f . x / and ı x. Ifx k  1, then

ı f D f . x C ı x / f . x /  kr f . x /kkı x k cos :

Thus, the greatest reduction ı f occurs if cos  D 1, that is if ı x D r f , where

 > 0. We thus seek a local minimum of the LMS objective function by taking a sequence of steps

b w . t C 1 / D b w . t /  rE  b w . t / 

:

(16)

Training an LTU using Steepest Descent

We now return to our original problem. Given a dichotomy X

m

D f. x

1

; `

1

/; : : : ; . x

m

; `

m

/g

of m feature vectors x

i

2 R

n

with `

i

2 f 1 ; 1 g for i D 1 ; : : : ; m, we construct the set of normalized, augmented feature vectors

X b 0

m

D f.`

i

; `

i

x

i

;

1

; : : : ; `

i

x

i

;

n

/

T

2 R

n

C

1

j i D 1 ; : : : ; m g:

Given a margin vector, b 2 R

m

, with b

i

> 0 for i D 1 ; : : : ; m, we construct the LMS objective function,

E. b w / D 1 2

m

X

i

D

1

. b w

T

b x

0

i

b

i

/

2

D 1

2 k X b w b k

2

;

(the factor of

12

is added with some foresight), and then evaluate its gradient

rE.b w / D

m

X

i

D

1

. b w

T

b x

0

i

b

i

/ b x

0

i

D X

T

. X b w b /:

CS 295 (UVM) The LMS Algorithm Fall 2013 16 / 23

(17)

Training an LTU using Steepest Descent (cont.)

Substitution into the steepest descent update rule,

b w . t C 1 / D b w . t / rE  b w . t / 

; yields the batch LMS update rule,

b w . t C 1 / D b w . t / C 

m

X

i

D

1



b

i

b w . t /

T

b x

0

i

 b x

0

i

:

Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from the above:

b w . t C 1 / D b w . t / C  

b b w . t /

T

b x

0 . t /  b x

0 . t /:

where b x 0 . t / 2 b X 0

m

is the element of the dichotomy that is presented to the LTU at

epoch t. (Here, we assume that b is fixed; otherwise, replace it by b . t / .)

(18)

Sequential LMS Rule

The sequential LMS rule

b w . t C 1 / D b w . t / C  h

b b w . t /

T

b x

0 . t / i b x

0 . t /;

resembles the sequential perceptron rule,

b w . t C 1 / D b w . t / C  2 h

1 sgn b w . t /

T

b x

0 . t /  i b x

0 . t /:

Sequential rules are well suited to real-time implementations, as only the current values for the weights, i.e. the configuration of the LTU itself, need to be stored. They also work with dichotomies of infinite sets.

CS 295 (UVM) The LMS Algorithm Fall 2013 18 / 23

(19)

Convergence of the batch LMS rule

Recall, the LMS objective function in the form

E. b w / D 1 2 X b w b

2

;

has as its gradient,

rE.b w / D X

T

X b w X

T

b D X

T

. X b w b /:

Substitution into the rule of steepest descent,

b w . t C 1 / D b w . t /  rE b w . t /;

yields,

b w . t C 1 / D b w . t /  X

T

X b w . t / b 

(20)

Convergence of the batch LMS rule (cont.)

The algorithm is said to converge to a fixed point b w

? , if for every finite initial value

b w . 0 / < 1 ,

t

!1 lim b w . t / D b w

? :

The fixed points b w

? satisfy rE.b w ? / D 0, whence

X

T

X b w

? D X

T

b :

The update rule becomes,

b w . t C 1 / D b w . t /  X

T

X b w . t / b  D b w . t /  X

T

X b w . t / b w

? 

Let ı b w . t / D

def

b w . t / b w

? . Then,

ı b w . t C 1 / D ıb w . t /  X

T

X ı b w . t / D I  X

T

X /ı b w . t /:

CS 295 (UVM) The LMS Algorithm Fall 2013 20 / 23

(21)

Convergence of the batch LMS rule (cont.)

Convergence b w . t / ! b w

? occurs if ı b w . t / D b w . t / b w

? ! 0. Thus we require that ı b w . t C 1 /

<

ı b w . t /

. Inspecting the update rule, ı b w . t C 1 / D 

I  X

T

X  ı b w . t /;

this reduces to the condition that all the eigenvalues of I  X

T

X have magnitudes less than 1.

We will now evaluate the eigenvalues of the above matrix.

(22)

Convergence of the batch LMS rule (cont.)

Let S 2 R .

n

C

1

/.

n

C

1

/ denote the similarity transform that reduces the symmetric matrix X

T

X to a diagonal matrix ƒ 2 R .

n

C

1

/.

n

C

1

/ . Thus, S

T

S D SS

T

D I, and

S X

T

X S

T

D ƒ D diag.

0

; 

1

; : : : ; 

n

/;

The numbers, 

0

; 

1

; : : : ; 

n

, represent the eigenvalues of X

T

X . Note that 0  

i

for i D 0 ; 1 ; : : : ; n. Thus,

S ı b w . t C 1 / D S . I  X

T

X / S

T

S ı b w . t /;

D . I ƒ/ S ı b w . t /:

Thus convergence occurs if

k S ı b w . t C 1 /k < k S ı b w . t /k;

which occurs if the eigenvalues of I ƒ all have magnitudes less than one.

CS 295 (UVM) The LMS Algorithm Fall 2013 22 / 23

(23)

Convergence of the batch LMS rule (cont.)

The eigenvalues of I ƒ equal 1 

i

, for i D 0 ; 1 ; : : : ; n. (These are in fact the eigenvalues of I  X

T

X .) Thus, we require that

1 < 1 

i

< 1 ; or 0 < 

i

< 2 ; for all i. Let 

max

D max

0



i



n



i

denote the largest eigenvalue of X

T

X , then convergence requires that

0 <  < 2



max

:

References

Related documents

On the left, the values are separated by gender to compare results obtained in the right (green) and left (blue) hemisphere.. On the right, the values are separated by hemisphere

Neurofilament Light Chain in Blood and CSF as Marker of Disease Progression in Mouse Models and in Neurodegenerative Diseases. Plasma Concentration of the Neurofilament Light

The Support Provider Directory is a valuable data asset to the Navy Enterprise because it is authoritative information provided by the Echelon II/III community of

Municipal and county planning authorities are encouraged to develop policies to improve resilience to coastal and inland flooding, salt water intrusion, and other related impacts

clarus effector candidates encoding putative secreted enzymes involved in cell wall modifica- tion.. They were divided in two subgroups depending whether their putative

• Make daily download of prices and price related information from 4 online retailers registered in Norway with highest expenditure figures. • Most

■ Acquire an 8,000 square foot facility in Putnam to house six medical exam rooms, six dental operatories, two behavioral health offices to provide 12,000 medical, dental

To communicate with the ONS network element (NE) using TL1 commands through a Telnet session over a craft interface or a LAN connection, you can choose either of the following