• No results found

The p-norm generalization of the LMS algorithm for adaptive filtering

N/A
N/A
Protected

Academic year: 2021

Share "The p-norm generalization of the LMS algorithm for adaptive filtering"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

The p-norm generalization of the LMS algorithm for adaptive filtering

Jyrki Kivinen

University of Helsinki Manfred Warmuth

University of California, Santa Cruz Babak Hassibi

California Institute of Technology

(2)

Least Mean Squares (LMS) update

Pick learning rate η > 0. Initialize w0 = 0 Rn At time t, for t = 1, . . . , T , the algorithm

• observes input xt Rn

• makes prediction wt−1 · xt R

• observes feedback yt ∈ R, and

• updates its hypothesis as

wt = wt−1 − η(wt−1 · xt − yt)xt

(3)

Main Results

• Techniques from machine learning lead to generalizations of LMS

• H-optimal filtering in signal processing is similar to relative on-line loss bounds in machine learning

(4)

Motivation

• Non-Gaussian modeling

• Get away from rotation invariant algorithms

• Develop algorithms that work well when instances orthogonal and target weight vectors “sparse”

(5)

Expected Bounds for LMS

• Assume yt = u· xt + νt, where νt iid with E[νt2] = ε. Then E

"

1 T

T

X

t=1

(u· xt − wt−1 · xt)2

#

≤ ε + 1

T X22||u||22

Better algorithms exist for probabilistic setting However our goal is to weaken the assumptions

(6)

H

bound for LMS

[HSK96]

Assume ||xt||2 ≤ X2 for all t. Choose η = 1/X22 For any u Rn

PT

t=1(u· xt wt−1 · xt)2 PT

t=1(u · xt − yt)2 + X22||u||22 ≤ 1

• If some u with small norm is good predictor then LMS must approximate predictions of u

• Bound holds for any u and (xt, yt)

• No probabilistic assumptions

• LMS is H-optimal:

No algorithm can achieve ratio < 1

∀u and (xt, yt)

(7)

Two related problems

A priori filtering: Control Theory

Try to match u· xt

T

X

t=1

(u· xt − wt−1 · xt)2

Prediction: On-line Learning

Try to match yt

T

X

t=1

(yt − wt−1 · xt)2

(8)

Comparison of known LMS-related bounds

• For η = α/X22,

T

X

t=1

(u· xt wt−1 · xt)2

T

X

t=1

(u · xt − yt)2 + X22||u||22

• For η = α/X22 (0 < α < 1)

T

X

t=1

(yt − wt−1 · xt)2 ≤ 1 1 − α

T

X

t=1

(u· xt − yt)2

| {z }

Lossu

+ 1

αX22||u||22 tuned α

≤ Lossu + 2√

LossuX2||u||2 + X22||u||22

[CBLW96]

(9)

Generalizing the LMS bound

• Replace ||x||2||u||2 by ||x||p||u||q where 1/p + 1/q = 1 and ||x||p = (P

i|xi|p)1/p

• Instead of comparing predictions to u · xt for a fixed target u compare to ut · xt where ut may change

• Replace (·)2 by more general loss

(10)

Basic LMS

t = −η(wt−1 · xt − yt)xt

"!

#

"!

#

"!

#





t

wt−1 + wt

(11)

p-norm LMS

Write θt = f (wt)

"!

#

"!

#





"!

#

"!

#

"!

#

'

&

$

% '

&

$

%

C C C





C C C







t

wt−1 wt

θt−1 + θt

f f−1

”W -space”

”Θ-space”

(12)

p-norm LMS

based on [GLS01]

wt = f−1(f(wt−1) − η(wt−1 · xt − yt)xt) where f:Rn Rn given by

fi(w) = sign(wi)|wi|q−1

||w||qq−2

and fi−1(θ) = sign(θi)|θi|p−1

||θ||pp−2

When p = q = 2, then f(w) = w: LMS

For large p, f−1 emphasizes differences in components

(13)

A priori filtering bound

Theorem Assume ||xt||p ≤ Xp for all t, and let η = 1/((p − 1)Xp2) Then for any u the p-norm algorithm satisfies

T

X

t=1

(u · xt − wt−1 · xt)2

T

X

t=1

(yt − u· xt)2 + (p − 1)Xp2||u||2q

• 1/p + 1/q = 1 and 2 ≤ p < ∞, 1 < q ≤ 2

• How do we get the dual norm pair (∞, 1) (where ||x|| = maxi|xi|)?

For p = 2 ln n, (p − 1)||x||2p||u||q2 ≤ (2e ln n)||x||2||u||21

(14)

Comparison with basic LMS

New bounds incomparable with old ones because for p > 2 and q < 2

||x||p < ||x||2 and ||u||q > ||u||2

Compare p = 2 and p = O(log n) in two extreme cases:

Sparse target, dense instances: Let u = (1, 0, . . . , 0) and x = (1, . . . , 1).

• ||x||22||u||22 = n2

• (log n)||x||2||u||21 = log n

• Thus large p better

Dense target, sparse instances: Let u = (1, 1, . . . , 1) and x = (1, 0, . . . , 0).

• ||x||22||u||22 = n2

• (log n)||x||2||u||21 = n2log n

• Thus p = 2 better

(15)

The p-norm LMS can behave like EG

Hadamard Matrix:

→ +1 +1 +1 +1 instances → +1 −1 +1 −1

→ +1 +1 −1 −1

→ +1 −1 −1 +1

↑ ↑ ↑ ↑

targets

• Instances are orthogonal

• Target weight vectors are units

• LMS: error ≥ 1 − nk

• p-norm LMS with p = O(log n): error ≥ ln nk

(16)

Time-varying target

(following [HW01])

Up to now, model has been yt = u · xt + noise where target u is fixed Generalize this to yt = ut · xt + noise where target ut may vary over time Example 1: target makes one jump Choose a,b Rn and take

ut =

 a for 1 ≤ t ≤ T /2 b for T /2 < t ≤ T

Example 2: target moves steadily Choose a,b Rn and take ut = T − t

T − 1a + t − 1 T − 1b

(17)

Algorithms for time-varying target

Old update:

w0t = f−1(f(wt−1) − η(wt−1 · xt − yt)xt) Bounding update:

wt =

 w0t if ||w0t||q ≤ Uq Uq w0t

||w0t||q otherwise where Uq > 0 is a norm bound

We rescale the weight vector whenever q-norm larger than Uq

(18)

Bound for time-varying target

Theorem Assume ||xt||p ≤ Xp for all t, and let η = 1/((p − 1)Xp2) Then if ||ut||q ≤ Uq for all t, the bounded p-norm LMS satisfies

T

X

t=1

(ut · xt wt−1 · xt)2

T

X

t=1

(yt − ut · xt)2 + (p − 1)Xp2Uq2

+ 2(p− 1)Xp2Uq

T−1

X

t=1

||ut+1 ut||q

• Only total distance P

t||ut+1 ut||q traveled by the target matters

• Cost 2(p− 1)Xp2Uq per unit target movement

• For fixed target ut+1 = ut, we recover previous bound

• However Uq needs to be known in advance

(19)

Bregman divergences

• Key tool in analyzing and understanding the algorithms

• Fix strictly convex differentiable F :Rn R. Denote the gradient by f = ∇F .

• Now the Bregman divergence dF:Rn × Rn R is

dF(u,w) = F (u) − F (w) − f(w) · (u w)

w u

dF(u, w) F

dF(u,w) is the error of first- order Taylor approximation of F (u) around w

(20)

Basic properties of Bregman divergences

• dF(u,w) ≥ 0, dF(u,w) = 0 iff u = w

• not symmetrical (in general)

• does not satisfy triangle inequality

• dF(u,w) convex in u, not necessarily in w Connection to exponential families (roughly):

• F is cumulant function, f is link function

• w is expectation parameter, f(w) canonical parameter

• dF(u,w) is the KL divergence between distributions parameterized by u and w

(21)

Example: p-norm divergence

[GLS01]

F (w) = 1

2||w||2q

• Then the gradient f = ∇F satisfies fi(w) = sign(wi)|wi|

q−1

||w||qq−2 and f

i−1) = sign(θi)i|

p−1

||θ||pp−2

• The divergence is

dF(u,w) = 1

2||u||2q 1

2||w||2q f(w) · (u w).

• Special case p = q = 2 gives dF(u,w) = 12||u v||22

(22)

Deriving the updates

• Define a regularized instantaneous loss Ct(w) = dF(w,wt−1) + η

2(yt − w · xt)2

• Basic aim is to have

wt = argmin

w

Ct(w)

• Minimize by setting ∇Ct(wt) = 0, obtaining the implicit update f(wt) = f(wt−1) − η(wt · xt − yt)xt

• Approximate wt · xt ≈ wt−1 · xt to obtain the update f(wt) = f(wt−1) − η(wt−1 · xt − yt)xt

(23)

Analyzing the update

• Measure of progress

dF(u,wt−1)− dF(u,wt) = η(yt − wt−1 · xt)xt · (u wt−1) − dF(wt−1,wt)

• Massage the term (yt − wt−1 · xt)xt · (u wt−1) until (u · xt wt−1 · xt)2 and (yt − u· xt)2 appear; throw rest away

• Estimate dF(wt−1,wt) in terms of ||xt||p

• Sum over t = 1, . . . , T

(24)

Conclusion

• LMS and normalized LMS can be derived from an optimization problem involving a certain Bregman divergence

• Different Bregman divergences lead to different algorithms, with loss bounds in terms of different norms

• Bounds can be generalized for time-varying targets (and generalized linear models, not presented in the talk); proofs easy

• Algorithms for p = 2 can be kernelized, for p > 2 probably not Bottom line: Machinery from on-line machine learning carries over to H-optimal filtering

(25)

Where are we headed?

• Develop p-norm Kalman filter

• Prove relative loss bounds

References

Related documents

Timms, EDWARD and DAVID KELLEY (eds) Unreal City: Urban Experience in Modern European Literature and Art (New York: St Martin's Press; Manchester: University of Manchester

A switch between Rac1 activity and RhoA activity regulated by Eps8 (epidermal growth factor receptor kinase substrate 8) differentially regulates the pro-migratory and

In this paper, we develop an empirical framework for matching models with Imperfectly Transferable Utility (ITU), using a structure of heterogeneities in preferences ` a-la Choo

• High-performing integrated financial services marketers invested 38 per- cent of their total marketing budget on online campaigns, while other firms only allocated 22 percent

6.4.3: Describe some of the great variety of body plans and internal structures animals and plants have that contribute to their being able to make or find food and reproduce..

This review, uniquely, not only embraced the literature on clinical leadership and leadership studies more generally, but reached out into related relevant literatures on CCGs and

Our proposal is for a study using New Zealand area-level census data, examining the relationship between inter-censal (5-year) migration flows and labour market conditions at

This grant permitted the Nursing, Respira- tory Care, and Radiologic Sciences faculty to integrate immersive simulations into the skills course with the goal of enhancing the