The p-norm generalization of the LMS algorithm for adaptive filtering

(1)

The p-norm generalization of the LMS algorithm for adaptive filtering

Jyrki Kivinen

University of Helsinki Manfred Warmuth

University of California, Santa Cruz Babak Hassibi

California Institute of Technology

(2)

Least Mean Squares (LMS) update

Pick learning rate η > 0. Initialize w0 = 0 _∈ Rⁿ At time t, for t = 1, . . . , T , the algorithm

• observes input x_t _∈ Rⁿ

• makes prediction w_t₋₁ _· x_t _∈ R

• observes feedback y_t ∈ R, and

• updates its hypothesis as

wt = wt−1 − η(wt−1 · xt − y^t)xt

(3)

Main Results

• Techniques from machine learning lead to generalizations of LMS

• H^∞-optimal filtering in signal processing is similar to relative on-line loss bounds in machine learning

(4)

Motivation

• Non-Gaussian modeling

• Get away from rotation invariant algorithms

• Develop algorithms that work well when instances orthogonal and target weight vectors “sparse”

(5)

Expected Bounds for LMS

• Assume y_t = u_· x_t + ν_t, where ν_t iid with E[ν_t²] = ε. Then E

"

1 T

T

X

t=1

(u_· xt − wt−1 · xt)²

#

≤ ε + 1

T X₂²||u_||²₂

Better algorithms exist for probabilistic setting However our goal is to weaken the assumptions

(6)

H

^∞

bound for LMS

[HSK96]

Assume ||x_t_||2 ≤ X² for all t. Choose η = 1/X₂² For any u _∈ Rⁿ

PT

t=1(u_· x_t ₋ w_t₋₁ _· x_t)² PT

t=1(u _· xt − y^t)² + X₂²||u_||²₂ ^{≤ 1}

• If some u with small norm is good predictor then LMS must approximate predictions of u

• Bound holds for any u ^{and (}xt, y_t)

• No probabilistic assumptions

• LMS is H^∞-optimal:

No algorithm can achieve ratio < 1

∀u ^{and (}xt, y_t)

(7)

Two related problems

A priori filtering: Control Theory

Try to match u_· xt

T

X

t=1

(u_· xt − w_t₋₁ _· xt)²

Prediction: On-line Learning

Try to match y_t

T

X

t=1

(y_t − w_t₋₁ _· x_t)²

(8)

Comparison of known LMS-related bounds

• For η = α/X₂²,

T

X

t=1

(u_· x_t ₋ w_t₋₁ _· x_t)² ≤

T

X

t=1

(u _· x_t _{− y}_t)² + X₂²||u_||²₂

• For η = α/X₂² (0 < α < 1)

T

X

t=1

(y_t − w_t₋₁ _· x_t)² ≤ 1 1 − α

T

X

t=1

(u_· x_t _{− y}_t)²

| {z }

Lossu

+ 1

αX₂²||u_||²₂ tuned α

≤ Loss_u + 2√

Loss_uX2||u_||2 + X₂²||u_||²₂

[CBLW96]

(9)

Generalizing the LMS bound

• Replace ||x_||2||u_||2 by ||x_||p||u_||q where 1/p + 1/q = 1 and ||x_||p = (P

i|xⁱ|^p)^1/p

• Instead of comparing predictions to u _· xt for a fixed target u compare to u_t _· x_t where u_t may change

• Replace (·)² by more general loss

(10)

Basic LMS

∆_t = −η(w^t−1 · x^t − y^t)x_t

"!

#

"!

#

"!

#

∆_t

w_t−1 + wt

(11)

p-norm LMS

Write θ_t = f (w_t)

"!

#

"!

#

"!

#

"!

#

"!

#

'

&

$

% '

&

$

%

C C C

∆_t

w_t−1 w_t

θ_t−1 + θt

f f⁻¹

”W -space”

”Θ-space”

(12)

p-norm LMS

^{based on} ^[GLS01]

wt = f⁻¹(f⁽w_t₋₁⁾ _{− η(}w_t₋₁ _· xt − y^t)xt) where f^:Rⁿ _→ Rⁿ ^{given by}

f_i(w) = sign(w_i)|wi|^q⁻¹

||w_||^qq⁻²

and f_i⁻¹(θ) = sign(θ_i)|θi|^p⁻¹

||θ_||^pp⁻²

When p = q = 2, then f(w) = w: LMS

For large p, f⁻¹ emphasizes differences in components

(13)

A priori filtering bound

Theorem Assume ||xt||^p ≤ X^p for all t, and let η = 1/((p − 1)Xp²) Then for any u the p-norm algorithm satisfies

T

X

t=1

(u _· xt − wt−1 · xt)² ≤

T

X

t=1

(y_t − u_· xt)² + (p − 1)Xp²||u_||²_q

• 1/p + 1/q = 1 and 2 ≤ p < ∞, 1 < q ≤ 2

• How do we get the dual norm pair (∞, 1) (where ||x_||_∞ = max_i|xⁱ|)?

For p = 2 ln n, (p − 1)||x_||²_p_||u_||_q² ≤ (2e ln n)||x_||²_∞_||u_||²₁

(14)

Comparison with basic LMS

New bounds incomparable with old ones because for p > 2 and q < 2

||x_||_p < ||x_||2 and ||u_||_q > ||u_||2

Compare p = 2 and p = O(log n) in two extreme cases:

Sparse target, dense instances: Let u = (1, 0, . . . , 0) and x = (1, . . . , 1).

• ||x_||²₂_||u_||²₂ = n²

• (log n)||x_||²_∞_||u_||²₁ = log n

• Thus large p better

Dense target, sparse instances: Let u = (1, 1, . . . , 1) and x = (1, 0, . . . , 0).

• ||x_||²₂_||u_||²₂ = n²

• (log n)||x_||²_∞_||u_||²₁ = n²log n

• Thus p = 2 better

(15)

The p-norm LMS can behave like EG

Hadamard Matrix:

→ +1 +1 +1 +1 instances → +1 −1 +1 −1

→ +1 +1 −1 −1

→ +1 −1 −1 +1

↑ ↑ ↑ ↑

targets

• Instances are orthogonal

• Target weight vectors are units

• LMS: error ≥ 1 − _n^k

• p-norm LMS with p = O(log n): error ≥ ^{ln n}_k

(16)

Time-varying target

(following [HW01])

Up to now, model has been yt = u _· xt + noise where target u is fixed Generalize this to y_t = ut · xt + noise where target ut may vary over time Example 1: target makes one jump Choose a,b _∈ Rⁿ ^{and take}

ut =

a for 1 ≤ t ≤ T /2 b for T /2 < t ≤ T

Example 2: target moves steadily Choose a,b _∈ Rⁿ and take ut = T − t

T − 1a + t − 1 T − 1b

(17)

Algorithms for time-varying target

Old update:

w⁰_t = f⁻¹(f(w_t₋₁) − η(w_t₋₁ _· x_t _{− y}_t)x_t) Bounding update:

wt =

w⁰_t if ||w⁰_t_||_q _{≤ U}_q U_q ^w⁰^t

||w⁰t||^q otherwise where U_q > 0 is a norm bound

We rescale the weight vector whenever q-norm larger than U_q

(18)

Bound for time-varying target

Theorem Assume ||x_t_||_p _{≤ X}_p for all t, and let η = 1/((p − 1)Xp²) Then if ||ut||^q ≤ U^q for all t, the bounded p-norm LMS satisfies

T

X

t=1

(u_t _· x_t ₋ w_t₋₁ _· x_t)² ≤

T

X

t=1

(y_t − u_t _· x_t)² + (p − 1)X_p²U_q²

+ 2(p− 1)Xp²U_q

T−1

X

t=1

||u_t+1 ₋ u_t_||_q

• Only total distance P

t||u_t+1 ₋ u_t_||_q traveled by the target matters

• Cost 2(p− 1)X_p²U_q per unit target movement

• For fixed target u_t+1 = u_t, we recover previous bound

• However U_q needs to be known in advance

(19)

Bregman divergences

• Key tool in analyzing and understanding the algorithms

• Fix strictly convex differentiable F :Rⁿ _→ R. Denote the gradient by f = ∇F .

• Now the Bregman divergence d_F:Rⁿ _× Rⁿ _→ R is

d_F(u,w) = F (u) − F (w) − f(w) · (u₋ w)

w u

dF(u, w) F

d_F(u,w) is the error of first- order Taylor approximation of F (u) around w

(20)

Basic properties of Bregman divergences

• d_F(u,w⁾ _{≥ 0, d}_F⁽u,w^{) = 0 iff} u ⁼ w

• not symmetrical (in general)

• does not satisfy triangle inequality

• d_F(u,w) convex in u, not necessarily in w Connection to exponential families (roughly):

• F is cumulant function, f is link function

• w is expectation parameter, f⁽w) canonical parameter

• d_F(u,w) is the KL divergence between distributions parameterized by u and w

(21)

Example: p-norm divergence

^[GLS01]

•

F (w) = 1

2||w_||²_q

• Then the gradient f = ∇F satisfies f_i(w^{) =} ^sign(wⁱ⁾^|wⁱ^|

q−1

||w_||^q_q⁻² ^and ^f

i−1(θ^{) =} ^sign(θⁱ⁾^|θⁱ^|

p−1

||θ_||^p_p⁻²

• The divergence is

d_F(u,w^{) =} ¹

2||u_||²_q ₋ ¹

2||w_||²_q ₋ f⁽w⁾ _{· (}u ₋ w).

• Special case p = q = 2 gives d_F(u,w^{) =} ¹₂_||u ₋ v_||²₂

(22)

Deriving the updates

• Define a regularized instantaneous loss C_t(w) = d_F(w,w_t₋₁) + η

2(y_t − w _· x_t)²

• Basic aim is to have

w_t = argmin

w

C_t(w)

• Minimize by setting ∇C^t(w_t) = 0, obtaining the implicit update f⁽wt) = f⁽wt−1) − η(wt · xt − y^t)xt

• Approximate wt · xt ≈ wt−1 · xt to obtain the update f(wt) = f(w_t₋₁) − η(w_t₋₁ _· xt − y^t)xt

(23)

Analyzing the update

• Measure of progress

d_F(u,w_t₋₁)− dF(u,w_t) = η(y_t − w_t₋₁ _· x_t)x_t _{· (}u ₋ w_t₋₁) − dF(w_t₋₁,w_t)

• Massage the term (y_t − w_t₋₁ _· x_t)x_t _{· (}u₋ w_t₋₁) until (u _· x_t ₋ w_t₋₁ _· x_t)² and (y_t − u_· x_t)² appear; throw rest away

• Estimate d_F(wt−1,wt) in terms of ||xt||^p

• Sum over t = 1, . . . , T

(24)

Conclusion

• LMS and normalized LMS can be derived from an optimization problem involving a certain Bregman divergence

• Different Bregman divergences lead to different algorithms, with loss bounds in terms of different norms

• Bounds can be generalized for time-varying targets (and generalized linear models, not presented in the talk); proofs easy

• Algorithms for p = 2 can be kernelized, for p > 2 probably not Bottom line: Machinery from on-line machine learning carries over to H^∞-optimal filtering

(25)

Where are we headed?

• Develop p-norm Kalman filter

• Prove relative loss bounds