The p-norm generalization of the LMS algorithm for adaptive filtering
Jyrki Kivinen
University of Helsinki Manfred Warmuth
University of California, Santa Cruz Babak Hassibi
California Institute of Technology
Least Mean Squares (LMS) update
Pick learning rate η > 0. Initialize w0 = 0 ∈ Rn At time t, for t = 1, . . . , T , the algorithm
• observes input xt ∈ Rn
• makes prediction wt−1 · xt ∈ R
• observes feedback yt ∈ R, and
• updates its hypothesis as
wt = wt−1 − η(wt−1 · xt − yt)xt
Main Results
• Techniques from machine learning lead to generalizations of LMS
• H∞-optimal filtering in signal processing is similar to relative on-line loss bounds in machine learning
Motivation
• Non-Gaussian modeling
• Get away from rotation invariant algorithms
• Develop algorithms that work well when instances orthogonal and target weight vectors “sparse”
Expected Bounds for LMS
• Assume yt = u· xt + νt, where νt iid with E[νt2] = ε. Then E
"
1 T
T
X
t=1
(u· xt − wt−1 · xt)2
#
≤ ε + 1
T X22||u||22
Better algorithms exist for probabilistic setting However our goal is to weaken the assumptions
H
∞bound for LMS
[HSK96]Assume ||xt||2 ≤ X2 for all t. Choose η = 1/X22 For any u ∈ Rn
PT
t=1(u· xt − wt−1 · xt)2 PT
t=1(u · xt − yt)2 + X22||u||22 ≤ 1
• If some u with small norm is good predictor then LMS must approximate predictions of u
• Bound holds for any u and (xt, yt)
• No probabilistic assumptions
• LMS is H∞-optimal:
No algorithm can achieve ratio < 1
∀u and (xt, yt)
Two related problems
A priori filtering: Control Theory
Try to match u· xt
T
X
t=1
(u· xt − wt−1 · xt)2
Prediction: On-line Learning
Try to match yt
T
X
t=1
(yt − wt−1 · xt)2
Comparison of known LMS-related bounds
• For η = α/X22,
T
X
t=1
(u· xt − wt−1 · xt)2 ≤
T
X
t=1
(u · xt − yt)2 + X22||u||22
• For η = α/X22 (0 < α < 1)
T
X
t=1
(yt − wt−1 · xt)2 ≤ 1 1 − α
T
X
t=1
(u· xt − yt)2
| {z }
Lossu
+ 1
αX22||u||22 tuned α
≤ Lossu + 2√
LossuX2||u||2 + X22||u||22
[CBLW96]
Generalizing the LMS bound
• Replace ||x||2||u||2 by ||x||p||u||q where 1/p + 1/q = 1 and ||x||p = (P
i|xi|p)1/p
• Instead of comparing predictions to u · xt for a fixed target u compare to ut · xt where ut may change
• Replace (·)2 by more general loss
Basic LMS
∆t = −η(wt−1 · xt − yt)xt
"!
#
"!
#
"!
#
∆t
wt−1 + wt
p-norm LMS
Write θt = f (wt)
"!
#
"!
#
"!
#
"!
#
"!
#
'
&
$
% '
&
$
%
C C C
C C C
∆t
wt−1 wt
θt−1 + θt
f f−1
”W -space”
”Θ-space”
p-norm LMS
based on [GLS01]wt = f−1(f(wt−1) − η(wt−1 · xt − yt)xt) where f:Rn → Rn given by
fi(w) = sign(wi)|wi|q−1
||w||qq−2
and fi−1(θ) = sign(θi)|θi|p−1
||θ||pp−2
When p = q = 2, then f(w) = w: LMS
For large p, f−1 emphasizes differences in components
A priori filtering bound
Theorem Assume ||xt||p ≤ Xp for all t, and let η = 1/((p − 1)Xp2) Then for any u the p-norm algorithm satisfies
T
X
t=1
(u · xt − wt−1 · xt)2 ≤
T
X
t=1
(yt − u· xt)2 + (p − 1)Xp2||u||2q
• 1/p + 1/q = 1 and 2 ≤ p < ∞, 1 < q ≤ 2
• How do we get the dual norm pair (∞, 1) (where ||x||∞ = maxi|xi|)?
For p = 2 ln n, (p − 1)||x||2p||u||q2 ≤ (2e ln n)||x||2∞||u||21
Comparison with basic LMS
New bounds incomparable with old ones because for p > 2 and q < 2
||x||p < ||x||2 and ||u||q > ||u||2
Compare p = 2 and p = O(log n) in two extreme cases:
Sparse target, dense instances: Let u = (1, 0, . . . , 0) and x = (1, . . . , 1).
• ||x||22||u||22 = n2
• (log n)||x||2∞||u||21 = log n
• Thus large p better
Dense target, sparse instances: Let u = (1, 1, . . . , 1) and x = (1, 0, . . . , 0).
• ||x||22||u||22 = n2
• (log n)||x||2∞||u||21 = n2log n
• Thus p = 2 better
The p-norm LMS can behave like EG
Hadamard Matrix:
→ +1 +1 +1 +1 instances → +1 −1 +1 −1
→ +1 +1 −1 −1
→ +1 −1 −1 +1
↑ ↑ ↑ ↑
targets
• Instances are orthogonal
• Target weight vectors are units
• LMS: error ≥ 1 − nk
• p-norm LMS with p = O(log n): error ≥ ln nk
Time-varying target
(following [HW01])Up to now, model has been yt = u · xt + noise where target u is fixed Generalize this to yt = ut · xt + noise where target ut may vary over time Example 1: target makes one jump Choose a,b ∈ Rn and take
ut =
a for 1 ≤ t ≤ T /2 b for T /2 < t ≤ T
Example 2: target moves steadily Choose a,b ∈ Rn and take ut = T − t
T − 1a + t − 1 T − 1b
Algorithms for time-varying target
Old update:
w0t = f−1(f(wt−1) − η(wt−1 · xt − yt)xt) Bounding update:
wt =
w0t if ||w0t||q ≤ Uq Uq w0t
||w0t||q otherwise where Uq > 0 is a norm bound
We rescale the weight vector whenever q-norm larger than Uq
Bound for time-varying target
Theorem Assume ||xt||p ≤ Xp for all t, and let η = 1/((p − 1)Xp2) Then if ||ut||q ≤ Uq for all t, the bounded p-norm LMS satisfies
T
X
t=1
(ut · xt − wt−1 · xt)2 ≤
T
X
t=1
(yt − ut · xt)2 + (p − 1)Xp2Uq2
+ 2(p− 1)Xp2Uq
T−1
X
t=1
||ut+1 − ut||q
• Only total distance P
t||ut+1 − ut||q traveled by the target matters
• Cost 2(p− 1)Xp2Uq per unit target movement
• For fixed target ut+1 = ut, we recover previous bound
• However Uq needs to be known in advance
Bregman divergences
• Key tool in analyzing and understanding the algorithms
• Fix strictly convex differentiable F :Rn → R. Denote the gradient by f = ∇F .
• Now the Bregman divergence dF:Rn × Rn → R is
dF(u,w) = F (u) − F (w) − f(w) · (u− w)
w u
dF(u, w) F
dF(u,w) is the error of first- order Taylor approximation of F (u) around w
Basic properties of Bregman divergences
• dF(u,w) ≥ 0, dF(u,w) = 0 iff u = w
• not symmetrical (in general)
• does not satisfy triangle inequality
• dF(u,w) convex in u, not necessarily in w Connection to exponential families (roughly):
• F is cumulant function, f is link function
• w is expectation parameter, f(w) canonical parameter
• dF(u,w) is the KL divergence between distributions parameterized by u and w
Example: p-norm divergence
[GLS01]•
F (w) = 1
2||w||2q
• Then the gradient f = ∇F satisfies fi(w) = sign(wi)|wi|
q−1
||w||qq−2 and f
i−1(θ) = sign(θi)|θi|
p−1
||θ||pp−2
• The divergence is
dF(u,w) = 1
2||u||2q − 1
2||w||2q − f(w) · (u − w).
• Special case p = q = 2 gives dF(u,w) = 12||u − v||22
Deriving the updates
• Define a regularized instantaneous loss Ct(w) = dF(w,wt−1) + η
2(yt − w · xt)2
• Basic aim is to have
wt = argmin
w
Ct(w)
• Minimize by setting ∇Ct(wt) = 0, obtaining the implicit update f(wt) = f(wt−1) − η(wt · xt − yt)xt
• Approximate wt · xt ≈ wt−1 · xt to obtain the update f(wt) = f(wt−1) − η(wt−1 · xt − yt)xt
Analyzing the update
• Measure of progress
dF(u,wt−1)− dF(u,wt) = η(yt − wt−1 · xt)xt · (u − wt−1) − dF(wt−1,wt)
• Massage the term (yt − wt−1 · xt)xt · (u− wt−1) until (u · xt − wt−1 · xt)2 and (yt − u· xt)2 appear; throw rest away
• Estimate dF(wt−1,wt) in terms of ||xt||p
• Sum over t = 1, . . . , T
Conclusion
• LMS and normalized LMS can be derived from an optimization problem involving a certain Bregman divergence
• Different Bregman divergences lead to different algorithms, with loss bounds in terms of different norms
• Bounds can be generalized for time-varying targets (and generalized linear models, not presented in the talk); proofs easy
• Algorithms for p = 2 can be kernelized, for p > 2 probably not Bottom line: Machinery from on-line machine learning carries over to H∞-optimal filtering
Where are we headed?
• Develop p-norm Kalman filter
• Prove relative loss bounds