Lecturer: Shivani Agarwal Scribe: Aadirupa
1 Introduction
In this lecture we shall look at a fairly general setting of online convex optimization which, as we shall see, encompasses some of the online learning problems we have seen so far as special cases. A standard (offline) convex optimization problem involves a convex set Ω ⊆ Rn and a fixed convex cost function c : Ω 7→ R;
the goal is to find a point x∗ ∈ Ω that minimizes c(x) over Ω. An online convex optimization problem also involves a convex set Ω ⊆ Rn, but proceeds in trials: at each trial t, the player/algorithm must choose a point xt∈ Ω, after which a convex cost function ct: Ω→R is revealed by the environment/adversary, and a loss of ct(xt) is incurred for the trial; the goal is to minimize the total loss incurred over a sequence of trials, relative to the best single point in Ω in hindsight.
Online Convex Optimization Convex set Ω ⊆ Rn
For t = 1, . . . , T : – Play xt∈ Ω
– Receive cost function ct: Ω→R – Incur loss ct(xt)
We will denote the total loss incurred by an algorithm A on T trials as
LT[A] =
T
X
t=1
ct(xt) ,
and the total loss of a fixed point x ∈ Ω as
LT[x] =
T
X
t=1
ct(x) ;
the regret of A w.r.t. the best fixed point in Ω is then simply LT[A] − inf
x∈ΩLT[x] .
Let us consider which of the online learning problems we have seen earlier can be viewed as instances of online convex optimization (OCO):
1. Online linear regression (GD analysis). Here one selects a weight vector wt∈ Rn, and incurs a loss given by ct(wt) = `yt(wt·xt), which is assumed to be convex in its argument, and is therefore convex in wt; however the total loss is compared against the best fixed weight vector in onlyu ∈ Rn: kuk2≤ U for some U > 0. Call this latter set Ω. Clearly, if the weight vectors wt are also selected to be in Ω, e.g. by Euclidean projection onto Ω (which in this case amounts to a simple normalization), then this becomes an instance of OCO.
2. Online linear regression (EG analysis). Here one selects wt ∈ ∆n, and incurs a loss given by ct(wt) = `yt(wt· xt), which is assumed to be convex in its argument, and is therefore convex in wt; the total loss is also compared against the best fixed weight vector in ∆n. Thus this can be viewed as an instance of OCO with Ω = ∆n.
1
3. Online prediction from expert advice. Here one selects a probability vector ˆpt∈ ∆n, and incurs a loss ct(ˆpt) = `yt(ˆpt· ξt), which is assumed to be convex in its argument, and is therefore convex in pˆt. However the total loss is compared only against the best vector inei ∈ ∆n : i ∈ [n] , where ei
denotes a unit vector with 1 in the i-th position and 0 elsewhere. In general, the smallest loss in this set is not the same as the smallest loss in ∆n, and therefore this is not an instance of OCO.
4. Online decision/allocation among experts. Here one selects a probability vector ˆpt∈ ∆n, and incurs a linear loss ct(ˆpt) = ˆpt· `t, which is clearly convex in ˆpt. The total loss is again compared against the best vector inei∈ ∆n: i ∈ [n] as above, but in this case, comparison against this set is actually equivalent to comparison against ∆n, since
inf
ˆ
p∈∆nLT[ˆp] = inf
ˆ p∈∆n
T
X
t=1
ˆ
p · `t= inf
ˆ
p∈∆np ·ˆ XT
t=1
`t
= min
i∈[n]ei·XT
t=1
`t
= min
i∈[n]
T
X
t=1
ei· `t.
Thus online allocation with a linear loss as above can be viewed as an instance of OCO with Ω = ∆n.
2 Online Gradient Descent Algorithm
The online gradient descent (OGD) algorithm [7] generalizes the basic gradient descent algorithm used for standard (offline) optimization problems, and can be described as follows:
Algorithm Online Gradient Descent (OGD) Parameter: η > 0
Initialize: x1∈ Ω For t = 1, . . . , T :
– Play xt∈ Ω
– Receive cost function ct: Ω 7→ R – Incur loss ct(xt)
– Update:
ext+1← xt− η∇ct(xt) xt+1← PΩ(xet+1) ,
where PΩ(ex) = argmin
x∈Ω
kx −xke 2 (Euclidean projection of ex onto Ω)
Theorem 2.1 (Zinkevich, 2003). Let Ω ⊆ Rn be a closed, non-empty, convex set with kx − x1k2≤ D ∀x ∈ Ω. Let ct : Ω→R (t = 1, 2, . . .) be any sequence of convex cost functions with bounded subgradients, k∇ct(x)k2≤ G ∀t, ∀x ∈ Ω. Then
LTOGD(η) − inf
x∈ΩLT[x] ≤ 1 2
D2
η + η G2T
.
Moreover, setting η∗= D
G√
T to minimize the right-hand side of the above bound (assuming D, G and T , or upper bounds on them, are known in advance), we have
LTOGD(η∗) − inf
x∈ΩLT[x] ≤ DG√ T .
Notes. Before proving the above theorem, we note the following:
1. The analysis of the GD algorithm for online regression can be viewed as a special case of the above, with D ≤ U and G ≤ ZR2 (see Lecture 17 notes).
2. With η∗ as above (which depends on knowledge of D, G and T ), one gets an O(√
T ) regret bound.
Such an O(√
T ) regret bound (with worse constants) can also be obtained by using a time-varying parameter ηt= 1/√
t ∀t (in fact this was the version contained in the original paper of Zinkevich [7]).
Proof of Theorem 2.1. The proof is similar to that of the regret bound we obtained for the GD algorithm for online regression in Lecture 17. Let x ∈ Ω. We have ∀t ∈ [T ]:
ct(xt) − ct(x)
≤ ∇ct(xt) · (xt− x) , by convexity of ct
= xt−xet+1 η
· (xt− x)
= 1
2η
kx − xtk22− kx −xet+1k22+ kxt−xet+1k22
≤ 1
2η
kx − xtk22− kx − xt+1k22+ kxt−xet+1k22
= 1
2η
kx − xtk22− kx − xt+1k22+ η2k∇ct(xt)k22 Summing over t = 1 . . . T , we thus get
T
X
t=1
ct(xt) − ct(x)
≤ 1
2η
kx − x1k22− kx − xT +1k22+ η2G2T
≤ 1
2η D2+ η2G2T . It is easy to see that the right-hand side is minimized at η∗ = D
G√
T; substituting this back in the above inequality gives the desired result.
2.1 Logarithmic regret bound for OGD when cost functions are strongly convex (using time-varying parameter η
t)
The above result gives an O(√
T ) regret bound for the OGD algorithm in general. When the cost functions ct are known to be α-strongly convex, one can in fact obtain an O(ln T ) regret bound with a slight variant of the OGD algorithm that uses a suitable time-varying parameter ηt[3]:
Theorem 2.2 (Hazan et al, 2007). Let Ω ⊆ Rn be a closed, non-empty, convex set with kx − x1k2 ≤ D ∀x ∈ Ω. Let ct : Ω→R (t = 1, 2, . . .) be any sequence of α-strongly convex cost functions with bounded subgradients, k∇ct(xt)k2≤ G ∀t, ∀x ∈ Ω. Then the OGD algorithm with time-varying parameter ηt= αt1 satisfies
LT
OGD
ηt= 1
αt
− inf
x∈ΩLT[x] ≤ G2
2α ln T + 1 . Proof. Let x ∈ Ω. Proceeding similarly as before, we have ∀t ∈ [T ]:
ct(xt) − ct(x)
≤ ∇ct(xt) · (xt− x) −α
2kxt− xk22, by α-strong convexity of ct
= xt−xet+1 ηt
· (xt− x) −α
2kxt− xk22
= 1
2ηt
kx − xtk22− kx −ext+1k22+ kxt−ext+1k22− αηtkxt− xk22
≤ 1
2ηt
kx − xtk22− kx − xt+1k22+ kxt−ext+1k22− αηtkxt− xk22
= 1
2ηt
kx − xtk22− kx − xt+1k22+ ηt2k∇ct(xt)k22− αηtkxt− xk22 . Summing over t = 1 . . . T then gives
T
X
t=1
ct(xt) − ct(x)
≤ G2 2
T
X
t=1
ηt+1 2
1 η1
− α
kx − x1k22+1 2
T
X
t=2
1 ηt
− 1 ηt−1
− α
kx − xtk22
≤ G2
2α 1 + ln T + 0 + 0 .
3 Online Mirror Descent Algorithm
The online mirror descent (OMD) algorithm, described for example in [6], generalizes the basic mirror descent algorithm used for standard (offline) optimization problems [4, 5, 1] much as OGD generalizes basic gradient descent. The OMD framework allows one to obtain O(√
T ) regret bounds for cost functions with bounded subgradients not only in Euclidean norm (assumed by the OGD analysis), but in any suitable norm.
Let eΩ ∈ Rnbe a convex set containing Ω (i.e. Ω ⊆ eΩ), and let F : eΩ→R be a strictly convex and differentiable function. Let BF : eΩ × eΩ→R denote the Bregman divergence associated with F ; recall that for all u, v ∈ eΩ, this is defined as
BF(u, v) = F (u) − F (v) − (u − v) · ∇F (v) .
Then the OMD algorithm using the set eΩ and function F can be described as follows:
Algorithm Online Mirror Descent (OMD)
Parameters: η > 0; convex set eΩ ⊇ Ω; strictly convex, differentiable function F : eΩ→R Initialize: x1= argmin
x∈Ω
F (x) For t = 1, . . . , T :
– Play xt∈ Ω
– Receive cost function ct: Ω→R – Incur loss ct(xt)
– Update:
∇F (xet+1) ← ∇F (xt) − η∇ct(xt) (Assume this yields ext+1∈ eΩ) xt+1← PΩF(ext+1) ,
where PΩF(ex) = argmin
x∈Ω
BF(x,ex) (Bregman projection ofex onto Ω)
Below we will obtain a regret bound for OMD (with appropriate choice of the function F ) for cost functions with subgradients bounded in an arbitrary norm. Before we do so, let us recall the notion of a dual norm.
Specifically, let k· k be any norm defined on a closed convex set eΩ ⊆ Rn. Then the corresponding dual norm is defined as
kuk∗ = sup
x∈eΩ:kxk≤1
x · u = 2 sup
x∈eΩ
x · u −12kxk2 .
Theorem 3.1. Let Ω ⊆ Rnbe a closed, non-empty, convex set. Let ct: Ω→R (t = 1, 2, . . .) be any sequence of convex cost functions with bounded subgradients k∇ct(xt)k ≤ G ∀t, ∀x ∈ Ω (where k· k is an arbitrary norm). Let eΩ ⊇ Ω be a convex set and F : eΩ→R be a strictly convex function such that
(a) F (x) − F (x1) ≤ D2∀x ∈ Ω; and
(b) the restriction of F to Ω is α-strongly convex w.r.t. k· k∗, the dual norm of k· k.
Then
LTOMD(η) − inf
x∈ΩLT[x] ≤ 1 η
D2+η2G2T 2α
.
Moreover, setting η∗= DG q2α
T to minimize the right-hand side of the above bound (assuming D, G, T and α, or upper bounds on them, are known in advance), we have
LTOMD(η∗) − inf
x∈ΩLT[x] ≤ DG r2T
α .
Proof. Let x ∈ Ω. Proceeding as before, we have ∀t ∈ [T ]:
ct(xt) − ct(x)
≤ ∇ct(xt) · (xt− x) , by convexity of ct
= ∇F (xt) − ∇F (ext+1) η
· (xt− x)
= 1
η
BF(x, xt) − BF(x,xet+1) + BF(xt,ext+1)
≤ 1
η
BF(x, xt) − BF(x, xt+1) − BF(xt+1,xet+1) + BF(xt,ext+1) . Summing over t = 1 . . . T , we thus get
T
X
t=1
ct(xt) − ct(x)
≤ 1
η
BF(x, x1) − BF(x, xT +1) +1
η
T
X
t=1
BF(xt,xet+1) − BF xt+1,ext+1 .
Now, it can be shown that
BF(x, x1) ≤ F (x) − F (x1) ≤ D2. Also, we have
T
X
t=1
BF(xt,ext+1) − BF(xt+1,xet+1)
=
T
X
t=1
F (xt) − F (xt+1) − ∇F (ext+1) · (xt− xt+1)
≤
T
X
t=1
∇F (xt) · (xt− xt+1) −α
2kxt− xt+1k2∗
− ∇F (ext+1).(xt− xt+1)
, by α-strong convexity of F on Ω
=
T
X
t=1
− η∇ct(xt) · (xt− xt+1) −α
2kxt− xt+1k2∗
≤
T
X
t=1
ηGkxt− xt+1k∗−α
2kxt− xt+1k2∗
, by H¨olders inequality
≤
T
X
t=1
η2G2 2α ,
since η22αG2 +α2kxt− xt+1k2∗− ηGkxt− xt+1k∗=ηG
√2α−pα
2kxt− xt+1k∗
2
≥ 0
= η2G2T 2α . Thus we get
T
X
t=1
ct(xt) − ct(x)
≤ 1
η
D2+η2G2T 2α
.
It is easy to verify that the right-hand side is minimized at η∗=DG q2α
T ; substituting this back in the above inequality gives the desired result.
Examples. Examples of OMD algorithms include the following:
1. OGD. Here eΩ = Rn and F (x) = 12kxk22. Note that F is 1-strongly convex on Rn (and therefore on any Ω ⊆ Rn).
2. Exponential weights. Here Ω = a∆n for some a > 0; eΩ = Rn++; and F (x) =Pn
i=1xiln x −Pn i=1xi. Note that F is strictly convex on eΩ and 1-strongly convex on Ω = ∆n.
Exercise. Suppose the cost functions ct are all β-strongly convex for some β > 0 (and have subgradients bounded in some arbitrary norm k· k as above). Can you obtain a logarithmic (O(ln(T ))) regret bound for the OMD algorithm in this case, as we did for the OGD algorithm? Why or why not?
More details on online convex optimization, including missing details in some of the above proofs, can be found for example in [2].
4 Next Lecture
In the next lecture, we will consider how algorithms for online supervised learning problems (such as online regression and classification) can be used in a batch setting, and will see how regret bounds for online learning algorithms can be converted to generalization error bounds for the resulting batch algorithms.
References
[1] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
[2] S´ebastion Bubeck. Introduction to online optimization. Lecture Notes, Princeton University, 2011.
[3] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimiza- tion. Machine Learning, 69:169–192, 2007.
[4] A. Nemirovski. Efficient methods for large-scale convex optimization problems. Ekonomika i Matem- aticheskie Metody, 15, 1979.
[5] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Inter- science, 1983.
[6] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
[7] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceed- ings of the 20th International Conference on Machine Learning (ICML), 2003.