Online Convex Optimization

(1)

Lecturer: Shivani Agarwal Scribe: Aadirupa

1 Introduction

In this lecture we shall look at a fairly general setting of online convex optimization which, as we shall see, encompasses some of the online learning problems we have seen so far as special cases. A standard (offline) convex optimization problem involves a convex set Ω ⊆ Rⁿ and a fixed convex cost function c : Ω 7→ R;

the goal is to find a point x^∗ ∈ Ω that minimizes c(x) over Ω. An online convex optimization problem also involves a convex set Ω ⊆ Rⁿ, but proceeds in trials: at each trial t, the player/algorithm must choose a point x^t∈ Ω, after which a convex cost function ct: Ω→R is revealed by the environment/adversary, and a loss of ct(x^t) is incurred for the trial; the goal is to minimize the total loss incurred over a sequence of trials, relative to the best single point in Ω in hindsight.

Online Convex Optimization Convex set Ω ⊆ Rⁿ

For t = 1, . . . , T : – Play x^t∈ Ω

– Receive cost function c_t: Ω→R – Incur loss ct(x^t)

We will denote the total loss incurred by an algorithm A on T trials as

LT[A] =

T

X

t=1

ct(x^t) ,

and the total loss of a fixed point x ∈ Ω as

LT[x] =

T

X

t=1

ct(x) ;

the regret of A w.r.t. the best fixed point in Ω is then simply LT[A] − inf

x∈ΩLT[x] .

Let us consider which of the online learning problems we have seen earlier can be viewed as instances of online convex optimization (OCO):

1. Online linear regression (GD analysis). Here one selects a weight vector w^t∈ Rⁿ, and incurs a loss given by c_t(w^t) = `_yt(w^t·x^t), which is assumed to be convex in its argument, and is therefore convex in w^t; however the total loss is compared against the best fixed weight vector in onlyu ∈ Rⁿ: kuk₂≤ U for some U > 0. Call this latter set Ω. Clearly, if the weight vectors w^t are also selected to be in Ω, e.g. by Euclidean projection onto Ω (which in this case amounts to a simple normalization), then this becomes an instance of OCO.

2. Online linear regression (EG analysis). Here one selects w^t ∈ ∆n, and incurs a loss given by ct(w^t) = `y^t(w^t· x^t), which is assumed to be convex in its argument, and is therefore convex in w^t; the total loss is also compared against the best fixed weight vector in ∆n. Thus this can be viewed as an instance of OCO with Ω = ∆n.

1

(2)

3. Online prediction from expert advice. Here one selects a probability vector ˆp^t∈ ∆n, and incurs a loss ct(ˆp^t) = `y^t(ˆp^t· ξ^t), which is assumed to be convex in its argument, and is therefore convex in pˆ^t. However the total loss is compared only against the best vector inei ∈ ∆n : i ∈ [n] , where ei

denotes a unit vector with 1 in the i-th position and 0 elsewhere. In general, the smallest loss in this set is not the same as the smallest loss in ∆_n, and therefore this is not an instance of OCO.

4. Online decision/allocation among experts. Here one selects a probability vector ˆp^t∈ ∆n, and incurs a linear loss ct(ˆp^t) = ˆp^t· `^t, which is clearly convex in ˆp^t. The total loss is again compared against the best vector inei∈ ∆n: i ∈ [n] as above, but in this case, comparison against this set is actually equivalent to comparison against ∆n, since

inf

ˆ

p∈∆_nLT[ˆp] = inf

ˆ p∈∆_n

T

X

t=1

ˆ

p · `t= inf

ˆ

p∈∆_np ·ˆ X^T

t=1

`t

= min

i∈[n]ei·X^T

t=1

`t

= min

i∈[n]

T

X

t=1

ei· `t.

Thus online allocation with a linear loss as above can be viewed as an instance of OCO with Ω = ∆_n.

2 Online Gradient Descent Algorithm

The online gradient descent (OGD) algorithm [7] generalizes the basic gradient descent algorithm used for standard (offline) optimization problems, and can be described as follows:

Algorithm Online Gradient Descent (OGD) Parameter: η > 0

Initialize: x¹∈ Ω For t = 1, . . . , T :

– Play x^t∈ Ω

– Receive cost function ct: Ω 7→ R – Incur loss ct(x^t)

– Update:

ex^t+1← x^t− η∇ct(x^t) x^t+1← PΩ(xe^t+1) ,

where P_Ω(ex) = argmin

x∈Ω

kx −xke ₂ (Euclidean projection of ex onto Ω)

Theorem 2.1 (Zinkevich, 2003). Let Ω ⊆ Rⁿ be a closed, non-empty, convex set with kx − x¹k2≤ D ∀x ∈ Ω. Let c_t : Ω→R (t = 1, 2, . . .) be any sequence of convex cost functions with bounded subgradients, k∇ct(x)k₂≤ G ∀t, ∀x ∈ Ω. Then

L_TOGD(η) − inf

x∈ΩL_T[x] ≤ 1 2

D²

η + η G²T

.

Moreover, setting η^∗= ^D

G√

T to minimize the right-hand side of the above bound (assuming D, G and T , or upper bounds on them, are known in advance), we have

L_TOGD(η^∗) − inf

x∈ΩL_T[x] ≤ DG√ T .

Notes. Before proving the above theorem, we note the following:

1. The analysis of the GD algorithm for online regression can be viewed as a special case of the above, with D ≤ U and G ≤ ZR₂ (see Lecture 17 notes).

2. With η^∗ as above (which depends on knowledge of D, G and T ), one gets an O(√

T ) regret bound.

Such an O(√

T ) regret bound (with worse constants) can also be obtained by using a time-varying parameter ηt= 1/√

t ∀t (in fact this was the version contained in the original paper of Zinkevich [7]).

(3)

Proof of Theorem 2.1. The proof is similar to that of the regret bound we obtained for the GD algorithm for online regression in Lecture 17. Let x ∈ Ω. We have ∀t ∈ [T ]:

ct(x^t) − ct(x)

≤ ∇ct(x^t) · (x^t− x) , by convexity of ct

= x^t−xe^t+1 η

· (x^t− x)

= 1

2η

kx − x^tk²₂− kx −xe^t+1k²₂+ kx^t−xe^t+1k²₂

≤ 1

2η

kx − x^tk²₂− kx − x^t+1k₂²+ kx^t−xe^t+1k²₂

= 1

2η

kx − x^tk²₂− kx − x^t+1k₂²+ η²k∇ct(x^t)k²₂ Summing over t = 1 . . . T , we thus get

T

X

t=1

ct(x^t) − ct(x)

≤ 1

2η

kx − x¹k²₂− kx − x^{T +1}k²₂+ η²G²T

≤ 1

2η D²+ η²G²T . It is easy to see that the right-hand side is minimized at η^∗ = ^D

G√

T; substituting this back in the above inequality gives the desired result.

2.1 Logarithmic regret bound for OGD when cost functions are strongly convex (using time-varying parameter η

_t

)

The above result gives an O(√

T ) regret bound for the OGD algorithm in general. When the cost functions c_t are known to be α-strongly convex, one can in fact obtain an O(ln T ) regret bound with a slight variant of the OGD algorithm that uses a suitable time-varying parameter ηt[3]:

Theorem 2.2 (Hazan et al, 2007). Let Ω ⊆ Rⁿ be a closed, non-empty, convex set with kx − x¹k2 ≤ D ∀x ∈ Ω. Let ct : Ω→R (t = 1, 2, . . .) be any sequence of α-strongly convex cost functions with bounded subgradients, k∇ct(x^t)k2≤ G ∀t, ∀x ∈ Ω. Then the OGD algorithm with time-varying parameter ηt= _αt¹ satisfies

LT

OGD

ηt= 1

αt

− inf

x∈ΩLT[x] ≤ G²

2α ln T + 1 . Proof. Let x ∈ Ω. Proceeding similarly as before, we have ∀t ∈ [T ]:

c_t(x^t) − c_t(x)

≤ ∇c_t(x^t) · (x^t− x) −α

2kx^t− xk²₂, by α-strong convexity of c_t

= x^t−xe^t+1 ηt

· (x^t− x) −α

2kx^t− xk²₂

= 1

2ηt

kx − x^tk²₂− kx −ex^t+1k²₂+ kx^t−ex^t+1k²₂− αηtkx^t− xk²₂

≤ 1

2ηt

kx − x^tk²₂− kx − x^t+1k²₂+ kx^t−ex^t+1k²₂− αηtkx^t− xk²₂

= 1

2ηt

kx − x^tk²₂− kx − x^t+1k²₂+ η_t²k∇ct(x^t)k²₂− αηtkx^t− xk²₂ . Summing over t = 1 . . . T then gives

T

X

t=1

c_t(x^t) − c_t(x)

≤ G² 2

T

X

t=1

η_t+1 2

1 η1

− α

kx − x¹k²₂+1 2

T

X

t=2

1 ηt

− 1 ηt−1

− α

kx − x^tk²₂

≤ G²

2α 1 + ln T + 0 + 0 .

(4)

3 Online Mirror Descent Algorithm

The online mirror descent (OMD) algorithm, described for example in [6], generalizes the basic mirror descent algorithm used for standard (offline) optimization problems [4, 5, 1] much as OGD generalizes basic gradient descent. The OMD framework allows one to obtain O(√

T ) regret bounds for cost functions with bounded subgradients not only in Euclidean norm (assumed by the OGD analysis), but in any suitable norm.

Let eΩ ∈ Rⁿbe a convex set containing Ω (i.e. Ω ⊆ eΩ), and let F : eΩ→R be a strictly convex and differentiable function. Let BF : eΩ × eΩ→R denote the Bregman divergence associated with F ; recall that for all u, v ∈ eΩ, this is defined as

B_F(u, v) = F (u) − F (v) − (u − v) · ∇F (v) .

Then the OMD algorithm using the set eΩ and function F can be described as follows:

Algorithm Online Mirror Descent (OMD)

Parameters: η > 0; convex set eΩ ⊇ Ω; strictly convex, differentiable function F : eΩ→R Initialize: x¹= argmin

x∈Ω

F (x) For t = 1, . . . , T :

– Play x^t∈ Ω

– Receive cost function c_t: Ω→R – Incur loss c_t(x^t)

– Update:

∇F (xe^t+1) ← ∇F (x^t) − η∇c_t(x^t) (Assume this yields ex^t+1∈ eΩ) x^t+1← P_Ω^F(ex^t+1) ,

where P_Ω^F(ex) = argmin

x∈Ω

BF(x,ex) (Bregman projection ofex onto Ω)

Below we will obtain a regret bound for OMD (with appropriate choice of the function F ) for cost functions with subgradients bounded in an arbitrary norm. Before we do so, let us recall the notion of a dual norm.

Specifically, let k· k be any norm defined on a closed convex set eΩ ⊆ Rⁿ. Then the corresponding dual norm is defined as

kuk_∗ = sup

x∈eΩ:kxk≤1

x · u = 2 sup

x∈eΩ

x · u −¹₂kxk² .

Theorem 3.1. Let Ω ⊆ Rⁿbe a closed, non-empty, convex set. Let c_t: Ω→R (t = 1, 2, . . .) be any sequence of convex cost functions with bounded subgradients k∇c_t(x^t)k ≤ G ∀t, ∀x ∈ Ω (where k· k is an arbitrary norm). Let eΩ ⊇ Ω be a convex set and F : eΩ→R be a strictly convex function such that

(a) F (x) − F (x¹) ≤ D²∀x ∈ Ω; and

(b) the restriction of F to Ω is α-strongly convex w.r.t. k· k_∗, the dual norm of k· k.

Then

LTOMD(η) − inf

x∈ΩLT[x] ≤ 1 η

D²+η²G²T 2α

.

Moreover, setting η^∗= ^D_G q2α

T to minimize the right-hand side of the above bound (assuming D, G, T and α, or upper bounds on them, are known in advance), we have

LTOMD(η^∗) − inf

x∈ΩLT[x] ≤ DG r2T

α .

(5)

Proof. Let x ∈ Ω. Proceeding as before, we have ∀t ∈ [T ]:

c_t(x^t) − c_t(x)

≤ ∇ct(x^t) · (x^t− x) , by convexity of c_t

= ∇F (x^t) − ∇F (ex^t+1) η

· (x^t− x)

= 1

η

BF(x, x^t) − BF(x,xe^t+1) + BF(x^t,ex^t+1)

≤ 1

η

BF(x, x^t) − BF(x, x^t+1) − BF(x^t+1,xe^t+1) + BF(x^t,ex^t+1) . Summing over t = 1 . . . T , we thus get

T

X

t=1

ct(x^t) − ct(x)

≤ 1

η

BF(x, x¹) − BF(x, x^{T +1}) +1

η

T

X

t=1

BF(x^t,xe^t+1) − BF x^t+1,ex^t+1 .

Now, it can be shown that

BF(x, x¹) ≤ F (x) − F (x¹) ≤ D². Also, we have

T

X

t=1

BF(x^t,ex^t+1) − BF(x^t+1,xe^t+1)

=

T

X

t=1

F (x^t) − F (x^t+1) − ∇F (ex^t+1) · (x^t− x^t+1)

≤

T

X

t=1

∇F (x^t) · (x^t− x^t+1) −α

2kx^t− x^t+1k²_∗

− ∇F (ex^t+1).(x^t− x^t+1)

, by α-strong convexity of F on Ω

=

T

X

t=1

− η∇ct(x^t) · (x^t− x^t+1) −α

2kx^t− x^t+1k²_∗

≤

T

X

t=1

ηGkx^t− x^t+1k_∗−α

2kx^t− x^t+1k²_∗

, by H¨olders inequality

≤

T

X

t=1

η²G² 2α ,

since ^η²_2α^G² +^α₂kx^t− x^t+1k²_∗− ηGkx^t− x^t+1k∗=_ηG

√2α−pα

2kx^t− x^t+1k∗

²

≥ 0

= η²G²T 2α . Thus we get

T

X

t=1

ct(x^t) − ct(x)

≤ 1

η

D²+η²G²T 2α

.

It is easy to verify that the right-hand side is minimized at η^∗=^D_G q2α

T ; substituting this back in the above inequality gives the desired result.

Examples. Examples of OMD algorithms include the following:

1. OGD. Here eΩ = Rⁿ and F (x) = ¹₂kxk²₂. Note that F is 1-strongly convex on Rⁿ (and therefore on any Ω ⊆ Rⁿ).

2. Exponential weights. Here Ω = a∆n for some a > 0; eΩ = Rⁿ++; and F (x) =Pn

i=1xiln x −Pn i=1xi. Note that F is strictly convex on eΩ and 1-strongly convex on Ω = ∆n.

(6)

Exercise. Suppose the cost functions ct are all β-strongly convex for some β > 0 (and have subgradients bounded in some arbitrary norm k· k as above). Can you obtain a logarithmic (O(ln(T ))) regret bound for the OMD algorithm in this case, as we did for the OGD algorithm? Why or why not?

More details on online convex optimization, including missing details in some of the above proofs, can be found for example in [2].

4 Next Lecture

In the next lecture, we will consider how algorithms for online supervised learning problems (such as online regression and classification) can be used in a batch setting, and will see how regret bounds for online learning algorithms can be converted to generalization error bounds for the resulting batch algorithms.

References

[1] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.

[2] S´ebastion Bubeck. Introduction to online optimization. Lecture Notes, Princeton University, 2011.

[3] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.

[4] A. Nemirovski. Efficient methods for large-scale convex optimization problems. Ekonomika i Matem- aticheskie Metody, 15, 1979.

[5] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Inter- science, 1983.

[6] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.

[7] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceed- ings of the 20th International Conference on Machine Learning (ICML), 2003.