An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

(1)

An Optimal Algorithm for Bandit and Zero-Order

Convex Optimization with Two-Point Feedback

Department of Computer Science and Applied Mathematics Weizmann Institute of Science

Rehovot 7610001, Israel

Editor:Alexander Rakhlin

Abstract

We consider the closely related problems of bandit convex optimization with two-point feedback, and zero-order stochastic convex optimization with two function evaluations per round. We provide a simple algorithm and analysis which is optimal for convex Lipschitz functions. This improves on Duchi et al. (2015), which only provides an optimal result for smooth functions; Moreover, the algorithm and analysis are simpler, and readily extend to non-Euclidean problems. The algorithm is based on a small but surprisingly powerful modification of the gradient estimator.

Keywords: zero-order optimization, bandit optimization, stochastic optimization, gradi-ent estimator

1. Introduction

We consider the problem of bandit convex optimization with two-point feedback Agarwal et al. (2010). This problem can be defined as a repeated game between a learner and an adversary as follows: At each roundt, the adversary picks a convex functionftonRd, which

is not revealed to the learner. The learner then chooses a point wt from some known and closed convex set W ⊆_Rd_{, and suffers a loss} _f

t(wt). As feedback, the learner may choose two points w_t0,w00_t ∈ W and receive1 ft(w0t), ft(wt00). The learner’s goal is to minimize average regret, defined as

1

T

X

t=1

ft(wt)− min

w∈W 1

T

X

t=1

ft(w).

In this paper, we focus on obtaining bounds on the expected average regret (with respect to the learner’s randomness).

A closely-related and easier setting is zero-order stochastic convex optimization. In this setting, our goal is to approximately solve F(w) = minw∈WEξ[f(w;ξ)], given limited access to {f(·;ξt)}Tt=1 where ξt are i.i.d. instantiations. Specifically, we assume that each

1. This is slightly different than the model of Agarwal et al. (2010), where the learner only choosesw0t,wt00

and the loss is 1 2(ft(w

0

t) +ft(w

00

t)). However, our results and analysis can be easily translated to their

setting, and the model we discuss translates more directly to the zero-order stochastic optimization considered later.

c

(2)

f(·, ξ_t) is not directly observed, but rather can be queried at two points. This models situations where computing gradients directly is complicated or infeasible. It is well-known (Cesa-Bianchi et al., 2004) that given an algorithm with expected average regretRT in the bandit optimization setting above, if we feed it with the functions ft(w) =f(w;ξt), then the average ¯wT = _T1 PT_t=1wt of the points generated satisfies the following bound on the expected optimization error:

E[F( ¯wT)]− min

w∈WF(w)≤RT.

Thus, an algorithm for bandit optimization can be converted to an algorithm for zero-order stochastic optimization with similar guarantees.

The bandit optimization setting with two-point feedback was proposed and studied in Agarwal et al. (2010). Independently, Nesterov (2011) and considered two-point methods for stochastic optimization. Both papers are based on randomized gradient estimates which are then fed into standard first-order algorithms (e.g. gradient descent, or more generally mirror descent). However, the regret/error guarantees in both papers were suboptimal in terms of the dependence on the dimension. Recently, Duchi et al. (2015) considered a similar approach for the stochastic optimization setting, attaining an optimal error guarantee when

f(·;ξ) is a smooth function (differentiable and with Lipschitz-continuous gradients). Related results in the smooth case were also obtained by Ghadimi and Lan (2013). However, to tackle the general case, where f(·;ξ) may be non-smooth, Duchi et al. (2015) resorted to a non-trivial smoothing scheme and a significantly more involved analysis. The resulting bounds have additional factors (logarithmic in the dimension) compared to the guarantees in the smooth case. Moreover, an analysis is only provided for Euclidean problems (where the domainW and Lipschitz parameter of ft scale with theL2 norm).

In this note, we present and analyze a simple algorithm with the following properties:

• For Euclidean problems, it is optimal up to constants for both smooth and non-smooth functions. This closes the gap between the non-smooth and non-non-smooth Euclidean problems in this setting.

• The algorithm and analysis are readily applicable to non-Euclidean problems. We give an example for the 1-norm, with the resulting bound optimal up to logarithmic factors.

• The algorithm and analysis are simpler than those proposed in Duchi et al. (2015). They apply equally to the bandit and zero-order optimization setting, and can be readily extended using standard techniques, e.g. improved bounds for strongly-convex functions; regret/error bounds holding with high-probability rather than just in ex-pectation; and improved bounds if allowed k > 2 observations per round instead of just two (Hazan et al., 2007; Shalev-Shwartz, 2007; Agarwal et al., 2010).

(3)

which queries at w and w+δu (where u is a random unit vector and δ > 0 is a small parameter), and returns

d

δ (f(w+δu)−f(w))u. (1)

The intuition is readily seen in the one-dimensional (d= 1) case, where the expectation of this expression equals

1

2δ (f(w+δ)−f(w−δ)), (2)

which indeed approximates the derivative of f (assuming f is differentiable) at w, if δ is small enough.

In contrast, our algorithm uses a slightly different estimator (also used in Agarwal et al., 2010), which queries atw−δu,w+δu, and returns

d

2δ(f(w+δu)−f(w−δu))u. (3)

Again, the intuition is readily seen in the cased= 1, where the expectation of this expression also equals Eq. (2).

When δ is sufficiently small and f is differentiable at w, both estimators compute a good approximation of the true gradient∇f(w). However, whenf is not differentiable, the variance of the estimator in Eq. (1) can be quadratic in the dimensiond, as pointed out by Duchi et al. (2015): For example, for f(w) =kwk2 andw= 0, the second moment equals

E

"

d

δ(f(δu)−f(0))u

2 2

#

= Ed2kuk42

= d2.

Since the performance of the algorithm crucially depends on the second moment of the gradient estimate, this leads to a highly sub-optimal guarantee. In Duchi et al. (2015), this was handled by adding an additional random perturbation and using a more involved analysis. Surprisingly, it turns out that the slightly different estimator in Eq. (3) does not suffer from this problem, and its second moment is essentially linear in the dimension d.

We note that in this work, we assume thatuis a random unit vector, similar to previous works. However, our results can be readily extended to other distributions, such as uniform in the Euclidean unit ball, or a Gaussian distribution.

2. Algorithm and Main Results

We consider the algorithm described in Figure 1, which performs standard mirror descent using a randomized gradient estimator ˜gt of a (smoothed) version of ft at point wt. Fol-lowing Duchi et al. (2015), we assume that one can indeed query ft at any pointwt+δtut as specified in the algorithm2.

The analysis of the algorithm is presented in the following theorem:

2. This may require us to query at a distanceδt outsideW. If we must query withinW, then a standard technique (see Agarwal et al., 2010) is to simply run the algorithm on a slightly smaller set (1−)W, where >0 is sufficiently large so thatwt+δtut must be inW. Since the formal guarantee in Thm. 1

(4)

Algorithm 1 Two-Point Bandit Convex Optimization Algorithm Input: Step size η, functionr:W 7→R, exploration parameters δt>0 Initialize θ1=0.

for t= 1, . . . , T−1do

Predictwt= arg maxw∈Whθt,wi −r(w)

Sampleut uniformly from the Euclidean unit sphere {w:kwk2 = 1} Queryft(wt+δtut) and ft(wt−δtut)

Set ˜gt= _2δd_t (ft(wt+δtut)−ft(wt−δtut))ut Updateθt+1 =θt−ηg˜t

end for

Theorem 1 Assume the following conditions hold:

1. r is 1-strongly convex with respect to a norm k · k, and sup_w_∈Wr(w) ≤R2 _{for some}

R <∞.

2. ft is convex and G2-Lipschitz with respect to the 2-norm k · k2.

3. The dual norm k · k∗ of k · k is such that 4

p

Eutkutk4∗≤p∗ for some p∗<∞.

If η = R

p∗G2 √

dT, and δt chosen such that δt ≤ p∗R

q

d

T, then the sequence w1, . . . ,wT

generated by the algorithm satisfies the following for any T andw∗ ∈ W:

E

"

1

T

X

t=1

ft(wt)− 1

T

X

t=1

ft(w∗)

#

≤c p∗G2R

r

d T,

where c is some numerical constant.

We note that condition 1 is standard in the analysis of the mirror-descent method (see the specific corollaries below), whereas conditions 2 and 3 are needed to ensure that the variance of our gradient estimator is controlled.

As mentioned earlier, the bound on the average regret which appears in Thm. 1 imme-diately implies a similar bound on the error in a stochastic optimization setting, for the average point ¯wT = _T1

PT

t=1wt. We note that the result is robust to the choice of η, and is the same up to constants as long as η = Θ(R/p∗G2

√

dT). Also, the constant c, while always strictly positive, shrinks as δt→0 (see the proof below for details).

As a first application of the theorem, let us consider the case wherek · kis the Euclidean normk·k₂. In this case, we can taker(w) = 1₂kwk2

2, and the algorithm reduces to a standard variant of online gradient descent, defined asθt+1 =θt−g˜tandwt= arg minw∈Wkw−θtk2. In this case, we get the following corollary:

Corollary 2 Suppose ft for all t is G2-Lipschitz with respect to the Euclidean norm, and

W ⊆ {w : kwk2 ≤ R}. Then using k · k = k · k2 and r(w) = 1₂kwk22, it holds for some

constant c and any w∗ ∈ W that

E

"

1

T

X

t=1

ft(wt)− 1

T

X

t=1

ft(w∗)

#

≤c G2R

r

(5)

The proof is immediately obtained from Thm. 1, noting thatp∗= 1 in our case. This bound matches (up to constants) the lower bound in Duchi et al. (2015), hence closing the gap between upper and lower bounds in this setting.

As a second application, let us consider the case where k · k is the 1-norm, k · k₁, the domainW is the simplex in Rd,d >1 (although our result easily extends to any subset of

the 1-norm unit ball), and we use a standard entropic regularizer:

Corollary 3 Suppose ft for all t isG1-Lipschitz with respect to the L1 norm. Then using

k · k=k · k₁ and r(w) =Pd

i=1wilog(dwi), it holds for some constant c and any w∗ ∈ W

that

E

"

1

T

X

t=1

ft(wt)− 1

T

X

t=1

ft(w∗)

#

≤c G1

s

dlog2(d)

T .

This bound matches (this time up to a factor polylogarithmic in d) the lower bound in Duchi et al. (2015) for this setting.

Proof The function r is 1-strongly convex with respect to the 1-norm (see for instance Shalev-Shwartz, 2012, Example 2.5), and has value at most log(d) on the simplex. Also, ifft isG1-Lipschitz with respect to the 1-norm, then it must be

√

dG1-Lipschitz with respect to the Euclidean norm. Finally, to satisfy condition 3 in Thm. 1, we upper bound p4

E[kutk4∞] using the following lemma, whose proof is given in the appendix:

Lemma 4 If u is uniformly distributed on the unit sphere inRd, d >1, then p4 E[kuk4∞]≤

c

q

log(d)

d where c is a positive numerical constant independent of d.

Plugging these observations into Thm. 1 leads to the desired result.

Finally, we make two additional remarks on possible extensions and improvements to Thm. 1.

Remark 5 (Querying at k >2 points) If the algorithm is allowed to queryft atk >2,

then it can be modified to attain an improved regret bound, by computingbk/2cindependent estimates of ˜gt at every round (using a freshly sampled ut each time), and using their

average. This leads to a new gradient estimator ˜gk_t, which satisfies E[kg˜ktk2]≤ 1kE[kg˜tk 2_{] +}

k_E[˜gt]k2. Based on the proof of Thm. 1, it is easily verified that this leads to an average

expected regret bound of cG_√2R T

1 +p∗

p

d/k

for some numerical constant c.

Remark 6 (Non-Euclidean Geometries) When considering norms other than the Eu-clidean norm, it is tempting to conjecture that our algorithm and analysis can be improved, by samplingutfrom a distribution adapted to the geometry of that norm (not necessarily the

Euclidean ball), and assuming ft is Lipschitz w.r.t. the dual norm. However, adapting the

(6)

3. Proof of Theorem 1

As discussed in the introduction, the key to getting improved results compared to previous papers is the use of a slightly different random gradient estimator, which turns out to have significantly less variance. The formal proof relies on a few simple lemmas listed below. The key lemma is Lemma 10, which establishes the improved variance behavior.

Lemma 7 For any w∗ ∈ W, it holds that

T

X

t=1

h˜gt,wt−w∗i ≤ 1

ηR

2₊_η T

X

t=1

k˜gtk2∗.

This lemma is the canonical result on the convergence of online mirror descent, and the proof is standard (see e.g. Shalev-Shwartz, 2012).

Lemma 8 Define the function

ˆ

ft(w) =Eut[ft(w+δtut)],

over W, where ut is a vector picked uniformly at random from the Euclidean unit sphere.

Then the function is convex, Lipschitz with constant G2, satisfies sup

w∈W

|fˆt(w)−ft(w)| ≤δtG2,

and is differentiable with the following gradient:

∇fˆt(w) =Eut

d δt

ft(w+δtut)ut

.

Proof The fact that the function is convex and Lipschitz is immediate from its definition and the assumptions in the theorem. The inequality follows from ut being a unit vector and that ftis assumed to beG2-Lipschitz with respect to the 2-norm. The differentiability property follows from Lemma 2.1 in Flaxman et al. (2005).

Lemma 9 For any functiong which isL-Lipschitz with respect to the2-norm, it holds that if u is uniformly distributed on the Euclidean unit sphere, then

r

E

h

(g(u)−_E[g(u)])4

i

≤cL

2

d .

for some numerical constant c.

Proof A standard result on the concentration of Lipschitz functions on the Euclidean unit sphere implies that

(7)

for some numerical constant c0 >0 (see the proof of Proposition 2.10 and Corollary 2.6 in Ledoux, 2005). Therefore,

r

E

h

(g(u)−E[g(u)])4

i

=

s Z ∞

t=0

Pr(g(u)−E[g(u)])4> t

dt = s Z ∞ t=0

Pr|g(u)−_E[g(u)]|>√4tdt≤

s Z ∞ t=0 2 exp −c

0_d√_t

L2 dt= s 2 L 4 (c0d)2, where in the last step we used the fact that R∞

x=0exp(−

√

x)dx= 2. The expression above equals cL2/dfor some numerical constantc.

Lemma 10 It holds thatE[˜gt|wt] =∇fˆt(wt) (where fˆt(·) is as defined in Lemma 8), and E[k˜gtk2|wt]≤cdp2∗G22 for some numerical constant c.

Proof For simplicity of notation, we drop the t subscript. Since u has a symmetric distribution around the origin,

E[˜g|w] =Eu

d

2δ(f(w+δu)−f(w−δu))u

=Eu

d

2δ(f(w+δu))u

+Eu

d

2δf(w−δu)(−u)

=Eu

d

2δ(f(w+δu))u

+Eu

d

2δf(w+δu)(u)

=Eu

d

δf(w+δu)u

which equals∇fˆ(w) by Lemma 8.

As to the second part of the lemma, we have the following, where α is an arbitrary parameter and where we use the elementary inequality (a−b)2 ≤2(a2+b2).

E[kg˜k2∗|w] =Eu

"

d

2δ(f(w+δu)−f(w−δu))u

2 ∗ # = d 2 4δ2Eu

h

kuk2_∗(f(w+δu)−f(w−δu))2i = d

2 4δ2Eu

h

kuk2_∗((f(w+δu)−α)−(f(w−δu)−α))2

i

≤ d

2 2δ2Eu

h

kuk2_∗(f(w+δu)−α)2+ (f(w−δu)−α)2

i

= d 2 2δ2

Eu

h

kuk2_∗(f(w+δu)−α)2

i

+Eu

h

kuk2_∗(f(w−δu)−α)2

i

(8)

Again using the symmetric distribution ofu, this equals

d2 2δ2

Eu

h

kuk2_∗(f(w+δu)−α)2i+Eu

h

kuk2_∗(f(w+δu)−α)2i = d

2

δ2Eu

h

kuk2

∗(f(w+δu)−α)2

i

.

Applying Cauchy-Schwartz and using the condition p4

Eukuk4∗ ≤p∗ stated in the theorem, we get the upper bound

d2 δ2

p

Eu[kuk4∗]

r

Eu

h

(f(w+δu)−α)4

i

= p 2 ∗d2

δ2

r

Eu

h

(f(w+δu)−α)4

i

.

In particular, taking α = Eu[f(w+δu)] and using Lemma 9 (noting that f(w+δu) is G2δ-Lipschitz w.r.t. u in terms of the 2-norm), this is at most p

2

∗d2

δ2 c (G2δ)2

d = cdp 2 ∗G22 as required.

We are now ready to prove the theorem. Taking expectations on both sides of the inequality in Lemma 7, we have

E

" _T X

t=1

h˜gt,wt−w∗i

#

≤ 1 ηR

2₊_η T

X

t=1

Ekg˜tk2∗

= 1

ηR

2₊_η T

X

t=1

EEkg˜tk2∗|wt

. (4)

Using Lemma 10, the right hand side is at most

1

ηR

2₊_ηcdp2 ∗G22T

The left hand side of Eq. (4), by Lemma 10 and convexity of ˆft, equals

E

" _T X

t=1

h_E[˜gt|wt],wt−w∗i

#

=E

" _T X

t=1

h∇fˆt(wt),wt−w∗i

# ≥_E " _T X t=1 ˆ

ft(wt)−fˆt(w∗)

#

.

By Lemma 8, this is at least

E

" _T X

t=1

(ft(wt)−ft(w∗))

#

−2G2 T

X

t=1

δt.

Combining these inequalities and plugging back into Eq. (4), we get

E

" _T X

t=1

(ft(wt)−ft(w∗))

#

≤2G2 T

X

t=1

δt+ 1

ηR

2₊_cdp2 ∗G22ηT.

Choosingη =R/(p∗G2

√

dT), and anyδt≤p∗R

p

d/T, we get

E

" _T X

t=1

(ft(wt)−ft(w∗))

#

≤(c+ 3)p∗G2R

(9)

Dividing both sides byT, the result follows.

Acknowledgments

This research was supported in part by an Israel Science Foundation grant 425/13 and an FP7 Marie Curie CIG grant. We thank the anonymous reviewers for several helpful comments.

Appendix A. Proof of Lemma 4

We note that the distribution of kuk4

∞ is identical to that of knk4

∞

knk4 2

, where n∼ N(0, Id) is a standard Gaussian random vector. Moreover, by a standard concentration bound on the norm of Gaussian random vectors (e.g. Corollary 2.3 in Barvinok, 2005, with = 1/2):

max

(

Pr knk2≤

r d 2 ! ,Pr

knk2 ≥

√ 2d ) ≤exp −d 16 .

Finally, for any value of n, we always have knk∞

knk2 ≤1, since the Euclidean norm is always larger than the infinity norm. Combining these observations, and using1Afor the indicator function of the eventA, we have

E[kuk4∞] =E

knk4 ∞

knk4 2

= Pr knk2≤

r d 2 ! E "

knk4 ∞

knk4 2

knk2 ≤

r

d

2

#

+ Pr knk2>

r d 2 ! E "

knk4 ∞

knk4 2

knk2 >

r d 2 # ≤exp − d 16

∗1 + Pr knk2>

r d 2 ! E   

knk4 ∞

_p

d/24

knk2 >

r d 2    = exp − d 16 + 2 d 2 E h

knk4_∞1_k_n_k2_>√_d/2

i ≤exp − d 16 + 4

d2E

knk4_∞

. (5)

Thus, it remains to upper bound Eknk4∞

(10)

distributed standard Gaussian random variables, we have for any scalarz≥1 that

Pr(knk∞≤z) = n

Y

i=1

Pr(|n_i| ≤z) = (Pr(|n₁| ≤z))d

= (1−Pr(|n₁|> z))d (1)

≥ 1−dPr(|n₁|> z) = 1−2dPr(n1 > z)

(2)

≥ 1−dexp(−z2/2),

where (1) is Bernoulli’s inequality, and (2) is using a standard tail bound for a Gaussian random variable. In particular, the above implies that

Pr (knk∞> z)≤dexp(−z2/2). Therefore, for an arbitrary positive scalarr ≥1,

Eknk4∞

=

Z ∞

z=0

Pr knk4 ∞> z

dz

≤

Z r

z=0 1dz+

Z ∞

z=r

Pr knk∞> 4

√ zdz

≤r+

Z ∞

z=r

dexp

− √

z

2

dz

=r+ 4d(2 +√r) exp

− √

r

2

.

In particular, pluggingr = 4 log2(d) (which is larger than 1, since we assumed >1), we get 4(2 + 2 log(d) + log2(d)). Plugging this back into Eq. (5), we get that

E[kuk4∞] ≤ exp

−d

16

+ 162 + 2 log(d) + log 2₍_d₎

d2 ,

which can be shown to be at most c0

log(d) d

2

for alld > 1, where c0 <150 is a numerical

constant. In particular, this means that p4

E[kuk4∞]≤ 4

√ c0

q

log(d)

d as required.

References

A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory (COLT), 2010.

A. Barvinok. Measure concentration lecture notes. http://www.math.lsa.umich.edu/

~barvinok/total710.pdf, 2005.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.

(11)

A. Flaxman, A. Kalai, and B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.

S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex opti-mization. Machine Learning, 69(2):169–192, 2007.

M. Ledoux.The concentration of measure phenomenon, volume 89. American Mathematical Soc., 2005.

Y. Nesterov. Random gradient-free minimization of convex functions. Technical Report 2011/16, ECORE, 2011.

S. Shalev-Shwartz. Online learning: Theory, algorithms, and applications. PhD thesis, The Hebrew University, 2007.