4 Handling Saddle-Points: Zeroth-Order Cubic Regularization Method

x_N =

P_{N −1}

k=0 x_k

N = γ_k[1 − 16LCs(2ˆs + s^∗)(log d)²γ_k]x_k P_{N −1}

k=0 γ_k[1 − 16LCs(2ˆs + s^∗)(log d)²γ_k]

due to the constant choice of γ_k in (3.5). Hence, (3.6) follows by using the choice of parameters in (3.5) into the above relation.

Remark 6 While for convex case, similar to the nonconvex case, the complexity of Algorithm 6 depends poly-logarithmically on d, it only linearly depends on the choice of ˆs, facilitating zeroth-order stochastic optimization in high-dimensions under sparsity assumptions.

Remark 7 As discussed in detail in [WDBS18], both Assumption 4 and 5 are implied when we assume the function f depends on only s of the d coordinates. But, both Assumption 4 and 5 are comparatively weaker than that assumption. Furthermore, unlike [WDBS18], we do not make any assumption on the sparsity or smoothness of the second-order derivative of the objective function f for our results.

Remark 8 As mentioned before, [WDBS18] considers only the convex case. Furthermore, their gradient estimator with zeroth-order oracle requires poly(s, s^∗, log d) function queries in each itera-tion whereas our estimator is based on only one funcitera-tion query per iteraitera-tion. Moreover, [WDBS18]

requires computationally expensive debiased Lasso estimators whereas our method requires only sim-ple thresholding operations (for convex case) to handle sparsity.

4 Handling Saddle-Points: Zeroth-Order Cubic Regularization Method

In this section, we study zeroth-order stochastic cubic regularized Newton method for unconstrained version of Problem 1.1. Throughout this section, we equip our space with the self- dual Euclidean norm, i.e., k · k = kk2. Furthermore, for a matrix A, we denote by kAkF, its Frobenious norm and by kAk, its operator norm. We also make the following smoothness assumption on the Hessian of the objective function f , which is a generalization of the assumption in Equation 1.2.

Assumption 6 The function f is twice differentiable and has Lipschitz continuous Hessian i.e., there exists LH > 0 such that

k∇²f (y) − ∇²f (y)k ≤ L^Hky − xk ∀x, y ∈ R^d. It can be easily seen that the above assumption is equivalent to

k∇f (y) − ∇f (x) − ∇²f (x)(y − x)k ≤ L_H

2 ky − xk², (4.1)

|f (y) − f (x) − h∇f (x), y − xi − 1

2hy − x, ∇²f (x)(y − x)i| ≤ L_H

6 ky − xk³. (4.2) Note that such an assumption in standard in the analysis of second-order optimization tech-niques [NP06]. We next describe a general technique for estimating the Hessian of a function based on Stein’s identity in Section 4.1 and use it to provide a zeroth-order cubic regularization method and its analysis in Section 4.2.

4.1 Estimating Hessian with Zeroth-Order Information

Charles Stein, in his seminal paper [Ste72], proposed a method for characterizing Gaussian ran-dom variables. Specifically, a ranran-dom vector, u ∼ N(0, Id), is standard Gaussian if and only if, E [u g(u)] = E [∇g(u)], for all absolutely continuous function g. Note that Stein’s identity, naturally relate function queries (left hand side of Equation 1.9) to gradients (right hand side of Equation 1.9) and thus is naturally suited for zeroth-order optimization. Indeed the Gaussian smoothing tech-nique proposed by [NS17], is based on the Stein’s identity. Indeed, if we let g(u) = f (x + νu) in Equation 1.9, it is easy to see that the identity in Equation 1.3 holds by simply evaluating the Gaussian Stein’s identity in Equation 1.9. Recall that the results in Sections 2 and 3 are essentially based on approximately estimating the gradient information based on the Gaussian smoothing technique [NS17]. In this section, we develop techniques for approximately estimating the Hessian using zeroth-order oracle, based on second-order Stein’s identities. It is worth noting that [Erd16]

also use Stein’s identities to estimate the Hessian but they only work in the restricted framework of generalized linear models with Gaussian data. Our use of Stein’s identity to estimate Hessians, is completely different and we provide Hessian estimators for a general class of non-covnex, smooth functions, even for deterministic functions.

The second-order Gaussian Stein’s identity, that we provide here informally for convenience, states thats E[(uu^⊤− Id) g(u)] = E[∇²g(u)], for all functions g with well-defined Hessians. Similar to first-order Stein’s identity, this naturally relates function queries to Hessians. In order to leverage this, similar to the previous case, we let g(u) = f (x + νu) and note that we have

E (uu^⊤− Id)f (x + νu) ν²

= E[∇²f (x + νu)] = ∇²f_ν(x) = H_fν. (4.3) This provides a way of approximately estimating the Hessian of the function f_ν by approximating the expectation on the left hand side using Gaussian samples. Hence, we can leverage this estimate of Hessian of the smoothed function to get an approximate estimate of Hessian of f . Similar to the gradient-free setting, we now have the following estimates of the Hessian.

H_ν(x, ξ, u) = 1 ν²

uu^⊤− Id

F (x + νu, ξ), Hν(x, ξ, u) = 1

ν²

uu^⊤− Id

[F (x + νu, ξ) − F (x, ξ)] , H_ν(x, ξ, u) = 1

2ν²

uu^⊤− Id

[F (x + νu, ξ) − F (x, ξ) + F (x − νu, ξ) − F (x, ξ)] , (4.4)

H¯_ν = 1 b

i=1

H_ν(x, ξ, u_i). (4.5)

Note that above quantities are all unbiased estimators of H_fν, with the last one, also being the variance reduced version. Note however that the above two estimators, unlike the gradient case, are not robust w.r.t the smoothing parameter ν in the sense that their variances blow up when ν converges to 0. Hence, for the rest of this section we only focus on the Hessian estimator defined in (4.4). The above Hessian-estimator has several advantages that we elaborate now. Recall that for second-order optimization algorithms, for example, cubic regularization, it is important to be able to compute the Hessian matrix operating on a vector v ∈ R^d, efficiently. For the proposed estimator above, such a Hessian-vector product boils down to just inner-product based operations

and could be done in time linear in the dimensionality. To the best of our knowledge, no such estimator for computing the hessian of a function exists. We now present the following results that characterize the estimation and approximation capability of the H_ν(x, ξ, u).

Lemma 4.1 (Variance Bound) Let the Hessian estimator be defined in (4.4) and Assumption 6 hold for F (x, ξ). Then, we have

E[kHν(x, ξ, u)k⁴F] ≤ (d + 16)⁸ L⁴_H(d + 16)²ν⁴

9 + L⁴

. (4.6)

As a consequence, we have

E[kHν(x, ξ, u)k²] ≤ (d + 16)⁴ L²_H(d + 16)ν²

3 + L²

. (4.7)

Proof. Noting (4.4) and Holder’s inequality, we have

E[kHν(x, ξ, u)k⁴F] = E

which together with Assumptions 2, assumption (4.2) for F (x, ξ), and the fact that Eh Moreover, by Holder’s inequality, we have

E[kHν(x, ξ, u)k²] ≤ E[kHν(x, ξ, u)k²F] ≤ E[kHν(x, ξ, u)k⁴F]

1 2 , which together with (4.6), imply (4.7).

Lemma 4.2 (Approximation Error) Under Assumption 6, denoting the Hessian of f by H_f for simplicity, we have

kHf^ν − Hfk ≤ L_Hν(d + 6)⁵²

4 . (4.8)

Proof. Taking y = x + νu in Equation 4.2, note that we have which together with (4.9) and (1.5), imply that

kHf^ν− Hfk ≤ L_Hν

Remark 9 Note that (4.8) is obtained only under Assumption 6. However one could obtain an im-proved bound on the approximation error, by making the more restrictive assumption of interchange-ability of differentiation and expectation as follows: kHf^ν − Hfk = kE[∇²f (x + νu)] − ∇²f (x)k ≤ Ek∇²f (x + νu) − ∇²f (x)k ≤ LHνEkuk ≤ LHν√

d. While this provides an improved dependency on d, we remark that this improvement does not translate to the improvement in the number of zeroth-order oracle calls, at least for the cubic regularized method as discussed in Section 4.2.

Remark 10 Recall from Section 1.1 that one could obtain high-probability results via the approach proposed in [GL13, LZ16] under sub-Gaussian tail assumption on the function F . To allow for functions F , that have heavy-tails, one could also leverage the spectral truncation argument to construct a robust Hessian estimators; see for example, [Min18]. Let φ : R → R be a non-decreasing function such that

− log(1 − x + x²/2) ≤ φ(x) ≤ log(1 + x + x²/2), ∀x ∈ R.

Recall the definition of a spectral function below.

Definition 4.1 Let A ∈ R^d×d be a real symmetric matrix with eigenvalue decomposition A = U ΛU^⊤ where U is the matrix of eigenvectors of A and Λ is a diagonal matrix of eigenvalues (λ₁, . . . , λ_d). A real-valued function φ(·) is a spectral function if it acts on the matrix as follows: Then, we define the robust Hessian estimator as

H(x, ξ, u) =˜ 1

κ · φκ · Hν(x, ξ, u),

Algorithm 7 Zeroth-order Stochastic Cubic Regularized Newton Method

Input: x₀ ∈ R^d, smoothing parameter ν > 0, non-negative sequence α_k, positive integer sequences m_k and b_k, iteration limit N ≥ 1 and probability distribution P^R(·) over {1, . . . , N}.

for k = 1, . . . , N do

1. Generate u^G(H)_k = [u^G(H)_k,1 , . . . , u^G(H)_k,m

k(bk)], where u^G(H)_k,j ∼ N(0, Id), call the stochastic oracle to compute m_k stochastic gradients G^k,jν and b_k stochastic Hessians Hν^k,j according to (1.4) and (4.4), respectively, and take their averages:

G¯^k_ν = 1 m_k

j=1

F (x_k−1+ νu_k,j, ξ_k,j) − F (xk−1, ξ^G_k,j)

ν u^G_k,j, (4.10)

H¯_ν^k= 1 b_k

i=1

[F (x_k−1+ νu^H_k,i, ξ_k,i^H) + F (x_k−1− νu^Hk,i, ξ_k,i^H) − 2F (xk−1, ξ_k,i^H)]

2ν²

u^H_k,i(u^H_k,i)^⊤− Id

. (4.11) 2. Compute

x_k= argmin

x∈R^d

n ˜f^k(x) ≡ ˜f (x, x_k−1, ¯H_ν^k, ¯G^k_ν, α_k)o

, (4.12)

where

f (x, y, H, g, α) = hg, x − yi +˜ 1

2hH(x − y), x − yi +α

6kx − yk³. (4.13) end for

Output: Generate R according to P_R(·) and output zR.

where κ > 0 is a tuning parameter. This provides us with a robust Hessian estimator that al-lows for the function F to have heavy tails. Furthermore, the more standard median-of-means estimator [NY83] provides a robust gradient estimator as well. A thorough treatment of the esti-mation error of the robust Hessian and gradient follows from an analysis similar to that of [Min18]

and [NY83] respectively, although we do not outline the details in the current paper. We also re-mark that while the spectral truncation argument makes the estimator robust, the computational advantage of the vanilla estimator in Equation 4.4 is lost.

4.2 Zeroth-Order Stochastic Cubic Regularized Newton Method

Our goal in this subsection is to provide a second-order algorithmic framework using the estimated gradient and Hessian based on Stein’s identities. In particular, we present a zeroth-order stochastic cubic regularized Newton method in Algorithm 7. Note that the output of this algorithm, similar to the other algorithms presented in this paper for nonconvex problems, is a random index from the generated trajectory. In order to analyze its complexity, we first state a result due to [NP06]

that provides optimality conditions of the cubic regularized subproblem in step 2 of Algorithm 7.

Lemma 4.3 ([NP06]) Let ¯x = argmin

Our next result is the analogous result of Lemma 2.1 for the averaged Hessian matrices.

Lemma 4.4 Let ¯H_ν^k be computed by (4.11), b_k ≥ 4(1 + 2 log 2d). Then under Assumptions 1 and Proof. First, note that by Theorem 1 in [Tro16], we have

E[k ¯H_ν^k− ∇²f_ν(x_k−1)k²] ≤ 2C(d)

which together with the above inequality and the fact that Combining this inequality with (4.8), we obtain (4.14). Moreover, by Holder’s inequality we have

Eh Now, by vector-valued Rosenthal’s inequality (see, for example, Theorem 5.2 in [Pin94]) and (4.6), we obtain

which together with the above inequality and (4.16) imply (4.15).

We now proceed to provide the complexity results for Algorithm 7. We first require two inter-mediate results.

Lemma 4.5 Let {xk} be computed by Algorithm 7. Then under Assumptions 1 and 2, we have pE[kxk− xk−1k²]

≥ max





 s

E[k∇f (xk)k] − δ_k^g− δ_k^H

L_H+ α_k , −2

α_k+ 2L_H

E[λmin ∇²f (x_k)] +q

2(α_k+ LH)δ^H_k





 , (4.17) where δ_k^g, δ^H_k > 0 are chosen such that

E[k∇f (xk−1) − ¯G^k_νk²] ≤ (δk^g)², E[k∇²f (x_k−1) − ¯H_ν^kk³] ≤ 2(LH + α_k)δ_k^H

2 . (4.18) Proof. By the equality condition in Lemma 4.3 and (4.1), we have

k∇f (xk)k ≤ k∇f (xk) − ∇f (xk−1) − ∇²f (x_k−1)(x_k− xk−1)k + k∇f (xk−1) − ¯G^k_νk + k∇²f (x_k−1) − ¯H_ν^kk · kxk− xk−1k + α_k

2 kxk− xk−1k²

≤ (L_H + α_k)

2 kxk− xk−1k²+ k∇f (xk−1) − ¯G^k_νk + k∇²f (x_k−1) − ¯H_ν^kk · kxk− xk−1k

≤ (LH + α_k)kxk− xk−1k²+ k∇f (xk−1) − ¯G^k_νk +k∇²f (x_k−1) − ¯H_ν^kk² 2(LH+ α_k) .

Taking expectation from both sides of the above inequality and noting that δ^g_k, δ_k^H given in (4.18) are well-defined by properly choosing m_k and b_k in Lemmas 2.1 and 4.4, we obtain

E[k∇f (xk)k] − δ_k^g− δ_k^H

LH+ α_k ≤ E[kxk− xk−1k²]. (4.19) Also, by smoothness assumption of the Hessian and the inequality relation in Lemma 4.3

∇²f (x_k) ∇²f (x_k−1) − LHkxk− xk−1kId ∇²f (x_k−1) − ¯H_ν^k+ ¯H_ν^k− LHkxk− xk−1kId

∇²f (x_k−1) − ¯H_ν^k−(α_k+ 2L_H)kxk− xk−1k

2 I_d,

which implies that

(α_k+ 2L_H)kxk− xk−1k

2 ≥ λmin

f (x_k−1) − ¯H_ν^k

− λmin ∇²f (x_k) .

Taking expectation from both sides of the above inequality and noting definition of δ_k^H in (4.18), we obtain

pE[kxk− xk−1k²] ≥ E[kxk− xk−1k] ≥ −2 α_k+ 2L_H

2(α_k+ L_H)δ_k^H+ E[λ_min ∇²f (x_k)]

. Combining the above inequality with (4.19), we obtain (4.17).

Lemma 4.6 Let {xk} be computed by Algorithm 7 for a given iteration limit N ≥ 1. Then under Assumptions 1 and 2, we have

E[kxR− xR−1k³] ≤ 36 where R is an integer random variable whose probability distribution PR(·) is supported on {1, . . . , N}

and given by Moreover, by Lemma 4.3, we have

f˜^k(x_k) = −1

2h ¯H_ν^k(x_k− xk−1), (x_k− xk−1)i − α_k

3 kxk− xk−1k³ ≤ −α_k

12kxk− xk−1k³. Combining the above two relations, we obtain

α_k

where the last inequality follows from the Young’s inequality. Taking expectation from both sides, re-arranging the terms, and noting (4.18), we obtain

α_k Summing up the above inequalities, dividing both sides by PN

k=1α_k, and noting (4.21), we obtain (4.20).

Theorem 4.1 Let {xk} be computed by Algorithm 7 for a given iteration limit N ≥ 1. Moreover, assume that the parameters are set to

α_k = L_H, ν ≤ 1

Then under Assumptions 1 and 2, we have 5√

ǫ ≥ max

pE[k∇f(xR)k], −5 8√

L_HE[λ_min ∇²f (x_R)]

, (4.23)

where R is uniformly distributed over {1, . . . , N}. As a consequence, to obtain an ǫ second-order stationary point of the problem, the total number of samples required to compute the gradient and Hessian are, respectively, bounded by

O d ǫ⁷²

, O˜ d⁴ ǫ⁵²

Proof. First, note that by (4.22), Lemmas 2.1, and 4.4, we can ensure that (4.18) is satisfied by δ_k^g = 2ǫ/5 and δ^H_k = ǫ/138. Moreover, by Lemma 4.6, we have

E[kxR− xR−1k³] ≤ 1 L

3 2

12√

L_H(f (x₀) − f^∗)

N + 7ǫ³²

Hence, by choosing N according (4.22), and noting Lemma 4.6, we obtain (4.23). Therefore, x_R is an 4ǫ second-order stationary point of the problem. Finally, note that the total number of required samples to obtain such a solution is bounded by

k=1

m_k= O d ǫ⁷²

k=1

b_k = ˜O d⁴ ǫ⁵²

Remark 11 Note that [TSJ⁺17] provide a high-probability complexity result for stochastic Newton method with inexact gradient and Hessian information of the order ˜O ǫ^−3.5. This dependence on ǫ is better compared to algorithms that only use stochastic first-order information to avoid sad-dle points. They mainly focus on sub-sampled Newton method common in the finite-sum setting and require their stochastic Hessians to be almost-surely bounded. However, this assumption does not imply the zeroth-order Hessian estimators in Equation 4.4 are bounded almost-surely, which complicates the analysis.

Remark 12 Note that by Theorem 4.1, the total number of calls to the zeroth-order oracle is of the order ˜O _ǫ^3.5^d when ǫ ≈ d⁻³. This shows the advantage of using the (estimated) second-order information for converging to high-accuracy second-order stationary points. The linear dependence on d is the price to pay for having access to only zeroth-order information, similar to the pre-vious sections. Furthermore, depending on the quality of the solution required, a wide variety of intermediate complexity results are possible, thereby providing practical flexibility.

5 Discussion

In this work, we propose and analyze zeroth-order stochastic approximation algorithms for con-vex and nonconcon-vex problems motivated by modern machine learning challenges. Specifically, we

provide zeroth-order algorithms to deal with constraints, dimensionality and saddle-points in non-convex stochastic optimization problems. While our focus was on general stochastic optimization problems, one could naturally obtain better rates in the case of finite-sum optimization problems with various variance reduction techniques. Several concrete extensions are possible for future work. The performance of conditional gradient algorithm in the high-dimensional constrained op-timization setting is not well-explored; the interaction between the geometry of the constraint set, sparsity structure and zeroth-order information is extremely interesting to explore. Obtaining re-gret bounds for the non-convex problems considered in this work is more challenging. Furthermore, lower bounds can be explored for the cases considered in this paper when f is nonconvex. Finally, obtaining second-order stationarity results in the constrained setting is more challenging. We plan to extend our results for these setting in the future.

References

[AZ18] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems, pages 2680–2691, 2018.

[BCB12] S´ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Foundations and Trends in Machine Learn-R

ing, 5(1):1–122, 2012.

[Bec17] Amir Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.

[Ber16] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 2016.

[BNS16] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.

[BS15] Dimitri P Bertsekas and Athena Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015.

[BTN01] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: anal-ysis, algorithms, and engineering applications, volume 2. Siam, 2001.

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[CDHS18] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.

[CGT11a] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.

[CGT11b] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295–319, 2011.

[CGT18] Coralia Cartis, Nick IM Gould, and Philippe L Toint. Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics, 18(5):1073–1107, 2018.

[CRS⁺18] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimiza-tion. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018.

[CSV09] Andrew Conn, Katya Scheinberg, and Luis Vicente. Introduction to derivative-free optimization, volume 8. Siam, 2009.

[CZS⁺17] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.

[DJWW15] John Duchi, Michael Jordan, Martin Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.

[DR70] V. Demyanov and A. Rubinov. Approximate methods in optimization problems. Amer-ican Elsevier Publishing Co, 1970.

[Erd16] Murat A Erdogdu. Newton-stein method: an optimization method for glms via stein’s lemma. The Journal of Machine Learning Research, 17(1):7565–7616, 2016.

[FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956.

[Gha18] Saeed Ghadimi. Conditional gradient type methods for composite nonlinear and stochastic optimization. Mathematical Programming, 2018.

[GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.

[GL13] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

[GLM16] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.

[Hea82] Donald Hearn. The gap function of a convex program. Operations Research Letters, 2, 1982.

[HK12] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1843–1850. Omnipress, 2012.

[HL16] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic opti-mization. In International Conference on Machine Learning, pages 1263–1271, 2016.

[Jag13] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427–435, 2013.

[JGN⁺17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732, 2017.

[JK17] Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning.

Foundations and Trends in Machine Learning, 10(3-4):142–336, 2017.R

[JNR12] Kevin Jamieson, Robert Nowak, and Ben Recht. Query complexity of derivative-free optimization. In Advances in Neural Information Processing Systems, pages 2672–2680, 2012.

[JTK14] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard threshold-ing methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.

[KK19] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. arXiv preprint arXiv:1901.00279, 2019.

[LZ16] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization.

SIAM Journal on Optimization, 26(2):1379–1409, 2016.

[MGR18] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. In Advances in Neural Information Processing Systems, 2018.

[MHK18a] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic submodular maximization: Closing the gap. In International Conference on Artificial Intelligence and Statistics, pages 1886–1895, 2018.

[MHK18b] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. arXiv preprint arXiv:1804.09554, 2018.

[Min18] Stanislav Minsker. Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46(6A):2871–2903, 2018.

[MK87] Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.

[Moc12] Jonas Mockus. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.

[Nes04] Y. E. Nesterov. Introductory Lectures on Convex Optimization: a basic course. Kluwer Academic Publishers, Massachusetts, 2004.

[Nes13] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, vol-ume 87. Springer Science & Business Media, 2013.

[NP06] Yurii Nesterov and Boris Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[NS17] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.

[NW06] Jorge Nocedal and Stephen J Wright. Nonlinear Equations. Springer, 2006.

[NY83] A. S. Nemirovski and D. Yudin. Problem complexity and method efficiency in opti-mization. Wiley-Interscience Series in Discrete Mathematics. John Wiley, XV, 1983.

[Pin94] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces.

The Annals of Probability, pages 1679–1706, 1994.

[RK16] Reuven Rubinstein and Dirk Kroese. Simulation and the Monte Carlo method, vol-ume 10. John Wiley & Sons, 2016.

[RSPS16] Sashank Reddi, Suvrit Sra, Barnab´as P´oczos, and Alexander Smola. Stochastic Frank-Wolfe Methods for Nonconvex Optimization. 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251, 2016.

[RZS⁺18] Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alex Smola. A generic approach for escaping saddle points. In International Conference on Artificial Intelligence and Statistics, pages 1233–1242, 2018.

[Sha13] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex opti-mization. In Conference on Learning Theory, pages 3–24, 2013.

[SHC⁺17] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolu-tion strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.

[SLA12] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems,

In document arxiv: v2 [math.oc] 14 Jan 2019 (Page 22-35)