Efficient optimization with mirror descent

Next, we will propose a specialized iterative algorithm for solving the proposed optimization problem given by (6.11). However first we introduce the mirror descent method as a generic tool for convex optimization.

6.4.1 Mirror descent

Mirror descent (MD) is an optimization procedure that generalizes subgradient methods to non-Euclidean spaces. For an optimization problem minx∈Xf (x), it tries to minimize

It uses a differentiable mirror map function ω(·) to measure locality, which must be 1-strongly convex with respect to a norm || · ||. Significantly, MD is an optimal optimization algorithm for non-smooth functions in the blackbox model (Ben-Tal and Nemirovski, 2015).

Mirror descent is given by the recurrence

x0 = arg min

x∈X ω(x), xt+1= Proxxt(γtf

0 (xt)),

where f0(xt) is a subgradient of f at xt, γt are step sizes and the proximity operator

is defined as

Proxx(ψ) = arg min

y∈X ω(y) + hψ − ω

(x), yi.

This proximity operator aims to move in the negative direction to ψ, while staying close to the original point x.

With the above steps, letting xT ₌ P

T t=1γtxt

PT t=1γt

and choosing step sizes appropriately, it is shown in Theorem 5.3.1 of (Ben-Tal and Nemirovski, 2015) that

f (xT) − min

x∈Xf (x) ≤

ΩL(f ) √

T ,

where L(f ) is the Lipschitz constant of f w.r.t. the considered norm in X and Ω is related to the radius of X w.r.t. ω(·) (e.g. Ω ≤ q2(max ω(·) − min ω(·))). We refer the reader to Section 5.3 of (Ben-Tal and Nemirovski, 2015) or (Bubeck, 2016) for a more comprehensive treatment of the general mirror descent scheme.

In the spectrahedron setup (i.e. for the minimization minx∈∆nf (M )), the negative

von Neumann entropy of a matrix, ω(M ) =Pn

i=1λilog λi can be chosen as the mirror

map, where λi are the eigenvalues of M . Notice that this map is 1-strongly convex

with respect to the `1 norm of the eigenvalues, i.e., to the matrix trace norm. Working out the proximal mapping, this gives us the following multiplicative update rule (cf.

part 2 of (Bubeck, 2016)):

Mt+1∝ exp (log Mt− γtf0(Mt)) , (6.12)

with matrix exponential and logarithm, M0 = 1_nInand the right-hand side is normalized

to unit trace to obtain Mt+1. L(f ) is the Lipschitz constant of f w.r.t. the matrix trace

norm. In this setup it also follows that the radius of ∆n, Ω satisfies Ω = O(

√ log n) (Ben-Tal and Nemirovski, 2015; Bubeck, 2016).

Note that (6.12) is written for the general minimization problem, while we consider the maximization problem in which case the update step can be written as

Mt+1∝ exp t X τ =1 ατf0(Mτ) ! , (6.13)

for some weights ατ, where we also unrolled the recursion.

6.4.2 Solving the subgraph problem with mirror descent

In this section we propose a mirror descent-based iterative algorithm for solving (6.11). We first consider a modification of our original SDP by adding the violation variables s ≥ 0. For some fixed penalty value p ≥ 0 we write

max

M ∈∆n,s

C · M − ps s.t. Qγ(M ) + sD 0. (6.14)

Recalling that D is the degree matrix of G, the term sD provides a measure of how violated the SDP constraint is. If s ≥ γ2_{, it is possible to prove that the SDP constraint} is trivially satisfied for any M in ∆n at a cost of ps in the objective (which follows

from the fact that L_Star(r)  2D). To avoid such trivial solutions, we set p ≥ 4OPT

γ2 , where OPT is the optimal value for cost function C · M . This means that in a solution with nonnegative cost, s can be at most γ₄. In practice we can replace OPT with kCk, which is an upper bound to the optimal value.

Introducing the Lagrange multiplier Y 0 corresponding to the constraint Qγ(M )+

sD 0, we then obtain the saddle point problem

max

M ∈∆n,M ≥0,s

min

Y 0 C · M − ps + Y · (Qγ(M ) + sD),

from which we obtain the dual

min

Y ∈∆p,Dn

f (Y ), where f (Y ) = max

M ∈∆n,M ≥0

(C + Pγ(Y )) · M,

and we defined ∆p,D_n to be the scaled spectrahedron {X 0 : D · X = p} w.r.t. the degree matrix D.

As it is standard in the spectrahedron setup maxX∈∆nf (X), we will use the

negative von Neumann entropy ω(X) =Pn

i=1λilog λi as our mirror map. Finally, to

apply mirror descent, we need access to the gradient of f at Y(t) _{for which we utilize} Danskin’s theorem (Ben-Tal and Nemirovski, 2015), stated below.

Theorem 6.4.1 (Danskin’s Theorem). Let f (x) = maxzg(x, z), where g(·, z) is a

convex function for all z. Define Z0(x) = {z0 _{: g(x, z}0_{) = max}

zg(x, z)} to be the

set of maximizers z given a point x. Then, under certain regularity conditions the subdifferential of f at x is given by

∂f (x) = conv {∂g(x, z) : z ∈ Z0(x)} .

Then, using above theorem ∇Yf (Y ) is given by

∇Yf (Y(t)) = Qγ(M(t)), where M(t) = arg minM ∈∆n,M ≥0(C + Pγ(Y

(t)_{)) · M.}

Hence, computation of the gradient requires finding M(t), which plays the role of the primal update at time t. However, this is just the rank-1 matrix given by the projection over the top eigenvector of C + Pγ(Y(t)), where M ≥ 0 is once again ensured

by Perron-Frobenius. Using the definition of the mirror map, for a step size η, we obtain the well-known multiplicative update rule for Y (see (Kivinen and Warmuth,

Algorithm 2 Mirror descent for connected subgraph detection Input: C, p, r, γ, η,  Output: ˆM , ˆS Y(0) _← p Tr(D)In G(0) _{← 0} for t = 1, . . . , T do v ← eigC + Dγ(Y(t−1)) M(t) _{← vv}> G(t) _{← G}(t−1)_{+ Q} γ(M(t)) Y(t) ← exp−η G(t) Y(t) ← _D·Yp(t)Y (t) end for ˆ M ← _T1 PT t=1M(t) ˆ S ← {i : ˆMii> } 1997)), Y(t+1) ∝ explog Y(t)− ηQγ(M(t))) , where Y(t+1) _{is normalized so that D · Y}(t+1) _{= p and Y}(0) ₌ p

Tr(D)In. We can also

avoid computing the matrix logarithm and unwind the recursion to get

Y(t+1) ∝ exp −η t X τ =1 Qγ(M(τ )) ! , (6.15)

to directly compute Y(t+1) _{from a running sum of Q}

γ(M(τ )) matrices. We formally

present the resulting algorithm in Algorithm 2, where eig(·) operator returns the eigenvector of the operand corresponding to the largest eigenvalue.

Using standard results on the convergence of mirror descent (Ben-Tal and Ne- mirovski, 2015) which we noted in the previous section and are applicable thanks to our careful formulation of the problem, we obtain the following convergence bound.

Theorem 6.4.2. Setting η = _4OPT1 , Algorithm 2 converges to an -multiplicative approximation of optimal in T = Olog n_γ22

steps.

Moreover, each iteration consists of computing the top eigenvector of a nonnegative matrix and the matrix exponential of a the sum of a Laplacian and a rank-1

term (cf. Lemma 6.3.2). Thanks to recent theoretical results, both of these objects can be approximated sufficiently closely in nearly-linear time (Orecchia et al., 2012; Cohen et al., 2016) by exploiting well-known dimensionality reduction techniques and numerical algebra results. In practice, existing iterative solvers, combined with the use of the Johnson-Lindenstrauss lemma to keep a low-dimensional sketch of the matrix exponential, already provide a very efficient computational approach to this problem.

In this work, we did not perform a theoretically study of the approximation guarantees achievable in rounding our relaxation to an integral solution in the worst- case, as this is a much more challenging mathematical task. Instead we empirically tested a number of different rounding techniques, including random projections and truncating the diagonal entries.

In document Discovery of low-dimensional structure in high-dimensional inference problems (Page 167-172)