• No results found

An optimization scheme for the trace norm regularized convex problem (5.1)

Riemannian gradient in the direction ξx, i.e., Dgradxf [ξx]. The second term is the correction term corresponds to the manifold structure and the metric.

• Dgradxf [ξx¯]: The computational cost depends on the cost function f and its first-order deriva- tive.

• Correction term: It involves matrix multiplications with total cost of O(np2+ p3).

It is clear that all the manifold related operations are of linear complexity in n and m, and cubic in p. For the case of interest, p min(n, m), these operations are computationally very efficient. The ingredients that depend on the problem at hand are the evaluation of the cost function f and computation of its first-order derivative and its directional derivative along a search direction. In Section 5.6, these computations are worked out for two specific examples of low-rank matrix completion and multivariate regression, where we exploit the least-squares nature of the cost function.

5.4

An optimization scheme for the trace norm regularized

convex problem (5.1)

Starting with a rank-1 problem, we alternate a second-order local optimization algorithm on fixed-rank manifold with a first-order rank-one update in order to propose an algorithm for the convex problem with trace norm penalty (5.1). The scheme is shown in Table5.2. The rank-one update decreases the cost with the updated iterate inMp+1.

Proposition 5.4. Assume that the function F in (5.1) has Lipschitz continuous derivative with the Lipschitz constant LF such that kGradXF− GradYFkF ≤ LFkX − YkF for all X, Y∈ Rn×m, where GradXF is Euclidean gradient of the function F inRn×m. If X= UBVT is a stationary point of (5.3) with(U, B, V)∈ St(p0, n)× S++(p0)× St(p0, m), then the rank-one update

X+= X− βuvT (5.23)

ensures a decrease in the cost functionF (X) + λkXk∗, provided thatβ > 0 is sufficiently small and the unit norm descent directions u∈ Rn andv

∈ Rm are the dominant left and right singular vectors of the dual variable S= GradXF .

Additionally, the maximum decrease in the cost function in (5.1) is obtained forβ = (σ1− λ)/LF where σ1 is the dominant singular value of S .

Proof. This is in fact a descent step as shown by Cai et al.(2010); Ma et al.(2011); Mazumder et al.

(2010) but now projected onto the rank-one dominant subspace. The proof follows.

Since X = UBVT is a stationary point for the problem (5.3) and not the global optimum of (5.1), by virtue of Proposition 5.2 we have kSkop > λ (strict inequality). We assume that F is smooth and hence, let the first derivative of F is Lipschitz continuous with the Lipschitz constant LF, i.e.,

Algorithm to solve convex problem (5.1) 0. • Initialize p to p0, a rank guess.

• Initialize the threshold  for convergence criterion, refer to Proposition5.2. • Initialize the iterates (U0, B0, V0) ∈ St(p0, n) × S++(p0) × St(p0, m).

1. Solve the fixed-rank problem (5.3) with rank p to obtain a critical point (U, B, V).

2. Compute σ1(the dominant singular value) of dual variable S = GradXF , where

X = UBVT.

• If σ1−λ ≤  (or duality gap ≤ ) due to Proposition5.2, output X = UBVT

as the solution to problem (5.1) and stop.

• Else, compute the update as shown in Proposition 5.4and compute the new point (U+, B+, V+) as described in (5.23). Set p = p + 1 and repeat

step 1.

Table 5.2: Algorithm to solve the trace norm minimization problem (5.1).

kGradXF− GradYFkF ≤ LFkX − YkF for any X, Y∈ Rn×m (Nesterov,2003, Chapter 2). Therefore, the update (5.23), X+= X− βuvT, results in the inequalities

F (X+) ≤ F (X) + hGradXF, X+− Xi +L2FkX+− Xk2F = F (X)− βσ1+L2Fβ2(from Lipschitz continuity). Also

kX+k∗ ≤ kXk∗+ β (from triangle inequality of matrix norm in (5.23)) ⇒ F (X+) + λkX+k∗ ≤ F (X) + λkXk∗− β(σ1− λ −L2Fβ)

(5.24)

for β > 0 and σ1 is the largest singular value of S (= GradXF ). The maximum decrease in the cost function is obtained by maximizing β(σ1− λ −

Lf

2 β) with respect to β which gives βmax= σ1−λ

LF > 0. In

addition, βmax = 0 ⇔ σ1− λ = 0 which characterizes global optimality as shown in Proposition (5.2). This proves the proposition.

A representation of X+= X−βuvT onMp+1is obtained by computing the singular value decomposition of X+. Since X+ is a rank-one update of X = UBVT, the singular value decomposition of X+ only requires O(np2+ mp2+ p3) operations (Brand,2006). Finally, we perform a backtracking linesearch along the rank-one descent direction to compute a good value of β starting from the value σ1−λ

LF , where LF is

the Lipschitz constant for the first-order derivative of F (Nesterov,2003). The justification for this value is given in Proposition5.4. In many problem instances, it is easy to estimate LF by randomly selecting two points, say X and Y∈ Rn×m, and computing

kGradXF− GradYFkF/kX − YkF (Nesterov,2003). There is no theoretical guarantee that the algorithm in Table 5.2 stops at p = p∗ (the optimal rank). However, convergence to the global solution is guaranteed from the fact that the algorithm alternates between fixed-rank optimization and rank updates (unconstrained projected rank-1 gradient step) and both are descent iterates. Disregarding the fixed-rank step, the algorithm reduces to a gradient algorithm for a convex problem with classical global convergence guarantees. This theoretical certificate however does not capture the convergence properties of an algorithm that empirically always converges at a rank