Riemannian gradient in the direction ξx, i.e., Dgradxf [ξx]. The second term is the correction term corresponds to the manifold structure and the metric.
• Dgradxf [ξx¯]: The computational cost depends on the cost function f and its first-order deriva- tive.
• Correction term: It involves matrix multiplications with total cost of O(np2+ p3).
It is clear that all the manifold related operations are of linear complexity in n and m, and cubic in p. For the case of interest, p min(n, m), these operations are computationally very efficient. The ingredients that depend on the problem at hand are the evaluation of the cost function f and computation of its first-order derivative and its directional derivative along a search direction. In Section 5.6, these computations are worked out for two specific examples of low-rank matrix completion and multivariate regression, where we exploit the least-squares nature of the cost function.
5.4
An optimization scheme for the trace norm regularized
convex problem (5.1)
Starting with a rank-1 problem, we alternate a second-order local optimization algorithm on fixed-rank manifold with a first-order rank-one update in order to propose an algorithm for the convex problem with trace norm penalty (5.1). The scheme is shown in Table5.2. The rank-one update decreases the cost with the updated iterate inMp+1.
Proposition 5.4. Assume that the function F in (5.1) has Lipschitz continuous derivative with the Lipschitz constant LF such that kGradXF− GradYFkF ≤ LFkX − YkF for all X, Y∈ Rn×m, where GradXF is Euclidean gradient of the function F inRn×m. If X= UBVT is a stationary point of (5.3) with(U, B, V)∈ St(p0, n)× S++(p0)× St(p0, m), then the rank-one update
X+= X− βuvT (5.23)
ensures a decrease in the cost functionF (X) + λkXk∗, provided thatβ > 0 is sufficiently small and the unit norm descent directions u∈ Rn andv
∈ Rm are the dominant left and right singular vectors of the dual variable S= GradXF .
Additionally, the maximum decrease in the cost function in (5.1) is obtained forβ = (σ1− λ)/LF where σ1 is the dominant singular value of S .
Proof. This is in fact a descent step as shown by Cai et al.(2010); Ma et al.(2011); Mazumder et al.
(2010) but now projected onto the rank-one dominant subspace. The proof follows.
Since X = UBVT is a stationary point for the problem (5.3) and not the global optimum of (5.1), by virtue of Proposition 5.2 we have kSkop > λ (strict inequality). We assume that F is smooth and hence, let the first derivative of F is Lipschitz continuous with the Lipschitz constant LF, i.e.,
Algorithm to solve convex problem (5.1) 0. • Initialize p to p0, a rank guess.
• Initialize the threshold for convergence criterion, refer to Proposition5.2. • Initialize the iterates (U0, B0, V0) ∈ St(p0, n) × S++(p0) × St(p0, m).
1. Solve the fixed-rank problem (5.3) with rank p to obtain a critical point (U, B, V).
2. Compute σ1(the dominant singular value) of dual variable S = GradXF , where
X = UBVT.
• If σ1−λ ≤ (or duality gap ≤ ) due to Proposition5.2, output X = UBVT
as the solution to problem (5.1) and stop.
• Else, compute the update as shown in Proposition 5.4and compute the new point (U+, B+, V+) as described in (5.23). Set p = p + 1 and repeat
step 1.
Table 5.2: Algorithm to solve the trace norm minimization problem (5.1).
kGradXF− GradYFkF ≤ LFkX − YkF for any X, Y∈ Rn×m (Nesterov,2003, Chapter 2). Therefore, the update (5.23), X+= X− βuvT, results in the inequalities
F (X+) ≤ F (X) + hGradXF, X+− Xi +L2FkX+− Xk2F = F (X)− βσ1+L2Fβ2(from Lipschitz continuity). Also
kX+k∗ ≤ kXk∗+ β (from triangle inequality of matrix norm in (5.23)) ⇒ F (X+) + λkX+k∗ ≤ F (X) + λkXk∗− β(σ1− λ −L2Fβ)
(5.24)
for β > 0 and σ1 is the largest singular value of S (= GradXF ). The maximum decrease in the cost function is obtained by maximizing β(σ1− λ −
Lf
2 β) with respect to β which gives βmax= σ1−λ
LF > 0. In
addition, βmax = 0 ⇔ σ1− λ = 0 which characterizes global optimality as shown in Proposition (5.2). This proves the proposition.
A representation of X+= X−βuvT onMp+1is obtained by computing the singular value decomposition of X+. Since X+ is a rank-one update of X = UBVT, the singular value decomposition of X+ only requires O(np2+ mp2+ p3) operations (Brand,2006). Finally, we perform a backtracking linesearch along the rank-one descent direction to compute a good value of β starting from the value σ1−λ
LF , where LF is
the Lipschitz constant for the first-order derivative of F (Nesterov,2003). The justification for this value is given in Proposition5.4. In many problem instances, it is easy to estimate LF by randomly selecting two points, say X and Y∈ Rn×m, and computing
kGradXF− GradYFkF/kX − YkF (Nesterov,2003). There is no theoretical guarantee that the algorithm in Table 5.2 stops at p = p∗ (the optimal rank). However, convergence to the global solution is guaranteed from the fact that the algorithm alternates between fixed-rank optimization and rank updates (unconstrained projected rank-1 gradient step) and both are descent iterates. Disregarding the fixed-rank step, the algorithm reduces to a gradient algorithm for a convex problem with classical global convergence guarantees. This theoretical certificate however does not capture the convergence properties of an algorithm that empirically always converges at a rank