A Non-Euclidean Gradient Descent Framework for Non-Convex Matrix Factorization

(1)

Supplementary Materials to “A Non-Euclidean

Gradient Descent Framework for Non-Convex

Matrix Factorization”

Ya-Ping Hsieh, Yu-Chun Kao, Rabeeh Karimi Mahabadi, Alp Yurtsever, Anastasios Kyrillidis, Member, IEEE,

and Volkan Cevher, Senior Member, IEEE

APPENDIXA

PROOF OFSUBLINEARRATE OFNUCLEARGRADIENTDESCENT

We first use following lemma to prove the sublinear rate.

Lemma 1. For the sequence of the iterates_{Ui}ki=0, we have

f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k∇f(Xi)· Uik2S∞ (A.1)

and f (UiUiT)− f(U∗U∗T)≤ βi· k∇f(Xi)· UikS∞ (A.2) where αi= 1.117 ηi and βi= (2 +19₈₁)k∆Uik∗= (2 + 19 81)D∗(Ui, U ∗_).

Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k∇f(Xi)· Uik2S∞ ≤ δi− αi β2 i · δ 2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (III.5),

1 δi+1 − 1 δi ≥ αi β2 i ·_δδi i+1 ≥ αi β2 i ≥ α_˜i D2 ∗ . Telescoping the inequality we get the desired result.

Now we prove (A.1) of lemma 1. The smoothness gives

f (UiUiT)− f(Ui+1Ui+1T)

≥h∇f(Xi), Xi− Xi+1i − L 2kXi− Xi+1k 2 ∗ =_∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T | {z } 1

−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

| {z } 2 −L₂ kXi− Xi+1k2∗ | {z } 3 . (A.3) For 1_{we have} ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2S∞. (A.4)

(2)

To upper bound 2_{, we use}

∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

=η2 ik∇f(Xi)· Uik2S∞· T race(∇f(Xi)A1A T 1) ≤η2 ik∇f(Xi)· Uik2S∞· k∇f(Xi)kS∞ (∗) ≤1₄ηik∇f(Xi)· Uik2S∞ (A.5)

in which A1 is the first singular vector of ∇f(Xi)· Ui, and(∗) is by ηi≤ _{4k∇f (X}1_i_)k

S∞.

kUiUiT − Ui+1Ui+1T kS∞

=_kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT

− (Ui− Ui+1)(Ui− Ui+1)TkS∞

≤2kUikS∞kUi− Ui+1k∗+kUi− Ui+1kS∞kUi− Ui+1k∗

=2ηikUikS∞k∇f(Xi)· UikS∞+ η 2 ik∇f(Xi)· Uik2S∞ =ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)· UikS∞] ≤ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)kS∞kUikS∞] (1) ≤ ηik∇f(Xi)· UikS∞ 9 4kUikS∞ (A.6) where(1) is by ηi≤_{4k∇f (X}1 i)kS∞.

Plugging above inequalities into (A.3), we obtain

≥2ηik∇f(Xi)Uik2S∞− 1 4ηik∇f(Xi)· Uik 2 S∞ −L₂(9 4ηikUikS∞k∇f(Xi)· UikS∞) 2 ≥ηik∇f(Xi)· Uik2S∞ " 7 4 − L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2S∞ (A.7) where(_{∗) is by η}i≤_4LkX1_i_k S∞ = 1

4LkUik2_S∞. We have thus finished the first part of lemma 1.

Now we give the proof of (A.2) of lemma 1. We denote RUi≡ arg min R R is unitary kUi− U∗Rk∗. (A.8) and define ∆Ui≡ Ui− U ∗_R Ui. We begin with f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =_∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi −∇f(Xi), ∆Ui∆ T Ui =2h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· UikS∞k∆Uik∗+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 . (A.9)

(3)

To upper bound 1_{, we use} ∇f(Xi), ∆Ui∆ T Ui =_h∇f(Xi)∆Ui, ∆Uii ≤k∇f(Xi)∆UikS∞k∆Uik∗ =_k∇f(Xi)P∆_Ui∆UikS∞k∆Uik∗ ≤k∇f(Xi)P∆_UikS∞k∆UikS∞k∆Uik∗ (∗) ≤k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞ k∆Uik 2 ∗ (A.10)

in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U

∗ r)) and k∆UikS∞ ≤ k∆Uik∗. Continuing, we get k∇f(Xi)PUikS∞ =k∇f(Xi)UiU † ikS∞ (1) ≤ k∇f(Xi)UikS∞ 1 σr(Ui) (2) ≤ k∇f(Xi)UikS∞ 10 9σr(U∗) (A.11)

in which Ui† denotes the pseudoinverse ofUi. Here,(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (III.5), Weyl’s

inequality andσr(U∗RUi) = σr(U ∗_). Similarly, we have k∇f(Xi)PU∗k_S_∞ =k∇f(X_i)U∗(U∗)†k_S_∞ ≤ k∇f(Xi)U∗kS∞ | {z } A 1 σr(U∗) . (A.12)

To upper bound A_{, we use the following inequality.} k∇f(Xi)U∗kS∞ =k∇f(Xi)U∗RUikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)∆UikS∞ =_k∇f(Xi)UikS∞+k∇f(Xi)P∆Ui∆UikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)P∆UikS∞k∆UikS∞ (1) ≤ k∇f(Xi)UikS∞ +_k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞ k∆UikS∞ (2) ≤ k∇f(Xi)UikS∞ +10 9 k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞ k∆_U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)UikS∞ + 1 10 10 9 k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞ =10 9 k∇f(Xi)UikS∞+ 1 10k∇f(Xi)U ∗ kS∞. (A.13)

Here, (1) is owing to the similar reason of (A.10), (2) is obtained by plugging in (A.11) and (A.12), and (3) is by assumption (III.5) and_k∆UikS∞ ≤ k∆Uik∗. Thus we arrive at

k∇f(Xi)U∗kS∞ ≤

10 9

2

k∇f(Xi)UikS∞. (A.14)

Plugging this into (A.12), we get

k∇f(Xi)PU∗k_S ∞ ≤ 10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) . (A.15)

(4)

Combining (A.11) and (A.15) with (A.10), we obtain ∇f(Xi), ∆Ui∆ T Ui ≤(k∇f(Xi)UikS∞ 10 9σr(U∗) + 10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) )k∆Uik 2 ∗ =k∇f(Xi)UikS∞ 190 81 k∆Uik∗ σr(U∗)k∆ Uik∗ (∗) ≤19₈₁k∇f(Xi)UikS∞k∆Uik∗ (A.16)

where(_{∗) is by assumption (III.5). Now we plug (A.16) into (A.9) and obtain} f (UiUiT)− f(U∗U∗T)≤ (2 +

19

81)k∆Uik∗k∇f(Xi)· UikS∞. (A.17)

The last part is to proveminiγi≥ 1₄η by showingkUikS∞≤

11

9kU0kS∞ and

k∇f(Xi)kS∞ ≤

40L

81 σr(U0)σ1(U0) +k∇f(X0)kS∞. (A.18)

By assumption (III.5) and Weyl’s inequality, we have for every i_{≥ 0} (1−₁₀1 )σ1(U∗)≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗_{) , and thus} 1 + 1 10 1₋ 1 10 σ1(U0)≥ σ1(Ui). (A.19) For_k∇f(Xi)kS∞, we have k∇f(Xi)kS∞ ≤ k∇f(Xi)− ∇f(X0)kS∞+k∇f(X0)kS∞ ≤ LS1→S∞kXi− X0k∗+k∇f(X0)kS∞ ≤ LS1→S∞ kXi− X∗k∗+kX0− X∗k∗ +_k∇f(X0)kS∞. (A.20) Since kXi− X∗k∗=kUi(Ui− U∗RUi) T + (Ui− U∗RUi)(U ∗_R Ui) T k∗ ≤ kUi− U∗RUik∗ kUikS∞+kU ∗ kS∞ , (A.21) we have kXi− X∗k∗+kX0− X∗k∗ ≤kUi− U∗RUik∗(kUikS∞+kU ∗ kS∞) +_kU0− U∗RU0k∗(kU0kS∞+kU ∗ kS∞) (A.22) ≤σr(U ∗₎ 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (A.19).

(5)

APPENDIXB

PROOF OFSUBLINEARRATE FORNUCLEARGRADIENTDESCENT, TENSORVERSION

We first define the action (_{·) on a bounded linear operator T of H}1⊗ H2 andH1 whereH1 andH2 are Hilbert spaces.

∀T ∈ L(H1⊗ H2, R), h1∈ H1, T = X i λi(ai⊗ bi) T _{· h}1, X i λihai, h1ibi∈ H2. (B.1)

Immediately we have _{hT, h}1⊗ h2i = hT · h1, h2i, since

hT, h1⊗ h2i = * X i λi(ai⊗ bi), h1⊗ h2 + =X i λihai, h1ihbi, h2i =hT · h1, h2i. (B.2)

For the cases Hi = Rni×mi, we define the norm to be injective cross norm with eachHi having the spectral normkkS∞ as

primal norm and the consequent dual norm, nuclear norm _kk∗.

kxk , sup kaik∗≤1 ha1⊗ a2, xi (B.3) which satisfies kh1⊗ h2k = kh1kS∞kh2kS∞ ka1⊗ a2kdual=ka1k∗ka2k∗. (B.4)

We also use_kh1⊗ h2kS∞ andka1⊗ a2k∗ to denotekh1⊗ h2k and ka1⊗ a2kdual.

We use following lemma to prove the sublinear rate.

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik 2 ∗ (B.5) and f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (B.6) where αi= 1.117 ηi and βi= (2 +19₈₁)k∆UikS∞ = (2 + 19 81)D∞(Ui, U ∗_).

Defineδi= f (Ui⊗ Ui)− f(U∗⊗ U∗) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k[∇f(X) · U]#∞k 2 S∞ ≤ δi− αi β2 i · δ 2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (B.14),

1 δi+1 − 1 δi ≥ αi β2 i ·_δδi i+1 ≥ αi β2 i ≥ _˜αi D2 S∞ . Telescoping the inequality we get the desired result.

Now we prove (B.5) of lemma 2. We assume_{∇f(X) is symmetric, i.e. ∇f(X) =}P

iλi(ai⊗ ai) throughout. The smoothness

gives

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ h∇f(Xi), Xi− Xi+1i −

L 2kXi− Xi+1k 2 S∞ =_h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i | {z } 1 − h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i | {z } 2 −L₂ kXi− Xi+1k2S∞ | {z } 3 (B.7)

(6)

For 1_{we have} h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i (∗) = 2h∇f(Xi)· Ui, Ui− Ui+1i = 2ηi∇f(Xi)· Ui, [∇f(Xi)· Ui]#∞ = 2ηik∇f(Xi)· Uik2∗ (B.8)

where(_{∗) is by the assumption that ∇f(X}i) is symmetric.

h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i = ηi2k∇f(Xi)· Uik2∗∇f(Xi), ABT ⊗ ABT

≤ η2

ik∇f(Xi)· Uik2∗· k∇f(Xi)k∗kABTk2S∞

(∗)

≤ 1₄ηik∇f(Xi)· Uik2∗ (B.9)

in which A and B are respectively the left and right singular vectors of_∇f(Xi)· Ui.(∗) is by ηi ≤_{4k∇f (X}1

i)k∗.

kUi⊗ Ui− Ui+1⊗ Ui+1kS∞=kUi⊗ (Ui− Ui+1) + (Ui− Ui+1)⊗ Ui− (Ui− Ui+1)⊗ (Ui− Ui+1)kS∞

≤ 2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k

2 S∞ = 2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ = ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)k∗kUikS∞] (2) ≤ ηik∇f(Xi)· Uik∗ 9 4kUikS∞ (B.10) (2) is by ηi≤ 4k∇f (X1 i)k∗ and(1) is due to k∇f(Xi)· Uik∗= sup kyk_S∞≤1h∇f(X i)· Ui, yi = sup kykS∞≤1 h∇f(Xi), Ui⊗ yi ≤ sup kykS∞≤1 k∇f(Xi)k∗kUikS∞kykS∞ =_k∇f(Xi)k∗kUikS∞ (B.11)

Plugging above inequalities into (B.7), we obtain

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ 2ηik∇f(Xi)· Uik2∗−

1 4ηik∇f(Xi)· Uik 2 ∗ −L₂(9 4ηi kUikS∞k∇f(Xi)· Uik∗) 2 ≥ ηik∇f(Xi)· Uik2∗ " 7 4 − L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (B.12) (∗) is by ηi≤ _4LkX1 ikS∞ = 1

4LkUik2_S∞. We thus finish the first part of lemma 2.

We give the proof of (B.6) of lemma 2 which only holds for the phase retrieval case. We then use ˜f ( ˜X) to denote the original objective function, i.e. ˜X = ˜U ˜UT_{, ˜}_{f ( ˜}_{X) = f (X) = f (U}

⊗ U) and ˜U , a p× 1 vector, is the vectorization of U, a m× n matrix where p = m · n. We have the following equalities.

h∇f(Xi), U1⊗ U2i =

D

∇ ˜f ( ˜Xi), ˜U1U˜2 TE

and its immediate consequence

h∇f(Xi)· U1, U2i =

D

∇ ˜f ( ˜Xi) ˜U1, ˜U2

E

(7)

in which both sides of the first equation are the first order term of the objective function. The assumption made here, which corresponds to IV.5, would be

˜ D∞≡ max ˜ U :f ( ˜U ˜U>_{)≤f ( ˜}_U₀_U˜> 0) D∞( ˜U , ˜U∗)≤σmin( ˜U ∗₎ 10 = σ1( ˜U∗) 10 = k ˜U∗_k l2 10 . (B.14) We denote R_U˜_i ≡ arg min R unitaryk ˜ Ui− ˜U∗RkS∞ = arg min R∈{1,−1}k ˜ Ui− ˜U∗RkS∞. and define ∆_U˜_i≡ ˜Ui− ˜U∗R_U˜_i. f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ h∇f(Xi), Xi− X∗i =∇f(Xi), Ui⊗ Ui− U∗RU˜i⊗ U ∗_R ˜ Ui =D_{∇ ˜}f ( ˜Xi), ∆_U˜ i ˜ UT i E +D_{∇ ˜}f ( ˜Xi), ˜Ui∆TU˜i E −D∇ ˜f ( ˜Xi), ∆_U˜ i∆ T ˜ Ui E = 2D∇ ˜f ( ˜Xi) ˜Ui, ∆U˜i E −D∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ 2k∇f( ˜Xi) ˜Uik∗k∆_U˜_ikS∞+ D ∇ ˜f ( ˜Xi), ∆_U˜_i∆T_U˜ i E | {z } 1 (B.15) To upper bound 1_, D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E = D ∇ ˜f ( ˜Xi)∆U˜i, ∆U˜i E ≤ k∇ ˜f ( ˜Xi)∆U˜ik∗k∆U˜ikS∞ =_{k∇ ˜}f ( ˜Xi)(P∆_Ui˜ ∆U˜i)k∗k∆U˜ikS∞ ≤ k∇ ˜f ( ˜Xi)P∆_Ui˜ k∗k∆U˜ik 2 S∞ (∗) ≤ (k∇ ˜f ( ˜Xi)P_U˜ ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ik 2 S∞ (B.16)

in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆_U˜

i))⊆ Span(Col( ˜Ui)∪ Col( ˜U ∗ r)). k∇ ˜f ( ˜Xi)PUik∗=k∇ ˜f ( ˜Xi)UiU † ik∗ (1) ≤ k∇ ˜f ( ˜Xi)Uik∗ 1 σmin(Ui) (2) ≤ k∇ ˜f ( ˜Xi)Uik∗ 10 9σmin(U∗) (B.17)

in which Ui† denotes the pseudoinverse of Ui and σmin denotes the smallest non-zero singular value. (1) is due to σ1(Ui†) =

σr(Ui)−1.(2) is by assumption (B.14) and Weyl’s inequality.

Similarly, we have k∇ ˜f ( ˜Xi)P_U˜∗k∗=k∇ ˜f ( ˜Xi) ˜U∗( ˜U∗)†k∗≤ k∇ ˜f ( ˜Xi) ˜U∗k∗ | {z } A 1 σmin( ˜U∗) . (B.18)

To upper bound A_{, we use the following inequality.} k∇ ˜f ( ˜Xi) ˜U∗k∗=k∇ ˜f ( ˜Xi) ˜U∗RU˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)∆_U˜_ik∗ =k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆_Ui˜ ∆U˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆_Ui˜ k∗k∆U˜ikS∞ (1) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi)PU˜ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ikS∞ (2) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)k∆ ˜ UikS∞ σmin( ˜U∗) (3) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)1 10 = 10 9 k∇ ˜f ( ˜Xi) ˜Uik∗+ 1 10k∇ ˜f ( ˜Xi) ˜U ∗ k∗. (B.19)

(8)

(1) is owing to the similar reason of (B.16). (2) is obtained by plugging in (B.17) and (B.18). (3) is by assumption (B.14). Thus we arrive k∇ ˜f ( ˜Xi) ˜U∗k∗≤ 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗. (B.20)

Plugging this into (B.18), we get

k∇ ˜f ( ˜Xi)PU˜∗k∗≤ 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin( ˜U∗) . (B.21)

Combining (B.17) and (B.21) with (B.16), we obtain

D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9σmin(U∗) + 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin(U∗) )k∆U˜ik 2 S∞ =k∇ ˜f ( ˜Xi) ˜Uik∗ 190 81 k∆U˜ikS∞ σmin(U∗)k∆ ˜ UikS∞ (∗) ≤ 19₈₁k∇ ˜f ( ˜Xi) ˜Uik∗k∆_U˜_ikS∞ (B.22)

where(∗) is by assumption (B.14). Now we plug (B.22) into (B.15) and obtain f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ (2 +

19

81)k∆U˜ikS∞k∇ ˜f ( ˜Xi) ˜Uik∗

(∗∗)

≤ (2 +19₈₁)C0k∆_U˜_ikS∞k∇f(Xi)· Uik∗ (B.23)

where C0 is a constant and (∗∗) is obtained by the connection between ∇f(Xi)· Ui and ˜f ( ˜Xi) ˜Ui (see (B.13)) and the

equivalence of norms of finite dimensional Banach space. The last part is to proveminiγi≥ 1₄η by showingkUikS∞≤

11

9kU0kS∞ and

k∇f(Xi)rk∗≤40L

81 σmin(U0)σ1(U0) +k∇f(X0)rk∗. (B.24) By assumption (B.14) and Weyl’s inequality, we have for every i_{≥ 0}

(1₋ 1 10)σ1(U ∗₎ ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗_{) , and thus} 1 + 1 10 1₋ 1 10 σ1(U0)≥ σ1(Ui). (B.25)

Since _k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have

k∇f(Xi)rk∗≤ k(∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1(kXi− X ∗ kS∞+kX0− X ∗ kS∞) +k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi⊗ (Ui− U ∗_R Ui) + (Ui− U ∗_R Ui)⊗ (U ∗_R Ui)kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) we have kXi− X∗kS∞+kX0− X ∗ kS∞≤ kUi− U ∗_R UikS∞(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤ σmin(U ∗₎ 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σmin(U0) 10 σ1(U0) 40 9 = 40 81σmin(U0)σ1(U0) by applying inequality (B.25).

(9)

APPENDIXC

PROOF OFLINEARRATE FORNUCLEARGRADIENTDESCENT

We use U and U+ _{to denotes the current state and the updated state. Let}_X∗_{= U}∗_(U∗₎T _{be the optimum,}_{X = U U}T _and

X+_{= U}+_(U+₎T_.

U+_{= U}

− ηU∇f(UUT)#· U (C.1)

whereηU = _16(LkXk 1

S∞+k∇f (X)#QUQTUkS∞) which also denoted asη for simplicity.

We now start to prove the following key lemma.

Lemma 3. Given DF(U, U∗)≤ ρσr(Ur∗) and D∗(U, U∗)≤_81κ1 σr(X

∗₎ σ1(U∗), 1 ηU − U +_{, U} − Ur∗RU =_∇f(X)#_{U , U} − Ur∗RU ≥0.86ηk∇f(X)# Uk2 F + 0.7 µ 4 σr(X ∗_)D F(U, Ur∗) 2 −L₄kX∗ − X∗ rk 2 F, (C.2)

in which RU = arg min R R is unitary kU − U∗ rRkF. First, we define∆_{≡ U − U}∗ rRU and thus ∇f(X)#_{U , U} − Ur∗RU =1 2∇f(X) # , X− X∗ r + 1 2∇f(X) # , ∆∆T. (C.3)

First, we lower bound _∇f(X)#_{, X}

− X∗ r: f (X)≥ f(X+ )−∇f(X), X+ − X −L₂kX+ − Xk2 ∗ ≥ f(X∗₎ −∇f(X), X+ − X −L₂kX+ − Xk2 ∗, (C.4) and f (X∗ r)≥ f(X) + h∇f(X), Xr∗− Xi + µ 2kX ∗ r − Xk 2 ∗. (C.5)

Noticing PSD matrices form a convex cone, we obtain_h∇f(X∗), X∗_{i = 0 and consequently h∇f(X}∗_{), X}∗

ri = 0. Therefore we have f (Xr∗)≤ f(X∗) +h∇f(X∗), Xr∗− X∗i + L 2kX ∗ r− X∗k2∗ = f (X∗) +L 2kX ∗ r− X ∗ k2 ∗. (C.6)

Summing up previous three inequalities, we have

h∇f(X), X − X∗ ri ≥∇f(X), X − X+₋ L 2kX + − Xk2 ∗ +µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.7) Let A≡ I − η 2QUQ T U∇f(X) #_{, we have} X+ − X = (U − η∇f(X)#_{U )(U} − η∇f(X)#_{U )}T − UUT =_−η∇f(X)#_XA − ηAT_X ∇f(X)# _(C.8)

(10)

where we have used the property of_∇f(X)# being symmetric. Plugging the previous equality into (C.7), we achieve

h∇f(X), X − Xr∗i − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X ∗ k2 ∗ ≥∇f(X), X − X+₋L 2kX + − Xk2 ∗ ≥2η∇f(X), ∇f(X)#_XA −L₂(2_kη∇f(X)#_XA k∗)2 (C.9)

where we have used_{kY + Y}T_k∗≤ 2kY k∗.

For the two terms on the RHS of (C.9) we have the following bounds. ∇f(X), ∇f(X)# XA =h∇f(X), ∇f(X)# U UT −η₂∇f(X), ∇f(X)#_{U U}T_Q UQTU∇f(X) #i ≥k∇f(X)#_U k2 F− η 2k∇f(X) #_U k2 FkQUQTU∇f(X) # kS∞ ≥(1 −₃₂1 )k∇f(X)#_U k2 F (C.10)

where the last inequality is due to the choice of the step size. Continuing, we compute

|∇f(X)#_XA k∗≤ k∇f(X)#Uk∗kUkS∞kAkS∞ ≤ k∇f(X)# Uk∗kUkS∞ 1 + 1 32 . (C.11)

Now plugging these two bounds into (C.9), we have

h∇f(X), X − X∗ ri − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X∗k 2 ∗ ≥2η k∇f(X)#_U k2 F " 1₋ 1 32− ηLkUk 2 S∞ 33 32 2# ≥2η k∇f(X)#_U k2 F " 1−₃₂1 −₁₆1 33₃₂ 2# ≥18η₁₀ k∇f(X)# Uk2 F That is, h∇f(X), X − Xr∗i ≥18η₁₀ k∇f(X)#_U k2 F+ µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.12)

We now lower bound_∇f(X)#_{, ∆∆}T_{, the second term of (C.3):}

∇f(X)#_{, ∆∆}T =Q∆QT∆∇f(X) #_{, ∆∆}T ≥ − |Trace(∆∆T_Q ∆QT∆∇f(X) #₎ | ≥ − kQ∆QT∆∇f(X) # kS∞h∆, ∆i ≥ −hkQUQTU∇f(X) # kS∞ +_kQU∗ rQ T U∗ r∇f(X) # kS∞ i · DF(U, Ur∗) 2 _(C.13)

where the last inequality is owing to Span(Col(∆))_{⊆ Span(Col(U) ∪ Col(U}∗ r)). kQUQTU∇f(X) # kS∞ DF(U, U ∗ r) 2_{= η 16(L} kXkS∞+k∇f(X) #_Q UQTUkS∞)kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2 = 16ηL_kXkS∞kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2₊ 16ηk∇f(X)#_Q UQTUk2S∞DF(U, U ∗ r)2 (C.14)

(11)

We bound the underlined term by considering two possible conditions, _k∇f(X)#QUQTUkS∞ ≤ µσr(X) 40 and k∇f(X)#_Q UQTUkS∞ > µσr(X) 40 . 16ηL_kXkS∞kQUQ T U∇f(X) # kS∞D 2 ≤ max 16ηLkXkS∞µσr(X) 40 D 2_{, 16η40κτ (X)} k∇f(X)#_Q UQTUk 2 S∞D 2 ≤ 16ηLkXkS∞ µσr(X) 40 D 2_{+ 16η40κτ (X)} k∇f(X)#_Q UQTUk 2 S∞D 2 ≤ µσr₄₀(X)D2_{+ 16η40κτ (X)} k∇f(X)#_Q UQTUk 2 S∞D 2 (C.15)

in which D denotes DF(U, Ur∗). Combining the previous inequality with inequality (C.14), we get

kQUQTU∇f(X) # kS∞ D 2 ≤µσ₄₀r(X)D2+ (40κτ (X) + 1)16ηk∇f(X)# QUQTUk 2 S∞D 2 (i) ≤ µσ₄₀r(X)D2+ (40(101 99 ) 2 κτ (Xr∗) + 1)16ηk∇f(X) # QUQTUk 2 S∞(ρσr(U ∗ r)) 2 ≤ µσr₄₀(X)D2+ 16· 43ηκτ(X∗ r)k∇f(X) # QUQTUk 2 S∞σr(X ∗ r)ρ 2 (ii) ≤ µσr₄₀(X)D2_{+ 16} · 43ηκτ(X∗ r)k∇f(X) #_U k2 S∞ σr(Xr∗) σr(X) ρ2 (iii) ≤ µσr₄₀(X)D2_{+ 16} · 43ηκτ(Xr∗)k∇f(X) #_U k2 S∞( 100 99 ) 2_ρ2 (iv) ≤ µσr₄₀(X)D2₊2η 29k∇f(X) #_U k2 S∞ (C.16)

(i) and (iii) is due to the assumption DF(U, U∗)≤ ρσr(Ur∗) and lemma 6. (ii) is owing to

k∇f(X)#_U kS∞ =kU T ∇f(X)# kS∞ =kU T_Q UQTU∇f(X) # kS∞ ≥ σmin(U )k∇f(X) #_Q UQTUkS∞

andσmin(U ) = σr(U ) =pσr(X). (iv) is obtained by plugging ρ = _{100κτ (X}1 ∗ r).

We first note that _∇f(U∗(U∗₎T_)U∗ _{= 0, since X}∗ _{is the optimum, and thus} _∇f(X∗_)Q U∗

r = 0. Now we start to bound

kQU∗ rQ T U∗ r∇f(X) # kS∞. kQU∗ rQ T U∗ r∇f(X) # kS∞ ≤ kQUr∗Q T U∗ r∇f(X)kS∞ =_kQU∗ rQ T U∗ r(∇f(X) − ∇f(X ∗₎₎ kS∞ ≤ k∇f(X) − ∇f(X∗₎ kS∞ ≤ L (kX − Xr∗k∗+kXr∗− X ∗ k∗) (C.17)

where the last inequality is owing to L-smoothness and the triangular inequality. Plugging inequalities (C.16) and (C.17) into (C.13), we get

∇f(X)#_{, ∆∆}T_{≥ −} µσr(X) 40 D 2₊2η 29k∇f(X) #_U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X ∗ k∗) D2 . (C.18)

Now we plug two bounds (C.12) and (C.18) into (C.3) to get

∇f(X)#_{U , U} − Ur∗RU ≥ 1 2 18η 10 k∇f(X) #_U k2 F+ µ 2kX ∗ r − Xk 2 ∗− L 2kX ∗ r − X ∗ k2 ∗ −1₂ µσr₄₀(X)D2₊2η 29k∇f(X) #_U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X∗k∗) D2 ≥ 0.86ηk∇f(X)#_U k2 F− L 4kX ∗ r− X∗k 2 ∗ +µ 4 |X∗ r− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2₍ kX − X∗ rk∗+kXr∗− X∗k∗) (C.19)

(12)

Lemma 4. If DF(U, U∗)≤ ρσr(Ur∗) and ρ≤ 1001 , then for any unitary matrixR kX − X∗ rk∗=kUUT− Ur∗(Ur∗)Tk∗ =_kUUT − Ur∗RU T _{+ U}∗ rRU T − Ur∗R(U ∗ rR) T k∗ ≤ kU − U∗ rRk∗kUkS∞+kU − U ∗ rRk∗kUr∗kS∞ (i) ≤ kU − U∗ rRk∗(1 + ρ)kUr∗kS∞+kU − U ∗ rRk∗kUr∗kS∞ ≤ (2 + ρ)kU − U∗ rRk∗kUr∗kS∞ ≤ (2.01)kU − Ur∗Rk∗kUr∗kS∞ (C.20)

where (i) is due to lemma 6. Lemma 5. Let X = U UT _and_X∗

rUr∗(Ur∗) T _then kX − X∗ rk 2 F ≥ 2( √ 2− 1)σr(Xr∗)DF(U, Ur∗) 2_. _(C.21) See reference [2].

Combining lemmas 4 and 5, we obtain a lower bound for the underlined term in (C.19).

kXr∗− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2₍ kX − Xr∗k∗+kXr∗− X ∗ k∗) (i) ≥kX∗ r− Xk2F− σr(X∗)D2 20 − 2κD 2 kX − X∗ rk∗+ σr(X ∗₎ 200κ1.5_{τ (X}∗ r) (ii) ≥ 2(√2− 1)σr(Xr∗)D 2 −σr(X ∗_)D2 20 − 2κD 2 kX − X∗ rk∗−σr(X ∗_)D2 50 (iii) ≥ 2(√2_{− 1) −} 1 20− 1 50 σr(Xr∗)D 2 − 2κD2_(2.01) 1 81κ σr(X∗) σ1(U∗)kU ∗ rkS∞ ≥ 2(√2− 1) −₂₀1 −₅₀1 −₂₀1 σr(Xr∗)D2 ≥ 0.7 σr(Xr∗) D 2 _(C.22)

where(i) is due to_{k · k}∗≥ k · kF, and (ii) is owing to lemma 5. (iii) is due to the assumption ˜D∗≤81κ1 σr(X?) σ1(U?). Combining (C.22) with (C.19), we get ∇f(X)#_{U , U} − U∗ rRU ≥ 0.86ηk∇f(X)#Uk2F− L 4kX ∗ r− X∗k 2 ∗+ 0.7 µ 4 σr(X ∗ r) D 2_, _(C.23)

the desired lemma 3.

DF(U+, Ur∗) 2₌ _min R is unitarykU + − U∗ rk 2 F ≤ kU + − U∗ rRUk2F =kU − U∗ rRUk2F − 2η∇f(X) # U , U− U∗ rRU + η2k∇f(X)#Uk2F (i) ≤ DF(U, U∗)2− 2η −L₄kX∗ r− X∗k 2 ∗+ 0.7µ 4 σr(X ∗ r)D(U, U∗) 2 − (2(0.86)− 1)η2 k∇f(X)#_U k2 F ≤ 1₋0.7 µ η 2 σr(X ∗ r) D(U, U∗)2₊η L 2 kX ∗ r− X ∗ k2 ∗ (C.24)

in which RU = arg min R R is unitary

kU − U∗

rRkF.(i) is by lemma 3.

Lemma 6. Let U and U∗

r be twon× r matrices such that DF(U, Ur∗)≤ ρσr(Ur∗), for ρ ≤ 1 100. Define X ≡ UU T _and X∗ r ≡ Ur∗(Ur∗)T. Then we have (1−₁₀₀1 )σ1(Ur∗)≤ σ1(U )≤ (1 + 1 100)σ1(U ∗ r) (C.25) (1₋ 1 100)σr(U ∗ r)≤ σr(U )≤ (1 + 1 100)σr(U ∗ r) (C.26) and thus τ (U )_≤ 101 99τ (U ∗ r) and τ (X)≤ ( 101 99 ) 2_{τ (X}∗ r) (C.27)

(13)

Proof. By _{k · k}S∞ ≤ k · kF and Weyl’s inequality for perturbation of singular values, we obtain |σi(Ur∗)− σi(U )| ≤ ρσr(Ur∗)≤ 1 100σr(U ∗ r). (C.28)

Now we show min

i≥0ηi ≥ 1 16η by verifying kUiU T i kS∞ ≤ _1+ρ 1−ρ 2 kX0kS∞ and k∇f(UiU T i ) #_Q UiQ T UikS∞ ≤ 4Lσ1(U0)σr(X∗)

81κσ1(U∗)(1−ρ) +k∇f(X0)kS∞. The first one is an immediate result of lemma 6. Applying the same arguments of (A.20),

(A.21) and (A.22), the second part is a direct consequence of assumption ˜D∗≤ 81κ1 σr(X?)

σ1(U?) and assumption ˜DF ≤ ρσr(U

? r).

APPENDIXD

PROOF OFSUBLINEARRATE OFSPECTRALGRADIENTDESCENT We first use following lemma to prove the sublinear rate.

f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik2∗ (D.1) and f (UiUiT)− f(U ∗_U∗T₎ ≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (D.2) where αi= 1.117 ηi and βi= (2 +19₈₁)D∞(Ui, U∗).

Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k[∇f(X) · U]#∞k2S∞ ≤ δi− αi β2 i · δ2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (IV.5),

1 δi+1 − 1 δi ≥ αi β2 i ·_δδi i+1 ≥ αi β2 i ≥ _˜αi D2 S∞ . Telescoping the inequality we get the desired result.

Now we prove (D.1) of lemma 7. The smoothness gives

f (UiUiT)− f(Ui+1Ui+1T)≥ h∇f(Xi), Xi− Xi+1i −

L 2kXi− Xi+1k 2 S∞ =∇f(Xi), (Ui− Ui+1)UiT+ Ui(Ui− Ui+1)T | {z } 1

−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

| {z } 2 −L₂ kXi− Xi+1k2S∞ | {z } 3 (D.3) For 1_{we have} ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2_h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2∗. (D.4)

∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

=η2 ik∇f(Xi)· Uik2∗· T race(AT∇f(Xi)A Ir 0 0 0(n−r)×(n−r) ) ≤η2 ik∇f(Xi)· Uik2∗· k∇f(Xi)rk∗ (∗) ≤1₄ηik∇f(Xi)· Uik2∗ (D.5)

(14)

in which A is the left-singular vectors of _∇f(Xi)· Ui and k∇f(Xi)rk∗ equals to the sum of the top r singular values of

∇f(Xi). (∗) is by ηi≤ 4k∇f (X1i)rk∗.

kUiUiT− Ui+1Ui+1T kS∞

=_kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT − (Ui− Ui+1)(Ui− Ui+1)TkS∞

≤2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k

2 S∞ =2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ =ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)rk∗kUikS∞] (2) ≤ ηi k∇f(Xi)· Uik∗ 9 4kUikS∞ (D.6)

where(1) is due to the rank of Ui is less than r and(2) is by ηi≤ _{4k∇f (X}1

i)rk∗. Plugging above inequalities into (D.3), we

obtain

≥2ηik∇f(Xi)Uik2∗− 1 4ηik∇f(Xi)· Uik 2 ∗ −L₂(9 4ηikUikS∞k∇f(Xi)· Uik∗) 2 ≥ηik∇f(Xi)· Uik2∗ " 7 4− L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (D.7) (_{∗) is by η}i≤ _4LkX1 ikS∞ = 1

4LkUik2_S∞. We have thus finished the first part of lemma 7.

Now we give the proof of (D.2) of lemma 7. We denote RUi≡ arg min R R is unitary kUi− U∗RkS∞. (D.8) and define ∆Ui≡ Ui− U ∗_R Ui. Then we have f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =_∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi − ∇f(Xi), ∆Ui∆ T Ui =2_h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· Uik∗k∆UikS∞+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 (D.9)

∇f(Xi), ∆Ui∆ T Ui = h∇f(Xi)∆Ui, ∆Uii ≤ k∇f(Xi)∆Uik∗k∆UikS∞ =k∇f(Xi)P∆_Ui∆Uik∗k∆UikS∞ ≤ k∇f(Xi)P∆_Uik∗k∆Uik 2 S∞ (∗) ≤ k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k_∗ k∆Uik 2 S∞ (D.10)

in which PU denotes the projection onto Col(U ), and (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U

∗ r)).

(15)

k∇f(Xi)PUik∗=k∇f(Xi)UiU † ik∗ (1) ≤ k∇f(Xi)Uik∗ 1 σr(Ui) (2) ≤ k∇f(Xi)Uik∗ 10 9σr(U∗) (D.11)

in which U_i† denotes the pseudoinverse of Ui. Here(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (IV.5), Weyl’s

inequality andσr(U∗RUi) = σr(U ∗_{). Similarly, we have} k∇f(Xi)PU∗k_∗=k∇f(X_i)U∗(U∗)†k_∗ ≤ k∇f(Xi)U∗k∗ | {z } A 1 σr(U∗) . (D.12)

To upper bound A_{, we use the following inequality.}

k∇f(Xi)U∗k∗=k∇f(Xi)U∗RUik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)∆Uik∗ =_k∇f(Xi)Uik∗+k∇f(Xi)P∆_Ui∆Uik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)P∆_Uik∗k∆UikS∞ (1) ≤ k∇f(Xi)Uik∗+ (k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k_∗)k∆_U ikS∞ (2) ≤ k∇f(Xi)Uik∗+ 10 9 k∇f(Xi)Uik∗ +_k∇f(Xi)U∗k∗ k∆_U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)Uik∗+ 1 10 10 9 k∇f(Xi)Uik∗ +_k∇f(Xi)U∗k∗ =10 9 k∇f(Xi)Uik∗+ 1 10k∇f(Xi)U ∗ k∗. (D.13)

Here, (1) is owing to the similar reason of (D.10), (2) is obtained by plugging in (D.11) and (D.12) and (3) is by assumption (IV.5). Thus we arrive at

k∇f(Xi)U∗k∗≤

10 9

2

k∇f(Xi)Uik∗. (D.14)

Plugging this into (D.12), we get

k∇f(Xi)PU∗k_∗≤ 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗) . (D.15)

Combining (D.11) and (D.15) with (D.10), we obtain

∇f(Xi), ∆Ui∆ T Ui ≤ k∇f(Xi)Uik∗ 10 9σr(U∗) + 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗) k∆Uik 2 S∞ =_k∇f(Xi)Uik∗ 190 81 k∆UikS∞ σr(U∗) k∆ UikS∞ (∗) ≤ 19₈₁k∇f(Xi)Uik∗k∆UikS∞ (D.16)

where(∗) is by assumption (IV.5). Now we plug (D.16) into (D.9) and obtain f (UiUiT)− f(U∗U∗T)≤ 2 +19 81 k∆UikS∞k∇f(Xi)· Uik∗. (D.17)

(16)

The last part is to proveminiγi≥ 1₄η by showingkUikS∞≤ 11 9kU0kS∞ and k∇f(Xi)rk∗≤ 40L 81 σr(U0)σ1(U0) +k∇f(X0)rk∗. (D.18) By assumption (IV.5) and Weyl’s inequality, we have for everyi_{≥ 0}

(1₋ 1 10)σ1(U ∗₎ ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗_{) , and thus} 1 + 1 10 1− 1 10 σ1(U0)≥ σ1(Ui). (D.19)

Since _k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have

k∇f(Xi)rk∗≤ k((∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1 kXi− X∗kS∞+kX0− X ∗ kS∞ +_k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi(Ui− U ∗_R Ui) T + (Ui− U∗RUi)(U ∗_R Ui) T kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞), we have kXi− X∗kS∞+kX0− X ∗ kS∞ ≤kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) +_kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤σr(U ∗₎ 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤₁ 1 − 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (D.19). APPENDIXE PROOF OFLEMMA1 Using chain rules, we see that

∇f(A) = ∇lse(Ax)x>. (E.1)

To prove the convexity off , we compute h∇f(A) − ∇f(A0_{), A} − A0 i =∇lse(Ax)x> − ∇lse(A0_x)x>_{, A} − A0 =_{h∇lse(Ax) − ∇lse(A}0x), Ax_{− A}0x_i ≥ 0

since the lse function is convex.

We now turn to the smoothness parameters. Since transposing a matrix does not alter the Schatten-p norm, we have k∇f(A) − ∇f(A0₎ kF = ∇lse(Ax) − ∇lse(A0_x)_x> _F ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k(A − A0)xk2, by (IV.2) ≤ kxk2 2kA − A0kF

(17)

Using similar arguments, we compute k∇f(A) − ∇f(A0)_kS1 = ∇lse(Ax) − ∇lse(A0x)x> _S 1 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k1 ≤ kxk2k(A − A0)xk∞, by (IV.3) ≤ kxk2 2kA − A0kS∞

which establishes (IV.9).

APPENDIXF PROOF OFLEMMA2 The convexity of ˆf follows immediately from the convexity of f . For any two positive semi-definite matricesZ1= A1 B1

B> 1 D1 andZ2= A2 B2 B> 2 D2 , we have k∇ ˆf (Z1)− ∇ ˆf (Z2)kqSq = 1 2 0 _∇f(B1)− ∇f(B2) ∇f(B1)− ∇f(B2) 0 q Sq (1) = 1 2 ∇f(B1)− ∇f(B2) 0 0 ∇f(B1)− ∇f(B2) q Sq (2) = 1 2 k∇f(B1)− ∇f(B2)kqSq+k∇f(B1)− ∇f(B2)k q Sq (3) ≤ Lq kB1− B2kqSp, (F.1) where

(1) is because_{|| · ||}Sq is permutation invariant,

(2) is because of the block-diagonal structure, and (3) uses the smoothness of f .

It remains to see that_kB1− B2kSp≤ kZ1− Z2kSp. In order to prove this, we use the permutation invariance of the || · ||Sp,

and the Pinching inequality [1]:

kZ1− Z2kp_S_p = B1− B2 A1− A2 D1− D2 B1>− B>2 p Sp ≥ kB1− B2kp_S_p. APPENDIXG

SYNTHETICDATA FORPHASERETRIEVAL

(18)

Original Picture

Nuclear Wirtinger Flow Wirtinger Flow Fienup

Phaselift Phaselamp Phasemax

Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow

(19)

Original Picture

Nuclear Wirtinger Flow Wirtinger Flow Fienup

Phaselift Phaselamp Phasemax

Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow

(20)

APPENDIXH

WIRTINGERFLOW V.S. NUCLEARWIRTINGERFLOW

In Figure 4 we present the images recovered from nuclear Wirtinger flow and Wirtinger flow, indexed by time. Our experiments show that the nuclear Wirtinger flow quickly finds area (att = 4s) where meaningful image characteristics start to emerge. At t = 8s, a fully visible image is recovered, and the reconstruction stays at the solution for a short period. However, the nuclear Wirtinger flow eventually overfits and returns a noisy figure; see Figure 3. This phenomenon is possibly due to the mismatch of the mathematical model and real Fourier Ptychographic reconstructions.

In contrast, the Wirtinger flow recovers only partial image at t = 8s, and exhibits oscillating behaviors. Eventually the Wirtinger flow overfits, and return solutions like random noise.

We stress that the Wirtinger flow fails to recover the image for all the initializations we have tried, whereas the nuclear Wirtinger flow is quite robust to the choice of initial point.

2700 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 0.062953 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160

Fig. 3: Final solution of the nuclear Wirtinger flow,t = 37s.

APPENDIXI

SPECTRALGRADIENTMETHODS FORFA S TTE X T

Four more datasets are presented in Figure 5. The results are in accordance with our observations in Section VI-B: The heuristic version of (V.4) is the best optimization algorithm, in that it solves the training problem most efficiently, but is prone to overfitting. On the other hand, the theoretical iterates (V.4) is either the best or comparable to the other methods in terms of prediction accuracy.

(21)

REFERENCES [1] R. Bhatia, Matrix analysis. Springer Science & Business Media, 2013, vol. 169.

[2] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 964–973.

(22)

(a) Nuclear Wirtinger flow at t = 2s. (b) Wirtinger flow at t = 2s.

(c) Nuclear Wirtinger flow at t = 4s. (d) Wirtinger flow at t = 4s.

(e) Nuclear Wirtinger flow at t = 8s. (f) Wirtinger flow at t = 8s.

(g) Nuclear Wirtinger flow at t = 14s. (h) Wirtinger flow at t = 14s.

(i) Nuclear Wirtinger flow at t = 16s. (j) Wirtinger flow at t = 16s.

(23)

2 4 6 8 10 12 epochs 0.08 0.10 0.12 0.14 0.16 0.18 loss

yelp review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.952 0.953 0.954 0.955 0.956 0.957 0.958 a ccura cy

yelp review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 loss ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.914 0.916 0.918 0.920 0.922 0.924 0.926 a ccura cy ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.06 0.08 0.10 0.12 0.14 0.16 0.18 loss sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.963 0.964 0.965 0.966 0.967 0.968 0.969 0.970 a ccura cy sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 loss

amazon review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.940 0.941 0.942 0.943 0.944 0.945 0.946 0.947 a ccura cy

amazon review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f

Fig. 5: From left to right, training loss and test accuracy. From top to bottom, results on Yelp Review Polarity, AG News, Sogou News, and Amazon Review Polarity