• No results found

A Non-Euclidean Gradient Descent Framework for Non-Convex Matrix Factorization

N/A
N/A
Protected

Academic year: 2021

Share "A Non-Euclidean Gradient Descent Framework for Non-Convex Matrix Factorization"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Supplementary Materials to “A Non-Euclidean

Gradient Descent Framework for Non-Convex

Matrix Factorization”

Ya-Ping Hsieh, Yu-Chun Kao, Rabeeh Karimi Mahabadi, Alp Yurtsever, Anastasios Kyrillidis, Member, IEEE,

and Volkan Cevher, Senior Member, IEEE

APPENDIXA

PROOF OFSUBLINEARRATE OFNUCLEARGRADIENTDESCENT

We first use following lemma to prove the sublinear rate.

Lemma 1. For the sequence of the iterates{Ui}ki=0, we have

f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k∇f(Xi)· Uik2S∞ (A.1)

and f (UiUiT)− f(U∗U∗T)≤ βi· k∇f(Xi)· UikS∞ (A.2) where αi= 1.117 ηi and βi= (2 +1981)k∆Uik∗= (2 + 19 81)D∗(Ui, U ∗).

Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k∇f(Xi)· Uik2S∞ ≤ δi− αi β2 i · δ 2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (III.5),

1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ α˜i D2 ∗ . Telescoping the inequality we get the desired result.

Now we prove (A.1) of lemma 1. The smoothness gives

f (UiUiT)− f(Ui+1Ui+1T)

≥h∇f(Xi), Xi− Xi+1i − L 2kXi− Xi+1k 2 ∗ =∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T | {z } 1

−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

| {z } 2 −L2 kXi− Xi+1k2∗ | {z } 3 . (A.3) For 1 we have ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2S∞. (A.4)

(2)

To upper bound 2 , we use

∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

=η2 ik∇f(Xi)· Uik2S∞· T race(∇f(Xi)A1A T 1) ≤η2 ik∇f(Xi)· Uik2S∞· k∇f(Xi)kS∞ (∗) ≤14ηik∇f(Xi)· Uik2S∞ (A.5)

in which A1 is the first singular vector of ∇f(Xi)· Ui, and(∗) is by ηi≤ 4k∇f (X1i)k

S∞.

To upper bound 3 , we use

kUiUiT − Ui+1Ui+1T kS∞

=kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT

− (Ui− Ui+1)(Ui− Ui+1)TkS∞

≤2kUikS∞kUi− Ui+1k∗+kUi− Ui+1kS∞kUi− Ui+1k∗

=2ηikUikS∞k∇f(Xi)· UikS∞+ η 2 ik∇f(Xi)· Uik2S∞ =ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)· UikS∞] ≤ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)kS∞kUikS∞] (1) ≤ ηik∇f(Xi)· UikS∞ 9 4kUikS∞ (A.6) where(1) is by ηi≤4k∇f (X1 i)kS∞.

Plugging above inequalities into (A.3), we obtain

f (UiUiT)− f(Ui+1Ui+1T)

≥2ηik∇f(Xi)Uik2S∞− 1 4ηik∇f(Xi)· Uik 2 S∞ −L2(9 4ηikUikS∞k∇f(Xi)· UikS∞) 2 ≥ηik∇f(Xi)· Uik2S∞ " 7 4 − L 2  9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2S∞ (A.7) where(∗) is by ηi≤4LkX1ik S∞ = 1

4LkUik2S∞. We have thus finished the first part of lemma 1.

Now we give the proof of (A.2) of lemma 1. We denote RUi≡ arg min R R is unitary kUi− U∗Rk∗. (A.8) and define ∆Ui≡ Ui− U ∗R Ui. We begin with f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi −∇f(Xi), ∆Ui∆ T Ui =2h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· UikS∞k∆Uik∗+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 . (A.9)

(3)

To upper bound 1 , we use ∇f(Xi), ∆Ui∆ T Ui =h∇f(Xi)∆Ui, ∆Uii ≤k∇f(Xi)∆UikS∞k∆Uik∗ =k∇f(Xi)P∆Ui∆UikS∞k∆Uik∗ ≤k∇f(Xi)P∆UikS∞k∆UikS∞k∆Uik∗ (∗) ≤k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞  k∆Uik 2 ∗ (A.10)

in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U

∗ r)) and k∆UikS∞ ≤ k∆Uik∗. Continuing, we get k∇f(Xi)PUikS∞ =k∇f(Xi)UiU † ikS∞ (1) ≤ k∇f(Xi)UikS∞ 1 σr(Ui) (2) ≤ k∇f(Xi)UikS∞ 10 9σr(U∗) (A.11)

in which Ui† denotes the pseudoinverse ofUi. Here,(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (III.5), Weyl’s

inequality andσr(U∗RUi) = σr(U ∗). Similarly, we have k∇f(Xi)PU∗kS =k∇f(Xi)U∗(U∗)†kS ≤ k∇f(Xi)U∗kS∞ | {z } A 1 σr(U∗) . (A.12)

To upper bound A , we use the following inequality. k∇f(Xi)U∗kS∞ =k∇f(Xi)U∗RUikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)∆UikS∞ =k∇f(Xi)UikS∞+k∇f(Xi)P∆Ui∆UikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)P∆UikS∞k∆UikS∞ (1) ≤ k∇f(Xi)UikS∞ +k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞  k∆UikS∞ (2) ≤ k∇f(Xi)UikS∞ +10 9  k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞ k∆U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)UikS∞ + 1 10 10 9 k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞  =10 9 k∇f(Xi)UikS∞+ 1 10k∇f(Xi)U ∗ kS∞. (A.13)

Here, (1) is owing to the similar reason of (A.10), (2) is obtained by plugging in (A.11) and (A.12), and (3) is by assumption (III.5) andk∆UikS∞ ≤ k∆Uik∗. Thus we arrive at

k∇f(Xi)U∗kS∞ ≤

 10 9

2

k∇f(Xi)UikS∞. (A.14)

Plugging this into (A.12), we get

k∇f(Xi)PU∗kS ∞ ≤  10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) . (A.15)

(4)

Combining (A.11) and (A.15) with (A.10), we obtain ∇f(Xi), ∆Ui∆ T Ui ≤(k∇f(Xi)UikS∞ 10 9σr(U∗) + 10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) )k∆Uik 2 ∗ =k∇f(Xi)UikS∞ 190 81 k∆Uik∗ σr(U∗)k∆ Uik∗ (∗) ≤1981k∇f(Xi)UikS∞k∆Uik∗ (A.16)

where(∗) is by assumption (III.5). Now we plug (A.16) into (A.9) and obtain f (UiUiT)− f(U∗U∗T)≤ (2 +

19

81)k∆Uik∗k∇f(Xi)· UikS∞. (A.17)

The last part is to proveminiγi≥ 14η by showingkUikS∞≤

11

9kU0kS∞ and

k∇f(Xi)kS∞ ≤

40L

81 σr(U0)σ1(U0) +k∇f(X0)kS∞. (A.18)

By assumption (III.5) and Weyl’s inequality, we have for every i≥ 0 (1−101 )σ1(U∗)≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1 1 10 σ1(U0)≥ σ1(Ui). (A.19) Fork∇f(Xi)kS∞, we have k∇f(Xi)kS∞ ≤ k∇f(Xi)− ∇f(X0)kS∞+k∇f(X0)kS∞ ≤ LS1→S∞kXi− X0k∗+k∇f(X0)kS∞ ≤ LS1→S∞  kXi− X∗k∗+kX0− X∗k∗  +k∇f(X0)kS∞. (A.20) Since kXi− X∗k∗=kUi(Ui− U∗RUi) T + (Ui− U∗RUi)(U ∗R Ui) T k∗ ≤ kUi− U∗RUik∗  kUikS∞+kU ∗ kS∞  , (A.21) we have kXi− X∗k∗+kX0− X∗k∗ ≤kUi− U∗RUik∗(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0k∗(kU0kS∞+kU ∗ kS∞) (A.22) ≤σr(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (A.19).

(5)

APPENDIXB

PROOF OFSUBLINEARRATE FORNUCLEARGRADIENTDESCENT, TENSORVERSION

We first define the action (·) on a bounded linear operator T of H1⊗ H2 andH1 whereH1 andH2 are Hilbert spaces.

∀T ∈ L(H1⊗ H2, R), h1∈ H1, T = X i λi(ai⊗ bi) T · h1, X i λihai, h1ibi∈ H2. (B.1)

Immediately we have hT, h1⊗ h2i = hT · h1, h2i, since

hT, h1⊗ h2i = * X i λi(ai⊗ bi), h1⊗ h2 + =X i λihai, h1ihbi, h2i =hT · h1, h2i. (B.2)

For the cases Hi = Rni×mi, we define the norm to be injective cross norm with eachHi having the spectral normkkS∞ as

primal norm and the consequent dual norm, nuclear norm kk∗.

kxk , sup kaik∗≤1 ha1⊗ a2, xi (B.3) which satisfies kh1⊗ h2k = kh1kS∞kh2kS∞ ka1⊗ a2kdual=ka1k∗ka2k∗. (B.4)

We also usekh1⊗ h2kS∞ andka1⊗ a2k∗ to denotekh1⊗ h2k and ka1⊗ a2kdual.

We use following lemma to prove the sublinear rate.

Lemma 2. For the sequence of the iterates{Ui}ki=0, we have

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik 2 ∗ (B.5) and f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (B.6) where αi= 1.117 ηi and βi= (2 +1981)k∆UikS∞ = (2 + 19 81)D∞(Ui, U ∗).

Defineδi= f (Ui⊗ Ui)− f(U∗⊗ U∗) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k[∇f(X) · U]#∞k 2 S∞ ≤ δi− αi β2 i · δ 2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (B.14),

1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ ˜αi D2 S∞ . Telescoping the inequality we get the desired result.

Now we prove (B.5) of lemma 2. We assume∇f(X) is symmetric, i.e. ∇f(X) =P

iλi(ai⊗ ai) throughout. The smoothness

gives

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ h∇f(Xi), Xi− Xi+1i −

L 2kXi− Xi+1k 2 S∞ =h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i | {z } 1 − h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i | {z } 2 −L2 kXi− Xi+1k2S∞ | {z } 3 (B.7)

(6)

For 1 we have h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i (∗) = 2h∇f(Xi)· Ui, Ui− Ui+1i = 2ηi∇f(Xi)· Ui, [∇f(Xi)· Ui]#∞ = 2ηik∇f(Xi)· Uik2∗ (B.8)

where(∗) is by the assumption that ∇f(Xi) is symmetric.

To upper bound 2 , we use

h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i = ηi2k∇f(Xi)· Uik2∗∇f(Xi), ABT ⊗ ABT

≤ η2

ik∇f(Xi)· Uik2∗· k∇f(Xi)k∗kABTk2S∞

(∗)

≤ 14ηik∇f(Xi)· Uik2∗ (B.9)

in which A and B are respectively the left and right singular vectors of∇f(Xi)· Ui.(∗) is by ηi ≤4k∇f (X1

i)k∗.

To upper bound 3 , we use

kUi⊗ Ui− Ui+1⊗ Ui+1kS∞=kUi⊗ (Ui− Ui+1) + (Ui− Ui+1)⊗ Ui− (Ui− Ui+1)⊗ (Ui− Ui+1)kS∞

≤ 2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k

2 S∞ = 2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ = ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)k∗kUikS∞] (2) ≤ ηik∇f(Xi)· Uik∗ 9 4kUikS∞ (B.10) (2) is by ηi≤ 4k∇f (X1 i)k∗ and(1) is due to k∇f(Xi)· Uik∗= sup kykS∞≤1h∇f(X i)· Ui, yi = sup kykS∞≤1 h∇f(Xi), Ui⊗ yi ≤ sup kykS∞≤1 k∇f(Xi)k∗kUikS∞kykS∞ =k∇f(Xi)k∗kUikS∞ (B.11)

Plugging above inequalities into (B.7), we obtain

f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ 2ηik∇f(Xi)· Uik2∗−

1 4ηik∇f(Xi)· Uik 2 ∗ −L2(9 4ηi kUikS∞k∇f(Xi)· Uik∗) 2 ≥ ηik∇f(Xi)· Uik2∗ " 7 4 − L 2  9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (B.12) (∗) is by ηi≤ 4LkX1 ikS∞ = 1

4LkUik2S∞. We thus finish the first part of lemma 2.

We give the proof of (B.6) of lemma 2 which only holds for the phase retrieval case. We then use ˜f ( ˜X) to denote the original objective function, i.e. ˜X = ˜U ˜UT, ˜f ( ˜X) = f (X) = f (U

⊗ U) and ˜U , a p× 1 vector, is the vectorization of U, a m× n matrix where p = m · n. We have the following equalities.

h∇f(Xi), U1⊗ U2i =

D

∇ ˜f ( ˜Xi), ˜U1U˜2 TE

and its immediate consequence

h∇f(Xi)· U1, U2i =

D

∇ ˜f ( ˜Xi) ˜U1, ˜U2

E

(7)

in which both sides of the first equation are the first order term of the objective function. The assumption made here, which corresponds to IV.5, would be

˜ D∞≡ max ˜ U :f ( ˜U ˜U>)≤f ( ˜U0U˜> 0) D∞( ˜U , ˜U∗)≤σmin( ˜U ∗) 10 = σ1( ˜U∗) 10 = k ˜U∗k l2 10 . (B.14) We denote RU˜i ≡ arg min R unitaryk ˜ Ui− ˜U∗RkS∞ = arg min R∈{1,−1}k ˜ Ui− ˜U∗RkS∞. and define ∆U˜i≡ ˜Ui− ˜U∗RU˜i. f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ h∇f(Xi), Xi− X∗i =∇f(Xi), Ui⊗ Ui− U∗RU˜i⊗ U ∗R ˜ Ui =D∇ ˜f ( ˜Xi), ∆U˜ i ˜ UT i E +D∇ ˜f ( ˜Xi), ˜Ui∆TU˜i E −D∇ ˜f ( ˜Xi), ∆U˜ i∆ T ˜ Ui E = 2D∇ ˜f ( ˜Xi) ˜Ui, ∆U˜i E −D∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ 2k∇f( ˜Xi) ˜Uik∗k∆U˜ikS∞+ D ∇ ˜f ( ˜Xi), ∆U˜i∆TU˜ i E | {z } 1 (B.15) To upper bound 1 , D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E = D ∇ ˜f ( ˜Xi)∆U˜i, ∆U˜i E ≤ k∇ ˜f ( ˜Xi)∆U˜ik∗k∆U˜ikS∞ =k∇ ˜f ( ˜Xi)(P∆Ui˜ ∆U˜i)k∗k∆U˜ikS∞ ≤ k∇ ˜f ( ˜Xi)P∆Ui˜ k∗k∆U˜ik 2 S∞ (∗) ≤ (k∇ ˜f ( ˜Xi)PU˜ ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ik 2 S∞ (B.16)

in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆U˜

i))⊆ Span(Col( ˜Ui)∪ Col( ˜U ∗ r)). k∇ ˜f ( ˜Xi)PUik∗=k∇ ˜f ( ˜Xi)UiU † ik∗ (1) ≤ k∇ ˜f ( ˜Xi)Uik∗ 1 σmin(Ui) (2) ≤ k∇ ˜f ( ˜Xi)Uik∗ 10 9σmin(U∗) (B.17)

in which Ui† denotes the pseudoinverse of Ui and σmin denotes the smallest non-zero singular value. (1) is due to σ1(Ui†) =

σr(Ui)−1.(2) is by assumption (B.14) and Weyl’s inequality.

Similarly, we have k∇ ˜f ( ˜Xi)PU˜∗k∗=k∇ ˜f ( ˜Xi) ˜U∗( ˜U∗)†k∗≤ k∇ ˜f ( ˜Xi) ˜U∗k∗ | {z } A 1 σmin( ˜U∗) . (B.18)

To upper bound A , we use the following inequality. k∇ ˜f ( ˜Xi) ˜U∗k∗=k∇ ˜f ( ˜Xi) ˜U∗RU˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)∆U˜ik∗ =k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆Ui˜ ∆U˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆Ui˜ k∗k∆U˜ikS∞ (1) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi)PU˜ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ikS∞ (2) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)k∆ ˜ UikS∞ σmin( ˜U∗) (3) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)1 10 = 10 9 k∇ ˜f ( ˜Xi) ˜Uik∗+ 1 10k∇ ˜f ( ˜Xi) ˜U ∗ k∗. (B.19)

(8)

(1) is owing to the similar reason of (B.16). (2) is obtained by plugging in (B.17) and (B.18). (3) is by assumption (B.14). Thus we arrive k∇ ˜f ( ˜Xi) ˜U∗k∗≤  10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗. (B.20)

Plugging this into (B.18), we get

k∇ ˜f ( ˜Xi)PU˜∗k∗≤  10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin( ˜U∗) . (B.21)

Combining (B.17) and (B.21) with (B.16), we obtain

D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9σmin(U∗) + 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin(U∗) )k∆U˜ik 2 S∞ =k∇ ˜f ( ˜Xi) ˜Uik∗ 190 81 k∆U˜ikS∞ σmin(U∗)k∆ ˜ UikS∞ (∗) ≤ 1981k∇ ˜f ( ˜Xi) ˜Uik∗k∆U˜ikS∞ (B.22)

where(∗) is by assumption (B.14). Now we plug (B.22) into (B.15) and obtain f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ (2 +

19

81)k∆U˜ikS∞k∇ ˜f ( ˜Xi) ˜Uik∗

(∗∗)

≤ (2 +1981)C0k∆U˜ikS∞k∇f(Xi)· Uik∗ (B.23)

where C0 is a constant and (∗∗) is obtained by the connection between ∇f(Xi)· Ui and ˜f ( ˜Xi) ˜Ui (see (B.13)) and the

equivalence of norms of finite dimensional Banach space. The last part is to proveminiγi≥ 14η by showingkUikS∞≤

11

9kU0kS∞ and

k∇f(Xi)rk∗≤40L

81 σmin(U0)σ1(U0) +k∇f(X0)rk∗. (B.24) By assumption (B.14) and Weyl’s inequality, we have for every i≥ 0

(1 1 10)σ1(U ∗) ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1 1 10 σ1(U0)≥ σ1(Ui). (B.25)

Since k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have

k∇f(Xi)rk∗≤ k(∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1(kXi− X ∗ kS∞+kX0− X ∗ kS∞) +k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi⊗ (Ui− U ∗R Ui) + (Ui− U ∗R Ui)⊗ (U ∗R Ui)kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) we have kXi− X∗kS∞+kX0− X ∗ kS∞≤ kUi− U ∗R UikS∞(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤ σmin(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σmin(U0) 10 σ1(U0) 40 9 = 40 81σmin(U0)σ1(U0) by applying inequality (B.25).

(9)

APPENDIXC

PROOF OFLINEARRATE FORNUCLEARGRADIENTDESCENT

We use U and U+ to denotes the current state and the updated state. LetX= U(U)T be the optimum,X = U UT and

X+= U+(U+)T.

U+= U

− ηU∇f(UUT)#· U (C.1)

whereηU = 16(LkXk 1

S∞+k∇f (X)#QUQTUkS∞) which also denoted asη for simplicity.

We now start to prove the following key lemma.

Lemma 3. Given DF(U, U∗)≤ ρσr(Ur∗) and D∗(U, U∗)≤81κ1 σr(X

) σ1(U∗), 1 ηU − U +, U − Ur∗RU =∇f(X)#U , U − Ur∗RU ≥0.86ηk∇f(X)# Uk2 F + 0.7 µ 4 σr(X ∗)D F(U, Ur∗) 2 −L4kX∗ − X∗ rk 2 F, (C.2)

in which RU = arg min R R is unitary kU − U∗ rRkF. First, we define∆≡ U − U∗ rRU and thus ∇f(X)#U , U − Ur∗RU =1 2∇f(X) # , X− X∗ r + 1 2∇f(X) # , ∆∆T . (C.3)

First, we lower bound ∇f(X)#, X

− X∗ r : f (X)≥ f(X+ )−∇f(X), X+ − X −L2kX+ − Xk2 ∗ ≥ f(X∗) −∇f(X), X+ − X −L2kX+ − Xk2 ∗, (C.4) and f (X∗ r)≥ f(X) + h∇f(X), Xr∗− Xi + µ 2kX ∗ r − Xk 2 ∗. (C.5)

Noticing PSD matrices form a convex cone, we obtainh∇f(X∗), X∗i = 0 and consequently h∇f(X), X

ri = 0. Therefore we have f (Xr∗)≤ f(X∗) +h∇f(X∗), Xr∗− X∗i + L 2kX ∗ r− X∗k2∗ = f (X∗) +L 2kX ∗ r− X ∗ k2 ∗. (C.6)

Summing up previous three inequalities, we have

h∇f(X), X − X∗ ri ≥∇f(X), X − X+ L 2kX + − Xk2 ∗ +µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.7) Let A≡ I − η 2QUQ T U∇f(X) #, we have X+ − X = (U − η∇f(X)#U )(U − η∇f(X)#U )T − UUT =−η∇f(X)#XA − ηATX ∇f(X)# (C.8)

(10)

where we have used the property of∇f(X)# being symmetric. Plugging the previous equality into (C.7), we achieve

h∇f(X), X − Xr∗i − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X ∗ k2 ∗ ≥∇f(X), X − X+L 2kX + − Xk2 ∗ ≥2η∇f(X), ∇f(X)#XA −L2(2kη∇f(X)#XA k∗)2 (C.9)

where we have usedkY + YTk∗≤ 2kY k∗.

For the two terms on the RHS of (C.9) we have the following bounds. ∇f(X), ∇f(X)# XA =h∇f(X), ∇f(X)# U UT −η2∇f(X), ∇f(X)#U UTQ UQTU∇f(X) # i ≥k∇f(X)#U k2 F− η 2k∇f(X) #U k2 FkQUQTU∇f(X) # kS∞ ≥(1 −321 )k∇f(X)#U k2 F (C.10)

where the last inequality is due to the choice of the step size. Continuing, we compute

|∇f(X)#XA k∗≤ k∇f(X)#Uk∗kUkS∞kAkS∞ ≤ k∇f(X)# Uk∗kUkS∞  1 + 1 32  . (C.11)

Now plugging these two bounds into (C.9), we have

h∇f(X), X − X∗ ri − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X∗k 2 ∗ ≥2η k∇f(X)#U k2 F " 1 1 32− ηLkUk 2 S∞  33 32 2# ≥2η k∇f(X)#U k2 F " 1−321 −161  3332 2# ≥18η10 k∇f(X)# Uk2 F That is, h∇f(X), X − Xr∗i ≥18η10 k∇f(X)#U k2 F+ µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.12)

We now lower bound∇f(X)#, ∆∆T , the second term of (C.3):

∇f(X)#, ∆∆T =Q∆QT∆∇f(X) #, ∆∆T ≥ − |Trace(∆∆TQ ∆QT∆∇f(X) #) | ≥ − kQ∆QT∆∇f(X) # kS∞h∆, ∆i ≥ −hkQUQTU∇f(X) # kS∞ +kQU∗ rQ T U∗ r∇f(X) # kS∞ i · DF(U, Ur∗) 2 (C.13)

where the last inequality is owing to Span(Col(∆))⊆ Span(Col(U) ∪ Col(U∗ r)). kQUQTU∇f(X) # kS∞ DF(U, U ∗ r) 2= η 16(L kXkS∞+k∇f(X) #Q UQTUkS∞)kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2 = 16ηLkXkS∞kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2+ 16ηk∇f(X)#Q UQTUk2S∞DF(U, U ∗ r)2 (C.14)

(11)

We bound the underlined term by considering two possible conditions, k∇f(X)#QUQTUkS∞ ≤ µσr(X) 40 and k∇f(X)#Q UQTUkS∞ > µσr(X) 40 . 16ηLkXkS∞kQUQ T U∇f(X) # kS∞D 2 ≤ max 16ηLkXkS∞µσr(X) 40 D 2, 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2  ≤ 16ηLkXkS∞ µσr(X) 40 D 2+ 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2 ≤ µσr40(X)D2+ 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2 (C.15)

in which D denotes DF(U, Ur∗). Combining the previous inequality with inequality (C.14), we get

kQUQTU∇f(X) # kS∞ D 2 ≤µσ40r(X)D2+ (40κτ (X) + 1)16ηk∇f(X)# QUQTUk 2 S∞D 2 (i) ≤ µσ40r(X)D2+ (40(101 99 ) 2 κτ (Xr∗) + 1)16ηk∇f(X) # QUQTUk 2 S∞(ρσr(U ∗ r)) 2 ≤ µσr40(X)D2+ 16· 43ηκτ(X∗ r)k∇f(X) # QUQTUk 2 S∞σr(X ∗ r)ρ 2 (ii) ≤ µσr40(X)D2+ 16 · 43ηκτ(X∗ r)k∇f(X) #U k2 S∞ σr(Xr∗) σr(X) ρ2 (iii) ≤ µσr40(X)D2+ 16 · 43ηκτ(Xr∗)k∇f(X) #U k2 S∞( 100 99 ) 2ρ2 (iv) ≤ µσr40(X)D2+2η 29k∇f(X) #U k2 S∞ (C.16)

(i) and (iii) is due to the assumption DF(U, U∗)≤ ρσr(Ur∗) and lemma 6. (ii) is owing to

k∇f(X)#U kS∞ =kU T ∇f(X)# kS∞ =kU TQ UQTU∇f(X) # kS∞ ≥ σmin(U )k∇f(X) #Q UQTUkS∞

andσmin(U ) = σr(U ) =pσr(X). (iv) is obtained by plugging ρ = 100κτ (X1 ∗ r).

We first note that ∇f(U∗(U∗)T)U= 0, since Xis the optimum, and thus ∇f(X)Q U∗

r = 0. Now we start to bound

kQU∗ rQ T U∗ r∇f(X) # kS∞. kQU∗ rQ T U∗ r∇f(X) # kS∞ ≤ kQUr∗Q T U∗ r∇f(X)kS∞ =kQU∗ rQ T U∗ r(∇f(X) − ∇f(X ∗)) kS∞ ≤ k∇f(X) − ∇f(X∗) kS∞ ≤ L (kX − Xr∗k∗+kXr∗− X ∗ k∗) (C.17)

where the last inequality is owing to L-smoothness and the triangular inequality. Plugging inequalities (C.16) and (C.17) into (C.13), we get

∇f(X)#, ∆∆T ≥ − µσr(X) 40 D 2+2η 29k∇f(X) #U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X ∗ k∗) D2  . (C.18)

Now we plug two bounds (C.12) and (C.18) into (C.3) to get

∇f(X)#U , U − Ur∗RU ≥ 1 2  18η 10 k∇f(X) #U k2 F+ µ 2kX ∗ r − Xk 2 ∗− L 2kX ∗ r − X ∗ k2 ∗  −12 µσr40(X)D2+2η 29k∇f(X) #U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X∗k∗) D2  ≥ 0.86ηk∇f(X)#U k2 F− L 4kX ∗ r− X∗k 2 ∗ +µ 4  |X∗ r− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2( kX − X∗ rk∗+kXr∗− X∗k∗)  (C.19)

(12)

Lemma 4. If DF(U, U∗)≤ ρσr(Ur∗) and ρ≤ 1001 , then for any unitary matrixR kX − X∗ rk∗=kUUT− Ur∗(Ur∗)Tk∗ =kUUT − Ur∗RU T + U∗ rRU T − Ur∗R(U ∗ rR) T k∗ ≤ kU − U∗ rRk∗kUkS∞+kU − U ∗ rRk∗kUr∗kS∞ (i) ≤ kU − U∗ rRk∗(1 + ρ)kUr∗kS∞+kU − U ∗ rRk∗kUr∗kS∞ ≤ (2 + ρ)kU − U∗ rRk∗kUr∗kS∞ ≤ (2.01)kU − Ur∗Rk∗kUr∗kS∞ (C.20)

where (i) is due to lemma 6. Lemma 5. Let X = U UT andX

rUr∗(Ur∗) T then kX − X∗ rk 2 F ≥ 2( √ 2− 1)σr(Xr∗)DF(U, Ur∗) 2. (C.21) See reference [2].

Combining lemmas 4 and 5, we obtain a lower bound for the underlined term in (C.19).

kXr∗− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2( kX − Xr∗k∗+kXr∗− X ∗ k∗) (i) ≥kX∗ r− Xk2F− σr(X∗)D2 20 − 2κD 2  kX − X∗ rk∗+ σr(X ∗) 200κ1.5τ (X∗ r)  (ii) ≥ 2(√2− 1)σr(Xr∗)D 2 −σr(X ∗)D2 20 − 2κD 2 kX − X∗ rk∗−σr(X ∗)D2 50 (iii) ≥  2(√2− 1) − 1 20− 1 50  σr(Xr∗)D 2 − 2κD2(2.01) 1 81κ σr(X∗) σ1(U∗)kU ∗ rkS∞ ≥  2(√2− 1) −201 −501 −201  σr(Xr∗)D2 ≥ 0.7 σr(Xr∗) D 2 (C.22)

where(i) is due tok · k∗≥ k · kF, and (ii) is owing to lemma 5. (iii) is due to the assumption ˜D∗≤81κ1 σr(X?) σ1(U?). Combining (C.22) with (C.19), we get ∇f(X)#U , U − U∗ rRU ≥ 0.86ηk∇f(X)#Uk2F− L 4kX ∗ r− X∗k 2 ∗+ 0.7 µ 4 σr(X ∗ r) D 2, (C.23)

the desired lemma 3.

DF(U+, Ur∗) 2= min R is unitarykU + − U∗ rk 2 F ≤ kU + − U∗ rRUk2F =kU − U∗ rRUk2F − 2η∇f(X) # U , U− U∗ rRU + η2k∇f(X)#Uk2F (i) ≤ DF(U, U∗)2− 2η  −L4kX∗ r− X∗k 2 ∗+ 0.7µ 4 σr(X ∗ r)D(U, U∗) 2  − (2(0.86)− 1)η2 k∇f(X)#U k2 F ≤  10.7 µ η 2 σr(X ∗ r)  D(U, U∗)2+η L 2 kX ∗ r− X ∗ k2 ∗ (C.24)

in which RU = arg min R R is unitary

kU − U∗

rRkF.(i) is by lemma 3.

Lemma 6. Let U and U∗

r be twon× r matrices such that DF(U, Ur∗)≤ ρσr(Ur∗), for ρ ≤ 1 100. Define X ≡ UU T and X∗ r ≡ Ur∗(Ur∗)T. Then we have (1−1001 )σ1(Ur∗)≤ σ1(U )≤ (1 + 1 100)σ1(U ∗ r) (C.25) (1 1 100)σr(U ∗ r)≤ σr(U )≤ (1 + 1 100)σr(U ∗ r) (C.26) and thus τ (U ) 101 99τ (U ∗ r) and τ (X)≤ ( 101 99 ) 2τ (X∗ r) (C.27)

(13)

Proof. By k · kS∞ ≤ k · kF and Weyl’s inequality for perturbation of singular values, we obtain |σi(Ur∗)− σi(U )| ≤ ρσr(Ur∗)≤ 1 100σr(U ∗ r). (C.28)

Now we show min

i≥0ηi ≥ 1 16η by verifying kUiU T i kS∞ ≤ 1+ρ 1−ρ 2 kX0kS∞ and k∇f(UiU T i ) #Q UiQ T UikS∞ ≤ 4Lσ1(U0)σr(X∗)

81κσ1(U∗)(1−ρ) +k∇f(X0)kS∞. The first one is an immediate result of lemma 6. Applying the same arguments of (A.20),

(A.21) and (A.22), the second part is a direct consequence of assumption ˜D∗≤ 81κ1 σr(X?)

σ1(U?) and assumption ˜DF ≤ ρσr(U

? r).

APPENDIXD

PROOF OFSUBLINEARRATE OFSPECTRALGRADIENTDESCENT We first use following lemma to prove the sublinear rate.

Lemma 7. For the sequence of the iterates{Ui}ki=0, we have

f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik2∗ (D.1) and f (UiUiT)− f(U ∗U∗T) ≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (D.2) where αi= 1.117 ηi and βi= (2 +1981)D∞(Ui, U∗).

Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and

δi+1≤ δi− αi· k[∇f(X) · U]#∞k2S∞ ≤ δi− αi β2 i · δ2 i

Dividing both sides with (δi· δi+1), we obtain, by assumption (IV.5),

1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ ˜αi D2 S∞ . Telescoping the inequality we get the desired result.

Now we prove (D.1) of lemma 7. The smoothness gives

f (UiUiT)− f(Ui+1Ui+1T)≥ h∇f(Xi), Xi− Xi+1i −

L 2kXi− Xi+1k 2 S∞ =∇f(Xi), (Ui− Ui+1)UiT+ Ui(Ui− Ui+1)T | {z } 1

−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

| {z } 2 −L2 kXi− Xi+1k2S∞ | {z } 3 (D.3) For 1 we have ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2∗. (D.4)

To upper bound 2 , we use

∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T

=η2 ik∇f(Xi)· Uik2∗· T race(AT∇f(Xi)A Ir 0 0 0(n−r)×(n−r)  ) ≤η2 ik∇f(Xi)· Uik2∗· k∇f(Xi)rk∗ (∗) ≤14ηik∇f(Xi)· Uik2∗ (D.5)

(14)

in which A is the left-singular vectors of ∇f(Xi)· Ui and k∇f(Xi)rk∗ equals to the sum of the top r singular values of

∇f(Xi). (∗) is by ηi≤ 4k∇f (X1i)rk∗.

To upper bound 3 , we use

kUiUiT− Ui+1Ui+1T kS∞

=kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT − (Ui− Ui+1)(Ui− Ui+1)TkS∞

≤2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k

2 S∞ =2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ =ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)rk∗kUikS∞] (2) ≤ ηi k∇f(Xi)· Uik∗ 9 4kUikS∞ (D.6)

where(1) is due to the rank of Ui is less than r and(2) is by ηi≤ 4k∇f (X1

i)rk∗. Plugging above inequalities into (D.3), we

obtain

f (UiUiT)− f(Ui+1Ui+1T)

≥2ηik∇f(Xi)Uik2∗− 1 4ηik∇f(Xi)· Uik 2 ∗ −L2(9 4ηikUikS∞k∇f(Xi)· Uik∗) 2 ≥ηik∇f(Xi)· Uik2∗ " 7 4− L 2  9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (D.7) (∗) is by ηi≤ 4LkX1 ikS∞ = 1

4LkUik2S∞. We have thus finished the first part of lemma 7.

Now we give the proof of (D.2) of lemma 7. We denote RUi≡ arg min R R is unitary kUi− U∗RkS∞. (D.8) and define ∆Ui≡ Ui− U ∗R Ui. Then we have f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi − ∇f(Xi), ∆Ui∆ T Ui =2h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· Uik∗k∆UikS∞+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 (D.9)

To upper bound 1 , we use

∇f(Xi), ∆Ui∆ T Ui = h∇f(Xi)∆Ui, ∆Uii ≤ k∇f(Xi)∆Uik∗k∆UikS∞ =k∇f(Xi)P∆Ui∆Uik∗k∆UikS∞ ≤ k∇f(Xi)P∆Uik∗k∆Uik 2 S∞ (∗) ≤ k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k  k∆Uik 2 S∞ (D.10)

in which PU denotes the projection onto Col(U ), and (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U

∗ r)).

(15)

k∇f(Xi)PUik∗=k∇f(Xi)UiU † ik∗ (1) ≤ k∇f(Xi)Uik∗ 1 σr(Ui) (2) ≤ k∇f(Xi)Uik∗ 10 9σr(U∗) (D.11)

in which Ui† denotes the pseudoinverse of Ui. Here(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (IV.5), Weyl’s

inequality andσr(U∗RUi) = σr(U ∗). Similarly, we have k∇f(Xi)PU∗k=k∇f(Xi)U∗(U∗)†k ≤ k∇f(Xi)U∗k∗ | {z } A 1 σr(U∗) . (D.12)

To upper bound A , we use the following inequality.

k∇f(Xi)U∗k∗=k∇f(Xi)U∗RUik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)∆Uik∗ =k∇f(Xi)Uik∗+k∇f(Xi)P∆Ui∆Uik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)P∆Uik∗k∆UikS∞ (1) ≤ k∇f(Xi)Uik∗+ (k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k)k∆U ikS∞ (2) ≤ k∇f(Xi)Uik∗+ 10 9 k∇f(Xi)Uik∗ +k∇f(Xi)U∗k∗ k∆U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)Uik∗+ 1 10 10 9 k∇f(Xi)Uik∗ +k∇f(Xi)U∗k∗  =10 9 k∇f(Xi)Uik∗+ 1 10k∇f(Xi)U ∗ k∗. (D.13)

Here, (1) is owing to the similar reason of (D.10), (2) is obtained by plugging in (D.11) and (D.12) and (3) is by assumption (IV.5). Thus we arrive at

k∇f(Xi)U∗k∗≤

 10 9

2

k∇f(Xi)Uik∗. (D.14)

Plugging this into (D.12), we get

k∇f(Xi)PU∗k≤ 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗) . (D.15)

Combining (D.11) and (D.15) with (D.10), we obtain

∇f(Xi), ∆Ui∆ T Ui ≤  k∇f(Xi)Uik∗ 10 9σr(U∗) + 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗)  k∆Uik 2 S∞ =k∇f(Xi)Uik∗ 190 81 k∆UikS∞ σr(U∗) k∆ UikS∞ (∗) ≤ 1981k∇f(Xi)Uik∗k∆UikS∞ (D.16)

where(∗) is by assumption (IV.5). Now we plug (D.16) into (D.9) and obtain f (UiUiT)− f(U∗U∗T)≤  2 +19 81  k∆UikS∞k∇f(Xi)· Uik∗. (D.17)

(16)

The last part is to proveminiγi≥ 14η by showingkUikS∞≤ 11 9kU0kS∞ and k∇f(Xi)rk∗≤ 40L 81 σr(U0)σ1(U0) +k∇f(X0)rk∗. (D.18) By assumption (IV.5) and Weyl’s inequality, we have for everyi≥ 0

(1 1 10)σ1(U ∗) ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1− 1 10 σ1(U0)≥ σ1(Ui). (D.19)

Since k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have

k∇f(Xi)rk∗≤ k((∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1  kXi− X∗kS∞+kX0− X ∗ kS∞  +k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi(Ui− U ∗R Ui) T + (Ui− U∗RUi)(U ∗R Ui) T kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞), we have kXi− X∗kS∞+kX0− X ∗ kS∞ ≤kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤σr(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤1 1 − 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (D.19). APPENDIXE PROOF OFLEMMA1 Using chain rules, we see that

∇f(A) = ∇lse(Ax)x>. (E.1)

To prove the convexity off , we compute h∇f(A) − ∇f(A0), A − A0 i =∇lse(Ax)x> − ∇lse(A0x)x>, A − A0 =h∇lse(Ax) − ∇lse(A0x), Ax− A0xi ≥ 0

since the lse function is convex.

We now turn to the smoothness parameters. Since transposing a matrix does not alter the Schatten-p norm, we have k∇f(A) − ∇f(A0) kF =  ∇lse(Ax) − ∇lse(A0x)x> F ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k(A − A0)xk2, by (IV.2) ≤ kxk2 2kA − A0kF

(17)

Using similar arguments, we compute k∇f(A) − ∇f(A0)kS1 =  ∇lse(Ax) − ∇lse(A0x)x> S 1 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k1 ≤ kxk2k(A − A0)xk∞, by (IV.3) ≤ kxk2 2kA − A0kS∞

which establishes (IV.9).

APPENDIXF PROOF OFLEMMA2 The convexity of ˆf follows immediately from the convexity of f . For any two positive semi-definite matricesZ1= A1 B1

B> 1 D1  andZ2= A2 B2 B> 2 D2  , we have k∇ ˆf (Z1)− ∇ ˆf (Z2)kqSq = 1 2  0 ∇f(B1)− ∇f(B2) ∇f(B1)− ∇f(B2) 0  q Sq (1) = 1 2 ∇f(B1)− ∇f(B2) 0 0 ∇f(B1)− ∇f(B2)  q Sq (2) = 1 2  k∇f(B1)− ∇f(B2)kqSq+k∇f(B1)− ∇f(B2)k q Sq  (3) ≤ Lq kB1− B2kqSp, (F.1) where

(1) is because|| · ||Sq is permutation invariant,

(2) is because of the block-diagonal structure, and (3) uses the smoothness of f .

It remains to see thatkB1− B2kSp≤ kZ1− Z2kSp. In order to prove this, we use the permutation invariance of the || · ||Sp,

and the Pinching inequality [1]:

kZ1− Z2kpSp =  B1− B2 A1− A2 D1− D2 B1>− B>2  p Sp ≥ kB1− B2kpSp. APPENDIXG

SYNTHETICDATA FORPHASERETRIEVAL

(18)

Original Picture

Nuclear Wirtinger Flow Wirtinger Flow Fienup

Phaselift Phaselamp Phasemax

Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow

(19)

Original Picture

Nuclear Wirtinger Flow Wirtinger Flow Fienup

Phaselift Phaselamp Phasemax

Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow

(20)

APPENDIXH

WIRTINGERFLOW V.S. NUCLEARWIRTINGERFLOW

In Figure 4 we present the images recovered from nuclear Wirtinger flow and Wirtinger flow, indexed by time. Our experiments show that the nuclear Wirtinger flow quickly finds area (att = 4s) where meaningful image characteristics start to emerge. At t = 8s, a fully visible image is recovered, and the reconstruction stays at the solution for a short period. However, the nuclear Wirtinger flow eventually overfits and returns a noisy figure; see Figure 3. This phenomenon is possibly due to the mismatch of the mathematical model and real Fourier Ptychographic reconstructions.

In contrast, the Wirtinger flow recovers only partial image at t = 8s, and exhibits oscillating behaviors. Eventually the Wirtinger flow overfits, and return solutions like random noise.

We stress that the Wirtinger flow fails to recover the image for all the initializations we have tried, whereas the nuclear Wirtinger flow is quite robust to the choice of initial point.

2700 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 0.062953 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160

Fig. 3: Final solution of the nuclear Wirtinger flow,t = 37s.

APPENDIXI

SPECTRALGRADIENTMETHODS FORFA S TTE X T

Four more datasets are presented in Figure 5. The results are in accordance with our observations in Section VI-B: The heuristic version of (V.4) is the best optimization algorithm, in that it solves the training problem most efficiently, but is prone to overfitting. On the other hand, the theoretical iterates (V.4) is either the best or comparable to the other methods in terms of prediction accuracy.

(21)

REFERENCES [1] R. Bhatia, Matrix analysis. Springer Science & Business Media, 2013, vol. 169.

[2] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 964–973.

(22)

(a) Nuclear Wirtinger flow at t = 2s. (b) Wirtinger flow at t = 2s.

(c) Nuclear Wirtinger flow at t = 4s. (d) Wirtinger flow at t = 4s.

(e) Nuclear Wirtinger flow at t = 8s. (f) Wirtinger flow at t = 8s.

(g) Nuclear Wirtinger flow at t = 14s. (h) Wirtinger flow at t = 14s.

(i) Nuclear Wirtinger flow at t = 16s. (j) Wirtinger flow at t = 16s.

(23)

2 4 6 8 10 12 epochs 0.08 0.10 0.12 0.14 0.16 0.18 loss

yelp review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.952 0.953 0.954 0.955 0.956 0.957 0.958 a ccura cy

yelp review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 loss ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.914 0.916 0.918 0.920 0.922 0.924 0.926 a ccura cy ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.06 0.08 0.10 0.12 0.14 0.16 0.18 loss sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.963 0.964 0.965 0.966 0.967 0.968 0.969 0.970 a ccura cy sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 loss

amazon review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.940 0.941 0.942 0.943 0.944 0.945 0.946 0.947 a ccura cy

amazon review polarity

[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f

Fig. 5: From left to right, training loss and test accuracy. From top to bottom, results on Yelp Review Polarity, AG News, Sogou News, and Amazon Review Polarity

References

Related documents

Validation and moderation of dental technology qualifications is required internationally so that the learning outcomes are relevant and clearly expressed to the

MEP 3 Toxicology and Risk Assessment 4.33 GRAD Matt Dodd 05/21/2013 07/28/2013 Online 10 Elective ENVR 582.

Abstract In this paper, we consider an implicit block backward differentiation formula (BBDF) for solving Volterra Integro-Differential Equations (VIDEs).. The approach given in

Based on the relationship between a prior history of acute sinusitis and composition of the microbiota, we next examined differences in the relative abundance of baseline

At position 180 and 195 NCDC-070 and reference sequence were different that formed ‘V’ (Valine) and ‘R’ (Arginine) but NCIM-2712 were similar to reference sequence, at position

Representing law schools, Donna Davis, Assistant Dean for Career Development for Case Western Reserve University School of Law, and Vicki Huebner, Assistant Dean for Law

Active TMJ Range of Motion Exam Passive TMJ Range of Motion Exam TMJ Adjustment Posterior-Superior TMJ Adjustment Anterior TMJ Adjustment Posterior TMJ Adjustment Superior

There are three different functions for estimating mod- els for paired comparison data: the llbt.fit function which estimates the loglinear version of the Bradley-Terry model