Supplementary Materials to “A Non-Euclidean
Gradient Descent Framework for Non-Convex
Matrix Factorization”
Ya-Ping Hsieh, Yu-Chun Kao, Rabeeh Karimi Mahabadi, Alp Yurtsever, Anastasios Kyrillidis, Member, IEEE,
and Volkan Cevher, Senior Member, IEEE
APPENDIXA
PROOF OFSUBLINEARRATE OFNUCLEARGRADIENTDESCENT
We first use following lemma to prove the sublinear rate.
Lemma 1. For the sequence of the iterates{Ui}ki=0, we have
f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k∇f(Xi)· Uik2S∞ (A.1)
and f (UiUiT)− f(U∗U∗T)≤ βi· k∇f(Xi)· UikS∞ (A.2) where αi= 1.117 ηi and βi= (2 +1981)k∆Uik∗= (2 + 19 81)D∗(Ui, U ∗).
Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and
δi+1≤ δi− αi· k∇f(Xi)· Uik2S∞ ≤ δi− αi β2 i · δ 2 i
Dividing both sides with (δi· δi+1), we obtain, by assumption (III.5),
1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ α˜i D2 ∗ . Telescoping the inequality we get the desired result.
Now we prove (A.1) of lemma 1. The smoothness gives
f (UiUiT)− f(Ui+1Ui+1T)
≥h∇f(Xi), Xi− Xi+1i − L 2kXi− Xi+1k 2 ∗ =∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T | {z } 1
−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T
| {z } 2 −L2 kXi− Xi+1k2∗ | {z } 3 . (A.3) For 1 we have ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2S∞. (A.4)
To upper bound 2 , we use
∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T
=η2 ik∇f(Xi)· Uik2S∞· T race(∇f(Xi)A1A T 1) ≤η2 ik∇f(Xi)· Uik2S∞· k∇f(Xi)kS∞ (∗) ≤14ηik∇f(Xi)· Uik2S∞ (A.5)
in which A1 is the first singular vector of ∇f(Xi)· Ui, and(∗) is by ηi≤ 4k∇f (X1i)k
S∞.
To upper bound 3 , we use
kUiUiT − Ui+1Ui+1T kS∞
=kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT
− (Ui− Ui+1)(Ui− Ui+1)TkS∞
≤2kUikS∞kUi− Ui+1k∗+kUi− Ui+1kS∞kUi− Ui+1k∗
=2ηikUikS∞k∇f(Xi)· UikS∞+ η 2 ik∇f(Xi)· Uik2S∞ =ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)· UikS∞] ≤ηik∇f(Xi)· UikS∞ [2kUikS∞+ ηik∇f(Xi)kS∞kUikS∞] (1) ≤ ηik∇f(Xi)· UikS∞ 9 4kUikS∞ (A.6) where(1) is by ηi≤4k∇f (X1 i)kS∞.
Plugging above inequalities into (A.3), we obtain
f (UiUiT)− f(Ui+1Ui+1T)
≥2ηik∇f(Xi)Uik2S∞− 1 4ηik∇f(Xi)· Uik 2 S∞ −L2(9 4ηikUikS∞k∇f(Xi)· UikS∞) 2 ≥ηik∇f(Xi)· Uik2S∞ " 7 4 − L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2S∞ (A.7) where(∗) is by ηi≤4LkX1ik S∞ = 1
4LkUik2S∞. We have thus finished the first part of lemma 1.
Now we give the proof of (A.2) of lemma 1. We denote RUi≡ arg min R R is unitary kUi− U∗Rk∗. (A.8) and define ∆Ui≡ Ui− U ∗R Ui. We begin with f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi −∇f(Xi), ∆Ui∆ T Ui =2h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· UikS∞k∆Uik∗+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 . (A.9)
To upper bound 1 , we use ∇f(Xi), ∆Ui∆ T Ui =h∇f(Xi)∆Ui, ∆Uii ≤k∇f(Xi)∆UikS∞k∆Uik∗ =k∇f(Xi)P∆Ui∆UikS∞k∆Uik∗ ≤k∇f(Xi)P∆UikS∞k∆UikS∞k∆Uik∗ (∗) ≤k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞ k∆Uik 2 ∗ (A.10)
in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U
∗ r)) and k∆UikS∞ ≤ k∆Uik∗. Continuing, we get k∇f(Xi)PUikS∞ =k∇f(Xi)UiU † ikS∞ (1) ≤ k∇f(Xi)UikS∞ 1 σr(Ui) (2) ≤ k∇f(Xi)UikS∞ 10 9σr(U∗) (A.11)
in which Ui† denotes the pseudoinverse ofUi. Here,(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (III.5), Weyl’s
inequality andσr(U∗RUi) = σr(U ∗). Similarly, we have k∇f(Xi)PU∗kS∞ =k∇f(Xi)U∗(U∗)†kS∞ ≤ k∇f(Xi)U∗kS∞ | {z } A 1 σr(U∗) . (A.12)
To upper bound A , we use the following inequality. k∇f(Xi)U∗kS∞ =k∇f(Xi)U∗RUikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)∆UikS∞ =k∇f(Xi)UikS∞+k∇f(Xi)P∆Ui∆UikS∞ ≤k∇f(Xi)UikS∞+k∇f(Xi)P∆UikS∞k∆UikS∞ (1) ≤ k∇f(Xi)UikS∞ +k∇f(Xi)PUikS∞+k∇f(Xi)PU∗kS∞ k∆UikS∞ (2) ≤ k∇f(Xi)UikS∞ +10 9 k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞ k∆U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)UikS∞ + 1 10 10 9 k∇f(Xi)UikS∞+k∇f(Xi)U ∗ kS∞ =10 9 k∇f(Xi)UikS∞+ 1 10k∇f(Xi)U ∗ kS∞. (A.13)
Here, (1) is owing to the similar reason of (A.10), (2) is obtained by plugging in (A.11) and (A.12), and (3) is by assumption (III.5) andk∆UikS∞ ≤ k∆Uik∗. Thus we arrive at
k∇f(Xi)U∗kS∞ ≤
10 9
2
k∇f(Xi)UikS∞. (A.14)
Plugging this into (A.12), we get
k∇f(Xi)PU∗kS ∞ ≤ 10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) . (A.15)
Combining (A.11) and (A.15) with (A.10), we obtain ∇f(Xi), ∆Ui∆ T Ui ≤(k∇f(Xi)UikS∞ 10 9σr(U∗) + 10 9 2 k∇f(Xi)UikS∞ 1 σr(U∗) )k∆Uik 2 ∗ =k∇f(Xi)UikS∞ 190 81 k∆Uik∗ σr(U∗)k∆ Uik∗ (∗) ≤1981k∇f(Xi)UikS∞k∆Uik∗ (A.16)
where(∗) is by assumption (III.5). Now we plug (A.16) into (A.9) and obtain f (UiUiT)− f(U∗U∗T)≤ (2 +
19
81)k∆Uik∗k∇f(Xi)· UikS∞. (A.17)
The last part is to proveminiγi≥ 14η by showingkUikS∞≤
11
9kU0kS∞ and
k∇f(Xi)kS∞ ≤
40L
81 σr(U0)σ1(U0) +k∇f(X0)kS∞. (A.18)
By assumption (III.5) and Weyl’s inequality, we have for every i≥ 0 (1−101 )σ1(U∗)≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1− 1 10 σ1(U0)≥ σ1(Ui). (A.19) Fork∇f(Xi)kS∞, we have k∇f(Xi)kS∞ ≤ k∇f(Xi)− ∇f(X0)kS∞+k∇f(X0)kS∞ ≤ LS1→S∞kXi− X0k∗+k∇f(X0)kS∞ ≤ LS1→S∞ kXi− X∗k∗+kX0− X∗k∗ +k∇f(X0)kS∞. (A.20) Since kXi− X∗k∗=kUi(Ui− U∗RUi) T + (Ui− U∗RUi)(U ∗R Ui) T k∗ ≤ kUi− U∗RUik∗ kUikS∞+kU ∗ kS∞ , (A.21) we have kXi− X∗k∗+kX0− X∗k∗ ≤kUi− U∗RUik∗(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0k∗(kU0kS∞+kU ∗ kS∞) (A.22) ≤σr(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (A.19).
APPENDIXB
PROOF OFSUBLINEARRATE FORNUCLEARGRADIENTDESCENT, TENSORVERSION
We first define the action (·) on a bounded linear operator T of H1⊗ H2 andH1 whereH1 andH2 are Hilbert spaces.
∀T ∈ L(H1⊗ H2, R), h1∈ H1, T = X i λi(ai⊗ bi) T · h1, X i λihai, h1ibi∈ H2. (B.1)
Immediately we have hT, h1⊗ h2i = hT · h1, h2i, since
hT, h1⊗ h2i = * X i λi(ai⊗ bi), h1⊗ h2 + =X i λihai, h1ihbi, h2i =hT · h1, h2i. (B.2)
For the cases Hi = Rni×mi, we define the norm to be injective cross norm with eachHi having the spectral normkkS∞ as
primal norm and the consequent dual norm, nuclear norm kk∗.
kxk , sup kaik∗≤1 ha1⊗ a2, xi (B.3) which satisfies kh1⊗ h2k = kh1kS∞kh2kS∞ ka1⊗ a2kdual=ka1k∗ka2k∗. (B.4)
We also usekh1⊗ h2kS∞ andka1⊗ a2k∗ to denotekh1⊗ h2k and ka1⊗ a2kdual.
We use following lemma to prove the sublinear rate.
Lemma 2. For the sequence of the iterates{Ui}ki=0, we have
f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik 2 ∗ (B.5) and f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (B.6) where αi= 1.117 ηi and βi= (2 +1981)k∆UikS∞ = (2 + 19 81)D∞(Ui, U ∗).
Defineδi= f (Ui⊗ Ui)− f(U∗⊗ U∗) and follow the previous lemma. We know{δi} is an positive decreasing sequence and
δi+1≤ δi− αi· k[∇f(X) · U]#∞k 2 S∞ ≤ δi− αi β2 i · δ 2 i
Dividing both sides with (δi· δi+1), we obtain, by assumption (B.14),
1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ ˜αi D2 S∞ . Telescoping the inequality we get the desired result.
Now we prove (B.5) of lemma 2. We assume∇f(X) is symmetric, i.e. ∇f(X) =P
iλi(ai⊗ ai) throughout. The smoothness
gives
f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ h∇f(Xi), Xi− Xi+1i −
L 2kXi− Xi+1k 2 S∞ =h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i | {z } 1 − h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i | {z } 2 −L2 kXi− Xi+1k2S∞ | {z } 3 (B.7)
For 1 we have h∇f(Xi), (Ui− Ui+1)⊗ Ui+ Ui⊗ (Ui− Ui+1)i (∗) = 2h∇f(Xi)· Ui, Ui− Ui+1i = 2ηi∇f(Xi)· Ui, [∇f(Xi)· Ui]#∞ = 2ηik∇f(Xi)· Uik2∗ (B.8)
where(∗) is by the assumption that ∇f(Xi) is symmetric.
To upper bound 2 , we use
h∇f(Xi), (Ui− Ui+1)⊗ (Ui− Ui+1)i = ηi2k∇f(Xi)· Uik2∗∇f(Xi), ABT ⊗ ABT
≤ η2
ik∇f(Xi)· Uik2∗· k∇f(Xi)k∗kABTk2S∞
(∗)
≤ 14ηik∇f(Xi)· Uik2∗ (B.9)
in which A and B are respectively the left and right singular vectors of∇f(Xi)· Ui.(∗) is by ηi ≤4k∇f (X1
i)k∗.
To upper bound 3 , we use
kUi⊗ Ui− Ui+1⊗ Ui+1kS∞=kUi⊗ (Ui− Ui+1) + (Ui− Ui+1)⊗ Ui− (Ui− Ui+1)⊗ (Ui− Ui+1)kS∞
≤ 2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k
2 S∞ = 2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ = ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηik∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)k∗kUikS∞] (2) ≤ ηik∇f(Xi)· Uik∗ 9 4kUikS∞ (B.10) (2) is by ηi≤ 4k∇f (X1 i)k∗ and(1) is due to k∇f(Xi)· Uik∗= sup kykS∞≤1h∇f(X i)· Ui, yi = sup kykS∞≤1 h∇f(Xi), Ui⊗ yi ≤ sup kykS∞≤1 k∇f(Xi)k∗kUikS∞kykS∞ =k∇f(Xi)k∗kUikS∞ (B.11)
Plugging above inequalities into (B.7), we obtain
f (Ui⊗ Ui)− f(Ui+1⊗ Ui+1)≥ 2ηik∇f(Xi)· Uik2∗−
1 4ηik∇f(Xi)· Uik 2 ∗ −L2(9 4ηi kUikS∞k∇f(Xi)· Uik∗) 2 ≥ ηik∇f(Xi)· Uik2∗ " 7 4 − L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (B.12) (∗) is by ηi≤ 4LkX1 ikS∞ = 1
4LkUik2S∞. We thus finish the first part of lemma 2.
We give the proof of (B.6) of lemma 2 which only holds for the phase retrieval case. We then use ˜f ( ˜X) to denote the original objective function, i.e. ˜X = ˜U ˜UT, ˜f ( ˜X) = f (X) = f (U
⊗ U) and ˜U , a p× 1 vector, is the vectorization of U, a m× n matrix where p = m · n. We have the following equalities.
h∇f(Xi), U1⊗ U2i =
D
∇ ˜f ( ˜Xi), ˜U1U˜2 TE
and its immediate consequence
h∇f(Xi)· U1, U2i =
D
∇ ˜f ( ˜Xi) ˜U1, ˜U2
E
in which both sides of the first equation are the first order term of the objective function. The assumption made here, which corresponds to IV.5, would be
˜ D∞≡ max ˜ U :f ( ˜U ˜U>)≤f ( ˜U0U˜> 0) D∞( ˜U , ˜U∗)≤σmin( ˜U ∗) 10 = σ1( ˜U∗) 10 = k ˜U∗k l2 10 . (B.14) We denote RU˜i ≡ arg min R unitaryk ˜ Ui− ˜U∗RkS∞ = arg min R∈{1,−1}k ˜ Ui− ˜U∗RkS∞. and define ∆U˜i≡ ˜Ui− ˜U∗RU˜i. f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ h∇f(Xi), Xi− X∗i =∇f(Xi), Ui⊗ Ui− U∗RU˜i⊗ U ∗R ˜ Ui =D∇ ˜f ( ˜Xi), ∆U˜ i ˜ UT i E +D∇ ˜f ( ˜Xi), ˜Ui∆TU˜i E −D∇ ˜f ( ˜Xi), ∆U˜ i∆ T ˜ Ui E = 2D∇ ˜f ( ˜Xi) ˜Ui, ∆U˜i E −D∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ 2k∇f( ˜Xi) ˜Uik∗k∆U˜ikS∞+ D ∇ ˜f ( ˜Xi), ∆U˜i∆TU˜ i E | {z } 1 (B.15) To upper bound 1 , D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E = D ∇ ˜f ( ˜Xi)∆U˜i, ∆U˜i E ≤ k∇ ˜f ( ˜Xi)∆U˜ik∗k∆U˜ikS∞ =k∇ ˜f ( ˜Xi)(P∆Ui˜ ∆U˜i)k∗k∆U˜ikS∞ ≤ k∇ ˜f ( ˜Xi)P∆Ui˜ k∗k∆U˜ik 2 S∞ (∗) ≤ (k∇ ˜f ( ˜Xi)PU˜ ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ik 2 S∞ (B.16)
in which PU denotes the projection onto Col(U ). (∗) is due to Span(Col(∆U˜
i))⊆ Span(Col( ˜Ui)∪ Col( ˜U ∗ r)). k∇ ˜f ( ˜Xi)PUik∗=k∇ ˜f ( ˜Xi)UiU † ik∗ (1) ≤ k∇ ˜f ( ˜Xi)Uik∗ 1 σmin(Ui) (2) ≤ k∇ ˜f ( ˜Xi)Uik∗ 10 9σmin(U∗) (B.17)
in which Ui† denotes the pseudoinverse of Ui and σmin denotes the smallest non-zero singular value. (1) is due to σ1(Ui†) =
σr(Ui)−1.(2) is by assumption (B.14) and Weyl’s inequality.
Similarly, we have k∇ ˜f ( ˜Xi)PU˜∗k∗=k∇ ˜f ( ˜Xi) ˜U∗( ˜U∗)†k∗≤ k∇ ˜f ( ˜Xi) ˜U∗k∗ | {z } A 1 σmin( ˜U∗) . (B.18)
To upper bound A , we use the following inequality. k∇ ˜f ( ˜Xi) ˜U∗k∗=k∇ ˜f ( ˜Xi) ˜U∗RU˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)∆U˜ik∗ =k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆Ui˜ ∆U˜ik∗ ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+k∇ ˜f ( ˜Xi)P∆Ui˜ k∗k∆U˜ikS∞ (1) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi)PU˜ik∗+k∇ ˜f ( ˜Xi)PU˜∗k∗)k∆U˜ikS∞ (2) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)k∆ ˜ UikS∞ σmin( ˜U∗) (3) ≤ k∇ ˜f ( ˜Xi) ˜Uik∗+ (k∇ ˜f ( ˜Xi) ˜Uik∗10 9 +k∇ ˜f ( ˜Xi) ˜U ∗ k∗)1 10 = 10 9 k∇ ˜f ( ˜Xi) ˜Uik∗+ 1 10k∇ ˜f ( ˜Xi) ˜U ∗ k∗. (B.19)
(1) is owing to the similar reason of (B.16). (2) is obtained by plugging in (B.17) and (B.18). (3) is by assumption (B.14). Thus we arrive k∇ ˜f ( ˜Xi) ˜U∗k∗≤ 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗. (B.20)
Plugging this into (B.18), we get
k∇ ˜f ( ˜Xi)PU˜∗k∗≤ 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin( ˜U∗) . (B.21)
Combining (B.17) and (B.21) with (B.16), we obtain
D ∇ ˜f ( ˜Xi), ∆U˜i∆ T ˜ Ui E ≤ (k∇ ˜f ( ˜Xi) ˜Uik∗ 10 9σmin(U∗) + 10 9 2 k∇ ˜f ( ˜Xi) ˜Uik∗ 1 σmin(U∗) )k∆U˜ik 2 S∞ =k∇ ˜f ( ˜Xi) ˜Uik∗ 190 81 k∆U˜ikS∞ σmin(U∗)k∆ ˜ UikS∞ (∗) ≤ 1981k∇ ˜f ( ˜Xi) ˜Uik∗k∆U˜ikS∞ (B.22)
where(∗) is by assumption (B.14). Now we plug (B.22) into (B.15) and obtain f (Ui⊗ Ui)− f(U∗⊗ U∗)≤ (2 +
19
81)k∆U˜ikS∞k∇ ˜f ( ˜Xi) ˜Uik∗
(∗∗)
≤ (2 +1981)C0k∆U˜ikS∞k∇f(Xi)· Uik∗ (B.23)
where C0 is a constant and (∗∗) is obtained by the connection between ∇f(Xi)· Ui and ˜f ( ˜Xi) ˜Ui (see (B.13)) and the
equivalence of norms of finite dimensional Banach space. The last part is to proveminiγi≥ 14η by showingkUikS∞≤
11
9kU0kS∞ and
k∇f(Xi)rk∗≤40L
81 σmin(U0)σ1(U0) +k∇f(X0)rk∗. (B.24) By assumption (B.14) and Weyl’s inequality, we have for every i≥ 0
(1− 1 10)σ1(U ∗) ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1− 1 10 σ1(U0)≥ σ1(Ui). (B.25)
Since k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have
k∇f(Xi)rk∗≤ k(∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1(kXi− X ∗ kS∞+kX0− X ∗ kS∞) +k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi⊗ (Ui− U ∗R Ui) + (Ui− U ∗R Ui)⊗ (U ∗R Ui)kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) we have kXi− X∗kS∞+kX0− X ∗ kS∞≤ kUi− U ∗R UikS∞(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤ σmin(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤ 1 1− 1 10 σmin(U0) 10 σ1(U0) 40 9 = 40 81σmin(U0)σ1(U0) by applying inequality (B.25).
APPENDIXC
PROOF OFLINEARRATE FORNUCLEARGRADIENTDESCENT
We use U and U+ to denotes the current state and the updated state. LetX∗= U∗(U∗)T be the optimum,X = U UT and
X+= U+(U+)T.
U+= U
− ηU∇f(UUT)#· U (C.1)
whereηU = 16(LkXk 1
S∞+k∇f (X)#QUQTUkS∞) which also denoted asη for simplicity.
We now start to prove the following key lemma.
Lemma 3. Given DF(U, U∗)≤ ρσr(Ur∗) and D∗(U, U∗)≤81κ1 σr(X
∗) σ1(U∗), 1 ηU − U +, U − Ur∗RU =∇f(X)#U , U − Ur∗RU ≥0.86ηk∇f(X)# Uk2 F + 0.7 µ 4 σr(X ∗)D F(U, Ur∗) 2 −L4kX∗ − X∗ rk 2 F, (C.2)
in which RU = arg min R R is unitary kU − U∗ rRkF. First, we define∆≡ U − U∗ rRU and thus ∇f(X)#U , U − Ur∗RU =1 2∇f(X) # , X− X∗ r + 1 2∇f(X) # , ∆∆T. (C.3)
First, we lower bound ∇f(X)#, X
− X∗ r: f (X)≥ f(X+ )−∇f(X), X+ − X −L2kX+ − Xk2 ∗ ≥ f(X∗) −∇f(X), X+ − X −L2kX+ − Xk2 ∗, (C.4) and f (X∗ r)≥ f(X) + h∇f(X), Xr∗− Xi + µ 2kX ∗ r − Xk 2 ∗. (C.5)
Noticing PSD matrices form a convex cone, we obtainh∇f(X∗), X∗i = 0 and consequently h∇f(X∗), X∗
ri = 0. Therefore we have f (Xr∗)≤ f(X∗) +h∇f(X∗), Xr∗− X∗i + L 2kX ∗ r− X∗k2∗ = f (X∗) +L 2kX ∗ r− X ∗ k2 ∗. (C.6)
Summing up previous three inequalities, we have
h∇f(X), X − X∗ ri ≥∇f(X), X − X+ − L 2kX + − Xk2 ∗ +µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.7) Let A≡ I − η 2QUQ T U∇f(X) #, we have X+ − X = (U − η∇f(X)#U )(U − η∇f(X)#U )T − UUT =−η∇f(X)#XA − ηATX ∇f(X)# (C.8)
where we have used the property of∇f(X)# being symmetric. Plugging the previous equality into (C.7), we achieve
h∇f(X), X − Xr∗i − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X ∗ k2 ∗ ≥∇f(X), X − X+ −L 2kX + − Xk2 ∗ ≥2η∇f(X), ∇f(X)#XA −L2(2kη∇f(X)#XA k∗)2 (C.9)
where we have usedkY + YTk∗≤ 2kY k∗.
For the two terms on the RHS of (C.9) we have the following bounds. ∇f(X), ∇f(X)# XA =h∇f(X), ∇f(X)# U UT −η2∇f(X), ∇f(X)#U UTQ UQTU∇f(X) #i ≥k∇f(X)#U k2 F− η 2k∇f(X) #U k2 FkQUQTU∇f(X) # kS∞ ≥(1 −321 )k∇f(X)#U k2 F (C.10)
where the last inequality is due to the choice of the step size. Continuing, we compute
|∇f(X)#XA k∗≤ k∇f(X)#Uk∗kUkS∞kAkS∞ ≤ k∇f(X)# Uk∗kUkS∞ 1 + 1 32 . (C.11)
Now plugging these two bounds into (C.9), we have
h∇f(X), X − X∗ ri − µ 2kX ∗ r− Xk 2 ∗+ L 2kX ∗ r − X∗k 2 ∗ ≥2η k∇f(X)#U k2 F " 1− 1 32− ηLkUk 2 S∞ 33 32 2# ≥2η k∇f(X)#U k2 F " 1−321 −161 3332 2# ≥18η10 k∇f(X)# Uk2 F That is, h∇f(X), X − Xr∗i ≥18η10 k∇f(X)#U k2 F+ µ 2kX ∗ r− Xk 2 ∗− L 2kX ∗ r− X ∗ k2 ∗. (C.12)
We now lower bound∇f(X)#, ∆∆T, the second term of (C.3):
∇f(X)#, ∆∆T =Q∆QT∆∇f(X) #, ∆∆T ≥ − |Trace(∆∆TQ ∆QT∆∇f(X) #) | ≥ − kQ∆QT∆∇f(X) # kS∞h∆, ∆i ≥ −hkQUQTU∇f(X) # kS∞ +kQU∗ rQ T U∗ r∇f(X) # kS∞ i · DF(U, Ur∗) 2 (C.13)
where the last inequality is owing to Span(Col(∆))⊆ Span(Col(U) ∪ Col(U∗ r)). kQUQTU∇f(X) # kS∞ DF(U, U ∗ r) 2= η 16(L kXkS∞+k∇f(X) #Q UQTUkS∞)kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2 = 16ηLkXkS∞kQUQ T U∇f(X) # kS∞DF(U, U ∗ r) 2+ 16ηk∇f(X)#Q UQTUk2S∞DF(U, U ∗ r)2 (C.14)
We bound the underlined term by considering two possible conditions, k∇f(X)#QUQTUkS∞ ≤ µσr(X) 40 and k∇f(X)#Q UQTUkS∞ > µσr(X) 40 . 16ηLkXkS∞kQUQ T U∇f(X) # kS∞D 2 ≤ max 16ηLkXkS∞µσr(X) 40 D 2, 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2 ≤ 16ηLkXkS∞ µσr(X) 40 D 2+ 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2 ≤ µσr40(X)D2+ 16η40κτ (X) k∇f(X)#Q UQTUk 2 S∞D 2 (C.15)
in which D denotes DF(U, Ur∗). Combining the previous inequality with inequality (C.14), we get
kQUQTU∇f(X) # kS∞ D 2 ≤µσ40r(X)D2+ (40κτ (X) + 1)16ηk∇f(X)# QUQTUk 2 S∞D 2 (i) ≤ µσ40r(X)D2+ (40(101 99 ) 2 κτ (Xr∗) + 1)16ηk∇f(X) # QUQTUk 2 S∞(ρσr(U ∗ r)) 2 ≤ µσr40(X)D2+ 16· 43ηκτ(X∗ r)k∇f(X) # QUQTUk 2 S∞σr(X ∗ r)ρ 2 (ii) ≤ µσr40(X)D2+ 16 · 43ηκτ(X∗ r)k∇f(X) #U k2 S∞ σr(Xr∗) σr(X) ρ2 (iii) ≤ µσr40(X)D2+ 16 · 43ηκτ(Xr∗)k∇f(X) #U k2 S∞( 100 99 ) 2ρ2 (iv) ≤ µσr40(X)D2+2η 29k∇f(X) #U k2 S∞ (C.16)
(i) and (iii) is due to the assumption DF(U, U∗)≤ ρσr(Ur∗) and lemma 6. (ii) is owing to
k∇f(X)#U kS∞ =kU T ∇f(X)# kS∞ =kU TQ UQTU∇f(X) # kS∞ ≥ σmin(U )k∇f(X) #Q UQTUkS∞
andσmin(U ) = σr(U ) =pσr(X). (iv) is obtained by plugging ρ = 100κτ (X1 ∗ r).
We first note that ∇f(U∗(U∗)T)U∗ = 0, since X∗ is the optimum, and thus ∇f(X∗)Q U∗
r = 0. Now we start to bound
kQU∗ rQ T U∗ r∇f(X) # kS∞. kQU∗ rQ T U∗ r∇f(X) # kS∞ ≤ kQUr∗Q T U∗ r∇f(X)kS∞ =kQU∗ rQ T U∗ r(∇f(X) − ∇f(X ∗)) kS∞ ≤ k∇f(X) − ∇f(X∗) kS∞ ≤ L (kX − Xr∗k∗+kXr∗− X ∗ k∗) (C.17)
where the last inequality is owing to L-smoothness and the triangular inequality. Plugging inequalities (C.16) and (C.17) into (C.13), we get
∇f(X)#, ∆∆T ≥ − µσr(X) 40 D 2+2η 29k∇f(X) #U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X ∗ k∗) D2 . (C.18)
Now we plug two bounds (C.12) and (C.18) into (C.3) to get
∇f(X)#U , U − Ur∗RU ≥ 1 2 18η 10 k∇f(X) #U k2 F+ µ 2kX ∗ r − Xk 2 ∗− L 2kX ∗ r − X ∗ k2 ∗ −12 µσr40(X)D2+2η 29k∇f(X) #U k2 S∞+ L (kX − X ∗ rk∗+kXr∗− X∗k∗) D2 ≥ 0.86ηk∇f(X)#U k2 F− L 4kX ∗ r− X∗k 2 ∗ +µ 4 |X∗ r− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2( kX − X∗ rk∗+kXr∗− X∗k∗) (C.19)
Lemma 4. If DF(U, U∗)≤ ρσr(Ur∗) and ρ≤ 1001 , then for any unitary matrixR kX − X∗ rk∗=kUUT− Ur∗(Ur∗)Tk∗ =kUUT − Ur∗RU T + U∗ rRU T − Ur∗R(U ∗ rR) T k∗ ≤ kU − U∗ rRk∗kUkS∞+kU − U ∗ rRk∗kUr∗kS∞ (i) ≤ kU − U∗ rRk∗(1 + ρ)kUr∗kS∞+kU − U ∗ rRk∗kUr∗kS∞ ≤ (2 + ρ)kU − U∗ rRk∗kUr∗kS∞ ≤ (2.01)kU − Ur∗Rk∗kUr∗kS∞ (C.20)
where (i) is due to lemma 6. Lemma 5. Let X = U UT andX∗
rUr∗(Ur∗) T then kX − X∗ rk 2 F ≥ 2( √ 2− 1)σr(Xr∗)DF(U, Ur∗) 2. (C.21) See reference [2].
Combining lemmas 4 and 5, we obtain a lower bound for the underlined term in (C.19).
kXr∗− Xk 2 ∗− σr(X∗)D2 20 − 2κD 2( kX − Xr∗k∗+kXr∗− X ∗ k∗) (i) ≥kX∗ r− Xk2F− σr(X∗)D2 20 − 2κD 2 kX − X∗ rk∗+ σr(X ∗) 200κ1.5τ (X∗ r) (ii) ≥ 2(√2− 1)σr(Xr∗)D 2 −σr(X ∗)D2 20 − 2κD 2 kX − X∗ rk∗−σr(X ∗)D2 50 (iii) ≥ 2(√2− 1) − 1 20− 1 50 σr(Xr∗)D 2 − 2κD2(2.01) 1 81κ σr(X∗) σ1(U∗)kU ∗ rkS∞ ≥ 2(√2− 1) −201 −501 −201 σr(Xr∗)D2 ≥ 0.7 σr(Xr∗) D 2 (C.22)
where(i) is due tok · k∗≥ k · kF, and (ii) is owing to lemma 5. (iii) is due to the assumption ˜D∗≤81κ1 σr(X?) σ1(U?). Combining (C.22) with (C.19), we get ∇f(X)#U , U − U∗ rRU ≥ 0.86ηk∇f(X)#Uk2F− L 4kX ∗ r− X∗k 2 ∗+ 0.7 µ 4 σr(X ∗ r) D 2, (C.23)
the desired lemma 3.
DF(U+, Ur∗) 2= min R is unitarykU + − U∗ rk 2 F ≤ kU + − U∗ rRUk2F =kU − U∗ rRUk2F − 2η∇f(X) # U , U− U∗ rRU + η2k∇f(X)#Uk2F (i) ≤ DF(U, U∗)2− 2η −L4kX∗ r− X∗k 2 ∗+ 0.7µ 4 σr(X ∗ r)D(U, U∗) 2 − (2(0.86)− 1)η2 k∇f(X)#U k2 F ≤ 1−0.7 µ η 2 σr(X ∗ r) D(U, U∗)2+η L 2 kX ∗ r− X ∗ k2 ∗ (C.24)
in which RU = arg min R R is unitary
kU − U∗
rRkF.(i) is by lemma 3.
Lemma 6. Let U and U∗
r be twon× r matrices such that DF(U, Ur∗)≤ ρσr(Ur∗), for ρ ≤ 1 100. Define X ≡ UU T and X∗ r ≡ Ur∗(Ur∗)T. Then we have (1−1001 )σ1(Ur∗)≤ σ1(U )≤ (1 + 1 100)σ1(U ∗ r) (C.25) (1− 1 100)σr(U ∗ r)≤ σr(U )≤ (1 + 1 100)σr(U ∗ r) (C.26) and thus τ (U )≤ 101 99τ (U ∗ r) and τ (X)≤ ( 101 99 ) 2τ (X∗ r) (C.27)
Proof. By k · kS∞ ≤ k · kF and Weyl’s inequality for perturbation of singular values, we obtain |σi(Ur∗)− σi(U )| ≤ ρσr(Ur∗)≤ 1 100σr(U ∗ r). (C.28)
Now we show min
i≥0ηi ≥ 1 16η by verifying kUiU T i kS∞ ≤ 1+ρ 1−ρ 2 kX0kS∞ and k∇f(UiU T i ) #Q UiQ T UikS∞ ≤ 4Lσ1(U0)σr(X∗)
81κσ1(U∗)(1−ρ) +k∇f(X0)kS∞. The first one is an immediate result of lemma 6. Applying the same arguments of (A.20),
(A.21) and (A.22), the second part is a direct consequence of assumption ˜D∗≤ 81κ1 σr(X?)
σ1(U?) and assumption ˜DF ≤ ρσr(U
? r).
APPENDIXD
PROOF OFSUBLINEARRATE OFSPECTRALGRADIENTDESCENT We first use following lemma to prove the sublinear rate.
Lemma 7. For the sequence of the iterates{Ui}ki=0, we have
f (UiUiT)− f(Ui+1Ui+1T)≥ αi· k[∇f(Xi)· Ui]#∞k 2 S∞ = αi· k∇f(Xi)· Uik2∗ (D.1) and f (UiUiT)− f(U ∗U∗T) ≤ βi· k[∇f(Xi)· Ui]#∞kS∞ (D.2) where αi= 1.117 ηi and βi= (2 +1981)D∞(Ui, U∗).
Defineδi= f (UiUiT)− f(U∗U∗T) and follow the previous lemma. We know{δi} is an positive decreasing sequence and
δi+1≤ δi− αi· k[∇f(X) · U]#∞k2S∞ ≤ δi− αi β2 i · δ2 i
Dividing both sides with (δi· δi+1), we obtain, by assumption (IV.5),
1 δi+1 − 1 δi ≥ αi β2 i ·δδi i+1 ≥ αi β2 i ≥ ˜αi D2 S∞ . Telescoping the inequality we get the desired result.
Now we prove (D.1) of lemma 7. The smoothness gives
f (UiUiT)− f(Ui+1Ui+1T)≥ h∇f(Xi), Xi− Xi+1i −
L 2kXi− Xi+1k 2 S∞ =∇f(Xi), (Ui− Ui+1)UiT+ Ui(Ui− Ui+1)T | {z } 1
−∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T
| {z } 2 −L2 kXi− Xi+1k2S∞ | {z } 3 (D.3) For 1 we have ∇f(Xi), (Ui− Ui+1)UiT + Ui(Ui− Ui+1)T =2h∇f(Xi)Ui, Ui− Ui+1i =2ηi∇f(Xi)Ui, [∇f(Xi)Ui]#∞ =2ηik∇f(Xi)Uik2∗. (D.4)
To upper bound 2 , we use
∇f(Xi), (Ui− Ui+1)(Ui− Ui+1)T
=η2 ik∇f(Xi)· Uik2∗· T race(AT∇f(Xi)A Ir 0 0 0(n−r)×(n−r) ) ≤η2 ik∇f(Xi)· Uik2∗· k∇f(Xi)rk∗ (∗) ≤14ηik∇f(Xi)· Uik2∗ (D.5)
in which A is the left-singular vectors of ∇f(Xi)· Ui and k∇f(Xi)rk∗ equals to the sum of the top r singular values of
∇f(Xi). (∗) is by ηi≤ 4k∇f (X1i)rk∗.
To upper bound 3 , we use
kUiUiT− Ui+1Ui+1T kS∞
=kUi(Ui− Ui+1)T + (Ui− Ui+1)UiT − (Ui− Ui+1)(Ui− Ui+1)TkS∞
≤2kUikS∞kUi− Ui+1kS∞+kUi− Ui+1k
2 S∞ =2ηikUikS∞k∇f(Xi)· Uik∗+ η 2 ik∇f(Xi)· Uik2∗ =ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)· Uik∗] (1) ≤ ηi k∇f(Xi)· Uik∗ [2kUikS∞+ ηik∇f(Xi)rk∗kUikS∞] (2) ≤ ηi k∇f(Xi)· Uik∗ 9 4kUikS∞ (D.6)
where(1) is due to the rank of Ui is less than r and(2) is by ηi≤ 4k∇f (X1
i)rk∗. Plugging above inequalities into (D.3), we
obtain
f (UiUiT)− f(Ui+1Ui+1T)
≥2ηik∇f(Xi)Uik2∗− 1 4ηik∇f(Xi)· Uik 2 ∗ −L2(9 4ηikUikS∞k∇f(Xi)· Uik∗) 2 ≥ηik∇f(Xi)· Uik2∗ " 7 4− L 2 9 4 2 ηikUik2S∞ # (∗) ≥ 1.117 ηik∇f(Xi)· Uik2∗ (D.7) (∗) is by ηi≤ 4LkX1 ikS∞ = 1
4LkUik2S∞. We have thus finished the first part of lemma 7.
Now we give the proof of (D.2) of lemma 7. We denote RUi≡ arg min R R is unitary kUi− U∗RkS∞. (D.8) and define ∆Ui≡ Ui− U ∗R Ui. Then we have f (UiUiT)− f(U∗U∗T) ≤h∇f(Xi), Xi− X∗i =∇f(Xi), ∆UiU T i + ∇f(Xi), Ui∆TUi − ∇f(Xi), ∆Ui∆ T Ui =2h∇f(Xi)Ui, ∆Uii −∇f(Xi), ∆Ui∆ T Ui ≤2k∇f(Xi)· Uik∗k∆UikS∞+ ∇f(Xi), ∆Ui∆ T Ui | {z } 1 (D.9)
To upper bound 1 , we use
∇f(Xi), ∆Ui∆ T Ui = h∇f(Xi)∆Ui, ∆Uii ≤ k∇f(Xi)∆Uik∗k∆UikS∞ =k∇f(Xi)P∆Ui∆Uik∗k∆UikS∞ ≤ k∇f(Xi)P∆Uik∗k∆Uik 2 S∞ (∗) ≤ k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k∗ k∆Uik 2 S∞ (D.10)
in which PU denotes the projection onto Col(U ), and (∗) is due to Span(Col(∆Ui)) ⊆ Span(Col(Ui)∪ Col(U
∗ r)).
k∇f(Xi)PUik∗=k∇f(Xi)UiU † ik∗ (1) ≤ k∇f(Xi)Uik∗ 1 σr(Ui) (2) ≤ k∇f(Xi)Uik∗ 10 9σr(U∗) (D.11)
in which Ui† denotes the pseudoinverse of Ui. Here(1) is due to σ1(Ui†) = σr(Ui)−1, and(2) is by assumption (IV.5), Weyl’s
inequality andσr(U∗RUi) = σr(U ∗). Similarly, we have k∇f(Xi)PU∗k∗=k∇f(Xi)U∗(U∗)†k∗ ≤ k∇f(Xi)U∗k∗ | {z } A 1 σr(U∗) . (D.12)
To upper bound A , we use the following inequality.
k∇f(Xi)U∗k∗=k∇f(Xi)U∗RUik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)∆Uik∗ =k∇f(Xi)Uik∗+k∇f(Xi)P∆Ui∆Uik∗ ≤ k∇f(Xi)Uik∗+k∇f(Xi)P∆Uik∗k∆UikS∞ (1) ≤ k∇f(Xi)Uik∗+ (k∇f(Xi)PUik∗ +k∇f(Xi)PU∗k∗)k∆U ikS∞ (2) ≤ k∇f(Xi)Uik∗+ 10 9 k∇f(Xi)Uik∗ +k∇f(Xi)U∗k∗ k∆U ikS∞ σr(U∗) (3) ≤ k∇f(Xi)Uik∗+ 1 10 10 9 k∇f(Xi)Uik∗ +k∇f(Xi)U∗k∗ =10 9 k∇f(Xi)Uik∗+ 1 10k∇f(Xi)U ∗ k∗. (D.13)
Here, (1) is owing to the similar reason of (D.10), (2) is obtained by plugging in (D.11) and (D.12) and (3) is by assumption (IV.5). Thus we arrive at
k∇f(Xi)U∗k∗≤
10 9
2
k∇f(Xi)Uik∗. (D.14)
Plugging this into (D.12), we get
k∇f(Xi)PU∗k∗≤ 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗) . (D.15)
Combining (D.11) and (D.15) with (D.10), we obtain
∇f(Xi), ∆Ui∆ T Ui ≤ k∇f(Xi)Uik∗ 10 9σr(U∗) + 10 9 2 k∇f(Xi)Uik∗ 1 σr(U∗) k∆Uik 2 S∞ =k∇f(Xi)Uik∗ 190 81 k∆UikS∞ σr(U∗) k∆ UikS∞ (∗) ≤ 1981k∇f(Xi)Uik∗k∆UikS∞ (D.16)
where(∗) is by assumption (IV.5). Now we plug (D.16) into (D.9) and obtain f (UiUiT)− f(U∗U∗T)≤ 2 +19 81 k∆UikS∞k∇f(Xi)· Uik∗. (D.17)
The last part is to proveminiγi≥ 14η by showingkUikS∞≤ 11 9kU0kS∞ and k∇f(Xi)rk∗≤ 40L 81 σr(U0)σ1(U0) +k∇f(X0)rk∗. (D.18) By assumption (IV.5) and Weyl’s inequality, we have for everyi≥ 0
(1− 1 10)σ1(U ∗) ≤ σ1(Ui)≤ (1 + 1 10)σ1(U ∗) , and thus 1 + 1 10 1− 1 10 σ1(U0)≥ σ1(Ui). (D.19)
Since k∇f(Xi)rk∗ is the Ky Fanr-norm of ∇f(Xi), we have
k∇f(Xi)rk∗≤ k((∇f(Xi)− ∇f(X0))rk∗+k∇f(X0)rk∗ ≤ k∇f(Xi)− ∇f(X0)k∗+k∇f(X0)rk∗ ≤ LS∞→S1kXi− X0kS∞+k∇f(X0)rk∗ ≤ LS∞→S1 kXi− X∗kS∞+kX0− X ∗ kS∞ +k∇f(X0)rk∗. Since kXi− X∗kS∞ =kUi(Ui− U ∗R Ui) T + (Ui− U∗RUi)(U ∗R Ui) T kS∞ ≤ kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞), we have kXi− X∗kS∞+kX0− X ∗ kS∞ ≤kUi− U∗RUikS∞(kUikS∞+kU ∗ kS∞) +kU0− U∗RU0kS∞(kU0kS∞+kU ∗ kS∞) ≤σr(U ∗) 10 σ1(U0)( 11 9 + 10 9 + 1 + 10 9 ) ≤1 1 − 1 10 σr(U0) 10 σ1(U0) 40 9 =40 81σr(U0)σ1(U0) by applying inequality (D.19). APPENDIXE PROOF OFLEMMA1 Using chain rules, we see that
∇f(A) = ∇lse(Ax)x>. (E.1)
To prove the convexity off , we compute h∇f(A) − ∇f(A0), A − A0 i =∇lse(Ax)x> − ∇lse(A0x)x>, A − A0 =h∇lse(Ax) − ∇lse(A0x), Ax− A0xi ≥ 0
since the lse function is convex.
We now turn to the smoothness parameters. Since transposing a matrix does not alter the Schatten-p norm, we have k∇f(A) − ∇f(A0) kF = ∇lse(Ax) − ∇lse(A0x)x> F ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k(A − A0)xk2, by (IV.2) ≤ kxk2 2kA − A0kF
Using similar arguments, we compute k∇f(A) − ∇f(A0)kS1 = ∇lse(Ax) − ∇lse(A0x)x> S 1 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k2 ≤ kxk2k∇lse(Ax) − ∇lse(A0x)k1 ≤ kxk2k(A − A0)xk∞, by (IV.3) ≤ kxk2 2kA − A0kS∞
which establishes (IV.9).
APPENDIXF PROOF OFLEMMA2 The convexity of ˆf follows immediately from the convexity of f . For any two positive semi-definite matricesZ1= A1 B1
B> 1 D1 andZ2= A2 B2 B> 2 D2 , we have k∇ ˆf (Z1)− ∇ ˆf (Z2)kqSq = 1 2 0 ∇f(B1)− ∇f(B2) ∇f(B1)− ∇f(B2) 0 q Sq (1) = 1 2 ∇f(B1)− ∇f(B2) 0 0 ∇f(B1)− ∇f(B2) q Sq (2) = 1 2 k∇f(B1)− ∇f(B2)kqSq+k∇f(B1)− ∇f(B2)k q Sq (3) ≤ Lq kB1− B2kqSp, (F.1) where
(1) is because|| · ||Sq is permutation invariant,
(2) is because of the block-diagonal structure, and (3) uses the smoothness of f .
It remains to see thatkB1− B2kSp≤ kZ1− Z2kSp. In order to prove this, we use the permutation invariance of the || · ||Sp,
and the Pinching inequality [1]:
kZ1− Z2kpSp = B1− B2 A1− A2 D1− D2 B1>− B>2 p Sp ≥ kB1− B2kpSp. APPENDIXG
SYNTHETICDATA FORPHASERETRIEVAL
Original Picture
Nuclear Wirtinger Flow Wirtinger Flow Fienup
Phaselift Phaselamp Phasemax
Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow
Original Picture
Nuclear Wirtinger Flow Wirtinger Flow Fienup
Phaselift Phaselamp Phasemax
Truncated Amplitude Flow SketchyCGM Reweighted Wirtinger Flow
APPENDIXH
WIRTINGERFLOW V.S. NUCLEARWIRTINGERFLOW
In Figure 4 we present the images recovered from nuclear Wirtinger flow and Wirtinger flow, indexed by time. Our experiments show that the nuclear Wirtinger flow quickly finds area (att = 4s) where meaningful image characteristics start to emerge. At t = 8s, a fully visible image is recovered, and the reconstruction stays at the solution for a short period. However, the nuclear Wirtinger flow eventually overfits and returns a noisy figure; see Figure 3. This phenomenon is possibly due to the mismatch of the mathematical model and real Fourier Ptychographic reconstructions.
In contrast, the Wirtinger flow recovers only partial image at t = 8s, and exhibits oscillating behaviors. Eventually the Wirtinger flow overfits, and return solutions like random noise.
We stress that the Wirtinger flow fails to recover the image for all the initializations we have tried, whereas the nuclear Wirtinger flow is quite robust to the choice of initial point.
2700 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 0.062953 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Fig. 3: Final solution of the nuclear Wirtinger flow,t = 37s.
APPENDIXI
SPECTRALGRADIENTMETHODS FORFA S TTE X T
Four more datasets are presented in Figure 5. The results are in accordance with our observations in Section VI-B: The heuristic version of (V.4) is the best optimization algorithm, in that it solves the training problem most efficiently, but is prone to overfitting. On the other hand, the theoretical iterates (V.4) is either the best or comparable to the other methods in terms of prediction accuracy.
REFERENCES [1] R. Bhatia, Matrix analysis. Springer Science & Business Media, 2013, vol. 169.
[2] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in Proceedings of The 33rd International Conference on Machine Learning, 2016, pp. 964–973.
(a) Nuclear Wirtinger flow at t = 2s. (b) Wirtinger flow at t = 2s.
(c) Nuclear Wirtinger flow at t = 4s. (d) Wirtinger flow at t = 4s.
(e) Nuclear Wirtinger flow at t = 8s. (f) Wirtinger flow at t = 8s.
(g) Nuclear Wirtinger flow at t = 14s. (h) Wirtinger flow at t = 14s.
(i) Nuclear Wirtinger flow at t = 16s. (j) Wirtinger flow at t = 16s.
2 4 6 8 10 12 epochs 0.08 0.10 0.12 0.14 0.16 0.18 loss
yelp review polarity
[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.952 0.953 0.954 0.955 0.956 0.957 0.958 a ccura cy
yelp review polarity
[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 loss ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.914 0.916 0.918 0.920 0.922 0.924 0.926 a ccura cy ag news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.06 0.08 0.10 0.12 0.14 0.16 0.18 loss sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.963 0.964 0.965 0.966 0.967 0.968 0.969 0.970 a ccura cy sogou news [∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 loss
amazon review polarity
[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f 2 4 6 8 10 12 epochs 0.940 0.941 0.942 0.943 0.944 0.945 0.946 0.947 a ccura cy
amazon review polarity
[∇Af ;∇Bf ]#∞ [∇f]#∞ ∇f
Fig. 5: From left to right, training loss and test accuracy. From top to bottom, results on Yelp Review Polarity, AG News, Sogou News, and Amazon Review Polarity