I Proof of Proposition - A Matrix Factorization Approach for Learning Semidefinite-Representabl

Proof of Proposition 10. To simplify notation in the proof we denote α8:= α8(q, L?) = 96

√

qkL?k2.

We show that

kL(t)_{− L}(t+1)_k

2 ≤ α9ξL?(L(t)), (34)

for some function α9 := α9(q, r, L?) that we specify later. In the proof of Theorem 9 we showed

that ξL?(L(t)) ≤ γtξ_L?(L(0)) for some γ < 1. Hence establishing (34) immediately implies that the

sequence {L(t)}∞

t=1 is Cauchy.

Our proof builds on the proof of Theorem 9. Let

L(t)_{= L}?_{◦ (I + E}(t)_{) ◦ M}

where E(t) is a linear map that satisfies kE(t)k_eu< 1/α0. In the proof of Theorem 9 we show that

L(t+1)_{= L}?_{◦ (I + E}(t+1)_{) ◦ W ◦ M ◦ N ,}

where kE(t+1)k_eu ≤ kE(t)_k

eu, W is a rank-preserver, and N is a positive definite rank-preserver.

Moreover, as a consequence of applying Proposition 8 to establish (23) in the proof, we obtain the bound kW − Ik2 ≤ 3α7kE(t)keu. We use these bounds and relations to prove (34).

By the triangle inequality we have

kL(t)− L(t+1)k₂ ≤ kL?◦ E(t)◦ Mk₂+ kL?◦ E(t+1)◦ W ◦ M ◦ N k₂

+ kL?◦ M ◦ (N − I)k2+ kL?◦ (W − I) ◦ M ◦ N k2. (35)

By Proposition 5 applied to the pairs of linear maps L(t), L? and L(t+1), L? we have kM − Q₁k₂, kW ◦ M ◦ N − Q2k2 ≤ α8kE(t)keu, for some pair of orthogonal rank-preservers Q1, Q2. Since

α8/α0 ≤ 1 we have kMk2≤ 2 and kW ◦ M ◦ N k2 ≤ 2. Consequently kL?◦ E(t)◦ Mk2, kL?◦ E(t+1)◦

W ◦ M ◦ N k2 ≤ 2kL?k2kE(t)keu.

Next we bound kN −Ik2. By utilizing kW ◦M◦N −Q2k2 ≤ α8/α0, kM−Q1k2 ≤ α8kE(t)keu, and

kW −Ik₂ ≤ 3α₇kE(t)_k

eu, one can show that kN −Q3k2 ≤ (6α7+2α8+2)kE(t)keu, where Q3 = Q01◦Q2

is an orthogonal rank-preserver. Since N is self-adjoint, we have kN2−Ik2≤ 3(6α7+2α8+2)kE(t)keu,

and hence kN − Ik2 ≤ 3(6α7+ 2α8+ 2)kE(t)keu. This also implies the bound kN k2 ≤ 3.

We apply these bounds to obtain kL? ◦ M ◦ (N − I)k₂ ≤ 6(6α₇+ 2α8+ 2)kL?k2kE(t)keu, and

kL?_{◦ (W − I) ◦ M ◦ N k}

2 ≤ 9α7kL?k2kE(t)k2.

We define α9 := (4 + 6(6α7+ 2α8+ 2) + 9α7)kL?k2(these are exactly the sum of the coefficients

of kE(t)k_eu in the above bounds). The result follows by adding these bounds, and subsequently taking the infimum over E(t) in (35).

Acknowledgements

The authors were supported in part by NSF Career award CCF-1350590, by Air Force Office of Scientific Research grants FA9550-14-1-0098 and FA9550-16-1-0210, by a Sloan research fellowship, and an A*STAR (Agency for Science, Technology, and Research, Singapore) fellowship.

References

[1] A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli. Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization. SIAM Journal on Optimization, 26(4):2775–2799, 2016.

[2] A. Agarwal, A. Anandkumar, and P. Netrapalli. A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries. IEEE Transactions on Information Theory, 63(1), 2017. [3] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete

Dictionaries for Sparse Representation. IEEE Transactions on Signal Processing, 54(11):4311– 4322, 2006.

[4] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, Efficient, and Neural Algorithms for Sparse Coding. Journal of Machine Learning Research: Workshop and Conference Proceedings, 40, 2015.

[5] S. Arora, R. Ge, and A. Moitra. New Algorithms for Learning Incoherent and Overcomplete Dictionaries. Journal of Machine Learning Research: Workshop and Conference Proceedings, 35:1–28, 2014.

[6] B. Barak, J. A. Kelner, and D. Steurer. Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing. ACM, 2015.

[7] A. R. Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.

[8] B. N. Bhaskar, G. Tang, and B. Recht. Atomic Norm Denoising with Applications to Line Spectral Estimation. IEEE Transactions on Signal Processing, 61(23):5987–5999, 2013. [9] T. Blumensath and M. E. Davies. Iterative Hard Thresholding for Compressed Sensing. Applied

and Computational Harmonic Analysis, 27:265–274, 2009.

[10] A. M. Bruckstein, D. L. Donoho, and M. Elad. From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images. SIAM Review, 51(1):34–81, 2009.

[11] E. J. Cand`es and Y. Plan. Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements. IEEE Transactions on Information Theory, 57(4):2342–2359.

[12] E. J. Cand`es and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.

[13] E. J. Cand`es, J. Romberg, and T. Tao. Robust Uncertainty Principles: Exact Signal Recon- struction from Highly Incomplete Frequency Information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.

[14] E. J. Cand`es and T. Tao. Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006. [15] V. Chandrasekaran, P. Parillo, and A. S. Willsky. Latent Variable Graphical Model Selection

via Convex Optimization. The Annals of Statistics, 40(4):1935–1967, 2012.

[16] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805–849, 2012. [17] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Decomposition by Basis Pursuit.

SIAM Journal on Scientific Computing, 20(1):33–61, 1998.

[18] E. Cho. Inner Products of Random Vectors on Sn. Journal of Pure and Applied Mathematics: Advances and Applications, 9(1):63–68, 2013.

[19] M. Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Dis- tances. In Advances in Neural Information Processing Systems, 2013.

[20] K. R. Davidson and S. J. Szarek. Local Operator Theory, Random Matrices and Banach Spaces. In W. B. Johnson and J. Lindenstrauss, editors, Handbook of the Geometry of Banach Spaces, chapter 8, pages 317–366. Elsevier B. V., 2011.

[21] R. A. DeVore and V. N. Temlyakov. Some Remarks on Greedy Algorithms. Advances in Computational Mathematics, 5(1):173–187, 1996.

[22] D. L. Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289– 1306, 2006.

[23] D. L. Donoho. For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution Is Also the Sparsest Solution. Communications on Pure and Applied Math-

ematics, 59(6):797–829, 2006.

[24] D. L. Donoho and X. Huo. Uncertainty Principles and Ideal Atomic Decomposition. IEEE Transactions on Information Theory, 47(7):2845–2862.

[25] M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Department of Electrical Engineering, Stanford University, 2002.

[26] M. Fazel, E. Cand`es, B. Recht, and P. Parrilo. Compressed Sensing and Robust Recovery of Low Rank Matrices. In 42nd IEEE Asilomar Conference on Signals, Systems and Computers, 2008.

[27] A. Garg, L. Gurvits, R. Oliveira, and A. Wigderson. A Deterministic Polynomial Time Algorithm for Non-Commutative Rational Identity Testing with Applications. CoRR, abs/1511.03730, 2015.

[28] R. Ge, J. D. Lee, and T. Ma. Matrix Completion has No Spurious Local Minimum. In Advances in Neural Information Processing Systems, 2016.

[29] D. Goldfarb and S. Ma. Convergence of Fixed-Point Continuation Algorithms for Matrix Rank Minimization. Foundations of Computational Mathematics, 11:183 – 210, 2011.

[30] W. M. Gorman. Estimating Trends in Leontief Matrices. Unplublished note, referenced in Bacharach (1970), 1963.

[31] J. Gouveia, P. A. Parrilo, and R. R. Thomas. Lifts of Convex Sets and Cone Factorizations. Mathematics of Operations Research, 38(2), 2013.

[32] L. Gurvits. Classical Complexity and Quantum Entanglement. Journal of Computer and Systems Sciences, 69(3):448 – 484, 2004.

[33] M. Idel. A Review of Matrix Scaling and Sinkhorn’s Normal Form for Matrices and Positive Maps. CoRR, abs/1609.06349, 2016.

[34] P. Jain, R. Meka, and I. S. Dhillon. Guaranteed Rank Minimization via Singular Value Projection. In Advances in Neural Information Processing Systems, 2009.

[35] L. K. Jones. A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training. The Annals of Statis- tics, 20(1):608–613, 1992.

[36] T. Kato. Perturbation Theory for Linear Operators. Springer-Verlag, 1966.

[37] L. Khachiyan and B. Kalantari. Diagonal Matrix Scaling and Linear Programming. SIAM Journal on Optimization, 2(4):668–672, 1991.

[38] N. Linial, A. Samorodnitsky, and A. Wigderson. A Deterministic Strongly Polynomial Algo- rithm for Matrix Scaling and Approximate Permanents. Combinatorica, 20(4):545–568, 2000. [39] M. Marcus and B. N. Moyls. Transformations on Tensor Product Spaces. Pacific Journal of

Mathematics, 9(4):1215–1221, 1959.

[40] N. Meinhausen and P. B¨uhlmann. High-Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics, 34(3):1436–1462, 2006.

[41] B. K. Natarajan. Sparse Approximate Solutions to Linear Systems. SIAM Journal on Com- puting, 24(2):227–234, 1993.

[42] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Program- ming. SIAM Studies in Applied and Numerical Mathematics, 1994.

[43] B. A. Olshausen and D. J. Field. Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature, 381, 1996.

[44] S. Oymak and B. Hassibi. Sharp MSE Bounds for Proximal Denoising. Foundations of Com- putational Mathematics, 16(4):965–1029, 2016.

[45] N. Parikh and S. Boyd. Proximal Algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.

[46] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit ”Maurey-Schwartz”), pages 1–12, 1981.

[47] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review, 52(3):471–501, 2010.

[48] J. Renegar. A Mathematical View of Interior-Point Methods in Convex Optimization. MOS- SIAM Series on Optimization, 2001.

[49] P. Shah, B. N. Bhaskar, G. Tang, and B. Recht. Linear System Identification via Atomic Norm Regularization. In 51st IEEE Conference on Decisions and Control, 2012.

[50] R. Sinkhorn. A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices. The Annals of Mathematical Statistics, 35(2):876–879, 1964.

[51] D. A. Spielman, H. Wang, and J. Wright. Exact Recovery of Sparsely-Used Dictionaries. Journal on Machine Learning and Research: Workshop and Conference Proceedings, 23(37), 2012.

[52] G. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, 1990.

[53] J. Sun, Q. Qu, and J. Wright. A Geometric Analysis of Phase Retrieval. CoRR, abs/1602.06664, 2016.

[54] J. Sun, Q. Qu, and J. Wright. Complete Dictionary Recovery over the Sphere I: Overview and the Geometric Picture. IEEE Transactions on Information Theory, (99), 2016.

[55] J. Sun, Q. Qu, and J. Wright. Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method. IEEE Transactions on Information Theory, (99), 2016. [56] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical

Society, Series B, 58:267–288, 1994.

[57] K. C. Toh, M. J. Todd, and R. H. Tütüncü. SDPT3 – a MATLAB Software Package for Semidefinite Programming. Optimization Methods and Software, 11:545–581, 1999.

[58] J. A. Tropp. User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.

[59] L. Tun¸cel. Potential Reduction and Primal-Dual Methods. In H. Wolkowicz, R. Saigal, and L. Vandenberghe, editors, Handbook of Semidefinite Programming – Theory, Algorithms, and Applications, chapter 9. Kluwer’s International Series in Operations Research and Management Science, 2000.

[60] M. Yannakakis. Expressing Combinatorial Optimization Problems by Linear Programs. Jour- nal of Computer and System Sciences, 43, 1991.

In document A Matrix Factorization Approach for Learning Semidefinite-Representable Regularizers (Page 44-48)