Proof of Theorem 2.4 - Block Coordinate Descent for Regularized Multi-convex Optimization

If θ = 0, we must have F (xk0_{) = F (¯}_{x) for some k}

0. Otherwise, F (xk) > F (¯x) for all sufficiently large k. The Kurdyka- Lojasiewicz inequality gives c · dist(0, ∂F (xk_{)) ≥ 1} for all k ≥ 0, which is impossible since xk _{→ ¯}_{x and 0 ∈ ∂F (¯}_{x). The finite convergence} now follows from the fact that F (xk0_{) = F (¯}_{x) implies x}k _{= x}k0 _{= ¯}_{x for all k ≥ k}

0. For θ ∈ (0, 1), we assume F (xk_{) > F (¯}_{x) = 0 and use the same notation as in the} proof of Lemma 3. Define

Sk = ∞ X

i=k

kxi− xi+1k. Then (A.3) can be written as

Sk ≤ C1φ(Fk) + 2 + δω 1 − δω (Sk−1− Sk) + Sk−2− Sk−1, for k ≥ 2, which implies Sk≤ C1φ(Fk) + 2 + δω 1 − δω (Sk−2− Sk), for k ≥ 2, (A.10) since Sk−2− Sk−1 ≥ 0. Using φ(s) = cs1−θ, we have from (A.7) for sufficiently large k that c(1 − θ)(Fk)−θ ≥ (L + sLG)−1 kxk− xk−1k + kxk−1− xk−2k −1 , or equivalently (Fk)θ ≤ c(1 − θ)(L + sLG)(Sk−2− Sk). Then, φ(Fk) = c(Fk)1−θ ≤ c c(1 − θ)(L + sLG)(Sk−2− Sk) 1−θ_θ . (A.11) Letting C3 = C1c c(1 − θ)(L + sLG) 1−θ_θ

and C4 = 2+δ_1−δω_ω, we have from (A.10) and (A.11) that Sk ≤ C3(Sk−2− Sk) 1−θ θ _{+ C} 4(Sk−2− Sk) . (A.12) When θ ∈ (0,1 2], i.e., 1−θ

θ ≥ 1, (A.12) implies that Sk ≤ (C3+ C4)(Sk−2− Sk) for sufficiently large k since Sk−2− Sk → 0, and thus

Sk ≤

C3 + C4 1 + C3+ C4

Sk−2. Note that kxk _{− ¯}_{xk ≤ S}

k. Therefore, item 2 holds with τ = q

C3+C4

1+C3+C4 < 1 and

sufficiently large C.

When θ ∈ (1₂, 1), i.e., 1−θ_θ < 1, we can show

for ν = 1−2θ_1−θ < 0, some constant µ > 0 and any N > K with sufficiently large K by the same argument as in the proof of Theorem 2 of [2]. Note SN ≤ SN −1 and ν < 0. Hence, (A.13) implies

SN ≤ 1 2 S ν K+1+ S ν K+ µ(N − K) 1_ν ≤ CN−2θ−11−θ ,

for sufficiently large C and N . This completes the proof.

For completeness, I give the proof of (A.13) below by repeating the arguments in Theorem 2 of [2].

Proof of (A.13)

Note that Sk → 0 and 1−θ_θ < 1. We have from (A.12) that S

θ 1−θ

k ≤ C5(Sk−2− Sk), (A.14)

for C5 = max(C3, C4) + 1 and all k ≥ K with sufficiently large K such that SK−2 < 1. Define

h(s) = s−1−θθ

and let c ∈ (1, +∞). Take k ≥ K and go to show S_kν − Sν

k−2 ≥ µ > 0, (A.15)

for some µ and all k ≥ K.

Case 1: h(Sk) ≤ ch(Sk−2). Writing (A.14) to 1 ≤ C5(Sk−2− Sk)S − θ 1−θ k , we have 1 ≤ C5(Sk−2− Sk)h(Sk) ≤ cC5(Sk−2− Sk)h(Sk−2) ≤ cC5 Z Sk−2 Sk h(s)ds = cC5 1 − θ 1 − 2θS ν k−2− S ν k . Letting ˆµ = _cCν 5, we get 0 < ˆµ ≤ S_k−2ν − S_kν.

Case 2: h(Sk) > ch(Sk−2). Set q = c θ 1−θ ∈ (0, 1). Then S k ≤ qSk−2 and Skν ≥ qνS_k−2ν or equivalently S_kν − S_k−2ν ≥ (qν− 1)S_k−2ν

by noting ν < 0. Since Sk−2 < 1, ∀k > K and qν > 1, there exists ¯µ > 0 such that (qν _{− 1)S}ν

k−2 > ¯µ for all k > K. Hence, S_k−2ν − Sν

k > ¯µ.

Taking µ = min(ˆµ, ¯µ), we get (A.15). Summing (A.15) from K to some integer N > K gives (A.13).

Bibliography

[1] S. Amari, A. Cichocki, H.H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996.

[2] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1):5–16, 2009.

[3] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.

[4] A. Auslender. Optimisation: m´ethodes num´eriques. Masson, 1976. [5] A. Auslender. Asymptotic properties of the fenchel dual functional and

applications to decomposition problems. Journal of optimization theory and applications, 73(3):427–449, 1992.

[6] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.5, January 2012.

[7] B.W. Bader and T.G. Kolda. Efficient matlab computations with sparse and factored tensors. SIAM Journal on Scientific Computing, 30(1):205–231, 2009. [8] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for

linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[9] E. Benetos and C. Kotropoulos. Non-negative tensor factorization applied to music genre classification. Audio, Speech, and Language Processing, IEEE Transactions on, 18(8):1955–1967, 2010.

[10] M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, and R.J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155–173, 2007.

[11] D.P. Bertsekas and P. Tseng. Partial proximal minimization algorithms for convex pprogramming. SIAM Journal on Optimization, 4(3):551–572, 1994.

[12] C.M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.

[13] J. Bobin, Y. Moudden, J.L. Starck, J. Fadili, and N. Aghanim. SZ and CMB reconstruction using generalized morphological component analysis. Statistical Methodology, 5(4):307–317, 2008.

[14] J. Bochnak, M. Coste, and M.F. Roy. Real algebraic geometry, volume 36. Springer Verlag, 1998.

[15] P. Bofill and M. Zibulevsky. Underdetermined blind source separation using sparse representations. Signal processing, 81(11):2353–2362, 2001.

[16] J. Bolte, A. Daniilidis, and A. Lewis. The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.

[17] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007. [18] J.D. Carroll and J.J. Chang. Analysis of individual differences in

multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika, 35(3):283–319, 1970.

[19] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM review, pages 129–159, 2001.

[20] Y. Chen, M. Rege, M. Dong, and J. Hua. Non-negative matrix factorization for semi-supervised data clustering. Knowledge and Information Systems,

17(3):355–379, 2008.

[21] E.C. Chi and T.G. Kolda. On tensors, sparsity, and nonnegative factorizations. Arxiv preprint arXiv:1112.2414, 2011.

[22] S. Choi, A. Cichocki, H.M. Park, and S.Y. Lee. Blind source separation and independent component analysis: A review. Neural Information

Processing-Letters and Reviews, 6(1):1–57, 2005.

[23] A. Cichocki and A.H. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer science, 92(3):708–721, 2009.

[24] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts. Advances in neural information processing systems, 16, 2003.

[25] Jonathan Eckstein and Dimitri P. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Programming, 55(3, Ser. A):293–318, 1992.

[26] M.P. Friedlander and K. Hatz. Computing non-negative tensor factorizations. Optimisation Methods and Software, 23(4):631–647, 2008.

[27] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The annals of statistics, 28(2):337–407, 2000. [28] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear

variational problems via finite-element approximations. Computers and Mathematics with Applications, 2:17–40, 1976.

[29] R. Glowinski and A. Marrocco. Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe de problemes de Dirichlet nonlineaires. Rev. Francaise dAut. Inf. Rech. Oper., 2:41–76, 1975. [30] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear

Gauss-Seidel method under convex constraints. Oper. Res. Lett., 26(3):127–136, 2000.

[31] S.P. Han. A successive projection method. Mathematical Programming, 40(1):1–14, 1988.

[32] C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4(1):79–85, 1957.

[33] N.D. Ho, P. Van Dooren, and V.D. Blondel. Descent methods for nonnegative matrix factorization. Numerical Linear Algebra in Signals, Systems and Control, pages 251–293, 2011.

[34] P.O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004.

[35] T.P. Jung, S. Makeig, C. Humphries, T.W. Lee, M.J. Mckeown, V. Iragui, and T.J. Sejnowski. Removing electroencephalographic artifacts by blind source separation. Psychophysiology, 37(02):163–178, 2000.

[36] C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal processing, 24(1):1–10, 1991.

[37] J. Karhunen, A. Hyvarinen, R. Vig´ario, J. Hurri, and E. Oja. Applications of neural blind separation to signal and image processing. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages 131–134. IEEE, 1997.

[38] H.A.L. Kiers. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics, 14(3):105–122, 2000.

[39] H. Kim and H. Park. Non-negative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl, 30(2):713–730, 2008.

[40] H. Kim, H. Park, and L. Eld´en. Non-negative tensor factorization based on alternating large-scale non-negativity-constrained least squares. In

Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on, pages 1147–1151. IEEE, 2007.

[41] J. Kim and H. Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 353–362. IEEE, 2008.

[42] J. Kim and H. Park. Fast nonnegative tensor factorization with an

active-set-like method. High-Performance Scientific Computing, pages 311–326, 2012.

[43] Y.D. Kim and S. Choi. Nonnegative Tucker decomposition. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.

[44] T.G. Kolda and B.W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455, 2009.

[45] K. Kurdyka. On gradients of functions definable in o-minimal structures. In Annales de l’institut Fourier, volume 48, pages 769–784. Chartres: L’Institut, 1950-, 1998.

[46] M. Lai and Y. Wang. An unconstrained `q minimization with 0 < q < 1 for sparse solution of under-determined linear systems. SIAM J. Optimization, 21:82–101, 2011.

[47] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.

[48] D.D. Lee and H.S. Seung. Algorithms for Non-Negative Matrix Factorization. Advances in Neural Information Processing Systems, 13:556–562, 2001.

[49] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. Advances in neural information processing systems, 19:801–808, 2007.

[50] S.Z. Li, X.W. Hou, H.J. Zhang, and Q.S. Cheng. Learning spatially localized, parts-based representation. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages 207–212. IEEE, 2001.

[51] C.J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10):2756–2779, 2007.

[52] J.K. Lin, D.G. Grier, and J.D. Cowan. Feature extraction approach to blind source separation. In Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop, pages 398–405. IEEE, 1997.

[53] J. Liu, J. Liu, P. Wonka, and J. Ye. Sparse non-negative tensor factorization using columnwise coordinate descent. Pattern Recognition, 2011.

[54] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

[55] S. Lojasiewicz. Sur la g´eom´etrie semi-et sous-analytique. Ann. Inst. Fourier (Grenoble), 43(5):1575–1595, 1993.

[56] D.G. Luenberger. Introduction to linear and nonlinear programming. 1973. [57] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method

for convex differentiable minimization. J. Optim. Theory Appl., 72(1):7–35, 1992.

[58] Z.Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: A general approach. Annals of Operations Research, 46(1):157–178, 1993.

[59] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming, pages 1–33, 2009. [60] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for

sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689–696. ACM, 2009.

[61] M. Mørup, L.K. Hansen, and S.M. Arnfred. Algorithms for sparse nonnegative Tucker decompositions. Neural computation, 20(8):2112–2131, 2008.

[62] C. Navasca, L. De Lathauwer, and S. Kindermann. Swamp reducing technique for tensor decomposition. In Proc. of the 16th European Signal Processing Conference (EUSIPCO 2008), 2008.

[63] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005.

[64] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Papers, 2007.

[65] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale

optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. [66] J.M. Ortega and W.C. Rheinboldt. Iterative solution of nonlinear equations in

several variables. Academic Press, 1970.

[67] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values.

Environmetrics, 5(2):111–126, 1994.

[68] V.P. Pauca, J. Piper, and R.J. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and its Applications, 416(1):29–47, 2006. [69] V.P. Pauca, F. Shahnaz, M.W. Berry, and R.J. Plemmons. Text mining using

nonnegative matrix factorizations. In Proc. SIAM Inter. Conf. on Data Mining, Orlando, FL, 2004.

[70] E. Polak, R.W.H. Sargent, and D.J. Sebastian. On the convergence of sequential minimization algorithms. Journal of Optimization Theory and Applications, 12(6):567–575, 1973.

[71] M.J.D. Powell. On search directions for minimization algorithms. Mathematical Programming, 4(1):193–201, 1973.

[72] Z. Qin, K. Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithms for the group lasso. Preprint, 2010.

[73] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review,

52(3):471–501, 2010.

[74] B. Recht and C. R´e. Parallel stochastic gradient algorithms for large-scale matrix completion. Optimization Online, 2011.

[75] P. Richt´arik and M. Tak´aˇc. Iteration complexity of randomized

block-coordinate descent methods for minimizing a composite function. arXiv preprint arXiv:1107.2848, 2011.

[76] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14:877–898, 1976.

[77] R.T. Rockafellar and R.J.B. Wets. Variational analysis, volume 317. Springer Verlag, 1998.

[78] C. Serviere and P. Fabry. Principal component analysis and blind source separation of modulated sources for electro-mechanical systems diagnostic. Mechanical systems and signal processing, 19(6):1293–1311, 2005.

[79] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning, pages 792–799. ACM, 2005.

[80] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [81] P. Tseng. Dual coordinate ascent methods for non-strictly convex

minimization. Mathematical Programming, 59(1):231–247, 1993. [82] P. Tseng. Convergence of a block coordinate descent method for

nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, 2001.

[83] P. Tseng. Convergence of a block coordinate descent method for

nondifferentiable minimization. J. Optim. Theory Appl., 109(3):475–494, 2001. [84] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for

nonsmooth separable minimization. Math. Program., 117(1-2, Ser. B):387–423, 2009.

[85] L.R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.

[86] J. Warga. Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593, 1963.

[87] M. Welling and M. Weber. Positive tensor factorization. Pattern Recognition Letters, 22(12):1255–1261, 2001.

[88] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, pages 1–29, 2012.

[89] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM, 2003.

[90] Y. Xu, W. Yin, Z. Wen, and Y. Zhang. An alternating direction algorithm for matrix completion with nonnegative factors. Journal of Frontiers of

Mathematics in China, Special Issue on Computational Mathematics, 7(2):365–384, 2011.

[91] M. Yuan and Y. Lin. Model selection and estimation in regression with

grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.

[92] N. Zadeh. A note on the cyclic coordinate ascent method. Management Science, 16(9):642–644, 1970.

[93] S. Zafeiriou. Algorithms for nonnegative tensor factorization. Tensors in Image Processing and Computer Vision, pages 105–124, 2009.

[94] Q. Zhang, H. Wang, R.J. Plemmons, and V. Pauca. Tensor methods for

hyperspectral data analysis: a space object material identification study. JOSA A, 25(12):3001–3012, 2008.

[95] Y. Zhang. An alternating direction algorithm for nonnegative matrix factorization. Rice Technical Report, 2010.

[96] M. Zibulevsky and B.A. Pearlmutter. Blind source separation by sparse

In document Block Coordinate Descent for Regularized Multi-convex Optimization (Page 84-94)