There is a close resemblance between SOFT-IMPUTEand Nesterov’s gradient method (Nesterov, 2007, Section 3). However, as mentioned earlier the original motivation of our algorithm is very different.
The techniques used in this proof are adapted from Nesterov (2007). Proof Plugging Zλk=Z in (11), we have˜
Qλ(Z|Zλk) = fλ(Zk
λ) +12kPΩ⊥(Zλk−Z)k2F (55)
≥ fλ(Zkλ).
Let Zλk(θ)denote a convex combination of the optimal solution (Zλ∞) and the kthiterate (Zλk):
Using the convexity of fλ(·)we get:
fλ(Zk
λ(θ))≤(1−θ)fλ(Zkλ) +θfλ(Zλ∞). (57)
Expanding Zλk(θ)using (56), and simplifying PΩ⊥(Zk
λ−Zλk(θ))we have:
kPΩ⊥(Zk
λ−Zλk(θ))k2F = θ2kPΩ⊥(Zλk−Zλ∞)k2F
≤ θ2kZλk−Zλ∞k2F (58)
≤ θ2kZλ0−Zλ∞k2F. (59) Line 59 follows from (58) by observing thatkZλm−Zλ∞k2
F ≤ kZλm−1−Zλ∞k2F,∀m—a consequence of
the inequalities (19) and (18), established in Theorem 1.
Using (55), the value of fλ(Z)at the(k+1)thiterate satisfies the following chain of inequalities: fλ(Zkλ+1) ≤ min Z fλ(Z) +1 2kP ⊥ Ω(Zkλ−Z)k2F ≤ min θ∈[0,1] fλ(Zλk(θ)) +1 2kP ⊥ Ω(Zλk−Zkλ(θ))k2F (60) ≤ min θ∈[0,1] fλ(Zλk) +θ(fλ(Zλ∞)−fλ(Zλk)) +1 2θ 2kZ0 λ−Z∞λk2F . (61) Line 61 follows from Line 60, by using (57) and (59).
The r.h.s. expression in (61), is minimized atbθ(k+1)given by
b θ(k+1) = min{1,θk} ∈[0,1], where, θk = fλ(Zk λ)−fλ(Zλ∞) kZλ0−Z∞λk2 F . IfkZλ0−Zλ∞k2 F =0,then we takeθk=∞.
Note thatθk is a decreasing sequence. This implies that ifθk0 ≤1 thenθm≤1 for all m≥k0. Suppose,θ0>1.Thenbθ(1) =1. Hence using (61) we have:
fλ(Zλ1)−fλ(Zλ∞)≤1 2kZ 0 λ−Z∞λk2F =⇒ θ1≤ 1 2. Thus we get back to the former case.
Henceθk≤1 for all k≥1.
In addition, observe the previous deductions show that, ifθ0>1 then (20) holds true for k=1.
Combining the above observations, plugging in the value ofbθin (61) and simplifying, we get:
fλ(Zk+1 λ )−fλ(Zλk)≤ − (fλ(Zk λ)−fλ(Zλ∞))2 2kZλ0−Zλ∞k2 F . (62)
For the sake of notational convenience, we define the sequenceαk = fλ(Zkλ)−fλ(Zλ∞).It is easily
Using this notation in (62) we get: αk ≥ α2 k 2kZλ0−Zλ∞k2 F +αk+1 (Sinceαk↓) ≥ αkαk+1 2kZλ0−Zλ∞k2 F +αk+1. (63)
Dividing both sides of the inequality in (63), byαkαk+1we have:
α−1 k+1≥ 1 2kZλ0−Zλ∞k2 F +α−k1. (64)
Summing both sides of (64) over 1≤k≤(k−1)we get: α−1 k ≥ k−1 2kZλ0−Zλ∞k2 F +α−11. (65)
Sinceθ1≤1,we observeα1/(2kZλ0−Zλ∞k2F)≤1/2—using this in (65) and rearranging terms we
get, the desired inequality (20)—completing the proof of the Theorem.
References
J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: opera- tor estimation with spectral regularization. Journal of Machine Learning Research, 10:803–826, 2009.
A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural
Information Processing Systems 19. MIT Press, 2007.
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9: 1019–1048, 2008.
R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. Technical report, AT&T Bell Laboratories, 2007.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
S. Burer and R. D.C. Monteiro. Local minima and convergence in low-rank semidefinite program- ming. Mathematical Programming, 103(3):427–631, 2005.
J. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion, 2008. Available athttp://www.citebase.org/abstract?id=oai:arXiv.org:0810.3286. E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Com-
E. J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion. IEEE
Transactions on Information Theory, 56(5):2053–2080, 2009.
D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In
Proceedings of the 23rd International Conference on Machine Learning, pages 249–256. ACM,
2006.
A. Dempster, N.. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39:1–38, 1977. D. Donoho, I. Johnstone, G. Kerkyachairan, and D. Picard. Wavelet shrinkage; asymptopia? (with
discussion). Journal of the Royal Statistical Society: Series B, 57:201–337, 1995.
M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002. J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics,
Stanford University, 2008.
J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of
Applied Statistics, 2(1):302–332, 2007.
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, 2009. Web page and software available athttp://stanford.edu/∼boyd/cvx.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Prediction, Infer-
ence and Data Mining (Second Edition). Springer Verlag, New York, 2009.
S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of the
26th International Conference on Machine Learning, pages 457–464, 2009.
R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. IEEE Transactions
on Information Theory, 56(6):2980–2998, 2009.
R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. Technical Report DAIMI PB-357, Department of Computer Science, Aarhus University, 1998.
R.M. Larsen. Propack-software for large and sparse svd calculations, 2004. Available at
http://sun.stanford.edu/∼rmunk/PROPACK.
J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009. Available athttp://www.public.asu.edu/∼jye02/Software/SLEP.
Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identfication. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2009.
S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming Series A,forthcoming.
R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Technical report, Stanford University, 2009.
Y. Nesterov. Introductory Lectures on Convex Optimization: Basic course. Kluwer, Boston, 2003. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76,
Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.
B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, 2007. Available at
http://www.citebase.org/abstract?id=oai:arXiv.org:0706.4138.
J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In
Proceedings of the 22nd International Conference on Machine Learning, pages 713–719. ACM,
2005.
R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, pages 791– 798. AAAI Press, 2007.
ACM SIGKDD and Netflix. Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach. In Proceedings of KDD Cup and Workshop, 2007. Available at
http://www.cs.uic.edu/∼liub/KDD-cup-2007/proceedings.html.
N. Srebro and T. Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th Interna-
tional Conference on Machine Learning, pages 720–727. AAAI Press, 2003.
N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances in Neural Information Processing Systems 17, pages 5–27. MIT Press, 2005a.
N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in
Neural Information Processing Systems 17, pages 1329–1336. MIT Press, 2005b.
G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Scalable collaborative filtering approaches for large recommender systems. Journal of Machine Learning Research, 10:623–656, 2009.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society, Series B, 58:267–288, 1996.
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520– 525, 2001.
C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statis-