Proof of Theorem - Spectral Regularization Algorithms for Learning Large Incomplete Matrices

There is a close resemblance between SOFT-IMPUTEand Nesterov’s gradient method (Nesterov, 2007, Section 3). However, as mentioned earlier the original motivation of our algorithm is very different.

The techniques used in this proof are adapted from Nesterov (2007). Proof Plugging Z_λk=Z in (11), we have˜

Q_λ(Z|Z_λk) = f_λ(Zk

λ) +1₂kPΩ⊥(Zλk−Z)k2F (55)

≥ f_λ(Zk_λ).

Let Z_λk(θ)denote a convex combination of the optimal solution (Z_λ∞) and the kthiterate (Z_λk):

Using the convexity of f_λ(·)we get:

f_λ(Zk

λ(θ))≤(1−θ)fλ(Zkλ) +θfλ(Zλ∞). (57)

Expanding Z_λk(θ)using (56), and simplifying P_Ω⊥(Zk

λ−Zλk(θ))we have:

kP_Ω⊥(Zk

λ−Zλk(θ))k2F = θ2kPΩ⊥(Zλk−Zλ∞)k2F

≤ θ2kZ_λk−Z_λ∞k2_F (58)

≤ θ2kZ_λ0−Z_λ∞k2_F. (59) Line 59 follows from (58) by observing thatkZ_λm−Z_λ∞k2

F ≤ kZ_λm−1−Zλ∞k2F,∀m—a consequence of

the inequalities (19) and (18), established in Theorem 1.

Using (55), the value of f_λ(Z)at the(k+1)th_{iterate satisfies the following chain of inequalities:} f_λ(Zk_λ+1) ≤ min Z f_λ(Z) +1 2kP ⊥ Ω(Zkλ−Z)k2F ≤ min θ∈[0,1] f_λ(Z_λk(θ)) +1 2kP ⊥ Ω(Zλk−Zkλ(θ))k2F (60) ≤ min θ∈[0,1] f_λ(Z_λk) +θ(f_λ(Z_λ∞)−f_λ(Z_λk)) +1 2θ 2_k_Z0 λ−Z∞λk2F . (61) Line 61 follows from Line 60, by using (57) and (59).

The r.h.s. expression in (61), is minimized atbθ(k+1)given by

b θ(k+1) = min{1,θk} ∈[0,1], where, θk = f_λ(Zk λ)−fλ(Zλ∞) kZ_λ0−Z∞_λk2 F . IfkZ_λ0−Z_λ∞k2 F =0,then we takeθk=∞.

Note thatθk is a decreasing sequence. This implies that ifθk0 ≤1 thenθm≤1 for all m≥k0. Suppose,θ0>1.Thenbθ(1) =1. Hence using (61) we have:

f_λ(Z_λ1)−f_λ(Z_λ∞)≤1 2kZ 0 λ−Z∞λk2F =⇒ θ1≤ 1 2. Thus we get back to the former case.

Henceθk≤1 for all k≥1.

In addition, observe the previous deductions show that, ifθ0>1 then (20) holds true for k=1.

Combining the above observations, plugging in the value ofbθin (61) and simplifying, we get:

f_λ(Zk+1 λ )−fλ(Zλk)≤ − (f_λ(Zk λ)−fλ(Zλ∞))2 2kZ_λ0−Z_λ∞k2 F . (62)

For the sake of notational convenience, we define the sequenceαk = fλ(Zk_λ)−fλ(Z_λ∞).It is easily

Using this notation in (62) we get: αk ≥ α2 k 2kZ_λ0−Z_λ∞k2 F +αk+1 (Sinceαk↓) ≥ αkαk+1 2kZ_λ0−Z_λ∞k2 F +αk+1. (63)

Dividing both sides of the inequality in (63), byαkαk+1we have:

α−1 k+1≥ 1 2kZ_λ0−Z_λ∞k2 F +α−_k1. (64)

Summing both sides of (64) over 1≤k≤(k−1)we get: α−1 k ≥ k−1 2kZ_λ0−Z_λ∞k2 F +α−₁1. (65)

Sinceθ1≤1,we observeα1/(2kZ_λ0−Z_λ∞k2F)≤1/2—using this in (65) and rearranging terms we

get, the desired inequality (20)—completing the proof of the Theorem.

References

J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: opera- tor estimation with spectral regularization. Journal of Machine Learning Research, 10:803–826, 2009.

A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural

Information Processing Systems 19. MIT Press, 2007.

A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.

F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9: 1019–1048, 2008.

R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. Technical report, AT&T Bell Laboratories, 2007.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

S. Burer and R. D.C. Monteiro. Local minima and convergence in low-rank semidefinite program- ming. Mathematical Programming, 103(3):427–631, 2005.

J. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion, 2008. Available athttp://www.citebase.org/abstract?id=oai:arXiv.org:0810.3286. E. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Com-

E. J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion. IEEE

Transactions on Information Theory, 56(5):2053–2080, 2009.

D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In

Proceedings of the 23rd International Conference on Machine Learning, pages 249–256. ACM,

2006.

A. Dempster, N.. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39:1–38, 1977. D. Donoho, I. Johnstone, G. Kerkyachairan, and D. Picard. Wavelet shrinkage; asymptopia? (with

discussion). Journal of the Royal Statistical Society: Series B, 57:201–337, 1995.

M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002. J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics,

Stanford University, 2008.

J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of

Applied Statistics, 2(1):302–332, 2007.

M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, 2009. Web page and software available athttp://stanford.edu/∼boyd/cvx.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Prediction, Infer-

ence and Data Mining (Second Edition). Springer Verlag, New York, 2009.

S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In Proceedings of the

26th International Conference on Machine Learning, pages 457–464, 2009.

R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. IEEE Transactions

on Information Theory, 56(6):2980–2998, 2009.

R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. Technical Report DAIMI PB-357, Department of Computer Science, Aarhus University, 1998.

R.M. Larsen. Propack-software for large and sparse svd calculations, 2004. Available at

http://sun.stanford.edu/∼rmunk/PROPACK.

J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009. Available athttp://www.public.asu.edu/∼jye02/Software/SLEP.

Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identfication. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2009.

S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming Series A,forthcoming.

R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Technical report, Stanford University, 2009.

Y. Nesterov. Introductory Lectures on Convex Optimization: Basic course. Kluwer, Boston, 2003. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76,

Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, 2007. Available at

http://www.citebase.org/abstract?id=oai:arXiv.org:0706.4138.

J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In

Proceedings of the 22nd International Conference on Machine Learning, pages 713–719. ACM,

2005.

R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, pages 791– 798. AAAI Press, 2007.

ACM SIGKDD and Netflix. Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach. In Proceedings of KDD Cup and Workshop, 2007. Available at

http://www.cs.uic.edu/∼liub/KDD-cup-2007/proceedings.html.

N. Srebro and T. Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th Interna-

tional Conference on Machine Learning, pages 720–727. AAAI Press, 2003.

N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances in Neural Information Processing Systems 17, pages 5–27. MIT Press, 2005a.

N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in

Neural Information Processing Systems 17, pages 1329–1336. MIT Press, 2005b.

G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Scalable collaborative filtering approaches for large recommender systems. Journal of Machine Learning Research, 10:623–656, 2009.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society, Series B, 58:267–288, 1996.

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520– 525, 2001.

C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statis-

In document Spectral Regularization Algorithms for Learning Large Incomplete Matrices (Page 32-36)