Discussion and Future Work - Value Regularization and Fenchel Duality

We have introduced a new framework for thinking about optimization in learning theory by com- bining two key elements: Fenchel duality and value regularization. Fenchel duality allows us to analyze the optimality conditions for regularization and loss terms separately, and combine these optimality conditions to get complete optimality conditions for machine learning schemes. Value regularization makes it easy to reason about optimization problems over training and testing points simultaneously, and isolates the RKHS kernel to the regularization term. We have used the framework to gain new insights into several topics in machine learning, including the representer theorem, learning a kernel matrix, biased regularizations, and low-rank approximations to kernel matrices. There remain several interesting open questions.

In Sections 7 and 8, we showed that both supervised learning in an RKHS and learning the kernel matrix had a representer theorem that caused the inductive and natural transductive algorithms to make the same prediction. It would be interesting to explore this idea further, characterizing formally what sorts of conditions give rise to agreement between inductive and transductive algorithms.

The framework developed in this article is specialized to the case where the predicted outputs y are real-valued scalars. It is frequently of interest to make predictions on objects with more structure: examples include multiclass classification, ranking, and classification of sequences. The extension of the value regularization framework to these problems is in principle straightforward, but many details remain to be worked out.

Finally, although this article has focussed on the analysis of existing schemes, one may ask whether a value regularization perspective can lead to the development of new learning algorithms. It seems unlikely that value regularization in an RKHS will lead directly to improved algorithms (for example, a better SVM solver), because the inverse kernel matrix K−1_{appearing in the primal} is a computationally inconvenient object. However, we might imagine designing regularizers for which value-based optimization is computationally effective, such as a transductive algorithm with a

regularizer that directly imposes some non-RKHS smoothness over the input space. We are actively working to design and implement such algorithms.

Acknowledgments

We thank Ben Recht, Tomaso Poggio, Rob Freund, Adrian Lewis and Roger Wets for helpful dis- cussions and suggestions. We thank Gustavo Santos and Tim Sears for finding many technical errors in earlier versions of this work. We thank our anonymous reviewers for helpful suggestions.

References

Yasemin Altun and Alex Smola. Unifying divergence minimization and statistical inference via convex duality. In Proceedings of the 19th Annual Conference on Lwearning Theory, 2006. Yasemin Altun, David McAllester, and Mikhail Belkin. Maximum margin semi-supervised learning

for structured variables. In Neural Information Processing Systems, 2005.

Andreas Argyriou, Charles A. Micchelli, and Massimiliano Pontil. Learning convex combinations of continuously parametrized basic kernels. In Proceedings of the 18th Annual Conference on Learning Theory, 2005.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.

Christopher T. H. Baker. The Numerical Treatment of Integral Equations. Clarendon press, 1977. Mokhtar S. Bazaraa, Hanif D. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and

Algorithms. Wiley-Interscience, 2nd edition, 1993.

Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56:209–239, 2003.

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from examples. Technical Report TR-2004-06, University of Chicago, 2004. Yoshua Bengio, Jean-Francois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, and

Marie Ouimet. Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps and spectral clus- tering. In Neural Information Processing Systems, 2003.

Jonathan M. Borwein and Adrian S. Lewis. Convex Analysis and Nonlinear Optimization. CMS Books in Mathematics. Springer, 2000.

Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. Technical Report 479, Stanford University Department of Statistics, 1995.

Corinna Cortes and Vladimir Vapnik. Support vector networks. Machine Learning, 20:1–25, 1995. Nello Cristianini and John Shaw-Taylor. An Introduction To Support Vector Machines. Cambridge

Miroslav Dud´ık and Robert E. Schapire. Maximum entropy distribution estimation with generalized regularization. In Proceedings of the 19th Annual Conference on Lwearning Theory, pages 123– 138, 2006.

Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and sup- port vector machines. Advances In Computational Mathematics, 13(1):1–50, 2000.

Federico Girosi. An equivalence between sparse approximation and support vecto ˚machines. Neural Computation, 10:1455–1480, 1998.

Tommi S. Jaakkola and David Haussler. Probabilistic kernel regression models. In Proceedings of the 1999 Conference on AI and Statistics. Morgan Kaufmann, 1999.

Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Re- search, 5:24–72, 2004.

Zhen Luo and Grace Wahba. Hybrid adaptive splines. Journal of the American Statistical Society, 92:107–116, 1997.

Charles A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive functions. Constructive Approximation, 2(1):11–22, 1986.

Tomaso Poggio and Federico Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, September 1990.

Tomaso Poggio, Sayan Mukherjee, Ryan M. Rifkin, Alex Rakhlin, and Alessandro Verri. b. In Proceedings of the Conference on Uncertainty in Geometric Computations, 2001.

Ali Rahimi, Benjamin Recht, and Trevor Darrell. Learning appearance manifolds from video. In Computer Vision and Pattern Recognition, 2005.

Carl Edward Rasmusen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Ben Recht. Unpublished phd thesis. 2006.

Ryan M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches to Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002.

R. Tyrrell Rockafellar and Roger J. B. Wets. Variational Analysis. Springer, Berlin, 2004.

Bernhard Sch ¨olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In 14th Annual Conference on Computational Learning Theory, pages 416–426, 2001.

Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, 2005.

Johan A. K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle. Least Squares Support Vector Machines. World Scientific, 2002.

Andrei N. Tikhonov and Vasilii Y. Arsenin. Solutions of Ill-posed problems. W. H. Winston, Wash- ington D.C., 1977.

Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.

Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Confer- ence Series in Applied Mathematics. Society for Industrial & Applied Mathematics, 1990. Christopher K. I. Willams and Matthias Seeger. Using the Nystr ¨om method to speed up kernel

machines. In Advances in Neural Information Processing Systems, 2000.

Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector machines. In Neural Information Processing Systems, 2003.

In document Value Regularization and Fenchel Duality (Page 36-39)