The hierarchical architecture of the neural system in- spires deep learning architectures for feature learning by stacking multiple layers of simple learning blocks.[17] These architectures are often designed based on the as- sumption ofdistributed representation: observed data is generated by the interactions of many different factors on multiple levels. In a deep learning architecture, the output of each intermediate layer can be viewed as a rep- resentation of the original input data. Each level uses the representation produced by previous level as input, and produces new representations as output, which is then fed to higher levels. The input of bottom layer is the raw data, and the output of the final layer is the final low- dimensional feature or representation.
9.3.1 Restricted Boltzmann machine
Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning architectures.[3][18] An RBM can be represented by an undirected bipartite graph consisting of a group of binary hidden variables, a group of visible variables, and edgesconnecting the hidden and visible nodes. It is a special case of the more general Boltzmann machines with the constraint of no intra-node connections. Each edge in an RBM is associated with a weight. The weights together with the connections define an energy function, based on which a joint distribution of visible and hidden nodes can be devised. Based on the topology of the RBM, the hidden (visible) variables are independent conditioned on the visible (hidden) variables. Such conditional indepen- dence facilitates computations on RBM.
An RBM can be viewed as a single layer architecture for unsupervised feature learning. In particular, the visible variables correspond to input data, and the hidden vari- ables correspond to feature detectors. The weights can be trained by maximizing the probability of visible vari- ables using the contrastive divergence (CD) algorithm by Geoffrey Hinton.[18]
In general, the training of RBM by solving the above max- imization problem tends to result in non-sparse represen- tations. The sparse RBM,[19]a modification of the RBM, was proposed to enable sparse representations. The idea is to add a regularization term in the objective function of data likelihood, which penalizes the deviation of the expected hidden variables from a small constant p .
9.3.2 Autoencoder
An autoencoder consisting of encoder and decoder is a paradigm for deep learning architectures. An example is provided by Hinton and Salakhutdinov[18]where the en- coder uses raw data (e.g., image) as input and produces feature or representation as output, and the decoder uses the extracted feature from the encoder as input and recon- structs the original input raw data as output. The encoder and decoder are constructed by stacking multiple layers of RBMs. The parameters involved in the architecture are trained in a greedy layer-by-layer manner: after one layer of feature detectors is learned, they are fed to upper layers as visible variables for training the corresponding RBM. The process can be repeated until some stopping criteria is satisfied.
9.4 See also
• Basis function
• Deep learning
• Feature detection (computer vision)
• Feature extraction
• Kernel trick
9.5. REFERENCES 59
9.5 References
[1] Y. Bengio; A. Courville; P. Vincent (2013). “Represen- tation Learning: A Review and New Perspectives”. IEEE Trans. PAMI, special issue Learning Deep Architectures. [2] Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola
(2004). Maximum-Margin Matrix Factorization.NIPS. [3] Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011).An
analysis of single-layer networks in unsupervised feature learning(PDF). Int'l Conf. on AI and Statistics (AIS- TATS).
[4] Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin; Willamowski, Jutta; Bray, Cédric (2004). Visual catego- rization with bags of keypoints(PDF). ECCV Workshop on Statistical Learning in Computer Vision.
[5] Daniel Jurafsky; James H. Martin (2009). Speech and Language Processing. Pearson Education International. pp. 145–146.
[6] Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo; Zisserman, Andrew (2009). “Supervised Dic- tionary Learning”. Advances in neural information pro- cessing systems.
[7] Percy Liang (2005).Semi-Supervised Learning for Natu- ral Language(PDF) (M. Eng.).MIT. pp. 44–52. [8] Joseph Turian; Lev Ratinov; Yoshua Bengio (2010).
Word representations: a simple and general method for semi-supervised learning(PDF). Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
[9] Schwenker, Friedhelm; Kestler, Hans A.; Palm, Gün- ther (2001). “Three learning phases for radial-basis- function networks”. Neural Networks 14: 439– 458. doi:10.1016/s0893-6080(01)00027-2. CiteSeerX:
10 .1 .1 .109 .312.
[10] Coates, Adam; Ng, Andrew Y. (2012). “Learning feature representations with k-means”. In G. Montavon, G. B. Orr and K.-R. Müller. Neural Networks: Tricks of the Trade. Springer.
[11] Dekang Lin; Xiaoyun Wu (2009). Phrase clustering for discriminative learning(PDF). Proc. J. Conf. of the ACL and 4th Int'l J. Conf. on Natural Language Processing of the AFNLP. pp. 1030–1038.
[12] Roweis, Sam T; Saul, Lawrence K (2000). “Nonlinear Dimensionality Reduction by Locally Linear Embed- ding”. Science, New Series 290 (5500): 2323–2326.
doi:10.1126/science.290.5500.2323.
[13] Saul, Lawrence K; Roweis, Sam T (2000).“An Introduc- tion to Locally Linear Embedding”.
[14] Hyvärinen, Aapo; Oja, Erkki (2000). “Independent Com- ponent Analysis: Algorithms and Applications”. Neural networks (4): 411–430.
[15] Lee, Honglak; Battle, Alexis; Raina, Rajat; Ng, Andrew Y (2007). “Efficient sparse coding algorithms”. Ad- vances in neural information processing systems.
[16] Aharon, Michal; Elad, Michael; Bruckstein, Alfred (2006). “K-SVD: An Algorithm for Designing Over- complete Dictionaries for Sparse Representation”. IEEE Trans. Signal Process. 54 (11): 4311–4322.
doi:10.1109/TSP.2006.881199.
[17] Bengio, Yoshua (2009). “Learning Deep Architectures for AI”. Foundations and Trends® in Machine Learning
2 (1): 1–127.doi:10.1561/2200000006.
[18] Hinton, G. E.; Salakhutdinov, R. R. (2006).
“Reducing the Dimensionality of Data with Neural Networks” (PDF). Science 313 (5786): 504–507.
doi:10.1126/science.1127647.PMID 16873662. [19] Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng
(2008). “Sparse deep belief net model for visual area V2”. Advances in neural information processing systems.
Online machine learning
Onlinemachine learningis used in the case where the databecomes available in a sequential fashion, in order to de- termine a mapping from the dataset to the corresponding labels. The key difference between online learning and batch learning (or“offline” learning) techniques, is that in online learning the mapping is updated after the arrival of every new datapoint in a scalable fashion, whereas batch techniques are used when one has access to the entire training dataset at once. Online learning could be used in the case of a process occurring in time, for example the value of a stock given its history and other external factors, in which case the mapping updates as time goes on and we get more and more samples.
Ideally in online learning, the memory needed to store the function remains constant even with added datapoints, since the solution computed at one step is updated when a new datapoint becomes available, after which that dat- apoint can then be discarded. For many formulations, for example nonlinearkernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used. In this case, the space requirements are no longer guaranteed to be con- stant since it requires storing all previous datapoints, but the solution may take less time to compute with the ad- dition of a new datapoint, as compared to batch learning techniques.
As in all machine learning problems, the goal of the algo- rithm is to minimize some performance criteria using a loss function. For example, with stock market predic- tion the algorithm may attempt to minimize the mean squared errorbetween the predicted and true value of a stock. Another popular performance criterion is to min- imize the number of mistakes when dealing with classifi- cation problems. In addition to applications of a sequen- tial nature, online learning algorithms are also relevant in applications with huge amounts of data such that tradi- tional learning approaches that use the entire data set in aggregate are computationally infeasible.
10.1 A prototypical online super-
vised learning algorithm
In the setting ofsupervised learning, or learning from ex- amples, we are interested in learning a function f : X→ Y , where X is thought of as a space of inputs and Y as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p(x, y) on X × Y . In this setting, we are given a loss function
V : Y × Y → R , such that V (f(x), y) measures the difference between the predicted value f (x) and the true value y . The ideal goal is to select a function f ∈ H , whereH is a space of functions called a hypothesis space, so as to minimize the expected risk:
I[f ] =E[V (f(x), y)] = ∫
V (f (x), y) dp(x, y) .
In reality, the learner never knows the true distribution p(x, y)over instances. Instead, the learner usually has ac- cess to a training set of examples (x1, y1), . . . , (xn, yn) that are assumed to have been drawni.i.d. from the true distribution p(x, y) . A common paradigm in this situ- ation is to estimate a function ˆf through empirical risk minimizationor regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularizedleast squaresandsupport vector machines.
The above paradigm is not well-suited to the online learn- ing setting though, as it requires complete a priori knowl- edge of the entire training set. In the pure online learn- ing approach, the learning algorithm should update a se- quence of functions f1, f2, . . .in a way such that the func- tion ft+1depends only on the previous function ftand the next data point (xt, yt). This approach has low mem- ory requirements in the sense that it only requires storage of a representation of the current function ftand the next data point (xt, yt). A related approach that has larger memory requirements allows ft+1to depend on ftand all previous data points (x1, y1), . . . , (xt, yt). We fo- cus solely on the former approach here, and we consider both the case where the data is coming from an infinite 60