An Introduction to Deep Learning
Examining the Advantages of Hierarchical Learning
015 SAP SE or an SAP affiliate company. All rights reserved.
Dr. Ying Wu, PhD, is a data scientist within the Advanced Analytics organization at SAP. With more than 10 years of research experience in artificial intelligence, Dr. Wu mainly focuses on designing and applying a wide range of machine learning techniques in data mining, as well as providing solutions for integrating predictive analytics into innovations from SAP. Before joining SAP, Dr. Wu served as a researcher at University College Cork (UCC), Ireland, for six years. During his tenure at UCC, he researched primarily artificial intelligence in data integration and data mining and was involved in projects funded by the European Union Framework Program (FP7), European Space Agency, Irish Environment Protection Agency, and Geological Survey of Ireland. Dr. Wu has published more than 15 research papers in the area of artificial intelligence. He also received a master’s degree with distinction in information technology from the University of Paisley and a PhD in artificial intelligence and data mining from the University of West of Scotland, UK.
Dr. Rouzbeh Razavi, PhD, is a data scientist within the Advanced Analytics R&D organization at SAP. In his current role, Dr. Razavi is responsible for providing expertise in areas related to machine learning, data mining practices and design, and implementation of advanced algorithms. Prior to joining SAP, he served as a senior scientist at Bell Laboratories, Alcatel-Lucent, for over five years. At Bell Labs, Dr. Razavi introduced a number of innovations with significant business and scientific impact. He has been the recipient of a number of awards including the prestigious Bell Labs Golden Pen Award. Before joining Bell Labs, he was a research fellow at the University of Essex for three years. Dr. Razavi has published more than 70 technical papers, invented more than 25 patent applications, and authored five books and book chapters. He received both a master’s degree in information systems and a PhD in computer science from the University of Essex, UK. He also received a second master’s degree in business analytics from the University College of Dublin.
4 The Emergence of Deep Learning 7 Applying Deep-Learning Techniques
10 Scaling Deep-Learning Algorithms
12 Conclusion
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
Deep learning is taking the academic community and business world by storm. This machine learning
approach is powering the latest generation of commodity computing and deriving significant value from Big Data.
But most important, it is radically changing how
computers recognize speech, identify objects in images, and recall and process information – three of the
fundamental building blocks for artificial intelligence.
For over a decade, computer science has completely changed nearly every aspect of our lives. Once thought as an unrealistic dream, artificial intelligence (AI) has finally come to fruition – enabling computers to understand and interact with us while processing their own thinking.
Over the years, there’s been much research done on AI methods. For example, machine learning is applying the concept of artificial neural networks (ANNs), a family of statistical learning algorithms inspired by biological neural networks similar to the inner workings of a human’s brain. This approach is used to estimate or approximate functions that depend on a large number of inputs and that are generally unknown. ANNs are com- monly represented as systems of interconnected
“neurons” that can compute values from inputs and are capable of machine learning and pattern recognition, thanks to their adaptive nature.
Despite their popularity and diverse variety of applications, neural network models and other machine learning methods typically contain a shallow architecture of two or three levels.
Researchers reported positive results on a wide range of applications with two or three layers;
however, training deeper networks yielded less- promising results. In addition, they revealed that multilayer neural networks with more than two hidden layers have a marginal impact on operations while requiring a significant increase of training time. Why? Many believe that those earlier hidden layers in a multilayer architecture are placed too far away from the output. As a result, when considering learning through back propagation where the source of learning is the output, such layers are not very
effective and are more influenced by initial random setting.
Yoshua Bengio and Yann LeCun observed that, in most classical machine learning methods with a large number of parameters to consider, optimal learning can only be achieved when some form of prior knowledge is available.
1Moreover, when the problem is expressed by complex behaviors, highly varying mathematical functions are usually required to solve it. These mathematical functions are highly nonlinear in the data space and can display a very large number of variations. With a shallow architecture for the highly varying functions, the learning algorithms are greatly impacted by the number of dimensions in the problem and are very prone to suboptimal performance.
In recent years, deep architecture – motivated by biological and circuit complexity theories – has been reported to be more efficient than shallow architectures, especially when the problem is assumed to have complex behaviors with highly varying mathematical functions. These deep- learning networks are usually presented with multiple hidden layers. The hidden layer is where the network stores its internal abstract representation of the training data. In deep learning, the hidden layers are computed in an entirely different fashion when compared to traditional neural networks.
More specifically, each layer in a deep network is pretrained with an unsupervised learning algorithm, resulting in a nonlinear transformation of its input or the output of the previous layer and capture of more abstracted features from its input. Then in the final training phase, the deep architecture is fine-tuned with respect to a supervised training criterion with gradient-based optimization.
The Emergence of Deep Learning
1.
Bengio, Y., and LeCun, Y., “In Large Scale Kernel Machines,” MIT Press, 2007.
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
The concept of deep learning is designed to train features at higher levels by applying the composition of lower-level features. As Bengio and LeCun proposed, deep-learning networks can automatically discover abstractions from lower-level features to higher-level concepts through a series of processing stages.
2This is where lower-level abstractions are more tightly related to pieces of data and higher- level abstractions are more directly tied to actual and meaningful concepts. One advantage of using such deep architecture is how a different level of abstraction focuses on a small subset of a large number of features. Although the information to be learned is not located in a single layer of neurons, it is distributed across multiple layers. Such a distributed representation allows deep-learning networks to have a stronger capacity for learning and can produce much better generalizations when compared to the traditional machine-learning methods. Furthermore, an architecture with multiple levels and based on a distributed representation of data allows deep-learning networks to learn intermediate representations, which can be shared across different problem areas.
This means that knowledge learned as intermediate representations is reusable, where new high-level features can be learned by combining lower-level intermediate features from a common pool of information.
A large body of literature has been focused on deep-learning methods. Almost 10 years ago, Geoffrey E. Hinton and his team presented the concept of deep belief networks (DBNs).
3In 2007 deep neural networks based on autoencoders was proposed by a study conducted by Bengio and his team.
4However, not all deep-learning methods were derived after 2006. For example, another neural network model with a deep architecture, the convolutional neural network (CNN), was introduced by LeCun in 1998. But, it is also important to note that much research has been done since 2006 to extend the CNN framework.
For instance, the CNN has been applied to restricted Boltzmann machines (RBMs) and DBN.
5On the other hand, the unsupervised pretraining step of deep learning is also applied to the CNN.
62.
Ibid.
3.
Hinton, G., and Salakhutdinov, R., “Reducing the Dimensionality of Data with Neural Networks,” Science, 2006.
4.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H., “Greedy Layer-Wise Training of Deep Networks,” Advances in Neural Information Processing Systems, 2007.
5.
Lee, H., Grosse, R., Ranganath, R., and Andrew, Y. N., “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,” ICML, 2009.
6.
Kavukcuoglu, K., Ranzato, M. A., Fergus, R., and LeCun, Y., “Learning Invariant Features Through Topographic Filter Maps,”
CVPR, 2009.
In summary, Bengio and LeCun have listed some advantages of using a deep architecture, such as the ability to:
• Learn complex, highly varying functions
• Analyze low-level, intermediate, and high-level abstractions, with little human input
• Process a very large set of examples
• Assess mostly unlabeled data
• Exploit synergies presented across a large number of tasks
7Figure 1: Evolution and popularity of machine learning algorithms from 1960 to the present day
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Deep learning Neural networks
Decision tree, ID3
SVM
Random forests
Adaboost
Perceptron (large scale) Subjective
popularity
SVM = Support vector machine
In terms of popularity, deep architecture has gained significant attention in recent years. See Figure 1 for an illustration on the evolution and popularity of different machine learning algorithms, including the emerging deep-learning methods over the years.
7.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H., “Greedy Layer-Wise Training of Deep Networks,” Advances in Neural
Information Processing Systems, 2007.
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
Since 2006 deep architectures have been enabling state-of-the-art performance. And with success, this technology has been applied across a wide range of fields such as classification, dimensionality reduction, robotics, image recognition, image retrieval, information retrieval, language modeling, and natural language processing.
The DBN and stack autoencoder were originally demonstrated with success on the Mixed National Institute of Standards and Technology (MNIST) data set as an image-recognition task.
8Recently, some image classification models based on deep architectures have reported state-of-the-art performance on this data set. According to a study conducted by Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber, the convolutional neural network is built and trained, reporting a very low 0.23% error rate.
9In this section, we will provide a brief overview of applications where deep learning is successfully ap- plied. For more information on this topic, we strongly encourage you to read “Deep Learning: Methods and Applications. (Foundations and Trends® in Signal Processing)” by Li Deng and Dong Yu.
10MULTIMEDIA SIGNAL PROCESSING
Traditionally, multimedia signal processing has been an active area where machine learning algorithms have been applied. This includes areas related to image recognition, classification, and retrieval. The origin of applying deep learning to object-recognition tasks can be traced to CNNs in the early 1990s. However, the introduction of
deep learning has resulted in a paradigm shift in field-object recognition and classification.
The fundamental principle of deep learning is the ability to autonomously generate high-level representations from raw data sources. Therefore, it is evident that deep learning complements the area of image recognition and classification. In other words, the raw data is fed into the first layer and higher-level features are extracted and
passed to the next layer subsequently until the eventual output (such as a prediction) is produced.
In a study where deep architectures were used along with convolution structures when processing computer vision and image recognition, it was reported that the deep CNN approach achieved a considerably lower error rate than other state-of- the-art machine learning ever used.
11This work was the output of a training data set that contained 1,000 unique image classes as the targets, 1.2 million high-resolution images in the training set, and 150,000 images in the test data set.
Machine learning has been also successfully applied to speech and audio signal processing. In this context, the goal is condensed to the use of primitive spectral and waveform features while such features were traditionally handcrafted.
Experimental results validate the superiority of deep-learning methods for speech recognition, especially in the presence of noise. Developments and features such as Siri by Apple, Google Now, Nokia Cortana, and Baidu Deep Speech are some examples of commercial products relying on such advancements in speech processing.
Applying Deep-Learning Techniques
8.
MNIST data set, http://yann.lecun.com/exdb/mnist.
9.
Ciresan, D., Meier, U., and Schmidhuber, J., “Multi-Column Deep Neural Networks for Image Classification,” Arxiv preprint, 2012.
10.
Deng, L., and Yu, D.,“Deep Learning: Methods and Applications. (Foundations and Trends® in Signal Processing),”
New Publishers Inc., 2014.
11.
Alex, K., Sutskever, I., and Hinton, G., “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural
Information Processing Systems, pages 1–9, 2012.
SEARCH ENGINES AND INFORMATION RETRIEVAL During information retrieval, a user submits queries to a system that contains a collection of many documents and objects with the goal of obtaining a set of relevant documents and objects.
Web search engines are commonly held as the most important category of information retrieval service providers. Traditionally, search engines retrieve Web-based documents by matching terms in documents with those in a search query – a process called lexical matching. However, lexical matching can be suboptimal due to a language discrepancy between documents and queries.
Many practitioners are looking at semantic matching as an approach to close this gap. Latent semantic analysis (LSA) and its extensions are
mature concepts that were introduced 25 years ago. However, a new trend has now started, which deploys deep neural networks to extract high-level semantic representations. This explains why search engine giants such as Google, Microsoft, Yahoo, and Baidu are significantly investing in this area.
As for image documents, content-based image retrieval searches for images are performed according to the visual content of those images.
Deep-learning techniques have been widely applied in this area in recent years. D.W. Ji Wan proposed a deep-learning framework as shown in Figure 2.
12The model was successfully trained on the “ILSVRC-2012” data set from ImageNet and resulted in state-of-the-art performance with 1,000 categories and more than 1 million training images.
Figure 2: A deep-learning framework for image retrieval
12.
Ji Wan, D. W., “Deep Learning for Content-Based Image Retrieval: A Comprehensive Study,” Proceedings of ACM Multimedia Conference (MM2014), 2014.
Feature representation output for content-based image retrieval Massive Source Image in Various Categories (For Example, ImageNet ILSVRC2012) Using CNN Model on Other Image Data Sets
Feature Representation in CBIR Scheme I: Directly use the features representation from layers FC1, FC2, FC3 based on ImageNet-trained model
Scheme II: Adopt metric learning technique to refine the feature representation achieved from Scheme I
Scheme III: Retrain a deep CNN model with classification or similarity loss function, which is initialized by the ImageNet-trained model
Final output labels (FC3)
Fully connected layer (FC2)
Fully connected layer (FC1)
Local contract norm and sample pooling
Loops for high-level feature (normalization and pooling are optional) Input
raw RBG image
Convolutional neural network
Low level Mid-
level High
level
New image retrieval data set 1
New image retrieval data set 2
New image retrieval data set n
Convolutional filtering
CNN = Convolutional neural network
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
LANGUAGE MODELING AND
NATURAL LANGUAGE PROCESSING
Research in language modeling and processing has recently gained significant popularity. The goal of a language model is to estimate the distribution of natural language as accurately as possible.
Natural language processing (NLP), or computational linguistics, also deals with word sequences; however, the tasks are much more diverse.
Deep learning has shown to be very promising for both language modeling and NLP. For example, the long short-term memory (LSTM) architecture has been applied in machine translation.
13, 14The WMT’14 English to French data set is used to evaluate this architecture, and the models are trained on a subset of 12 million sentences consisting of 348 million French words and 304 million English words. The deep LSTM architecture is built with four layers with 1,000 cells at each layer and 1,000 dimensional words embedded.
The input vocabulary consists of 160,000 words, and there are 80,000 words in the output vocabulary.
As a result of this study, it was determined that deep LSTMs significantly outperform shallow LSTMs, especially when the complexity of each additional layer is reduced by nearly 10%.
1513.
Sutskever, I., Vinyals, O., and Le, Q. V., “Sequence to Sequence Learning with Neural Networks,” e-print arXiv:1409.3215, 2014.
14.
Hochreiter, S. S., “Long Short-Term Memory,” Neural Computation, 1997.
15.
Sutskever, I., Vinyals, O., and Le, Q. V., “Sequence to Sequence Learning with Neural Networks,” e-print arXiv:1409.3215, 2014.
While deep learning brings impressive advantages in many applications, the training of deep-learning models is not straightforward when the volume of data is very large. This is due to the fact that iterative computations inherent in most deep- learning methods are difficult to be parallelized.
In recent years, there has been much research in effective and scalable parallel algorithms for training of deep learning.
For instance, DistBelief is a software framework recently designed for distributed training and learning in deep networks with very large models and large-scale data sets.
16It uses large-scale clusters of machines to manage data and model parallelism through multithreading, message passing, synchronization, and communication between machines. The large network architecture is firstly partitioned into smaller parallel structures, whose nodes are assigned and calculated in several machines.
As shown in Figure 3, there are four blocks partitioned and each is assigned to a single machine. The boundary nodes require data transfers between the machines.
As a result, DistBelief is implemented into two deep-learning models:
1. A fully connected network with 42 million model parameters and 1.1 billion examples
2. A locally connected convolutional neural network with 16 million images of 100x100 pixels, 21,000 categories, and as many as 1.7 billion parameters
The experimental results show that locally connected learning models benefit more from DistBelief since the method is 12 times faster than using a single machine.
An alternative way to train deep-learning models is by using GPUs. In August 2013 NVIDIA single precision GPUs exceeded 4.5 TeraFLOP/s with a memory bandwidth of near 300 GB/s. This shows that GPUs are particularly suited for massively parallel computing with more transistors devoted for data proceeding needs.
18Scaling Deep-Learning Algorithms
16.
Lin, X.-W., and Chen, X., “Big Data Deep Learning: Challenges and Perspectives,” Digital Object Identifier, 2014.
17.
Dean, J., “Large-Scale Distributed Deep Networks,” Proceedings: Active Neural Information Processing Systems, 2012.
18.
Lin, X.-W., and Chen, X., “Big Data Deep Learning: Challenges and Perspectives,” Digital Object Identifier, 2014.
Figure 3: Models partitioned into four blocks and assigned to four machines
17Block 1
Block 3
Block 2
Block 4
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
A couple years ago, Adam Coats, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng proposed the commodity off-the-shelf, high-performance computing (COTS HPC) system for training deep network models with more than 11 billion free parameters by using just three machines.
19The COTS HPC system comprises a cluster of 16 GPU servers with an Infiniband adapter for interconnects and MPI for data exchange in a cluster. Each server is equipped with four NVIDIA GTX680 GPUs, each having 4 GB of memory. Refer to Figure 4 for a summary of research efforts dedicated toward experimentation with GPUs.
In addition, it is worth mentioning Deeplearning4j – the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala.
21Integrated with Hadoop and Spark, Deeplearning4j is designed for business environ- ments and includes a distributed multithreaded deep-learning framework and a single-threaded deep-learning framework. Training takes place in the cluster, which means it can process massive amounts of data. Networks are trained in parallel through iterative deduction, and they are equally compatible with Java, Scala, and Clojure.
Figure 4: Recent research progress in large-scale deep learning
2019.
Coats, A., Huval, B., Wang, T., Wu, D., and Wu, A., “Deep Learning with COTS HPS Systems,” Journal of Machine Learning Research, 2013.
20.
Lin, X.-W., and Chen, X., “Big Data Deep Learning: Challenges and Perspectives,” Digital Object Identifier, 2014.
21.