Deep Learning.pdf

(1)

Deep Learning

Yoshua Bengio Ian J. Goodfellow

Aaron Courville August 24, 2014

(2)

Acknowledgments

We would like to thank the following people who commented our proposal for the book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain, Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger and Thomas Roh´ee.

We would like to thank the following people who oﬀered feedback on the content of the book itself:

Math chapter: Ilya Sutskever Vincent Vanhoucke

Linear algebra: Guillaume Alain Dustin Webb David Warde-Farley Pierre Luc Car-rier Li Yao Thomas Roh´ee Colby Toland Amjad Almahairi

Probability: Rasmus Antti Stephan Gouws David Warde-Farley Vincent Dumoulin Li Yao

Convolutional nets: Guillaume Alain David Warde-Farley Mehdi Mirza Caglar Gul-cehre

Partition function: Sam Bowman

(9)

Chapter 1 Deep Learning for AI

Inventors have long dreamed of creating machines that think. Ancient Greek myths tell of intelligent objects, such as animated statues of human beings and tables that arrive full of food and drink when called . When programmable computers were first conceived, people wondered whether they might become intelligent, over a hundred years before one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics. We look to intelligent software to automate routine labor, make predictions in finance and diagnoses in medicine, and to support basic scientific research. This book is about deep learning, an approach to AI based on enabling computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of of its relation to simpler concepts.

Many of the early successes of AI took place in relatively sterile and formal envi-ronments and did not require computers to have much knowledge about the world . For example, IBM’s Deep Blue chess-playing system defeated world champion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed ways. Devising a successful chess strategy is a tremendous intellectual accomplishment, but does not require much knowledge about the agent’s environment. The environment can be described by a very brief list of rules, easily provided ahead of time by the programmer.

Ironically, abstract and formal tasks such as chess that are among the most diﬃcult mental undertakings for a human being are among the easiest for a computer. A person’s everyday life requires an immense amount of knowledge about the world, and much of this knowledge is subjective and intuitive, and therefore diﬃcult to articulate in a formal way.

Several artiﬁcial intelligence projects have sought to hard-code knowledge about the world in formal languages. A computer can reason about statements in these formal languages automatically using logical inference rules. None of these projects has lead to a major success. One of the most famous such projects is Cyc. Cyc is an inference engine and a database of statements in a language called CycL. These statements are entered by a staﬀ of human supervisors. It is an unwieldy process. People struggle

(10)

to devise formal rules with enough complexity to accurately describe the world. For example, Cyc failed to understand a story about a person named Fred shaving in the morning (Linde, 1992). Its inference engine detected an inconsistency in the story—it knew that people do not have electrical parts, but because Fred was holding an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether Fred was still a person while he was shaving.

The diﬃculties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as machine learning. The introduction of machine learning allowed computers to tackle problems involving knowledge of the real world and make decisions that appear subjective. A simple machine learning algorithm called

logistic regression can determine whether to recommend cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm called naive Bayes can separate legitimate e-mail from spam e-mail.

The performance of these simple machine learning algorithms depends heavily on the representation of the data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant information, such as the presence or absence of a uterine scar. Each piece of information included in the representation of the patient is known as a feature. Logistic regression learns how each of these features of the patient correlates with various outcomes. However, it cannot learn what features are useful, nor can it observe the features itself. If logistic regression was given a 3-D MRI image of the patient, rather than the doctor’s formalized report, it would not be able to make useful predictions. Individual voxels in an MRI scan have negligible correlation with any complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears through-out computer science and even daily life. In computer science, operations such as search-ing a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but ﬁnd arithmetic on Roman numerals much more time consuming. It is not surprising that the choice of representation has an enormous eﬀect on the performance of machine

(11)

learning algorithms.

Example of representation

Examples can be represented in diﬀerent ways, but some representations make it easier for machine learning algorithms to capture the knowledge they provide. For example, a number can be represented by its binary encoding (with bits),n

by a single real-valued scalar or by its one-hot encoding (with 2n_{bits of which}

only one is turned on). In many cases the compact binary representation is a poor choice for learning algorithms, because two very nearby values (like 3, encoded as binary 00000011, and 4, encoded as binary 00000100) have no digits in common while two values that are very different (like binary 10000001 = 129 and binary 00000001 = 1) only differ by one digit. This makes it difficult for the learner to generalize from examples that involve a particular numeric value to examples that involve just a small numerical change, whereas in many application we expect that what is true for input x is often true for input +x 

for a small . This is called the smoothness prior and is exploited in most

applications of machine learning that involve real numbers, and to some extent other data types in which some meaningful notion of similarity can be defined. Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is the pitch. The pitch can be formally specified—it is the lowest frequency major peak of the spectrogram. It is useful for speaker identification because it is determined by the size of the vocal tract, and therefore gives a strong clue as to whether the speaker is a man, woman, or child.

However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on. One solution to this problem is to use machine learning to discover the tions as well. This approach is known as representation learning. Learned representa-tions often result in much better performance than can be obtained with hand-designed representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of fea-tures for a simple task in minutes, or a complex task in months. Hand-designing feafea-tures for a complex task can take decades for an entire community of human researchers.

When designing features or algorithms for learning features, our goal is usually to separate thefactors of variationthat explain the observed data. In this context, we use

(12)

the word “factors” simply to refer to separate sources of inﬂuence; the factors are usually not combined by multiplication. Such factors are often not quantities that are directly observed but they exist in the minds of humans as explanations or inferred causes of the observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability in the data. When analyzing a speech recording, the factors of variation include the speaker’s age and sex, their accent, and the words that they are speaking. When analyzing an image of a car, the factors of variation include the position of the car, its color, and the angle and brightness of the sun.

A major source of difficulty in many real-world artificial intelligence applications is that most of the factors of variation influence every single piece of data we are able to observe. The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle. Most applications require us to disentanglethe factors of variation and discard the ones that we do not care about.

Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, also require sophisticated, nearly human-level understanding of the data. When it is nearly as diffi-cult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us.

Deep learning solves this central problem in representation learning by introducing

representations that are expressed in terms of other, simpler representations. Deep learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.1 shows how a deep learning system can represent the concept of an image of a person by combining simpler concepts, such as corners and contours, which are in turn deﬁned in terms of edges.

Another perspective on deep learning is that it allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Being able to execute instructions sequentially oﬀers great power because later instructions can refer back to the results of earlier instructions. According to this view of deep learning, not all of the information in a layer’s representation of the input necessarily encodes factors of variation that explain the input. The representation is also used to store state information that helps to execute a program that can make sense of the input.

“Depth” is not a mathematically rigorous term in this context; there is no formal definition of deep learning. All approaches to deep learning share the idea of nested representations of data, but different approaches view depth in different ways. For some approaches, the depth of the system is the depth of the flowchart describing the com-putations needed to produce the final representation. The depth corresponds roughly to the number of times we update the representation. Other approaches consider depth to be the depth of the graph describing how concepts are related to each other. In this case, the depth of the flow-chart of the computations needed to compute the represen-tation of each concept may be much deeper than the graph of the concepts themselves. This is because the system’s understanding of the simpler concepts can be refined given

(13)

(b)

(c)

features without feature scale clipping. Note

). (c): Our 1st layer features. The

2012

e features and fewer “dead” features. (d):

Visualizatio n s of our 2nd layer featu res. These

(a)

_(b)

(c)

yer features without feature scale clipping.

,

). (c): Our 1st layer features.

et al. 2012

distin c t ive features and fewer “dead” features.

Visualization s of our 2nd layer features . These

(a)

_(b)

(c)

yer features without feature scale clipping.

,

). (c): Our 1st layer features.

al. 2012

distin c t ive features and fewer “dead” features.

Visualization s of our 2nd layer features . These

Visualizing and Understanding Convolu

Visualizing and Understanding Convolutional

Visible layer (input pixels) 1st hidden layer

(edges)

Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top

of feature maps across the validation data, projected down to pixel space using our decon

Our reconstructions are

not

samples from the model: they are reconstructed patte rn s fro

high activations in a given feat u re map. For each feature map we also show the corresp

(i) the the strong grouping within each fe at u re map, (ii) greater invariance at higher

2nd hidden layer (corners and

contours) 3rd hidden layer

(object parts)

CAR PERSON ANIMAL Output

(object identity)

Figure 1.1: Illustration of a deep learning model. It is difficult for a computer to under-standing of the meaning of raw sensory input data, such as this image represented as a collection of pixel values. The function mapping from a set of pixels to an object identity is very complicated. Learning or evaluating this mapping seems insurmountable if tack-led directly. Deep learning resolves this difficulty by breaking the desired complicated mapping into a series of nested simple mappings, each described by a different layer of the model. The input is presented at the visible layer. Then a series ofhidden layers ex-tracts increasingly abstract features from the image. The images here are visualizations of the kind of feature represented by each hidden unit. Given the pixels, the first layer can easily identify edges, by comparing the brightness of neighboring pixels. Given the first hidden layer’s description of the edges, the second hidden layer can easily search for corners and extended contours, which are recognizable as collections of edges. Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of contours and corners. Finally, this description of the image in terms of the object parts it contains can be used to recognize the objects present in the image. Images provided by Zeiler and Fergus (2014).

(14)

AI

Machine learning

Representation learning

Deep learning

Example:

Knowledge

bases

Example:

Logistic

regression

Example:

Autoencoders

Example:

MLPs

Figure 1.2: A Venn diagram showing how deep learning is a kind of representation learning, which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.

information about the more complex concepts. For example, an AI system observing an image of a face with one eye in shadow may initially only see one eye. After detecting that a face is present, it can then infer that a second eye is probably present as well.

To summarize, deep learning, the subject of this book, is an approach to AI. Specif-ically, it is a type of machine learning, a technique that allows computer systems to improve with experience and data. Machine learning is the only viable approach to building AI systems that can operate in complicated, real-world environments. Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts. Fig. 1.2 illustrates the relationship between these different AI disciplines. Fig. 1.3 gives a high-level schematic of how each works.

Deep learning is the subject of this book. It involves learning multiple levels of

(15)

Input Hand-designed program Output Input Hand-designed features Mapping from features Output Input Features Mapping from features Output Input Simplest features Mapping from features Output Most complex features Rule-based systems Classic machine learning Representation learning Deep learning

Figure 1.3: Flow-charts showing how the diﬀerent parts of an AI system relate to each other within diﬀerent AI disciplines. Shaded boxes indicate components that are able to learn from data.

(16)

there has been a tremendous impact of research on deep learning, both at an academic level and in terms of industrial breakthroughs.

Representation learning algorithms can either be supervised, unsupervised, or a combination of both (semi-supervised). These notions are explained in more detail in Chapter 5, but we introduce them brieﬂy here. Supervised learning requires examples that include both an input and a target output, the latter being generally interpreted as what we would have liked the learner to produce as output, given that input. Such examples are called labeled examples because the target output often comes from a human providing that “right answer”. Manual labeling can be tedious and expensive, and a lot more unlabeled data than labeled data are available. Unsupervised learning allows a learner to capture statistical dependencies present in unlabeled data, while semi-supervised learning combines labeled examples and unlabeled examples. The fact that several deep learning algorithms can take advantage of unlabeled examples can be an important advantage, discussed much in this book, in particular in Chapter 10 as well as in Section 1.5 of this introductory chapter.

Deep learning has not only changed the field of machine learning and influenced our understanding of human perception, it has revolutionized application areas such as speech recognition and image understanding. Companies such as Google, Microsoft, Facebook, IBM, NEC, Baidu and others have all deployed products and services based on deep learning methods, and set up research groups to take advantage of deep learning. Deep learning has allowed to win international competitions in object recognition 1. Deep unsupervised learning (trying to capture the input distribution, and relying less on having many labeled examples) performs exceptionally well in a transfer context, when the model has to be applied on a distribution that is a bit different from the training distribution2. With large training sets, deep supervised learning has been the most impressive, as recently shown with the outstanding breakthrough achieved by Geoff Hinton’s team on the ImageNet object recognition 1000-class benchmark, bringing down the state-of-the-art error rate from 26.1% to 15.3% (Krizhevsky et al., 2012a). On another front, whereas speech recognition error rates kept decreasing in the 90’s (thanks mostly to better systems engineering, larger datasets, and larger HMM models), performance of speech recognition systems had stagnated in the 2000-2010 decade, until the advent of deep learning (Hintonet al., 2012). Since then, thanks to large and deep architectures, the error rates on the well-known Switchboard benchmark have dropped by

about half! As a consequence, most of the major speech recognition systems (Microsoft,

Google, IBM, Apple) have incorporated deep learning, which has become a de facto standard at conferences such as ICASSP.

1

J¨urgen Schmidhuber’s lab at IDSIA has won many such competitions, see http://www.idsia.ch/ ~juergen/deeplearning.html, and Hinton’s U. Toronto group and Fergus’ NYU group have respec-tively won the ImageNet competitions in 2012 and 2013(Krizhevsky et al., 2012a), see http://www. image-net.org/challenges/LSVRC/2012/andhttp://www.image-net.org/challenges/LSVRC/2013/. We have used deep learning to win an international competition in computer vision focused on the detection of emotional expression from videos and audio (Ebrahimiet al., 2013).

2_{Teams from Yoshua Bengio’s lab won the Transfer Learning Challenge (results at ICML 2011} work-shop) (Mesnil et al., 2011) and the NIPS’2011 workshop Transfer Learning Challenge (Goodfellow et al., 2011).

(17)

1.1 Who should read this book?

This book can be useful for a variety of readers, but the main target audiences are the university students (undergraduate or graduate) learning about machine learning and the engineers and practitioners of machine learning, artiﬁcial intelligence, data-mining and data science aiming to better understand and take advantage of deep learning. Ma-chine learning is successfully applied in many areas, including computer vision, natural language processing, robotics, speech and audio processing, bioinformatics, video-games, search engines, online advertising and many more. Deep learning has been most suc-cessful in traditional AI applications but is expanding in other areas, such as modeling molecules, customers or web pages.

Knowledge of basic concepts in machine learning will be very helpful to absorb the concepts in this book, although the book will attempt to explain these concepts intu-itively (and sometimes formally) when needed. Similarly, knowledge of basic concepts in probability, statistics, calculus, linear algebra, and optimization will be very useful, although the book will brieﬂy explain the required concepts as needed, in particular with science and programming will be mostly useful to understand and modify the code provided in the practical exercises associated with this book, in the Python language and based on the Pylearn2 machine learning and deep learning library.

Since much science remains to be done in deep learning, many practical aspects of these algorithms can be seen as tricks, practices that have been found to work (at least in some contexts) while we do not have complete explanatory theories for them. This book will also spell out these practical guidelines, although the reader is invited to question them and even to ﬁgure out the reasons of their successes and failures. A live online resource http://www.deeplearning.net/book/guidelines allows practioners and researchers to share their questions and experience and keep abreast of developments in the art of deep learning. Keep in mind that science is not frozen but evolving, because we dare question the received wisdom, and readers are invited to contribute to this exciting expansion and clariﬁcation of our knowledge.

1.2 Machine Learning

This section introduces a few machine learning concepts, while a deeper treatment of this branch of knowledge at the intersection of artiﬁcial intelligence and statistics can be found in Chapter 5. A machine learning algorithm (or learner), sees training examples, each of which can be thought of as an assignment of values to some variables (which we call random variables3_{). A learner uses these examples to build a function or a model}

from which it can answer questions about these random variables or perform some useful computation on their values. By analogy, human brains (whose learning algorithm we do not perfectly understand but would like to decipher) see as training examples the sequence of experiences of their life, and the observed random variables are what arrives

3_{Informally, a random variable is a variable which can take diﬀerent values, with some uncertainty} about which value will be taken, i.e., its value is not perfectly predictable.

(18)

at their senses as well as the internal reinforcement signals (such as pain or pleasure) that evolution has programmed in us to guide our learning process towards survival and reproduction. Human brains also observe their own actions, which inﬂuence the world around them, and it appears that human brains try to learn the statistical dependencies between these actions and their consequences, so as to maximize future rewards.

What we call a conﬁguration of variables is an assignment of values to the vari-ables. For example if we have 10 binary variables then there are 210 _possible

config-urations of their values. The crux of what a learner needs to do is to guess which configurations of the variables of interest4 are most likely to occur again. And the fun-damental challenge involved is the so-called curse of dimensionality, i.e., the fact that the number of possible configurations of these variables could be astronomical, i.e., exponential in the number of variables involved (which could be for example the thousands of pixels in an image). Indeed, each observed example only tells the learner about one such configuration. To make a prediction regarding never seen configurations, the learner can only make a guess. How could humans, animals or machines possibly have a preference for some of these unseen configurations, after seeing some positive examples, a very small fraction of all the possible configurations? If the only informa-tion available to make this guess comes from the examples, then not much could be said about new unseen configurations except that they should be less likely than the observed ones. However, every machine learning algorithm incorporates not just the information from examples but also somepriors, which can be understood as knowledge about the world that has been built up from previous experience, either personal experience of the learner, or from the prior experience of other learners, e.g., through biological or cultural evolution, in the case of humans. These priors can be combined with the given training examples to form better predictions. Bayesian machine learning attempts to formalize these priors as probability distributions and once this is done, Bayes theorem and the laws of probability (discussed in Chapter 3) dictates what the right predictions should be. However, the required calculations are generally intractable and it is not always clear how to formalize all our priors in this way. Many of the priors associated with different learning algorithms are implicit in the computations performed, which can sometimes be viewed as approximations aimed at reconciling the information in the priors and the information in the data, in a way that is computationally tractable. Indeed, machine learning needs to deal not just with priors but also with computation. A practical learning algorithm must not just make good predictions but also do it quickly enough, with reasonable computing resources spent during training and at the time of taking decisions.

The best studied machine learning task is the problem of supervised classifica-tion. The examples are configurations of input variables along with a target category. The learner’s objective is typically to classify new input configurations, i.e., to predict which of the categories is most likely to be associated with the given input values. A

4

At the lowest level, the variables of interest would be the externally observed variables (the sensor readings both from the outside world and from the body), actions and rewards, but we will see in this book that it can be advantageous to model the statistical structure by introducing internal variables or internal representations of the observed variables.

(19)

central concept in machine learning is generalization: how good are these guesses made on new configurations of the observed variables? How many classification errors would a learner make on new examples? A learner that generalizes well on a given dis-tribution makes good guesses about new configurations, after having seen some training examples. This concept is related to another basic concept: capacity. Capacity is a measure of the flexibility of the learner, essentially the number of training examples that it could always learn perfectly. Machine learning theory has traditionally focused on the relationship between generalization and capacity. Overfitting occurs when capacity is too large compared to the number of examples, so that the learner does a good job on the training examples (it correctly guesses that they are likely configurations) but a very poor one on new examples (it does not discriminate well between the likely config-urations and the unlikely one). Underfittingoccurs when instead the learner does not have enough capacity, so that even on the training examples it is not able to make good guesses: it does not manage to capture enough of the information present in the training examples, maybe because it does not have enough degrees of freedom to fit all the train-ing examples. Whereas theoretical analysis of overfitttrain-ing is often of a statistical nature (how many examples do I need to get a good generalization with a learner of a given capacity?), underfitting has been less studied because it often involves the computa-tional aspect of machine learning. The main reason we get underfitting (especially with deep learning) is not that we choose to have insufficient capacity but because obtaining high capacity in a learner that has strong priors often involves difficult numerical op-timization. Numerical optimization methods attempt to find a configuration of some variables (often called parameters, in machine learning) that minimizes or maximizes some given function of these parameters, which we call objective function or train-ing criterion. Durtrain-ing traintrain-ing, one typically iteratively modifies the parameters so as to gradually minimize the training criterion (for example the classification error). At each step of this adaptive process, the learner changes slightly its parameters so as to make better guesses on the training examples. This of course does not in general guarantee that future guesses on novel test exampleswill be good, i.e., we could be in an overfitting situation. In the case of most deep learning algorithms, this difficulty in optimizing the training criterion is related to the fact that it is not convex in the parameters of the model. It is even the case that for many models (such as most neural network models), obtaining the optimal parameters can be computationally intractable (in general requiring computation that grows exponentially with the number of parame-ters). It means that approximate optimization methods (which are often iterative) must be used, and such methods often get stuck in what appear to belocal minima of the training criterion5. We believe that the issue of underfitting is central in deep learning algorithms and deserves a lot more attention from researchers.

Another central concept in modern machine learning is probability theory. Be-cause the data give us information about random variables, probability theory provides a natural language to describe their uncertainty and many learning algorithms (in-5_{A local minimum is a conﬁguration of the parameters that cannot be improved by small changes} in the parameters, so if the optimization procedure is iterative and operates by small changes, training appears stuck and unable to progress.

(20)

cluding most described in this book) are formalized as means to capture a probability distribution over the observed variables. The probability assigned by a learner to a configuration quantifies how likely it is to encounter that configuration of variables. The classical means of training a probabilistic model involve the definition of the model as a family of probability functions indexed by some parameters, and the use of the maximum likelihood criterion6 (or a variant that incorporates some priors) to de-fine towards what objective to optimize these parameters. Unfortunately, for many probabilistic models of interest (that have enough capacity and expressive power), it is computationally intractable to maximize the likelihood exactly, and even computing its gradient7 _{is generally intractable. This has led to a variety of practical learning}

algorithms with diﬀerent ways to bypass these obstacles, many of which are described in this book.

Another important machine learning concept that turns out to be important to un-derstand many deep learning algorithms is that of manifold learning. The manifold learning hypothesis (Cayton, 2005; Narayanan and Mitter, 2010) states that probability is concentrated around regions called manifolds, i.e., that most configurations are un-likely and that probable configurations are neighbors of other probable configurations. We define the dimensionof a manifold as the number of orthogonal directions in which one can move and stay among probable configurations8_{. This hypothesis of probability}

concentration seems to hold for most AI tasks of interest, as can be verified by the fact that most configurations of input variables are unlikely (pick pixel values randomly and you will almost never obtain a natural-looking image). The manifold hypothesis also states that small changes (e.g. translating an input image) tend to leave unchanged categorical variables (e.g., object identity) and that there are much fewer such local degrees of fredom (manifold dimensions) than the overall input dimension (the number of observed variables). The associated natural clustering hypothesis assumes that different classes correspond to different manifolds, well-separated by vast zones of low probability. These ideas turn out to be very important to understand the basic concept of representation associated with deep learning algorithms, which may be understood as a way to specify a coordinate system along these manifolds, as well as telling to which manifold the example belongs. These manifold learning ideas turn out as well to be im-portant for understanding the mechanisms by which regularized auto-encoders capture both the unknown manifold structure of the data (Chapter 14), and the underlying data generating distribution (Chapter 17.9).

6_{which is simply the probability that the model assigns to the whole training set} 7

the direction in which parameters should be changed in order to slightly improve the training criterion

8_{The manifold hypothesis as stated here and elsewhere should be adapted to replace the discrete} notion of number of dimensions by something smoother, like the eigenspectrum, which talks about some dominant directions, and the degree (e.g. eigenvalue in the case of Principal Components Analysis) to which some directions matter in explaining the variations in the data.

(21)

1.3 Historical Perspective and Neural Networks

Modern deep learning research takes a lot of its inspiration from neural network research of previous decades. Other major intellectual sources of concepts found in deep learning research include works on probabilistic modeling and graphical models, as well as works on manifold learning.

The starting point of the story, though is the Perceptron and the Adaline, inspired by knowledge of the biological neuron: simple learning algorithms for artificial neu-ral networks were introduced around 1960 (Rosenblatt, 1958; Widrow and Hoff, 1960; Rosenblatt, 1962), leading to much research and excitement. However, after the initial enthusiasm, research progress reached a plateau due to the inability of these simple learning algorithms to learn representations (i.e., in the case of an artificial neural network, to learn what the intermediate layers of artificial neurons called hidden layers -should represent). This limitation to learning linear functions of fixed features led to a strong reaction (Minsky and Papert, 1969) and the dominance of symbolic computation and expert systems as the main approaches AI in the late 60’s, 70’s and early 80’s. In the mid-1980’s, a revival of neural networks research took place thanks to the back-propagation algorithm for learning one or more layers of non-linear features (Rumelhart

et al., 1986a; LeCun, 1987). This second bloom of neural network research took place

in the decade up to the mid-90’s, at which point many overly strong claims were made about neural nets, mostly with the objective to attract investors and funding. In the 90’s and 2000’s, other approaches to machine learning dominated the field, especially those based on kernel machines (Boseret al., 1992; Cortes and Vapnik, 1995; Schölkopf et al., 1999) and those based on probabilistic approaches, also known as graphical models (Jor-dan, 1998). In practical applications, simple linear models with labor-intensive design of hand-crafted features dominated the field. Kernel machines were found to performed as well or better while being more convenient to train (with less hyper-parameters, i.e., knobs to tune by hand) and affording easier mathematical analysis coming from con-vexity of the training criterion: by the end of 1990’s, the machine learning community had largely abandoned artificial neural networks in favor of these more limited methods despite the impressive performance of neural networks on some tasks (LeCun et al., 1998a; Bengioet al., 2001a).

In the early years of this century, the groups at Toronto, Montreal, and NYU (and shortly after, Stanford) worked together under a Canadian long-term research initiative (the Canadian Institute For Advanced Research – CIFAR) to break through two of the limitations of old-day neural networks: unsupervised learning and the difficulty of training deep networks. Their work initiated a new wave of interest in artificial neural networks by introducing a new way of learning multiple layers of non-linear features without requiring vast amounts of labeled data. Deep architectures had been proposed before (Fukushima, 1980; LeCun et al., 1989; Schmidhuber, 1992; Utgoff and Stracuzzi, 2002) but without major success to jointly train a deep neural network with many layers except in the case of convolutional architectures (LeCun et al., 1989, 1998b), covered in Chapter 12. The breakthrough came from a semi-supervised procedure: using unsupervised learning to learn one layer of features at a time and then fine-tuning

(22)

the whole system with labeled data (Hinton et al., 2006; Bengio et al., 2007; Ranzato

et al., 2007), described in Chapter 10. This initiated a lot of new research and other

ways of successfully training deep nets emerged. Even though unsupervised pre-training is sometimes unnecessary for datasets with a very large number of labels, it was the early success of unsupervised pre-training that led many new researchers to investigate deep neural networks. In particular, the use of rectiﬁers (Nair and Hinton, 2010b) as non-linearity and appropriate initialization allowing information to ﬂow well both forward (to produce predictions from input) and backward (to propagate error signals) were later shown to enable training very deep supervised networks (Glorot et al., 2011) without unsuperviseid pre-training.

1.4 Recent Impact of Deep Learning Research

Since 2010, deep learning has had spectacular practical successes. It has led to much better acoustic models that have dramatically improved the state of the art in speech recognition. Deep neural nets are now used in deployed speech recognition systems in-cluding voice search on the Android (Dahl et al., 2010; Deng et al., 2010; Seide et al., 2011; Hinton et al., 2012). Deep convolutional nets have led to major advances in the state of the art for recognizing large numbers of different types of objects in images (now deployed in Google+ photo search). They have also had spectacular successes for pedestrian detection and image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance in traffic sign classi-fication (Ciresan et al., 2012). An organization called Kaggle runs machine learning competitions on the web and deep learning has had numerous successes in these com-petitions9 10.

The number of research groups doing deep learning has grown from just 3 in 2006 (Toronto, Montreal, NYU, all within NCAP) to 4 in 2007 (+Stanford), and to more than 26 in 2013. 11 _{Accordingly, the number of papers published in this area has}

skyrocketed. Before 2006, it had become very difficult to publish any paper having to do with artificial neural networks at NIPS or ICML (the leading machine learning conferences). In the last few years, deep learning has been added as a keyword or area for submitted papers and sessions at NIPS and ICML. At ICML 2013 there was almost a whole day devoted to deep learning papers. The first deep learning workshop was co-organized by Yoshua Bengio, Yann LeCun, Ruslan Salakhutdinov and Hugo Larochelle at NIPS’2007, in an unofficial session sponsored by CIFAR because the NIPS workshop organizers had rejected the workshop proposal. It turned out to be the most popular workshop that year (with around 200 participants). That popularity has continued year after year, with Yoshua Bengio co-organizing almost all of the deep learning workshops at NIPS or ICML since then. There are now even multiple workshops on deep learning subjects (such as specialized workshops on the application of deep learning to speech or to natural language).

9_{http://blog.kaggle.com/2012/11/01/deep-learning-how- i-did-it-merck-1st-place-interview} 10

http://www.nytimes.com/2012/11/24/science/scientists- see- advances- in- deeplearning- a- part- of-artificial- intelligence.html 11_{http://deeplearning.net/deep-learning-research- groups-and-labs}

(23)

This has led Yann LeCun and Yoshua Bengio to create a new conference on the

subject. They called it the International Conference on Learning

Representa-tions (ICLR) , to broaden the scope from just deep learning to the more general subject of representation learning (which includes topics such as sparse coding, that learns shallow representations, because shallow representation-learners can be used as building blocks for deep representation-learners). The ﬁrst ICLR was ICLR’2013, and was a frank success, attracting more than 110 participants (almost twice as much as the defunct conference which ICLR replaced, the Learning Conference). It was also an opportunity to experiment with a novel reviewing and publishing model based on open

reviews and openly visible submissions (as arXiv papers), with the objective of achieving a faster dissemination of information (not keeping papers hidden from view while being evaluated) and a more open and interactive discussion between authors, reviewers and non-anonymous spontaneous commenters.

The general media and media specializing in information technology have picked up on deep learning as an exciting new technology since 2012, and we have even seen it covered on television in 2013, (on NBC12). It started with two articles in the New York Times in 2012, both covering the work done at Stanford by Andrew Ng and his group (part of NCAP). In March 2013 it was announced (in particular in Wired13) that Google acqui-hired Geoﬀ Hinton, Ilya Sutskever, and Alex Krizhevsky to help them “supercharge” machine learning applications at Google. In April 2013, the MIT Technology Review published their annual list of 10 Breakthrough Technologies and they put deep learning ﬁrst on their list. This has stirred a lot of echoes and discussions around the web, including an interview of Yoshua Bengio for Wired14_{, and there have}

since been many pieces, not just in technology oriented media but also, for example, in business oriented media like Forbes15. The latest news as of this writing is that after the important investments from Google, Microsoft and Baidu, Facebook is creating a new research group devoted to deep learning 16_{, led by Yann LeCun.}

1.5 Challenges for Future Research

In spite of all the successes of deep learning to date and the obvious fact that deep learning already has a great industrial impact, there is still a huge gap between the in-formation processing and perception capabilities of even simple animals and our current technology. Understanding more of the principles behind such information processing

architectures will still require much more basic science, in the same way that the

dis-covery of the currently understood learning principles behind deep learning have. This kind of knowledge will probably lead to big changes in how we build and deploy infor-mation technology in the future. While the novel techniques that have been developed by deep learning researchers achieved impressive progress, training deep networks to

12_{http://video.cnbc.com/gallery/?play=1&video=3000192292} 13_{http://www.wired.com/wiredenterprise/2013/03/google_hinton/} 14_{http://www.wired.com/wiredenterprise/2013/06/yoshua-bengio}

15_{http://www.forbes.com/sites/netapp/2013/08/19/what-is-deep-learning/}

(24)

learn meaningful representations is not yet a solved issue. Moreover while we believe that deep learning is a crucial component, it is not a complete solution to AI. In this book, we review the current state of knowledge on deep learning, and we also present our ideas for how to move beyond current deep learning methods to help us move closer to achieving human-level AI. We identify some major qualitative deﬁciencies in existing deep learning systems, and propose general ideas for how to face them. Success on any of these fronts would represent a major conceptual leap forward on the path to AI.

In the examples of outstanding applications of deep learning described above, the impressive breakthroughs have mostly been achieved with supervised learning techniques for deep architectures17_{. We believe that some of the most important future progress}

in deep learning will hinge on achieving a similar impact in the unsupervised18 _and

semi-supervised19cases.

However, even in the supervised case, there are signs that the techniques used in the regime of neural networks with a few thousand neurons per layer encounter training diﬃculties when using them on networks that have 10 times more units per layer and 100 times more connections, making it diﬃcult to exploit the increased model size (Dauphin and Bengio, 2013a). Even though the scaling behavior of stochastic gradient descent is theoretically very good in terms of computations per update, these observations suggest a numerical optimization challenge that must be addressed.

In addition to these numerical optimization diﬃculties, scaling up large and deep neural networks as they currently stand would require a substantial increase in com-puting power, which remains a limiting factor of our research. To train much larger models with the current hardware (or the hardware likely to be available in the next few years) will require a change in design and/or the ability to eﬀectively exploit paral-lel computation. These raise non-obvious questions where fundamental research is also needed.

Furthermore, some of the biggest challenges remain in front of us regarding unsu-pervised deep learning. Powerful unsuunsu-pervised learning is important for many reasons:

• It allows a learner to take advantage of unlabeled data. Most of the data available

to machines (and to humans and animals) is unlabeled, i.e., without a precise and symbolic characterization of its semantics and of the outputs desired from a learner. Humans and animals are also motivated, and this guides research into learning algorithms based on a reinforcement signal, which is much weaker than the signal required for supervised learning.

• Unsupervised learning allows a learner to capture enough information about the

observed variables so as to be able to answer new questions about them in the future, that were not anticipated at the time of seeing the training examples.

• It has been shown to be a good regularizer for supervised learning (Erhan et al., 2010), meaning that it can help the learner generalize better, especially when the 17

where the training examples are (input,target output) pairs, with the target output often being a human-produced label.

18_{no labels at all, only unlabeled examples} 19_{a mix of labeled and unlabeled examples}

(25)

number of labeled examples is small. This advantage clearly shows up in practical applications (e.g., the transfer learning competitions won by NCAP members with unsupervised deep learning (Bengio, 2011; Mesnil et al., 2011; Goodfellow et al., 2011)) where the distribution changes or new classes or domains are considered (transfer learning, domain adaptation), when some classes are frequent while many others are rare (fat tail or Zipf distribution), or when new classes are shown with zero, one or very few examples (zero-shot and one-shot learning (Larochelle et al., 2008; Lake et al., 2013)).

• There is evidence suggesting that unsupervised learning can be successfully achieved

mostly from a local training signal (as indicated by the successes of the unsuper-vised layer-wise pre-training procedures (Bengio, 2009a), semi-superunsuper-vised embed-ding (Weston et al., 2008), and intermediate-level hints (Gulcehre and Bengio, 2013)), i.e., that it may suﬀer less from the diﬃculty of propagating credit across a large network, which has been observed for supervised learning. Local train-ing algorithms are also much more likely to be amenable to fast hardware-based implementations20

• A recent trend in machine learning research is to apply machine learning to

prob-lems where the output variable is very high-dimensional (structured output mod-els), instead of just a few numbers or classes. For example, the output could be a sentence or an image. In that case, the mathematical and computational is-sues involved in unsupervised learning also arise, because there is an exponentially large number of configurations of the output values that need to be considered (to sum over them, when computing probability gradients, or to find an optimal configuration, when taking a decision).

To summarize, some of the challenges we view as important for future breakthroughs in deep learning are the following:

• How should we deal with the fundamental challenges behind unsupervised learning,

such as intractable inference and sampling? See Chapters 16, 15, and 17.

• How can we build and train much larger and more adaptive and reconﬁgurable deep

architectures, thus maximizing the advantage one can draw from larger datasets? See Chapter 8.

• How can we improve the ability of deep learning algorithms to disentangle the un-derlying factors of variation, or put more simply, make sense of the world around

us? See Chapter 11 on this very basic question about what learning a good repre-sentation is about.

Many other challenges are not discussed in this book, such as the needed integration of deep learning with reinforcement learning, active learning, and reasoning.

20_{See for example DARPA’s UPSIDE program as evidence for recent interest in hardware-enabled} implementations of deep learning.

(26)

Chapter 2 Linear algebra

TODO–redo intro and section headings, haven’t compensated for splitting math chapter yet

In this chapter we present several of the basic mathematical concepts that are nec-essary for a good understanding of deep learning: linear algebra, probability, and nu-merical computation. You may already be familiar with many of these concepts. If so, feel free to skip to the next chapter, or skip individual sections depending on which you are familiar with. This chapter could also be useful as review if you have learned these mathematical concepts before but do not feel especially conﬁdent about some of them. If you are already familiar with these concepts but need a detailed reference sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and Pedersen, 2006), which, despite the name, contains information on commonly used probability distributions as well as linear algebra.

We assume the reader is already familiar with basic algebra and single variable calculus (limits, derivative, and integrals). We also assume some familiarity with basic graph theory. We develop all more advanced concepts within this book.

2.1 Linear Algebra

Linear algebra is the study of systems of linear equations. Linear algebra is of funda-mental importance to most branches of science and engineering. Unfortunately, many software engineers have little exposure to linear algebra, since it is a form of continu-ous math and much of the computer science world emphasizes only discrete math. We therefore provide a brief introduction to linear algebra, highlighting the most important subjects necessary for understanding deep learning. Readers with little or no prior expo-sure to linear algebra are encouraged to also consult another resource focused exclusively on teaching linear algebra, such as (Shilov, 1977). This section of the book aims to present, without proof, as few results from linear algebra as are needed to understand the presentation of the algorithms in the remainder of the book. To read and understand this book, it should be suﬃcient to understand only the topics presented in this chapter. In order to developnew machine learning algorithms, a deeper understanding of linear

(27)

algebra is highly recommended.

2.1.1 Scalars, vectors, matrices and tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars: A scalar is just a single number, in contrast to most of the other objects

studied in linear algebra, which are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower-case variable names. When we introduce them, we specify what kind of number they are. For example, we might say “Let s _{∈ R} be the slope of the line,” while deﬁning a real-valued scalar, or “Let n_{∈ N} be the number of units,” while deﬁning a natural number scalar.

• Vectors: A vector is an array of numbers. The numbers have an order to them, and

we can identify each individual number by its index in that ordering. Typically we give vectors lower case names written in bold typeface, such as x. The elements of the vector are identiﬁed by writing its name in italic typeface, with a subscript. The ﬁrst element of x is x1, the second element is x2, and so on. We also need

to say what kind of numbers are stored in the vector. If each element is in R, and the vector has n elements, then the vector lies in the set formed by taking the Cartesian product of R_n times, denoted as Rn_{. When we need to explicitly} identify the elements of a vector, we write them as a column enclosed in square brackets: x =      x1 x2 .. . xn     .

We can think of vectors as identifying points in space, with each element giving the coordinate along a diﬀerent axis.

Sometimes we need to index a set of elements of a vector. In this case, we deﬁne a set containing the indices, and write the set as a subscript. For example, to access

x1, x3, and x6, we deﬁne the set S = 1 3 6 and write{ , , } xS. We use the − sign

to index the complement of a set. For example x₋₁ is the vector containing all elements of x except forx1, and x−S is the vector containing all of the elements

of xexcept forx₁, x₃, and x₆.

• Matrices: A matrix is a 2-D array of numbers, so each element is identiﬁed by two

indices instead of just one. We usually give matrices upper-case variable names with bold typeface, such as A. If a real-valued matrixA has a height of m and a width of , then we say thatn A ∈ Rm n× _{. We usually identify the elements of a}

matrix with its lower-case name, and the indices are listed with separating commas. For example, a1 1, is the upper left entry of of Aand am,nis the bottom right entry.

We can identify all of the numbers with vertical coordinate i by writing a “:” for the horizontal coordinate. For example, Ai,: denotes the horizontal cross section

(28)

BRIEF ARTICLE

THE AUTHOR

A =

2

4 a

a

1 1

_{2 1}

,

_,

a

1 2

_{2 2}

,

_,

a

_{3 1}

_,

a

_{3 2}

_,

3 5 ) A

>

₌



a

_{1 1}

_,

a

_{2 1}

_,

a

_{3 1}

_,

a

_{1 2}

_,

a

_{2 2}

_,

a

_{3 2}

_,



Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal.

of A with vertical coordinate . This is known as the -thi i row of A. Likewise,

A:,iis the -thi column of A. When we need to explicitly identify the elements of

a matrix, we write them as an array enclosed in square brackets: 

a1 1, a1 2,

a2 1, a2 2,



.

Sometimes we may need to index matrix-valued expressions that are not just a single letter. In this case, we use subscripts after the expression, but do not convert anything to lower case. For example, f A)( i,j gives element (i, j) of the

matrix computed by applying the function f toA.

• Tensors: In some cases we will need an array with more than two axes. In the

general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.

One important operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a diagonal line running down and to the right, starting from its upper left corner. See Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a matrixA as A, and it is deﬁned such that

(A)i,j= aj,i.

Vectors can be thought of as matrices that contain only one column. The transpose of a vector is therefore a matrix with only one row. Sometimes we deﬁne a vector by writing out its elements in the text inline as a row matrix, then using the transpose operator to turn it into a standard column vector, e.g. x= [x1, x2, x3].

2.1.2 Multiplying matrices and vectors

One of the most important operations involving matrices is multipying of two matrices. The matrix product of matricesA and B is a third matrix . In order for this productC

(29)

m_×n and B is of shape n_{× , then}p C is of shape m_{× . We can write the matrix}p

product just by placing two or more matrices together, e.g.

C = AB.

The product operation is deﬁned by

ci,j =



k

ai,kbk,j.

Note that the standard product of two matrices is not just a matrix containing the product of the individual elements. Such an operation exists and is called the

elementwise productor Hadamard product, and is denoted in this book1_as_{A B}_◦ _.

The dot product between two vectors x and y of the same dimensionality is the matrix product xy. We can think of the matrix product C =AB as computing ci,j

as the dot product between row ofi A and column j of B.

Matrix product operations have many useful properties that make mathematical analysis of matrices more convenient. For example, matrix multiplication is distributive:

A B( +C) = AB+AC.

It is also associative:

A BC( ) = (AB C) .

Matrix multiplication isnot commutative, unlike scalar multiplication. The transpose of a matrix product also has a simple form:

(AB)= BA.

Since the focus of this textbook is not linear algebra, we do not attempt to develop a comprehensive list of useful properties of the matrix product here, but the reader should be aware that many more exist.

We can also multiply matrices and vectors by scalars. To multiply by a scalar, we just multiply every element of the matrix or vector by that scalar:

sA =    sa1 1, . . . sa1,n .. . . . . ... sam,1 . . . sam,n   

We now know enough linear algebra notation to write down a system of linear equa-tions:

Ax=b (2.1)

where A ∈ Rm n× _{is a known matrix, b} _{∈ R}m _{is a known vector, and x} _{∈ R}n _{is a}

vector of unknown variables we would like to solve for. Each element xi of x is one

1

The elementwise product is used relatively rarely, so the notation for it is not as standardized as the other operations described in this section.

Deep Learning.pdf

Deep Learning

Table of Contents

Acknowledgments

Chapter 1

Deep Learning for AI

(b)

(c)

features without feature scale clipping. Note

). (c): Our 1st layer features. The

2012

e features and fewer “dead” features. (d):

Visualizatio n s of our 2nd layer featu res. These

(a)

(b)

(c)

yer features without feature scale clipping.

,

). (c): Our 1st layer features.

et al. 2012

distin c t ive features and fewer “dead” features.

Visualization s of our 2nd layer features . These

(a)

(b)

(c)

yer features without feature scale clipping.

,

). (c): Our 1st layer features.

al. 2012

distin c t ive features and fewer “dead” features.

Visualization s of our 2nd layer features . These

Visualizing and Understanding Convolu

Visualizing and Understanding Convolutional

Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top

of feature maps across the validation data, projected down to pixel space using our decon

Our reconstructions are

not

samples from the model: they are reconstructed patte rn s fro

high activations in a given feat u re map. For each feature map we also show the corresp

(i) the the strong grouping within each fe at u re map, (ii) greater invariance at higher

AI

Machine learning

Representation learning

Deep learning

Example:

Knowledge

bases

Example:

Logistic

regression

Example:

Autoencoders

Example:

MLPs

1.1

Who should read this book?

1.2

Machine Learning

1.3

Historical Perspective and Neural Networks

1.4

Recent Impact of Deep Learning Research

1.5

Challenges for Future Research

Chapter 2

Linear algebra

2.1

Linear Algebra

BRIEF ARTICLE

THE AUTHOR

A =

2

4

a

a

1 1

2 1

,

,

a

_(b)

_(b)

_{2 1}

_,

_{2 2}

_,

_{3 1}

_,

_{3 2}

_,

₌

_{1 1}

_,

_{2 1}

_,

_{3 1}

_,

_{1 2}

_,

_{2 2}

_,

_{3 2}

_,