A Tutorial on Deep Learning
Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks
Quoc V. Le [email protected] Google Brain, Google Inc.
1600 Amphitheatre Pkwy, Mountain View, CA 94043 October 20, 2015
1 Introduction
In the previous tutorial, I discussed the use of deep networks to classify nonlinear data. In addition to their ability to handle nonlinear data, deep networks also have a special strength in their flexibility which sets them apart from other tranditional machine learning models: we can modify them in many ways to suit our tasks. In the following, I will discuss three most common modifications:
• Unsupervised learning and data compression via autoencoders which require modifications in the loss function,
• Translational invariance via convolutional neural networks which require modifications in the network architecture,
• Variable-sized sequence prediction via recurrent neural networks which require modifications in the network architecture.
The flexibility of neural networks is a very powerful property. In many cases, these changes lead to great improvements in accuracy compared to basic models that we discussed in the previous tutorial.
In the last part of the tutorial, I will also explain how to parallelize the training of neural networks.
This is also an important topic because parallelizing neural networks has played an important role in the current deep learning movement.
2 Autoencoders
One of the first important results in Deep Learning since early 2000 was the use of Deep Belief Networks [15]
to pretrain deep networks. This approach is based on the observation that random initialization is a
bad idea, and that pretraining each layer with an unsupervised learning algorithm can allow for better
initial weights. Examples of such unsupervised algorithms are Deep Belief Networks, which are based on
Restricted Boltzmann Machines, and Deep Autoencoders, which are based on Autoencoders. Although
the first breakthrough result is related to Deep Belief Networks, similar gains can also be obtained later
by Autoencoders [4]. In the following section, I will only describe the Autoencoder algorithm because it is
simpler to understand.
2.1 Data compression via autoencoders
Suppose that I would like to write a program to send some data from my cellphone to the cloud. Since I want to limit my network usage, I will optimize every bit of data that I am going to send. The data is a collection of data points, each has two dimensions. A sample of my data look like the following:
Here, the red crosses are my data points, the horizontal axis is the value of the first dimension and the vertical axis is the value of the second dimension.
Upon visualization, I notice is that the value of the second dimension is approximately twice as much as that of the first dimension. Given this observation, we can send only the first dimension of every data point to the cloud. Then at the cloud, we can compute the value of the second dimension by doubling the value of the first dimension. This requires some computation, and the compression is lossy, but it reduces the network traffic by 50%. And since network traffic is what I try to optimize, this idea seems reasonable.
So far this method is achieved via visualization, but the bigger question is can we do this more system- atically for high-dimensional data? More formally, suppose we have a set of data points {x (1) , x (2) , ..., x (m) } where each data point has many dimensions. The question becomes whether there is a general way to map them to another set of data points {z (1) , z (2) , ..., z (m) }, where z’s have lower dimensionality than x’s and z’s can faithfully reconstruct x’s.
To answer this, notice that in the above process of sending data from my cellphone to the cloud has three steps:
1. Encoding: in my cellphone, map my data x (i) to compressed data z (i) . 2. Sending: send z (i) to the cloud.
3. Decoding: in the cloud, map from my compressed data z (i) back to ˜ x (i) , which approximates the original data.
To map data back and forth more systematically, I propose that z and ˜ x are functions of their inputs, in the following manner:
z (i) = W 1 x (i) + b 1
˜
x (i) = W 2 z (i) + b 2
If x (i) is a two-dimensional vector, it may be possible to visualize the data to find W 1 , b 1 and W 2 , b 2 analytically as the experiment above suggested. Most often, it is difficult to find those matrices using visualization, so we will have to rely on gradient descent.
As our goal is to have ˜ x (i) to approximate x (i) , we can set up the following objective function, which is the sum of squared differences between ˜ x (i) and x (i) :
J (W 1 , b 1 , W 2 , b 2 ) =
m
X
i=1
˜
x (i) − x (i)
2
=
m
X
i=1
W 2 z (i) + b 2 − x (i)
2
=
m
X
i=1
W 2 W 1 x (i) + b 1 + b 2 − x (i)
2
which can be minimized using stochastic gradient descent.
This particular architecture is also known as a linear autoencoder, which is shown in the following network architecture:
In the above figure, we are trying to map data from 4 dimensions to 2 dimensions using a neural network with one hidden layer. The activation function of the hidden layer is linear and hence the name linear autoencoder.
The above network uses the linear activation function and works for the case that the data lie on a linear surface. If the data lie on a nonlinear surface, it makes more sense to use a nonlinear autoencoder, e.g., one that looks like following:
If the data is highly nonlinear, one could add more hidden layers to the network to have a deep autoencoder.
Autoencoders belong to a class of learning algorithms known as unsupervised learning. Unlike super-
vised algorithms as presented in the previous tutorial, unsupervised learning algorithms do not need labeled
information for the data. In other words, unlike in the previous tutorials, our data only have x’s but do
2.2 Autoencoders as an initialization method
Autoencoders have many interesting applications, such as data compression, visualization, etc. But around 2006-2007, researchers [4] observed that autoencoders could be used as a way to “pretrain” neural networks.
Why? The reason is that training very deep neural networks is difficult:
• The magnitudes of gradients in the lower layers and in higher layers are different,
• The landscape or curvature of the objective function is difficult for stochastic gradient descent to find a good local optimum,
• Deep networks have many parameters, which can remember training data and do not generalize well.
The goal of pretraining is to address the above problems. With pretraining, the process of training a deep network is divided in a sequence of steps:
• Pretraining step: train a sequence of shallow autoencoders, greedily one layer at a time, using unsupervised data,
• Fine-tuning step 1: train the last layer using supervised data,
• Fine-tuning step 2: use backpropagation to fine-tune the entire network using supervised data.
While the last two steps are quite clear, the first step needs needs some explanation, perhaps via an example. Suppose I would like to train a relatively deep network of two hidden layers to classify some data. The parameters of the first two hidden layers are W 1 and W 2 respectively. Such network can be pretrained by a sequence of two autoencoders, in the following manner:
More concretely, to train the red neurons, we will train an autoencoder that has parameters W 1 and W 1 0 .
After this, we will use W 1 to compute the values for the red neurons for all of our data, which will then
be used as input data to the subsequent autoencoder. The parameters of the decoding process W 1 0 will
be discarded. The subsequent autoencoder uses the values for the red neurons as inputs, and trains an
autoencoder to predict those values by adding a decoding layer with parameters W 2 0 .
Researchers have shown that this pretraining idea improves deep neural networks; perhaps because pretraining is done one layer at a time which means it does not suffer from the difficulty of full supervised learning. From 2006 to 2011, this approach gained much traction as a scientific endeavor because the brain is likely to unsupervised learning as well. Unsupervised learning is also more appealing because it makes use of inexpensive unlabeled data. Since 2012, this research direction however has gone through a relatively quiet period, because unsupervised learning is less relevant when a lot of labeled data are available. Nonetheless, there are a few recent research attempts to revive this area, for example, using variational methods for probabilistic autoencoders [24].
3 Convolutional neural networks
Since 2012, one of the most important results in Deep Learning is the use of convolutional neural networks to obtain a remarkable improvement in object recognition for ImageNet [25]. In the following sections, I will discuss this powerful architecture in detail.
3.1 Using local networks for high dimensional inputs
In all networks that you have seen so far, every neuron in the first hidden layer connects to all the neurons in the inputs:
This does not work when x is high-dimensional (the number of bubbles is large) because every neuron ends up with many connections. For example, when x is a small image of 100x100 pixels (i.e., input vector has 10,000 dimensions), every neuron has 10,000 parameters. To make this more efficient, we can force each neuron to have a small number of connections to the input. The connection patterns can be designed to fit some structure in the inputs. For example, in the case of images, the connection patterns are that neurons can only look at adjacent pixels in the input image:
We can extend this idea to force local connectivity in many layers, to obtain a deep locally connected network. Training with gradient descent is possible because we can modify the backpropagation algorithm to deal with local connectivity: in the forward pass, we can compute the values of neurons by assuming that the empty connections have weights of zeros; wheareas in the backward pass, we do not need to compute the gradients for the empty connections.
This kind of networks has many names: local networks, locally connected networks, local receptive field
networks. The last name is inspired by the fact that neurons in the brain are also mostly locally connected,
and the corresponding terminology in neuroscience/biology is “local receptive field.”
3.2 Translational invariance with convolutional neural networks
In the previous section, we see that using locality structures significantly reduces the number of connections.
Even further reduction can be achieved via another technique called “weight sharing.” In weight sharing, some of the parameters in the model are constrained to be equal to each other. For example, in the following layer of a network, we have the following constraints w 1 = w 4 = w 7 , w 2 = w 5 = w 8 and w 3 = w 6 = w 9
(edges that have the same color have the same weight):
With these constraints, the model can be quite compact in terms of number of actual parameters. Instead of storing all the weights from w 1 to w 9 , we only need to store w 1 , w 2 , w 3 . Values for other connections can then be derived from these three values.
This idea of sharing the weights resembles an important operation in signal processing known as con- volution. In convolution, we can apply a “filter” (a set of weights) to many positions in the input signals.
In practice, this type of networks also comes with another layer known as the “max-pooling layer.” The max-pooling layer computes the max value of a selected set of output neurons from the convolutional layer and uses these as inputs to higher layers:
An interesting property of this approach is that the output of the max-pooling neurons are invariant to shifts in the inputs. To see this, consider the following two examples: x 1 = [0, 1, 0, 0, 0, 0, 0, ...] and x 2 = [0, 0, 0, 1, 0, 0, 0, ...] which can be thought of as two 1D input images, each with one white dot. Two images are the same, except that in x 2 , the white dot gets shifted two pixels to the right compared to x 1 . We can see that as the dot moves 2 pixels to the right, the value of the first max pooling neuron changes from w 2 to w 5 . But as w 2 = w 5 , the value of the neuron is unchanged:
This property, that the outputs of the system are invariant to translation, is also known as translational
invariance. Translational invariant systems are typically effective for natural data (such as images, sounds
etc.) because a major source of distortions in natural data is typically translation.
This type of networks is also known as convolutional neural networks (sometimes called convnets). The max pooling layer in the convolutional neural networks is also known as the subsampling layer because it dramatically reduces the size of the input data. The max operation can sometimes be replaced by the average operation.
Many recent convolutional neural networks also have another type of layers, called Local Contrast Normalization (LCN) [19]. This layer operates on the outputs of the max-pooling layer. Its goal is to subtract the mean and divide the standard deviation of the incoming neurons (in the same manner with max-pooling). This operation allows brightness invariance, which is useful for image recognition.
We can also construct a deep convolutional neural networks by treating the outputs of a max-pooling layer (or LCN layer) as a new input vector, and adding a new convolutional layer and a new max-pooling layer (and maybe a LCN layer) on top of this vector.
Finally, it is possible to modify the backpropagation algorithm to work with these layers. For example, for the convolutional layer, we can do the following steps:
• In the forward pass, perform explicit computation with all the weights, w 1 , w 2 , ..., w 9 (some of them are the same),
• In the backward pass, compute the gradient ∂w ∂J
1
, ..., ∂w ∂J
9
, for all the weights,
• When updating the weights, use the average of the gradients from the shared weights, e.g., w 1 = w 1 − α( ∂w ∂J
1
+ ∂w ∂J
4
+ ∂w ∂J
7
), w 4 = w 4 − α( ∂w ∂J
1
+ ∂w ∂J
4
+ ∂w ∂J
7
) and w 7 = w 7 − α( ∂w ∂J
1
+ ∂w ∂J
4
+ ∂w ∂J
7