Convolutional Neural Networks
and
Applications
Feature detectors
Input Layer Hidden Layer
Output Layer
What is this unit doing?
Input Layer Hidden Layer
Output Layer
What does this unit detect?
1 5 10 15 20 25 β¦
β¦
it will send strong signal for a horizontal
line in the top row, ignoring everywhere else 1
63
strong +ve weight low/zero weight
β¦
1 5 10 15 20 25 β¦
1
63
What does this unit detect?
What features might you expect a good NN
to learn, when trained with data like this?
vertical lines
Handwritten images
Handwritten images
Handwritten images
Successive layers can learn higher-level features
detect lines in
Specific positions
Higher level detectors ( horizontal line,
βRHS vertical lineβ βupper loopβ, etcβ¦
etc β¦
v etc β¦
1 5 10 15 20 25 β¦
Successive layers can learn higher-level features
detect lines in
Specific positions
Higher level detectors ( horizontal line,
βRHS vertical luneβ βupper loopβ, etcβ¦
What does this unit detect?
etc β¦
v etc β¦
1 5 10 15 20 25 β¦
β¦
Weights determine what patterns (or mixture of
Motivation: Object Recognition in Vision
β’ Challengesβ’ Must deal with very high-dimensional inputs
β’ 1600 x 1200 pixels = 1.9m inputs, or 3 x 1.9m if RGB pixels
β’ Can we exploit the 2D topology of pixels (or 3D for video data)
β’ Can we build invariance to certain variations we can expect
β’ Translations, illumination, etc
Motivation: Recognizing an Image
β’ Input is 5x5 pixel array
β’ Simple back propagation net
Motivation: Recognizing an Image
with
unknown location
Hidden Units
Output Hidden Units
β’ Object can appear either in the top image or in the bottom image
location
Motivation: Recognizing an Image with
any
location
Hidden Units
Output Hidden Units
β’ Each possible location the object can appear in has its own set of
hidden units
β’ Each set detects the same features except in a different location β’ Locations can overlap
β’ Little or no invariance to shifting, scaling, and other forms of distortion
Drawbacks of previous neural networks
154 input change from 2 shift left 77 : black to white 77 : white to black Shift left
β’ scaling, and other forms of distortion
β’ The topology of the input data is completely ignored β’ Work with raw data.
Feature 1
Feature 2
Weights
Convolutional Neural Network: Key Idea
Exploit
1. Structure
2. Local Connectivity 3. Share Parameter
To Give
Convolutional Neural Network: Local
Connectivity
β’ Each hidden unit is connected only to a subregion (patch) of the input
image
β’ It is connected to all channels
β’ 1 if greyscale image
β’ 3 (R, G, B) for color image
Solves the following problems:
β’ Fully connected hidden layer would have an unmanageable number
of parameters
β’ Computing the linear activations of the hidden units would be very
Convolutional Neural Network: Parameter
Sharing
β’ Units organized into the same βfeature map/templateβ share parameters
β’ In this way all neurons detect the same feature at different positions in the
input image
β’ Hidden units within a feature map cover different positions in the image
Feature extraction
β’ If a neuron in the feature map fires, this corresponds to a match with
Convolutional Neural Network: Parameter
Sharing
How does it help?
β’ Reduces even more number of parameters (compared to local
connectivity)
β’ Will extract the same features at every position
Convolutional Neural Network: Parameter
Sharing
β’ Each feature map forms a 2D grid of
features
β’ Can be computed with a discrete
convolution (β) of a kernel matrix ππππππ which
is the hidden weights matrix ππππππ with its
rows and columns flipped:
β’ π₯π₯ππ is the πππ‘π‘π‘ channel of input
β’ ππππππ is the convolution kernel
β’ ππππ is a learned scaling factor
β’ π¦π¦ππ is the hidden layer (without a bias)
Convolutional Neural Network: Parameter
Sharing
β’ Stride (typically denoted by s in CNNs) decides the spacing between
overlapping convolutions
Subsampling/Pooling layer
β’ The subsampling layers reduce the spatial resolution of each feature
map
β’ By reducing the spatial resolution of the feature map, a certain
β’ the subsampling layers reduce the spatial resolution of each feature
map
β’ Reduce the effect of noises and shift or distortion β’ Weight sharing is also applied in subsampling layers
β’ when trainable weights are used, e.g., scaled average pooling
β’ Often no trainable weights are necessary, e.g., max pooling (more popular)
Subsampling/Pooling layer
Jargon
β’ Convolutional Neural Networks
β’ also called CNNs, Conv Nets etc.
β’ Each hidden unit channel
β’ also called map, feature, feature type, dimension
β’ Weights for each channel
β’ also called kernels or filters
β’ Input patch to a hidden unit at (x,y)
Typical CNN
β’ Alternates convolutional and pooling layers. Why?
C1: feature maps 6x28x28
Input Image 32x32
C3: f. maps 16x10x10 S2: feature maps
6x14x14
S4: f. maps 16x5x5
C5: layer
120 FC6: layer
84 Output:10
Convolutions Subsampling/
Pooling
Convolutions Full connection
Subsampling/
Typical CNN
β’ Output layer is a regular, fully connected layer with softmax
non-linearity
β’ Output provides an estimate of the conditional probability of each class
β’ The network is trained by stochastic gradient descent
β’ Backpropagation is used similarly as in a fully connected network
β’ Gradients pass through element-wise activation functions in each layer β’ Sigmoid, ReLU, etc.
C1:6x28x28 Input Image 28x28 C3:16x10x10 S2:6x14x14 S4:16x5x5 C5:120 FC6:84 RBF Output:10
LeNet5
β’ Introduced by LeCun in 1998.
LeNet5
β’ C1,C3,C5 : Convolutional layer. β’ 5 Γ 5 Convolution filter.
β’ S2 , S4 : Subsampling layer. β’ Subsampling by factor 2. β’ F6 : Fully connected layer.
β’ All the units of the layers up to FC6 have a sigmoidal activation function.
β’ About 187,000 connection.
β’ About 14,000 trainable weight.
Layer-1 Layer-3Layer-5 Input
The brute force approach
β’ LeNet uses knowledge about the invariances to design:
β’ the local connectivity, the weight-sharing, and the pooling. β’ This achieves about 80 errors (of 10,000 test samples).
β’ This can be reduced to about 40 errors by using many different
transformations of the input and other tricks (Ranzato 2008)
β’ Ciresan et. al. (2010) inject knowledge of invariances by creating a
huge amount of carefully designed extra training data:
β’ For each training image, they produce many new training
examples by applying many different transformations.
β’ They can then train a large, deep, dumb net on a GPU without
much overfitting.
Image Classification
β’ A core task in computer vision
(assume given set of discrete labels) {dog, cat, truck, plane, ...}
The Problem: Semantic Gap
β’ Images are represented as3D arrays of numbers, with integers between [0, 255].
β’ E.g. 300 x 100 x 3 (3 for 3
ImageNet Challenge
β’ ImageNet (image-net.org)β’ 15M high resolution images β’ over 22K categories
β’ labeled by Mechanical Turk workers
β’ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) β’ 2010-2015
β’ 2010 competition
β’ 1.2 M training images, 1000 categories (general and specific) β’ 200K test images
AlexNet (2012)
β’ 5 convolutional layers β’ 3 fully connected layers
AlexNet (2012): Key Ideas
β’ Downsampled images
β’ shorter dimension 256 pixels, longer dimension cropped about center to 256 pixels
β’ R, G, B channels
AlexNet (2012): Key Ideas
β’ Data set augmentation
β’ Generate image translations by selecting random 224 x 224 sub-images β’ Horizontal reflections (standard trick in computer vision)
β’ When testing, extract 10 distinct 224x224 sub-images and average predictions
β’ More data set augmentation
AlexNet (2012): Key Ideas
β’ ReLU instead of logistic or tanh unitsβ’ Normalize output of ReLU output in a map at (x,y) based on activity of
features in adjacent maps at (x,y)
β’ Overlapping pooling
β’ pooling units spaced s pixels apart, summing over a z X z neighborhood, with s < z
Other Results that Followed
ILSVRC
2012 AlexNet
ILSVRC
2013 ZFNet Proposed byZeiler-Fergus improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers.
ILSVRC
2014 GoogLeNet Szegedy etal, Google
Development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much.
ILSVRC 2014
Runner-up VGGNet SimonyanZisserman
Showed that network depth is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly,
features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Outperformed GoogLeNet in multiple transfer learning tasks.
ILSVRC
RCNNs
βDogβ
βCatβ
Fast RCNN
Fully Convolutional Networks for Semantic
Segmentation
β’ Using Convolutional Net for
Segmentation (Long,
Shelhamer, Darrell, 2015)
β’ Determine what objects are
where
β’ βwhatβ depends on global
information
β’ βwhereβ depends on local
Regular CNNs
End to end learning
1000 dimensional vector
How to do semantic segmentation?
End to end learning
Fully Convolutional Networks for Semantic
Segmentation
β’ Higher layers of
CNN classifiers are nonspatial (no
βwhereβ info)
β’ If every layer
remains spatial (as in conv. layers), then output can specify where as well as what (albeit
FCN Key Idea: Learnable Upsampling
Type equation here. π»π» 4 ππ ππ 4 π»π» 8 ππ ππ 8 π»π» 16 ππ ππ 16 π»π» 32 ππ ππ 32 Conv, Pool, π»π»ππππ Upsampling Pixelwise Output + LossFCN Key Idea: Learnable Upsampling
β’ Bilinear interpolation uses βfixedβ weightsfor interpolation.
β’ Weights depend on the upsampling factor
β’ FCNs βlearnβ the weights in the upsampling
Bilinear Interpolation
Upsampling Layer
β’ Often referred to as βdeconvolutionβ
(a misnomer)
β’ Or Transposed Convolution
Weighted combination of neighbors
FCN Key Idea β Skip Connection
Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Pooling and prediction layers are shown as grids that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines. First row (FCN-32s): Our singlestream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Second row (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic information. Third row (FCN-8s): Additional predictions from pool3, at
Key Idea: Skip Connection
β’ The 32 pixel stride at the final prediction layer limits the scale of detail in
the upsampled output
β’ This is resolved by adding skips that combine the final prediction layer with
lower layers with finer strides. This turns a line topology into a DAG, with edges that skip ahead from lower layers to higher ones
β’ Combining fine layers and coarse layers lets the model make local
CNN Applications
β’ We discussed
β’ Image classification β’ Semantic segmentation β’ Detection & localization
β’ Several Others
β’ Depth estimation
β’ Stereo based dense point correspondence β’ Pose estimation
β’ Tracking (often with Recurrent NNs)
β’ Check out recent CVPR/ICCV/ECCV papers for numerous applications of
Backpropagation in General Cases
1. Decompose operations in layers of a neural network into function elements whose derivatives w.r.t. inputs are known by symbolic computation.
βππ π₯π₯ = ππ(ππππππππ)Β° β¦ Β°ππ
ππ ππ
ππ Β° β¦ Β°ππ ππ 2
2 Β°ππ 1 π₯π₯
where ππ(1) = π₯π₯, ππ(ππππππππ) = β
ππ and βππ: ππππ
(ππ)
Backpropagation in General Cases
2. Backpropagate error signals corresponding to a differentiable cost function by numerical computation (Starting from cost function, plug in error signals backward).
πΏπΏ ππ = ππππππ ππ π½π½ ππ; π₯π₯, π¦π¦ = ππππππππ(ππ+1) ππππππππ(ππ+1(ππ) ) = πΏπΏ(ππ+1) ππππππππ(ππ+1(ππ) )
Backpropagation in General Cases
3. Use backpropagated error signals to compute gradients w.r.t. parameters only for the function elements with parameters where their derivatives w.r.t. parameters are known by symbolic computation.
π»π»ππ ππ π½π½ ππ; π₯π₯, π¦π¦ = ππππππππ π½π½ ππ; π₯π₯, π¦π¦ = ππππππππ(ππ)
ππππ
ππ(ππ)
(ππ)
ππππ(ππ) = πΏπΏ(ππ) ππππ
ππ(ππ)
(ππ)
ππππ(ππ)
where ππππππ(ππ)
(ππ)
Backpropagation in General Cases
4. Sum gradients over all example to get overall gradient.
π»π»ππ ππ π½π½ ππ = οΏ½
ππ=1 ππ
Derivatives of Convolution
Discrete convolution parameterized by a feature w and its derivatives
β’ Let x be the input, and y be the output of convolution layer.
β’ Here we focus only one feature vector w, although a convolution layer
usually has multiple features ππ = π€π€1π€π€2 β¦ π€π€|ππ| .
β’ n indexes x and y where 1 β€ ππ β€ |π₯π₯| for π₯π₯ππ, 1 β€ ππ β€ π¦π¦ = π₯π₯ β ππ + 1
for π¦π¦ππ. i indexes w where 1 β€ ππ β€ ππ .
Derivatives of Convolution
π¦π¦ = π₯π₯ β π€π€ = π¦π¦ππ , π¦π¦ππ = οΏ½ππ=1
|ππ|
π₯π₯ππ+ππβ1π€π€ππ = π€π€πππ₯π₯ππ:ππ+ ππ β1
ππππππβππ+1
ππππππ = π€π€ππ and
ππππππ
ππππππ = π₯π₯ππ+ππβ1 for 1 β€ ππ β€ |ππ|
π₯π₯ππ π€π€
1
π€π€2 π¦π¦ππβ1
π¦π¦ππ |ππ|
From a fixed π₯π₯ππ stand point, π₯π₯ππ has outgoing
connections to π¦π¦ππβ ππ +1:ππ
π₯π₯ππ π€π€
1 π¦π¦ππ
π₯π₯ππ+1 π€π€
2
π¦π¦ππ has incoming
connections from
π₯π₯ππ:ππ+ ππ β1
All π¦π¦ππβ ππ +1:ππ
have derivatives w.r.t. π₯π₯ππ.
Note that π¦π¦ and π€π€
Backpropagation in Convolution Layer
Error signals and gradient for each example are computed by convolution using the communication property of convolution and the multivariable chain rule of derivative. Letβs focus on single elements of error signals and a gradient w.r.t. w.
πΏπΏππ(ππ) = πππ₯π₯πππ½π½
ππ = πππ½π½ πππ¦π¦ πππ¦π¦ πππ₯π₯ππ = οΏ½ ππ=1
|ππ|
πππ½π½ πππ¦π¦ππβππ+1
πππ¦π¦ππβππ+1
πππ₯π₯ππ = οΏ½ππ=1 |ππ|
πΏπΏππβππ+1(ππ) π€π€ππ = (πΏπΏ(ππ)β ππππππππ π€π€ ) ππ ,
Backpropagation in Convolution Layer
πππ½π½πππ€π€ππ =
πππ½π½ πππ¦π¦
πππ¦π¦
πππ€π€ππ = ππ=1οΏ½
ππ β ππ +1
πππ½π½ πππ¦π¦ππ πππ¦π¦ππ πππ€π€ππ = οΏ½ ππ=1 ππ β ππ +1
πΏπΏππ(ππ)π₯π₯ππ+ππβ1 = πΏπΏ ππ β π₯π₯ ππ
πππ½π½
πππ€π€ =
πππ½π½
Backpropagation in Convolution Layer
π₯π₯ππ
π€π€1 π¦π¦
ππ
π₯π₯ππ+1
π€π€2
πΏπΏππ(ππ)
π€π€1
π€π€2 πΏπΏππβ1(ππ)
πΏπΏππ(ππ)
π₯π₯ β π€π€ = π¦π¦ πΏπΏ(ππ) = ππππππππ π€π€ β πΏπΏ(ππ) π₯π₯ β πΏπΏ ππ = πππππππ½π½
Forward propagation
(valid convolution) Backward propagation(full convolution) Gradient computation(valid convolution) οΏ½
πππ½π½
πππ€π€ππ π₯π₯ππ
π₯π₯ππ+1
Derivatives of Pooling
Pooling layer subsamples statistics to obtain summary statistics with any
aggregate function (or filter) g whose input is vector and output is scalar.
Subsampling is an operation like convolution, however g is applied to disjoint
(non-overlapping) regions
Definition: subsample (or downsample)
Let m be the size of the pooling region, x be the input, and y be the output of
the pooling layer. Subsample ππ, ππ [ππ] denotes the nth element of
subsample ππ, ππ .
π¦π¦ππ = π π π π π π π π π π π π πππππ π π₯π₯, ππ ππ = ππ(π₯π₯ ππβ1 ππ+1:ππππ)
Subsampling Equations
π π
π₯π₯
ππ π¦π¦ππ
Pooling
ππ π₯π₯ = βππ=1πππ π π₯π₯ππ πππππππ₯π₯ = π π 1
Mean Pooling
Max Pooling g x = max π₯π₯ πππ₯π₯ππππ
ππ = οΏ½
1 ππππ π₯π₯ππ = max(π₯π₯) 0 ππππβπ π πππ€π€πππ π π π
ππ π₯π₯ = π₯π₯ ππ = οΏ½
ππ=1 ππ
|π₯π₯ππ|ππ
1
ππ ππππ
πππ₯π₯ππ = ππ=1οΏ½
ππ
|π₯π₯ππ|ππ
1 ππβ1
Backpropagation in the Pooling Layer
Error signals for each example are computed by upsampling. Upsampling is an operation which backpropagates the error signals over the aggregate function g using its derivatives ππππβ² = ππππ
ππππ ππβ1 ππ+1:ππππ
ππππβ² can change depending on the pooling region n
Backpropagation in the Pooling Layer
πΏπΏ(ππβ1 ππ+1ππ) :ππππ = πππ₯π₯ πππ½π½ππβ1 ππ+1:ππππ =
πππ½π½ πππ¦π¦ππ
πππ¦π¦ππ
πππ₯π₯ ππβ1 ππ+1:ππππ
πΏπΏ(ππβ1 ππ+1ππ) :ππππ = πΏπΏππ(ππ) πππ₯π₯ ππππ
ππβ1 ππ+1:ππππ = πΏπΏππ
(ππ)ππ
ππβ²
π π
π₯π₯
ππ π¦π¦ππ π π πΏπΏππ(ππ)
β ππππ πππ₯π₯