• No results found

Lec12-CNN-Applications

N/A
N/A
Protected

Academic year: 2020

Share "Lec12-CNN-Applications"

Copied!
97
0
0

Loading.... (view fulltext now)

Full text

(1)

Convolutional Neural Networks

and

Applications

(2)

Feature detectors

Input Layer Hidden Layer

Output Layer

(3)

What is this unit doing?

Input Layer Hidden Layer

Output Layer

(4)

What does this unit detect?

1 5 10 15 20 25 …

…

it will send strong signal for a horizontal

line in the top row, ignoring everywhere else 1

63

(5)

strong +ve weight low/zero weight

…

1 5 10 15 20 25 …

1

63

What does this unit detect?

(6)

What features might you expect a good NN

to learn, when trained with data like this?

(7)

vertical lines

(8)

Handwritten images

(9)

Handwritten images

(10)

Handwritten images

(11)

Successive layers can learn higher-level features

detect lines in

Specific positions

Higher level detectors ( horizontal line,

β€œRHS vertical line” β€œupper loop”, etc…

etc …

v etc …

1 5 10 15 20 25 …

(12)

Successive layers can learn higher-level features

detect lines in

Specific positions

Higher level detectors ( horizontal line,

β€œRHS vertical lune” β€œupper loop”, etc…

What does this unit detect?

etc …

v etc …

1 5 10 15 20 25 …

…

Weights determine what patterns (or mixture of

(13)

Motivation: Object Recognition in Vision

β€’ Challenges

β€’ Must deal with very high-dimensional inputs

β€’ 1600 x 1200 pixels = 1.9m inputs, or 3 x 1.9m if RGB pixels

β€’ Can we exploit the 2D topology of pixels (or 3D for video data)

β€’ Can we build invariance to certain variations we can expect

β€’ Translations, illumination, etc

(14)

Motivation: Recognizing an Image

β€’ Input is 5x5 pixel array

β€’ Simple back propagation net

(15)

Motivation: Recognizing an Image

with

unknown location

Hidden Units

Output Hidden Units

β€’ Object can appear either in the top image or in the bottom image

location

(16)

Motivation: Recognizing an Image with

any

location

Hidden Units

Output Hidden Units

β€’ Each possible location the object can appear in has its own set of

hidden units

β€’ Each set detects the same features except in a different location β€’ Locations can overlap

(17)
(18)

β€’ Little or no invariance to shifting, scaling, and other forms of distortion

Drawbacks of previous neural networks

(19)

154 input change from 2 shift left 77 : black to white 77 : white to black Shift left

(20)

β€’ scaling, and other forms of distortion

(21)

β€’ The topology of the input data is completely ignored β€’ Work with raw data.

Feature 1

Feature 2

Weights

(22)

Convolutional Neural Network: Key Idea

Exploit

1. Structure

2. Local Connectivity 3. Share Parameter

To Give

(23)

Convolutional Neural Network: Local

Connectivity

β€’ Each hidden unit is connected only to a subregion (patch) of the input

image

β€’ It is connected to all channels

β€’ 1 if greyscale image

β€’ 3 (R, G, B) for color image

Solves the following problems:

β€’ Fully connected hidden layer would have an unmanageable number

of parameters

β€’ Computing the linear activations of the hidden units would be very

(24)
(25)
(26)

Convolutional Neural Network: Parameter

Sharing

β€’ Units organized into the same β€œfeature map/template” share parameters

β€’ In this way all neurons detect the same feature at different positions in the

input image

β€’ Hidden units within a feature map cover different positions in the image

(27)

Feature extraction

β€’ If a neuron in the feature map fires, this corresponds to a match with

(28)

Convolutional Neural Network: Parameter

Sharing

How does it help?

β€’ Reduces even more number of parameters (compared to local

connectivity)

β€’ Will extract the same features at every position

(29)

Convolutional Neural Network: Parameter

Sharing

β€’ Each feature map forms a 2D grid of

features

β€’ Can be computed with a discrete

convolution (βˆ—) of a kernel matrix π‘˜π‘˜π‘–π‘–π‘–π‘– which

is the hidden weights matrix π‘Šπ‘Šπ‘–π‘–π‘–π‘– with its

rows and columns flipped:

β€’ π‘₯π‘₯𝑖𝑖 is the 𝑖𝑖𝑑𝑑𝑑 channel of input

β€’ π‘˜π‘˜π‘–π‘–π‘–π‘– is the convolution kernel

β€’ 𝑔𝑔𝑖𝑖 is a learned scaling factor

β€’ 𝑦𝑦𝑖𝑖 is the hidden layer (without a bias)

(30)
(31)

Convolutional Neural Network: Parameter

Sharing

β€’ Stride (typically denoted by s in CNNs) decides the spacing between

overlapping convolutions

(32)

Subsampling/Pooling layer

β€’ The subsampling layers reduce the spatial resolution of each feature

map

β€’ By reducing the spatial resolution of the feature map, a certain

(33)

β€’ the subsampling layers reduce the spatial resolution of each feature

map

(34)
(35)

β€’ Reduce the effect of noises and shift or distortion β€’ Weight sharing is also applied in subsampling layers

β€’ when trainable weights are used, e.g., scaled average pooling

β€’ Often no trainable weights are necessary, e.g., max pooling (more popular)

(36)

Subsampling/Pooling layer

(37)

Jargon

β€’ Convolutional Neural Networks

β€’ also called CNNs, Conv Nets etc.

β€’ Each hidden unit channel

β€’ also called map, feature, feature type, dimension

β€’ Weights for each channel

β€’ also called kernels or filters

β€’ Input patch to a hidden unit at (x,y)

(38)

Typical CNN

β€’ Alternates convolutional and pooling layers. Why?

C1: feature maps 6x28x28

Input Image 32x32

C3: f. maps 16x10x10 S2: feature maps

6x14x14

S4: f. maps 16x5x5

C5: layer

120 FC6: layer

84 Output:10

Convolutions Subsampling/

Pooling

Convolutions Full connection

Subsampling/

(39)

Typical CNN

β€’ Output layer is a regular, fully connected layer with softmax

non-linearity

β€’ Output provides an estimate of the conditional probability of each class

β€’ The network is trained by stochastic gradient descent

β€’ Backpropagation is used similarly as in a fully connected network

β€’ Gradients pass through element-wise activation functions in each layer β€’ Sigmoid, ReLU, etc.

(40)

C1:6x28x28 Input Image 28x28 C3:16x10x10 S2:6x14x14 S4:16x5x5 C5:120 FC6:84 RBF Output:10

LeNet5

β€’ Introduced by LeCun in 1998.

(41)

LeNet5

β€’ C1,C3,C5 : Convolutional layer. β€’ 5 Γ— 5 Convolution filter.

β€’ S2 , S4 : Subsampling layer. β€’ Subsampling by factor 2. β€’ F6 : Fully connected layer.

β€’ All the units of the layers up to FC6 have a sigmoidal activation function.

(42)

β€’ About 187,000 connection.

β€’ About 14,000 trainable weight.

Layer-1 Layer-3Layer-5 Input

(43)
(44)

The brute force approach

β€’ LeNet uses knowledge about the invariances to design:

β€’ the local connectivity, the weight-sharing, and the pooling. β€’ This achieves about 80 errors (of 10,000 test samples).

β€’ This can be reduced to about 40 errors by using many different

transformations of the input and other tricks (Ranzato 2008)

β€’ Ciresan et. al. (2010) inject knowledge of invariances by creating a

huge amount of carefully designed extra training data:

β€’ For each training image, they produce many new training

examples by applying many different transformations.

β€’ They can then train a large, deep, dumb net on a GPU without

much overfitting.

(45)
(46)
(47)

Image Classification

β€’ A core task in computer vision

(assume given set of discrete labels) {dog, cat, truck, plane, ...}

(48)

The Problem: Semantic Gap

β€’ Images are represented as

3D arrays of numbers, with integers between [0, 255].

β€’ E.g. 300 x 100 x 3 (3 for 3

(49)
(50)
(51)
(52)
(53)
(54)
(55)

ImageNet Challenge

β€’ ImageNet (image-net.org)

β€’ 15M high resolution images β€’ over 22K categories

β€’ labeled by Mechanical Turk workers

β€’ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) β€’ 2010-2015

β€’ 2010 competition

β€’ 1.2 M training images, 1000 categories (general and specific) β€’ 200K test images

(56)

AlexNet (2012)

β€’ 5 convolutional layers β€’ 3 fully connected layers

(57)

AlexNet (2012): Key Ideas

β€’ Downsampled images

β€’ shorter dimension 256 pixels, longer dimension cropped about center to 256 pixels

β€’ R, G, B channels

(58)

AlexNet (2012): Key Ideas

β€’ Data set augmentation

β€’ Generate image translations by selecting random 224 x 224 sub-images β€’ Horizontal reflections (standard trick in computer vision)

β€’ When testing, extract 10 distinct 224x224 sub-images and average predictions

β€’ More data set augmentation

(59)

AlexNet (2012): Key Ideas

β€’ ReLU instead of logistic or tanh units

β€’ Normalize output of ReLU output in a map at (x,y) based on activity of

features in adjacent maps at (x,y)

β€’ Overlapping pooling

β€’ pooling units spaced s pixels apart, summing over a z X z neighborhood, with s < z

(60)
(61)
(62)
(63)
(64)

Other Results that Followed

ILSVRC

2012 AlexNet

ILSVRC

2013 ZFNet Proposed byZeiler-Fergus improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers.

ILSVRC

2014 GoogLeNet Szegedy etal, Google

Development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much.

ILSVRC 2014

Runner-up VGGNet SimonyanZisserman

Showed that network depth is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly,

features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Outperformed GoogLeNet in multiple transfer learning tasks.

ILSVRC

(65)
(66)

RCNNs

β€œDog”

β€œCat”

(67)
(68)
(69)

Fast RCNN

(70)
(71)

Fully Convolutional Networks for Semantic

Segmentation

β€’ Using Convolutional Net for

Segmentation (Long,

Shelhamer, Darrell, 2015)

β€’ Determine what objects are

where

β€’ β€œwhat” depends on global

information

β€’ β€œwhere” depends on local

(72)

Regular CNNs

End to end learning

1000 dimensional vector

(73)

How to do semantic segmentation?

End to end learning

(74)

Fully Convolutional Networks for Semantic

Segmentation

β€’ Higher layers of

CNN classifiers are nonspatial (no

β€˜where’ info)

β€’ If every layer

remains spatial (as in conv. layers), then output can specify where as well as what (albeit

(75)

FCN Key Idea: Learnable Upsampling

Type equation here. 𝐻𝐻 4 𝑋𝑋 π‘Šπ‘Š 4 𝐻𝐻 8 𝑋𝑋 π‘Šπ‘Š 8 𝐻𝐻 16 𝑋𝑋 π‘Šπ‘Š 16 𝐻𝐻 32 𝑋𝑋 π‘Šπ‘Š 32 Conv, Pool, π»π»π‘‹π‘‹π‘Šπ‘Š Upsampling Pixelwise Output + Loss

(76)

FCN Key Idea: Learnable Upsampling

β€’ Bilinear interpolation uses β€˜fixed’ weights

for interpolation.

β€’ Weights depend on the upsampling factor

β€’ FCNs β€˜learn’ the weights in the upsampling

(77)

Bilinear Interpolation

(78)

Upsampling Layer

β€’ Often referred to as β€˜deconvolution’

(a misnomer)

β€’ Or Transposed Convolution

Weighted combination of neighbors

(79)

FCN Key Idea – Skip Connection

Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Pooling and prediction layers are shown as grids that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines. First row (FCN-32s): Our singlestream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Second row (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic information. Third row (FCN-8s): Additional predictions from pool3, at

(80)

Key Idea: Skip Connection

β€’ The 32 pixel stride at the final prediction layer limits the scale of detail in

the upsampled output

β€’ This is resolved by adding skips that combine the final prediction layer with

lower layers with finer strides. This turns a line topology into a DAG, with edges that skip ahead from lower layers to higher ones

β€’ Combining fine layers and coarse layers lets the model make local

(81)
(82)
(83)

CNN Applications

β€’ We discussed

β€’ Image classification β€’ Semantic segmentation β€’ Detection & localization

β€’ Several Others

β€’ Depth estimation

β€’ Stereo based dense point correspondence β€’ Pose estimation

β€’ Tracking (often with Recurrent NNs)

β€’ Check out recent CVPR/ICCV/ECCV papers for numerous applications of

(84)
(85)

Backpropagation in General Cases

1. Decompose operations in layers of a neural network into function elements whose derivatives w.r.t. inputs are known by symbolic computation.

β„Žπœƒπœƒ π‘₯π‘₯ = 𝑓𝑓(π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š)Β° … °𝑓𝑓

πœƒπœƒ 𝑙𝑙

𝑙𝑙 Β° … °𝑓𝑓 πœƒπœƒ 2

2 °𝑓𝑓 1 π‘₯π‘₯

where 𝑓𝑓(1) = π‘₯π‘₯, 𝑓𝑓(π‘™π‘™π‘šπ‘šπ‘šπ‘šπ‘šπ‘š) = β„Ž

πœƒπœƒ and βˆ€π‘™π‘™: πœ•πœ•π‘“π‘“

(𝑙𝑙)

(86)

Backpropagation in General Cases

2. Backpropagate error signals corresponding to a differentiable cost function by numerical computation (Starting from cost function, plug in error signals backward).

𝛿𝛿 𝑙𝑙 = πœ•πœ•πœ•πœ•π‘“π‘“ 𝑙𝑙 𝐽𝐽 πœƒπœƒ; π‘₯π‘₯, 𝑦𝑦 = πœ•πœ•π‘“π‘“πœ•πœ•πœ•πœ•(𝑙𝑙+1) πœ•πœ•π‘“π‘“πœ•πœ•π‘“π‘“(𝑙𝑙+1(𝑙𝑙) ) = 𝛿𝛿(𝑙𝑙+1) πœ•πœ•π‘“π‘“πœ•πœ•π‘“π‘“(𝑙𝑙+1(𝑙𝑙) )

(87)

Backpropagation in General Cases

3. Use backpropagated error signals to compute gradients w.r.t. parameters only for the function elements with parameters where their derivatives w.r.t. parameters are known by symbolic computation.

π›»π›»πœƒπœƒ 𝑙𝑙 𝐽𝐽 πœƒπœƒ; π‘₯π‘₯, 𝑦𝑦 = πœ•πœ•πœƒπœƒπœ•πœ•π‘™π‘™ 𝐽𝐽 πœƒπœƒ; π‘₯π‘₯, 𝑦𝑦 = πœ•πœ•π‘“π‘“πœ•πœ•πœ•πœ•(𝑙𝑙)

πœ•πœ•π‘“π‘“

πœƒπœƒ(𝑙𝑙)

(𝑙𝑙)

πœ•πœ•πœƒπœƒ(𝑙𝑙) = 𝛿𝛿(𝑙𝑙) πœ•πœ•π‘“π‘“

πœƒπœƒ(𝑙𝑙)

(𝑙𝑙)

πœ•πœ•πœƒπœƒ(𝑙𝑙)

where πœ•πœ•π‘“π‘“πœƒπœƒ(𝑙𝑙)

(𝑙𝑙)

(88)

Backpropagation in General Cases

4. Sum gradients over all example to get overall gradient.

π›»π›»πœƒπœƒ 𝑙𝑙 𝐽𝐽 πœƒπœƒ = οΏ½

𝑖𝑖=1 π‘šπ‘š

(89)

Derivatives of Convolution

Discrete convolution parameterized by a feature w and its derivatives

β€’ Let x be the input, and y be the output of convolution layer.

β€’ Here we focus only one feature vector w, although a convolution layer

usually has multiple features π‘Šπ‘Š = 𝑀𝑀1𝑀𝑀2 … 𝑀𝑀|π‘Šπ‘Š| .

β€’ n indexes x and y where 1 ≀ 𝑛𝑛 ≀ |π‘₯π‘₯| for π‘₯π‘₯𝑛𝑛, 1 ≀ 𝑛𝑛 ≀ 𝑦𝑦 = π‘₯π‘₯ βˆ’ π‘Šπ‘Š + 1

for 𝑦𝑦𝑛𝑛. i indexes w where 1 ≀ 𝑖𝑖 ≀ π‘Šπ‘Š .

(90)

Derivatives of Convolution

𝑦𝑦 = π‘₯π‘₯ βˆ— 𝑀𝑀 = 𝑦𝑦𝑛𝑛 , 𝑦𝑦𝑛𝑛 = οΏ½

𝑖𝑖=1

|π‘Šπ‘Š|

π‘₯π‘₯𝑛𝑛+π‘–π‘–βˆ’1𝑀𝑀𝑖𝑖 = 𝑀𝑀𝑇𝑇π‘₯π‘₯𝑛𝑛:𝑛𝑛+ π‘Šπ‘Š βˆ’1

πœ•πœ•πœ•πœ•π‘›π‘›βˆ’π‘–π‘–+1

πœ•πœ•πœ•πœ•π‘›π‘› = 𝑀𝑀𝑖𝑖 and

πœ•πœ•πœ•πœ•π‘›π‘›

πœ•πœ•πœ•πœ•π‘–π‘– = π‘₯π‘₯𝑛𝑛+π‘–π‘–βˆ’1 for 1 ≀ 𝑖𝑖 ≀ |π‘Šπ‘Š|

π‘₯π‘₯𝑛𝑛 𝑀𝑀

1

𝑀𝑀2 π‘¦π‘¦π‘›π‘›βˆ’1

𝑦𝑦𝑛𝑛 |π‘Šπ‘Š|

From a fixed π‘₯π‘₯𝑛𝑛 stand point, π‘₯π‘₯𝑛𝑛 has outgoing

connections to π‘¦π‘¦π‘›π‘›βˆ’ π‘Šπ‘Š +1:𝑛𝑛

π‘₯π‘₯𝑛𝑛 𝑀𝑀

1 𝑦𝑦𝑛𝑛

π‘₯π‘₯𝑛𝑛+1 𝑀𝑀

2

𝑦𝑦𝑛𝑛 has incoming

connections from

π‘₯π‘₯𝑛𝑛:𝑛𝑛+ π‘Šπ‘Š βˆ’1

All π‘¦π‘¦π‘›π‘›βˆ’ π‘Šπ‘Š +1:𝑛𝑛

have derivatives w.r.t. π‘₯π‘₯𝑛𝑛.

Note that 𝑦𝑦 and 𝑀𝑀

(91)

Backpropagation in Convolution Layer

Error signals and gradient for each example are computed by convolution using the communication property of convolution and the multivariable chain rule of derivative. Let’s focus on single elements of error signals and a gradient w.r.t. w.

𝛿𝛿𝑛𝑛(πœ•πœ•) = πœ•πœ•π‘₯π‘₯πœ•πœ•π½π½

𝑛𝑛 = πœ•πœ•π½π½ πœ•πœ•π‘¦π‘¦ πœ•πœ•π‘¦π‘¦ πœ•πœ•π‘₯π‘₯𝑛𝑛 = οΏ½ 𝑖𝑖=1

|π‘Šπ‘Š|

πœ•πœ•π½π½ πœ•πœ•π‘¦π‘¦π‘›π‘›βˆ’π‘–π‘–+1

πœ•πœ•π‘¦π‘¦π‘›π‘›βˆ’π‘–π‘–+1

πœ•πœ•π‘₯π‘₯𝑛𝑛 = �𝑖𝑖=1 |π‘Šπ‘Š|

π›Ώπ›Ώπ‘›π‘›βˆ’π‘–π‘–+1(πœ•πœ•) 𝑀𝑀𝑖𝑖 = (𝛿𝛿(πœ•πœ•)βˆ— 𝑓𝑓𝑙𝑙𝑖𝑖𝑓𝑓 𝑀𝑀 ) 𝑛𝑛 ,

(92)

Backpropagation in Convolution Layer

πœ•πœ•π½π½

πœ•πœ•π‘€π‘€π‘–π‘– =

πœ•πœ•π½π½ πœ•πœ•π‘¦π‘¦

πœ•πœ•π‘¦π‘¦

πœ•πœ•π‘€π‘€π‘–π‘– = 𝑛𝑛=1οΏ½

πœ•πœ• βˆ’ π‘Šπ‘Š +1

πœ•πœ•π½π½ πœ•πœ•π‘¦π‘¦π‘›π‘› πœ•πœ•π‘¦π‘¦π‘›π‘› πœ•πœ•π‘€π‘€π‘–π‘– = οΏ½ 𝑛𝑛=1 πœ•πœ• βˆ’ π‘Šπ‘Š +1

𝛿𝛿𝑛𝑛(πœ•πœ•)π‘₯π‘₯𝑛𝑛+π‘–π‘–βˆ’1 = 𝛿𝛿 πœ•πœ• βˆ— π‘₯π‘₯ 𝑖𝑖

πœ•πœ•π½π½

πœ•πœ•π‘€π‘€ =

πœ•πœ•π½π½

(93)

Backpropagation in Convolution Layer

π‘₯π‘₯𝑛𝑛

𝑀𝑀1 𝑦𝑦

𝑛𝑛

π‘₯π‘₯𝑛𝑛+1

𝑀𝑀2

𝛿𝛿𝑛𝑛(πœ•πœ•)

𝑀𝑀1

𝑀𝑀2 π›Ώπ›Ώπ‘›π‘›βˆ’1(πœ•πœ•)

𝛿𝛿𝑛𝑛(πœ•πœ•)

π‘₯π‘₯ βˆ— 𝑀𝑀 = 𝑦𝑦 𝛿𝛿(πœ•πœ•) = 𝑓𝑓𝑙𝑙𝑖𝑖𝑓𝑓 𝑀𝑀 βˆ— 𝛿𝛿(πœ•πœ•) π‘₯π‘₯ βˆ— 𝛿𝛿 πœ•πœ• = πœ•πœ•π‘Šπ‘Šπœ•πœ•π½π½

Forward propagation

(valid convolution) Backward propagation(full convolution) Gradient computation(valid convolution) οΏ½

πœ•πœ•π½π½

πœ•πœ•π‘€π‘€π‘–π‘– π‘₯π‘₯𝑛𝑛

π‘₯π‘₯𝑛𝑛+1

(94)

Derivatives of Pooling

Pooling layer subsamples statistics to obtain summary statistics with any

aggregate function (or filter) g whose input is vector and output is scalar.

Subsampling is an operation like convolution, however g is applied to disjoint

(non-overlapping) regions

Definition: subsample (or downsample)

Let m be the size of the pooling region, x be the input, and y be the output of

the pooling layer. Subsample 𝑓𝑓, 𝑔𝑔 [𝑛𝑛] denotes the nth element of

subsample 𝑓𝑓, 𝑔𝑔 .

𝑦𝑦𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑓𝑓𝑙𝑙𝑠𝑠 π‘₯π‘₯, 𝑔𝑔 𝑛𝑛 = 𝑔𝑔(π‘₯π‘₯ π‘›π‘›βˆ’1 π‘šπ‘š+1:π‘›π‘›π‘šπ‘š)

(95)

Subsampling Equations

𝑠𝑠

π‘₯π‘₯

𝑔𝑔 𝑦𝑦𝑛𝑛

Pooling

𝑔𝑔 π‘₯π‘₯ = βˆ‘π‘˜π‘˜=1π‘šπ‘šπ‘ π‘  π‘₯π‘₯π‘˜π‘˜ πœ•πœ•π‘”π‘”πœ•πœ•π‘₯π‘₯ = 𝑠𝑠1

Mean Pooling

Max Pooling g x = max π‘₯π‘₯ πœ•πœ•π‘₯π‘₯πœ•πœ•π‘”π‘”

𝑖𝑖 = οΏ½

1 𝑖𝑖𝑓𝑓 π‘₯π‘₯𝑖𝑖 = max(π‘₯π‘₯) 0 π‘œπ‘œπ‘œπ‘œβ„Žπ‘ π‘ π‘œπ‘œπ‘€π‘€π‘–π‘–π‘ π‘ π‘ π‘ 

𝑔𝑔 π‘₯π‘₯ = π‘₯π‘₯ 𝑝𝑝 = οΏ½

π‘˜π‘˜=1 π‘šπ‘š

|π‘₯π‘₯π‘˜π‘˜|𝑃𝑃

1

𝑃𝑃 πœ•πœ•π‘”π‘”

πœ•πœ•π‘₯π‘₯𝑖𝑖 = π‘˜π‘˜=1οΏ½

π‘šπ‘š

|π‘₯π‘₯π‘˜π‘˜|𝑃𝑃

1 π‘ƒπ‘ƒβˆ’1

(96)

Backpropagation in the Pooling Layer

Error signals for each example are computed by upsampling. Upsampling is an operation which backpropagates the error signals over the aggregate function g using its derivatives 𝑔𝑔𝑛𝑛′ = πœ•πœ•πœ•πœ•

πœ•πœ•πœ•πœ• π‘›π‘›βˆ’1 π‘šπ‘š+1:π‘›π‘›π‘šπ‘š

𝑔𝑔𝑛𝑛′ can change depending on the pooling region n

(97)

Backpropagation in the Pooling Layer

𝛿𝛿(π‘›π‘›βˆ’1 π‘šπ‘š+1πœ•πœ•) :π‘›π‘›π‘šπ‘š = πœ•πœ•π‘₯π‘₯ πœ•πœ•π½π½

π‘›π‘›βˆ’1 π‘šπ‘š+1:π‘›π‘›π‘šπ‘š =

πœ•πœ•π½π½ πœ•πœ•π‘¦π‘¦π‘›π‘›

πœ•πœ•π‘¦π‘¦π‘›π‘›

πœ•πœ•π‘₯π‘₯ π‘›π‘›βˆ’1 π‘šπ‘š+1:π‘›π‘›π‘šπ‘š

𝛿𝛿(π‘›π‘›βˆ’1 π‘šπ‘š+1πœ•πœ•) :π‘›π‘›π‘šπ‘š = 𝛿𝛿𝑛𝑛(πœ•πœ•) πœ•πœ•π‘₯π‘₯ πœ•πœ•π‘”π‘”

π‘›π‘›βˆ’1 π‘šπ‘š+1:π‘›π‘›π‘šπ‘š = 𝛿𝛿𝑛𝑛

(πœ•πœ•)𝑔𝑔

𝑛𝑛′

𝑠𝑠

π‘₯π‘₯

𝑔𝑔 𝑦𝑦𝑛𝑛 𝑠𝑠 𝛿𝛿𝑛𝑛(πœ•πœ•)

⁄ πœ•πœ•π‘”π‘” πœ•πœ•π‘₯π‘₯

References

Related documents

Q:\Market ng\Presentations\AGI Presentation\AG Presentation with Design-Const-Too Procure 07-2010.ppt Q:\Marketing\Presentations\AGI Presentation\AGI Interview Pres for Univ

One of the tools of performance measurement that has been used in the past is performance appraisal that has been reintroduced in a new format and design and implementation within the

Problem 2.1 Given the system (1) with uncertainty, disturbance, and faults, it is required to design together the following FE and FTC modules to guarantee system stability after

The effect of the flexibility of the links on the rigid body behavior of the robot arm is illustrated by comparing the joint velocities and End effector deflection obtained from

Blower Warranty: Injectidry Systems promises to the original purchaser to repair or replace (at Injectidry’s option), the parts nec- essary to correct any blower unit found to

Implications for empirically based software development process : β€’ Experts might be more effective working on their own but most people.. should apply a

Figure 27: The fraction of students attending the top tier (β€œGymnasium”) of the three-track German school system is particularly high in East Germany, often exceeding 30% compared

To reduce crime risks in public space, for instance, problem-solving models recommend analysts to collect data that can assist in identifying the crime events