Lec12-CNN-Applications

(1)

Convolutional Neural Networks

and

Applications

(2)

Feature detectors

Input Layer Hidden Layer

Output Layer

(3)

What is this unit doing?

Input Layer Hidden Layer

Output Layer

(4)

What does this unit detect?

1 5 10 15 20 25 …

…

it will send strong signal for a horizontal

line in the top row, ignoring everywhere else 1

63

(5)

strong +ve weight low/zero weight

…

1 5 10 15 20 25 …

1

63

What does this unit detect?

(6)

What features might you expect a good NN

to learn, when trained with data like this?

(7)

vertical lines

(8)

Handwritten images

(9)

Handwritten images

(10)

Handwritten images

(11)

Successive layers can learn higher-level features

detect lines in

Specific positions

Higher level detectors ( horizontal line,

“RHS vertical line” “upper loop”, etc…

etc …

v etc …

1 5 10 15 20 25 …

(12)

Successive layers can learn higher-level features

detect lines in

Specific positions

Higher level detectors ( horizontal line,

“RHS vertical lune” “upper loop”, etc…

What does this unit detect?

etc …

v etc …

1 5 10 15 20 25 …

…

Weights determine what patterns (or mixture of

(13)

Motivation: Object Recognition in Vision

• Challenges

• Must deal with very high-dimensional inputs

• 1600 x 1200 pixels = 1.9m inputs, or 3 x 1.9m if RGB pixels

• Can we exploit the 2D topology of pixels (or 3D for video data)

• Can we build invariance to certain variations we can expect

• Translations, illumination, etc

(14)

Motivation: Recognizing an Image

• Input is 5x5 pixel array

• Simple back propagation net

(15)

Motivation: Recognizing an Image

with

unknown location

Hidden Units

Output Hidden Units

• Object can appear either in the top image or in the bottom image

location

(16)

Motivation: Recognizing an Image with

any

location

Hidden Units

Output Hidden Units

• Each possible location the object can appear in has its own set of

hidden units

• Each set detects the same features except in a different location • Locations can overlap

(17)

(18)

• Little or no invariance to shifting, scaling, and other forms of distortion

Drawbacks of previous neural networks

(19)

154 input change from 2 shift left 77 : black to white 77 : white to black Shift left

(20)

• scaling, and other forms of distortion

(21)

• The topology of the input data is completely ignored • Work with raw data.

Feature 1

Feature 2

Weights

(22)

Convolutional Neural Network: Key Idea

Exploit

1. Structure

2. Local Connectivity 3. Share Parameter

To Give

(23)

Convolutional Neural Network: Local

Connectivity

• Each hidden unit is connected only to a subregion (patch) of the input

image

• It is connected to all channels

• 1 if greyscale image

• 3 (R, G, B) for color image

Solves the following problems:

• Fully connected hidden layer would have an unmanageable number

of parameters

• Computing the linear activations of the hidden units would be very

(24)

(25)

(26)

Convolutional Neural Network: Parameter

Sharing

• Units organized into the same “feature map/template” share parameters

• In this way all neurons detect the same feature at different positions in the

input image

• Hidden units within a feature map cover different positions in the image

(27)

Feature extraction

• If a neuron in the feature map fires, this corresponds to a match with

(28)

Convolutional Neural Network: Parameter

Sharing

How does it help?

• Reduces even more number of parameters (compared to local

connectivity)

• Will extract the same features at every position

(29)

Convolutional Neural Network: Parameter

Sharing

• Each feature map forms a 2D grid of

features

• Can be computed with a discrete

convolution (_∗) of a kernel matrix 𝑘𝑘_{𝑖𝑖𝑖𝑖} which

is the hidden weights matrix 𝑊𝑊_{𝑖𝑖𝑖𝑖} with its

rows and columns flipped:

• 𝑥𝑥_𝑖𝑖 is the 𝑖𝑖𝑡𝑡𝑡 channel of input

• 𝑘𝑘_{𝑖𝑖𝑖𝑖} is the convolution kernel

• 𝑔𝑔_𝑖𝑖 is a learned scaling factor

• 𝑦𝑦_𝑖𝑖 is the hidden layer (without a bias)

(30)

(31)

Convolutional Neural Network: Parameter

Sharing

• Stride (typically denoted by s in CNNs) decides the spacing between

overlapping convolutions

(32)

Subsampling/Pooling layer

• The subsampling layers reduce the spatial resolution of each feature

map

• By reducing the spatial resolution of the feature map, a certain

(33)

• the subsampling layers reduce the spatial resolution of each feature

map

(34)

(35)

• Reduce the effect of noises and shift or distortion • Weight sharing is also applied in subsampling layers

• when trainable weights are used, e.g., scaled average pooling

• Often no trainable weights are necessary, e.g., max pooling (more popular)

(36)

Subsampling/Pooling layer

(37)

Jargon

• Convolutional Neural Networks

• also called CNNs, Conv Nets etc.

• Each hidden unit channel

• also called map, feature, feature type, dimension

• Weights for each channel

• also called kernels or filters

• Input patch to a hidden unit at (x,y)

(38)

Typical CNN

• Alternates convolutional and pooling layers. Why?

C1: feature maps 6x28x28

Input Image 32x32

C3: f. maps 16x10x10 S2: feature maps

6x14x14

S4: f. maps 16x5x5

C5: layer

120 FC6: layer

84 Output:10

Convolutions _Subsampling/

Pooling

Convolutions Full connection

Subsampling/

(39)

Typical CNN

• Output layer is a regular, fully connected layer with softmax

non-linearity

• Output provides an estimate of the conditional probability of each class

• The network is trained by stochastic gradient descent

• Backpropagation is used similarly as in a fully connected network

• Gradients pass through element-wise activation functions in each layer • Sigmoid, ReLU, etc.

(40)

C1:6x28x28 Input Image 28x28 C3:16x10x10 S2:6x14x14 S4:16x5x5 C5:120 FC6:84 RBF Output:10

LeNet5

• Introduced by LeCun in 1998.

(41)

LeNet5

• C1,C3,C5 : Convolutional layer. • 5 × 5 Convolution filter.

• S2 , S4 : Subsampling layer. • Subsampling by factor 2. • F6 : Fully connected layer.

• All the units of the layers up to FC6 have a sigmoidal activation function.

(42)

• About 187,000 connection.

• About 14,000 trainable weight.

Layer-1 Layer-3Layer-5 Input

(43)

(44)

The brute force approach

• LeNet uses knowledge about the invariances to design:

• the local connectivity, the weight-sharing, and the pooling. • This achieves about 80 errors (of 10,000 test samples).

• This can be reduced to about 40 errors by using many different

transformations of the input and other tricks (Ranzato 2008)

• Ciresan et. al. (2010) inject knowledge of invariances by creating a

huge amount of carefully designed extra training data:

• For each training image, they produce many new training

examples by applying many different transformations.

• They can then train a large, deep, dumb net on a GPU without

much overfitting.

(45)

(46)

(47)

Image Classification

• A core task in computer vision

(assume given set of discrete labels) {dog, cat, truck, plane, ...}

(48)

The Problem: Semantic Gap

• Images are represented as

3D arrays of numbers, with integers between [0, 255].

• E.g. 300 x 100 x 3 (3 for 3

(49)

(50)

(51)

(52)

(53)

(54)

(55)

ImageNet Challenge

• ImageNet (image-net.org)

• 15M high resolution images • over 22K categories

• labeled by Mechanical Turk workers

• ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) • 2010-2015

• 2010 competition

• 1.2 M training images, 1000 categories (general and specific) • 200K test images

(56)

AlexNet (2012)

• 5 convolutional layers • 3 fully connected layers

(57)

AlexNet (2012): Key Ideas

• Downsampled images

• shorter dimension 256 pixels, longer dimension cropped about center to 256 pixels

• R, G, B channels

(58)

AlexNet (2012): Key Ideas

• Data set augmentation

• Generate image translations by selecting random 224 x 224 sub-images • Horizontal reflections (standard trick in computer vision)

• When testing, extract 10 distinct 224x224 sub-images and average predictions

• More data set augmentation

(59)

AlexNet (2012): Key Ideas

• ReLU instead of logistic or tanh units

• Normalize output of ReLU output in a map at (x,y) based on activity of

features in adjacent maps at (x,y)

• Overlapping pooling

• pooling units spaced s pixels apart, summing over a z X z neighborhood, with s < z

(60)

(61)

(62)

(63)

(64)

Other Results that Followed

ILSVRC

2012 AlexNet

ILSVRC

2013 ZFNet Proposed byZeiler-Fergus improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers.

ILSVRC

2014 GoogLeNet Szegedy etal, Google

Development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much.

ILSVRC 2014

Runner-up VGGNet SimonyanZisserman

Showed that network depth is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly,

features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Outperformed GoogLeNet in multiple transfer learning tasks.

ILSVRC

(65)

(66)

RCNNs

“Dog”

“Cat”

(67)

(68)

(69)

Fast RCNN

(70)

(71)

Fully Convolutional Networks for Semantic

Segmentation

• Using Convolutional Net for

Segmentation (Long,

Shelhamer, Darrell, 2015)

• Determine what objects are

where

• “what” depends on global

information

• “where” depends on local

(72)

Regular CNNs

End to end learning

1000 dimensional vector

(73)

How to do semantic segmentation?

End to end learning

(74)

Fully Convolutional Networks for Semantic

Segmentation

• Higher layers of

CNN classifiers are nonspatial (no

‘where’ info)

• If every layer

remains spatial (as in conv. layers), then output can specify where as well as what (albeit

(75)

FCN Key Idea: Learnable Upsampling

Type equation here. 𝐻𝐻 4 𝑋𝑋 𝑊𝑊 4 𝐻𝐻 8 𝑋𝑋 𝑊𝑊 8 𝐻𝐻 16 𝑋𝑋 𝑊𝑊 16 𝐻𝐻 32 𝑋𝑋 𝑊𝑊 32 Conv, Pool, 𝐻𝐻𝑋𝑋𝑊𝑊 Upsampling Pixelwise Output + Loss

(76)

FCN Key Idea: Learnable Upsampling

• Bilinear interpolation uses ‘fixed’ weights

for interpolation.

• Weights depend on the upsampling factor

• FCNs ‘learn’ the weights in the upsampling

(77)

Bilinear Interpolation

(78)

Upsampling Layer

• Often referred to as ‘deconvolution’

(a misnomer)

• Or Transposed Convolution

Weighted combination of neighbors

(79)

FCN Key Idea – Skip Connection

Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Pooling and prediction layers are shown as grids that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines. First row (FCN-32s): Our singlestream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Second row (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic information. Third row (FCN-8s): Additional predictions from pool3, at

(80)

Key Idea: Skip Connection

• The 32 pixel stride at the final prediction layer limits the scale of detail in

the upsampled output

• This is resolved by adding skips that combine the final prediction layer with

lower layers with finer strides. This turns a line topology into a DAG, with edges that skip ahead from lower layers to higher ones

• Combining fine layers and coarse layers lets the model make local

(81)

(82)

(83)

CNN Applications

• We discussed

• Image classification • Semantic segmentation • Detection & localization

• Several Others

• Depth estimation

• Stereo based dense point correspondence • Pose estimation

• Tracking (often with Recurrent NNs)

• Check out recent CVPR/ICCV/ECCV papers for numerous applications of

(84)

(85)

Backpropagation in General Cases

1. Decompose operations in layers of a neural network into function elements whose derivatives w.r.t. inputs are known by symbolic computation.

ℎ_𝜃𝜃 𝑥𝑥 = 𝑓𝑓(𝑙𝑙𝑚𝑚𝑚𝑚𝑚𝑚)° … °𝑓𝑓

𝜃𝜃 𝑙𝑙

𝑙𝑙 _{° … °}_𝑓𝑓 𝜃𝜃 2

2 _°_𝑓𝑓 ₁ _𝑥𝑥

where _𝑓𝑓(1) ₌ _𝑥𝑥_, _𝑓𝑓(𝑙𝑙_{𝑚𝑚𝑚𝑚𝑚𝑚}) ₌ _ℎ

𝜃𝜃 and ∀𝑙𝑙: 𝜕𝜕𝑓𝑓

(𝑙𝑙)

(86)

Backpropagation in General Cases

2. Backpropagate error signals corresponding to a differentiable cost function by numerical computation (Starting from cost function, plug in error signals backward).

𝛿𝛿 𝑙𝑙 = 𝜕𝜕_{𝜕𝜕𝑓𝑓} _𝑙𝑙 𝐽𝐽 𝜃𝜃; 𝑥𝑥, 𝑦𝑦 = _{𝜕𝜕𝑓𝑓}𝜕𝜕𝜕𝜕₍_𝑙𝑙+1₎ 𝜕𝜕𝑓𝑓_{𝜕𝜕𝑓𝑓}(𝑙𝑙+1₍_𝑙𝑙₎ ) = 𝛿𝛿(𝑙𝑙+1) 𝜕𝜕𝑓𝑓_{𝜕𝜕𝑓𝑓}(𝑙𝑙+1₍_𝑙𝑙₎ )

(87)

Backpropagation in General Cases

3. Use backpropagated error signals to compute gradients w.r.t. parameters only for the function elements with parameters where their derivatives w.r.t. parameters are known by symbolic computation.

𝛻𝛻_𝜃𝜃 𝑙𝑙 𝐽𝐽 𝜃𝜃; 𝑥𝑥, 𝑦𝑦 = _{𝜕𝜕𝜃𝜃}𝜕𝜕_𝑙𝑙 𝐽𝐽 𝜃𝜃; 𝑥𝑥, 𝑦𝑦 = _{𝜕𝜕𝑓𝑓}𝜕𝜕𝜕𝜕₍_𝑙𝑙₎

𝜕𝜕𝑓𝑓

𝜃𝜃(𝑙𝑙)

(𝑙𝑙)

𝜕𝜕𝜃𝜃(𝑙𝑙) = 𝛿𝛿(𝑙𝑙) 𝜕𝜕𝑓𝑓

𝜃𝜃(𝑙𝑙)

(𝑙𝑙)

𝜕𝜕𝜃𝜃(𝑙𝑙)

where 𝜕𝜕𝑓𝑓𝜃𝜃(𝑙𝑙)

(𝑙𝑙)

(88)

Backpropagation in General Cases

4. Sum gradients over all example to get overall gradient.

𝛻𝛻_𝜃𝜃 𝑙𝑙 𝐽𝐽 𝜃𝜃 = �

𝑖𝑖=1 𝑚𝑚

(89)

Derivatives of Convolution

Discrete convolution parameterized by a feature w and its derivatives

• Let x be the input, and y be the output of convolution layer.

• Here we focus only one feature vector w, although a convolution layer

usually has multiple features _𝑊𝑊 ₌ _𝑤𝑤₁_𝑤𝑤₂ _… _𝑤𝑤_|_𝑊𝑊_| .

• n indexes x and y where 1 ≤ 𝑛𝑛 ≤ |𝑥𝑥| for 𝑥𝑥_𝑛𝑛, 1 ≤ 𝑛𝑛 ≤ 𝑦𝑦 = 𝑥𝑥 − 𝑊𝑊 + 1

for _𝑦𝑦_𝑛𝑛. i indexes w where ₁ _{≤ 𝑖𝑖 ≤ 𝑊𝑊} _.

(90)

Derivatives of Convolution

𝑦𝑦 = 𝑥𝑥 ∗ 𝑤𝑤 = 𝑦𝑦_𝑛𝑛 , 𝑦𝑦_𝑛𝑛 = �

𝑖𝑖=1

|𝑊𝑊|

𝑥𝑥_{𝑛𝑛+𝑖𝑖−1}𝑤𝑤_𝑖𝑖 = 𝑤𝑤𝑇𝑇𝑥𝑥_𝑛𝑛_:_{𝑛𝑛+ 𝑊𝑊 −1}

𝜕𝜕𝜕𝜕_{𝑛𝑛−𝑖𝑖+1}

𝜕𝜕𝜕𝜕_𝑛𝑛 = 𝑤𝑤𝑖𝑖 and

𝜕𝜕𝜕𝜕_𝑛𝑛

𝜕𝜕𝜕𝜕_𝑖𝑖 = 𝑥𝑥𝑛𝑛+𝑖𝑖−1 for 1 ≤ 𝑖𝑖 ≤ |𝑊𝑊|

𝑥𝑥_𝑛𝑛 _𝑤𝑤

1

𝑤𝑤₂ 𝑦𝑦𝑛𝑛−1

𝑦𝑦_𝑛𝑛 |𝑊𝑊|

From a fixed _𝑥𝑥_𝑛𝑛 stand point, _𝑥𝑥_𝑛𝑛 has outgoing

connections to _𝑦𝑦_{𝑛𝑛− 𝑊𝑊 +1}_:_𝑛𝑛

𝑥𝑥_𝑛𝑛 _𝑤𝑤

1 𝑦𝑦_𝑛𝑛

𝑥𝑥_𝑛𝑛+1 _𝑤𝑤

2

𝑦𝑦_𝑛𝑛 has incoming

connections from

𝑥𝑥_𝑛𝑛_:_{𝑛𝑛+ 𝑊𝑊 −1}

All _𝑦𝑦_{𝑛𝑛− 𝑊𝑊 +1}_:_𝑛𝑛

have derivatives w.r.t. _𝑥𝑥_𝑛𝑛.

Note that _𝑦𝑦 and _𝑤𝑤

(91)

Backpropagation in Convolution Layer

Error signals and gradient for each example are computed by convolution using the communication property of convolution and the multivariable chain rule of derivative. Let’s focus on single elements of error signals and a gradient w.r.t. w.

𝛿𝛿_𝑛𝑛(𝜕𝜕) = _{𝜕𝜕𝑥𝑥}𝜕𝜕𝐽𝐽

𝑛𝑛 = 𝜕𝜕𝐽𝐽 𝜕𝜕𝑦𝑦 𝜕𝜕𝑦𝑦 𝜕𝜕𝑥𝑥_𝑛𝑛 = � 𝑖𝑖=1

|𝑊𝑊|

𝜕𝜕𝐽𝐽 𝜕𝜕𝑦𝑦_{𝑛𝑛−𝑖𝑖+1}

𝜕𝜕𝑦𝑦_{𝑛𝑛−𝑖𝑖+1}

𝜕𝜕𝑥𝑥_𝑛𝑛 = �_𝑖𝑖=1 |𝑊𝑊|

𝛿𝛿_{𝑛𝑛−𝑖𝑖+1}(𝜕𝜕) 𝑤𝑤_𝑖𝑖 = (𝛿𝛿(𝜕𝜕)∗ 𝑓𝑓𝑙𝑙𝑖𝑖𝑓𝑓 𝑤𝑤 ) 𝑛𝑛 ,

(92)

Backpropagation in Convolution Layer

𝜕𝜕𝐽𝐽

𝜕𝜕𝑤𝑤_𝑖𝑖 =

𝜕𝜕𝐽𝐽 𝜕𝜕𝑦𝑦

𝜕𝜕𝑦𝑦

𝜕𝜕𝑤𝑤_𝑖𝑖 = _𝑛𝑛=1�

𝜕𝜕 − 𝑊𝑊 +1

𝜕𝜕𝐽𝐽 𝜕𝜕𝑦𝑦_𝑛𝑛 𝜕𝜕𝑦𝑦_𝑛𝑛 𝜕𝜕𝑤𝑤_𝑖𝑖 = � 𝑛𝑛=1 𝜕𝜕 − 𝑊𝑊 +1

𝛿𝛿_𝑛𝑛(𝜕𝜕)𝑥𝑥_{𝑛𝑛+𝑖𝑖−1} = 𝛿𝛿 𝜕𝜕 ∗ 𝑥𝑥 𝑖𝑖

𝜕𝜕𝐽𝐽

𝜕𝜕𝑤𝑤 =

𝜕𝜕𝐽𝐽

(93)

Backpropagation in Convolution Layer

𝑥𝑥_𝑛𝑛

𝑤𝑤₁ _𝑦𝑦

𝑛𝑛

𝑥𝑥_𝑛𝑛+1

𝑤𝑤₂

𝛿𝛿_𝑛𝑛(𝜕𝜕)

𝑤𝑤₁

𝑤𝑤₂ 𝛿𝛿𝑛𝑛−1(𝜕𝜕)

𝛿𝛿_𝑛𝑛(𝜕𝜕)

𝑥𝑥 ∗ 𝑤𝑤 = 𝑦𝑦 𝛿𝛿(𝜕𝜕) = 𝑓𝑓𝑙𝑙𝑖𝑖𝑓𝑓 𝑤𝑤 ∗ 𝛿𝛿(𝜕𝜕) 𝑥𝑥 ∗ 𝛿𝛿 𝜕𝜕 = _{𝜕𝜕𝑊𝑊}𝜕𝜕𝐽𝐽

Forward propagation

(valid convolution) Backward propagation(full convolution) Gradient computation(valid convolution) �

𝜕𝜕𝐽𝐽

𝜕𝜕𝑤𝑤_𝑖𝑖 𝑥𝑥_𝑛𝑛

𝑥𝑥_𝑛𝑛+1

(94)

Derivatives of Pooling

Pooling layer subsamples statistics to obtain summary statistics with any

aggregate function (or filter) g whose input is vector and output is scalar.

Subsampling is an operation like convolution, however g is applied to disjoint

(non-overlapping) regions

Definition: subsample (or downsample)

Let m be the size of the pooling region, x be the input, and y be the output of

the pooling layer. Subsample _𝑓𝑓_, _𝑔𝑔 _[_𝑛𝑛_] denotes the nth element of

subsample _𝑓𝑓_, _𝑔𝑔 .

𝑦𝑦_𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑓𝑓𝑙𝑙𝑠𝑠 𝑥𝑥, 𝑔𝑔 𝑛𝑛 = 𝑔𝑔(𝑥𝑥 _{𝑛𝑛−1 𝑚𝑚+1}_:_{𝑛𝑛𝑚𝑚})

(95)

Subsampling Equations

𝑠𝑠

𝑥𝑥

𝑔𝑔 𝑦𝑦_𝑛𝑛

Pooling

𝑔𝑔 𝑥𝑥 = ∑𝑘𝑘=1𝑚𝑚_𝑠𝑠 𝑥𝑥𝑘𝑘 𝜕𝜕𝑔𝑔_{𝜕𝜕𝑥𝑥} = _𝑠𝑠1

Mean Pooling

Max Pooling g x = max 𝑥𝑥 _{𝜕𝜕𝑥𝑥}𝜕𝜕𝑔𝑔

𝑖𝑖 = �

1 𝑖𝑖𝑓𝑓 𝑥𝑥_𝑖𝑖 = max(𝑥𝑥) 0 𝑜𝑜𝑜𝑜ℎ𝑠𝑠𝑜𝑜𝑤𝑤𝑖𝑖𝑠𝑠𝑠𝑠

𝑔𝑔 𝑥𝑥 = 𝑥𝑥 _𝑝𝑝 = �

𝑘𝑘=1 𝑚𝑚

|𝑥𝑥_𝑘𝑘|𝑃𝑃

1

𝑃𝑃 _{𝜕𝜕𝑔𝑔}

𝜕𝜕𝑥𝑥_𝑖𝑖 = _𝑘𝑘=1�

𝑚𝑚

|𝑥𝑥_𝑘𝑘|𝑃𝑃

1 𝑃𝑃−1

(96)

Backpropagation in the Pooling Layer

Error signals for each example are computed by upsampling. Upsampling is an operation which backpropagates the error signals over the aggregate function g using its derivatives _𝑔𝑔_𝑛𝑛′ ₌ 𝜕𝜕𝜕𝜕

𝜕𝜕𝜕𝜕 _{𝑛𝑛−1 𝑚𝑚+1}_:_{𝑛𝑛𝑚𝑚}

𝑔𝑔_𝑛𝑛′ can change depending on the pooling region n

(97)

Backpropagation in the Pooling Layer

𝛿𝛿(_{𝑛𝑛−1 𝑚𝑚+1}𝜕𝜕) _:_{𝑛𝑛𝑚𝑚} = _{𝜕𝜕𝑥𝑥} 𝜕𝜕𝐽𝐽

𝑛𝑛−1 𝑚𝑚+1:𝑛𝑛𝑚𝑚 =

𝜕𝜕𝐽𝐽 𝜕𝜕𝑦𝑦_𝑛𝑛

𝜕𝜕𝑦𝑦_𝑛𝑛

𝜕𝜕𝑥𝑥 _{𝑛𝑛−1 𝑚𝑚+1}_:_{𝑛𝑛𝑚𝑚}

𝛿𝛿(_{𝑛𝑛−1 𝑚𝑚+1}𝜕𝜕) _:_{𝑛𝑛𝑚𝑚} = 𝛿𝛿_𝑛𝑛(𝜕𝜕) _{𝜕𝜕𝑥𝑥} 𝜕𝜕𝑔𝑔

𝑛𝑛−1 𝑚𝑚+1:𝑛𝑛𝑚𝑚 = 𝛿𝛿𝑛𝑛

(𝜕𝜕)_𝑔𝑔

𝑛𝑛′

𝑠𝑠

𝑥𝑥

𝑔𝑔 𝑦𝑦𝑛𝑛 𝑠𝑠 𝛿𝛿_𝑛𝑛(𝜕𝜕)

⁄ 𝜕𝜕𝑔𝑔 𝜕𝜕𝑥𝑥