Midterm Review

(1)

Fei-Fei Li & Justin Johnson & Serena Yeung Midterm Review Lecture 3 - May 4th, 2018 April 10, 2018

Midterm Review

(2)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

50 minutes is short!

This is just to help you get going with your studies.

Midterm Review May 4th, 2018

(3)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material:

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

Q&A, time permitting

(4)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

(5)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 5 April 10, 2018

Lecture 3:

Loss Functions

and Optimization

(6)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

An optimization problem

At the end of the day, we want to train a model that performs a desired task well –

and a proxy for best achieving this is minimizing a loss function

(7)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 7 April 10, 2018

SVM/Softmax Loss

- We have some dataset of (x,y) - We have a score function:

- We have a loss function:

e.g.

Softmax

SVM

Full loss

(8)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 8 April 10, 2018

Know how to derive the SVM and Softmax

gradients!

(9)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Stochastic Gradient Descent (SGD)

9 Full sum expensive when N is large!

Approximate sum using a minibatch of examples

32 / 64 / 128 common

(10)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Learning Rate Loss Curves

(11)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 11 April 24, 2018

Optimization: Problems with SGD

What if the loss

function has a

local minima or

saddle point?

(12)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 12 April 24, 2018

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Zero gradient,

gradient descent

gets stuck

(13)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 13 April 24, 2018

Optimization: Problems with SGD

What if loss changes quickly in one direction and slowly in another?

What does gradient descent do?

Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest

singular value of the Hessian matrix is large

(14)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14 April 24, 2018

Optimization: Problems with SGD

Our gradients come from

minibatches so they can be noisy!

(15)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 15 April 24, 2018

Update Rules

SGD

Momentum

Nesterov Momentum AdaGrad

RMSProp

Adam

(16)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material:

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

(17)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 17 April 19, 2018

Lecture 6:

Training Neural Networks,

Part I

(18)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 18 April 19, 2018

Activation Functions

Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

(19)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 19 April 19, 2018

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]

- Historically popular since they have nice interpretation as a

saturating “firing rate” of a neuron 3 problems:

1. Saturated neurons “kill” the gradients

2. Sigmoid outputs are not zero-centered

3. exp() is a bit compute expensive

(20)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 20 April 19, 2018

Consider what happens when the input to a neuron is always positive...

What can we say about the gradients on w?

Always all positive or all negative :(

(this is also why you want zero-mean data!)

hypothetical optimal w vector

zig zag path

allowed gradient update directions

allowed

gradient

update

directions

(21)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 21 April 19, 2018

Activation Functions

ReLU

(Rectified Linear Unit)

- Computes f(x) = max(0,x)

- Does not saturate (in +region) - Very computationally efficient - Converges much faster than

sigmoid/tanh in practice (e.g. 6x) - Actually more biologically plausible

than sigmoid

- Not zero-centered output - An annoyance:

hint: what is the gradient when x < 0?

(22)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 22 April 19, 2018

DATA CLOUD active ReLU

dead ReLU

will never activate

=> never update

h = WX + b

o = relu(h)

(23)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 23 April 19, 2018

DATA CLOUD active ReLU

dead ReLU

will never activate

=> never update

h = WX + b

o = relu(h)

do / dh = 0

(24)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 24 April 19, 2018

DATA CLOUD active ReLU

dead ReLU

will never activate

=> never update

h = WX + b

o = relu(h)

do / dh = 0

dL / dh = 0

(25)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 25 April 19, 2018

DATA CLOUD active ReLU

dead ReLU

will never activate

=> never update

h = WX + b o = relu(h) do / dh = 0 dL / dh = 0

dL / dh * dh / dW = 0

(26)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 26 April 24, 2018

Vanishing/Exploding Gradient

Vanishing Gradient:

- Gradient becomes too small

(27)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 27 April 24, 2018

Vanishing/Exploding Gradient

Vanishing Gradient:

- Gradient becomes too small - Some causes:

- Choice of activation function

- Multiplying many small numbers

together

(28)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 28 April 24, 2018

Vanishing/Exploding Gradient

Vanishing Gradient:

- Gradient becomes too small - Some causes:

- Choice of activation function

- Multiplying many small numbers together

Exploding Gradient:

- Gradient becomes too large

(29)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 29 May 3, 2018

Vanilla RNN Gradient Flow

h

₀

h

₁

h

₂

h

₃

h

₄

x

₁

x

₂

x

₃

x

₄

Largest singular value > 1:

Exploding gradients

Largest singular value < 1:

Vanishing gradients

Gradient clipping: Scale gradient if its norm is too big Computing gradient

of h

₀

involves many factors of W

(and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994

Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

(30)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material:

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

(31)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 31 April 17, 2018

32 32

3 Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

(32)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 32 April 17, 2018

32 32

3 Convolution Layer

activation maps

6

28

28 For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

(33)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Convolution Layer

In contrast to fully connected layer,

Each term in output is dependent on spatially local ‘subregions’ of input

(34)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Convolution Layer

In contrast to fully connected layer,

Each term in output is dependent on spatially local ‘subregions’ of input

(35)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Convolution Layer

In contrast to fully connected layer,

Each term in output is dependent on spatially local ‘subregions’ of input

Question: connection between an FC layer

and a convolutional layer?

(36)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - April 17, 2018

Convolution Layer

In contrast to fully connected layer,

Each term in output is dependent on spatially local ‘subregions’ of input

Question: connection between an FC layer and a convolutional layer?

Answer: FC looks like convolution layer

with filter size HxW

(37)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material:

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

(38)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Drawbacks of increased complexity: Overfitting (Bias vs Variance)

Source: Wikipedia

(39)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Combat overfitting

● Increase data quantity/quality

● Impose extra constraints

● Introduce randomness/uncertainty

(40)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Combat overfitting

● Increase data quantity/quality

○ Data augmentation

● Impose extra constraints

○ On model parameters: L2 regularization

○ On layer outputs: Batchnorm

● Introduce randomness/uncertainty

○ Dropout

○ Batchnorm

○ Stochastic depth, drop connect

(41)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Overview of today’s session

Summary of Course Material:

● How we “power” neural networks:

○ Loss function

○ Optimization

● How we build complex network models

○ Nonlinear Activations

○ Convolutional Layers

● How we “rein in” complexity

○ Regularization

Practice Midterm Problems

(42)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(43)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(44)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

‘Input data seen/received’ in single activation layer ‘pixel’

Input Conv2d

(45)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

Input Conv2d

‘Input data seen/received’ in single activation layer ‘pixel’

(46)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

‘Input data seen/received’ in single output layer ‘pixel’

Input Conv2d

(47)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Input Conv2d

(48)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Input Conv2d

(49)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Summary

(50)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Summary

Note: Generally when we refer to ‘receptive field’,

we mean with respect to input data/layer 0/original image,

not with respect to direct input to the layer

(51)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Summary

(Need to compute recursively!)

(52)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Going back to activation dimensions...

(53)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 Cumulative receptive field of layer output = layer input

Going back to activation dimensions...

(54)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018 Cumulative receptive field of layer output = layer input

Going back to activation dimensions...

(55)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Activation dimensions

(56)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Activation dimensions

(57)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Activation dimensions

(58)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Summary

(59)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

Conv2d k=3, s=1 Conv2d

k=5, s=1

k=3, s=1, m=1

(60)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

k=3, s=1, m=1 n=3

Conv2d k=3, s=1 Conv2d

k=5, s=1

(61)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

k=3, s=1, m=1 n=3

Conv2d k=3, s=1 Conv2d

k=5, s=1

(62)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

k=5, s=1, m=3

Conv2d k=3, s=1 Conv2d

k=5, s=1

(63)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Receptive field size

k=5, s=1, m=3 n=7

Conv2d k=3, s=1 Conv2d

k=5, s=1

(64)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 1, 2018

Case Study: VGGNet

64 [Simonyan and Zisserman, 2014]

Q: Why use smaller filters? (3x3 conv)

3x3 conv, 128 Pool 3x3 conv, 64 3x3 conv, 64

Input 3x3 conv, 128

Pool 3x3 conv, 256 3x3 conv, 256

Pool 3x3 conv, 512 3x3 conv, 512

Pool FC 4096 FC 1000 Softmax

FC 4096

3x3 conv, 512

3x3 conv, 384 Pool 5x5 conv, 256 11x11 conv, 96

Input Pool 3x3 conv, 384 3x3 conv, 256

Pool FC 4096 FC 4096 Softmax FC 1000

Pool

Input Pool Pool Pool Pool Softmax

3x3 conv, 512

3x3 conv, 256 3x3 conv, 256

3x3 conv, 128 3x3 conv, 128

3x3 conv, 64 3x3 conv, 64 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512 3x3 conv, 512

FC 4096 FC 1000 FC 4096

AlexNet VGG16 VGG19

Stack of three 3x3 conv (stride 1) layers has same effective receptive field as one 7x7 conv layer

But deeper, more non-linearities

And fewer parameters: 3 * (3

²

C

²

) vs.

7

²

C

²

for C channels per layer

(65)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(66)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(67)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(68)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Chain Rule

(69)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Chain Rule?

(70)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

Chain Rule!

(71)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(72)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(73)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(74)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(75)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(76)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - 76 April 19, 2018

Loss

time

Bad initialization a prime suspect

(77)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(78)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(79)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(80)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(81)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(82)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(83)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 10, 2018

(84)