• No results found

Administrative - how is the assignment going? - btw, the notes get updated all the time based on your feedback - no lecture on Monday

N/A
N/A
Protected

Academic year: 2021

Share "Administrative - how is the assignment going? - btw, the notes get updated all the time based on your feedback - no lecture on Monday"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

Administrative

- how is the assignment going?

- btw, the notes get updated all the time based on your feedback

- no lecture on Monday

(2)

Lecture 4:

Optimization

(3)

Image Classification

cat

assume given set of discrete labels {dog, cat, truck, plane, ...}

(4)

Data-driven approach

(5)

1. Score function

(6)
(7)

1. Score function

2. Two loss functions

(8)
(9)

Three key components to training Neural Nets:

1. Score function

2. Loss function

3. Optimization

(10)

Brief aside: Image Features

- In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

(11)

Brief aside: Image Features

- In practice, very rare to see Computer Vision applications

that train linear classifiers on pixel values

(12)

Example: Color (Hue) Histogram

hue bins

+1

(13)

Example: HOG features

8x8 pixel region, quantize the edge orientation into 9 bins

(images from vlfeat.org)

(14)

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size

[number_of_features x 144]

repeat for each detected feature

Problem: different images will have different numbers of

features. Need fixed-sized vectors for linear classification

(15)

Example: Bag of Words

1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers)

gives a matrix of size

[number_of_features x 144]

repeat for each detected feature

(16)

Example: Bag of Words

144

visual word vectors

learn k-means centroids

“vocabulary of visual words

e.g. 1000 centroids

1000-d vector

1000-d vector

1000-d vector histogram of visual words

(17)

Brief aside: Image Features

(18)

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

(19)

Most recognition systems are build on the same Architecture

(slide from Yann LeCun)

CNNs:

end-to-end models

(20)

Visualizing the loss function

(21)

Visualizing the (SVM) loss function

(22)

Visualizing the (SVM) loss function

(23)

Visualizing the (SVM) loss function

(24)

Visualizing the (SVM) loss function

the full data loss:

(25)

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

(26)

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

(27)

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

(28)

Visualizing the (SVM) loss function

Suppose there are 3 examples with 3 classes (class 0, 1, 2 in sequence), then this becomes:

Question: CIFAR-10 has 50,000 training images, 5,000 per class and 10 labels. How many

occurrences of one classifier row in the full data loss?

(29)

Optimization

(30)

Strategy #1: A first very bad idea solution: Random search

(31)

Strategy #1: A first very bad idea solution: Random search

(32)

Strategy #1: A first very bad idea solution: Random search

what’s up with

0.0001?

(33)

Lets see how well this works on the test set...

(34)

Fun aside:

When W = 0, what is the CIFAR-10 loss for SVM and Softmax?

(35)

Strategy #2: A better but still very bad idea solution:

Random local search

(36)

Strategy #2: A better but still very bad idea solution:

Random local search

gives 21.4%!

(37)
(38)
(39)

Strategy #3: Following the gradient

In 1-dimension, the derivative of a function:

In multiple dimension, the gradient is the vector of (partial derivatives).

(40)

Evaluation the

gradient numerically

(41)

Evaluation the

gradient numerically

“finite difference

approximation”

(42)

Evaluation the

gradient numerically

“centered difference formula”

in practice:

(43)

Evaluation the

gradient numerically

(44)

performing a

parameter update

(45)

performing a

parameter update

(46)

original W

negative gradient direction

(47)

The problems with

numerical gradient:

(48)

The problems with numerical gradient:

- approximate

- very slow to evaluate

(49)

We need something better...

(50)

We need something better...

Calculus

(51)
(52)
(53)
(54)

In summary:

- Numerical gradient: approximate, slow, easy to write - Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but check

implementation with numerical gradient. This is called a

gradient check.

(55)

Gradient check: Words of caution

(56)

Gradient Descent

(57)

Mini-batch Gradient Descent

- only use a small portion of the training set to compute the gradient.

Common mini-batch sizes are ~100 examples.

e.g. Krizhevsky ILSVRC ConvNet used 256 examples

(58)

Stochastic Gradient Descent (SGD)

- use a single example at a time

1

(also sometimes called on-line Gradient Descent)

(59)

Summary

- Always use mini-batch gradient descent

- Incorrectly refer to it as “doing SGD” as everyone else - The mini-batch size is a hyperparameter, but it is not

very common to cross-validate over it (usually based on practical concerns, e.g. space/time efficiency)

(or call it batch gradient descent)

(60)

Fun question: Suppose you were training with mini-batch size of 100, and now you switch to mini-batch of size 1. Your learning rate (step size) should:

- increase - decrease

- stay the same

- become zero

(61)

The dynamics of Gradient Descent

(62)

The dynamics of Gradient Descent

always pull the weights down

pull some weights up and some down

(63)

Momentum Update

gradient

momentum

update

(64)

Many other ways to perform optimization…

- Second order methods that use the Hessian (or its approximation): BFGS, LBFGS, etc.

- Currently, the lesson from the trenches is that well-tuned

SGD+Momentum is very hard to beat for CNNs.

(65)

Summary

- We looked at image features, and saw that CNNs can be thought of as learning the features in end-to-end manner

- We explored intuition about what the loss surfaces of linear classifiers look like

- We introduced gradient descent as a way of optimizing loss functions, as well as batch gradient descent and SGD.

- Numerical gradient: slow :(, approximate :(, easy to write :) - Analytic gradient: fast :), exact :), error-prone :(

- In practice: Gradient check (but be careful)

(66)

Next class:

Becoming a

backprop ninja

References

Related documents

The ideal of self-r eliance cer tainly does not exclude this other ideal of shar ing, of the g iv e-and-tak e, whether spontaneous or institutionalized, that

In this context, the present study is a report of the BRN- BD that aims to evaluate the prevalence and clinical correlates of cardiovascular risk factors among patients with BD from

What is the relationship between reported extent and effectiveness of interprofessional collaboration and: work setting, patient population, college degree, physical proximity

I find no evidence of any effects of statewide student achievement data on education policy preferences such as increasing overall spending levels, increasing teacher

[As described in the previous lecture.] – Stochastic gradient descent: faster than batch on large, redundant data sets.. [Whereas batch gradient descent walks downhill on one

• Introduced online gradient descent, which can solve online convex optimization with average regret approaching 0. • Introduced stochastic gradient descent, an

Stagecoach Island and Dell support in Second Life bring customers into the virtual worlds that have interests more in line with the business objectives of the

Nesterov (1988), “On an approach to the construction of optimal methods of minimization of smooth convex functions”.