• No results found

CS 4803 / 7643: Deep Learning

N/A
N/A
Protected

Academic year: 2021

Share "CS 4803 / 7643: Deep Learning"

Copied!
69
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 4803 / 7643: Deep Learning Dhruv Batra Georgia Tech Topics: – Analytical GradientsAutomatic DifferentiationComputational Graphs

(2)

Administrativia

HW1 Reminder

Due: 09/09, 11:59pm

(3)

Recap from last time

(4)

Strategy: Follow the slope

(5)

Gradient Descent

(6)

Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(7)

How do we compute gradients?

(8)
(9)
(10)

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] (1.25322 - 1.25347)/0.0001 = -2.5 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(11)

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient.

This is called a gradient check.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(12)

Plan for Today

Analytical Gradients

Automatic Differentiation

Computational Graphs

Forward mode vs Reverse mode AD

(13)

Example: Logistic Regression

(14)

Vector/Matrix Derivatives Notation

(15)

Vector/Matrix Derivatives Notation

(16)

Vector Derivative Example

(17)

Vector Derivative Example

(18)

Extension to Tensors

(19)

Chain Rule: Composite Functions

(20)

Chain Rule: Scalar Case

(21)

Chain Rule: Vector Case

(22)

Chain Rule: Jacobian view

(23)

Chain Rule: Graphical view

(24)

Chain Rule: Cascaded

(25)

Chain Rule: How should we multiply?

(26)

Example: Logistic Regression

(27)

Logistic Regression Derivatives

(28)

Logistic Regression Derivatives

(29)

input image

loss

weights

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Convolutional network (AlexNet)

(30)

Figure reproduced with permission from a Twitter post by Andrej Karpathy. input image

loss

Neural Turing Machine

(31)

How do we compute gradients?

(32)

Deep Learning = Differentiable Programming

Computation = Graph

– Input = Data + Parameters

– Output = Loss

– Scheduling = Topological ordering

Auto-Diff

A family of algorithms for

implementing chain-rule on computation graphs

(33)

x W hinge loss R + L s (scores) *

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(34)

(C) Dhruv Batra Figure Credit: Andrea Vedaldi 34

(35)

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 35

(36)

Directed Acyclic Graphs (DAGs)

Exactly what the name suggests

– Directed edges

– No (directed) cycles

– Underlying undirected cycles okay

(37)

Directed Acyclic Graphs (DAGs)

Concept

– Topological Ordering

(38)

Directed Acyclic Graphs (DAGs)

(39)

Computational Graphs

Notation

(40)
(41)

HW0

(42)

(C) Dhruv Batra 42

(43)

Logistic Regression as a Cascade

(C) Dhruv Batra 43

Given a library of simple functions

Compose into a complicate function

(44)

Deep Learning = Differentiable Programming

Computation = Graph

– Input = Data + Parameters

– Output = Loss

– Scheduling = Topological ordering

Auto-Diff

A family of algorithms for

implementing chain-rule on computation graphs

(45)

Forward mode vs Reverse Mode

Key Computations

(46)

46

g

(47)

47

g

(48)
(49)
(50)
(51)

(C) Dhruv Batra 51

+

sin( )

x1 x2

*

(52)

(C) Dhruv Batra 52

+

sin( )

x1 x2

*

(53)

(C) Dhruv Batra 53

+

sin( )

x1 x2

*

Example: Forward mode AD

(54)

(C) Dhruv Batra 54

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another input variable x3?

(55)

(C) Dhruv Batra 55

+

sin( )

x1 x2

*

Example: Forward mode AD

(56)

(C) Dhruv Batra 56

+

sin( )

x1 x2

*

Example: Forward mode AD

(57)
(58)
(59)

(C) Dhruv Batra 59

Example: Reverse mode AD

+

sin( )

x1 x2

(60)

+

Gradients add at branches

(61)

(C) Dhruv Batra 61

Example: Reverse mode AD

+

sin( )

x1 x2

*

(62)

(C) Dhruv Batra 62

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another input variable x3?

(63)

(C) Dhruv Batra 63

Example: Reverse mode AD

+

sin( )

x1 x2

*

(64)

(C) Dhruv Batra 64

Example: Reverse mode AD

+

sin( )

x1 x2

*

(65)

Forward mode vs Reverse Mode

x  Graph  L

Intuition of Jacobian

(66)

Forward Pass vs

Forward mode AD vs Reverse Mode AD

(67)

Forward mode vs Reverse Mode

What are the differences?

(68)

Forward mode vs Reverse Mode

What are the differences?

Which one is faster to compute?

Forward or backward?

(69)

Forward mode vs Reverse Mode

What are the differences?

Which one is faster to compute?

Forward or backward?

Which one is more memory efficient (less storage)?

Forward or backward?

References

Related documents

Idea #2: Split data into train and test, choose hyperparameters that work best on test data.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 13, 2017 Biological Neurons:. ● Many

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017?. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 7 April

● Define layers in Python with numpy (CPU only).. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 150 April 27, 2017.

Ian Goodfellow et al., “Generative Adversarial Nets”, NIPS 2014.. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 13 - May 18, 2017. Training GANs:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 7 April 26, 2018!. Spot

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners.. shallow 8 layers

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:...