CS 4803 / 7643: Deep Learning

(1)

CS 4803 / 7643: Deep Learning Dhruv Batra Georgia Tech Topics: – _{Analytical Gradients} – _{Automatic Differentiation} – _{Computational Graphs}

(2)

Administrativia

• _{HW1 Reminder}

– _{Due: 09/09, 11:59pm}

(3)

Recap from last time

(4)

Strategy: Follow the slope

(5)

Gradient Descent

(6)

Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(7)

How do we compute gradients?

(8)

(9)

(10)

gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] (1.25322 - 1.25347)/0.0001 = -2.5 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347

(11)

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient.

This is called a gradient check.

(12)

Plan for Today

• _{Analytical Gradients}

• _{Automatic Differentiation}

– _{Computational Graphs}

– _{Forward mode vs Reverse mode AD}

(13)

Example: Logistic Regression

(14)

Vector/Matrix Derivatives Notation

(15)

Vector/Matrix Derivatives Notation

(16)

Vector Derivative Example

(17)

Vector Derivative Example

(18)

Extension to Tensors

(19)

Chain Rule: Composite Functions

(20)

Chain Rule: Scalar Case

(21)

Chain Rule: Vector Case

(22)

Chain Rule: Jacobian view

(23)

Chain Rule: Graphical view

(24)

Chain Rule: Cascaded

(25)

Chain Rule: How should we multiply?

(26)

Example: Logistic Regression

(27)

Logistic Regression Derivatives

(28)

Logistic Regression Derivatives

(29)

input image

loss

weights

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Convolutional network (AlexNet)

(30)

Figure reproduced with permission from a Twitter post by Andrej Karpathy. input image

loss

Neural Turing Machine

(31)

How do we compute gradients?

(32)

Deep Learning = Differentiable Programming

• _{Computation = Graph}

– Input = Data + Parameters

– Output = Loss

– Scheduling = Topological ordering

• _Auto-Diff

– _{A family of algorithms for}

implementing chain-rule on computation graphs

(33)

x W hinge loss R + L s (scores) *

(34)

(C) Dhruv Batra Figure Credit: Andrea Vedaldi 34

(35)

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 35

(36)

Directed Acyclic Graphs (DAGs)

• _{Exactly what the name suggests}

– Directed edges

– No (directed) cycles

– Underlying undirected cycles okay

(37)

• _Concept

– Topological Ordering

(38)

(39)

Computational Graphs

• _Notation

(40)

(41)

HW0

(42)

(C) Dhruv Batra 42

(43)

Logistic Regression as a Cascade

(C) Dhruv Batra 43

Given a library of simple functions

Compose into a complicate function

(44)

Deep Learning = Differentiable Programming

• _{Computation = Graph}

– Input = Data + Parameters

– Output = Loss

– Scheduling = Topological ordering

• _Auto-Diff

– _{A family of algorithms for}

implementing chain-rule on computation graphs

(45)

Forward mode vs Reverse Mode

• _{Key Computations}

(46)

46

g

(47)

47

g

(48)

(49)

(50)

(51)

(C) Dhruv Batra 51

+

sin( )

x₁ x₂

*

(52)

(C) Dhruv Batra 52

+

sin( )

x₁ x₂

*

(53)

(C) Dhruv Batra 53

+

sin( )

x₁ x₂

*

Example: Forward mode AD

(54)

(C) Dhruv Batra 54

+

sin( )

x₁ x₂

*

Q: What happens if there’s another input variable x₃?

(55)

(C) Dhruv Batra 55

+

sin( )

x₁ x₂

*

(56)

(C) Dhruv Batra 56

+

sin( )

x₁ x₂

*

(57)

(58)

(59)

(C) Dhruv Batra 59

Example: Reverse mode AD

+

sin( )

x₁ x₂

(60)

+

Gradients add at branches

(61)

(C) Dhruv Batra 61

+

sin( )

x₁ x₂

*

(62)

(C) Dhruv Batra 62

+

sin( )

x₁ x₂

*

Q: What happens if there’s another input variable x₃?

(63)

(C) Dhruv Batra 63

+

sin( )

x₁ x₂

*

(64)

(C) Dhruv Batra 64

+

sin( )

x₁ x₂

*

(65)

• _{x  Graph  L}

• _{Intuition of Jacobian}

(66)

Forward Pass vs

Forward mode AD vs Reverse Mode AD

(67)

• _{What are the differences?}

(68)

• _{Which one is faster to compute?}

– _{Forward or backward?}

(69)

• _{Which one is faster to compute?}

• _{Which one is more memory efficient (less storage)?}