Introduction to Deep Learning

(1)

Introduction to Deep Learning

(2)

The Rise of Deep Learning

(3)

What is Deep Learning?

A

^RTIFICIAL

I

NTELLIGENCE

M

^ACHINE

L

^EARNING

D

^EEP

L

^EARNING

Any technique that enables computers to mimic

human behavior

Ability to learn without

explicitly being programmed Extract patterns from data using neural networks

(4)

Why Deep Learning and Why Now?

(5)

Why Deep Learning?

Hand engineered features are time consuming, brittle and not scalable in practice Can we learn the underlying features directly from data?

Low Level Features

Lines & Edges Eyes & Nose & Ears Facial Structure Mid Level Features High Level Features

(6)

Why Now?

1952 Stochastic Gradient Descent

1958 Perceptron

• Learnable Weights

1995 Deep Convolutional NN

• Digit Recognition

1986 Backpropagation

• Multi-Layer Perceptron

1. Big Data

• Larger Datasets

• Easier Collection

& Storage

2. Hardware

• Graphics

Processing Units (GPUs)

• Massively Parallelizable

3. Software

• Improved Techniques

• New Models

• Toolboxes

Neural Networks date back decades, so why the resurgence?

(7)

The Perceptron

The structural building block of deep learning

(8)

The Perceptron: Forward Propagation

!_"

!_#

!_$

Σ

&_$

&_#

&_"

'( = * +

, - "

#

&_, !_,

Non-linear activation function

Output Linear combination of inputs

'(

Inputs Weights Sum Non-Linearity Output

(9)

Inputs Weights Sum Non-Linearity

!_"

!_#

!_$

!_%

&_$

&_# 1

&_" ^{() = +} ^!^%^{+ -}

. / "

#

&_. !_.

Non-linear activation function

Output Linear combination of inputs

Σ

⁽⁾

Output

Bias

The Perceptron: Forward Propagation

(10)

!_"

!_#

!_$

!_%

&_$

&_# 1

&_"

() = + !_%+ -

. / "

#

&_. !_.

Σ

⁽⁾

Output

The Perceptron: Forward Propagation

() = + !_%+ 1²3

where: 1 =

&_"

⋮

&_# and 3 =

!_"

⋮

!_#

(11)

!_"

!_#

!_$

!_%

&_$

&_# 1

&_"

Σ

^)*

Output

The Perceptron: Forward Propagation

)*

= , !

_%

+.

^/

0 Activation Functions

• Example: sigmoid function

, 1 = 2 1 = 1

1 + 3 ^{4 5}

1

(12)

Common Activation Functions

NOTE: All activation functions are non-linear

! " = 1 1 + &^{' (}

Sigmoid Function

! ′ " = !(") 1 − !(")

! " = & ⁽ − &^{' (}

&⁽ + & ^{' (}

Hyperbolic Tangent

! ′ " = 1 − !(")^-

! " = max ( 0 , " )

Rectified Linear Unit (ReLU)

! ′ ( " ) = 31 , " > 0 0 , otherwise

tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)

(13)

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities into the network

What if we wanted to build a Neural Network to distinguish green vs red points?

(14)

Importance of Activation Functions

Linear Activation functions produce linear decisions no matter the network size

(15)

Importance of Activation Functions

Linear Activation functions produce linear decisions no matter the network size

Non-linearities allow us to approximate arbitrarily complex functions

(16)

The Perceptron: Example

1

−2

3

Σ

&_'

&₍ 1

)*

We have: +_, = 1 and . = 3

− 2 )* = / +_,+ 1².

= / 1 + &_'

&₍

2 3

)* = / ( 1 + 3 &_' − 2 &− 2₍)

This is just a line in 2D!

(17)

The Perceptron: Example

1

−2

3

Σ

&_'

&₍ 1

)*

)* = , ( 1 + 3 &_' − 2 &₍)

1+3&^'

−2&⁽

=0

&_'

&₍

(18)

The Perceptron: Example

1

−2

3

Σ

&_'

&₍ 1

)*

)* = , ( 1 + 3 &_' − 2 &₍)

Assume we have input: 0 = − 12

−1 2

)* = , 1 + 3 ∗ −1 − 2 ∗ 2

= , −6 ≈ 0.002

1+3&^'

−2&⁽

=0

&_'

&₍

(19)

The Perceptron: Example

1

−2

3

Σ

&_'

&₍ 1

)*

)* = , ( 1 + 3 &_' − 2 &₍)

1+3&^'

−2&⁽

=0

&_'

&₍ 1 < 0

* < 0.5

1 > 0

* > 0.5

(20)

Building Neural Networks with Perceptrons

(21)

!_"

!_#

!_$

!_%

&_$

&_# 1

&_"

Σ

^)*

Output

The Perceptron: Simplified

(22)

The Perceptron: Simplified

!_"

!_#

!_$

% & = ( %

% = )_* + ,

-.$

# !_- )_-

(23)

Multi Output Perceptron

!_"

!_#

!_$

%_"

%_$ ^&^$ ^{= ( %}^$

&_" = ( %_"

%₎ = *_+,) + .

/0$

# !_/ *_/,)

(24)

Single Layer Neural Network

Inputs

!_"

!_#

!_$

Hidden

%_&

%_"

'(_"

'(_$

Final Output

%_$

%₎_*

%₊ = -_.,+^($) + 3

45$

# !₄ -_4,+^($) '(₊ = 6 -_.,+^(") + 3

45$

)_*

%₄ -_4,+^(")

6 %_$

6 %_"

6 %_&

6 %₎_*

7^($) 7^(")

(25)

Single Layer Neural Network

!_"

!_#

!_$

%_&

%_"

'(_"

'(_$

%_$

%₎_*

%_" = ,_-,"^($) + 2

34$

# !₃ ,_3,"^($)

= ,_-,"^($) + !_$ ,_$,"^($) + !_" ,","($) + !_# ,_#,"^($)

5_$,"^($) 5","($)

5_#,"^($)

(26)

Multi Output Perceptron

Inputs

!_"

!_#

!_$

Hidden

%_&

%_"

'(_"

'(_$

Output

%_$

%₎_*

from tf.keras.layers import * inputs = Inputs(m)

hidden = Dense(d₁)(inputs) outputs = Dense(2)(hidden) model = Model(inputs, outputs)

(27)

Deep Neural Network

Inputs

!_"

!_#

!_$

Hidden

%_&,(

%&,"

)*_"

)*_$

Output

%_&,$

%_&,+_,

%_&,- = /_0,-^(&) + 4

56$

+_,78

9(%_&:$,5) /_5,-^(&)

⋯ ⋯

(28)

Applying Neural Networks

(29)

Example Problem

Will I pass this class?

Let’s start with a simple two feature model

!_" = Number of lectures you attend

!_$ = Hours spent on the final project

(30)

Example Problem: Will I pass this class?

!_" = Hours

spent on the final project

!_$ = Number of lectures you attend

Pass Fail Legend

(31)

Example Problem: Will I pass this class?

!_" = Hours

spent on the final project

!_$ = Number of lectures you attend

Pass Fail Legend

? 4

5

(32)

Example Problem: Will I pass this class?

!_"

!_#

$_%

$_" &'_#

$_#

!

^#

= 4 ,5

Predicted: 0.1

(33)

Example Problem: Will I pass this class?

!_"

!_#

$_%

$_" &'_#

$_#

Predicted: 0.1 Actual: 1

!

^#

= 4 ,5

(34)

Quantifying Loss

!_"

!_#

$_%

$_" &'_#

$_#

Predicted: 0.1 Actual: 1

The loss of our network measures the cost incurred from incorrect predictions

ℒ , !

^(.)

; 1 , '

^(.)

Predicted Actual

!

^#

= 4 ,5

(35)

Empirical Loss

!_"

!_#

$_%

$_" &'_#

$_#

4, 2, 5,

⋮

The empirical loss measures the total loss over our entire dataset

5 1 8

⋮

) =

0.1 0.8 0.6

⋮

+(!)

1 0 1

⋮

'

. / = 1

1 2

34#

5

ℒ + !

⁽³⁾

; / , '

⁽³⁾

Predicted Actual

Also known as:

• Objective function

• Cost function

• Empirical Risk

(36)

Binary Cross Entropy Loss

!_"

!_#

$_%

$_" &'_#

$_#

4, 2, 5,

⋮

Cross entropy loss can be used with models that output a probability between 0 and 1

5 1 8

⋮

) =

0.1 0.8 0.6

⋮

+(!)

1 0 1

⋮

'

. / = 1

1 2

34#

5 '⁽³⁾ log + ! ³ ; / + (1 − '⁽³⁾) log 1 − + ! ³ ; /

Predicted Actual

(37)

Mean Squared Error Loss

!_"

!_#

$_%

$_" &'_#

$_#

4, 2, 5,

⋮

Mean squared error loss can be used with regression models that output continuous real numbers

5 1 8

⋮

) =

30 80 85

⋮

+(!)

90 20 95

⋮

'

. / = 1

1 2

34#

5 ' ³ − + ! ³ ; / ^"

Predicted Actual

loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )

Final Grades (percentage)

(38)

Training Neural Networks

(39)

Loss Optimization

We want to find the network weights that achieve the lowest loss

!^∗ = argmin

!

1

+ ,

-./

0 ℒ 2 3^(-); ! , 8^(-)

!^∗ = argmin

! 9(!)

(40)

Loss Optimization

We want to find the network weights that achieve the lowest loss

!^∗ = argmin

!

1

+ ,

-./

0 ℒ 2 3^(-); ! , 8^(-)

!^∗ = argmin

! 9(!)

Remember:

! = !^(:), !^(/), ⋯

(41)

Loss Optimization

!^∗ = argmin

! *(!)

*(-_., -₀)

-₀ -_.

Remember:

Our loss is a function of the network weights!

(42)

Loss Optimization

Randomly pick an initial ("_#, "_%)

'("_#, "_%)

"_%

"

(43)

Loss Optimization

Compute gradient, ^!"($)_!$

&('₍, '_*)

'_* '₍

(44)

Loss Optimization

Take small step in opposite direction of gradient

!(#_$, #_&)

#_&

#

(45)

Gradient Descent

Repeat until convergence

!(#_$, #_&)

#_&

#_$

(46)

Gradient Descent

Algorithm

1. Initialize weights randomly ~"(0, &^') 2. Loop until convergence:

3. Compute gradient, ^)*(+)₎₊

4. Update weights, + ← + − . ^)*(+)₎₊

5. Return weights

weights = tf.random_normal(shape, stddev=sigma)

grads = tf.gradients(ys=loss, xs=weights)

weights_new = weights.assign(weights – lr * grads)

(47)

Gradient Descent

Algorithm

4. Update weights, + ← + − . ^)*(+)₎₊

5. Return weights

weights = tf.random_normal(shape, stddev=sigma)

grads = tf.gradients(ys=loss, xs=weights)

weights_new = weights.assign(weights – lr * grads)

(48)

Computing Gradients: Backpropagation

How does a small change in one weight (ex. !_") affect the final loss #(%)?

' !₎ (₎ !_" *+

#(%)

(49)

Computing Gradients: Backpropagation

!"($)

!&_' =

) &₊ *₊ &_' ,-

"($)

Let’s use the chain rule!

(50)

Computing Gradients: Backpropagation

!"($)

!&_' = !"($)

! )* ∗ ! )*

!&_'

, &_. -_. &_' )*

"($)

(51)

Computing Gradients: Backpropagation

!"($)

!&_' = !"($)

! )* ∗ ! )*

!&_'

, &_' -_' &_. )*

"($)

Apply chain rule! Apply chain rule!

(52)

Computing Gradients: Backpropagation

!"($)

!&_' = !"($)

! )* ∗ ! )*

!,_'

- &_' ,_' &_. )*

"($)

∗ !,_'

!&_'

(53)

Computing Gradients: Backpropagation

!"($)

!&_' = !"($)

! )* ∗ ! )*

!,_'

- &_' ,_' &_. )*

"($)

∗ !,_'

!&_'

Repeat this for every weight in the network using gradients from later layers

(54)

Neural Networks in Practice:

Optimization

(55)

Training Neural Networks is Difficult

“Visualizing the loss landscape of neural nets”. Dec 2017.

(56)

Loss Functions Can Be Difficult to Optimize

Remember:

Optimization through gradient descent

! ← ! − $ %&(!)

%!

(57)

Remember:

Optimization through gradient descent

! ← ! − $ %&(!)

%!

How can we set the learning rate?

Loss Functions Can Be Difficult to Optimize

(58)

Setting the Learning Rate

Small learning rate converges slowly and gets stuck in false local minima

Initial guess

!

"(!)

(59)

Setting the Learning Rate

Large learning rates overshoot, become unstable and diverge

Initial guess

!

"(!)

(60)

Setting the Learning Rate

Stable learning rates converge smoothly and avoid local minima

Initial guess

!

"($)

(61)

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

(62)

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

Idea 2:

Do something smarter!

Design an adaptive learning rate that “adapts” to the landscape

(63)

Adaptive Learning Rates

• Learning rates are no longer fixed

• Can be made larger or smaller depending on:

• how large gradient is

• how fast learning is happening

• size of particular weights

• etc...

(64)

Adaptive Learning Rate Algorithms

• Momentum

• Adagrad

• Adadelta

• Adam

• RMSProp

tf.train.MomentumOptimizer tf.train.AdagradOptimizer tf.train.AdadeltaOptimizer

tf.train.AdamOptimizer tf.train.RMSPropOptimizer

Qian et al. “On the momentum term in gradient descent learning algorithms.” 1999.

Duchi et al. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” 2011.

Zeiler et al. “ADADELTA: An Adaptive Learning Rate Method.” 2012.

Kingma et al. “Adam: A Method for Stochastic Optimization.” 2014.

(65)

Neural Networks in Practice:

Mini-batches

(66)

Gradient Descent

Algorithm

4. Update weights, + ← + − . ^)*(+)₎₊

5. Return weights

(67)

Gradient Descent

Algorithm

4. Update weights, + ← + − . ^)*(+)₎₊

5. Return weights

Can be very computational to

compute!

(68)

Stochastic Gradient Descent

Algorithm

3. Pick single data point ) 4. Compute gradient, ^*+_*-^,^(-)

5. Update weights, - ← - − 0 ^*+(-)_*- 6. Return weights

(69)

Stochastic Gradient Descent

Algorithm

3. Pick single data point ) 4. Compute gradient, ^*+_*-^,^(-)

5. Update weights, - ← - − 0 ^*+(-)_*- 6. Return weights

Easy to compute but very noisy

(stochastic)!

(70)

Stochastic Gradient Descent

Algorithm

3. Pick batch of ) data points 4. Compute gradient, ^*+(,)_*, = ^.

/ ∑_12.^/ ^*+³^(,)

*,

5. Update weights, , ← , − 6 ^*+(,)_*, 6. Return weights

(71)

Stochastic Gradient Descent

Algorithm

3. Pick batch of ) data points 4. Compute gradient, ^*+(,)_*, = ^.

/ ∑_12.^/ ^*+³^(,)

*,

5. Update weights, , ← , − 6 ^*+(,)_*, 6. Return weights

Fast to compute and a much better estimate of the true gradient!

(72)

Mini-batches while training

More accurate estimation of gradient

Smoother convergence

Allows for larger learning rates

(73)

Mini-batches while training

More accurate estimation of gradient

Smoother convergence

Allows for larger learning rates

Mini-batches lead to fast training!

Can parallelize computation + achieve significant speed increases on GPU’s

(74)

Neural Networks in Practice:

Overfitting

(75)

The Problem of Overfitting

Underfitting

Model does not have capacity to fully learn the data

Ideal fit Overfitting

Too complex, extra parameters, does not generalize well

(76)

Regularization

What is it?

Technique that constrains our optimization problem to discourage complex models

(77)

Regularization

What is it?

Technique that constrains our optimization problem to discourage complex models

Why do we need it?

Improve generalization of our model on unseen data

(78)

Regularization 1: Dropout

!_"

!_#

!_$

%&_"

%&_$

'_$,#

'_$,"

'_$,$

'_$,)

'_",#

'","

'_",$

'_",)

• During training, randomly set some activations to 0

(79)

Regularization 1: Dropout

!_"

!_#

!_$

%&_"

%&_$

'_$,#

'_$,"

'_$,$

'_$,)

'_",#

'","

'_",$

'_",)

• During training, randomly set some activations to 0

• Typically ‘drop’ 50% of activations in layer

• Forces network to not rely on any 1 node tf.keras.layers.Dropout(p=0.5)

(80)

Regularization 1: Dropout

!_"

!_#

!_$

%&_"

%&_$

'_$,#

'_$,"

'_$,$

'_$,)

'_",#

'","

'_",$

'_",)

• During training, randomly set some activations to 0

• Typically ‘drop’ 50% of activations in layer

• Forces network to not rely on any 1 node tf.keras.layers.Dropout(p=0.5)

(81)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

Training Iterations Loss

(82)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

Training Iterations

Loss Testing

Training Legend

(83)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

Training Legend Testing

(84)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

(85)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

(86)

Regularization 2: Early Stopping

• Stop training before we have a chance to overfit

(87)

Regularization 2: Early Stopping

Training Legend Stop training

here! Testing

• Stop training before we have a chance to overfit

(88)

Regularization 2: Early Stopping

Training Legend Stop training

here!

Over-fitting Under-fitting

Testing

• Stop training before we have a chance to overfit

(89)

Core Foundation Review

• Structural building blocks

• Nonlinear activation functions

The Perceptron Neural Networks Training in Practice

• Stacking Perceptrons to form neural networks

• Optimization through backpropagation

• Adaptive learning

• Batching

• Regularization

Σ

"_#

"_$

"_%

&' "#

"_$

"%

(_),+

(),#

&'_#

&'_%

(_),%

(),,_-

⋯ ⋯

(90)