Lecture 10: Recurrent Neural Networks

(1)

Lecture 10:

Recurrent Neural Networks

(2)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 2 May 4, 2017

Administrative

A1 grades will go out soon A2 is due today (11:59pm)

Midterm is in-class on Tuesday!

We will send out details on where to go soon

(3)

Extra Credit: Train Game

More details on Piazza by early next week

(4)

Lecture 10 - 4 May 4, 2017

Last Time: CNN Architectures

AlexNet

Figure copyright Kaiming He, 2016. Reproduced with permission.

(5)

Last Time: CNN Architectures

3x3 conv, 128 Pool 3x3 conv, 128

Pool 3x3 conv, 256 3x3 conv, 256

Pool 3x3 conv, 512 3x3 conv, 512

Pool FC 4096 FC 1000 Softmax

FC 4096

3x3 conv, 512

Pool Pool Pool Pool Pool Softmax

3x3 conv, 512

3x3 conv, 256 3x3 conv, 256

3x3 conv, 128 3x3 conv, 128 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 FC 4096 FC 1000

FC 4096

(6)

Lecture 10 - 6 May 4, 2017

Last Time: CNN Architectures

Figure copyright Kaiming He, 2016. Reproduced with permission.

Input Softmax

3x3 conv, 64

7x7 conv, 64 / 2 FC 1000

Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128 / 2

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128

...

3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

Pool

relu

Residual block conv

conv

X identity F(x) + x

F(x)

relu

X

(7)

Pool Conv Dense Block 1

Conv Conv Dense Block 2

Conv Pool Conv Dense Block 3

Softmax FC Pool

Conv Conv 1x1 conv, 64 1x1 conv, 64

Concat

DenseNet

FractalNet

(8)

Lecture 10 - 8 May 4, 2017

Last Time: CNN Architectures

(9)

Last Time: CNN Architectures

AlexNet and VGG have tons of parameters in the fully connected layers

AlexNet: ~62M parameters

FC6: 256x6x6 -> 4096: 38M params FC7: 4096 -> 4096: 17M params FC8: 4096 -> 1000: 4M params

~59M params in FC layers!

(10)

Lecture 10 - 10 May 4, 2017

Today: Recurrent Neural Networks

(11)

“Vanilla” Neural Network

(12)

Lecture 10 - 12 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning

image -> sequence of words

(13)

Recurrent Neural Networks: Process Sequences

(14)

Lecture 10 - 14 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Machine Translation seq of words -> seq of words

(15)

Recurrent Neural Networks: Process Sequences

(16)

Lecture 10 - 16 May 4, 2017

Sequential Processing of Non-Sequence Data

Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.

Classify images by taking a series of “glimpses”

(17)

Sequential Processing of Non-Sequence Data

Generate images one piece at a time!

(18)

Lecture 10 - May 4, 2017

Lecture 10 - 18 May 4, 2017

Recurrent Neural Network

x RNN

(19)

Recurrent Neural Network

x RNN

y

usually want to

predict a vector at

some time steps

(20)

Lecture 10 - May 4, 2017

Lecture 10 - 20 May 4, 2017

Recurrent Neural Network

x RNN

y We can process a sequence of vectors x by

applying a recurrence formula at every time step:

new state old state input vector at some time step some function

with parameters W

(21)

Recurrent Neural Network

x RNN

y We can process a sequence of vectors x by

applying a recurrence formula at every time step:

Notice: the same function and the same set

(22)

Lecture 10 - May 4, 2017

Lecture 10 - 22 May 4, 2017

(Vanilla) Recurrent Neural Network

x RNN

y

The state consists of a single “hidden” vector h:

(23)

h₀

f

_W ^h₁

x₁

RNN: Computational Graph

(24)

Lecture 10 - May 4, 2017

Lecture 10 - 24 May 4, 2017

h₀

f

_W ^h₁

f

_W ^h₂

x₂ x₁

RNN: Computational Graph

(25)

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

…

x₂ x₁

RNN: Computational Graph

h_T

(26)

Lecture 10 - May 4, 2017

Lecture 10 - 26 May 4, 2017

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

…

x₂ x₁

W

RNN: Computational Graph

Re-use the same weight matrix at every time-step

h_T

(27)

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

y_T

…

x₂ x₁

W

RNN: Computational Graph: Many to Many

h_T y₃

y₂ y₁

(28)

Lecture 10 - May 4, 2017

Lecture 10 - 28 May 4, 2017

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

y_T

…

x₂ x₁

W

RNN: Computational Graph: Many to Many

h_T y₃

y₂

y₁ L₁ L₂ L₃ L_T

(29)

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

y_T

…

x₂ x₁

W

RNN: Computational Graph: Many to Many

h_T y₃

y₂

y₁ L₁ L₂ L₃ L_T

L

(30)

Lecture 10 - May 4, 2017

Lecture 10 - 30 May 4, 2017

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

x₃

y

…

x₂ x₁

W

RNN: Computational Graph: Many to One

h_T

(31)

h₀

f

_W ^h₁

f

_W ^h₂

f

_W ^h₃

y_T

…

W

x

RNN: Computational Graph: One to Many

h_T y₃

y₃ y₃

(32)

Lecture 10 - May 4, 2017

Lecture 10 - 32 May 4, 2017

Sequence to Sequence: Many-to-one + one-to-many

h

0

f_W h

1

f_W h

2

f_W h

3

x

3

…

x

2

x

1

W

1

h

T

Many to one: Encode input sequence in a single vector

(33)

Sequence to Sequence: Many-to-one + one-to-many

h

0

f_W h

1

f_W h

2

f_W h

3

x

3

…

x

2

x

1

W

1

h

T

y

1

y

2

…

Many to one: Encode input sequence in a single vector

One to many: Produce output sequence from single input vector

f_W h

1

f_W h

2

f_W

W

2

(34)

Lecture 10 - May 4, 2017

Lecture 10 - 34 May 4, 2017

Example:

Character-level Language Model Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(35)

Example:

Character-level Language Model Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(36)

Lecture 10 - May 4, 2017

Lecture 10 - 36 May 4, 2017

Example:

Character-level Language Model Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(37)

Example:

Character-level Language Model Sampling

Vocabulary:

[h,e,l,o]

At test-time sample

characters one at a time,

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Softmax

“e” “l” “l” “o”

Sample

(38)

Lecture 10 - May 4, 2017

Lecture 10 - 38 May 4, 2017

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Softmax

“e” “l” “l” “o”

Sample

Example:

Character-level Language Model Sampling

Vocabulary:

[h,e,l,o]

At test-time sample

characters one at a time,

feed back to model

(39)

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Softmax

“e” “l” “l” “o”

Sample

Example:

Character-level Language Model Sampling

Vocabulary:

[h,e,l,o]

At test-time sample

characters one at a time,

(40)

Lecture 10 - May 4, 2017

Lecture 10 - 40 May 4, 2017

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Softmax

“e” “l” “l” “o”

Sample

Example:

Character-level Language Model Sampling

Vocabulary:

[h,e,l,o]

At test-time sample

characters one at a time,

feed back to model

(41)

Backpropagation through time

Loss

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

(42)

Lecture 10 - May 4, 2017

Lecture 10 - 42 May 4, 2017

Truncated Backpropagation through time

Loss

Run forward and backward through chunks of the

sequence instead of whole sequence

(43)

Truncated Backpropagation through time

Loss

Carry hidden states forward in time forever, but only backpropagate for some smaller

number of steps

(44)

Lecture 10 - May 4, 2017

Lecture 10 - 44 May 4, 2017

Truncated Backpropagation through time

Loss

(45)

min-char-rnn.py gist: 112 lines of Python

(46)

Lecture 10 - May 4, 2017

Lecture 10 - 46 May 4, 2017

x RNN

y

(47)

train more

train more at first:

(48)

Lecture 10 - May 4, 2017

Lecture 10 - 48 May 4, 2017

(49)

The Stacks Project: open source algebraic geometry textbook

(50)

Lecture 10 - May 4, 2017

Lecture 10 - 50 May 4, 2017

(51)

(52)

Lecture 10 - May 4, 2017

Lecture 10 - 52 May 4, 2017

(53)

Generated

C code

(54)

Lecture 10 - May 4, 2017

Lecture 10 - 54 May 4, 2017

(55)

(56)

Lecture 10 - May 4, 2017

Lecture 10 - 56 May 4, 2017

Searching for interpretable cells

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016

(57)

Searching for interpretable cells

(58)

Lecture 10 - May 4, 2017

Lecture 10 - 58 May 4, 2017

Searching for interpretable cells

Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

quote detection cell

(59)

Searching for interpretable cells

line length tracking cell

(60)

Lecture 10 - May 4, 2017

Lecture 10 - 60 May 4, 2017

Searching for interpretable cells

if statement cell

(61)

Searching for interpretable cells

(62)

Lecture 10 - May 4, 2017

Lecture 10 - 62 May 4, 2017

Searching for interpretable cells

code depth cell

(63)

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei

Image Captioning

Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015.

Reproduced for educational purposes.

(64)

Lecture 10 - May 4, 2017

Lecture 10 - 64 May 4, 2017

Convolutional Neural Network

Recurrent Neural Network

(65)

test image

This image is CC0 public domain

(66)

test image

(67)

test image

(68)

test image

x0

<START>

(69)

h0 y0

test image

before:

h = tanh(Wxh * x + Whh * h)

Wih now:

(70)

h0

x0

y0

<START>

test image

straw

sample!

(71)

h0 y0

test image

h1 y1

(72)

h0

x0

y0

<START>

test image

straw

h1 y1

hat

sample!

(73)

h0 y0

test image

h1 y1

h2 y2

(74)

h0

x0

y0

<START>

test image

straw

h1 y1

hat

h2 y2

sample

<END> token

=> finish.

(75)

A cat sitting on a suitcase on the floor

A cat is sitting on a tree branch

A dog is running in the grass with a frisbee

A white teddy bear sitting in the grass

Image Captioning: Example Results

Captions generated using neuraltalk2 All images are CC0 Public domain:

cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle

(76)

Lecture 10 - May 4, 2017

Lecture 10 - 76 May 4, 2017

Image Captioning: Failure Cases

A woman is holding a cat in her hand

A woman standing on a beach holding a surfboard A person holding a

computer mouse on a desk

A bird is perched on a tree branch

A man in a baseball uniform throwing a ball

Captions generated using neuraltalk2 All images are CC0 Public domain: fur coat, handstand, spider web, baseball

(77)

Image Captioning with Attention

RNN focuses its attention at a different spatial location when generating each word

(78)

Lecture 10 - May 4, 2017

Lecture 10 - 78 May 4, 2017

Image Captioning with Attention

CNN

Image:

H x W x 3

Features:

L x D

h0

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

(79)

CNN

Image:

H x W x 3

Features:

L x D

h0 a1

Distribution over L locations

Image Captioning with Attention

(80)

Lecture 10 - May 4, 2017

Lecture 10 - 80 May 4, 2017 CNN

Image:

H x W x 3

Features:

L x D

h0 a1

Weighted combination of features

Distribution over L locations

Weighted z1 features: D

Image Captioning with Attention

(81)

CNN

Image:

H x W x 3

Features:

L x D

h0 a1

z1 h1 Distribution over

L locations

Weighted

features: D ^y1

Image Captioning with Attention

(82)

Lecture 10 - May 4, 2017

Lecture 10 - 82 May 4, 2017 CNN

Image:

H x W x 3

Features:

L x D

h0 a1

z1 Weighted

combination of features

y1 h1

First word Distribution over

L locations

a2 d1

Weighted features: D

Distribution over vocab

Image Captioning with Attention

(83)

CNN

Image:

H x W x 3

Features:

L x D

h0 a1

z1 y1

h1 Distribution over

L locations

a2 d1

h2

z2 y2

Image Captioning with Attention

(84)

Lecture 10 - May 4, 2017

Lecture 10 - 84 May 4, 2017 CNN

Image:

H x W x 3

Features:

L x D

h0 a1

z1 Weighted

combination of features

y1 h1

First word Distribution over

L locations

a2 d1

h2 a3 d2

z2 y2

Image Captioning with Attention

(85)

Soft attention

Hard attention

Image Captioning with Attention

(86)

Lecture 10 - May 4, 2017

Lecture 10 - 86 May 4, 2017

Image Captioning with Attention

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

(87)

Visual Question Answering

(88)

Lecture 10 - May 4, 2017

Lecture 10 - 88 May 4, 2017

Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016

Figures from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.

Visual Question Answering: RNNs with Attention

(89)

depth

Multilayer RNNs

LSTM:

(90)

Lecture 10 - May 4, 2017

Lecture 10 - 90 May 4, 2017

h

_t-1

x

_t

W

stack

tanh

h

_t

Vanilla RNN Gradient Flow

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994

Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

(91)

h

_t-1

x

_t

W

stack

tanh

h

_t

Vanilla RNN Gradient Flow

Backpropagation from h_t to h_t-1 multiplies by W (actually W_hh^T)

(92)

Lecture 10 - May 4, 2017

Lecture 10 - 92 May 4, 2017

Vanilla RNN Gradient Flow

h₀ h₁ h₂ h₃ h₄

x₁ x₂ x₃ x₄

Computing gradient of h₀ involves many factors of W

(and repeated tanh)

(93)

Vanilla RNN Gradient Flow

h₀ h₁ h₂ h₃ h₄

x₁ x₂ x₃ x₄

Largest singular value > 1:

Exploding gradients Computing gradient

of h₀ involves many

(94)

Lecture 10 - May 4, 2017

Lecture 10 - 94 May 4, 2017

Vanilla RNN Gradient Flow

h₀ h₁ h₂ h₃ h₄

x₁ x₂ x₃ x₄

Exploding gradients

Largest singular value < 1:

Vanishing gradients

Gradient clipping: Scale gradient if its norm is too big Computing gradient

of h₀ involves many factors of W

(and repeated tanh)

(95)

Vanilla RNN Gradient Flow

h₀ h₁ h₂ h₃ h₄

x₁ x₂ x₃ x₄

Computing gradient of h₀ involves many

Exploding gradients

(96)

Lecture 10 - May 4, 2017

Lecture 10 - 96 May 4, 2017

Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Vanilla RNN LSTM

(97)

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

x

h

vector from before (h)

W

i f o g

vector from below (x)

sigmoid sigmoid

tanh sigmoid

f: Forget gate, Whether to erase cell i: Input gate, whether to write to cell

g: Gate gate (?), How much to write to cell o: Output gate, How much to reveal cell

(98)

Lecture 10 - May 4, 2017

☉

98 c

_t-1

h

_t-1

x

_t

f i g o

W

^☉

+

c

_t

tanh

☉

h

_t

Long Short Term Memory (LSTM)

stack

(99)

c

_t-1 ☉

h

_t-1

f i g o

W

^☉

+

c

_t

tanh

☉

h

_t

Long Short Term Memory (LSTM): Gradient Flow

stack

Backpropagation from c_t to c_t-1 only elementwise

multiplication by f, no matrix multiply by W

(100)

Lecture 10 - May 4, 2017

Lecture 10 - 100 May 4, 2017

Long Short Term Memory (LSTM): Gradient Flow

c

₀

c

₁

c

₂

c

₃

Uninterrupted gradient flow!

(101)

Long Short Term Memory (LSTM): Gradient Flow

c

₀

c

₁

c

₂

c

₃

Uninterrupted gradient flow!

3x3 conv, 64

7x7 conv, 64 / 2 3x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 1283x3 conv, 128 / 2 3x3 conv, 1283x3 conv, 128 3x3 conv, 1283x3 conv, 128 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64

(102)

Lecture 10 - May 4, 2017

Lecture 10 - 102 May 4, 2017

Long Short Term Memory (LSTM): Gradient Flow

c

₀

c

₁

c

₂

c

₃

Uninterrupted gradient flow!

Input Softmax

3x3 conv, 64

7x7 conv, 64 / 2 FC 1000

Pool 3x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 1283x3 conv, 128 / 2 3x3 conv, 1283x3 conv, 128 3x3 conv, 1283x3 conv, 128 ... 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64 3x3 conv, 643x3 conv, 64 Pool

Similar to ResNet!

In between:

Highway Networks

Srivastava et al, “Highway Networks”, ICML DL Workshop 2015

(103)

Other RNN Variants

[LSTM: A Search Space Odyssey,

[An Empirical Exploration of

Recurrent Network Architectures, Jozefowicz et al., 2015]

GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]

(104)