Lecture 10:

(1)

Lecture 10 - 8 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 10 - 1 8 Feb 2016

Lecture 10:

Recurrent Neural Networks

(2)

Lecture 10 - 8 Feb 2016

Lecture 10 - 2 8 Feb 2016

Administrative

- Midterm this Wednesday! woohoo!

- A3 will be out ~Wednesday

(3)

Lecture 10 - 8 Feb 2016

Lecture 10 - 3 8 Feb 2016

(4)

Lecture 10 - 8 Feb 2016

Lecture 10 - 4 8 Feb 2016

http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html

(5)

Lecture 10 - 8 Feb 2016

Lecture 10 - 5 8 Feb 2016

http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html

(6)

Lecture 10 - 8 Feb 2016

Lecture 10 - 6 8 Feb 2016

Recurrent Networks offer a lot of flexibility:

Vanilla Neural Networks

(7)

Lecture 10 - 8 Feb 2016

Lecture 10 - 7 8 Feb 2016

Recurrent Networks offer a lot of flexibility:

e.g. Image Captioning

image -> sequence of words

(8)

Lecture 10 - 8 Feb 2016

Lecture 10 - 8 8 Feb 2016

Recurrent Networks offer a lot of flexibility:

e.g. Sentiment Classification sequence of words -> sentiment

(9)

Lecture 10 - 8 Feb 2016

Lecture 10 - 9 8 Feb 2016

Recurrent Networks offer a lot of flexibility:

e.g. Machine Translation seq of words -> seq of words

(10)

Lecture 10 - 8 Feb 2016

Lecture 10 - 10 8 Feb 2016

Recurrent Networks offer a lot of flexibility:

e.g. Video classification on frame level

(11)

Lecture 10 - 8 Feb 2016

Lecture 10 - 11 8 Feb 2016

Multiple Object Recognition with Visual Attention, Ba et al.

Sequential Processing of fixed

inputs

(12)

Lecture 10 - 8 Feb 2016

Lecture 10 - 12 8 Feb 2016

DRAW: A Recurrent Neural Network For Image Generation, Gregor et al.

Sequential Processing of fixed

outputs

(13)

Lecture 10 - 8 Feb 2016

Lecture 10 - 13 8 Feb 2016

Recurrent Neural Network

x RNN

(14)

Lecture 10 - 8 Feb 2016

Lecture 10 - 14 8 Feb 2016

Recurrent Neural Network

x RNN

y

usually want to

predict a vector at

some time steps

(15)

Lecture 10 - 8 Feb 2016

Lecture 10 - 15 8 Feb 2016

Recurrent Neural Network

x RNN

y We can process a sequence of vectors x by

applying a recurrence formula at every time step:

new state old state input vector at some time step some function

with parameters W

(16)

Lecture 10 - 8 Feb 2016

Lecture 10 - 16 8 Feb 2016

Recurrent Neural Network

x RNN

y We can process a sequence of vectors x by

applying a recurrence formula at every time step:

Notice: the same function and the same set

of parameters are used at every time step.

(17)

Lecture 10 - 8 Feb 2016

Lecture 10 - 17 8 Feb 2016

(Vanilla) Recurrent Neural Network

x RNN

y

The state consists of a single “hidden” vector h:

(18)

Lecture 10 - 8 Feb 2016

Lecture 10 - 18 8 Feb 2016

Character-level language model example

Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

x RNN

y

(19)

Lecture 10 - 8 Feb 2016

Lecture 10 - 19 8 Feb 2016

Character-level language model example

Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(20)

Lecture 10 - 8 Feb 2016

Lecture 10 - 20 8 Feb 2016

Character-level language model example

Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(21)

Lecture 10 - 8 Feb 2016

Lecture 10 - 21 8 Feb 2016

Character-level language model example

Vocabulary:

[h,e,l,o]

Example training sequence:

“hello”

(22)

Lecture 10 - 8 Feb 2016

Lecture 10 - 22 8 Feb 2016

min-char-rnn.py gist: 112 lines of Python

(https://gist.github.

com/karpathy/d4dee566867f8291f086)

(23)

min-char-rnn.py gist

Data I/O

(24)

Initializations

recall:

(25)

Main loop

(26)

Main loop

(27)

Main loop

(28)

Main loop

(29)

Main loop

(30)

Loss function

- forward pass (compute loss)

- backward pass (compute param gradient)

(31)

Softmax classifier

(32)

recall:

(33)

(34)

Lecture 10 - 8 Feb 2016

Lecture 10 - 34 8 Feb 2016

x RNN

y

(35)

Lecture 10 - 8 Feb 2016

Lecture 10 - 35 8 Feb 2016

(36)

Lecture 10 - 8 Feb 2016

Lecture 10 - 36 8 Feb 2016

train more

train more at first:

(37)

Lecture 10 - 8 Feb 2016

Lecture 10 - 37 8 Feb 2016

(38)

Lecture 10 - 8 Feb 2016

Lecture 10 - 38 8 Feb 2016

open source textbook on algebraic geometry

Latex source

(39)

Lecture 10 - 8 Feb 2016

Lecture 10 - 39 8 Feb 2016

(40)

Lecture 10 - 8 Feb 2016

Lecture 10 - 40 8 Feb 2016

(41)

Lecture 10 - 8 Feb 2016

Lecture 10 - 41 8 Feb 2016

(42)

Lecture 10 - 8 Feb 2016

Lecture 10 - 42 8 Feb 2016

Generated

C code

(43)

Lecture 10 - 8 Feb 2016

Lecture 10 - 43 8 Feb 2016

(44)

Lecture 10 - 8 Feb 2016

Lecture 10 - 44 8 Feb 2016

(45)

Lecture 10 - 8 Feb 2016

Lecture 10 - 45 8 Feb 2016

Searching for interpretable cells

[Visualizing and Understanding Recurrent Networks, Andrej Karpathy*, Justin Johnson*, Li Fei-Fei]

(46)

Lecture 10 - 8 Feb 2016

Lecture 10 - 46 8 Feb 2016

quote detection cell

Searching for interpretable cells

(47)

Lecture 10 - 8 Feb 2016

Lecture 10 - 47 8 Feb 2016

line length tracking cell

Searching for interpretable cells

(48)

Lecture 10 - 8 Feb 2016

Lecture 10 - 48 8 Feb 2016

if statement cell

Searching for interpretable cells

(49)

Lecture 10 - 8 Feb 2016

Lecture 10 - 49 8 Feb 2016

quote/comment cell

Searching for interpretable cells

(50)

Lecture 10 - 8 Feb 2016

Lecture 10 - 50 8 Feb 2016

code depth cell

Searching for interpretable cells

(51)

Lecture 10 - 8 Feb 2016

Lecture 10 - 51 8 Feb 2016

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al.

Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.

Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Image Captioning

(52)

Lecture 10 - 8 Feb 2016

Lecture 10 - 52 8 Feb 2016

Convolutional Neural Network

Recurrent Neural Network

(53)

test image

(54)

test image

(55)

test image

X

(56)

test image

x0

<START>

(57)

h0

x0

y0

<START>

test image

before:

h = tanh(Wxh * x + Whh * h)

now:

h = tanh(Wxh * x + Whh * h + Wih * v)

v

Wih

(58)

h0

x0

y0

<START>

test image

straw

sample!

(59)

h0

x0

y0

<START>

test image

straw

h1 y1

(60)

h0

x0

y0

<START>

test image

straw

h1 y1

hat

sample!

(61)

h0

x0

y0

<START>

test image

straw

h1 y1

hat

h2 y2

(62)

h0

x0

y0

<START>

test image

straw

h1 y1

hat

h2 y2

sample

<END> token

=> finish.

(63)

Image Sentence Datasets

Microsoft COCO

[Tsung-Yi Lin et al. 2014]

mscoco.org

currently:

~120K images

~5 sentences each

(64)

(65)

(66)

Show Attend and Tell, Xu et al., 2015 66

Preview of fancier architectures

RNN attends spatially to different parts of images while generating each word of the sentence:

(67)

Lecture 10 - 8 Feb 2016

Lecture 10 - 67 8 Feb 2016

time depth

RNN:

(68)

Lecture 10 - 8 Feb 2016

Lecture 10 - 68 8 Feb 2016

time depth

RNN:

LSTM:

(69)

Lecture 10 - 8 Feb 2016

Lecture 10 - 69 8 Feb 2016

LSTM

(70)

Lecture 10 - 8 Feb 2016

Lecture 10 - 70 8 Feb 2016

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

x

h

vector from before (h)

W

i f o g

vector from below (x)

sigmoid sigmoid

tanh sigmoid

4n x 2n 4n 4*n

(71)

Lecture 10 - 8 Feb 2016

Lecture 10 - 71 8 Feb 2016

Long Short Term Memory (LSTM)

cell state c

f

x

(72)

Lecture 10 - 8 Feb 2016

Lecture 10 - 72 8 Feb 2016

Long Short Term Memory (LSTM)

f

x

i g

x +

(73)

Lecture 10 - 8 Feb 2016

Lecture 10 - 73 8 Feb 2016

Long Short Term Memory (LSTM)

f

x +

tanh

o x

h

c

i g

x

(74)

Lecture 10 - 8 Feb 2016

Lecture 10 - 74 8 Feb 2016

Long Short Term Memory (LSTM)

f

x +

tanh

o x

h

c

i g

x

higher layer, or prediction

(75)

Lecture 10 - 8 Feb 2016

Lecture 10 - 75 8 Feb 2016

LSTM

f

x

i g

x +

tanh

o

x ^f

x

i g

x +

tanh

o

x

one timestep one timestep

h h

h x x

(76)

Lecture 10 - 8 Feb 2016

Lecture 10 - 76 8 Feb 2016

f f f

RNN

state

f

+

f

+

f

LSTM

+ (ignoring forget gates)

(77)

Lecture 10 - 8 Feb 2016

Lecture 10 - 77 8 Feb 2016

Recall:

“PlainNets” vs. ResNets

ResNet is to PlainNet what LSTM is to RNN, kind of.

(78)

Lecture 10 - 8 Feb 2016

Lecture 10 - 78 8 Feb 2016

Understanding gradient flow dynamics

Cute backprop signal video: http://imgur.com/gallery/vaNahKE

(79)

Lecture 10 - 8 Feb 2016

Lecture 10 - 79 8 Feb 2016

Understanding gradient flow dynamics

if the largest eigenvalue is > 1, gradient will explode if the largest eigenvalue is < 1, gradient will vanish

[On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2013]

(80)

Lecture 10 - 8 Feb 2016

Lecture 10 - 80 8 Feb 2016

Understanding gradient flow dynamics

if the largest eigenvalue is > 1, gradient will explode if the largest eigenvalue is < 1, gradient will vanish

[On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2013]

can control exploding with gradient clipping can control vanishing with LSTM

(81)

Lecture 10 - 8 Feb 2016

Lecture 10 - 81 8 Feb 2016

LSTM variants and friends

[LSTM: A Search Space Odyssey, Greff et al., 2015]

[An Empirical Exploration of

Recurrent Network Architectures, Jozefowicz et al., 2015]

GRU [Learning phrase

representations using rnn encoder- decoder for statistical machine translation, Cho et al. 2014]

(82)