Sequence Processing with Recurrent Networks

(1)

Sequence Processing with Recurrent Networks

Natalie Parde, Ph.D.

Department of Computer Science

University of Illinois at Chicago

CS 421: Natural Language Processing

Fall 2019

Many slides adapted from Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/).

(2)

What is sequence processing?

• Automated processing of sequential items (e.g., words in a sentence) while taking into account temporal information (e.g., w₁

occurs before w₂)

(3)

Language is inherently temporal.

• Continuous input streams of indefinite length

• These sequences unfold over time

I hope there are no more midterms between now and the end of the semester.

no I hope there are more midterms between now and the end of the semester.

≠

(4)

This is even evident in the way we talk

about language!

• Conversation flow

• News feed

• Twitter stream

(5)

We’ve already looked at a applications few of sequence processing….

• Syntactic parsing

• Part of speech tagging

• Viterbi algorithm

Natalie did not like social events so she politely declined the party invitation.

verb? noun? adjective?

(6)

…and many applications that do not incorporate temporal information.

Training

Document Class

Natalie was soooo thrilled that Usman had a famous new poem.

Sarcastic

She was totally 100% not annoyed that it had surpassed her poem on the bestseller list.

Sarcastic

Usman was happy that his poem about Thanksgiving was so successful.

Not Sarcastic He congratulated Natalie for getting #2 on the bestseller list. Not

Sarcastic Test

Document Class

Natalie told Usman she was soooo totally happy for him. ?

P(Sarcastic) = 0.5 P(Not Sarcastic) = 0.5

Natalie told Usman she was soooo totally

Word P(Word|Sarcastic) P(Word|Not Sarcastic)

Natalie 0.033 0.036

Usman 0.033 0.036

soooo 0.033 0.018

totally 0.033 0.018

happy 0.016 0.036

(7)

How can we adapt non-sequential models to make use of temporal information?

Natalie w_t-4 sat w_t-3 down w_t-2 to w_t-1 write w_t the w_t+1 midterm w_t+2

𝑃(𝑤_% = “write”|𝑤_%/0 = “to”, 𝑤_%/3 = “down”, 𝑤_%/6 = “sat”) h₁

h₂

y₁

…

“write”

…

y_|V|

softmax

distribution over all words in the vocabulary

(8)

How can we adapt non-sequential models to make use of temporal information?

Natalie w_t-5 sat w_t-4 down w_t-3 to w_t-2 write w_t-1

the w_t

midterm w_t+1

𝑃(𝑤 = “the”|𝑤 = “write”, 𝑤 = “to”, 𝑤 = “down”) h₁

h₂

y₁

…

“the”

…

y_|V|

softmax

(9)

How can we adapt non-sequential models to make use of temporal information?

Natalie w_t-6 sat w_t-5 down w_t-4 to w_t-3 write w_t-2 the w_t-1 midterm w_t

𝑃(𝑤_% = “midterm”|𝑤_%/0 = “the”, 𝑤_%/3 = “write”, 𝑤_%/6 = “to”) h₁

h₂

y₁

…

“midterm”

…

y_|V|

softmax

(10)

Disadvantages of the Sliding Window

Approach

• Items outside the predetermined context window cannot impact the model’s

decision

• What if a task requires information that can be arbitrarily distant from the point at which processing is occurring?

Limits the context from which information can be extracted

• Particularly problematic when trying to learn grammatical elements, e.g.,

constituent parses

Makes it difficult to learn systematic patterns

(11)

solution? The

• Recurrent Neural Networks

• Neural networks designed specifically to handle temporal information

• Can accept variable length inputs without the use of fixed-size windows

(12)

Recurrent Neural

Networks (RNNs)

• The value of a unit is dependent on outputs from previous time steps as input

Networks that contain cycles within their connections

• Long short-term memory network (LSTM)

• Bidirectional LSTM (BiLSTM)

• Gated recurrent unit (GRU) Many different variations

among RNNs

(13)

How do RNNs differ from standard feedforward neural networks?

• Memory!

• Loops in the network allow information to persist over time

• Information is stored between timesteps using an internal hidden state, and fed

back into the model the next time it reads an input

• Some type of output is also predicted at each timestep

• New hidden states are determined as a function of the existing hidden state and the new input at the current timestep

• This function remains the same across timesteps

(14)

Simple RNN

x_t h_t y_t

Current input Information from x_t Output for current input

Information from x_t-1 (activation value from previous input)

(15)

Thus, hidden layers in RNNs are more

complex than in

feedforward networks.

Capable of encoding outputs from earlier timesteps, which can thereby serve as additional context

Makes decisions based on both current input and outputs from prior timesteps

Does not impose a fixed-length limit on this prior context (can include

information extending back to the

beginning of the sequence)

(16)

However,

computation units still

perform the same core actions.

• Input vector

• (New!) activation values for the hidden layer for the previous time step

Given:

• Weighted sum of inputs

Compute:

(17)

Most Significant

Change

• New set of weights, U, that connect the

hidden layer from the previous time step to the current hidden layer

• These weights determine how the network should make use of prior context in

determining the output for the current input

• Just like with feedforward networks, weights are trained using backpropagation

(18)

Formal Definition

• Forward inference (mapping a

sequence of inputs to a sequence of outputs) is quite similar to what

we’ve seen with feedforward networks!

• Recall the basic set of equations for a feedforward neural network:

• h = 𝜎 𝑊x + 𝐛

• z = 𝑈h

• 𝑦 = softmax(z)

(19)

Formal Definition

• The only change we need to make to the original set of equations is to add the additional (weights X

activation values from previous timestep) product to the current (weights X inputs) product

• h = 𝜎 𝑊x_t + 𝑈h_t−1 + 𝐛

• z = 𝑉h_t

• 𝑦 = softmax(z)

• Note that the weight matrices W, U, and V (the renamed weight matrix for the output layer) are shared across all timesteps

(20)

How would this process look,

illustrated as a feedforward network?

x_t W h_t y_t

U

V x_t h_t y_t

h_t-1 U

W V

Recurrent View Feedforward View

(21)

Formal Algorithm

h₀ ← 0 # Initialize activations from the hidden layer to 0

for i ← 1 to length(x) do: # Iterate through each input element in temporal order

h_i ← g(Uh_i-1 + Wx_i + b) # Bias vector is optional y_i ← f(Vh_i)

New values for h and y are calculated with each time step!

(22)

Unrolling a simple RNN….

Natalie sat down to write the

midterm

y1

“write”…

…

y|V|

…

h₀ x₁

h₁

(23)

Unrolling a simple RNN….

midterm

y1

“write”…

…

y|V|

…

h₀ x₂

h₁

y1

“write”…

…

y|V|

…

h₂

(24)

Unrolling a simple RNN….

midterm x₃

y1

“write”…

…

y|V|

…

h₂

y1

“write”…

…

y|V|

…

h0

h₁

y1

“write”…

…

y|V|

…

h₃

(25)

Training RNNs

• Same core elements:

• Loss function

• Optimization function

• Backpropagation

• One extra set of weights to update

• Previous hidden layer to current hidden layer (U)

(26)

Additional

Considerations for Computing Loss

• To compute the loss at time t we need the hidden layer from time t-1

• To assess the overall error accruing to h_t, we need the current output as well as outputs that will follow

(27)

Backpropagation Through Time

• Two-pass algorithm for training RNNs

• First pass: Perform forward inference

• Compute h_t and y_t at each step in time

• Compute the loss at each step in time

• Second pass: Process the sequence in reverse

• Compute the required error gradients at each step

backward in time

(28)

Forward Pass

y₁ t₁

Networks

• Language modeling

• Part-of-speech tagging

• Sequence classification tasks

(36)

RNNs also form the basis for sequence- to-sequence approaches.

• Sequence-to-sequence (seq2seq): Model input and output are both sequences

• Useful for:

• Text summarization

• Machine translation

• Question answering

Chicago is colder than Antarctica today.

Chicago est plus froid que

l'Antarctique aujourd'hui.

(37)

Recurrent Neural

Language Models

Both of these attempt to predict the next word in a sequence given

a prior context of fixed length What we’ve seen so far:

N-gram language

models Feedforward networks with sliding windows

(38)

The problem with n-gram and feedforward language

models?

In both approaches, model quality is dependent on context size

Anything outside the fixed context window has no impact on the

model’s decision!

(39)

Recurrent Neural Language Models

• Recurrent neural language models process sequences one word at a time, as seen in the previous slides

• This means that they avoid constraining the context size

• The hidden state embodies information about all of the

preceding words, all the way back to the beginning of the sequence

(40)

Recurrent Neural Language

Models

• At each timestep:

1. Retrieve an embedding for the current input word

2. Combine the weighted sums of (a) the input embedding values and (b)

the activations of the hidden layer from the previous step, to compute a new set of activation values from the

hidden layer

3. Generate a set of outputs based on the activations from the hidden layer 4. Pass the outputs through a softmax

function to generate a probability

distribution over the entire vocabulary

(41)

Recurrent Neural Language

Models

• Recurrent models can be trained using the same data used to train n-gram and

feedforward language models

• Collection of representative text

• Correct class → still the word that actually comes next in the data

• Task: Predict the next word in a sequence given all previous words, rather than only those in a context window of size n

(42)

How can we generate text with neural

language models?

Model Completion (Machine-Written, 10 Tries): The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These

creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through

(43)

Generation with Neural Language

Models

1. Sample the first word in the output from the softmax distribution that results from using the beginning of sentence marker (<s>) as input

2. Get the embedding for that word

3. Use it as input to the network at the next time step, and sample the following word as in (1)

4. Repeat until the end of sentence marker (</s>) is sampled, or a fixed length limit is reached

(44)

Autoregressive Generation

• This technique is referred to as autoregressive generation

• Word generated at each timestep is conditioned on the word

generated previously by the model

• Key to successful autoregressive generation?

• Prime the generation component with appropriate context (e.g., something more useful than <s>)

(45)

Autoregressive Generation

<s> RNN

softmax

pumpkin

(46)

Autoregressive Generation

<s> RNN

softmax

pumpkin

pumpkin RNN

softmax

spice

(47)

Autoregressive Generation

<s> RNN

softmax

pumpkin

pumpkin RNN

softmax

spice

spice RNN ^latte

(48)

Sequence Labeling

• Task: Given a fixed set of labels, assign a label to each element of a sequence

• Example: Part-of-speech tagging

• In an RNN:

• Inputs → word embeddings

• Outputs → label probabilities generated by the softmax (or

other activation) function over the set of all labels

(49)

Sequence Labeling

determiner t₁

delicious t₂

h₁

h₂

latte t₃

h₃ adjective

noun

(50)

Sequence Classification

• Task: Given an input sequence, assign the entire sequence to a class (rather than the individual tokens within it)

• Useful for:

• Document-level topic classification

• Spam detection

• Message routing

• Deception detection

• Any other applications for which sequences of text are classified as belonging to one of a small number of categories!

(51)

How to use RNNs for sequence classification?

1

Pass the sequence through an RNN one word at a time, as usual

2

Assume that the hidden layer for the final word, h_n, acts as a compressed

representation of the entire sequence

3

Use h_n as input to a subsequent

feedforward neural network

4

Choose a class via softmax over all the possible classes

(52)

Sequence Classification

pumpkin RNN

spice RNN

(53)

Sequence Classification

pumpkin RNN

spice RNN

latte RNN

h_n

FNN ^coffee

(54)

Additional Guidelines for

Sequence Classification

No intermediate outputs (e.g., “pumpkin” = coffee) for words in the sequence that precede the last word → no loss terms associated with those elements

Loss function is based entirely on final classification task!

However, errors are still propagated backward all the way through the RNN

The process of adjusting weights the entire way through the network based on the loss from a downstream

application is often referred to as end-to-end training

(55)

Deep Recurrent Neural Networks

RNN

h

_n

FNN

• As demonstrated with sequence classification, it is possible to combine neural networks to form more complex architectures

• RNN + Feedforward Neural Network

• Two RNNs

• Possibilities are (theoretically) endless!

(56)

Stacked RNNs

• Use the entire sequence of outputs from one RNN as the input sequence to another

• Capable of outperforming single-layer networks

• Why?

• Having more layers allows the network to learn representations at differing

levels of abstraction across layers

• Early layers → more fundamental properties

• Later layers → more meaningful groups of fundamental properties

(57)

Stacked RNNs

RNN

h_n1

RNN

h_n2

RNN

h_n3

• Optimal number of RNNs to stack together?

• Depends on application and training set

• More RNNs in the stack → increased training costs

(58)

Bidirectional RNNs

• Simple RNNs only consider the information in a sequence leading up to the current

timestep

• ℎ_%^I = 𝑅𝑁𝑁_ILMNOMP(𝑥₀^%)

• ℎ_%^I corresponds to the normal hidden state at time t

• This could be visualized as the context to the left of the current time

Natalie ran to TBH 180G

(59)

Bidirectional RNNs

• However, in many cases the context after the current timestep (to the right of the current time) could be useful as well!

• In many applications we have access to the entire input sequence at once anyway

Natalie ran her code again

(60)

Bidirectional RNNs

• How can we make use of

information from both sides of the current timestep?

• Simple solution:

• Train an RNN on an input sequence in reverse

• ℎ_%^R = 𝑅𝑁𝑁_ROSTNOMP(𝑥_%^U)

• ℎ_%^R corresponds to information from the current timestep to the end of the sequence

• Combine the forward and backward networks

(61)

Bidirectional RNNs

• Two independent RNNs

• One where the input is processed from start to end

• One where the input is processed from end to start

• Outputs combined into a single

representation that captures both the left and right contexts of an input at each timestep

• ℎ_% = ℎ_%^I⨁ℎ_%^R

• How to combine the contexts?

• Concatenation

• Element-wise addition, multiplication, etc.

(62)

Bidirectional RNNs

RNN

180G TBH to ran Natalie

+ ^ℎ^%

ℎ_%^I

ℎ_%^R

(63)

Bidirectional RNNs are particularly useful for sequence classification.

• Simple RNN → final state naturally reflects more information about the end of the

sentence

• Bidirectional RNN → final state is a

combination of the forward and backward passes

(64)

Sequence Classification with a Bidirectional RNN

pumpkin _RNN

spice _RNN

FNN

coffee

latte _RNN

spice

RNN

pumpkin

RNN

(65)

Managing Context in

RNNs

• In a simple RNN, the final state tends to

reflect more information about recent items than those at the beginning of the sequence

• Distant timesteps → less information

• However, long-distance information can be critical to many applications!

Natalie took a train to O’Hare and then a plane to L.A. and then a plane to Tokyo and then a plane to Miyazaki where she finally Ubered to her hotel

t₀ t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈ t₉ t₁₀ t₁₁ t₁₂ t₁₃ t₁₄ t₁₅ t₁₆ t₁₇ t₁₈ t₁₉ t₂₀ t₂₁ t₂₂ t₂₃ t₂₄ t₂₅ t₂₆ t₂₇ t₂₈ t₂₉ t₃₀

(66)

Why is it so hard to

maintain long-

distance context?

• Provide information useful for the current decision (input at t)

• Update and carry forward information required for future decisions (input at time t+1 and beyond)

Hidden layers must perform two tasks simultaneously

• When small derivatives are repeatedly multiplied together (as would happen when backpropagating through time for a long sequence), gradients can

become so close to zero at early layers that they are no longer effective for model training Vanishing gradients

(67)

How can we address this?

• Design more complex RNNs that learn to:

• Forget information that is no longer needed

• Remember information still required for future decisions

(68)

Long Short-Term Memory Networks (LSTMs)

• Remove information no longer needed from the context, and add information likely to be needed later

• Do this by:

• Adding an explicit context layer to the architecture

• Control the flow of information into and out of network layers using specialized neural units called gates

(69)

LSTM Gates

• Feedforward layer + sigmoid activation +

pointwise multiplication with the layer being gated

• Combination of sigmoid activation and pointwise multiplication essentially creates a binary mask

• Values near 1 in the mask are passed through nearly unchanged

• Values near 0 are nearly erased

(70)

LSTM Gates

• Forget gate: Should we erase this existing information from the context?

• Input gate: Should we write this new information to the context?

• Output gate: What information should be revealed as output for the current hidden state?

Three main gates:

(71)

Long Short- Term Memory Networks

(LSTMs)

• LSTMs thus accept as input:

• Context layer

• Hidden layer from previous timestep

• Current input vector

• The output of the hidden layer can be used as input to subsequent layers in a stacked RNN, or to the network’s output layer

(72)

Gated Recurrent Units (GRUs)

• Also manage the context that is passed through to the next timestep, but do so by utilizing a simpler architecture than LSTMs

• No separate context vector

• Only two gates

• Reset gate

• Update gate

• Gates still use a similar design to that seen in LSTMs

• Feedforward layer + sigmoid

activation + pointwise multiplication with the layer being gated, resulting in a binary-like mask

(73)

GRU Gates

• Reset: Which aspects of the previous hidden state are relevant to the current context?

• What can be ignored?

• Update: Based on the intermediate representation produced by the reset gate, which aspects will be used directly in the new hidden state?

• Which aspects of the previous state need to be preserved for future use?

• Recall that h_t = 𝜎 𝑊x_t + 𝑈h_t−1 + 𝐛

• Letting h_t’ be an intermediate representation learned by the reset gate, and z be an intermediate representation produced by the update gate:

• h_t = 1 − z_t h_t−1 + z_th_t′

(74)

Comparing Inputs and Outputs for Neural Units

x h

x_t h_t-1

h_t

x_t h_t-1

h_t c_t

c_t-1 h_t-1 x_t

h_t

Feedforward RNN LSTM GRU

(75)

When to

use LSTMs vs. GRUs?

• Computational efficiency:

Good for scenarios in which you need to train your model quickly and don’t have access to high- performance computing

resources

Why use GRUs instead of LSTMs?

• Performance: LSTMs generally outperform GRUs at the same tasks

Why use LSTMs instead

of GRUs?

(76)

Do inputs to RNNs need to be word embeddings?

• Nope

• In some cases, word embeddings might not be the best type of input:

• Lexicon is so large that it is impractical to represent each possible word as an embedding

• Dataset has many unknown words

• Morphological information is critical to the task

(77)

Alternatives to Word

Embeddings

Input character sequences directly to the Input character RNN

sequences

Use subword representations rather than full word embeddings

Use subword representations

Build input sequences based on information from linguistic analyses

Build input sequences

(78)

Input representations can also be combined….

• Particularly successful: Word embeddings enriched with character information

• Learn embeddings from a bidirectional RNN that accepts character sequences for each word as input

• Concatenate the learned character embeddings with ordinary word

embeddings

• Opportunities are endless …currently a very active research area!

(79)

Summary:

Sequence Processing with Recurrent

Networks

• Sequence processing makes use of temporal information from input sequences

• Recurrent neural networks (RNNs) are neural networks specifically designed for sequence processing

• Bonus: Can accept inputs of variable length

• RNNs base their decisions on both current input and activation values from the previous timestep

• RNNs are particularly useful for text generation, sequence labeling, and (when combined with a feedforward network) sequence classification

• More complex varieties of RNNs include:

• Stacked RNNs

• Bidirectional RNNs

• Long short-term memory networks (LSTMs)

• Gated recurrent units (GRUs)

• Although RNNs usually accept word embeddings as input, they can also use other types of input sequences (e.g., character sequences or subword embeddings)