• No results found

Sequence Processing with Recurrent Networks

N/A
N/A
Protected

Academic year: 2021

Share "Sequence Processing with Recurrent Networks"

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)

Sequence Processing with Recurrent Networks

Natalie Parde, Ph.D.

Department of Computer Science

University of Illinois at Chicago

CS 421: Natural Language Processing

Fall 2019

Many slides adapted from Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/).

(2)

What is sequence processing?

• Automated processing of sequential items (e.g., words in a sentence) while taking into account temporal information (e.g., w1

occurs before w2)

(3)

Language is inherently temporal.

• Continuous input streams of indefinite length

• These sequences unfold over time

I hope there are no more midterms between now and the end of the semester.

no I hope there are more midterms between now and the end of the semester.

(4)

This is even evident in the way we talk

about language!

• Conversation flow

• News feed

• Twitter stream

(5)

We’ve already looked at a applications few of sequence processing….

• Syntactic parsing

• Part of speech tagging

• Viterbi algorithm

Natalie did not like social events so she politely declined the party invitation.

verb? noun? adjective?

(6)

…and many applications that do not incorporate temporal information.

Training

Document Class

Natalie was soooo thrilled that Usman had a famous new poem.

Sarcastic

She was totally 100% not annoyed that it had surpassed her poem on the bestseller list.

Sarcastic

Usman was happy that his poem about Thanksgiving was so successful.

Not Sarcastic He congratulated Natalie for getting #2 on the bestseller list. Not

Sarcastic Test

Document Class

Natalie told Usman she was soooo totally happy for him. ?

P(Sarcastic) = 0.5 P(Not Sarcastic) = 0.5

Natalie told Usman she was soooo totally

Word P(Word|Sarcastic) P(Word|Not Sarcastic)

Natalie 0.033 0.036

Usman 0.033 0.036

soooo 0.033 0.018

totally 0.033 0.018

happy 0.016 0.036

(7)

How can we adapt non-sequential models to make use of temporal information?

Natalie wt-4 sat wt-3 down wt-2 to wt-1 write wt the wt+1 midterm wt+2

𝑃(𝑤% = “write”|𝑤%/0 = “to”, 𝑤%/3 = “down”, 𝑤%/6 = “sat”) h1

h2

y1

“write”

y|V|

softmax

distribution over all words in the vocabulary

(8)

How can we adapt non-sequential models to make use of temporal information?

Natalie wt-5 sat wt-4 down wt-3 to wt-2 write wt-1

the wt

midterm wt+1

𝑃(𝑤 = “the”|𝑤 = “write”, 𝑤 = “to”, 𝑤 = “down”) h1

h2

y1

“the”

y|V|

softmax

distribution over all words in the vocabulary

(9)

How can we adapt non-sequential models to make use of temporal information?

Natalie wt-6 sat wt-5 down wt-4 to wt-3 write wt-2 the wt-1 midterm wt

𝑃(𝑤% = “midterm”|𝑤%/0 = “the”, 𝑤%/3 = “write”, 𝑤%/6 = “to”) h1

h2

y1

“midterm”

y|V|

softmax

distribution over all words in the vocabulary

(10)

Disadvantages of the Sliding Window

Approach

• Items outside the predetermined context window cannot impact the model’s

decision

• What if a task requires information that can be arbitrarily distant from the point at which processing is occurring?

Limits the context from which information can be extracted

• Particularly problematic when trying to learn grammatical elements, e.g.,

constituent parses

Makes it difficult to learn systematic patterns

(11)

solution? The

• Recurrent Neural Networks

• Neural networks designed specifically to handle temporal information

• Can accept variable length inputs without the use of fixed-size windows

(12)

Recurrent Neural

Networks (RNNs)

• The value of a unit is dependent on outputs from previous time steps as input

Networks that contain cycles within their connections

• Long short-term memory network (LSTM)

• Bidirectional LSTM (BiLSTM)

• Gated recurrent unit (GRU) Many different variations

among RNNs

(13)

How do RNNs differ from standard feedforward neural networks?

• Memory!

• Loops in the network allow information to persist over time

• Information is stored between timesteps using an internal hidden state, and fed

back into the model the next time it reads an input

• Some type of output is also predicted at each timestep

• New hidden states are determined as a function of the existing hidden state and the new input at the current timestep

• This function remains the same across timesteps

(14)

Simple RNN

xt ht yt

Current input Information from xt Output for current input

Information from xt-1 (activation value from previous input)

(15)

Thus, hidden layers in RNNs are more

complex than in

feedforward networks.

Capable of encoding outputs from earlier timesteps, which can thereby serve as additional context

Makes decisions based on both current input and outputs from prior timesteps

Does not impose a fixed-length limit on this prior context (can include

information extending back to the

beginning of the sequence)

(16)

However,

computation units still

perform the same core actions.

• Input vector

• (New!) activation values for the hidden layer for the previous time step

Given:

• Weighted sum of inputs

Compute:

(17)

Most Significant

Change

• New set of weights, U, that connect the

hidden layer from the previous time step to the current hidden layer

• These weights determine how the network should make use of prior context in

determining the output for the current input

• Just like with feedforward networks, weights are trained using backpropagation

(18)

Formal Definition

• Forward inference (mapping a

sequence of inputs to a sequence of outputs) is quite similar to what

we’ve seen with feedforward networks!

• Recall the basic set of equations for a feedforward neural network:

• h = 𝜎 𝑊x + 𝐛

• z = 𝑈h

• 𝑦 = softmax(z)

(19)

Formal Definition

• The only change we need to make to the original set of equations is to add the additional (weights X

activation values from previous timestep) product to the current (weights X inputs) product

• h = 𝜎 𝑊xt + 𝑈ht−1 + 𝐛

• z = 𝑉ht

• 𝑦 = softmax(z)

• Note that the weight matrices W, U, and V (the renamed weight matrix for the output layer) are shared across all timesteps

(20)

How would this process look,

illustrated as a feedforward network?

xt W ht yt

U

V xt ht yt

ht-1 U

W V

Recurrent View Feedforward View

(21)

Formal Algorithm

h0 ← 0 # Initialize activations from the hidden layer to 0

for i ← 1 to length(x) do: # Iterate through each input element in temporal order

hi ← g(Uhi-1 + Wxi + b) # Bias vector is optional yi ← f(Vhi)

New values for h and y are calculated with each time step!

(22)

Unrolling a simple RNN….

Natalie sat down to write the

midterm

y1

“write”

y|V|

h0 x1

h1

(23)

Unrolling a simple RNN….

Natalie sat down to write the

midterm

y1

“write”

y|V|

h0 x2

h1

y1

“write”

y|V|

h2

(24)

Unrolling a simple RNN….

Natalie sat down to write the

midterm x3

y1

“write”

y|V|

h2

y1

“write”

y|V|

h0

h1

y1

“write”

y|V|

h3

(25)

Training RNNs

• Same core elements:

• Loss function

• Optimization function

• Backpropagation

• One extra set of weights to update

• Previous hidden layer to current hidden layer (U)

(26)

Additional

Considerations for Computing Loss

• To compute the loss at time t we need the hidden layer from time t-1

• To assess the overall error accruing to ht, we need the current output as well as outputs that will follow

(27)

Backpropagation Through Time

• Two-pass algorithm for training RNNs

• First pass: Perform forward inference

• Compute ht and yt at each step in time

• Compute the loss at each step in time

• Second pass: Process the sequence in reverse

• Compute the required error gradients at each step

backward in time

(28)

Forward Pass

y1 t1

h1

(29)

Forward Pass

y1 t1

x2 y2

t2

h2

h1

(30)

Forward Pass

y1 t1

x2 y2

t2

h1

h2

x3 y3

t3

h3

(31)

…and now…

y1 t1

x2 y2

t2

h1

h2

x3 y3

t3

h3

(32)

Backward Pass

y1 t1

x2 y2

t2

h1

h2

x3 y3

t3

h3

(33)

Backward Pass

y1 t1

x2 y2

t2

h1

h2

x3 y3

t3

h3

(34)

Backward Pass

y1 t1

x2 y2

t2

h1

h2

x3 y3

t3

h3

(35)

Applications of Recurrent Neural

Networks

• Language modeling

• Part-of-speech tagging

• Sequence classification tasks

(36)

RNNs also form the basis for sequence- to-sequence approaches.

• Sequence-to-sequence (seq2seq): Model input and output are both sequences

• Useful for:

• Text summarization

• Machine translation

• Question answering

Chicago is colder than Antarctica today.

Chicago est plus froid que

l'Antarctique aujourd'hui.

(37)

Recurrent Neural

Language Models

Both of these attempt to predict the next word in a sequence given

a prior context of fixed length What we’ve seen so far:

N-gram language

models Feedforward networks with sliding windows

(38)

The problem with n-gram and feedforward language

models?

In both approaches, model quality is dependent on context size

Anything outside the fixed context window has no impact on the

model’s decision!

(39)

Recurrent Neural Language Models

• Recurrent neural language models process sequences one word at a time, as seen in the previous slides

• This means that they avoid constraining the context size

• The hidden state embodies information about all of the

preceding words, all the way back to the beginning of the sequence

(40)

Recurrent Neural Language

Models

• At each timestep:

1. Retrieve an embedding for the current input word

2. Combine the weighted sums of (a) the input embedding values and (b)

the activations of the hidden layer from the previous step, to compute a new set of activation values from the

hidden layer

3. Generate a set of outputs based on the activations from the hidden layer 4. Pass the outputs through a softmax

function to generate a probability

distribution over the entire vocabulary

(41)

Recurrent Neural Language

Models

• Recurrent models can be trained using the same data used to train n-gram and

feedforward language models

• Collection of representative text

• Correct class → still the word that actually comes next in the data

• Task: Predict the next word in a sequence given all previous words, rather than only those in a context window of size n

(42)

How can we generate text with neural

language models?

Model Completion (Machine-Written, 10 Tries): The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These

creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through

(43)

Generation with Neural Language

Models

1. Sample the first word in the output from the softmax distribution that results from using the beginning of sentence marker (<s>) as input

2. Get the embedding for that word

3. Use it as input to the network at the next time step, and sample the following word as in (1)

4. Repeat until the end of sentence marker (</s>) is sampled, or a fixed length limit is reached

(44)

Autoregressive Generation

• This technique is referred to as autoregressive generation

• Word generated at each timestep is conditioned on the word

generated previously by the model

• Key to successful autoregressive generation?

• Prime the generation component with appropriate context (e.g., something more useful than <s>)

(45)

Autoregressive Generation

<s> RNN

softmax

pumpkin

(46)

Autoregressive Generation

<s> RNN

softmax

pumpkin

pumpkin RNN

softmax

spice

(47)

Autoregressive Generation

<s> RNN

softmax

pumpkin

pumpkin RNN

softmax

spice

spice RNN latte

(48)

Sequence Labeling

• Task: Given a fixed set of labels, assign a label to each element of a sequence

• Example: Part-of-speech tagging

• In an RNN:

• Inputs → word embeddings

• Outputs → label probabilities generated by the softmax (or

other activation) function over the set of all labels

(49)

Sequence Labeling

determiner t1

delicious t2

h1

h2

latte t3

h3 adjective

noun

(50)

Sequence Classification

• Task: Given an input sequence, assign the entire sequence to a class (rather than the individual tokens within it)

• Useful for:

• Document-level topic classification

• Spam detection

• Message routing

• Deception detection

• Any other applications for which sequences of text are classified as belonging to one of a small number of categories!

(51)

How to use RNNs for sequence classification?

1

Pass the sequence through an RNN one word at a time, as usual

2

Assume that the hidden layer for the final word, hn, acts as a compressed

representation of the entire sequence

3

Use hn as input to a subsequent

feedforward neural network

4

Choose a class via softmax over all the possible classes

(52)

Sequence Classification

pumpkin RNN

spice RNN

(53)

Sequence Classification

pumpkin RNN

spice RNN

latte RNN

hn

FNN coffee

(54)

Additional Guidelines for

Sequence Classification

No intermediate outputs (e.g., “pumpkin” = coffee) for words in the sequence that precede the last word → no loss terms associated with those elements

Loss function is based entirely on final classification task!

However, errors are still propagated backward all the way through the RNN

The process of adjusting weights the entire way through the network based on the loss from a downstream

application is often referred to as end-to-end training

(55)

Deep Recurrent Neural Networks

RNN

h

n

FNN

• As demonstrated with sequence classification, it is possible to combine neural networks to form more complex architectures

• RNN + Feedforward Neural Network

• Two RNNs

• Possibilities are (theoretically) endless!

(56)

Stacked RNNs

• Use the entire sequence of outputs from one RNN as the input sequence to another

• Capable of outperforming single-layer networks

• Why?

• Having more layers allows the network to learn representations at differing

levels of abstraction across layers

• Early layers → more fundamental properties

• Later layers → more meaningful groups of fundamental properties

(57)

Stacked RNNs

RNN

hn1

RNN

hn2

RNN

hn3

• Optimal number of RNNs to stack together?

• Depends on application and training set

• More RNNs in the stack → increased training costs

(58)

Bidirectional RNNs

• Simple RNNs only consider the information in a sequence leading up to the current

timestep

• ℎ%I = 𝑅𝑁𝑁ILMNOMP(𝑥0%)

• ℎ%I corresponds to the normal hidden state at time t

• This could be visualized as the context to the left of the current time

Natalie ran to TBH 180G

(59)

Bidirectional RNNs

• However, in many cases the context after the current timestep (to the right of the current time) could be useful as well!

• In many applications we have access to the entire input sequence at once anyway

Natalie ran to TBH 180G

Natalie ran her code again

(60)

Bidirectional RNNs

• How can we make use of

information from both sides of the current timestep?

• Simple solution:

• Train an RNN on an input sequence in reverse

• ℎ%R = 𝑅𝑁𝑁ROSTNOMP(𝑥%U)

• ℎ%R corresponds to information from the current timestep to the end of the sequence

• Combine the forward and backward networks

(61)

Bidirectional RNNs

• Two independent RNNs

• One where the input is processed from start to end

• One where the input is processed from end to start

• Outputs combined into a single

representation that captures both the left and right contexts of an input at each timestep

% = ℎ%I⨁ℎ%R

• How to combine the contexts?

• Concatenation

• Element-wise addition, multiplication, etc.

(62)

Bidirectional RNNs

RNN

RNN

Natalie ran to TBH 180G

180G TBH to ran Natalie

+ %

%I

%R

(63)

Bidirectional RNNs are particularly useful for sequence classification.

• Simple RNN → final state naturally reflects more information about the end of the

sentence

• Bidirectional RNN → final state is a

combination of the forward and backward passes

(64)

Sequence Classification with a Bidirectional RNN

pumpkin RNN

spice RNN

FNN

coffee

latte RNN

spice

RNN

pumpkin

RNN

(65)

Managing Context in

RNNs

• In a simple RNN, the final state tends to

reflect more information about recent items than those at the beginning of the sequence

• Distant timesteps → less information

• However, long-distance information can be critical to many applications!

Natalie took a train to O’Hare and then a plane to L.A. and then a plane to Tokyo and then a plane to Miyazaki where she finally Ubered to her hotel

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26 t27 t28 t29 t30

(66)

Why is it so hard to

maintain long-

distance context?

• Provide information useful for the current decision (input at t)

• Update and carry forward information required for future decisions (input at time t+1 and beyond)

Hidden layers must perform two tasks simultaneously

• When small derivatives are repeatedly multiplied together (as would happen when backpropagating through time for a long sequence), gradients can

become so close to zero at early layers that they are no longer effective for model training Vanishing gradients

(67)

How can we address this?

• Design more complex RNNs that learn to:

• Forget information that is no longer needed

• Remember information still required for future decisions

(68)

Long Short-Term Memory Networks (LSTMs)

• Remove information no longer needed from the context, and add information likely to be needed later

• Do this by:

• Adding an explicit context layer to the architecture

• Control the flow of information into and out of network layers using specialized neural units called gates

(69)

LSTM Gates

• Feedforward layer + sigmoid activation +

pointwise multiplication with the layer being gated

• Combination of sigmoid activation and pointwise multiplication essentially creates a binary mask

• Values near 1 in the mask are passed through nearly unchanged

• Values near 0 are nearly erased

(70)

LSTM Gates

• Forget gate: Should we erase this existing information from the context?

• Input gate: Should we write this new information to the context?

• Output gate: What information should be revealed as output for the current hidden state?

Three main gates:

(71)

Long Short- Term Memory Networks

(LSTMs)

• LSTMs thus accept as input:

• Context layer

• Hidden layer from previous timestep

• Current input vector

• The output of the hidden layer can be used as input to subsequent layers in a stacked RNN, or to the network’s output layer

(72)

Gated Recurrent Units (GRUs)

• Also manage the context that is passed through to the next timestep, but do so by utilizing a simpler architecture than LSTMs

• No separate context vector

• Only two gates

• Reset gate

• Update gate

• Gates still use a similar design to that seen in LSTMs

• Feedforward layer + sigmoid

activation + pointwise multiplication with the layer being gated, resulting in a binary-like mask

(73)

GRU Gates

• Reset: Which aspects of the previous hidden state are relevant to the current context?

• What can be ignored?

• Update: Based on the intermediate representation produced by the reset gate, which aspects will be used directly in the new hidden state?

• Which aspects of the previous state need to be preserved for future use?

• Recall that ht = 𝜎 𝑊xt + 𝑈ht−1 + 𝐛

• Letting ht’ be an intermediate representation learned by the reset gate, and z be an intermediate representation produced by the update gate:

• ht = 1 − zt ht−1 + ztht

(74)

Comparing Inputs and Outputs for Neural Units

x h

xt ht-1

ht

xt ht-1

ht ct

ct-1 ht-1 xt

ht

Feedforward RNN LSTM GRU

(75)

When to

use LSTMs vs. GRUs?

• Computational efficiency:

Good for scenarios in which you need to train your model quickly and don’t have access to high- performance computing

resources

Why use GRUs instead of LSTMs?

• Performance: LSTMs generally outperform GRUs at the same tasks

Why use LSTMs instead

of GRUs?

(76)

Do inputs to RNNs need to be word embeddings?

• Nope

• In some cases, word embeddings might not be the best type of input:

• Lexicon is so large that it is impractical to represent each possible word as an embedding

• Dataset has many unknown words

• Morphological information is critical to the task

(77)

Alternatives to Word

Embeddings

Input character sequences directly to the Input character RNN

sequences

Use subword representations rather than full word embeddings

Use subword representations

Build input sequences based on information from linguistic analyses

Build input sequences

(78)

Input representations can also be combined….

• Particularly successful: Word embeddings enriched with character information

• Learn embeddings from a bidirectional RNN that accepts character sequences for each word as input

• Concatenate the learned character embeddings with ordinary word

embeddings

• Opportunities are endless …currently a very active research area!

(79)

Summary:

Sequence Processing with Recurrent

Networks

• Sequence processing makes use of temporal information from input sequences

• Recurrent neural networks (RNNs) are neural networks specifically designed for sequence processing

• Bonus: Can accept inputs of variable length

• RNNs base their decisions on both current input and activation values from the previous timestep

• RNNs are particularly useful for text generation, sequence labeling, and (when combined with a feedforward network) sequence classification

• More complex varieties of RNNs include:

• Stacked RNNs

• Bidirectional RNNs

• Long short-term memory networks (LSTMs)

• Gated recurrent units (GRUs)

• Although RNNs usually accept word embeddings as input, they can also use other types of input sequences (e.g., character sequences or subword embeddings)

References

Related documents

ABSTRACT : Surface Enhanced Raman Spectroscopy (SERS), an emerging technique of Raman spectroscopy, can be used for the characterization of molecular vibrations in drug

If capital markets were complete, the realization of such shocks would not affect children’s labor supply (and consumption), as they would be insured. The class of models

It has been found that the forward stepwise approach performs greedy optimisation in the sense that each time the best model term is greedily (where the least-squares estimation

Briefly, advancement genioplasty with bone augmentation by thin cortical bone was performed in the early stage of the preorthodontic treatment to establish the foundation for

sis of the global distribution of water isotopes using the NCAR atmospheric general circulation model, J. L.: Water isotopes during the Last Glacial Maximum: New GCM calculations,

Hence, we hypothesize that patients’ switching behavior across hospitals will be associated with an increase in duplicate tests due to lack of access to their

This paper investigates whether structural breaks and long memory are relevant features in modeling and forecasting the conditional volatility of oil spot and

From the above approaches the usual sample acf, the procedures using partial autocorre- lations, the acf of the robustly filtered values as well as the Gaussian rank autocorrelation