• No results found

Deep Learning for Natural Language Processing

N/A
N/A
Protected

Academic year: 2022

Share "Deep Learning for Natural Language Processing"

Copied!
87
0
0

Loading.... (view fulltext now)

Full text

(1)

Deep Learning for Natural Language Processing

Jindřich Libovický

April 29, 2021

NPFL124 Introduction to Natural Language Processing

Charles University

Faculty of Mathematics and Physics

Institute of Formal and Applied Linguistics unless otherwise stated

(2)

Outline

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 1/ 76

(3)

Deep Learning in NLP

• NLP tasks learn end-to-end using deep learning — the number-one approach in current research

• State of the art in POS tagging, parsing, named-entity recognition, machine translation,

, Good news: training without almost any linguistic insight

although it is not always a good idea

/ Bad news: requires enormous amount of training data and really big computational resources

although that changes with pre-trained models

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 2/ 76

(4)

What is deep learning?

• Buzzword for machine learning using neural networks with many layers using back-propagation

• Learning of a real-valued function with millions of parameters that solves a particular problem

• Learning more and more abstract representation of the input data until we reach such a suitable representation for our problem

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 3/ 76

(5)

Neural Networks Basics

(6)

Neural Networks Basics

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 4/ 76

(7)

Single Neuron (Perceptron)

activationfunction

input x weights w output

∑ xw > 0? 𝑦 𝑥𝑖

⋅𝑤𝑖 𝑥1

⋅𝑤1 𝑥2 ⋅𝑤2

⋮ 𝑥𝑛

⋅𝑤𝑛

The old view:

a network of artificial neurons

• simplistic model of a neuron from the 1940’s

• a neuron has some (weighted) inputs, when the input is high enough, it fires a signal

• focus on single neurons does not allow thinking about layers as vector

representations

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 5/ 76

(8)

Neural Network

The current view: a network of layers

input vector •••••••••••• 𝑥

↓ ↑ ↓

𝑛 hidden layers

••••••••••••••••1= 𝑓(𝑊1𝑥 + 𝑏1)

↓ ↑ ↓ ↑

••••••••••••••••2= 𝑓(𝑊21+ 𝑏2)

↓ ↑ ↓ ↑

⋮ ⋮ ⋮

↓ ↑ ↓ ↑

••••••••••••••••𝑛= 𝑓(𝑊𝑛𝑛−1+ 𝑏𝑛)

↓ ↑ ↓ ↑

output vector ••• 𝑜 = 𝑔(𝑊𝑜𝑛+ 𝑏𝑜) ∂𝑊∂𝐸

𝑜 = ∂𝐸∂𝑜∂𝑊∂𝑜

𝑜

↓ ↓ ↑

error 𝐸 = 𝑒(𝑜, 𝑡) → ∂𝐸∂𝑜

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 6/ 76

(9)

Implementation: Computation graph

Logistic regression:

𝑦 = 𝜎 (𝑊 𝑥 + 𝑏) (1)

Computation graph:

𝑥 𝑊

× 𝑏

+ 𝜎

forward graph

loss 𝑦

𝑜 𝑜 𝜎

+

𝑏

×

𝑊

backward graph

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 7/ 76

(10)

Frameworks for Deep Learning

research and prototyping in Python

• graph statically constructed, symbolic computation

• computation happens in a session

• allows graph export and running as a binary

• computations written dynamically as normal procedural code

• easy debugging: inspecting variables at any time of the computation

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 8/ 76

(11)

Representing Words

(12)

Representing Words

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 9/ 76

(13)

The problem of representing words

Problem:

Words (and characters)

are discrete × Inputs to neural nets must be continuous

Spoiler: The solution is called embeddings Let’s discuss it on the problem of language modeling

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 10/ 76

(14)

Language Modeling

• estimate probability of a next word in a text

P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤1)

• standard approach: 𝑛-gram models with Markov assumption

≈ P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈

𝑛

𝑗=0

𝜆𝑗 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗) 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)

• Let’s simulate it with a neural network:

… ≈ 𝐹 (𝑤𝑖−1, … , 𝑤𝑖−𝑛|𝜃) 𝜃 is a set of trainable parameters.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 11/ 76

(15)

Simple Neural Language Model

1𝑤

𝑛−3

⋅𝑊𝑒

1𝑤

𝑛−2

⋅𝑊𝑒

1𝑤

𝑛−1

⋅𝑊𝑒 tanh

⋅𝑉3 ⋅𝑉2 ⋅𝑉1 + 𝑏

softmax

⋅𝑊 + 𝑏 P(𝑤𝑛|𝑤𝑛−1, 𝑤𝑛−2, 𝑤𝑛−3)

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3 (Feb):1137–1155, 2003. ISSN 1532-4435

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 12/ 76

(16)

Neural LM: Word Representation

• limited vocabulary (hundred thousands words): indexed set of words

• words are initially represented as one-hot-vectors 1𝑤= (0, … , 0, 1, 0, … 0)

• projection 1𝑤⋅ 𝑉 corresponds to selecting one row from matrix 𝑉

• 𝑉 : is a table of learned word vector representations

so-calledword embeddings

• dimension typically 100 — 300

The first hidden layer is then:

1= 𝑉𝑤

𝑖−𝑛⊕ 𝑉𝑤

𝑖−𝑛+1⊕ … ⊕ 𝑉𝑤

𝑖−1

Matrix 𝑉 is shared for all words.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 13/ 76

(17)

Neural LM: Next Word Estimation

• optionally add extra hidden layer:

2= 𝑓(ℎ1𝑊1+ 𝑏1)

• last layer: probability distribution over vocabulary

𝑦 = softmax(ℎ2𝑊2+ 𝑏2) = exp(ℎ2𝑊2+ 𝑏2)

∑ exp(ℎ2𝑊2+ 𝑏2)

• training objective: cross-entropy between the true (i.e., one-hot) distribution and estimated distribution

𝐸 = − ∑

𝑖

𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑

𝑖

− log 𝑦(𝑤𝑖)

• learned by error back-propagation

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 14/ 76

(18)

Learned Representations

• word embeddings from LMs have interesting properties

• cluster according to POS & meaning similarity

Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928

• in IR: query expansion by nearest neighbors

• in deep learning models: embeddings initialization speeds up training / allows complex model with less data

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 15/ 76

(19)

Implementation in PyTorch I

import torch

import torch.nn as nn

class LanguageModel(nn.Module):

def __init__(self, vocab_size, embedding_dim, hidden_dim):

super().__init__()

self.embedding = nn.Embedding(vocab_size, embedding_dim) self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim) self.output_layer = nn.Linear(hidden_dim, vocab_size)

self.loss_function = nn.CrossEntropyLoss()

def forward(self, word_1, word_2, word_3, target=None):

embedded_1 = self.embedding(word_1) embedded_2 = self.embedding(word_2) embedded_3 = self.embedding(word_3)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 16/ 76

(20)

Implementation in PyTorch II

hidden = torch.tanh(self.hidden_layer(

torch.cat(embedded_1, embedded_2, embedded_3))) logits = self.output_layer(hidden)

loss = None

if target is not None:

loss = self.loss_function(logits, targets) return logits, loss

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 17/ 76

(21)

Representing Sequences

(22)

Representing Sequences

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 18/ 76

(23)

Contextual representation

Word embeddings represent words in isolation.

Meaning is context-dependent —

We need an encoder providing contextual representation

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 19/ 76

(24)

Representing Sequences

Recurrent Networks

(25)

Recurrent Neural Networks: Pipe Metaphor

RNN = pipeline for information

In every step some information goes in and some information goes out.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 20/ 76

(26)

Recurrent Networks (RNNs)

Image source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs

• inputs: 𝑥,… , 𝑥𝑇 (in NLP word embeddings)

• initial state ℎ0= 0, a result of previous computation, trainable parameter

• recurrent computation:

𝑡 = 𝐴(ℎ𝑡−1, 𝑥𝑡)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 21/ 76

(27)

RNN as Imperative Code

def rnn(initial_state, inputs):

prev_state = initial_state for x in inputs:

new_state, output = rnn_cell(x, prev_state) prev_state = new_state

yield output

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 22/ 76

(28)

RNN in Unrolled in Time

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

At training time unrolled in time: technically just a very deep network

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 23/ 76

(29)

Vanilla RNN

𝑡 = tanh (𝑊 [ℎ𝑡−1; 𝑥𝑡] + 𝑏)

• cannot propagate long-distance relations

• vanishing gradient problem

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 24/ 76

(30)

Vanishing Gradient Problem (1)

tanh 𝑥 = 1 − 𝑒−2𝑥 1 + 𝑒−2𝑥

-1.0 -0.5 0.0 0.5 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

dtanh 𝑥

d𝑥 = 1 − tanh2𝑥 ∈ (0, 1]

0.0 0.2 0.4 0.6 0.8 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥 Weight initialized ∼ 𝒩(0, 1) to have gradients further from zero.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 25/ 76

(31)

Vanishing Gradient Problem (2)

∂𝐸𝑡+1

∂𝑏 = ∂𝐸𝑡+1

∂ℎ𝑡+1 ⋅∂ℎ𝑡+1

∂𝑏 (chain rule)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 26/ 76

(32)

Vanishing Gradient Problem (3)

∂ℎ𝑡

∂𝑏 = ∂ tanh

=𝑧𝑡(activation)

⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊𝑡−1+ 𝑊𝑥𝑥𝑡+ 𝑏)

∂𝑏 (tanh

is derivative of tanh)

= tanh(𝑧𝑡) ⋅⎛⎜⎜

∂𝑊𝑡−1

∂𝑏 + ∂𝑊𝑥𝑥𝑡

⏟∂𝑏

=0

+∂𝑏

∂𝑏⏟

=1

⎞⎟

= 𝑊⏟

∼𝒩(0,1)

tanh⏟(𝑧𝑡)

∈(0;1]

∂ℎ𝑡−1

∂𝑏 + tanh(𝑧𝑡)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 27/ 76

(33)

Long Short-Term Memory Networks

LSTM = Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Control the gradient flow by explicitly gating:

1.

What to use from

the input.

2.

What to reuse from

the hidden state.

3.

What to put on the output.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 28/ 76

(34)

LSTM: Hidden State

• two types of hidden states

• ℎ𝑡 — “public” hidden state, used an output

• 𝑐𝑡 — “private” memory, no non-linearities on the way

• direct flow of gradients (without multiplying by ≤ 1 derivatives)

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 29/ 76

(35)

LSTM: Forget Gate

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

𝑓𝑡= 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓) Reminder sigmoid function:

0.0 0.2 0.4 0.6 0.8 1.0

−10 −5 0 5 10

𝑦

𝑥

• based on input and previous state, decide what to forget from the memory

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 30/ 76

(36)

LSTM: Input Gate

Image source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs

𝑖𝑡 = 𝜎 (𝑊𝑖⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)

̃𝐶𝑡 = tanh (𝑊𝑐⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)

• 𝐶 — candidate what may want to add to the memorỹ

• 𝑖𝑡 — decide how much of the information we want to store

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 31/ 76

(37)

LSTM: Cell State Update

Image source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs

𝐶𝑡= 𝑓𝑡⊙ 𝐶𝑡−1+ 𝑖𝑡⊙𝐶𝑡̃

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 32/ 76

(38)

LSTM: Output Gate

Image source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs

𝑜𝑡= 𝜎 (𝑊𝑜⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜) ℎ𝑡= 𝑜𝑡⊙ tanh 𝐶𝑡

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 33/ 76

(39)

Here we are, LSTM!

𝑓

𝑡

= 𝜎 (𝑊

𝑓

[ℎ

𝑡−1

; 𝑥

𝑡

] + 𝑏

𝑓

) 𝑖

𝑡

= 𝜎 (𝑊

𝑖

⋅ [ℎ

𝑡−1

; 𝑥

𝑡

] + 𝑏

𝑖

) 𝑜

𝑡

= 𝜎 (𝑊

𝑜

⋅ [ℎ

𝑡−1

; 𝑥

𝑡

] + 𝑏

𝑜

)

̃ 𝐶

𝑡

= tanh (𝑊

𝑐

⋅ [ℎ

𝑡−1

; 𝑥

𝑡

] + 𝑏

𝐶

) 𝐶

𝑡

= 𝑓

𝑡

⊙ 𝐶

𝑡−1

+ 𝑖

𝑡

⊙ 𝐶

𝑡

̃

𝑡

= 𝑜

𝑡

⊙ tanh 𝐶

𝑡

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 34/ 76

(40)

Gated Recurrent Units

update gate 𝑧𝑡 = 𝜎(𝑥𝑡𝑊𝑧+ ℎ𝑡−1𝑈𝑧+ 𝑏𝑧) ∈ (0, 1) remember gate 𝑟𝑡= 𝜎(𝑥𝑡𝑊𝑟+ ℎ𝑡−1𝑈𝑟+ 𝑏𝑟) ∈ (0, 1)

candidate hidden state ℎ𝑡̃ = tanh (𝑥𝑡𝑊+ (𝑟𝑡⊙ ℎ𝑡−1)𝑈) ∈ (−1, 1) hidden state ℎ𝑡= (1 − 𝑧𝑡) ⊙ ℎ𝑡−1+ 𝑧𝑡⋅ ̃ℎ𝑡

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches.

In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014.

Association for Computational Linguistics

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 35/ 76

(41)

LSTM vs. GRU

• GRU is smaller and therefore faster

• performance similar, task dependent

• theoretical limitation: GRU accepts regular languages, LSTM can simulate counter machine

Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ISSN 2331-8422;

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 36/ 76

(42)

RNN in PyTorch

rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.1)

output, (hidden, cell) = self.rnn(x)

https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 37/ 76

(43)

Bidirectional Networks

• simple trick to improve performance

• run one RNN forward, second one backward and concatenate outputs

Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/

• state of the art in tagging, crucial for neural machine translation

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 38/ 76

(44)

Representing Sequences

Convolutional Networks

(45)

Convolutional Networks: Tree-Structure metaphor

Processing input windows in a hierarchical fashion Sort of like a hierarchy in tree-like data structures

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 39/ 76

(46)

1-D Convolution

≈ sliding window over the sequence

embeddings x = (𝑥1, … , 𝑥𝑁)

𝑥0= ⃗0 𝑥𝑁 = ⃗0

1= 𝑓 (𝑊 [𝑥0; 𝑥1; 𝑥2] + 𝑏)

𝑖= 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)

pad with 0s if we want to keep sequence length

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 40/ 76

(47)

1-D Convolution: Pseudocode

xs = ... # input sequnce

kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size

W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters)

window = kernel_size // 2

outputs = []

for i in range(window, xs.shape[1] - window):

h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h)

return np.array(h)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 41/ 76

(48)

1-D Convolution in Pytorch

PyTorch

conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True)

h = conv(x)

https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 42/ 76

(49)

Rectified Linear Units

ReLU:

0.0 1.0 2.0 3.0 4.0 5.0 6.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

Derivative of ReLU:

0.0 0.2 0.4 0.6 0.8 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

faster, suffer less with vanishing gradient

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, Haifa, Israel, June 2010. JMLR.org

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 43/ 76

(50)

Residual Connections

embeddings x = (𝑥1, … , 𝑥𝑁)

𝑥0= ⃗0 𝑥𝑁 = ⃗0

𝑖= 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏) + 𝑥𝑖

Allows training deeper networks.

Better gradient flow – the same as in RNNs.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 44/ 76

(51)

Residual Connections: Numerical Stability

Numerically unstable, we need activation to be in similar scale ⇒ layer normalization.

Activation before non-linearity is normalized:

𝑎𝑖= 𝑔𝑖

𝜎𝑖(𝑎𝑖− 𝜇𝑖)

…𝑔 is a trainable parameter, 𝜇, 𝜎 estimated from data.

𝜇 = 1 𝐻

𝐻

𝑖=1

𝑎𝑖

𝜎 =

√√

⎷ 1 𝐻

𝐻

𝑖=1

(𝑎𝑖− 𝜇)2

Lei Jimmy Ba, Ryan Kiros, and Geoffrey E Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 45/ 76

(52)

Receptive Field

embeddings x = (𝑥1, … , 𝑥𝑁)

𝑥0= ⃗0 𝑥𝑁 = ⃗0

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 46/ 76

(53)

Convolutional architectures

+

• extremely computationally efficient

• limited context

• by default no aware of 𝑛-gram order

Currently mostly used in speech processing and character-level architectures

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 47/ 76

(54)

Representing Sequences

Self-attentive Networks

(55)

Self-attentive Networks: Complete graph metaphor

RNN = information pipeline

CNN = information in tree-like data structure

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

Self-attentive = Transformers =

Information flow in a weighted complete bipartite graph

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 48/ 76

(56)

Self-attentive Networks

• In some layers: states are linear combination of previous layer states

• Originally for the Transformer model for machine translation

⇐ attention weights = similarity matrix between all pairs of states

• 𝑂(𝑛2) memory, 𝑂(1) time (when paralelized)

• next layer: sum by rows

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 49/ 76

(57)

Multi-head scaled dot-product attention

Scaled dot-production attention

𝑄 = (𝑞1, … , 𝑞𝑛): queries, 𝐾: keys, 𝑉 : values

Attn(𝑄, 𝐾, 𝑉 ) = softmax

similarity matrix

⏞⏞⏞⏞⏞

(𝑄𝐾

√𝑑 ) 𝑉 Multi-head setup

Multihead(𝑄, 𝑉 ) =

concatenate head outputs

⏞⏞⏞⏞⏞⏞⏞

(𝐻1⊕ ⋯ ⊕ 𝐻) 𝑊𝑂 𝐻𝑖= Attn(𝑄𝑊𝑖𝑄, 𝑉 𝑊𝑖𝐾, 𝑉 𝑊𝑖𝑉)

𝑊𝑖𝑄, 𝑊𝑖𝐾, 𝑊𝑖𝑉 head-specific projections keys & values =? queries linear 𝑊𝐾 linear 𝑊𝑉 linear 𝑊𝑄

split split split

concat linear 𝑊𝑂

scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 50/ 76

(58)

Dot-Product Attention in PyTorch

def attention(query, key, value, mask=None):

d_k = query.size(-1)

scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k)

p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 51/ 76

(59)

Stacking self-attentive Layers

input embeddings position

encoding

self-attentivesublayer

multihead attention

keys &

values queries

layer normalization

feed-forwardsublayer

non-linear layer linear layer

layer normalization

𝑁 ×

• several layers (original paper 6)

• each layer: 2 sub-layers: self-attention and feed-forward layer

• everything inter-connected with residual connections

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 52/ 76

(60)

Position Encoding

Model is not aware of the position in the sequence.

pos(𝑖) =

⎧{

⎨{

sin (10𝑡4 𝑖

𝑑) , if 𝑖 mod 2 = 0 cos (10𝑡4

𝑖−1

𝑑 ) , otherwise

0 20 40 60 80

Text length 0

100

200

300

Dimension

−0.5 0.0 0.5 1.0

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 53/ 76

(61)

Architectures Comparison

Recurrent Convolutional

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

Self-attentive

computation sequential operations memory

Recurrent 𝑂(𝑛 ⋅ 𝑑2) 𝑂(𝑛) 𝑂(𝑛 ⋅ 𝑑)

Convolutional 𝑂(𝑘 ⋅ 𝑛 ⋅ 𝑑2) 𝑂(1) 𝑂(𝑛 ⋅ 𝑑) Self-attentive 𝑂(𝑛2⋅ 𝑑) 𝑂(1) 𝑂(𝑛2⋅ 𝑑)

𝑑 model dimension, 𝑛 sequence length, 𝑘 convolutional kernel

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 54/ 76

(62)

Classification and Labeling

(63)

Classification and Labeling

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 55/ 76

(64)

Architecture Overview

1.

Embed input words.

Get a sequence of

continuous vectors

2.

Contextualize input.

Apply a sequence processing architecture and get contextual representation.

3.

Get some output.

Typically classification or labeling.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 56/ 76

(65)

Text classification

Tasks such as: sentiment analysis, genre classification, fake news detection, spam detection.

RNN/CNN/Transformer Embeddings

Encoder Hidden states

(ℎ1, … , ℎ𝑛)

get single

vector classifier

Classifier = feedforward neural net with softmax at the end

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 57/ 76

(66)

Getting a single vector: Two options

1. Pooling

• Squeeze the sequence into a vector

• Usable directly for embeddings and also encoder output

• Mean pooling: 𝑛1∑ ℎ𝑖

• Max pooling (intuitively existential quantifier): take max in each dimension

2. Choose one state

• In RNN the last state, i.e., after reading the whohle input

• In Transformers any (typically the first) state, self-attention will learn to move relevant information there

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 58/ 76

(67)

Softmax & Cross-Entropy

Output layer with softmax (with parameters 𝑊 , 𝑏) — gets categorical distribution:

𝑃𝑦= softmax(x) = P(𝑦 = 𝑗 ∣ x) = exp x𝑊 + 𝑏

∑ exp x𝑊 + 𝑏

Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑇 = 1(𝑦) = (0, 0, … , 1, 0, … , 0):

𝐿(𝑃𝑦, 𝑦) = 𝐻(𝑃 , 𝑇 ) = −𝔼𝑖∼𝑇log 𝑃 (𝑖)

= − ∑

𝑖

𝑇 (𝑖) log 𝑃 (𝑖)

= − log 𝑃 (𝑦)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 59/ 76

(68)

Derivative of Cross-Entropy

Let 𝑙 = x𝑊 + 𝑏, 𝑙𝑦 corresponds to the correct one.

∂𝐿(𝑃𝑦, 𝑦)

∂𝑙 = −∂

∂𝑙log exp 𝑙𝑦

𝑗exp 𝑙𝑗 = −∂

∂𝑙𝑙𝑦 − log ∑ exp 𝑙

= 1𝑦 + ∂

∂𝑙 − log ∑ exp 𝑙 = 1𝑦∑ 1𝑦exp 𝑙

∑ exp 𝑙 =

= 1𝑦 − 𝑃𝑦(𝑦)

0 1

0 1

0 1

Interpretation: Reinforce the correct logit, suppress the rest.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 60/ 76

(69)

Sequence Labeling

• assign value / probability distribution to every token in a sequence

• morphological tagging, named-entity recognition, LM with unlimited history, answer span selection

classifier classifier classifier classifier classifier classifier

RNN/CNN/Transformer Embeddings

Encoder Hidden states The same classifier

at every state

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 61/ 76

(70)

Pre-training Representations

(71)

Pre-training Representations

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 62/ 76

(72)

Pre-trained Representations

• representations that emerge in models seem to carry a lot of general information about the language

• representations pre-trained on large data can be re-used on tasks with smaller training data

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 63/ 76

(73)

Pre-training Representations

Word2Vec & FastText

(74)

Word2Vec

• way to learn word embeddings without training the complete LM

CBOW Skip-gram

𝑤3

𝑤1 𝑤2

𝑤4 𝑤5

𝑤3

𝑤1 𝑤2

𝑤4 𝑤5

• CBOW: minimize cross-entropy of the middle word of a sliding windows

• skip-gram: minimize cross-entropy of a bag of words around a word (LM other way round)

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.

Association for Computational Linguistics

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 64/ 76

(75)

Word2Vec: sampling

1. All human beings are born free and equal in dignity … (All, humans) (All, beings) 2. All human beings are born free and equal in dignity …

(human, All) (human, beings)

(human, are) 3. All human beings are born free and equal in dignity …

(beings, All) (beings, human)

(beings, are) (beings, born) 4. All human beings are born free and equal in dignity …

(are, human) (are, beings) (are, born)

(are, free)

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 65/ 76

(76)

Word2Vec: Formulas

• Training objective:

1 𝑇

𝑇

𝑡=1

𝑗∼(−𝑐,𝑐)

log 𝑝(𝑤𝑡+𝑐|𝑤𝑡)

• Probability estimation:

𝑝(𝑤𝑂|𝑤𝐼) = exp (𝑉′⊤𝑤

𝑂𝑉𝑤

𝐼)

𝑤exp (𝑉′⊤𝑤𝑉𝑤

𝑖) where 𝑉 is input (embedding) matrix, 𝑉 output matrix

Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 66/ 76

(77)

Word2Vec: Training using Negative Sampling

The summation in denominator is slow, use noise contrastive estimation:

log 𝜎 (𝑉′⊤𝑤

𝑂𝑉𝑤

𝐼) +

𝑘

𝑖=1

𝐸𝑤

𝑖∼𝑃𝑛(𝑤)[log 𝜎 (−𝑉′⊤𝑤

𝑖𝑉𝑤

𝐼)]

Main idea: classify independently by logistic regression the positive and few sampled negative examples.

Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,

USA, June 2013. Association for Computational Linguistics

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 67/ 76

(78)

Word2Vec: Vector Arithmetics

man

woman

uncle

aunt

king

queen

kings

queens

king

queen

Image originally from Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 68/ 76

(79)

FastText

State-of-the-art pre-trained word embeddings

• Word2Vec training treats words as independent entities

• There are regularities in how words look like — it is called morphology

FastText approaches this

1. Represent each word as a bag of character 𝑛-grams

2. Keep a table of character 𝑛-grams embeddings instead of words 3. Word embedding = average of character 𝑛-gram embeeddings

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://www.aclweb.org/anthology/Q17-1010

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 69/ 76

(80)

Training models

FastText — model implementation by Facebook https://github.com/facebookresearch/fastText Just throw in raw text.

./fasttext skipgram -input data.txt -output model

The tool allows training simple classifiers using the embeddings.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 70/ 76

(81)

Pre-training Representations

BERT

(82)

What is BERT

Bidirectional Encoder Representations

from Transformers • pre-trained representation capturing context

contextual embeddings

• Transformer with a masked language model objective

• originally by Google, published in November 2018

• since then better version: RoBERTa by Facebook, language-specific variants, multilingual versions

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 71/ 76

(83)

Achitecture

input embeddings position

encoding

self-attentivesublayer

multihead attention

keys &

values queries

layer normalization

feed-forwardsublayer

non-linear layer linear layer

layer normalization

𝑁 ×

• a stack of 12 Transformer layers

• trained as sequence labeling: change some input token, labeler guesses original tokens

masked language modeling

• being able to predict missing words: proxy for language understanding

When trained, throw away labeling, use last layer as the representation.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 72/ 76

(84)

Masked Language Model

All human being are born free free MASK hairy free and equal in dignity and rights

1. Randomly sample a word → free

2. With 80% change replace with special MASK token.

3. With 10% change replace with random token → hairy 4. With 10% change keep as is → free

Then a classifier should predict the missing/replaced word free

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 73/ 76

(85)

Pre-train and finetune paradigm

BERT encoder 1. Pre-training

(do once, large data, long time) unlabeled

datal

masked LM

2. Finetuning

(small data, fast) tasks-pecific

data

task-specific objective

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 74/ 76

(86)

Implementation: Huggingface Transformers

https://github.com/huggingface/transformers

Implements most existign pre-trained BERT-like models for both PyTorch and TensorFlow.

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 75/ 76

(87)

Deep Learning for Natural Language Processing

Summary

1. Discrete symbols → continuous representation with trained embeddings

2. Architectures to get suitable representation: recurrent, convolutional, self-attentive

3. Output: classification, sequence labeling

4. Representations pretrained on large data helps on downstream tasks

http://ufal.cz/courses/npfl124

References

Related documents

While this material is not classified as hazardous under OSHA, GHS or WHMIS 2015 regulations, this SDS contains valuable information critical to the safe handling and proper use of

The constables are also involved with various local youth programs, including after-school activities such as sporting and social occasions, Blue Light discos, Police and Citizens

Conclusion: The proposed gene selection algorithm using Bayes classification approach is shown to find important genes that can provide high classification accuracy on

Rodney Nesmith ("Claimant") was injured in a compensable work accident on July 23, 2008, when he was attacked by a dog while he was working for Service Master.' Claimant

• Specifically, we show that when multiple macro users are served by the macro cell, the optimal value of power and number of ABS obtained by performing interference management on

• Promote an international network of professionals in both the fields of technology and cultural heritage for scientific research but also applied practical experience;. •

{Factors contributing to the performance of a farmworker equity-share scheme} Policy & program interventions Conceptual Framework Institutional & macro-policy framework

In this paper we have considered the problem of forecasting the global horizontal irradiance in very short time using a particular dataset obtained from the Plataforma