Deep Learning for Natural Language Processing

(1)

Deep Learning for Natural Language Processing

Jindřich Libovický

April 29, 2021

NPFL124 Introduction to Natural Language Processing

Charles University

Faculty of Mathematics and Physics

Institute of Formal and Applied Linguistics unless otherwise stated

(2)

Outline

Neural Networks Basics Representing Words Representing Sequences

Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations

Word2Vec & FastText BERT

Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 1/ 76

(3)

Deep Learning in NLP

• NLP tasks learn end-to-end using deep learning — the number-one approach in current research

• State of the art in POS tagging, parsing, named-entity recognition, machine translation,

…

, Good news: training without almost any linguistic insight

although it is not always a good idea

/ Bad news: requires enormous amount of training data and really big computational resources

although that changes with pre-trained models

(4)

What is deep learning?

• Buzzword for machine learning using neural networks with many layers using back-propagation

• Learning of a real-valued function with millions of parameters that solves a particular problem

• Learning more and more abstract representation of the input data until we reach such a suitable representation for our problem

(5)

Neural Networks Basics

(6)

Neural Networks Basics

(7)

Single Neuron (Perceptron)

activationfunction

input x weights w output

∑ x^⊤w > 0? 𝑦 𝑥_𝑖

⋅𝑤_𝑖 𝑥₁

⋅𝑤₁ 𝑥₂ ⋅𝑤₂

⋮

⋮ 𝑥_𝑛

⋅𝑤_𝑛

The old view:

a network of artificial neurons

• simplistic model of a neuron from the 1940’s

• a neuron has some (weighted) inputs, when the input is high enough, it fires a signal

• focus on single neurons does not allow thinking about layers as vector

representations

(8)

Neural Network

The current view: a network of layers

input vector •••••••••••• 𝑥

↓ ↑ ↓

𝑛 hidden layers

•••••••••••••••• ℎ₁= 𝑓(𝑊₁𝑥 + 𝑏₁)

↓ ↑ ↓ ↑

•••••••••••••••• ℎ₂= 𝑓(𝑊₂ℎ₁+ 𝑏₂)

↓ ↑ ↓ ↑

⋮ ⋮ ⋮

↓ ↑ ↓ ↑

•••••••••••••••• ℎ_𝑛= 𝑓(𝑊_𝑛ℎ_𝑛−1+ 𝑏_𝑛)

↓ ↑ ↓ ↑

output vector ^••• 𝑜 = 𝑔(𝑊_𝑜ℎ_𝑛+ 𝑏_𝑜) _∂𝑊^∂𝐸

𝑜 = ^∂𝐸_∂𝑜 ⋅_∂𝑊^∂𝑜

𝑜

↓ ↓ ↑

error ^• 𝐸 = 𝑒(𝑜, 𝑡) → ^∂𝐸_∂𝑜

(9)

Implementation: Computation graph

Logistic regression:

𝑦 = 𝜎 (𝑊 𝑥 + 𝑏) (1)

Computation graph:

𝑥 𝑊

× 𝑏

+ ℎ 𝜎

forward graph

loss 𝑦^∗

𝑜 𝑜^′ 𝜎^′

+

𝑏^′ ℎ^′

×

𝑊^′

backward graph

(10)

Frameworks for Deep Learning

research and prototyping in Python

• graph statically constructed, symbolic computation

• computation happens in a session

• allows graph export and running as a binary

• computations written dynamically as normal procedural code

• easy debugging: inspecting variables at any time of the computation

(11)

Representing Words

(12)

Representing Words

(13)

The problem of representing words

Problem:

Words (and characters)

are discrete × Inputs to neural nets must be continuous

Spoiler: The solution is called embeddings Let’s discuss it on the problem of language modeling

(14)

Language Modeling

• estimate probability of a next word in a text

P(𝑤_𝑖|𝑤_𝑖−1, 𝑤_𝑖−2, … , 𝑤₁)

• standard approach: 𝑛-gram models with Markov assumption

≈ P(𝑤_𝑖|𝑤_𝑖−1, 𝑤_𝑖−2, … , 𝑤_𝑖−𝑛) ≈

𝑛

∑

𝑗=0

𝜆_𝑗 𝑐(𝑤_𝑖|𝑤_𝑖−1, … , 𝑤_𝑖−𝑗) 𝑐(𝑤_𝑖|𝑤_𝑖−1, … , 𝑤_{𝑖−𝑗+1})

• Let’s simulate it with a neural network:

… ≈ 𝐹 (𝑤_𝑖−1, … , 𝑤_𝑖−𝑛|𝜃) 𝜃 is a set of trainable parameters.

(15)

Simple Neural Language Model

1_𝑤

𝑛−3

⋅𝑊_𝑒

1_𝑤

𝑛−2

⋅𝑊_𝑒

1_𝑤

𝑛−1

⋅𝑊_𝑒 tanh

⋅𝑉₃ ⋅𝑉₂ ⋅𝑉₁ + 𝑏_ℎ

softmax

⋅𝑊 + 𝑏 P(𝑤𝑛|𝑤𝑛−1, 𝑤𝑛−2, 𝑤𝑛−3)

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3 (Feb):1137–1155, 2003. ISSN 1532-4435

(16)

Neural LM: Word Representation

• limited vocabulary (hundred thousands words): indexed set of words

• words are initially represented as one-hot-vectors 1_𝑤= (0, … , 0, 1, 0, … 0)

• projection 1_𝑤⋅ 𝑉 corresponds to selecting one row from matrix 𝑉

• 𝑉 : is a table of learned word vector representations

so-calledword embeddings

• dimension typically 100 — 300

The first hidden layer is then:

ℎ₁= 𝑉_𝑤

𝑖−𝑛⊕ 𝑉_𝑤

𝑖−𝑛+1⊕ … ⊕ 𝑉_𝑤

𝑖−1

Matrix 𝑉 is shared for all words.

(17)

Neural LM: Next Word Estimation

• optionally add extra hidden layer:

ℎ₂= 𝑓(ℎ₁𝑊₁+ 𝑏₁)

• last layer: probability distribution over vocabulary

𝑦 = softmax(ℎ₂𝑊₂+ 𝑏₂) = exp(ℎ₂𝑊₂+ 𝑏₂)

∑ exp(ℎ₂𝑊₂+ 𝑏₂)

• training objective: cross-entropy between the true (i.e., one-hot) distribution and estimated distribution

𝐸 = − ∑

𝑖

𝑝_true(𝑤_𝑖) log 𝑦(𝑤_𝑖) = ∑

𝑖

− log 𝑦(𝑤_𝑖)

• learned by error back-propagation

(18)

Learned Representations

• word embeddings from LMs have interesting properties

• cluster according to POS & meaning similarity

Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928

• in IR: query expansion by nearest neighbors

• in deep learning models: embeddings initialization speeds up training / allows complex model with less data

(19)

Implementation in PyTorch I

import torch

import torch.nn as nn

class LanguageModel(nn.Module):

def __init__(self, vocab_size, embedding_dim, hidden_dim):

super().__init__()

self.embedding = nn.Embedding(vocab_size, embedding_dim) self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim) self.output_layer = nn.Linear(hidden_dim, vocab_size)

self.loss_function = nn.CrossEntropyLoss()

def forward(self, word_1, word_2, word_3, target=None):

embedded_1 = self.embedding(word_1) embedded_2 = self.embedding(word_2) embedded_3 = self.embedding(word_3)

(20)

Implementation in PyTorch II

hidden = torch.tanh(self.hidden_layer(

torch.cat(embedded_1, embedded_2, embedded_3))) logits = self.output_layer(hidden)

loss = None

if target is not None:

loss = self.loss_function(logits, targets) return logits, loss

(21)

Representing Sequences

(22)

Representing Sequences

(23)

Contextual representation

Word embeddings represent words in isolation.

Meaning is context-dependent —

We need an encoder providing contextual representation

(24)

Representing Sequences

Recurrent Networks

(25)

Recurrent Neural Networks: Pipe Metaphor

RNN = pipeline for information

In every step some information goes in and some information goes out.

(26)

Recurrent Networks (RNNs)

Image source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs

• inputs: 𝑥_,… , 𝑥_𝑇 (in NLP word embeddings)

• initial state ℎ₀= 0, a result of previous computation, trainable parameter

• recurrent computation:

ℎ_𝑡 = 𝐴(ℎ_𝑡−1, 𝑥_𝑡)

(27)

RNN as Imperative Code

def rnn(initial_state, inputs):

prev_state = initial_state for x in inputs:

new_state, output = rnn_cell(x, prev_state) prev_state = new_state

yield output

(28)

RNN in Unrolled in Time

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

At training time unrolled in time: technically just a very deep network

(29)

Vanilla RNN

ℎ_𝑡 = tanh (𝑊 [ℎ_𝑡−1; 𝑥_𝑡] + 𝑏)

• cannot propagate long-distance relations

• vanishing gradient problem

(30)

Vanishing Gradient Problem (1)

tanh 𝑥 = 1 − 𝑒^−2𝑥 1 + 𝑒^−2𝑥

-1.0 -0.5 0.0 0.5 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

dtanh 𝑥

d𝑥 = 1 − tanh²𝑥 ∈ (0, 1]

0.0 0.2 0.4 0.6 0.8 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥 Weight initialized ∼ 𝒩(0, 1) to have gradients further from zero.

(31)

Vanishing Gradient Problem (2)

∂𝐸_𝑡+1

∂𝑏 = ∂𝐸_𝑡+1

∂ℎ_𝑡+1 ⋅∂ℎ_𝑡+1

∂𝑏 (chain rule)

(32)

Vanishing Gradient Problem (3)

∂ℎ_𝑡

∂𝑏 = ∂ tanh

=𝑧_𝑡(activation)

⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊_ℎℎ_𝑡−1+ 𝑊_𝑥𝑥_𝑡+ 𝑏)

∂𝑏 ^(tanh

′is derivative of tanh)

= tanh^′(𝑧_𝑡) ⋅⎛⎜⎜

⎝

∂𝑊_ℎℎ_𝑡−1

∂𝑏 + ∂𝑊_𝑥𝑥_𝑡

⏟∂𝑏

=0

+∂𝑏

∂𝑏⏟

=1

⎞⎟

⎟

⎠

= 𝑊⏟_ℎ

∼𝒩(0,1)

tanh⏟^′(𝑧_𝑡)

∈(0;1]

∂ℎ_𝑡−1

∂𝑏 + tanh^′(𝑧_𝑡)

(33)

Long Short-Term Memory Networks

LSTM = Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667

Control the gradient flow by explicitly gating:

1.

What to use from

the input.

2.

What to reuse from

the hidden state.

3.

What to put on the output.

(34)

LSTM: Hidden State

• two types of hidden states

• ℎ_𝑡 — “public” hidden state, used an output

• 𝑐_𝑡 — “private” memory, no non-linearities on the way

• direct flow of gradients (without multiplying by ≤ 1 derivatives)

(35)

LSTM: Forget Gate

𝑓_𝑡= 𝜎 (𝑊_𝑓[ℎ_𝑡−1; 𝑥_𝑡] + 𝑏_𝑓) Reminder sigmoid function:

0.0 0.2 0.4 0.6 0.8 1.0

−10 −5 0 5 10

𝑦

𝑥

• based on input and previous state, decide what to forget from the memory

(36)

LSTM: Input Gate

Image source:

𝑖_𝑡 = 𝜎 (𝑊_𝑖⋅ [ℎ_𝑡−1; 𝑥_𝑡] + 𝑏_𝑖)

̃𝐶_𝑡 = tanh (𝑊_𝑐⋅ [ℎ_𝑡−1; 𝑥_𝑡] + 𝑏_𝐶)

• 𝐶 — candidate what may want to add to the memorỹ

• 𝑖_𝑡 — decide how much of the information we want to store

(37)

LSTM: Cell State Update

Image source:

𝐶_𝑡= 𝑓_𝑡⊙ 𝐶_𝑡−1+ 𝑖_𝑡⊙𝐶_𝑡̃

(38)

LSTM: Output Gate

Image source:

𝑜_𝑡= 𝜎 (𝑊_𝑜⋅ [ℎ_𝑡−1; 𝑥_𝑡] + 𝑏_𝑜) ℎ_𝑡= 𝑜_𝑡⊙ tanh 𝐶_𝑡

(39)

Here we are, LSTM!

𝑓

_𝑡

= 𝜎 (𝑊

_𝑓

[ℎ

_𝑡−1

; 𝑥

_𝑡

] + 𝑏

_𝑓

) 𝑖

_𝑡

= 𝜎 (𝑊

_𝑖

⋅ [ℎ

_𝑡−1

; 𝑥

_𝑡

] + 𝑏

_𝑖

) 𝑜

_𝑡

= 𝜎 (𝑊

_𝑜

⋅ [ℎ

_𝑡−1

; 𝑥

_𝑡

] + 𝑏

_𝑜

)

̃ 𝐶

_𝑡

= tanh (𝑊

_𝑐

⋅ [ℎ

_𝑡−1

; 𝑥

_𝑡

] + 𝑏

_𝐶

) 𝐶

_𝑡

= 𝑓

_𝑡

⊙ 𝐶

_𝑡−1

+ 𝑖

_𝑡

⊙ 𝐶

_𝑡

̃

ℎ

_𝑡

= 𝑜

_𝑡

⊙ tanh 𝐶

_𝑡

(40)

Gated Recurrent Units

update gate 𝑧_𝑡 = 𝜎(𝑥_𝑡𝑊_𝑧+ ℎ_𝑡−1𝑈_𝑧+ 𝑏_𝑧) ∈ (0, 1) remember gate 𝑟_𝑡= 𝜎(𝑥_𝑡𝑊_𝑟+ ℎ_𝑡−1𝑈_𝑟+ 𝑏_𝑟) ∈ (0, 1)

candidate hidden state ℎ_𝑡̃ = tanh (𝑥_𝑡𝑊_ℎ+ (𝑟_𝑡⊙ ℎ_𝑡−1)𝑈_ℎ) ∈ (−1, 1) hidden state ℎ_𝑡= (1 − 𝑧_𝑡) ⊙ ℎ_𝑡−1+ 𝑧_𝑡⋅ ̃ℎ_𝑡

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches.

In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014.

Association for Computational Linguistics

(41)

LSTM vs. GRU

• GRU is smaller and therefore faster

• performance similar, task dependent

• theoretical limitation: GRU accepts regular languages, LSTM can simulate counter machine

Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ISSN 2331-8422;

(42)

RNN in PyTorch

rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.1)

output, (hidden, cell) = self.rnn(x)

https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM

(43)

Bidirectional Networks

• simple trick to improve performance

• run one RNN forward, second one backward and concatenate outputs

Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/

• state of the art in tagging, crucial for neural machine translation

(44)

Convolutional Networks

(45)

Convolutional Networks: Tree-Structure metaphor

Processing input windows in a hierarchical fashion Sort of like a hierarchy in tree-like data structures

(46)

1-D Convolution

≈ sliding window over the sequence

embeddings x = (𝑥₁, … , 𝑥_𝑁)

𝑥₀= ⃗0 𝑥_𝑁 = ⃗0

ℎ₁= 𝑓 (𝑊 [𝑥₀; 𝑥₁; 𝑥₂] + 𝑏)

ℎ_𝑖= 𝑓 (𝑊 [𝑥_𝑖−1; 𝑥_𝑖; 𝑥_𝑖+1] + 𝑏)

pad with 0s if we want to keep sequence length

(47)

1-D Convolution: Pseudocode

xs = ... # input sequnce

kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size

W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters)

window = kernel_size // 2

outputs = []

for i in range(window, xs.shape[1] - window):

h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h)

return np.array(h)

(48)

1-D Convolution in Pytorch

PyTorch

conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True)

h = conv(x)

https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d

(49)

Rectified Linear Units

ReLU:

0.0 1.0 2.0 3.0 4.0 5.0 6.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

Derivative of ReLU:

0.0 0.2 0.4 0.6 0.8 1.0

−6 −4 −2 0 2 4 6

𝑦

𝑥

faster, suffer less with vanishing gradient

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, Haifa, Israel, June 2010. JMLR.org

(50)

Residual Connections

𝑥₀= ⃗0 𝑥_𝑁 = ⃗0

⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕

ℎ_𝑖= 𝑓 (𝑊 [𝑥_𝑖−1; 𝑥_𝑖; 𝑥_𝑖+1] + 𝑏) + 𝑥_𝑖

Allows training deeper networks.

Better gradient flow – the same as in RNNs.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511

(51)

Residual Connections: Numerical Stability

Numerically unstable, we need activation to be in similar scale ⇒ layer normalization.

Activation before non-linearity is normalized:

𝑎_𝑖= 𝑔_𝑖

𝜎_𝑖(𝑎_𝑖− 𝜇_𝑖)

…𝑔 is a trainable parameter, 𝜇, 𝜎 estimated from data.

𝜇 = 1 𝐻

𝐻

∑

𝑖=1

𝑎_𝑖

𝜎 =

√√

√

⎷ 1 𝐻

𝐻

∑

𝑖=1

(𝑎_𝑖− 𝜇)²

Lei Jimmy Ba, Ryan Kiros, and Geoffrey E Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422

(52)

Receptive Field

𝑥₀= ⃗0 𝑥_𝑁 = ⃗0

(53)

Convolutional architectures

+

• extremely computationally eﬀicient

–

• limited context

• by default no aware of 𝑛-gram order

Currently mostly used in speech processing and character-level architectures

(54)

Self-attentive Networks

(55)

Self-attentive Networks: Complete graph metaphor

RNN = information pipeline

CNN = information in tree-like data structure

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

Self-attentive = Transformers =

Information flow in a weighted complete bipartite graph

(56)

Self-attentive Networks

• In some layers: states are linear combination of previous layer states

• Originally for the Transformer model for machine translation

⇐ attention weights = similarity matrix between all pairs of states

• 𝑂(𝑛²) memory, 𝑂(1) time (when paralelized)

• next layer: sum by rows

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc

(57)

Multi-head scaled dot-product attention

Scaled dot-production attention

𝑄 = (𝑞₁, … , 𝑞_𝑛): queries, 𝐾: keys, 𝑉 : values

Attn(𝑄, 𝐾, 𝑉 ) = softmax

similarity matrix

⏞⏞⏞⏞⏞

(𝑄𝐾^⊤

√𝑑 ) 𝑉 Multi-head setup

Multihead(𝑄, 𝑉 ) =

concatenate head outputs

⏞⏞⏞⏞⏞⏞⏞

(𝐻₁⊕ ⋯ ⊕ 𝐻_ℎ) 𝑊^𝑂 𝐻_𝑖= Attn(𝑄𝑊_𝑖^𝑄, 𝑉 𝑊_𝑖^𝐾, 𝑉 𝑊_𝑖^𝑉)

𝑊_𝑖^𝑄, 𝑊_𝑖^𝐾, 𝑊_𝑖𝑉 head-specific projections keys & values =^? queries linear 𝑊^𝐾 linear 𝑊^𝑉 linear 𝑊^𝑄

split split split

concat linear 𝑊^𝑂

scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention

(58)

Dot-Product Attention in PyTorch

def attention(query, key, value, mask=None):

d_k = query.size(-1)

scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k)

p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn

(59)

Stacking self-attentive Layers

input embeddings position ⊕

encoding

self-attentivesublayer

multihead attention

keys &

values queries

⊕ layer normalization

feed-forwardsublayer

non-linear layer linear layer

𝑁 ×

• several layers (original paper 6)

• each layer: 2 sub-layers: self-attention and feed-forward layer

• everything inter-connected with residual connections

(60)

Position Encoding

Model is not aware of the position in the sequence.

pos(𝑖) =

⎧{

⎨{

⎩

sin (₁₀^𝑡4 𝑖

𝑑) , if 𝑖 mod 2 = 0 cos (₁₀^𝑡4

𝑖−1

𝑑 ) , otherwise

0 20 40 60 80

Text length 0

100

200

300

Dimension

−0.5 0.0 0.5 1.0

(61)

Architectures Comparison

Recurrent Convolutional

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

6 6 6

7 7 7

8 8 8

Self-attentive

computation sequential operations memory

Recurrent 𝑂(𝑛 ⋅ 𝑑²) 𝑂(𝑛) 𝑂(𝑛 ⋅ 𝑑)

Convolutional 𝑂(𝑘 ⋅ 𝑛 ⋅ 𝑑²) 𝑂(1) 𝑂(𝑛 ⋅ 𝑑) Self-attentive 𝑂(𝑛²⋅ 𝑑) 𝑂(1) 𝑂(𝑛²⋅ 𝑑)

𝑑 model dimension, 𝑛 sequence length, 𝑘 convolutional kernel

(62)

Classification and Labeling

(63)

Classification and Labeling

(64)

Architecture Overview

1.

Embed input words.

Get a sequence of

continuous vectors

2.

Contextualize input.

Apply a sequence processing architecture and get contextual representation.

3.

Get some output.

Typically classification or labeling.

(65)

Text classification

Tasks such as: sentiment analysis, genre classification, fake news detection, spam detection.

RNN/CNN/Transformer Embeddings

Encoder Hidden states

(ℎ₁, … , ℎ_𝑛)

get single

vector classifier

Classifier = feedforward neural net with softmax at the end

(66)

Getting a single vector: Two options

1. ^Pooling

• Squeeze the sequence into a vector

• Usable directly for embeddings and also encoder output

• Mean pooling: _𝑛¹∑ ℎ_𝑖

• Max pooling (intuitively existential quantifier): take max in each dimension

2. Choose one state

• In RNN the last state, i.e., after reading the whohle input

• In Transformers any (typically the first) state, self-attention will learn to move relevant information there

(67)

Softmax & Cross-Entropy

Output layer with softmax (with parameters 𝑊 , 𝑏) — gets categorical distribution:

𝑃_𝑦= softmax(x) = P(𝑦 = 𝑗 ∣ x) = exp x^⊤𝑊 + 𝑏

∑ exp x^⊤𝑊 + 𝑏

Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑇 = 1(𝑦^∗) = (0, 0, … , 1, 0, … , 0):

𝐿(𝑃_𝑦, 𝑦^∗) = 𝐻(𝑃 , 𝑇 ) = −𝔼_𝑖∼𝑇log 𝑃 (𝑖)

= − ∑

𝑖

𝑇 (𝑖) log 𝑃 (𝑖)

= − log 𝑃 (𝑦^∗)

(68)

Derivative of Cross-Entropy

Let 𝑙 = x^⊤𝑊 + 𝑏, 𝑙_𝑦∗ corresponds to the correct one.

∂𝐿(𝑃_𝑦, 𝑦^∗)

∂𝑙 = −∂

∂𝑙log exp 𝑙_𝑦∗

∑_𝑗exp 𝑙_𝑗 = −∂

∂𝑙𝑙_𝑦∗ − log ∑ exp 𝑙

= 1_𝑦∗ + ∂

∂𝑙 − log ∑ exp 𝑙 = 1_𝑦∗− ∑ 1_𝑦∗exp 𝑙

∑ exp 𝑙 =

= 1_𝑦∗ − 𝑃_𝑦(𝑦^∗)

0 1

Interpretation: Reinforce the correct logit, suppress the rest.

(69)

Sequence Labeling

• assign value / probability distribution to every token in a sequence

• morphological tagging, named-entity recognition, LM with unlimited history, answer span selection

classifier classifier classifier classifier classifier classifier

RNN/CNN/Transformer Embeddings

Encoder Hidden states The same classifier

at every state

(70)

Pre-training Representations

(71)

Pre-training Representations

(72)

Pre-trained Representations

• representations that emerge in models seem to carry a lot of general information about the language

• representations pre-trained on large data can be re-used on tasks with smaller training data

(73)

Pre-training Representations

Word2Vec & FastText

(74)

Word2Vec

• way to learn word embeddings without training the complete LM

CBOW Skip-gram

∑ 𝑤₃

𝑤₁ 𝑤₂

𝑤₄ 𝑤₅

⋮

𝑤₃

𝑤₁ 𝑤₂

𝑤₄ 𝑤₅

⋮

• CBOW: minimize cross-entropy of the middle word of a sliding windows

• skip-gram: minimize cross-entropy of a bag of words around a word (LM other way round)

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.

Association for Computational Linguistics

(75)

Word2Vec: sampling

1. All human beings are born free and equal in dignity … → (All, humans) (All, beings) 2. All human beings are born free and equal in dignity … →

(human, All) (human, beings)

(human, are) 3. All human beings are born free and equal in dignity … →

(beings, All) (beings, human)

(beings, are) (beings, born) 4. All human beings are born free and equal in dignity … →

(are, human) (are, beings) (are, born)

(are, free)

(76)

Word2Vec: Formulas

• Training objective:

1 𝑇

𝑇

∑

𝑡=1

∑

𝑗∼(−𝑐,𝑐)

log 𝑝(𝑤_𝑡+𝑐|𝑤_𝑡)

• Probability estimation:

𝑝(𝑤_𝑂|𝑤_𝐼) = exp (𝑉^′⊤_𝑤

𝑂𝑉_𝑤

𝐼)

∑_𝑤exp (𝑉^′⊤_𝑤𝑉_𝑤

𝑖) where 𝑉 is input (embedding) matrix, 𝑉^′ output matrix

Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics

(77)

Word2Vec: Training using Negative Sampling

The summation in denominator is slow, use noise contrastive estimation:

log 𝜎 (𝑉^′⊤_𝑤

𝑂𝑉_𝑤

𝐼) +

𝑘

∑

𝑖=1

𝐸_𝑤

𝑖∼𝑃_𝑛(𝑤)[log 𝜎 (−𝑉^′⊤_𝑤

𝑖𝑉_𝑤

𝐼)]

Main idea: classify independently by logistic regression the positive and few sampled negative examples.

Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,

USA, June 2013. Association for Computational Linguistics

(78)

Word2Vec: Vector Arithmetics

man

woman

uncle

aunt

king

queen

kings

queens

king

queen

Image originally from Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics

(79)

FastText

State-of-the-art pre-trained word embeddings

• Word2Vec training treats words as independent entities

• There are regularities in how words look like — it is called morphology

FastText approaches this

1. Represent each word as a bag of character 𝑛-grams

2. Keep a table of character 𝑛-grams embeddings instead of words 3. Word embedding = average of character 𝑛-gram embeeddings

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://www.aclweb.org/anthology/Q17-1010

(80)

Training models

FastText — model implementation by Facebook https://github.com/facebookresearch/fastText Just throw in raw text.

./fasttext skipgram -input data.txt -output model

The tool allows training simple classifiers using the embeddings.

(81)

Pre-training Representations

BERT

(82)

What is BERT

Bidirectional Encoder Representations

from Transformers • pre-trained representation capturing context

contextual embeddings

• Transformer with a masked language model objective

• originally by Google, published in November 2018

• since then better version: RoBERTa by Facebook, language-specific variants, multilingual versions

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423

(83)

Achitecture

input embeddings position ⊕

encoding

self-attentivesublayer

multihead attention

keys &

values queries

feed-forwardsublayer

non-linear layer linear layer

𝑁 ×

• a stack of 12 Transformer layers

• trained as sequence labeling: change some input token, labeler guesses original tokens

masked language modeling

• being able to predict missing words: proxy for language understanding

When trained, throw away labeling, use last layer as the representation.

(84)

Masked Language Model

All human being are born free free MASK hairy free and equal in dignity and rights

1. Randomly sample a word → free

2. With 80% change replace with special MASK token.

3. With 10% change replace with random token → hairy 4. With 10% change keep as is → free

Then a classifier should predict the missing/replaced word free

(85)

Pre-train and finetune paradigm

BERT encoder 1. Pre-training

(do once, large data, long time) unlabeled

datal

masked LM

2. Finetuning

(small data, fast) tasks-pecific

data

task-specific objective

(86)

Implementation: Huggingface Transformers

https://github.com/huggingface/transformers

Implements most existign pre-trained BERT-like models for both PyTorch and TensorFlow.

(87)

Deep Learning for Natural Language Processing

Summary

1. Discrete symbols → continuous representation with trained embeddings

2. Architectures to get suitable representation: recurrent, convolutional, self-attentive

3. Output: classification, sequence labeling

4. Representations pretrained on large data helps on downstream tasks