Deep Learning for Natural Language Processing
Jindřich Libovický
April 29, 2021
NPFL124 Introduction to Natural Language Processing
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics unless otherwise stated
Outline
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 1/ 76
Deep Learning in NLP
• NLP tasks learn end-to-end using deep learning — the number-one approach in current research
• State of the art in POS tagging, parsing, named-entity recognition, machine translation,
…
, Good news: training without almost any linguistic insight
although it is not always a good idea
/ Bad news: requires enormous amount of training data and really big computational resources
although that changes with pre-trained models
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 2/ 76
What is deep learning?
• Buzzword for machine learning using neural networks with many layers using back-propagation
• Learning of a real-valued function with millions of parameters that solves a particular problem
• Learning more and more abstract representation of the input data until we reach such a suitable representation for our problem
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 3/ 76
Neural Networks Basics
Neural Networks Basics
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 4/ 76
Single Neuron (Perceptron)
activationfunction
input x weights w output
∑ x⊤w > 0? 𝑦 𝑥𝑖
⋅𝑤𝑖 𝑥1
⋅𝑤1 𝑥2 ⋅𝑤2
⋮
⋮ 𝑥𝑛
⋅𝑤𝑛
The old view:
a network of artificial neurons
• simplistic model of a neuron from the 1940’s
• a neuron has some (weighted) inputs, when the input is high enough, it fires a signal
• focus on single neurons does not allow thinking about layers as vector
representations
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 5/ 76
Neural Network
The current view: a network of layers
input vector •••••••••••• 𝑥
↓ ↑ ↓
𝑛 hidden layers
•••••••••••••••• ℎ1= 𝑓(𝑊1𝑥 + 𝑏1)
↓ ↑ ↓ ↑
•••••••••••••••• ℎ2= 𝑓(𝑊2ℎ1+ 𝑏2)
↓ ↑ ↓ ↑
⋮ ⋮ ⋮
↓ ↑ ↓ ↑
•••••••••••••••• ℎ𝑛= 𝑓(𝑊𝑛ℎ𝑛−1+ 𝑏𝑛)
↓ ↑ ↓ ↑
output vector ••• 𝑜 = 𝑔(𝑊𝑜ℎ𝑛+ 𝑏𝑜) ∂𝑊∂𝐸
𝑜 = ∂𝐸∂𝑜 ⋅∂𝑊∂𝑜
𝑜
↓ ↓ ↑
error • 𝐸 = 𝑒(𝑜, 𝑡) → ∂𝐸∂𝑜
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 6/ 76
Implementation: Computation graph
Logistic regression:
𝑦 = 𝜎 (𝑊 𝑥 + 𝑏) (1)
Computation graph:
𝑥 𝑊
× 𝑏
+ ℎ 𝜎
forward graph
loss 𝑦∗
𝑜 𝑜′ 𝜎′
+
𝑏′ ℎ′
×
𝑊′
backward graph
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 7/ 76
Frameworks for Deep Learning
research and prototyping in Python
• graph statically constructed, symbolic computation
• computation happens in a session
• allows graph export and running as a binary
• computations written dynamically as normal procedural code
• easy debugging: inspecting variables at any time of the computation
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 8/ 76
Representing Words
Representing Words
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 9/ 76
The problem of representing words
Problem:
Words (and characters)
are discrete × Inputs to neural nets must be continuous
Spoiler: The solution is called embeddings Let’s discuss it on the problem of language modeling
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 10/ 76
Language Modeling
• estimate probability of a next word in a text
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤1)
• standard approach: 𝑛-gram models with Markov assumption
≈ P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈
𝑛
∑
𝑗=0
𝜆𝑗 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗) 𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)
• Let’s simulate it with a neural network:
… ≈ 𝐹 (𝑤𝑖−1, … , 𝑤𝑖−𝑛|𝜃) 𝜃 is a set of trainable parameters.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 11/ 76
Simple Neural Language Model
1𝑤
𝑛−3
⋅𝑊𝑒
1𝑤
𝑛−2
⋅𝑊𝑒
1𝑤
𝑛−1
⋅𝑊𝑒 tanh
⋅𝑉3 ⋅𝑉2 ⋅𝑉1 + 𝑏ℎ
softmax
⋅𝑊 + 𝑏 P(𝑤𝑛|𝑤𝑛−1, 𝑤𝑛−2, 𝑤𝑛−3)
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3 (Feb):1137–1155, 2003. ISSN 1532-4435
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 12/ 76
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words
• words are initially represented as one-hot-vectors 1𝑤= (0, … , 0, 1, 0, … 0)
• projection 1𝑤⋅ 𝑉 corresponds to selecting one row from matrix 𝑉
• 𝑉 : is a table of learned word vector representations
so-calledword embeddings
• dimension typically 100 — 300
The first hidden layer is then:
ℎ1= 𝑉𝑤
𝑖−𝑛⊕ 𝑉𝑤
𝑖−𝑛+1⊕ … ⊕ 𝑉𝑤
𝑖−1
Matrix 𝑉 is shared for all words.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 13/ 76
Neural LM: Next Word Estimation
• optionally add extra hidden layer:
ℎ2= 𝑓(ℎ1𝑊1+ 𝑏1)
• last layer: probability distribution over vocabulary
𝑦 = softmax(ℎ2𝑊2+ 𝑏2) = exp(ℎ2𝑊2+ 𝑏2)
∑ exp(ℎ2𝑊2+ 𝑏2)
• training objective: cross-entropy between the true (i.e., one-hot) distribution and estimated distribution
𝐸 = − ∑
𝑖
𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑
𝑖
− log 𝑦(𝑤𝑖)
• learned by error back-propagation
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 14/ 76
Learned Representations
• word embeddings from LMs have interesting properties
• cluster according to POS & meaning similarity
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
• in IR: query expansion by nearest neighbors
• in deep learning models: embeddings initialization speeds up training / allows complex model with less data
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 15/ 76
Implementation in PyTorch I
import torch
import torch.nn as nn
class LanguageModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim) self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim) self.output_layer = nn.Linear(hidden_dim, vocab_size)
self.loss_function = nn.CrossEntropyLoss()
def forward(self, word_1, word_2, word_3, target=None):
embedded_1 = self.embedding(word_1) embedded_2 = self.embedding(word_2) embedded_3 = self.embedding(word_3)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 16/ 76
Implementation in PyTorch II
hidden = torch.tanh(self.hidden_layer(
torch.cat(embedded_1, embedded_2, embedded_3))) logits = self.output_layer(hidden)
loss = None
if target is not None:
loss = self.loss_function(logits, targets) return logits, loss
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 17/ 76
Representing Sequences
Representing Sequences
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 18/ 76
Contextual representation
Word embeddings represent words in isolation.
Meaning is context-dependent —
We need an encoder providing contextual representation
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 19/ 76
Representing Sequences
Recurrent Networks
Recurrent Neural Networks: Pipe Metaphor
RNN = pipeline for information
In every step some information goes in and some information goes out.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 20/ 76
Recurrent Networks (RNNs)
Image source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs
• inputs: 𝑥,… , 𝑥𝑇 (in NLP word embeddings)
• initial state ℎ0= 0, a result of previous computation, trainable parameter
• recurrent computation:
ℎ𝑡 = 𝐴(ℎ𝑡−1, 𝑥𝑡)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 21/ 76
RNN as Imperative Code
def rnn(initial_state, inputs):
prev_state = initial_state for x in inputs:
new_state, output = rnn_cell(x, prev_state) prev_state = new_state
yield output
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 22/ 76
RNN in Unrolled in Time
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
At training time unrolled in time: technically just a very deep network
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 23/ 76
Vanilla RNN
ℎ𝑡 = tanh (𝑊 [ℎ𝑡−1; 𝑥𝑡] + 𝑏)
• cannot propagate long-distance relations
• vanishing gradient problem
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 24/ 76
Vanishing Gradient Problem (1)
tanh 𝑥 = 1 − 𝑒−2𝑥 1 + 𝑒−2𝑥
-1.0 -0.5 0.0 0.5 1.0
−6 −4 −2 0 2 4 6
𝑦
𝑥
dtanh 𝑥
d𝑥 = 1 − tanh2𝑥 ∈ (0, 1]
0.0 0.2 0.4 0.6 0.8 1.0
−6 −4 −2 0 2 4 6
𝑦
𝑥 Weight initialized ∼ 𝒩(0, 1) to have gradients further from zero.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 25/ 76
Vanishing Gradient Problem (2)
∂𝐸𝑡+1
∂𝑏 = ∂𝐸𝑡+1
∂ℎ𝑡+1 ⋅∂ℎ𝑡+1
∂𝑏 (chain rule)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 26/ 76
Vanishing Gradient Problem (3)
∂ℎ𝑡
∂𝑏 = ∂ tanh
=𝑧𝑡(activation)
⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊ℎℎ𝑡−1+ 𝑊𝑥𝑥𝑡+ 𝑏)
∂𝑏 (tanh
′is derivative of tanh)
= tanh′(𝑧𝑡) ⋅⎛⎜⎜
⎝
∂𝑊ℎℎ𝑡−1
∂𝑏 + ∂𝑊𝑥𝑥𝑡
⏟∂𝑏
=0
+∂𝑏
∂𝑏⏟
=1
⎞⎟
⎟
⎠
= 𝑊⏟ℎ
∼𝒩(0,1)
tanh⏟′(𝑧𝑡)
∈(0;1]
∂ℎ𝑡−1
∂𝑏 + tanh′(𝑧𝑡)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 27/ 76
Long Short-Term Memory Networks
LSTM = Long short-term memory
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Control the gradient flow by explicitly gating:
1.
What to use fromthe input.
2.
What to reuse fromthe hidden state.
3.
What to put on the output.Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 28/ 76
LSTM: Hidden State
• two types of hidden states
• ℎ𝑡 — “public” hidden state, used an output
• 𝑐𝑡 — “private” memory, no non-linearities on the way
• direct flow of gradients (without multiplying by ≤ 1 derivatives)
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 29/ 76
LSTM: Forget Gate
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
𝑓𝑡= 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓) Reminder sigmoid function:
0.0 0.2 0.4 0.6 0.8 1.0
−10 −5 0 5 10
𝑦
𝑥
• based on input and previous state, decide what to forget from the memory
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 30/ 76
LSTM: Input Gate
Image source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs
𝑖𝑡 = 𝜎 (𝑊𝑖⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)
̃𝐶𝑡 = tanh (𝑊𝑐⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)
• 𝐶 — candidate what may want to add to the memorỹ
• 𝑖𝑡 — decide how much of the information we want to store
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 31/ 76
LSTM: Cell State Update
Image source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs
𝐶𝑡= 𝑓𝑡⊙ 𝐶𝑡−1+ 𝑖𝑡⊙𝐶𝑡̃
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 32/ 76
LSTM: Output Gate
Image source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs
𝑜𝑡= 𝜎 (𝑊𝑜⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜) ℎ𝑡= 𝑜𝑡⊙ tanh 𝐶𝑡
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 33/ 76
Here we are, LSTM!
𝑓
𝑡= 𝜎 (𝑊
𝑓[ℎ
𝑡−1; 𝑥
𝑡] + 𝑏
𝑓) 𝑖
𝑡= 𝜎 (𝑊
𝑖⋅ [ℎ
𝑡−1; 𝑥
𝑡] + 𝑏
𝑖) 𝑜
𝑡= 𝜎 (𝑊
𝑜⋅ [ℎ
𝑡−1; 𝑥
𝑡] + 𝑏
𝑜)
̃ 𝐶
𝑡= tanh (𝑊
𝑐⋅ [ℎ
𝑡−1; 𝑥
𝑡] + 𝑏
𝐶) 𝐶
𝑡= 𝑓
𝑡⊙ 𝐶
𝑡−1+ 𝑖
𝑡⊙ 𝐶
𝑡̃
ℎ
𝑡= 𝑜
𝑡⊙ tanh 𝐶
𝑡Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 34/ 76
Gated Recurrent Units
update gate 𝑧𝑡 = 𝜎(𝑥𝑡𝑊𝑧+ ℎ𝑡−1𝑈𝑧+ 𝑏𝑧) ∈ (0, 1) remember gate 𝑟𝑡= 𝜎(𝑥𝑡𝑊𝑟+ ℎ𝑡−1𝑈𝑟+ 𝑏𝑟) ∈ (0, 1)
candidate hidden state ℎ𝑡̃ = tanh (𝑥𝑡𝑊ℎ+ (𝑟𝑡⊙ ℎ𝑡−1)𝑈ℎ) ∈ (−1, 1) hidden state ℎ𝑡= (1 − 𝑧𝑡) ⊙ ℎ𝑡−1+ 𝑧𝑡⋅ ̃ℎ𝑡
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches.
In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014.
Association for Computational Linguistics
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 35/ 76
LSTM vs. GRU
• GRU is smaller and therefore faster
• performance similar, task dependent
• theoretical limitation: GRU accepts regular languages, LSTM can simulate counter machine
Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ISSN 2331-8422;
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 36/ 76
RNN in PyTorch
rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1, bidirectional=True, dropout=0.1)
output, (hidden, cell) = self.rnn(x)
https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 37/ 76
Bidirectional Networks
• simple trick to improve performance
• run one RNN forward, second one backward and concatenate outputs
Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/
• state of the art in tagging, crucial for neural machine translation
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 38/ 76
Representing Sequences
Convolutional Networks
Convolutional Networks: Tree-Structure metaphor
Processing input windows in a hierarchical fashion Sort of like a hierarchy in tree-like data structures
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 39/ 76
1-D Convolution
≈ sliding window over the sequence
embeddings x = (𝑥1, … , 𝑥𝑁)
𝑥0= ⃗0 𝑥𝑁 = ⃗0
ℎ1= 𝑓 (𝑊 [𝑥0; 𝑥1; 𝑥2] + 𝑏)
ℎ𝑖= 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
pad with 0s if we want to keep sequence length
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 40/ 76
1-D Convolution: Pseudocode
xs = ... # input sequnce
kernel_size = 3 # window size filters = 300 # output dimensions strides=1 # step size
W = trained_parameter(xs.shape[2] * kernel_size, filters) b = trained_parameter(filters)
window = kernel_size // 2
outputs = []
for i in range(window, xs.shape[1] - window):
h = np.mul(W, xs[i - window:i + window]) + b outputs.append(h)
return np.array(h)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 41/ 76
1-D Convolution in Pytorch
PyTorch
conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1, padding=0, dilation=1, groups=1, bias=True)
h = conv(x)
https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 42/ 76
Rectified Linear Units
ReLU:
0.0 1.0 2.0 3.0 4.0 5.0 6.0
−6 −4 −2 0 2 4 6
𝑦
𝑥
Derivative of ReLU:
0.0 0.2 0.4 0.6 0.8 1.0
−6 −4 −2 0 2 4 6
𝑦
𝑥
faster, suffer less with vanishing gradient
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, Haifa, Israel, June 2010. JMLR.org
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 43/ 76
Residual Connections
embeddings x = (𝑥1, … , 𝑥𝑁)
𝑥0= ⃗0 𝑥𝑁 = ⃗0
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑖= 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏) + 𝑥𝑖
Allows training deeper networks.
Better gradient flow – the same as in RNNs.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 44/ 76
Residual Connections: Numerical Stability
Numerically unstable, we need activation to be in similar scale ⇒ layer normalization.
Activation before non-linearity is normalized:
𝑎𝑖= 𝑔𝑖
𝜎𝑖(𝑎𝑖− 𝜇𝑖)
…𝑔 is a trainable parameter, 𝜇, 𝜎 estimated from data.
𝜇 = 1 𝐻
𝐻
∑
𝑖=1
𝑎𝑖
𝜎 =
√√
√
⎷ 1 𝐻
𝐻
∑
𝑖=1
(𝑎𝑖− 𝜇)2
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 45/ 76
Receptive Field
embeddings x = (𝑥1, … , 𝑥𝑁)
𝑥0= ⃗0 𝑥𝑁 = ⃗0
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 46/ 76
Convolutional architectures
+
• extremely computationally efficient
–
• limited context
• by default no aware of 𝑛-gram order
Currently mostly used in speech processing and character-level architectures
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 47/ 76
Representing Sequences
Self-attentive Networks
Self-attentive Networks: Complete graph metaphor
RNN = information pipeline
CNN = information in tree-like data structure
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
Self-attentive = Transformers =
Information flow in a weighted complete bipartite graph
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 48/ 76
Self-attentive Networks
• In some layers: states are linear combination of previous layer states
• Originally for the Transformer model for machine translation
⇐ attention weights = similarity matrix between all pairs of states
• 𝑂(𝑛2) memory, 𝑂(1) time (when paralelized)
• next layer: sum by rows
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 49/ 76
Multi-head scaled dot-product attention
Scaled dot-production attention
𝑄 = (𝑞1, … , 𝑞𝑛): queries, 𝐾: keys, 𝑉 : values
Attn(𝑄, 𝐾, 𝑉 ) = softmax
similarity matrix
⏞⏞⏞⏞⏞
(𝑄𝐾⊤
√𝑑 ) 𝑉 Multi-head setup
Multihead(𝑄, 𝑉 ) =
concatenate head outputs
⏞⏞⏞⏞⏞⏞⏞
(𝐻1⊕ ⋯ ⊕ 𝐻ℎ) 𝑊𝑂 𝐻𝑖= Attn(𝑄𝑊𝑖𝑄, 𝑉 𝑊𝑖𝐾, 𝑉 𝑊𝑖𝑉)
𝑊𝑖𝑄, 𝑊𝑖𝐾, 𝑊𝑖𝑉 head-specific projections keys & values =? queries linear 𝑊𝐾 linear 𝑊𝑉 linear 𝑊𝑄
split split split
concat linear 𝑊𝑂
scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention scaled dot-product attention
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 50/ 76
Dot-Product Attention in PyTorch
def attention(query, key, value, mask=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \ / math.sqrt(d_k)
p_attn = F.softmax(scores, dim = -1) return torch.matmul(p_attn, value), p_attn
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 51/ 76
Stacking self-attentive Layers
input embeddings position ⊕
encoding
self-attentivesublayer
multihead attention
keys &
values queries
⊕ layer normalization
feed-forwardsublayer
non-linear layer linear layer
⊕ layer normalization
𝑁 ×
• several layers (original paper 6)
• each layer: 2 sub-layers: self-attention and feed-forward layer
• everything inter-connected with residual connections
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 52/ 76
Position Encoding
Model is not aware of the position in the sequence.
pos(𝑖) =
⎧{
⎨{
⎩
sin (10𝑡4 𝑖
𝑑) , if 𝑖 mod 2 = 0 cos (10𝑡4
𝑖−1
𝑑 ) , otherwise
0 20 40 60 80
Text length 0
100
200
300
Dimension
−0.5 0.0 0.5 1.0
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 53/ 76
Architectures Comparison
Recurrent Convolutional
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
Self-attentive
computation sequential operations memory
Recurrent 𝑂(𝑛 ⋅ 𝑑2) 𝑂(𝑛) 𝑂(𝑛 ⋅ 𝑑)
Convolutional 𝑂(𝑘 ⋅ 𝑛 ⋅ 𝑑2) 𝑂(1) 𝑂(𝑛 ⋅ 𝑑) Self-attentive 𝑂(𝑛2⋅ 𝑑) 𝑂(1) 𝑂(𝑛2⋅ 𝑑)
𝑑 model dimension, 𝑛 sequence length, 𝑘 convolutional kernel
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 54/ 76
Classification and Labeling
Classification and Labeling
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 55/ 76
Architecture Overview
1.
Embed input words.
Get a sequence of
continuous vectors
2.
Contextualize input.
Apply a sequence processing architecture and get contextual representation.
3.
Get some output.
Typically classification or labeling.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 56/ 76
Text classification
Tasks such as: sentiment analysis, genre classification, fake news detection, spam detection.
RNN/CNN/Transformer Embeddings
Encoder Hidden states
(ℎ1, … , ℎ𝑛)
get single
vector classifier
Classifier = feedforward neural net with softmax at the end
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 57/ 76
Getting a single vector: Two options
1. Pooling
• Squeeze the sequence into a vector
• Usable directly for embeddings and also encoder output
• Mean pooling: 𝑛1∑ ℎ𝑖
• Max pooling (intuitively existential quantifier): take max in each dimension
2. Choose one state
• In RNN the last state, i.e., after reading the whohle input
• In Transformers any (typically the first) state, self-attention will learn to move relevant information there
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 58/ 76
Softmax & Cross-Entropy
Output layer with softmax (with parameters 𝑊 , 𝑏) — gets categorical distribution:
𝑃𝑦= softmax(x) = P(𝑦 = 𝑗 ∣ x) = exp x⊤𝑊 + 𝑏
∑ exp x⊤𝑊 + 𝑏
Network error = cross-entropy between estimated distribution and one-hot ground-truth distribution 𝑇 = 1(𝑦∗) = (0, 0, … , 1, 0, … , 0):
𝐿(𝑃𝑦, 𝑦∗) = 𝐻(𝑃 , 𝑇 ) = −𝔼𝑖∼𝑇log 𝑃 (𝑖)
= − ∑
𝑖
𝑇 (𝑖) log 𝑃 (𝑖)
= − log 𝑃 (𝑦∗)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 59/ 76
Derivative of Cross-Entropy
Let 𝑙 = x⊤𝑊 + 𝑏, 𝑙𝑦∗ corresponds to the correct one.
∂𝐿(𝑃𝑦, 𝑦∗)
∂𝑙 = −∂
∂𝑙log exp 𝑙𝑦∗
∑𝑗exp 𝑙𝑗 = −∂
∂𝑙𝑙𝑦∗ − log ∑ exp 𝑙
= 1𝑦∗ + ∂
∂𝑙 − log ∑ exp 𝑙 = 1𝑦∗− ∑ 1𝑦∗exp 𝑙
∑ exp 𝑙 =
= 1𝑦∗ − 𝑃𝑦(𝑦∗)
0 1
0 1
0 1
Interpretation: Reinforce the correct logit, suppress the rest.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 60/ 76
Sequence Labeling
• assign value / probability distribution to every token in a sequence
• morphological tagging, named-entity recognition, LM with unlimited history, answer span selection
classifier classifier classifier classifier classifier classifier
RNN/CNN/Transformer Embeddings
Encoder Hidden states The same classifier
at every state
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 61/ 76
Pre-training Representations
Pre-training Representations
Neural Networks Basics Representing Words Representing Sequences
Recurrent Networks Convolutional Networks Self-attentive Networks Classification and Labeling Pre-training Representations
Word2Vec & FastText BERT
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 62/ 76
Pre-trained Representations
• representations that emerge in models seem to carry a lot of general information about the language
• representations pre-trained on large data can be re-used on tasks with smaller training data
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 63/ 76
Pre-training Representations
Word2Vec & FastText
Word2Vec
• way to learn word embeddings without training the complete LM
CBOW Skip-gram
∑ 𝑤3
𝑤1 𝑤2
𝑤4 𝑤5
⋮
𝑤3
𝑤1 𝑤2
𝑤4 𝑤5
⋮
• CBOW: minimize cross-entropy of the middle word of a sliding windows
• skip-gram: minimize cross-entropy of a bag of words around a word (LM other way round)
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.
Association for Computational Linguistics
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 64/ 76
Word2Vec: sampling
1. All human beings are born free and equal in dignity … → (All, humans) (All, beings) 2. All human beings are born free and equal in dignity … →
(human, All) (human, beings)
(human, are) 3. All human beings are born free and equal in dignity … →
(beings, All) (beings, human)
(beings, are) (beings, born) 4. All human beings are born free and equal in dignity … →
(are, human) (are, beings) (are, born)
(are, free)
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 65/ 76
Word2Vec: Formulas
• Training objective:
1 𝑇
𝑇
∑
𝑡=1
∑
𝑗∼(−𝑐,𝑐)
log 𝑝(𝑤𝑡+𝑐|𝑤𝑡)
• Probability estimation:
𝑝(𝑤𝑂|𝑤𝐼) = exp (𝑉′⊤𝑤
𝑂𝑉𝑤
𝐼)
∑𝑤exp (𝑉′⊤𝑤𝑉𝑤
𝑖) where 𝑉 is input (embedding) matrix, 𝑉′ output matrix
Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 66/ 76
Word2Vec: Training using Negative Sampling
The summation in denominator is slow, use noise contrastive estimation:
log 𝜎 (𝑉′⊤𝑤
𝑂𝑉𝑤
𝐼) +
𝑘
∑
𝑖=1
𝐸𝑤
𝑖∼𝑃𝑛(𝑤)[log 𝜎 (−𝑉′⊤𝑤
𝑖𝑉𝑤
𝐼)]
Main idea: classify independently by logistic regression the positive and few sampled negative examples.
Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,
USA, June 2013. Association for Computational Linguistics
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 67/ 76
Word2Vec: Vector Arithmetics
man
woman
uncle
aunt
king
queen
kings
queens
king
queen
Image originally from Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013. Association for Computational Linguistics
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 68/ 76
FastText
State-of-the-art pre-trained word embeddings
• Word2Vec training treats words as independent entities
• There are regularities in how words look like — it is called morphology
FastText approaches this
1. Represent each word as a bag of character 𝑛-grams
2. Keep a table of character 𝑛-grams embeddings instead of words 3. Word embedding = average of character 𝑛-gram embeeddings
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://www.aclweb.org/anthology/Q17-1010
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 69/ 76
Training models
FastText — model implementation by Facebook https://github.com/facebookresearch/fastText Just throw in raw text.
./fasttext skipgram -input data.txt -output model
The tool allows training simple classifiers using the embeddings.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 70/ 76
Pre-training Representations
BERT
What is BERT
Bidirectional Encoder Representations
from Transformers • pre-trained representation capturing context
contextual embeddings
• Transformer with a masked language model objective
• originally by Google, published in November 2018
• since then better version: RoBERTa by Facebook, language-specific variants, multilingual versions
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 71/ 76
Achitecture
input embeddings position ⊕
encoding
self-attentivesublayer
multihead attention
keys &
values queries
⊕ layer normalization
feed-forwardsublayer
non-linear layer linear layer
⊕ layer normalization
𝑁 ×
• a stack of 12 Transformer layers
• trained as sequence labeling: change some input token, labeler guesses original tokens
masked language modeling
• being able to predict missing words: proxy for language understanding
When trained, throw away labeling, use last layer as the representation.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 72/ 76
Masked Language Model
All human being are born free free MASK hairy free and equal in dignity and rights
1. Randomly sample a word → free
2. With 80% change replace with special MASK token.
3. With 10% change replace with random token → hairy 4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 73/ 76
Pre-train and finetune paradigm
BERT encoder 1. Pre-training
(do once, large data, long time) unlabeled
datal
masked LM
2. Finetuning
(small data, fast) tasks-pecific
data
task-specific objective
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 74/ 76
Implementation: Huggingface Transformers
https://github.com/huggingface/transformers
Implements most existign pre-trained BERT-like models for both PyTorch and TensorFlow.
Neural Networks Basics Representing Words Representing Sequences Classification and Labeling Pre-training Representations 75/ 76
Deep Learning for Natural Language Processing
Summary
1. Discrete symbols → continuous representation with trained embeddings
2. Architectures to get suitable representation: recurrent, convolutional, self-attentive
3. Output: classification, sequence labeling
4. Representations pretrained on large data helps on downstream tasks