Variational Autoencoders: From Theory to Implementation

(1)

Variational Autoencoders:

From Theory to Implementation

Petra Poklukar

(2)

Outline

• Introduction & intuition

• Math, math and more math

• Implementation: tricks & pitfalls

(3)

Introduction

(4)

AE: Intuition

ℒ 𝑥, 𝑥

_$%&'(

= 𝑚𝑖𝑛 𝑥 − 𝑥

_$%&'( ^.

Encoder

NN

Decoder NN Latent space

Low-dimensional

𝑥

_$%&'(

𝑥

(5)

V AE: Intuition

2019-11-28 5

Latent space

Low-dimensional

𝑥 𝑥

_$%&'(

Encoder NN

Decoder

NN

(6)

VAE: Intuition

1. Fix a parametrization for each point: Gaussian N(𝜇, 𝜎)

𝑥 𝑥

_$%&'(

Encoder NN

Decoder

𝜎(𝑥)

NN

𝜇(𝑥)

2. Encoder NN outputs the parameters of the Gaussian

distribution: 𝜇, 𝜎

𝑧

3. Sample 𝑧 ~Ν(𝜇(𝑥), 𝜎(𝑥))

(7)

VAE: Use cases

2019-11-28 7

𝑥 𝑥

_$%&'(

Encoder NN

Decoder 𝜎(𝑥) NN

𝜇(𝑥)

𝑧

Latent representation

learning

Data synthesis Density

modelling

(8)

Theory

(9)

The starting objective

2019-11-28 9

learn the true distribution of the data 𝑝_∗(𝑥)

Goal

𝑝_∗(𝑥) is intractable

(but we do have access to finitely many samples from 𝑝_∗)

Problem

approximate 𝑝_∗ x (with a parametrized 𝑝_𝜽 x )

(Naive) Idea

𝑥 observed random variables

(10)

How do we measure the quality of the approximation?

(Reverse) KL Divergence to the rescue:

𝐷_<= 𝑝_> 𝑥 ∥ 𝑝_∗ 𝑥 = @ 𝑝_> 𝑥 log𝑝_> 𝑥 𝑝_∗ 𝑥 𝑑𝑥

𝑝_> 𝑥 𝑝_∗ 𝑥

𝑝_∗ 𝑥 𝑝_> 𝑥

Not OK, reverse KL is

large

OK, reverse KL is

small

Introduce a latent variable 𝑧

Helpful with missing data A way to leverage prior knowledge

Increasing expressive power

Same problem:

𝑝_∗ 𝑥 is unknown

(11)

Latent variable 𝒛 sneaking in

2019-11-28 11

Let’s instead consider the joint density 𝑝 𝑥, 𝑧 :

𝑝(𝑥) = @ 𝑝 𝑥, 𝑧 𝑑𝑧 = @ 𝑝 𝑥|𝑧 𝑝(𝑧) 𝑑𝑧

Problem:

- 𝑧 is unobserved - No analytical solution - No efficient estimator

Variational principle Convert it to an

optimisation problem

(12)

= @ 𝑞 𝑧 𝑥 ln𝑝 𝑥 𝑧 𝑝(𝑧)𝑞 𝑧 𝑥 𝑝(𝑧|𝑥)𝑞 𝑧 𝑥 𝑑𝑧

Variational principle

Introduce a parametrised conditional probability distribution 𝑞(𝑧|𝑥):

= @ 𝑞 𝑧 𝑥 ln𝑞(𝑧|𝑥)

𝑝(𝑧|𝑥) 𝑑𝑧 + @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧 ln 𝑝 𝑥

= 𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧|𝑥 + @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧 Intractable but

always ≥ 0

Tractable after some math

= ln 𝑝(𝑥) @ 𝑞 𝑧 𝑥 𝑑𝑧

=1

= @ 𝑞 𝑧 𝑥 ln 𝑝(𝑥) 𝑑𝑧

= @ 𝑞 𝑧 𝑥 ln𝑝 𝑥 𝑧 𝑝(𝑧) 𝑝(𝑧|𝑥) 𝑑𝑧

∎ 𝑝 𝑧 𝑥 =^M(N|O)M(O)

M(N)

∎

(13)

Variational principle

2019-11-28 13

We get that

where 𝑞 𝑧 𝑥 should approximate p 𝑧 𝑥 .

≥ @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧

𝑞 𝑧 𝑥 𝑑𝑧

New objective:

Maximise the variational lower

bound 𝓛

ln 𝑝 𝑥 = 𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧|𝑥 + @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧

(14)

Keeping track of what we’re doing

learn the true distribution of the data 𝑝_∗(𝑥)

Goal

𝑝_∗(𝑥) is intractable

(but we do have access to finitely many samples from 𝑝_∗)

Problem

approximate 𝑝_∗ x with a parametrised 𝑝_𝜽 x Idea #1

Introduce a latent variable and leverage the joint density ln 𝑝(𝑥) = ln @ 𝑝 𝑥, 𝑧 𝑑𝑧 = ln @ 𝑝 𝑥|𝑧 𝑝(𝑧) 𝑑𝑧

Idea #2

Introduce 𝑞(𝑧|𝑥) and solve the optimization problem Idea #3

(15)

Variational principle

2019-11-28 15

We get that

where 𝑞 𝑧 𝑥 should approximate p 𝑧 𝑥 .

≥ @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧

𝑞 𝑧 𝑥 𝑑𝑧

New objective:

Maximise the variational lower

bound 𝓛

ln 𝑝 𝑥 = 𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧|𝑥 + @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧

(16)

Variational lower bound rewritten

𝓛 = @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧

= @ 𝑞 𝑧 𝑥 ln𝑝 𝑥|𝑧 𝑝(𝑧) 𝑞 𝑧 𝑥 𝑑𝑧

= @ 𝑞 𝑧 𝑥 ln 𝑝(𝑧)

𝑞 𝑧 𝑥 𝑑𝑧 + @ 𝑞 𝑧 𝑥 ln 𝑝(𝑥|𝑧) 𝑑𝑧

= − @ 𝑞 𝑧 𝑥 ln𝑞(𝑧|𝑥)

𝑝(𝑧) 𝑑𝑧 + @ 𝑞 𝑧 𝑥 ln 𝑝(𝑥|𝑧) 𝑑𝑧

= −𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧 + 𝔼_R(O|N)[ln 𝑝(𝑥|𝑧)]

Evidence Lower Bound (ELBO), mostly known as the

VAE training

objective J

(17)

What we have done so far

2019-11-28 17

ln 𝑝 𝑥 = 𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧|𝑥 + @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧 Intractable but

always ≥ 0

Variational lower bound 𝓛

≥ @ 𝑞 𝑧 𝑥 ln𝑝 𝑥, 𝑧 𝑞 𝑧 𝑥 𝑑𝑧

= −𝐷_<= 𝑞 𝑧 𝑥 ∥ 𝑝 𝑧 + 𝔼_R(O|N)[ln 𝑝(𝑥|𝑧)]

(18)

VAE training objective

We derived that

Let’s choose:

And use two NNs to learn the parameters 𝜇(𝑥), 𝜎 𝑥 , 𝜇(𝑧), 𝜎 𝑧 !

ln 𝑝 𝑥 ≥ − 𝐷

_<=

𝑞 𝑧 𝑥 ∥ 𝑝 𝑧 + 𝔼

_R(O|N)

[ln 𝑝(𝑥|𝑧)]

𝑝 𝑧 = 𝑁(0, 1) 𝑞 𝑧 𝑥 = 𝑁(𝜇(𝑥), 𝜎(𝑥))

𝑝 𝑥 𝑧 = 𝑁(𝜇(z), 𝜎(𝑧))

(19)

Recall how we started

2019-11-28 19

1. Fix a parametrization: Gaussian Ν(𝜇, 𝜎)

𝑥 𝑥

_$%&'(

Encoder NN

Decoder 𝜎(𝑥) NN

𝜇(𝑥)

2. Encoder NN outputs the parameters of the Gaussian

distribution: 𝜇, 𝜎

𝑧

3. Sample 𝑧 ~Ν(𝜇(𝑥), 𝜎(𝑥))

(20)

Everything falling together

Prior on the latent codes:

𝑝 𝑧 = 𝑁(0, 1)

𝑥 𝑥

_$%&'(

𝑞 𝑧 𝑥

_𝜎(𝑥)^𝜇(𝑥)

𝑝 𝑥 𝑧

Encoder NN outputs the parameters of the approximate posterior distrubtion 𝑞 𝑧 𝑥

which we chose to parameterize as a Gaussian distribution

𝑧

_𝜎(𝑧)^𝜇(𝑧)

Decoder NN outputs the parameters of the likelihood 𝑝 𝑥 𝑧

which we chose to parameterize as a Gaussian distribution

(21)

VAE training objective

2019-11-28 21

We derived that

Given finitely many observe samples 𝑥_Y, 𝑥_., … , 𝑥_[ (training data):

where 𝜃 and 𝜙 denote the parameters of the encoder NN and decoder NN, respectively.

ln 𝑝 𝑥 ≥ − 𝐷

_<=

𝑞 𝑧 𝑥 ∥ 𝑝 𝑧 + 𝔼

_R(O|N)

[ln 𝑝(𝑥|𝑧)]

ln 𝑝 𝑥

_^

≈ max

^

( −𝐷

_<=

𝑞

_b

𝑧 𝑥

_^

∥ 𝑝 𝑧 + 𝔼

_R_c_(O|N_d₎

ln 𝑝

_>

𝑥

_^

𝑧 )

= min

^

( 𝐷

_<=

𝑞

_b

𝑧 𝑥

_^

∥ 𝑝 𝑧 − 𝔼

_R_c_(O|N_d₎

ln 𝑝

_>

𝑥

_^

𝑧 ) ,

(22)

VAE training objective

min

^

𝐷

_<=

𝑞

_b

𝑧 𝑥

_^

∥ 𝑝 𝑧 − 𝔼

_R_c_(O|N_d₎

[ln 𝑝

_>

(𝑥

_^

|𝑧)] ⟺

min

^

𝐷

_<=

𝑁(𝜇

_b

𝑥 , 𝜎

_b

(𝑥)) ∥ 𝑁(0, 1) − 𝔼

_O~[(g_c _{N ,h}_c _{N )}

[ln 𝑁(𝜇

_>

𝑧 , 𝜎

_>

𝑧 )]

𝑞

_b

𝑧 𝑥 𝜇

_b

(𝑥)

𝜎

_b

(𝑥) 𝑧 𝑝

_>

𝑥 𝑧 𝜇

_>

(𝑧)

𝜎

_>

(𝑧) 𝑥

_$

𝑥

(23)

VAE training objective: intuition

2019-11-28 23

Image source: http://ruishu.io/2018/03/14/vae/

min

^

(𝐷

_<=

𝑁(𝑧|𝜇

_b

𝑥 , 𝜎

_b

(𝑥)) ∥ 𝑁(0, 1)

−𝔼

_{O~[ g}

c N ,h_c N

[ln 𝑁 𝑥|𝜇

_>

𝑧 , 𝜎

_>

𝑧 ])

≈ (𝑥 − 𝜇_> 𝑧 )^. 2𝜎_> 𝑧 ^.

Hence the reconstruction!

(24)

VAE training objective: intuition

min

^

𝐷

_<=

𝑁(𝑧|𝜇

_b

𝑥 , 𝜎

_b

(𝑥)) ∥ 𝑁(0, 1) − 𝔼

_O~[(g_c _{N ,h}_c _{N )}

[ln 𝑁(𝑥|𝜇

_>

𝑧 , 𝜎

_>

𝑧 )]

(25)

After the completed training…

2019-11-28 25

… we can evaluate the log likelihood of a new point 𝑥_(%j:

𝑝 𝑥

_(%j

= @ 𝑝 𝑥

_(%j

, 𝑧 𝑑𝑧 = @ 𝑝 𝑥

_(%j

|𝑧 𝑝 𝑧 𝑑𝑧

= 𝔼

_M(O)

[𝑝 𝑥

_(%j

𝑧)]

≈ 1 𝑁 k

lmY [

𝑝 𝑥

_(%j

𝑧

_l

),

= 1 𝑁 k

lmY [

𝑝 𝑥

_(%j

𝜇

_>

(𝑧

_l

), 𝜎

_>

(𝑧

_l

)), 𝑧 ~ 𝑝(𝑧)

𝑧 ~ 𝑝(𝑧)

(26)

Implementation tricks

(27)

But first….

… STAND UP!

(28)

Implementation tricks

• Objective function

• Encoder

• Decoder

– Variance of the decoder

• Design choices

– Architecture

– Latent space dimension – Modelling decoder variance

(29)

Objective function

(30)

VAE training objective

𝑞

_b

𝑧 𝑥 𝜇

_b

(𝑥)

𝜎

_b

(𝑥) 𝑧 𝑝

_>

𝑥 𝑧 𝜇

_>

(𝑧) 𝜎

_>

(𝑧) 𝑥

_$

𝑥

Optimize

using stochastic gradient descent (SGD).

min

>,b

𝐷

_<=

𝑞

_b

𝑧 𝑥

_^

∥ 𝑝 𝑧 − 𝔼

_R_c_(O|N_d₎

[ln 𝑝

_>

(𝑥

_^

|𝑧)]

(31)

KL divergence has a closed form

2019-11-28 31

For two Gaussian distributions

the KL divergence term has a closed form:

𝑝 𝑧 = 𝑁(0, 1)

𝑞_b 𝑧 𝑥 = 𝑁(𝜇_b(𝑥), 𝜎_b(𝑥))

𝐷_<= 𝑁(𝜇_b 𝑥 , 𝜎_b(𝑥)) ∥ 𝑁(0, 1) = −1

2 k

lmY nop%(p q^r

(1 + ln 𝜎_b(𝑥)_l^. − 𝜇_b 𝑥 _l^. − 𝜎_b(𝑥)_l^.)

(1) See the full derivation Auto-Encoding Variational Bayes, Appendix B

(1)

(32)

Computing the reconstruction error involves sampling L

min

>,b

𝐷

_<=

𝑞

_b

𝑧 𝑥 ∥ 𝑝 𝑧 − 𝔼

_R_c_(O|N)

[ln 𝑝

_>

(𝑥|𝑧)]

Problem: we cannot directly backpropagate gradients through the random variable 𝑧:

∇

_b

𝔼

_R_c_(O|N)

ln 𝑝

_>

𝑥, 𝑧 = ?

(33)

Express the latent random variable 𝑧

as a differentiable transformation of another random variable 𝜖

where ⨀ denotes the element-wise product.

Reparametrization trick

2019-11-28 33

𝑧 ~ 𝑞

_b

𝑧 𝑥 = 𝑁 𝜇

_b

𝑥 , 𝜎

_b

(𝑥)

𝑧 ~ 𝑇

_b

(𝑥, 𝜖) = 𝜇

_b

𝑥 + 𝜎

_b

𝑥 ⨀𝜖

𝜖 ~ 𝑝 𝜖 = 𝑁 0, 1

(34)

Reparametrization trick

Given the change of variable

we can write

so that the gradients are commutative:

𝑧 ~ 𝑇

_b

(𝑥, 𝜖) = 𝜇

_b

𝑥 + 𝜎

_b

𝑥 ⨀𝜖

𝔼

_R_c_(O|N)

ln 𝑝

_>

𝑥|𝑧 = 𝔼

_M(x)

ln 𝑝

_>

𝑥|𝑧

= 𝔼

_M(x)

[∇

_b

ln 𝑝

_>

𝑥|𝑧 ]

∇

_b

𝔼

_R_c_(O|N)

ln 𝑝

_>

𝑥|𝑧 = ∇

_b

𝔼

_M(x)

ln 𝑝

_>

𝑥|𝑧

(35)

Reparametrization trick

2019-11-28 35

(Image source:https://arxiv.org/pdf/1906.02691.pdf, Section 2.4)

(36)

Encoder NN

(37)

Encoder: model the variance directly

2019-11-28 37

(38)

Encoder: model the log variance instead

(39)

Decoder NN

(40)

Decoder: learn the likelihood variance

Depends on your data preprocessing!!!

ln 𝑝_> 𝑥|𝑧 = ln 1

2𝜋𝜎_>(𝑧)^. 𝑒^{

(N { g_|(O))^} .h_|(O)^}

(41)

Decoder: fix the likelihood variance to 1

2019-11-28 41

ln 𝑝_> 𝑥|𝑧 = ln 1

2𝜋𝑒^{^{(N { g}^|^(O))

}

.

≈ (𝑥 − 𝜇_>(𝑧))^.

Since constants don’t matter for optimisation ⇒ again the

reconstruction effect!

(42)

Design choices

(according to my experience)

(43)

Choosing the architecture

2019-11-28 43

• Multi layer perceptron (nn.Linear):

– simple data potentially without spatial-temporal dependencies (MNIST, simple robot trajectories,

…)

• Convolutional layers (nn.Conv, nn.TransposeConv):

– 2d: images (MNIST, CIFAR-10, SVHN, …)

– 1d: time-series

• Extra: residual layers

– Beneficial for more complex

datasets (CIFAR-10, ImageNet, …)

(44)

Choosing the latent dimension

• For simple data (e.g. MNIST):

– Up to 20, start with say 10.

– 2-dimensional latent space is practical for visualisation.

• For complex data (e.g. CIFAR- 10, ImageNet):

– More than 60, start with say 64.

– I used 128 when training ResNet on CIFAR-10.

(45)

Choosing the decoder variance

2019-11-28 45

• Fix it to 1 if you are dealing with

simple data (e.g. MNIST) • Learn it if you are dealing with more complex data (e.g. CIFAR-10)

(46)

KL annealing schedule

Introduce a hyperparameter 𝛽

And gradually increase it from 0 to 𝑛 (the higher the 𝑛 the better disentanglement of the latent space).

Benefits:

- Better reconstructions

- Preventing posterior collapse (i.e. KL term == 0)

Check out 𝛽-VAE for more details: https://openreview.net/references/pdf?id=Sy2fzU9gl

min

>,b

𝛽𝐷

_<=

𝑞

_b

𝑧 𝑥 ∥ 𝑝 𝑧 − 𝔼

_R_c

𝑧 𝑥 [ln 𝑝

^>

(𝑥|𝑧)]

(47)

What I left out

2019-11-28 47

• Variants of VAEs

– Disentagled representations: FactorVAE, 𝛽-VAE – Discrete latent space: VQ-VAE

– Adversarial: AAE, GAN-VAE hybrids

– LadderVAEs, Lossy VAE, CAE, TD-VAE, 2StageVAE, …

(48)

Summary

• The true goal: learn the true distribution of the data 𝑝_∗(𝑥)

• Derivation of ELBO

• Encoder and decoder NNs return the parameters of the approximate posterior 𝑞_b 𝑧 𝑥 and likelihood 𝑝_> 𝑥|𝑧 , respectively.

• Implementation:

– Reparametrization trick for propagating the gradients – Variance of the decoder

(49)

Literature

2019-11-28 49

• Papers:

– Auto-encoding Variational Bayes [Kingma&Welling, 2014]

– Stochastic Backpropagation and Approximate Inference in Deep Generative Models [Rezende&Wiestra, 2014]

– β-VAE: Learning Basic Visual Concepts With a Constrained Variational Framework [Higgins et al., 2017]

– An Introduction to Variational Autoencoders [Kigma&Welling, 2019]

• Blogs:

– Rui Shu, Density Estimation: Variational Autoencoders – Jeremy Jordan, Variational Autoencoders

– Lilian Weng, From Autoencoder to Beta-VAE

(50)