Approximate Inference

(1)

Approximate Inference

Henrik I. Christensen

Robotics & Intelligent Machines @ GT Georgia Institute of Technology,

Atlanta, GA 30332-0280 [email protected]

(2)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(3)

Introduction

We often are required to estimate a (conditional) prior of the form p(Z|X)

The solution might be intractable

1 There might not be a close form solution

2 The integration over X or a parameter space θ might be computationally challenging

3 The set of possible outcomes might be significant/exponential Two strategies

1 Deterministic Approximation Methods

2 Stochastic Sampling (Monte Carlo Techniques) Today we will talk about deterministic techniques

(4)

Outline

1 Introduction

6 Summary

(5)

Variational Inference

In general we have a Bayesian Model as seen earlier, ie.

ln p(X ) = ln p(X , Z ) − ln p(Z |X ) We can rewrite this to

ln p(X ) = L(q) + KL(q||p) where

L(q) = Z

q(Z ) ln p(X , Z ) q(Z )

KL(q||p) = − Z

q(Z ) ln p(Z |X ) q(Z )

So L(q) is an estimate of the joint distribution and KL is the Kullback-Leibler comparison of q(Z ) to p(Z |X ).

(6)

Factorized Distributions

Assume for now that we can factorize Z into disjoint groups so that q(Z) =

M

Y

i =1

q_i(Z_i)

In physics a similar model has been adopted termedmean field theory We can them optimize L(q) through a component wise optimization

L(q) = Z

Y

i

q_i







ln p(X , Z ) −X

j

q_j





 dZ

= Z

qjln ˜p(X , Zj)dZj − Z

qjln qjdZj + const where

˜

p(X , Z_j) = E_{i 6=j}[ln p(X , Z )] + c = ln p(X , Z )Y

i 6=j

q_idZ_i + c

(7)

Factorized distributions

The optimal solution is now

ln q_j^∗(Zj) = Ei 6=j[ln p(X , Z )] + c

Ie the solution where every factor minimizes the influence on L(q)

(8)

Outline

1 Introduction

6 Summary

(9)

Variational Mixture of Gaussians

We encounter mixtures of Gaussians all the time

Examples are multi-wall modelling, ambiguous localization, ...

We have:

a set of observed data X ,

a set of latent variables, Z that describe the mixture

(10)

Mixture of Gaussians - Modelling

We can model the mixture model p(Z |π) =

N

Y

n=1 K

Y

k=1

π_k^z^nk

We can also derive the observed conditional

p(X |Z , µ, Λ) =

N

Y

n=1 K

Y

k=1

N(xn|µ_k, Λ⁻¹_k )^z^nk

We will for now assume that mixtures are modelled as diraclets p(π) = Dir (π|α0) = C (α0)

K

Y

k=1

π_k^α⁰⁻¹

(11)

Mixture of Gaussians - Modelling

The component processes can be modelled as a Gaussian-Wishart

p(µ, Λ) = p(µ|Λ)p(Λ) =

K

Y

k=1

N(µ_k|m₀, (β₀Λ_k)⁻¹)W (Λ_k|W₀, ν₀)

Ie a total model of

xn

zn

N π

µ Λ

(12)

Mixtures of Gaussians - Variational

The conditional model can be seen as

p(X , Z , π, µ, Λ) = p(X |Z , µ, Λ)p(Z |π)p(π)p(µ|Λ)p(Λ) Only X is observed

We can now consider the selection of a distribution q(Z , π, µ, Λ) = q(Z )q(π, µ, Λ) this is clear an assumption of independence.

We can use the general result of component-wise optimization ln q^∗(Z ) = Eπ,µ,Λ[ln p(X , Z , π, µ, Λ] + const Decomposition gives us

ln q^∗(Z ) = E_π[ln p(Z |π)] + E_µ,Λ[ln p(X |Z , µ, Λ)] + const ln q^∗(Z ) =

N

X

n=1 K

X

k=1

z_nkln ρ_nk+ const

(13)

Mixtures of Gaussians - Variational

We can further achieve ln ρnk = E [ln πk]+1

2E [ln |Λk|]−D

2 ln 2π −1

2Eµ_k,Λ_k[(xn−µk)^TΛk(xn−µk)]+c Taking the exponential we have

q^∗(Z ) ∝

K

Y

k=1 N

Y

n=1

ρ^z_nk^nk Using normalization we arrive at

q^∗(Z ) ∝

K

Y

k=1 N

Y

n=1

r_nk^z^nk Where

rnk = ρ_nk P

jρnj

(14)

Mixtures of Gaussians - Variational

Just as we saw for EM we can define N_k =

N

X

n=1

r_nk

¯

x_k = 1 N_k

N

X

n=1

r_nkxn

S_k = 1 Nk

N

X

n=1

r_nk(x_n− ¯x_n)(x_n− ¯x_n)^T

(15)

Mixtures of Gaussians - Parameters/Mixture

Lets now consider q(π, µ, Λ) to arrive at

ln q^∗(π, µ, Λ) = ln p(π) + K X

k=1

ln p(µ_k, Λ_k) + E_Z[ln p(Z |π)] + k X

k=1 N X

n=1

E [z_nk] ln N(xn|µ_k, Λ⁻¹_k ) + c

We can partition the problem into q(π, µ, Λ) = q(π)

K

Y

k=1

q(µ_k, Λ_k) We can derive

ln q^∗(π) = (α0− 1)

K

X

k=1

ln π_k +

K

X

k=1 N

X

n=1

r_nkln π_k + c We can now derive

q^∗(π) = Dir (π|α) where

αk = α0+ Nk

(16)

Mixtures of Gaussians - Parameters/Mixture

We can then derive

q^∗(µ_k, Λ_k) = N(µ_k|m_k, (β_kΛ_k)⁻¹)W (λ_k|W_k, ν_k) where

βk = β0+ Nk

mk = 1

β_k(β0m0+ Nkx¯k) W_K⁻¹ = W₀⁻¹+ N_kS_k+ β₀N_k

β0+ Nk

(¯x_k − m₀)(¯x_k − m₀)^T ν_k = ν₀+ N_k+ 1

(17)

Mixtures of Gaussians - Parameters

We can now arrive at the parameters

E_µ_k_,Λ_k[(x_n− µ_k)^T(x_n− µ_k)] = Dβ⁻¹_k + ν_k(x_n− m_k)^TW_K(x_n− m_k)

ln ˜Λk = E [ln |Λ|k|] =

D

X

i =1

ψ ν_k+ 1 − i 2

+ D ln 2 + ln |Wk|

ln ˜πk = E [ln πk] = ψ(αk) − ψ( ˆα)

here ψ(.) which is defined as d /da ln Γ(a) also known as the digramma function. The last two results are given by the Gauss-Wishart

(18)

Mixtures of Gaussians - Parameters

We can finally find the responsibilities r_nk ∝ π_k|Λ_k|^1/2exp

−1

2(xn− µ_k)^TΛ_k(xn− µ_k)

The optimization is stepwise

1 Estimate µ, Λ and then rnk 2 Estimate π and Z

3 Check for convergence - return to 1 if not converged

(19)

Mixture of Gaussians - Example

0 15

60 120

(20)

MoG - Varional Lower Bound

We can estimate the best fit / lower bound

L = E [ln p(X |Z , µ, Λ)] + E [ln p(Z |pi )] + E [ln p(µ, Λ)] − E [ln q(Z )] − E [ln q(π)] − E [ln q(µ, Λ)]

E [ln p(X |Z , µ, Λ)] = 1 2

X

k

N_kn

ln ˜Λ_k− Dβ_k⁻¹− νkTr (S_kW_k)

−ν_k(¯x_k− m_k)^TW_K(¯x_k − m_k) − D ln 2π E [ln p(Z |π)] = X

n

X

k

r_nkln r_nk

E [ln p(π)] = ln C (α₀) + (α₀− 1)X

k

ln ˜π_k

... = ... (see book)

(21)

Outline

1 Introduction

6 Summary

(22)

Exponential Family Distribution

Recall from 3rd lecture:

Exponential family

p(x |η) = h(x )g (η) exp n

η^Tu(x ) o where η represent the “natural parameters”

g (η) is the normalization “factor”

u(x ) is some general function of data

(23)

Exponential Family Distribution

The joint distribution for observed and latent variables is then

p(X , Z |η) =

N

Y

n=1

h(x_n, z_n)g (η) expn

η^Tu(x_n, z_n)o

The conjugate prior for η is then

p(η|ν0, v0) = f (ν0, χ0)g (η)^ν⁰exp n

ν0η^Tχ o where ν0 is prior number of observations and χ is the sufficient statistics (moments)

(24)

Exponential Family Distribution - Variational

As before we can compute

ln q^∗(Z ) = Eη[ln p(X , Z |η)] + const

= X

n

ln h(x_n, z_n) + E [η^T]u(x_n, z_n)o

+ const

i.e. a sum of independent terms

Taking exponential on both sides we have q^∗(zn) = h(xn, zn)g (E [η]) exp

n

E [η^T]u(xn, zn) o

(25)

Exponential Family Distribution - Variational

Similarly the natural parameters can be optimized by

ln q^∗(η) = ln p(η|ν0, χ0) + E_Z[ln p(X , Z |η)] + const Which expands to

ln q^∗(η) = ν0ln g (η) + η^Tχ0+X

ln g (η) + η^TEzn[u(xn, zn)] + const

Using the trick of exponentials on both sides we have q^∗(η) = f (νN, χN)g (η)^ν^Nexp

n η^TχN

o where

ν_N = ν₀+ N χ_n= χ₀+X

n

E_z_n[u(x_n, z_n)]

(26)

Exponential Family Distribution - Variational

As expected the solution is iterative q^∗(z_n) and q^∗(η) are coupled.

In the E step compute E [u(x_n, z_n)] - the sufficient statistics and compute q(η)

In the M step use the estimate to maximize the estimate for q(z_n) and compute E [η^T]

(27)

Outline

1 Introduction

6 Summary

(28)

Expectation Propagation

Fundamentally we are trying to match distributions to the data and match up the natural parameters. I.e. find the “best”family of distributions and at the same time fit the parameter.

In the end we are trying to minimize the Kullback-Leibler (KL) with respect to q(z)

Consider for a minute KL(p||q) where p(z) is fixed and q(z) is a member of the exponential family

q(z) = h(z)g (η) exp n

η^Tu(z) o

(29)

Expectation Propagation - Optimization

The Kullback - Leibler is then

KL(p||q) = − ln g (η) − η^TE_p(z)[u(z)] + const The extrema is then given by

−∇ ln g (η) = E_p(z)[u(z)]

i.e. the best estimate is to match q(z) to p(z) by setting “natural parameters” to the sufficient statistics (moment matching).

I.e. q(z) = N(z|µ, Σ) as a model for the data

(30)

Expectation Propagation - Modelling

Consider a model with factorized probabilities p(D, θ) =Y

i

f_i(θ)

where f_i(theta) = p(xn|θ) and you might have a prior f₀(θ) = p(θ).

The posterior is then

p(θ|D) = 1 p(D)

Y

i

fi(θ)

The model evident is given by p(D) =

Z Y

i

f_i(θ)d θ

(31)

Expectation Propagation - Computing

The estimate is then

q(θ) = 1 Z

Y

i

˜f_i(θ)

q(θ) can be factorized so that each term is optimized

Through optimization factor-by-factor it is possible to generate an estimate - take-one-out-and-optimize

(32)

Expectation Propagation - Algorithm

Initialize factor approximation - ˜f_i(θ) Initialize posterior estimate - q(θ) ∝Q

i˜f_i(θ) iterate

1 Choose a factor to refine

2 Remove ˜f_j(θ) from prior q^\j = q/f

3 Evaluate new posterior/sufficient statistics

4 Update factors

5 Evaluate aproximation

(33)

Expectation Propagation - Example

θ x

−5 0 5 10

p(x |θ) = (1 − w )N(x |θ, I ) + wN(x |0, aI )

(34)

Expectation Propagation - Example

−5 0 5 θ 10 −5 0 5 θ 10

(35)

Outline

1 Introduction

6 Summary

(36)

Summary

Often computation of complete model is a challenge Two ways to approximate computations

Deterministic Approximations Sampling Based Methods Many tricks for approximation

Factorization is typically a first strategy Iterative optimization of factors

Next time we will talk about sampling based methods