• No results found

Approximate Inference

N/A
N/A
Protected

Academic year: 2022

Share "Approximate Inference"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Approximate Inference

Henrik I. Christensen

Robotics & Intelligent Machines @ GT Georgia Institute of Technology,

Atlanta, GA 30332-0280 [email protected]

(2)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(3)

Introduction

We often are required to estimate a (conditional) prior of the form p(Z|X)

The solution might be intractable

1 There might not be a close form solution

2 The integration over X or a parameter space θ might be computationally challenging

3 The set of possible outcomes might be significant/exponential Two strategies

1 Deterministic Approximation Methods

2 Stochastic Sampling (Monte Carlo Techniques) Today we will talk about deterministic techniques

(4)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(5)

Variational Inference

In general we have a Bayesian Model as seen earlier, ie.

ln p(X ) = ln p(X , Z ) − ln p(Z |X ) We can rewrite this to

ln p(X ) = L(q) + KL(q||p) where

L(q) = Z

q(Z ) ln p(X , Z ) q(Z )



KL(q||p) = − Z

q(Z ) ln p(Z |X ) q(Z )



So L(q) is an estimate of the joint distribution and KL is the Kullback-Leibler comparison of q(Z ) to p(Z |X ).

(6)

Factorized Distributions

Assume for now that we can factorize Z into disjoint groups so that q(Z) =

M

Y

i =1

qi(Zi)

In physics a similar model has been adopted termedmean field theory We can them optimize L(q) through a component wise optimization

L(q) = Z

Y

i

qi

ln p(X , Z ) −X

j

qj

 dZ

= Z

qjln ˜p(X , Zj)dZj − Z

qjln qjdZj + const where

˜

p(X , Zj) = Ei 6=j[ln p(X , Z )] + c = ln p(X , Z )Y

i 6=j

qidZi + c

(7)

Factorized distributions

The optimal solution is now

ln qj(Zj) = Ei 6=j[ln p(X , Z )] + c

Ie the solution where every factor minimizes the influence on L(q)

(8)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(9)

Variational Mixture of Gaussians

We encounter mixtures of Gaussians all the time

Examples are multi-wall modelling, ambiguous localization, ...

We have:

a set of observed data X ,

a set of latent variables, Z that describe the mixture

(10)

Mixture of Gaussians - Modelling

We can model the mixture model p(Z |π) =

N

Y

n=1 K

Y

k=1

πkznk

We can also derive the observed conditional

p(X |Z , µ, Λ) =

N

Y

n=1 K

Y

k=1

N(xnk, Λ−1k )znk

We will for now assume that mixtures are modelled as diraclets p(π) = Dir (π|α0) = C (α0)

K

Y

k=1

πkα0−1

(11)

Mixture of Gaussians - Modelling

The component processes can be modelled as a Gaussian-Wishart

p(µ, Λ) = p(µ|Λ)p(Λ) =

K

Y

k=1

N(µk|m0, (β0Λk)−1)W (Λk|W0, ν0)

Ie a total model of

xn

zn

N π

µ Λ

(12)

Mixtures of Gaussians - Variational

The conditional model can be seen as

p(X , Z , π, µ, Λ) = p(X |Z , µ, Λ)p(Z |π)p(π)p(µ|Λ)p(Λ) Only X is observed

We can now consider the selection of a distribution q(Z , π, µ, Λ) = q(Z )q(π, µ, Λ) this is clear an assumption of independence.

We can use the general result of component-wise optimization ln q(Z ) = Eπ,µ,Λ[ln p(X , Z , π, µ, Λ] + const Decomposition gives us

ln q(Z ) = Eπ[ln p(Z |π)] + Eµ,Λ[ln p(X |Z , µ, Λ)] + const ln q(Z ) =

N

X

n=1 K

X

k=1

znkln ρnk+ const

(13)

Mixtures of Gaussians - Variational

We can further achieve ln ρnk = E [ln πk]+1

2E [ln |Λk|]−D

2 ln 2π −1

2Eµkk[(xn−µk)TΛk(xn−µk)]+c Taking the exponential we have

q(Z ) ∝

K

Y

k=1 N

Y

n=1

ρznknk Using normalization we arrive at

q(Z ) ∝

K

Y

k=1 N

Y

n=1

rnkznk Where

rnk = ρnk P

jρnj

(14)

Mixtures of Gaussians - Variational

Just as we saw for EM we can define Nk =

N

X

n=1

rnk

¯

xk = 1 Nk

N

X

n=1

rnkxn

Sk = 1 Nk

N

X

n=1

rnk(xn− ¯xn)(xn− ¯xn)T

(15)

Mixtures of Gaussians - Parameters/Mixture

Lets now consider q(π, µ, Λ) to arrive at

ln q(π, µ, Λ) = ln p(π) + K X

k=1

ln p(µk, Λk) + EZ[ln p(Z |π)] + k X

k=1 N X

n=1

E [znk] ln N(xnk, Λ−1k ) + c

We can partition the problem into q(π, µ, Λ) = q(π)

K

Y

k=1

q(µk, Λk) We can derive

ln q(π) = (α0− 1)

K

X

k=1

ln πk +

K

X

k=1 N

X

n=1

rnkln πk + c We can now derive

q(π) = Dir (π|α) where

αk = α0+ Nk

(16)

Mixtures of Gaussians - Parameters/Mixture

We can then derive

qk, Λk) = N(µk|mk, (βkΛk)−1)W (λk|Wk, νk) where

βk = β0+ Nk

mk = 1

βk0m0+ Nkk) WK−1 = W0−1+ NkSk+ β0Nk

β0+ Nk

(¯xk − m0)(¯xk − m0)T νk = ν0+ Nk+ 1

(17)

Mixtures of Gaussians - Parameters

We can now arrive at the parameters

Eµkk[(xn− µk)T(xn− µk)] = Dβ−1k + νk(xn− mk)TWK(xn− mk)

ln ˜Λk = E [ln |Λ|k|] =

D

X

i =1

ψ νk+ 1 − i 2



+ D ln 2 + ln |Wk|

ln ˜πk = E [ln πk] = ψ(αk) − ψ( ˆα)

here ψ(.) which is defined as d /da ln Γ(a) also known as the digramma function. The last two results are given by the Gauss-Wishart

(18)

Mixtures of Gaussians - Parameters

We can finally find the responsibilities rnk ∝ πkk|1/2exp



−1

2(xn− µk)TΛk(xn− µk)



The optimization is stepwise

1 Estimate µ, Λ and then rnk 2 Estimate π and Z

3 Check for convergence - return to 1 if not converged

(19)

Mixture of Gaussians - Example

0 15

60 120

(20)

MoG - Varional Lower Bound

We can estimate the best fit / lower bound

L = E [ln p(X |Z , µ, Λ)] + E [ln p(Z |pi )] + E [ln p(µ, Λ)] − E [ln q(Z )] − E [ln q(π)] − E [ln q(µ, Λ)]

E [ln p(X |Z , µ, Λ)] = 1 2

X

k

Nkn

ln ˜Λk− Dβk−1− νkTr (SkWk)

−νkxk− mk)TWKxk − mk) − D ln 2π E [ln p(Z |π)] = X

n

X

k

rnkln rnk

E [ln p(π)] = ln C (α0) + (α0− 1)X

k

ln ˜πk

... = ... (see book)

(21)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(22)

Exponential Family Distribution

Recall from 3rd lecture:

Exponential family

p(x |η) = h(x )g (η) exp n

ηTu(x ) o where η represent the “natural parameters”

g (η) is the normalization “factor”

u(x ) is some general function of data

(23)

Exponential Family Distribution

The joint distribution for observed and latent variables is then

p(X , Z |η) =

N

Y

n=1

h(xn, zn)g (η) expn

ηTu(xn, zn)o

The conjugate prior for η is then

p(η|ν0, v0) = f (ν0, χ0)g (η)ν0exp n

ν0ηTχ o where ν0 is prior number of observations and χ is the sufficient statistics (moments)

(24)

Exponential Family Distribution - Variational

As before we can compute

ln q(Z ) = Eη[ln p(X , Z |η)] + const

= X

n

n

ln h(xn, zn) + E [ηT]u(xn, zn)o

+ const

i.e. a sum of independent terms

Taking exponential on both sides we have q(zn) = h(xn, zn)g (E [η]) exp

n

E [ηT]u(xn, zn) o

(25)

Exponential Family Distribution - Variational

Similarly the natural parameters can be optimized by

ln q(η) = ln p(η|ν0, χ0) + EZ[ln p(X , Z |η)] + const Which expands to

ln q(η) = ν0ln g (η) + ηTχ0+X 

ln g (η) + ηTEzn[u(xn, zn)] + const

Using the trick of exponentials on both sides we have q(η) = f (νN, χN)g (η)νNexp

n ηTχN

o where

νN = ν0+ N χn= χ0+X

n

Ezn[u(xn, zn)]

(26)

Exponential Family Distribution - Variational

As expected the solution is iterative q(zn) and q(η) are coupled.

In the E step compute E [u(xn, zn)] - the sufficient statistics and compute q(η)

In the M step use the estimate to maximize the estimate for q(zn) and compute E [ηT]

(27)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(28)

Expectation Propagation

Fundamentally we are trying to match distributions to the data and match up the natural parameters. I.e. find the “best”family of distributions and at the same time fit the parameter.

In the end we are trying to minimize the Kullback-Leibler (KL) with respect to q(z)

Consider for a minute KL(p||q) where p(z) is fixed and q(z) is a member of the exponential family

q(z) = h(z)g (η) exp n

ηTu(z) o

(29)

Expectation Propagation - Optimization

The Kullback - Leibler is then

KL(p||q) = − ln g (η) − ηTEp(z)[u(z)] + const The extrema is then given by

−∇ ln g (η) = Ep(z)[u(z)]

i.e. the best estimate is to match q(z) to p(z) by setting “natural parameters” to the sufficient statistics (moment matching).

I.e. q(z) = N(z|µ, Σ) as a model for the data

(30)

Expectation Propagation - Modelling

Consider a model with factorized probabilities p(D, θ) =Y

i

fi(θ)

where fi(theta) = p(xn|θ) and you might have a prior f0(θ) = p(θ).

The posterior is then

p(θ|D) = 1 p(D)

Y

i

fi(θ)

The model evident is given by p(D) =

Z Y

i

fi(θ)d θ

(31)

Expectation Propagation - Computing

The estimate is then

q(θ) = 1 Z

Y

i

˜fi(θ)

q(θ) can be factorized so that each term is optimized

Through optimization factor-by-factor it is possible to generate an estimate - take-one-out-and-optimize

(32)

Expectation Propagation - Algorithm

Initialize factor approximation - ˜fi(θ) Initialize posterior estimate - q(θ) ∝Q

i˜fi(θ) iterate

1 Choose a factor to refine

2 Remove ˜fj(θ) from prior q\j = q/f

3 Evaluate new posterior/sufficient statistics

4 Update factors

5 Evaluate aproximation

(33)

Expectation Propagation - Example

θ x

−5 0 5 10

p(x |θ) = (1 − w )N(x |θ, I ) + wN(x |0, aI )

(34)

Expectation Propagation - Example

−5 0 5 θ 10 −5 0 5 θ 10

(35)

Outline

1 Introduction

2 Variational Inference

3 Variational Mixture of Gaussians

4 Exponential Family

5 Expectation Propagation

6 Summary

(36)

Summary

Often computation of complete model is a challenge Two ways to approximate computations

Deterministic Approximations Sampling Based Methods Many tricks for approximation

Factorization is typically a first strategy Iterative optimization of factors

Next time we will talk about sampling based methods

References

Related documents

As expected, for the smallest cache size (i.e., 1% of the index), the performance of the score cache is the best, and the result cache is the runner-up. In this case, it is not

• It’s often cheaper to evaluate an incremental change of a previously evaluated object than to evaluate

The group consists of lawyers from the United States and Europe, as well as colleagues from MWE China Law Offices who focus on class action defense, Chinese litigation,

BIMSTEC safeguard measures permit member countries to withdraw the tariff concession to protect domestic industry from serious injury due to increase in import

The bill requires the executive director of a governmental agency or the president of an institution of higher education (institution), as applicable, that enters into a

access control on the level of a data record, to include con- text information, and to use few quite general roles (group member, project technician) instead of many specific