Approximate Inference
Henrik I. Christensen
Robotics & Intelligent Machines @ GT Georgia Institute of Technology,
Atlanta, GA 30332-0280 [email protected]
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Introduction
We often are required to estimate a (conditional) prior of the form p(Z|X)
The solution might be intractable
1 There might not be a close form solution
2 The integration over X or a parameter space θ might be computationally challenging
3 The set of possible outcomes might be significant/exponential Two strategies
1 Deterministic Approximation Methods
2 Stochastic Sampling (Monte Carlo Techniques) Today we will talk about deterministic techniques
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Variational Inference
In general we have a Bayesian Model as seen earlier, ie.
ln p(X ) = ln p(X , Z ) − ln p(Z |X ) We can rewrite this to
ln p(X ) = L(q) + KL(q||p) where
L(q) = Z
q(Z ) ln p(X , Z ) q(Z )
KL(q||p) = − Z
q(Z ) ln p(Z |X ) q(Z )
So L(q) is an estimate of the joint distribution and KL is the Kullback-Leibler comparison of q(Z ) to p(Z |X ).
Factorized Distributions
Assume for now that we can factorize Z into disjoint groups so that q(Z) =
M
Y
i =1
qi(Zi)
In physics a similar model has been adopted termedmean field theory We can them optimize L(q) through a component wise optimization
L(q) = Z
Y
i
qi
ln p(X , Z ) −X
j
qj
dZ
= Z
qjln ˜p(X , Zj)dZj − Z
qjln qjdZj + const where
˜
p(X , Zj) = Ei 6=j[ln p(X , Z )] + c = ln p(X , Z )Y
i 6=j
qidZi + c
Factorized distributions
The optimal solution is now
ln qj∗(Zj) = Ei 6=j[ln p(X , Z )] + c
Ie the solution where every factor minimizes the influence on L(q)
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Variational Mixture of Gaussians
We encounter mixtures of Gaussians all the time
Examples are multi-wall modelling, ambiguous localization, ...
We have:
a set of observed data X ,
a set of latent variables, Z that describe the mixture
Mixture of Gaussians - Modelling
We can model the mixture model p(Z |π) =
N
Y
n=1 K
Y
k=1
πkznk
We can also derive the observed conditional
p(X |Z , µ, Λ) =
N
Y
n=1 K
Y
k=1
N(xn|µk, Λ−1k )znk
We will for now assume that mixtures are modelled as diraclets p(π) = Dir (π|α0) = C (α0)
K
Y
k=1
πkα0−1
Mixture of Gaussians - Modelling
The component processes can be modelled as a Gaussian-Wishart
p(µ, Λ) = p(µ|Λ)p(Λ) =
K
Y
k=1
N(µk|m0, (β0Λk)−1)W (Λk|W0, ν0)
Ie a total model of
xn
zn
N π
µ Λ
Mixtures of Gaussians - Variational
The conditional model can be seen as
p(X , Z , π, µ, Λ) = p(X |Z , µ, Λ)p(Z |π)p(π)p(µ|Λ)p(Λ) Only X is observed
We can now consider the selection of a distribution q(Z , π, µ, Λ) = q(Z )q(π, µ, Λ) this is clear an assumption of independence.
We can use the general result of component-wise optimization ln q∗(Z ) = Eπ,µ,Λ[ln p(X , Z , π, µ, Λ] + const Decomposition gives us
ln q∗(Z ) = Eπ[ln p(Z |π)] + Eµ,Λ[ln p(X |Z , µ, Λ)] + const ln q∗(Z ) =
N
X
n=1 K
X
k=1
znkln ρnk+ const
Mixtures of Gaussians - Variational
We can further achieve ln ρnk = E [ln πk]+1
2E [ln |Λk|]−D
2 ln 2π −1
2Eµk,Λk[(xn−µk)TΛk(xn−µk)]+c Taking the exponential we have
q∗(Z ) ∝
K
Y
k=1 N
Y
n=1
ρznknk Using normalization we arrive at
q∗(Z ) ∝
K
Y
k=1 N
Y
n=1
rnkznk Where
rnk = ρnk P
jρnj
Mixtures of Gaussians - Variational
Just as we saw for EM we can define Nk =
N
X
n=1
rnk
¯
xk = 1 Nk
N
X
n=1
rnkxn
Sk = 1 Nk
N
X
n=1
rnk(xn− ¯xn)(xn− ¯xn)T
Mixtures of Gaussians - Parameters/Mixture
Lets now consider q(π, µ, Λ) to arrive at
ln q∗(π, µ, Λ) = ln p(π) + K X
k=1
ln p(µk, Λk) + EZ[ln p(Z |π)] + k X
k=1 N X
n=1
E [znk] ln N(xn|µk, Λ−1k ) + c
We can partition the problem into q(π, µ, Λ) = q(π)
K
Y
k=1
q(µk, Λk) We can derive
ln q∗(π) = (α0− 1)
K
X
k=1
ln πk +
K
X
k=1 N
X
n=1
rnkln πk + c We can now derive
q∗(π) = Dir (π|α) where
αk = α0+ Nk
Mixtures of Gaussians - Parameters/Mixture
We can then derive
q∗(µk, Λk) = N(µk|mk, (βkΛk)−1)W (λk|Wk, νk) where
βk = β0+ Nk
mk = 1
βk(β0m0+ Nkx¯k) WK−1 = W0−1+ NkSk+ β0Nk
β0+ Nk
(¯xk − m0)(¯xk − m0)T νk = ν0+ Nk+ 1
Mixtures of Gaussians - Parameters
We can now arrive at the parameters
Eµk,Λk[(xn− µk)T(xn− µk)] = Dβ−1k + νk(xn− mk)TWK(xn− mk)
ln ˜Λk = E [ln |Λ|k|] =
D
X
i =1
ψ νk+ 1 − i 2
+ D ln 2 + ln |Wk|
ln ˜πk = E [ln πk] = ψ(αk) − ψ( ˆα)
here ψ(.) which is defined as d /da ln Γ(a) also known as the digramma function. The last two results are given by the Gauss-Wishart
Mixtures of Gaussians - Parameters
We can finally find the responsibilities rnk ∝ πk|Λk|1/2exp
−1
2(xn− µk)TΛk(xn− µk)
The optimization is stepwise
1 Estimate µ, Λ and then rnk 2 Estimate π and Z
3 Check for convergence - return to 1 if not converged
Mixture of Gaussians - Example
0 15
60 120
MoG - Varional Lower Bound
We can estimate the best fit / lower bound
L = E [ln p(X |Z , µ, Λ)] + E [ln p(Z |pi )] + E [ln p(µ, Λ)] − E [ln q(Z )] − E [ln q(π)] − E [ln q(µ, Λ)]
E [ln p(X |Z , µ, Λ)] = 1 2
X
k
Nkn
ln ˜Λk− Dβk−1− νkTr (SkWk)
−νk(¯xk− mk)TWK(¯xk − mk) − D ln 2π E [ln p(Z |π)] = X
n
X
k
rnkln rnk
E [ln p(π)] = ln C (α0) + (α0− 1)X
k
ln ˜πk
... = ... (see book)
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Exponential Family Distribution
Recall from 3rd lecture:
Exponential family
p(x |η) = h(x )g (η) exp n
ηTu(x ) o where η represent the “natural parameters”
g (η) is the normalization “factor”
u(x ) is some general function of data
Exponential Family Distribution
The joint distribution for observed and latent variables is then
p(X , Z |η) =
N
Y
n=1
h(xn, zn)g (η) expn
ηTu(xn, zn)o
The conjugate prior for η is then
p(η|ν0, v0) = f (ν0, χ0)g (η)ν0exp n
ν0ηTχ o where ν0 is prior number of observations and χ is the sufficient statistics (moments)
Exponential Family Distribution - Variational
As before we can compute
ln q∗(Z ) = Eη[ln p(X , Z |η)] + const
= X
n
n
ln h(xn, zn) + E [ηT]u(xn, zn)o
+ const
i.e. a sum of independent terms
Taking exponential on both sides we have q∗(zn) = h(xn, zn)g (E [η]) exp
n
E [ηT]u(xn, zn) o
Exponential Family Distribution - Variational
Similarly the natural parameters can be optimized by
ln q∗(η) = ln p(η|ν0, χ0) + EZ[ln p(X , Z |η)] + const Which expands to
ln q∗(η) = ν0ln g (η) + ηTχ0+X
ln g (η) + ηTEzn[u(xn, zn)] + const
Using the trick of exponentials on both sides we have q∗(η) = f (νN, χN)g (η)νNexp
n ηTχN
o where
νN = ν0+ N χn= χ0+X
n
Ezn[u(xn, zn)]
Exponential Family Distribution - Variational
As expected the solution is iterative q∗(zn) and q∗(η) are coupled.
In the E step compute E [u(xn, zn)] - the sufficient statistics and compute q(η)
In the M step use the estimate to maximize the estimate for q(zn) and compute E [ηT]
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Expectation Propagation
Fundamentally we are trying to match distributions to the data and match up the natural parameters. I.e. find the “best”family of distributions and at the same time fit the parameter.
In the end we are trying to minimize the Kullback-Leibler (KL) with respect to q(z)
Consider for a minute KL(p||q) where p(z) is fixed and q(z) is a member of the exponential family
q(z) = h(z)g (η) exp n
ηTu(z) o
Expectation Propagation - Optimization
The Kullback - Leibler is then
KL(p||q) = − ln g (η) − ηTEp(z)[u(z)] + const The extrema is then given by
−∇ ln g (η) = Ep(z)[u(z)]
i.e. the best estimate is to match q(z) to p(z) by setting “natural parameters” to the sufficient statistics (moment matching).
I.e. q(z) = N(z|µ, Σ) as a model for the data
Expectation Propagation - Modelling
Consider a model with factorized probabilities p(D, θ) =Y
i
fi(θ)
where fi(theta) = p(xn|θ) and you might have a prior f0(θ) = p(θ).
The posterior is then
p(θ|D) = 1 p(D)
Y
i
fi(θ)
The model evident is given by p(D) =
Z Y
i
fi(θ)d θ
Expectation Propagation - Computing
The estimate is then
q(θ) = 1 Z
Y
i
˜fi(θ)
q(θ) can be factorized so that each term is optimized
Through optimization factor-by-factor it is possible to generate an estimate - take-one-out-and-optimize
Expectation Propagation - Algorithm
Initialize factor approximation - ˜fi(θ) Initialize posterior estimate - q(θ) ∝Q
i˜fi(θ) iterate
1 Choose a factor to refine
2 Remove ˜fj(θ) from prior q\j = q/f
3 Evaluate new posterior/sufficient statistics
4 Update factors
5 Evaluate aproximation
Expectation Propagation - Example
θ x
−5 0 5 10
p(x |θ) = (1 − w )N(x |θ, I ) + wN(x |0, aI )
Expectation Propagation - Example
−5 0 5 θ 10 −5 0 5 θ 10
Outline
1 Introduction
2 Variational Inference
3 Variational Mixture of Gaussians
4 Exponential Family
5 Expectation Propagation
6 Summary
Summary
Often computation of complete model is a challenge Two ways to approximate computations
Deterministic Approximations Sampling Based Methods Many tricks for approximation
Factorization is typically a first strategy Iterative optimization of factors
Next time we will talk about sampling based methods