1 School of Computer Science
Approximate Inference:
Monte Carlo Inference
Probabilistic Graphical Models (10
Probabilistic Graphical Models (10
-
-
708)
708)
Lecture 18, Nov 19, 2007
Eric Xing
Eric Xing
Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8Reading: J-Chap. 1, KF-Chap. 11
Monte Carlo methods
z
Draw random samples from the desired distribution
z
Yield a stochastic representation of a complex distribution
z marginals and other expections can be approximated using
sample-based averages
z
Asymptotically
exact and easy to apply to arbitrary models
z
Challenges:
z how to draw samples from a given dist. (not all distributions can be
trivially sampled)?
z how to make better use of the samples (not all sample are useful, or
eqally useful, see an example later)?
∑
==
N t tx
f
N
x
f
11
)
(
)]
(
[
()E
Eric Xing 3
Example: naive sampling
z
Construct samples according to probabilities given in a BN.
Alarm example: (Choose the right sampling sequence) 1) Sampling:P(B)=<0.001, 0.999> suppose it is false, B0. Same for E0. P(A|B0, E0)=<0.001, 0.999> suppose it is false...
2) Frequency counting: In the samples right,
P(J|A0)=P(J,A0)/P(A0)=<1/9, 8/9>. J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0
Example: naive sampling
z
Construct samples according to probabilities given in a BN.
Alarm example: (Choose the right sampling
sequence)
3) what if we want to compute P(J|A1)?
we have only one sample ... P(J|A1)=P(J,A1)/P(A1)=<0, 1>.
4) what if we want to compute P(J|B1)?
No such sample available!
P(J|A1)=P(J,B1)/P(B1) can not be defined.
For a model with hundreds or more variables, rare events will be very hard to garner evough samples even after a long time or sampling ...
J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0
Eric Xing 5
Monte Carlo methods (cond.)
z
Direct Sampling
z We have seen it.
z Very difficult to populate a high-dimensional state space
z
Rejection Sampling
z Create samples like direct sampling, only count samples which is
consistent with given evidences.
z
Likelihood weighting, ...
z Sample variables and calculate evidence weight. Only create the
samples which support the evidences.
z
Markov chain Monte Carlo (MCMC)
z Metropolis-Hasting
z Gibbs
Rejection sampling
z
Suppose we wish to sample from dist. Π(
X
)=Π'(
X
)/Z.
z Π(X) is difficult to sample, but Π'(X) is easy to evaluate
z Sample from a simpler dist Q(X)
z Rejection sampling z Correctness: z Pitfall … ) ( / ) ( ' w.p. accept ), ( ~ * * * * Q X x x kQ x x Π
[
]
[
]
) ( ) ( ' ) ( ' ) ( ) ( / ) ( ' ) ( ) ( / ) ( ' ) ( x dx x x dx x Q x kQ x x Q x kQ x x p Π = Π Π = Π Π =∫
∫
Eric Xing 7
Rejection sampling
z
Pitfall:
z Using Q=N(µ,σq2I) to sample P=N(µ,σp2I)
z If σqexceeds σpby 1%, and dimensional=1000,
z The optimal acceptance rate k=(σq/σp)d≈1/20,000
z Big waste of samples!
z
Adaptive rejection sampling
z Using envelope functions to define Q
Unnormalized importance
sampling
z
Suppose sampling from
P
(·) is hard.
z
Suppose we can sample from a "simpler" proposal distribution
Q
(·) instead.
z
If
Q
dominates
P
(i.e.,
Q
(
x
) > 0 whenever
P
(
x
) > 0), we can
sample from
Q
and reweight:
∑
∑
∫
∫
= ≈ = = m m m m m m m m w x M X Q x x QP x x M dx x Q x QP x x dx x P x X ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( f f f f f 1 1Eric Xing 9
Normalized importance sampling
z
Suppose we can only evaluate
P'
(
x
) = α
P
(
x
) (e.g. for an
MRF).
z
We can get around the nasty normalization constant α as
follows:
z zNow
α = = = ⇒∫
Q x dx∫
P x dx x QP x X r Q ( ) '( ) ) ( ) ( ' ) ( ) ( ) ( ' ) ( Let x QP x X r =∑
∑
∑
∑
∫
∫
∫
∫
= = ≈ = = = m m m m m m m m m m m m m P r r w w x X Q x r r x dx x Q x r dx x Q x r x dx x Q x QP x x dx x P x X re whe ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ' ) ( ) ( ) ( ) ( f f f f f f α1Normalized vs unnormalized
importance sampling
z
Unormalized importance sampling is unbiased:
z
Normalized importance sampling is biased, eg for M = 1:
z
However, the variance of the normalized importance sampler is
usually lower in practice.
z
Also, it is common that we can evaluate P'(x) but not P(x), e.g.
P(x|e) = P'(x, e)/P(e) for Bayes net, or P(x) = P'(x)/Z for MRF.
[
f(X)w(X)]
= EQ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ( ) ( ) ( 1 1 1 x w x w x f EQEric Xing 11
Likelhood weighting
z
We now apply normalized importance sampling to a Bayes net.
zThe proposal Q is gotten from the mutilated BN where we clamp
evidence nodes, and cut their incoming arcs. Call this P
M.
z
The unnormalized posterior is P'(x) = P(x, e).
z
So for f(X
i) = δ(X
i= x
i), we get
where .
∑
∑
= = = m m m i m i m i i w x x w e x X Pˆ( | ) δ( ) ) ( / ) , ( m M m m P x e P x w = ′Eric Xing 13
Efficiency of likelihood weighting
z
The efficiency of importance sampling depends on how close
the proposal Q is to the target P.
z
Suppose all the evidence is at the roots. Then Q = P(X|e), and
all samples have weight 1.
z
Suppose all the evidence is at the leaves. Then Q is the prior,
so many samples might get small weight if the evidence is
unlikely.
z
We can use arc reversal to make some of the evidence nodes
be roots instead of leaves, but the resulting network can be
much more densely connected.
Weighted resampling
z
Problem of importance sampling: depends on how well
Q
matches
P
z If P(x)f(x) is strongly varying and has a significant proportion of its mass
concentrated in a small region, rmwill be dominated by a few samples
z Note that if the high-prob mass region of Qfalls into the low-prob mass
region of P, the variance of can be small even if the
samples come from low-prob region of P and potentially erroneous .
z
Solution
z Use heavy tail Q.
* ********* ) ( / ) ( m m m P x Q x r = m m m Qx r x P( )/ ( )
Eric Xing 15
Weighted resampling
z
Sampling importance resampling (SIR):
1. Draw Nsamples from Q: X1…XN
2. Constructing weights: w1…wN,
3. Sub-sample x from {X1…XN} w.p. (w1…wN)
z
Particular Filtering
z A special weighted resampler
z Yield samples from posterior p(Xt|Y1:t)
∑ ∑ = = m m m l l l m m m r r x Q x Px Qx P w ) ( / ) ( ) ( / ) (
A
A
A
Yt Yt+1 Y1 Xt Xt+1 X1...
Sketch of Particle Filters
z
The starting point
z Thus p(Xt|Y1:t) is represented by
z
A sequential weighted resampler
z Time update z Measurement update t t t t t t t p X X p X dX X p( +1|Y1:)=
∫
( +1| ) ( |Y1:)∫
− − − = = t t t t t t t t t t t t t t dX X Y p X p X Y p X p Y X p X p ) | ( ) | ( ) | ( ) | ( ) , | ( ) ( : : : : 1 1 1 1 1 1 1 Y Y Y Y ⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ∑ − = M m m t t m t t X Y p X Y p m t t t m t p X w X 1 1 1 ) | ( ) | ( : ), | ( ~ Y∫
+ + + + + + + + + = 1 1 1 1 1 1 1 1 1 1 1 1 t t t t t t t t t t t ppXX pYpY XXdX X p ) | ( ) | ( ) | ( ) | ( ) ( : : : Y Y Y∑
+ = m t t m t pX Xw ( 1| ) (sample from a mixture model)
⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ⇒ ∑ + + + M + +m m t t X Y p X Y p m t t t m t pX w X 1 1 1 1 1 1 ) | ( ) | ( :), | ( ~ Y (reweight) ) | (Xt t p +1 Y
Eric Xing 17
PF for switching SSM
z
Recall that the belief state has O(2t) Gaussian modes
PF for switching SSM
z
Key idea: if you knew the discrete states, you can apply the right
Kalman filter at each time step.
z
So for each old particle m, sample
from the prior, apply
the KF (usomg parameters for S
tm)
to the old belief state
to get an approximation to
z
Useful for online tracking,
fault diagnosis, etc.
) | ( ~ m t t m t PS S S −1 ) , ˆ ( | | m t t m t t P x−1−1 −1−1 ) , | ( : m: t t t y s X P 1 1
Eric Xing 19
Rao-Blackwellised sampling
z
Sampling in high dimensional spaces causes high variance in the
estimate.
z
RB idea: sample some variables X
p, and conditional on that,
compute expected value of rest X
danalytically:
z
This has lower variance, because of the identity:
[
]
[
]
[
( , )]
, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d∑
∫
∫
∫
∫
= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1[
(Xp,Xd)]
var[
E[
(Xp,Xd)|Xp]
]
E[
var[
(Xp,Xd)|Xp]
]
varτ = τ + τEric Xing 21
Rao-Blackwellised sampling
z
Sampling in high dimensional spaces causes high variance in the
estimate.
z
RB idea: sample some variables X
p, and conditional on that,
compute expected value of rest X
danalytically:
z
This has lower variance, because of the identity:
z
Hence , so
is a lower variance estimator.
[
]
[
]
[
( , )]
, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d∑
∫
∫
∫
∫
= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1[
(Xp,Xd)]
var[
E[
(Xp,Xd)|Xp]
]
E[
var[
(Xp,Xd)|Xp]
]
varτ = τ + τ[
]
[
( , )|]
var[
( , )]
varEτ Xp Xd Xp ≤ τ Xp Xd τ(Xp,Xd)=E[
f(Xp,Xd)|Xp]
Summary: Monte Carlo Methods
z
Direct Sampling
z Very difficult to populate a high-dimensional state space
z
Rejection Sampling
z Create samples like direct sampling, only count samples which is
consistent with given evidences.
z
Likelihood weighting, ...
z Sample variables and calculate evidence weight. Only create the
samples which support the evidences.
z
Markov chain Monte Carlo (MCMC)
z Metropolis-Hasting