Approximate Inference: Monte Carlo Inference. Probabilistic Graphical Models (10-708) Lecture 18, Nov 19, 2007

(1)

1 School of Computer Science

Approximate Inference:

Monte Carlo Inference

Probabilistic Graphical Models (10

-

708)

Lecture 18, Nov 19, 2007

Eric Xing

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Reading: J-Chap. 1, KF-Chap. 11

Monte Carlo methods

z

Draw random samples from the desired distribution

z

Yield a stochastic representation of a complex distribution

z marginals and other expections can be approximated using

sample-based averages

z

Asymptotically

exact and easy to apply to arbitrary models

z

Challenges:

z how to draw samples from a given dist. (not all distributions can be

trivially sampled)?

z how to make better use of the samples (not all sample are useful, or

eqally useful, see an example later)?

∑

=

N t t

x

f

N

x

f

1

1 )

(

)]

(

[

()

E

(2)

Eric Xing 3

Example: naive sampling

z

Construct samples according to probabilities given in a BN.

Alarm example: (Choose the right sampling sequence) 1) Sampling:P(B)=<0.001, 0.999> suppose it is false, B0. Same for E0. P(A|B0, E0)=<0.001, 0.999> suppose it is false...

2) Frequency counting: In the samples right,

P(J|A0)=P(J,A0)/P(A0)=<1/9, 8/9>. J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0

Example: naive sampling

z

Construct samples according to probabilities given in a BN.

Alarm example: (Choose the right sampling

sequence)

3) what if we want to compute P(J|A1)?

we have only one sample ... P(J|A1)=P(J,A1)/P(A1)=<0, 1>.

4) what if we want to compute P(J|B1)?

No such sample available!

P(J|A1)=P(J,B1)/P(B1) can not be defined.

For a model with hundreds or more variables, rare events will be very hard to garner evough samples even after a long time or sampling ...

J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0

(3)

Eric Xing 5

Monte Carlo methods (cond.)

z

Direct Sampling

z We have seen it.

z Very difficult to populate a high-dimensional state space

z

Rejection Sampling

z Create samples like direct sampling, only count samples which is

consistent with given evidences.

z

Likelihood weighting, ...

z Sample variables and calculate evidence weight. Only create the

samples which support the evidences.

z

Markov chain Monte Carlo (MCMC)

z Metropolis-Hasting

z Gibbs

Rejection sampling

z

Suppose we wish to sample from dist. Π(

X

)=Π'(

X

)/Z.

z Π(X) is difficult to sample, but Π'(X) is easy to evaluate

z Sample from a simpler dist Q(X)

z Rejection sampling z Correctness: z Pitfall … ) ( / ) ( ' w.p. accept ), ( ~ * * * * Q X x x kQ x x Π

[

]

[

]

) ( ) ( ' ) ( ' ) ( ) ( / ) ( ' ) ( ) ( / ) ( ' ) ( x dx x x dx x Q x kQ x x Q x kQ x x p Π = Π Π = Π Π =

∫

(4)

Eric Xing 7

Rejection sampling

z

Pitfall:

z Using Q=N(µ,σ_q2I) to sample P=N(µ,σ_p2I)

z If σ_qexceeds σ_pby 1%, and dimensional=1000,

z The optimal acceptance rate k=(σ_q/σ_p)d≈1/20,000

z Big waste of samples!

z

Adaptive rejection sampling

z Using envelope functions to define Q

Unnormalized importance

sampling

z

Suppose sampling from

P

(·) is hard.

z

Suppose we can sample from a "simpler" proposal distribution

Q

(·) instead.

z

If

Q

dominates

P

(i.e.,

Q

(

x

) > 0 whenever

P

(

x

) > 0), we can

sample from

Q

and reweight:

∑

∫

= ≈ = = m m m m m m m m w x M X Q x x QP x x M dx x Q x QP x x dx x P x X ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( f f f f f 1 1

(5)

Eric Xing 9

Normalized importance sampling

z

Suppose we can only evaluate

P'

(

x

) = α

P

(

x

) (e.g. for an

MRF).

z

We can get around the nasty normalization constant α as

follows:

z z

Now

α = = = ⇒

_∫

Q x dx

_∫

P x dx x QP x X r _Q ( ) '( ) ) ( ) ( ' ) ( ) ( ) ( ' ) ( Let x QP x X r =

∑

∫

= = ≈ = = = m m m m m m m m m m m m m P r r w w x X Q x r r x dx x Q x r dx x Q x r x dx x Q x QP x x dx x P x X re whe ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ' ) ( ) ( ) ( ) ( f f f f f f α1

Normalized vs unnormalized

importance sampling

z

Unormalized importance sampling is unbiased:

z

Normalized importance sampling is biased, eg for M = 1:

z

However, the variance of the normalized importance sampler is

usually lower in practice.

z

Also, it is common that we can evaluate P'(x) but not P(x), e.g.

P(x|e) = P'(x, e)/P(e) for Bayes net, or P(x) = P'(x)/Z for MRF.

[

f(X)w(X)

]

= EQ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ( ) ( ) ( 1 1 1 x w x w x f E_Q

(6)

Eric Xing 11

Likelhood weighting

z

We now apply normalized importance sampling to a Bayes net.

z

The proposal Q is gotten from the mutilated BN where we clamp

evidence nodes, and cut their incoming arcs. Call this P

_M

.

z

The unnormalized posterior is P'(x) = P(x, e).

z

So for f(X

_i

) = δ(X

_i

= x

_i

), we get

where .

∑

= = = m m m i m i m i i w x x w e x X Pˆ( | ) δ( ) ) ( / ) , ( m M m m P x e P x w = ′

(7)

Eric Xing 13

Efficiency of likelihood weighting

z

The efficiency of importance sampling depends on how close

the proposal Q is to the target P.

z

Suppose all the evidence is at the roots. Then Q = P(X|e), and

all samples have weight 1.

z

Suppose all the evidence is at the leaves. Then Q is the prior,

so many samples might get small weight if the evidence is

unlikely.

z

We can use arc reversal to make some of the evidence nodes

be roots instead of leaves, but the resulting network can be

much more densely connected.

Weighted resampling

z

Problem of importance sampling: depends on how well

Q

matches

P

z If P(x)f(x) is strongly varying and has a significant proportion of its mass

concentrated in a small region, r_mwill be dominated by a few samples

z Note that if the high-prob mass region of Qfalls into the low-prob mass

region of P, the variance of can be small even if the

samples come from low-prob region of P and potentially erroneous .

z

Solution

z Use heavy tail Q.

* ********* ) ( / ) ( m m m _P _x _Q _x r = m m m _Q_x _r x P( )/ ( )

(8)

Eric Xing 15

Weighted resampling

z

Sampling importance resampling (SIR):

1. Draw Nsamples from Q: X₁…X_N

2. Constructing weights: w₁…w_N,

3. Sub-sample x from {X₁…X_N} w.p. (w₁…w_N)

z

Particular Filtering

z A special weighted resampler

z Yield samples from posterior p(X_t|Y_1:t)

∑ ∑ = = m m m l l l m m m r r x Q x Px Qx P w ) ( / ) ( ) ( / ) (

A

Y_t Y_t+1 Y₁ Xt Xt+1 X1

...

Sketch of Particle Filters

z

The starting point

z Thus p(X_t|Y_1:t) is represented by

z

A sequential weighted resampler

z Time update z Measurement update t t t t t t t p X X p X dX X p( +1|Y1:)=

_∫

( +1| ) ( |Y1:)

∫

− − − = = t t t t t t t t t t t t t t dX X Y p X p X Y p X p Y X p X p ) | ( ) | ( ) | ( ) | ( ) , | ( ) ( : : : : 1 1 1 1 1 1 1 Y Y Y Y ⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ∑ − = M m m t t m t t X Y p X Y p m t t t m t p X w X 1 1 1 ) | ( ) | ( : ), | ( ~ Y

∫

+ + + + + + + + + = 1 1 1 1 1 1 1 1 1 1 1 1 t t t t t t t t t t t _pp_XX _p_YpY _XX_dX X p ) | ( ) | ( ) | ( ) | ( ) ( : : : Y Y Y

∑

+ = m t t m t pX X

w ( 1| ) (sample from a mixture model)

⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ⇒ ∑ + + + M + +m m t t X Y p X Y p m t t t m t pX w X 1 1 1 1 1 1 ) | ( ) | ( :), | ( ~ Y (reweight) ) | (Xt t p +1 Y

(9)

Eric Xing 17

PF for switching SSM

z

Recall that the belief state has O(2t) Gaussian modes

PF for switching SSM

z

Key idea: if you knew the discrete states, you can apply the right

Kalman filter at each time step.

z

So for each old particle m, sample

from the prior, apply

the KF (usomg parameters for S

tm

)

to the old belief state

to get an approximation to

z

Useful for online tracking,

fault diagnosis, etc.

) | ( ~ m t t m t PS S S −1 ) , ˆ ( | | m t t m t t P x−1−1 −1−1 ) , | ( _: m_: t t t y s X P ₁ ₁

(10)

Eric Xing 19

Rao-Blackwellised sampling

z

Sampling in high dimensional spaces causes high variance in the

estimate.

z

RB idea: sample some variables X

_p

, and conditional on that,

compute expected value of rest X

_d

analytically:

z

This has lower variance, because of the identity:

[

]

[

]

[

( , )

]

, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d

∑

∫

= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1

[

(Xp,Xd)

]

var

[

E

[

(Xp,Xd)|Xp

]

E

[

var

[

(Xp,Xd)|Xp

]

varτ = τ + τ

(11)

Eric Xing 21

Rao-Blackwellised sampling

z

Sampling in high dimensional spaces causes high variance in the

estimate.

z

RB idea: sample some variables X

_p

, and conditional on that,

compute expected value of rest X

_d

analytically:

z

This has lower variance, because of the identity:

z

Hence , so

is a lower variance estimator.

[

]

[

]

[

( , )

]

, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d

∑

∫

= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1

[

(Xp,Xd)

]

var

[

E

[

(Xp,Xd)|Xp

]

E

[

var

[

(Xp,Xd)|Xp

]

varτ = τ + τ

[

]

[

( , )|

]

var

[

( , )

]

varEτ Xp Xd Xp ≤ τ Xp Xd τ(Xp,Xd)=E

[

f(Xp,Xd)|Xp

]

Summary: Monte Carlo Methods

z

Direct Sampling

z Very difficult to populate a high-dimensional state space

z

Rejection Sampling

z Create samples like direct sampling, only count samples which is

consistent with given evidences.

z

Likelihood weighting, ...

z Sample variables and calculate evidence weight. Only create the

samples which support the evidences.

z

Markov chain Monte Carlo (MCMC)

z Metropolis-Hasting