• No results found

Approximate Inference: Monte Carlo Inference. Probabilistic Graphical Models (10-708) Lecture 18, Nov 19, 2007

N/A
N/A
Protected

Academic year: 2021

Share "Approximate Inference: Monte Carlo Inference. Probabilistic Graphical Models (10-708) Lecture 18, Nov 19, 2007"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

1 School of Computer Science

Approximate Inference:

Monte Carlo Inference

Probabilistic Graphical Models (10

Probabilistic Graphical Models (10

-

-

708)

708)

Lecture 18, Nov 19, 2007

Eric Xing

Eric Xing

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Reading: J-Chap. 1, KF-Chap. 11

Monte Carlo methods

z

Draw random samples from the desired distribution

z

Yield a stochastic representation of a complex distribution

z marginals and other expections can be approximated using

sample-based averages

z

Asymptotically

exact and easy to apply to arbitrary models

z

Challenges:

z how to draw samples from a given dist. (not all distributions can be

trivially sampled)?

z how to make better use of the samples (not all sample are useful, or

eqally useful, see an example later)?

=

=

N t t

x

f

N

x

f

1

1

)

(

)]

(

[

()

E

(2)

Eric Xing 3

Example: naive sampling

z

Construct samples according to probabilities given in a BN.

Alarm example: (Choose the right sampling sequence) 1) Sampling:P(B)=<0.001, 0.999> suppose it is false, B0. Same for E0. P(A|B0, E0)=<0.001, 0.999> suppose it is false...

2) Frequency counting: In the samples right,

P(J|A0)=P(J,A0)/P(A0)=<1/9, 8/9>. J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0

Example: naive sampling

z

Construct samples according to probabilities given in a BN.

Alarm example: (Choose the right sampling

sequence)

3) what if we want to compute P(J|A1)?

we have only one sample ... P(J|A1)=P(J,A1)/P(A1)=<0, 1>.

4) what if we want to compute P(J|B1)?

No such sample available!

P(J|A1)=P(J,B1)/P(B1) can not be defined.

For a model with hundreds or more variables, rare events will be very hard to garner evough samples even after a long time or sampling ...

J1 M1 A1 B0 E1 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J0 M0 A0 B0 E0 J1 M0 A0 B0 E0 J0 M0 A0 B0 E0

(3)

Eric Xing 5

Monte Carlo methods (cond.)

z

Direct Sampling

z We have seen it.

z Very difficult to populate a high-dimensional state space

z

Rejection Sampling

z Create samples like direct sampling, only count samples which is

consistent with given evidences.

z

Likelihood weighting, ...

z Sample variables and calculate evidence weight. Only create the

samples which support the evidences.

z

Markov chain Monte Carlo (MCMC)

z Metropolis-Hasting

z Gibbs

Rejection sampling

z

Suppose we wish to sample from dist. Π(

X

)=Π'(

X

)/Z.

z Π(X) is difficult to sample, but Π'(X) is easy to evaluate

z Sample from a simpler dist Q(X)

z Rejection sampling z Correctness: z Pitfall … ) ( / ) ( ' w.p. accept ), ( ~ * * * * Q X x x kQ x x Π

[

]

[

]

) ( ) ( ' ) ( ' ) ( ) ( / ) ( ' ) ( ) ( / ) ( ' ) ( x dx x x dx x Q x kQ x x Q x kQ x x p Π = Π Π = Π Π =

(4)

Eric Xing 7

Rejection sampling

z

Pitfall:

z Using Q=N(µ,σq2I) to sample P=N(µ,σp2I)

z If σqexceeds σpby 1%, and dimensional=1000,

z The optimal acceptance rate k=(σqp)d≈1/20,000

z Big waste of samples!

z

Adaptive rejection sampling

z Using envelope functions to define Q

Unnormalized importance

sampling

z

Suppose sampling from

P

(·) is hard.

z

Suppose we can sample from a "simpler" proposal distribution

Q

(·) instead.

z

If

Q

dominates

P

(i.e.,

Q

(

x

) > 0 whenever

P

(

x

) > 0), we can

sample from

Q

and reweight:

= ≈ = = m m m m m m m m w x M X Q x x QP x x M dx x Q x QP x x dx x P x X ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( f f f f f 1 1

(5)

Eric Xing 9

Normalized importance sampling

z

Suppose we can only evaluate

P'

(

x

) = α

P

(

x

) (e.g. for an

MRF).

z

We can get around the nasty normalization constant α as

follows:

z z

Now

α = = = ⇒

Q x dx

P x dx x QP x X r Q ( ) '( ) ) ( ) ( ' ) ( ) ( ) ( ' ) ( Let x QP x X r =

= = ≈ = = = m m m m m m m m m m m m m P r r w w x X Q x r r x dx x Q x r dx x Q x r x dx x Q x QP x x dx x P x X re whe ) ( ) ( ~ where ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ' ) ( ) ( ) ( ) ( f f f f f f α1

Normalized vs unnormalized

importance sampling

z

Unormalized importance sampling is unbiased:

z

Normalized importance sampling is biased, eg for M = 1:

z

However, the variance of the normalized importance sampler is

usually lower in practice.

z

Also, it is common that we can evaluate P'(x) but not P(x), e.g.

P(x|e) = P'(x, e)/P(e) for Bayes net, or P(x) = P'(x)/Z for MRF.

[

f(X)w(X)

]

= EQ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ( ) ( ) ( 1 1 1 x w x w x f EQ

(6)

Eric Xing 11

Likelhood weighting

z

We now apply normalized importance sampling to a Bayes net.

z

The proposal Q is gotten from the mutilated BN where we clamp

evidence nodes, and cut their incoming arcs. Call this P

M

.

z

The unnormalized posterior is P'(x) = P(x, e).

z

So for f(X

i

) = δ(X

i

= x

i

), we get

where .

= = = m m m i m i m i i w x x w e x X Pˆ( | ) δ( ) ) ( / ) , ( m M m m P x e P x w = ′

(7)

Eric Xing 13

Efficiency of likelihood weighting

z

The efficiency of importance sampling depends on how close

the proposal Q is to the target P.

z

Suppose all the evidence is at the roots. Then Q = P(X|e), and

all samples have weight 1.

z

Suppose all the evidence is at the leaves. Then Q is the prior,

so many samples might get small weight if the evidence is

unlikely.

z

We can use arc reversal to make some of the evidence nodes

be roots instead of leaves, but the resulting network can be

much more densely connected.

Weighted resampling

z

Problem of importance sampling: depends on how well

Q

matches

P

z If P(x)f(x) is strongly varying and has a significant proportion of its mass

concentrated in a small region, rmwill be dominated by a few samples

z Note that if the high-prob mass region of Qfalls into the low-prob mass

region of P, the variance of can be small even if the

samples come from low-prob region of P and potentially erroneous .

z

Solution

z Use heavy tail Q.

* ********* ) ( / ) ( m m m P x Q x r = m m m Qx r x P( )/ ( )

(8)

Eric Xing 15

Weighted resampling

z

Sampling importance resampling (SIR):

1. Draw Nsamples from Q: X1…XN

2. Constructing weights: w1…wN,

3. Sub-sample x from {X1…XN} w.p. (w1…wN)

z

Particular Filtering

z A special weighted resampler

z Yield samples from posterior p(Xt|Y1:t)

∑ ∑ = = m m m l l l m m m r r x Q x Px Qx P w ) ( / ) ( ) ( / ) (

A

A

A

Yt Yt+1 Y1 Xt Xt+1 X1

...

Sketch of Particle Filters

z

The starting point

z Thus p(Xt|Y1:t) is represented by

z

A sequential weighted resampler

z Time update z Measurement update t t t t t t t p X X p X dX X p( +1|Y1:)=

( +1| ) ( |Y1:)

− − − = = t t t t t t t t t t t t t t dX X Y p X p X Y p X p Y X p X p ) | ( ) | ( ) | ( ) | ( ) , | ( ) ( : : : : 1 1 1 1 1 1 1 Y Y Y Y ⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ∑ − = M m m t t m t t X Y p X Y p m t t t m t p X w X 1 1 1 ) | ( ) | ( : ), | ( ~ Y

+ + + + + + + + + = 1 1 1 1 1 1 1 1 1 1 1 1 t t t t t t t t t t t ppXX pYpY XXdX X p ) | ( ) | ( ) | ( ) | ( ) ( : : : Y Y Y

+ = m t t m t pX X

w ( 1| ) (sample from a mixture model)

⎪⎭ ⎪ ⎬ ⎫ ⎪⎩ ⎪ ⎨ ⎧ = ⇒ ∑ + + + M + +m m t t X Y p X Y p m t t t m t pX w X 1 1 1 1 1 1 ) | ( ) | ( :), | ( ~ Y (reweight) ) | (Xt t p +1 Y

(9)

Eric Xing 17

PF for switching SSM

z

Recall that the belief state has O(2t) Gaussian modes

PF for switching SSM

z

Key idea: if you knew the discrete states, you can apply the right

Kalman filter at each time step.

z

So for each old particle m, sample

from the prior, apply

the KF (usomg parameters for S

tm

)

to the old belief state

to get an approximation to

z

Useful for online tracking,

fault diagnosis, etc.

) | ( ~ m t t m t PS S S −1 ) , ˆ ( | | m t t m t t P x−1−1 −1−1 ) , | ( : m: t t t y s X P 1 1

(10)

Eric Xing 19

Rao-Blackwellised sampling

z

Sampling in high dimensional spaces causes high variance in the

estimate.

z

RB idea: sample some variables X

p

, and conditional on that,

compute expected value of rest X

d

analytically:

z

This has lower variance, because of the identity:

[

]

[

]

[

( , )

]

, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d

= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1

[

(Xp,Xd)

]

var

[

E

[

(Xp,Xd)|Xp

]

]

E

[

var

[

(Xp,Xd)|Xp

]

]

varτ = τ + τ

(11)

Eric Xing 21

Rao-Blackwellised sampling

z

Sampling in high dimensional spaces causes high variance in the

estimate.

z

RB idea: sample some variables X

p

, and conditional on that,

compute expected value of rest X

d

analytically:

z

This has lower variance, because of the identity:

z

Hence , so

is a lower variance estimator.

[

]

[

]

[

( , )

]

, ~ ( | ) ) , ( ) | ( ) , ( ) , | ( ) | ( ) , ( ) | , ( ) ( ) , | ( ) , | ( ) | ( e x p x X x f E dx X x f E e x p dx dx x x f e x x p e x p dx dx x x f e x x p X f E p m p m d m p e x X p M x p d p e x X p p x p x d d p p d p d p d p d p e X p m p d p p d p d

= = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = 1

[

(Xp,Xd)

]

var

[

E

[

(Xp,Xd)|Xp

]

]

E

[

var

[

(Xp,Xd)|Xp

]

]

varτ = τ + τ

[

]

[

( , )|

]

var

[

( , )

]

varEτ Xp Xd Xp ≤ τ Xp Xd τ(Xp,Xd)=E

[

f(Xp,Xd)|Xp

]

Summary: Monte Carlo Methods

z

Direct Sampling

z Very difficult to populate a high-dimensional state space

z

Rejection Sampling

z Create samples like direct sampling, only count samples which is

consistent with given evidences.

z

Likelihood weighting, ...

z Sample variables and calculate evidence weight. Only create the

samples which support the evidences.

z

Markov chain Monte Carlo (MCMC)

z Metropolis-Hasting

References

Related documents

Unit costs at 2013/2014 prices 10 – 12 were applied to the resource use in the model to estimate the total NHS cost of patient management from the time a patient entered the data

(2015) found that trade openness and economic growth are significant factors in determining energy use in Malaysia, and Sannassee and Boopen (2016) found that a

Im, Lee &amp; Tieslau (2005) Panel Unit Root Test with Structural Breaks Results for Financial Stability Index. One break model Level

The objec- tives of this study were to determine how different seeding rates and application rates of mepiquat- type plant growth regulator compounds (PGR) affected cotton growth

In a study conducted in Egypt, pleuroscopy was performed in 40 patients with exudative pleural effusion; 95% of the procedures were conclusive among diagnosed

As presented by FHWA (1999) and illustrated in the diagram bellow, an Asset Management system has the following major elements, which are constrained by

The bill requires the executive director of a governmental agency or the president of an institution of higher education (institution), as applicable, that enters into a

By class of land, dryland cropland and grazing land values declined slightly for the year ending February 1, 2009, while irrigated land classes for the state as a whole