• No results found

The basic idea behind particle filtering (PF) is to approximate the belief state by a set of weighted particles or samples: P(Xt|y1:t)≈ Ns X i=1 wtiδ(Xt, Xti)

(In this chapter,Xi

tmeans thei’th sample ofXt, andXt(i)means thei’th component ofXt.) Given a prior

of this form, we can compute the posterior using importance sampling. In importance sampling, we assume the target distribution, π(x), is hard to sample from; instead, we sample from a proposal or importance

distributionq(x), and weight the sample according towi π(x)/q(x). (After we have finished sampling,

we can normalize all the weights soPiwi = 1). We can use this to sample paths with weights wit∝

P(xi

1:t|y1:t)

q(xi

1:t|y1:t)

P(x1:t|y1:t)can be computed recursively using Bayes rule. Typically we will want the proposal distribution to be recursive also, i.e.,q(x1:t|y1:t) =q(xt|x1:t−1, y1:t)q(x1:t−1|y1:t−1). In this case we have

wi t ∝ P(yt|xit)P(xit|xit−1)P(xi1:t−1|y1:t−1) q(xi t|xi1:t−1, y1:t)q(xi1:t−1|y1:t−1) = P(yt|x i t)P(xit|xit−1)) q(xi t|xi1:t−1, y1:t) wit−1 def = wˆti×wit−1 where we have definedwˆi

tto be the incremental weight.

For filtering, we only care aboutP(Xt|y1:t), as opposed to the whole trajectory, so we use the following proposal,q(xt|xi

1:t−1, y1:t) = q(xt|xti−1, yt), so we only need to storexit−1instead of the whole trajectory. In this case the weights simplify to

ˆ wi t= P(yt|xi t)P(xit|xit−1) q(xi t|xit−1, yt) (5.1) The most common proposal is to sample from the prior:q(xt|xi

t−1, yt) =P(xt|xit−1). In this case, the weights simplify towˆi

t=P(yt|xit). For predicting the future, sampling from the prior is adequate, since there

is no evidence. This technique can be used e.g., to stochastically evaluate policies for (PO)MDPs [KMN99].) But for monitoring/ filtering, it is not very efficient, since it amounts to “guess until you hit”. For example, if the transitions are highly stochastic, sampling from the prior will result in particles being proposed all over the space; if the observations are highly informative, most of the particles will get “killed off” (i.e., assigned low weight). In such a case, it makes more sense to first look at the evidence,yt, and then propose:

q(xt|xi

t−1, yt) =P(xt|xit−1, yt)∝P(yt|xt)P(xt|xit−1)

In fact, one can prove this is the optimal proposal distribution, in the sense of minimizing the variance of the weights (a balanced distribution being more economical, since particles with low weight are “wasted”). Unfortunately, it is often hard to sample from this distribution, and to compute the weights, which are given by the normalizing constant of the optimal proposal:

ˆ

wti=P(yt|xit−1) = Z

xt

P(yt|xt)P(xt|xit−1)

In Section 5.2.1, we will discuss when it is tractable to use the optimal proposal distribution.

Applying importance sampling in this way is known as sequential importance sampling (SIS). A well known problem with SIS is that the number of particles with non-zero weight rapidly goes to zero, even if we use the optimal proposal distribution (this is called particle “impoverishment”). An estimate of the “effective” number of samples is given by

Nef f = P 1 Ns i=1(wti)2

(5.2) If this drops below some threshold, we can sample with replacement from the current belief state. Essentially this throws out particles with low weight and replicates those with high weight (hence the term “survival of the fittest”). This is called resampling, and can be done inO(Ns)time. After resampling, the weights are reset to the uniform distribution: the past weights are reflected in the frequency with which particles are sampled, and do not need to be kept. Particle filtering is just sequential importance sampling with resampling (SISR). The resampling step was the key innovation in the ’90s; SIS itself has been around since at least the ’50s. The overall algorithm is sketched in Figure 5.1.

function[{xi t, wit}Ni=1s] =PF({x i t−1, wit−1}Ni=1s, yt) fori= 1 :Ns Samplexi t∼q(·|xit−1, yt)

Computewˆitfrom Equation 5.1

wi t= ˆwti×wt−i 1 Computewt=PNi=1sw i t Normalizewi t:=wti/wt

ComputeNef ffrom Equation 5.2

ifNef f<threshold π=resample({wi t}Ni=1s) x·t=xπt wi t= 1/Ns

Figure 5.1: Pseudo-code for a generic particle filter. The resample step samples indices with replacement according to their weight; the resulting set of sampled indices is calledπ. The linex·

t=xπt simply duplicates

or removes particles according to the chosen indices.

function[xi t,wˆit] =LW(xit−1, yt) ˆ wi t= 1 xi

t=empty vector of lengthN

for each nodeiin topological order

Letube the value of Pa(Xi

t)in(xit−1, xit) IfXi tnot inyt Samplexi t∼P(X i t|Pa(X i t) =u) else xit=the value ofX i tinyt ˆ wi t= ˆwit×P(Xti=xit|Pa(Xti) =u)

Figure 5.2: Pseudo-code for likelihood weighting.

Although resampling kills off unlikely particles, it also reduces the diversity of the population (which is why we don’t do it at every time step; if we did, thenwi

t = ˆwit). This a particular severe problem is the

system is highly deterministic (e.g., if the state space contains static parameters). A simple solution is to apply a kernel around each particle and then resample from the kernel. An alternative is to use an MCMC smoothing step; a particularly succesful version of this is the resample-move algorithm [BG01].

5.2.1

Particle filtering for DBNs

To apply PF to a DBN, we use the likelihood weighting (LW) routine [FC89, SP89] in Figure 5.2 to sample

xi

tand computewˆit. The proposal distribution that LW corresponds to depends on which nodes of the DBN

are observed. In the simplest case of an HMM, where the observation is at a leaf node, LW samples from the prior,P(Xi

t|xit−1), and the computes the weight asw=P(yt|xit). (We discuss how to improve this below.

See also [CD00].)

In general, some of the evidence might occur at arbitrary locations within the DBN slice. In this case, the proposal isq(xt, yt) = QjP(xt(j)|Pa(Xt(j))), and the weight isw(xt, yt) = QjP(yt(j)|Pa(Yt(j))), where xt(j)is the (value of the) j’th hidden node at timet, andyt(j)is the (value of the) j’th observed node at timet, and the parents of bothXt(j)andYt(j)may contain evidence. This is consistent, since (as

observed in [RN02])

P(xt, yt) =Y j

P(xt(j)|Pa(Xt(j)))×Y j

P(yt(j)|Pa(Yt(j))) =q(xt, yt)w(xt, yt)

Optimal proposal distribution

Since the evidence usually occurs at the leaves, likelihood weighting effectively samples from the prior, without looking at the evidence. A general way to take the evidence into account while sampling, suggested in [FC89, KKR95], is called “evidence reversal”. This means applying the rules of “arc reversal” [Sha86] until all the evidence nodes become parents instead of leaves. To reverse an arc fromX Y, we must add

Y’s unique parents,Yp, toX, and addX’s unique parents,Xp, toY (both nodes may also share common parents,C): see Figure 5.3. The CPDs in the new network are given by

P(Y|Xp, C, Yp) = X x

P(Y|C, Yp)P(x|Xp, C) P(X|Xp, C, Yp) = P(Y|X, C, Yp)P(X|Xp, C)

P(Y|Xp, C, Yp)

Note thatXp,YpandCcould represent sets of variables. Hence the new CPDs could be much larger than before the arc reversal. (One way to ameliorate this affect, for tree-structured CPDs, is discussed in [CB97].) Of course, if there are multiple evidence nodes, not all of the arcs have to be reversed. In the case of DBNs, the operation is shown in Figure 5.4. The new CPDs are

P(Yt|Xt1) = X xt P(Yt|xt)P(xt|Xt1) P(Xt|Xt−1, Yt) = P(Yt|Xt)P(Xt|Xt−1) P(Yt|Xt1)

Arc reversal was proposed by [Sha86] as a general means of inference in Bayesian networks. Since then, the junction tree (jtree) algorithm has come to dominate, since it is more efficient. It is possible to efficiently sample fromP(X|E)by first building a jtree, collecting evidence to the root, and then, in the distribute phase, drawing a random sample fromP(XCi\Si|xSi, E)for each cliqueCi, whereSiis the separator nearer

to the root [Daw92]. This is the optimal method.

Unfortunately, for continuous valued variables (when PF is most useful), it is not always possible to compute the optimal proposal distribution, because the new CPDs required by arc reversal, or the potentials required by jtree, cannot be computed. An important exception is when the observation model,P(Yt|Xt), is (conditionally) linear-Gaussian, and the process noise is Gaussian (although the dynamics can be non-linear), i.e.,

P(Xt|xit−1) = N(Xt;ft(xit−1), Qt)

P(Yt|Xt) = N(yt;HtXt, Rt)

In this case, one can use the standard Kalman filter rules to show that [AMGC02]

P(Xt|xit−1, yt) = N(Xt;mt,Σt) ˆ wti=P(yt|xit−1) = N(yt;Htft(xt−1), Qt+HtRtHt0) where Σ−t1 = Q−t1+HtR−t1Ht mt = Σt Q−t1ft(xit−1) +Ht0Rt−1yt

If the model does not satisfy these requirements, one can still use a Gaussian approximation of the form

q(Xt|xi

t−1, yt)constructed e.g., using the unscented transform [vdMDdFW00]. (This is the sense in which any heuristic can be converted into an optimal algorithm.) If the process noise is non-Gaussian, but the observation model is (conditional) linear-Gaussian, once can propose from the likelihood and weight by the transition prior. (This observation was first made by Nando de Freitas, and has been exploited in [FTBD01].)

X Y

Xp C Yp

X Y

Xp C Yp

(a) (b)

Figure 5.3: Arc reversal. (a) The original network. (b) The network after reversing theX → Y arc. The modified network encodes the same probability distribution as the original network.

Xt−1 Xt

Yt

Xt−1 Xt

Yt

(a) (b)

Figure 5.4: A DBN (a) before and (b) after evidence reversal.

Smoothing discrete state-spaces

When applying PF to (low dimensional) continuous state-spaces, it is easy to place a (Gaussian) kernel around each particle before resampling, to prevent “particle collapse”. However, most DBNs that have been studied have discrete state-spaces. In this case, we can use the smoothing technique proposed in [KL01]. Let

Wt=PNsi=1wi

t. We add a certain fraction,α, ofWtto all entries in the state-space that are consistent with

the evidence (i.e., which give it non-zero likelihood), and then renormalize. (This is like using a uniform Dirichlet prior.) That is, the smoothed approximate belief state is

ˆ P(x|y1:t) = α+Pi:xi t=xw i t Z

ifxis consistent withyt, andP(xˆ |y1:t) = 0otherwise. The sum is over all particles that are equal tox; if there are no such particles, the numerator has valueα. The normalizing constant isZ=Wt+αM, whereM

is the total number of states consistent withyt. (We discuss how to computeM below.) We can sample from this smoothed belief state as follows. With probabilityWt/Z, we select a particle as usual (with probability

wi

t), otherwise we select a state which is consistent with the evidence uniformly at random.

ComputingM is equivalent to counting the number of satisfying assignments, which in general is #-P hard (worse than NP-hard). If all CPDs are stochastic, we can computeM using techniques similar to Bayes net inference. Unfortunately, the cost of exactly computingM may be as high as doing exact inference. A quick and dirty solution is to add probability mass ofαto all states, whether or not they are consistent with the evidence.

Combining PF with BK

[NPP02] suggested combining particle filtering with the Boyen-Koller algorithm (see Section 4.2.1), i.e., approximating the belief state by

ˆ P(Xt|y1:t)≈ C Y c=1 1 Nc Nc X i=1 δ(Xt,c, xi t,c)

where Cis the number of clusters, and Nc is the number of particles in each cluster (assumed uniformly weighted). By applying PF to a smaller state-space (each cluster), they reduce the variance, at a cost of increasing the bias by using a factored representation. They show that this method outperforms standard PF (sampling from the prior) on some small, discrete DBNs. However, they do not compare with BK. Furthermore, it might be difficult to extend the method to work with continuous state-spaces (when PF is

most useful), since the two proposed methods of propagating the factored particles — using jtree and using an equi-join operator — only work well with discrete variables.