ppMCMC: A new population-based pMCMC method

This section presents a novel MCMC algorithm, which is a combination of two existing MCMC methods (popMCMC and pMCMC). The new algorithm is called Population-based Particle MCMC (ppM- CMC) and its purpose is to address problems where the SSM posterior presented in Section 2.3.3, i.e. equation (4.5) of this chapter, is multi-modal. Multi-modality here is tackled only for the marginal posterior ofθ (i.e. the posterior marginalised over the states). In these cases, the standard pMCMC algorithm faces the well-known issue of slow mixing, which is common to all single-chain MCMC methods. The sampler tends to “get stuck” in one of the modes of the posterior and only rarely man- ages to move to a different mode. Some of the modes might never be visited unless the sampler runs for a very long time, i.e. the runtime becomes impractical for real-world use by practitioners. Thus the user cannot be certain that the posterior has been explored completely. Of course, full exploration

Algorithm 8 Particle MCMC

1: procedurePMCMC(P, T , Y1:T, N,θinit) - Inputs: P (number of particles), T (number of SSM

states), Y1:T (observations), N (number of MCMC samples),θinit(initial MCMC sample)

2: First iteration (i= 1): 3: ˜ p(Y1:T| θinit), X1:P_1:T ← BootstrapPF(P,T ,θinit_,Y

1:T) // get likelihood and state samples

4: Randomly select an index p from_{{1,...,P} and set X}init_1:T = X_1:Tp

5: Sample[1] = (θinit_{, X}init

1:T) // save initial sample

6: Posterior[1] = p(θinit_{) ˜p(Y}

1:T | θinit) // compute and save posterior

7: θ= θinit // temporary variable

8: Remaining iterations: 9: for i= 2, ..., N do 10: θ∗∼ q(θ∗| θ) // propose new θ 11: ˜ p(Y1:T | θ∗), X1:P1:T

← BootstrapPF(P,T ,θ∗,Y1:T) // get likelihood and state samples

12: Randomly select an index p from{1,...,P} and set X∗_1:T = X_1:Tp 13: Accept proposed sample(θ∗, X_1:T∗ ) with probability min(1, ˜a)

14: if accepted then

15: Sample[i] = (θ∗_{, X}∗

1:T) // save proposed sample

16: Posterior[i] = p(θ∗) ˜p(Y1:T | θ∗) // compute and save posterior

17: θ= θ∗ // temporary variable

18: else

19: Sample_{[i] = Sample[i − 1] // replicate previous sample}

20: Posterior_{[i] = Posterior[i − 1] // replicate previous posterior}

21: return(Sample[1 : N], Posterior[1 : N]) (N sets of MCMC samples and posterior values)

is never certain, even when using an algorithm with superior mixing. Nevertheless, improving mixing makes it more likely that the modes will be found in a reasonable time frame (i.e. days or weeks).

Although most applications of pMCMC lead to uni-modal marginal posteriors forθ , there are mod- elling scenarios, like the one described in section 4.5, where the posterior admits multiple modes. For example, the methylation profile of whole blood consists of multiple methylation profiles originat- ing in the different cell types that exist in the blood [14]. In order to model this using an SSM, the transition density needs to be a mixture of densities with multiple unknown parameters. Inference on mixtures leads to multi-modal posteriors.

The ppMCMC algorithm is a typical population-based MCMC method: It employs a population of M MCMC chains, where each chain samples from a different target distribution, i.e. a modified version of the SSM posterior in equation (4.5). The differences between the chains’ target distributions are due to the use of tempering (in a PT-like fashion, see Chapter 3); each chain uses a separate PF to approximate its likelihood (like in the pMCMC algorithm) but the likelihoods are then tempered. Also, ppMCMC uses exchange moves between the chains in a predefined order (again in the same way as PT). The combination of tempering and exchanges results in better mixing for multi-modal

distributions but also introduces complications (which will be explained shortly). The two actions that take place during each ppMCMC iteration are the update moves and the exchange moves:

Update Moves

During the update moves stage, each chain proposes candidate MCMC samples(X∗_1:T, θ∗) (for its own unique, modified SSM posterior) using the proposal:

qj((X∗1:T, θ∗) | (X1:T, θ )) = qj(θ∗| θ) p(X∗1:T | Y1:T, θ∗), j ∈ {1,...,M} (4.8)

where j is the chain index, (X1:T, θ ) is the previous sample of the chain, qj((X∗1:T, θ∗) | (X1:T, θ ))

is the full MCMC proposal density (for states andθ ) for chain j and qj(θ∗| θ ) is the component

of the proposal density which is related toθ for chain j. This component can be different between chains, in order to account for the different target densities of each chain, i.e. chains with more diffuse target densities mix faster when their proposal has larger variance. The component of the proposal density which is related to states (p(X∗

1:T | Y1:T, θ∗)) is the same for all chains. The proposed states

are generated by a PF exactly as in pMCMC. The only difference between the proposal schemes of pMCMC and ppMCMC is the use of different variances in theθ component of the proposal. Note that here, and in the remaining of the section (except the pseudo-code), the subscript j is dropped from the candidate and previous samples of the chains to simplify notation.

After the candidate samples have been proposed, they are accepted or rejected based on the following M-H acceptance ratio: ˜ aj= p(θ∗_{) ˜p(Y}_1:T_|θ∗₎ 1 Temp j _p_(X∗ 1:T|Y1:T,θ∗) qj(θ |θ∗) p(X1:T|Y1:T,θ ) p(θ ) ˜p(Y1:T|θ ) 1 Temp j _p_(X 1:T|Y1:T,θ ) qj(θ∗|θ ) p(X∗1:T|Y1:T,θ∗) = p(θ∗) ˜p(Y1:T|θ∗) 1 Temp j _q_j_(θ_|θ∗₎ p(θ ) ˜p(Y1:T|θ ) 1 Temp j _q_j_(θ∗_{|θ )} , j ∈ {1,...,M} (4.9) where notation is the same as in the standard pMCMC method and Tempj is the temperature of chain

j (with 1= Temp1 < Temp2< ... < TempM < ∞). The above ratio is different from the standard

pMCMC ratio in equation (4.7), since the estimated likelihood ˜p(Y1:T | θ) is tempered. This is done

in order to achieve the posterior smoothing effect of all tempering schemes: The first chain of the population has Temp1= 1, which means that it is the only chain that is not tempered. The chain

samples from the exact same SSM posterior as a typical pMCMC algorithm (the “correct” posterior). The auxiliary chains of the population ( j_{∈ {2,...M}) are tempered, which means that they sample} from some smoothed (closer to uniform) version of the “correct” SSM posterior. As can be seen in

the acceptance equation, the temperatures are applied only to the likelihood ˜p(Y1:T | θ) (and not the

other terms in the left-most part of the equation). This is a typical approach in tempering MCMC algorithms, since the likelihood is the component that usually contributes the most in the posterior shape. Moreover, it is crucial to apply the temperature only to this component in order to simplify the implementation of exchange steps (as will be shown below).

From examining equation (4.9), it is clear that no changes to the Bootstrap PF (compared to the pMCMC case) are needed in order to implement the ratio and temper the likelihoods; the PF of each chain generates a sample X∗_1:T from p(X∗_1:T _{| Y}1:T, θ∗) which is used in the proposal. It also generates

an unbiased estimate ˜p(Y1:T | θ∗) of the “correct” (non-tempered) likelihood. The temperature is

applied after the termination of the PF. The terms p(X∗_1:T_{| Y}1:T, θ∗) and p(X1:T| Y1:T, θ ) are cancelled

out in the acceptance ratio (as in basic pMCMC).

What are the target distributions?

Although applying a temperature to the likelihood is a well-known technique in population-based methods, in the case of ppMCMC it is not clear what the target distribution of each tempered chain is. The term p(θ ) ˜p(Y1:T| θ)

Temp j_p_(X

1:T| Y1:T, θ ) (which contains the likelihood estimate) is used in the

acceptance ratio of chain j but this does not lead the chain to converge to the posterior p(θ )p(Y1:T |

θ)

Temp j_p_(X

1:T | Y1:T, θ ), as one would intuitively expect after comparing to the pMCMC case.

According to the theory presented in [45], a pMCMC chain converges to a target distribution provided that unbiased estimates of the distribution’s density are used in the numerator and denominator of the acceptance ratio. Nevertheless, the term p(θ ) ˜p(Y1:T | θ)

1 Temp j_p_(X 1:T | Y1:T, θ ) is not an unbiased estimator of p(θ )p(Y1:T | θ) 1 Temp j_p_(X

1:T| Y1:T, θ ): Running a PF on the given SSM produces an unbi-

ased estimate ˜p(Y1:T| θ) of the likelihood (i.e. E ˜p(Y1:T | θ) = p(Y1:T | θ)). However, applying the

temperature after the likelihood estimate is generated (i.e. finding ˜p(Y1:T | θ)

Temp j_{) does not maintain}

unbiasedness with respect to the “correct” tempered likelihood (i.e. p(Y1:T | θ)

Temp j_).

In more detail, because the function x_{7→ x}

Temp j _{is concave for Temp}

j≥ 1, applying Jensen’s inequality

[151] leads to the following:

E ˜p(Y_1:T_{| θ)} 1 Temp j_{≤ E ˜p(Y} 1:T| θ) _{Temp j}1 = p(Y1:T | θ) 1 Temp j _(4.10)

the function is convex and the inequality is reversed). Equality holds only for Tempj= 1. There-

fore, unbiased estimates of the “correct” tempered likelihood densities p(Y1:T | θ)

Temp j _{(and thus the}

respective posterior densities) can be acquired only when Tempj = 1 (i.e. only in the case of the

first chain). Nevertheless, this does not mean that the tempered chains do not converge to any target distribution. In fact, chain j converges to the distribution whose density is unbiasedly estimated by

p(θ ) ˜p(Y1:T | θ)

Temp j_p_(X_1:T _{| Y}_1:T_{, θ ). The densities of these distribution can be written as:}

pj(X1:T, θ | Y1:T) = p(θ ) E ˜p(Y1:T | θ)

Temp j_p(X_1:T _{| Y}_1:T_{, θ ),} _j_{∈ {1,...,M}} (4.11)

These are the actual target densities of the M chains of the ppMCMC algorithm. Only for the first chain the density is equal to the “correct” tempered posterior (with Temp1= 1) and also to the “correct” SSM

posterior, i.e. pj(X1:T, θ | Y1:T) = p(X1:T, θ | Y1:T).

The key point here is that it is not necessary for the auxiliary chains to sample from the set of “correct” tempered posteriors, since their samples are not kept. Only the samples of the first chain are kept because they are the ones distributed according to the desired, “correct” SSM posterior. The auxiliary chains are only employed to help the first chain mix faster; they need to explore the distribution space quickly (and therefore their target distributions need to be closer to uniform) and occasionally feed the first chain with samples through exchange moves. These samples help the first chain escape from local modes. It is therefore enough for the auxiliary chains to sample from some set of tempered versions of the SSM posterior (and not necessarily from the “correct” set of tempered posteriors). The densities in equation (4.11) provide this tempering effect and therefore fulfil their purpose, i.e. they move fast in the distribution space and help the first chain mix faster through exchange moves. In fact, the term “correct” is only used here for reasons of clarity; there is no reason to believe that the “correct” densities p(Y1:T | θ)

Temp j _{are the best candidates for use in auxiliary chains (with respect}

to the mixing gains they offer). On the other hand, this does not mean that any density would serve as a good auxiliary density, e.g. uniform auxiliary densities would not help the mixing of the first chain because they are not concentrated around the true modes. In other words, some (not complete) smoothing must be applied to the true densities but the exact form of the optimal auxiliary densities is not known. In practical situations, the temperature set is tuned (using pre-runs) to improve mixing as much as possible.

The above approach is similar to the MPPT custom precision technique presented in Chapter 3 for the PT algorithm; in that case, custom precision approximations of the auxiliary chains’ target densities

were used instead of the “correct” tempered densities. This did not affect the target distribution of the first chain and also proved effective for improving mixing (for most precisions).

Exchange Moves

In every ppMCMC iteration, after all chains have finished the update moves, the exchange step is performed. Exchange moves are attempted between chain pairs (1, 2), (3, 4), ... or chain pairs (2, 3), (4, 5), ... (neighbouring chains) in a rotating manner. As mentioned above, the exchange moves push MCMC samples from the high-temperature chains, which are closer to the uniform distribution, to the lower-temperature chains, which are closer to the “correct” target distribution. Eventually samples reach the first chain which samples from the “correct distribution” and help it escape from local modes. The exchange acceptance ratio between chains(q, r) is:

˜ eq= p(θr_{) ˜p(Y} 1:T|θr) 1 Tempq _p_(Xr 1:T|Y1:T,θr) p(θq) ˜p(Y1:T|θq) 1 Tempr _p_(Xq 1:T|Y1:T,θq) p(θq_{) ˜p(Y} 1:T|θq) 1 Tempq _p_(Xq 1:T|Y1:T,θq) p(θr) ˜p(Y1:T|θr) 1 Tempr _p_(Xr 1:T|Y1:T,θr) = p˜(Y1:T|θr) 1 Tempq _p_˜_(Y 1:T|θq) 1 Tempr ˜ p(Y1:T|θq) 1 Tempq _p_˜_(Y 1:T|θr) 1 Tempr (4.12)

where again the PF likelihood estimates are used, q_{∈ {1,...,M − 1}, r = q + 1 and (X}q_1:T, θq_{) and}

(Xr

1:T, θr) are the current samples of chains q and r respectively. The above equation shows why it

is important to apply the tempering technique only to the likelihood p(Y1:T | θ) and not to the term

p(X1:T | Y1:T, θ ); it allows the latter to cancel out in the exchange acceptance ratio and leads to the

simple form in the second line of the equation, which requires no additional PF runs (all the values are already known from the preceding update step).

It is important to justify why the above exchange move fulfils the requirements of the theory of pM- CMC [45] with regards to maintaining the correct target distributions of the two chains. The exchange step is equivalent to a Metropolis update where the updated state is the joint state of both chains (with indexes q and r). According to Andrieu and Roberts [45], a Metropolis update maintains the target distribution as long as the numerator and denominator of the acceptance ratio are unbiased estimates of the target density. In the case of the exchange step (and focusing only on the numerator for simplicity), this means that the product p(θr_{) ˜p(Y}

1:T| θr) 1 Tempq _p_(Xr 1:T | Y1:T, θr) p(θq) ˜p(Y1:T | θq) 1 Tempr _p(Xq 1:T |

were given in equation (4.11)). It is easy to show that this is the case, since: E_[p(θr_{) ˜p(Y}_1:T_{| θ}r₎ 1 Tempq _p_(Xr 1:T | Y1:T, θr) p(θq) ˜p(Y1:T | θq) 1 Tempr _p(Xq 1:T | Y1:T, θq)] = p(θr_{) p(X}r 1:T | Y1:T, θr) p(θq) p(X q 1:T | Y1:T, θq) E ˜p(Y1:T| θr) 1 Tempq _p_˜_(Y 1:T | θq) 1 Tempr = p(θr_{) p(X}r 1:T | Y1:T, θr) p(θq) p(X q 1:T | Y1:T, θq) E ˜p(Y1:T| θr) 1 Tempq E ˜p(Y 1:T | θq) 1 Tempr = pq(X1:T, θr| Y1:T) pr(X1:T, θq| Y1:T) (4.13) The first equality is true because the first four terms in the second line of the equation are zero- variance estimators. The second equality is true because the two estimates ˜p(Y1:T | θr)

Tempq _and

p(Y1:T | θq)

Tempr _{are independent estimators (since they are generated by two independent PFs, each}

assigned to its own MCMC chain) and therefore the expectation of their product is equal to the product of their expectations. The final equality is true due to equation (4.11).

The ppMCMC algorithm

The pseudo-code of ppMCMC is shown in Algorithm 9. All the differences compared to Algorithm 8 are included, i.e. multiple chains, different proposals and updates, exchange moves. They are all based on the equations presented previously. Note that the update and exchange ratios in the pseudo-code use slightly different notation compared to the previously derived equations for ease of presentation. Apart from the inputs of pMCMC, ppMCMC requires also the number of chains, the initial samples of each chain and the temperature of each chain. The output includes the samples and the posterior density values of the first chain.

In document Algorithms and architectures for MCMC acceleration in FPGAs (Page 143-149)