Kernel Density Metropolis-Hastings Algorithm

Chapter 3 MCMC

3.6 Kernel Density Metropolis-Hastings Algorithm

When θ has a small dimension and the marginal posterior π(θ|y) is believed to be smooth a new algorithm is proposed which overcomes the sticking of the GIMH and the bias of MCWM by utilizing the assumed smoothness. We note in passing that when the distribution ofY is considered then the exact marginal π(θ|Y) is a random function.

We have again assumed that we have available unbiased but noisy point estimates ˜π(θ) ofπ(θ), where ˜πis often a posterior distribution of a high dimensional model and want to use MCMC to investigate it, the estimates could be obtained from an importance sampler as used in the GIMH described above or via other methods such as a particle filter.

The proposed algorithm is based on a sequence _π](.)j of kernel density esti-

mates ofπ(θ), these are estimates of the whole density in contrast to the underlying point estimates. A standard Metropolis-Hastings is run overθusing _π](.)_j to calcu- late the acceptance ratio, the algorithm is

1. initialiseθ

3. forj= 1. . . Nmcmc

(a) propose θ∗ fromq(θ∗|θ) (b) obtain ˜π(θ∗)

(d) acceptθ∗ with probability min(A,1) where

A= π^(θ

∗₎_jq₍_θ_|_θ∗₎ ]

π(θ)jq(θ∗|θ)

Conceptually at (c) we compute it for allθ, in practice we only need it at the points

θand θ∗. Note that we make use of all the estimates ˜π(θ∗) in computing πg(θ). We

use a standard kernel estimate

g π(θ)_n= n X i=1 ˜ π(θi)K( θ−θi hn )/ n X i=1 K(θ−θi hn ) (3.6.1)

wherehn is a predefined non-increasing sequence of bandwidths andK(.) is a sym-

metric kernel. Computationally the use of a kernel with bounded support gives several options for efficiently computing the KDE sequentially. When the target is a Bayesian posterior the initial estimate_π]₍_θ₎₀ _{can be taken from the prior on}_θ_.

This algorithm generates a sequence of values of θ which because of the dependence on past values is no longer Markov. It is hoped that the sequence will converge toπ(θ) subject to some conditions on the target and proposal distributions and the sequencehn. The initial experiments described below have used a constant

value, in this case the best that can be hoped for is that _π](.)j converges to the

convolution of the targetπ(θ) and the kernel with bandwidthh.

3.6.1 Kernel MH Algorithm - Naive implementation

A naive implementation which is computationally inefficient has been used to investigate the behaviour on the example used in section 3.5.2 and a more challenging 2-d example. At each iterationn the KDE is recomputed for the two valuesθ, θ∗ from the stored valuesθjπˆ(θj)j= 1. . . nwhich isO(n2). A Gaussian kernel with a range of constant bandwidths (bw=1,.1,.01,.001) gives the results shown below (which can be compared with the GIMH results in figure 3.5.2), bw=1 is over smoothed giving biased results, on this short run on a simple toy bw=.1 may be the best. The KDE was initialised from 100 observations from “a prior” of N(2,2), this initialisation is still visible in these short runs for all bandwidths < 1.

Figure 3.6.1: Kernel MH example, bw=1, .1

Figure 3.6.2: Kernel MH example, bw=.01, .001

Himmelblau Example Distribution

The Kernel Metropolis-Hastings (KMH) algorithm has been investigated in higher dimensions, 2-d and 5-d and appears to work well, further programming to improve efficiency is needed before any more extensive runs. A 5-d N(µ, I5) target is used, the run time is still O(n2_{) the KMH appears to work well where the SEMH would} get stuck, running independent parallel chains circumvents the worst effects of the

In 2-d a challenging multimodal example with heavy tails based on the Him- melblau function5 H(x, y) = (x2 +y−11)2 + (x+y2−7)2 has been used. The target density isπ(x, y)∝1/(1 +H(x, y)) and contours of its logarithm are shown in figure alongside a perspective view.

Figure 3.6.3: Himmelblau example target distribution, the log-likelihood is shown as contours and a perspective view.The green dots indicate the position of local maxima, the red dot a local minima.

The marginal distributions are intractable and so an “exact” MH run of length 108 was used to obtain them along with the table below. Although the positions of the 4 modes are known exactly, the position of them on the two marginals is not, they are close to the projections of the peaks. Comparisons have been made using these as exact probabilities (shown below as %).

(−I n f ,−1 0 ] (−10 ,−5] (−5 , 0 ] ( 0 , 5 ] ( 5 , 1 0 ] ( 1 0 , I n f ] (−I n f ,−1 0 ] 0 . 1 2 0 . 0 8 0 . 0 8 0 . 0 7 0 . 0 6 0 . 1 0 (−10 ,−5] 0 . 0 7 0 . 3 9 1 . 0 5 0 . 6 9 0 . 2 6 0 . 0 6 (−5 , 0 ] 0 . 0 7 0 . 7 4 1 6 . 8 3 2 1 . 2 6 0 . 7 6 0 . 0 7 ( 0 , 5 ] 0 . 0 7 0 . 5 2 2 4 . 2 8 2 9 . 0 1 0 . 5 6 0 . 0 7 ( 5 , 1 0 ] 0 . 0 6 0 . 2 5 0 . 9 8 0 . 7 2 0 . 1 9 0 . 0 6 ( 1 0 , I n f ] 0 . 0 9 0 . 0 7 0 . 0 7 0 . 0 7 0 . 0 6 0 . 0 9 Results in 2-d

KMH runs have been compared with the SEMH and SAMH algorithms for a range of parameters. The “noise” is log-normal with parameterσ one of 2,3,4,5. Forσ= 2

Figure 3.6.4: Himmelblau example, 1-d marginal densities of the 2-d target. The left hand plot shows the true densities obtained by an MCMC run of length 108. The right hand plot shows an estimate obtained from a KMH run of≈3×106.

the SEMH algorithm worked well, a longer run would be necessary in practice, the SAMH was heavily biased not identifying the modes and putting too much weight in the tails, the bias of SAMH is expected to increase withσ so was not considered for higher values ofσ. For σ = 3 the SEMH algorithm showed significant sticking but was still acceptable, (see left hand plot in 3.6.5). The KMH for σ = 3 at 3 bandwidths produced slightly better results, as measured by χ2 identification of the 4 modes and tails, for a lower number of samples, but with the current implementation required more cpu time. The SEMH with σ = 4 got badly stuck, longer runs are unlikely to improve this. The KMH for σ = 4 at 3 bandwidths produced significantly better results, (see right hand plot in 3.6.5), although longer runs are necessary the 4 modes are correctly identified. Even with σ = 5 useful results are obtained, tuning of its parameters and/or a longer run is necessary to fully sample all 4 modes.

Programming details

Different approaches have been tried in the 2-d and 5-d examples they must be integrated. Currently logarithms of densities are calculated as the smoothing is linear, computation is dominated by the exp(.) function, this should be changed. In 2-d an index of which observations are in each of a grid of squares side h, is maintained so that the kernel is not evaluated when known to be zero. In 5-d a

Figure 3.6.5: KMH SEMH comparisons

decreasing kernel bandwidthh is used.

KMH on Indian Buffet Posteriors

Attempts have been made to use the KMH on Indian Buffet Posteriors, these are not described in detail, or described in the next chapter as they have been unsuccesful. The reason appears to be the far greater variance of the log likelihood estimates, which can be equivalent to a log normal parameterσ of 100 or more. The result is that the samples become concentrated in an area centered on one large value but matching the proposal distribution.

In document Epidemic models and MCMC inference (Page 104-109)