One-dimensional Gaussian mixture example:

Chapter 4 Weight Preserved Tempering

4.4 The HAT (Hessian Adjusted Tempering) Algorithm

4.4.2 One-dimensional Gaussian mixture example:

Consider a bi-modal Gaussian mixture target with target density given by:

π(x)∝ 2 X k=1 wkφ(µk,σ2 k)(x) (4.44)

where φ₍_µ,σ2₎(.) is the density function of a univariate Gaussian with mean µ and

variance σ2. The modal weights are given by w1 = 0.8 and w2 = 0.2, the means

are given by µ1 =−40 and µ2 = 40 and finally the standard deviations are given

byσ1 = 0.1 and σ2 = 5. So there is a large disparity between the modal variances.

Figure 4.6 illustrates the target distribution π and clearly shows the variance and height disparity between the modes.

Figure 4.6: The target density plotted for the example bi-modal target given in equation (4.44). Note that the second mode located at 40 is very disperse yet is the dominant mode at the hotter temperatures under vanilla power tempering.

The performance of the new HAT algorithm will be compared with that of the PT algorithm. However it is not obvious how to construct a fair setup since the “optimal temperature schedules” will differ between the two algorithms. Both have a suggested optimal 0.234 rule for the acceptance rates of the spacings (see Section 5 for details of the optimal scaling of the HAT algorithm).

To illustrate the gains of the HAT algorithm in this case the same temperature schedule will be used and this will be chosen under optimality for the HAT algorithm. This will highlight that the optimal spacings for HAT algorithm in this example are too ambitious for the PT algorithm to work well.

The inverse temperature schedule used is geometrically spaced with common ratio 0.05. Hence, the inverse temperature schedule is{1,0.05,0.052}. Verified over 10 repeated runs of the HAT algorithm this schedule is (approximately) optimal for the HAT algorithm, according to Theorem 5.1.1 from Chapter 5. Furthermore, at each level there are three within temperature moves before a temperature swap proposal between a uniformly selected pair of consecutive temperatures. The run is finished when there have been 20,000 swap moves proposed. All chains were started at the position 40 to really highlight the lack of robustness of the PT algorithm to a feasible start point.

Figure 4.7 shows three trace plots, of the cold state chain, for runs of the PT algorithm under the described setup. The target weight in the modal region centred on -40 is 0.8; so it is clear that the performance of the PT algorithm is inconsistent, making modal weight estimates highly variable.

Figure 4.8 shows the trace plot of three runs of the cold state chain for the HAT algorithm. It is immediately obvious that the inter-modal mixing is far more regular than for the PT approach. The acceptance rates of the swap moves between the coldest and consecutively next coldest level for each of these three runs are respectively {0.27,0.29,0.28}; for comparison those of the PT runs in Figure 4.7 were{0.05,0.12,0.12}. In this case, which is simplistic due to the single dimension, the poor mixing in the target state in the PT algorithm would be picked up by the low swap move acceptance rate and hence in an optimal setup further intermediate temperatures would be needed. This illustrates the ability of the HAT algorithm to take larger steps through the temperature schedule than the PT algorithm is capable of. In the below example, in five dimensions there are illustrations of the swap acceptance rates not diagnosing the poor mixing.

Figure 4.9 shows the running modal weight approximation ofw1after thekth

iteration of the cold state chains once a burn-in period of 10,000 iterations has been removed for the ten examples of the PT and HAT runs respectively. The weight

Figure 4.7: Three trace plots of the mixing of the coldest level chain in three separate runs of the PT algorithm targeting the distribution given in equation (4.44). The setup of the PT algorithm was the same in each case and the key observation is the infrequency of inter modal jumps that would subsequently result in more variable estimates of modal weights.

approximation after iterationkis given by

ˆ w₁k= 1 k−10000 k X i=10001 1(Xi<0) (4.45)

whereXi is the location of the chain at the coldest temperature level after the ith

iteration. Observe the jagged and volatile convergence of the running estimate of ˆ

w₁k as it converges to the true value 0.8 for the PT algorithm.

To see how the performance of the HAT algorithm compares with that of the idealised algorithm in this Gaussian mixture setting then 10 runs of the Ideal algorithm were performed. All runs had the same setup with regards to the temperature

Figure 4.8: Three trace plots of the mixing of the coldest level chain in three separate runs of the HAT algorithm targeting the distribution given in equation (4.44). The setup of the HAT algorithm was the same in each case and the key observation is the relatively high frequency of inter modal jumps which one would hope would give a fast rate of convergence of an estimator of the modal weights.

schedule and within level performance. The runs gave comparable performance to the HAT version. An example comparing a run from each of the three types is given in Figure 4.10. It is hard to differentiate between the trace plots for the HAT and idealised runs.

In document Towards optimality of the parallel tempering algorithm (Page 136-139)