In all of our experiments, we use a spherical, axis-aligned Gaussian for the proposal distri- bution,i.e.,
θ0 ∼q(θ0|θ) =N(θ0|θ, λ2Id), (5.1)
whereλ ∈R+ is the standard deviation, I
d is thed-dimensional identity matrix and dis the dimension of θ. In our preliminary experiments, which we don’t include in our evaluation here, we used a fixed proposal distribution. This was problematic because – as we discussed in Section 2.4.1 – the behavior of MH is both sensitive to the proposal distribution and changes over the course of execution. As a result, it was difficult to tune the parameters of the proposal distribution to yield experiments satisfying our requirements stated at the beginning of this chapter: each MH simulation starts away from convergence, progresses through burn-in and eventually converges, while achieving a meaningful acceptance rate and number of effective samples. Specifically, suppose we set the proposal distribution to achieve an acceptance rate of about 0.234 during the burn-in phase. MH advances until it is close to a mode of the target, but there the proposals tend to be far from the mode and thus have low probability. This results in a high rate of rejection and the algorithm becomes stuck. Alternately, if we tune the proposal distribution to sample well around such a mode of the target distribution, then the characteristic step size tends to be much smaller than before and progress is artificially slow during burn-in.
Our solution employs a simple adaptive scheme to set the parameters of the proposal distribution, improving convergence relative to standard MH. This approach falls under the provably convergent adaptive algorithms studied by Andrieu and Moulines (2006) and was
easily incorporated into our framework. The general idea behind adaptive MH is to improve performance by tuning the proposal distribution during execution, using information from the samples as they are generated, in a way that converges asymptotically. Often, it is desirable for the proposal distribution to be close to the target. This motivates adaptive schemes that fit a distribution to the observed samples and use this fitted model as the proposal distribution. For example, a simple online procedure can update the mean µ and covariance Σ of a multidimensional Gaussian model as follows:
µt+1 = µt+γt+1(θt+1−µt) t ≥0
Σt+1 = Σk+γt+1((θt+1−µt)(θt+1−µt)>−Σt),
wheretindexes the MH iterations andγt+1 controls the speed with which the adaptation van-
ishes. An appropriate choice isγt=t−α forα∈[1/2,1). The tutorial by Andrieu and Thoms (2008) provides a review of this and other, more sophisticated, adaptive MH algorithms.
Our adaptive scheme directly uses information about whether proposals are accepted or rejected to tune the proposal distribution to achieve an acceptance rate of approxi- mately 0.234. Letρbe a node in the MH binary tree. Denote by1ρthe indicator for whetherθρ corresponds to an accepted or rejected state, i.e.,
1ρ =
1 if ρ is a right child in the MH binary tree
0 if ρ is a left child in the MH binary tree.
Our strategy is to increase λ, the scale of our proposal distribution in Equation 5.1, if the acceptance rate is too high and decrease it if the acceptance rate is too low. Our adaptive rule achieves this by modifying `= logλ2, the log of the variance, as follows:
We set`0 = log(2.382/d), which corresponds to the proposal distribution with the “optimal”
acceptance rate of 0.234, derived for the case where the target is a standard d-dimensional normal distribution, in the limits where the chain has converged andd→ ∞(Roberts et al., 1997). We empirically found γt=t−1/2 to work well. Our adaptive approach can be general- ized to more complicated proposal distributions, but we did not need any for our experiments. To support this adaptive MH algorithm within our prefetching framework, we made a simple extension to our system. In general, adaptive MH depends on the history of the simulated chain. Our adaptive scheme depends on the sequence of accepted and rejected states, i.e., the chain’s path through the MH binary state tree. Given an initial value for `0,
the trajectory of `t is completely determined by this path. Thus, whenever we create a new node ρ in the jobtree, we generate the corresponding value of `ρ and store it on the node. This information, which is stored on the master, is communicated to a worker, via
a HAVE-WORK message, when called upon to generate the proposal atρ.
For our mixture of Gaussians problem, we follow the standard convention of additionally permuting the dimension labels each time a proposal is generated. In the Bayesian Lasso problem, the first coordinate ofθ is a standard deviation and must be positive, so we truncate this dimension of the proposal distribution accordingly.