Recycled elliptical slice sampling - Advances in Bayesian inference and stable optimization for

In this section we show how to sampleJ >1 points each ESS iteration without a significant increase in computational complexity. This idea is inspired by the work of Nishimura and Dunson [2015] on HMC. In that work an HMC algorithm is devised which “recycles” the intermediate points as valid samples from the target distribution. We borrow the phrase “recycling” from them and call our method Recycled Elliptical Slice Sampling.

Recall from Section 2.2.1 that in every ESS iteration, we propose points along an ellipse within an angle bracket, which is iteratively shrunk, until a point is accepted. In Recycled ESS, we don’t stop after accepting the first point but continue to propose points starting from the last angle bracket used. This procedure is continued until J points are accepted. One of the J accepted points is then randomly selected to propagate the Markov chain.

As we shrink the angle bracket [θmin, θmax] towards θ= 0 (corresponding to the current point),

the probability of the next proposal point being accepted tends to increase. Hence the number of shrinkage steps required to accept latter points is typically smaller than that for first accepted point. Since the number of likelihood function evaluations is proportional to the number of shrinkage steps, Recycled ESS is able to sample more points with only a small increase in computational complexity, leading to improved run times per sample. This approach is formalized in Algorithm 2. Note that recycled ESS is equivalent to standard ESS ifJ = 1.

It is implied in Algorithm 2 that we treat each sample X_j(i) as an element in a large Markov chain with state space (X₁(i), ..., X_J(i)). We prove in Theorem 2.3.2 that each element X_j(i) has its stationary marginal distribution as p∗. In order to do so, we first show in Lemma 2.3.1 that the transition operator of accepting thejth _{point is reversible.}

Lemma 2.3.1. Let Tj correspond to the transition operator from X(i −1)

1 → Xˆ

(i)

Algorithm 2:Recycled ESS

Input : Log-likelihood function (logL), initial point x(0)₁ ∈Rd_{, prior}_N₍₀_,_{Σ), number of} iterations N, number of recycled points J

Output: Samples from Markov Chain ((x(1)₁ , ..., x(1)_J ), ...,(x(₁N), ..., x(_JN))) 1 for i= 1 to N do

2 Choose ellipse: ν ∼ N(0,Σ) 3 Log-likelihood threshold:

u∼Uniform[0,1]

logy←logL(x(₁i−1)) + logu

4 Define initial bracket:

θmax ∼Uniform[0,2π]

θmin ←θ−2π 5 forj= 1 to J do

6 Draw initial proposal:

θ_∼Uniform [θmin, θmax] x0 _←xcos(θ) +νsin(θ) 7 whilelogL(x0)<logy do

8 Shrink bracket:

9 if θ <0 thenθmin←θ

10 elseθ_max←θ

11 Draw new proposal:

θ∼Uniform [θmin, θmax]

x0←xcos(θ) +νsin(θ) 12 end 13 Accept point: ˆx(_ji) ←x0 14 end 15 (x(₁i), ..., x(_Ji))←random permutation(ˆx(₁i), ...,xˆ(_Ji)) 16 end 17 return ((x(1)₁ , ..., x(1)_J ), ...,(x(₁N), ..., x(_JN)))

CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 15

invariant to p∗.

Proof. Our proof is similar to that of the original ESS algorithm [Murray et al., 2010, Sec. 2.3]. The approach is to show that Tj is reversible, i.e.

p∗(X=x(₁i−1))_·p( ¯X= ˆx(_ji)_|X =x(₁i−1)) =p∗(X = ˆx(_ji))_·p( ¯X =x(₁i−1)_|X= ˆx(_ji)),

from which it follows thatTj is invariant top∗.

Let {θj,k}, k = 1,2, . . . Kj, be the sequence of angles sampled during Tj. The distribution of the current stateX =x(₁i−1) with respect top∗ _{(as defined in (2.1)) multiplied by the distribution} of random variables Y, ν,{θj,k} generated to transition to ¯X= ˆx(_ji) is

p∗(X=x(₁i−1))·p(Y, ν,{θj,k}|X=x(i −1) 1 ) =p∗(X=x₁(i−1))·p(Y|X =x₁(i−1))·p(ν)·p({θj,k}|X =x(i −1) 1 , Y, ν) ∝ N(x(₁i−1); 0,Σ)· N(ν; 0,Σ)·p({θj,k}|X=x(i −1) 1 , Y, ν)

where p(Y =y_|X =x(₁i−1)) =I[0_≤y_{≤ L}(x(₁i−1))]/ _L(x(₁i−1)). The key to proving reversibility is showing that1 p∗(X=x(₁i−1))·p(Y =y, ν=ν,{θj,k}={θj,k}|X =x (i−1) 1 ) =p∗(X= ˆx_j(i))·p(Y =y, ν = ˆν,{θj,k}={θˆj,k}|X= ˆx(_ji)) (2.4) where ˆ ν =νcos(θj,Kj)−x (i−1) 1 sin(θj,Kj) ˆ θj,k =      θj,k−θj,Kj ifk < Kj −θj,Kj ifk=Kj. The values ˆν and ˆθj,k are constructed such that

x(₁i−1)cos(θj,k) +νsin(θj,k) = ˆx(_ji)cos(ˆθj,k) + ˆνsin(ˆθj,k)

1_{We have overloaded our notation with}_ν _and _{_θ

j,k}. In the expressionν = ν the left ν refers to the random variable and the rightν to its value. Likewise for{θj,k}. The notation was chosen to be consistent with [Murray et al., 2010].

for allk < Kj. The points proposed in the reverse direction (from ˆx

(i)

j tox

(i−1)

1 ) are thus the same

as in the forward direction (from x(₁i−1) to ˆx_j(i)), except for when k=Kj. To prove (2.4), we first show that:

p(_{θj,k}={θj,k}|X=x

(i−1)

1 , Y =y, ν =ν) = p({θj,k}={θˆj,k}|X= ˆx

(i)

j , Y =y, ν = ˆν) (2.5) The argument is as follows: the probability density for the first angle θj,1 is always 1/2π. The

intermediate angles were drawn with probability densities 1/(θmax

j,k −θminj,k ) where (θminj,k , θj,kmax) denotes the angle bracket for θj,k. Whenever the bracket was shrunk, it was done so that ˆx

(i)

j remained selectable. Now lets consider the reverse transitions starting from ˆx(_ji). The reverse transitions make the same intermediate proposals. Since same size angle brackets (ˆθmin

j,k ,θˆmaxj,k ) are sampled, the probabilities for drawing angles in forward and reverse transitions is the same.

Additionally, we have that

N(x(₁i−1); 0,Σ)· N(ν; 0,Σ) =N(ˆx(_ji); 0,Σ)· N(ˆν; 0,Σ) (2.6) since after taking logs and cancelling constants in (2.6) we have

ˆ x_j(i)>Σˆx_j(i)+ ˆν>Σˆν = (x(₁i−1)cos(θj,Kj) +νsin(θj,Kj))>Σ(x(i −1) 1 cos(θj,Kj) +νsin(θj,Kj)) + (νcos(θj,Kj)−x(i −1) 1 sin(θj,Kj))>Σ(νcos(θj,Kj)−x(i −1) 1 sin(θj,Kj)) =x(₁i−1)>Σx₁(i−1)+ν>Σν.

Equation (2.6) combined with the result in (2.5) proves (2.4). Integrating over y, ν and _{θj,k} proves reversibility and shows thatTj is invariant top∗.

Theorem 2.3.2 easily follows:

Theorem 2.3.2. Each element in the Recycled ESS Markov chain has marginal stationary distribution p∗_.

Proof. The sequence of points_{X₁(i)_}follow a Markov Chain. At each step the transition operator is uniformly sampled from the set{Tj :j = 1, ..., J}, with eachTjbeing invariant top∗(Lemma 2.3.1). Therefore we have that X₁(i) _−−−→dist. X∗ _where _X∗ _∼ _p∗_{. Also, at any fixed iteration} _i_{, we have}

CHAPTER 2. ELLIPTICAL SLICE SAMPLING WITH EXPECTATION PROPAGATION 17

that all points in _{X_j(i) : j = 1, ..., J_} are identically distributed. This follows from the random permutations:

p(X_j(i)|( ˆX₁(i), ...,Xˆ_J(i))) = Uniform( ˆX₁(i), ...,Xˆ_J(i)) =p(X_k(i)|( ˆX₁(i), ...,Xˆ_J(i))).

Since we have thatX₁(i)_−−−→dist. X∗_{, it follows that for all}_j_: _X(i)

j dist.

−−−→X∗_.

The downside of Recycled ESS is that the latter accepted points (corresponding to j _≈J) are sampled from a very small angle bracket and so are highly correlated. On the other hand these points only require a small number of function evaluations. Overall the effect of recycling is a small increase in the effective number of samples, with a small increase in computational complexity. Whether or not this is beneficial is investigated empirically in Section 2.5.

2.4 Analytic elliptical slice sampling

In document Advances in Bayesian inference and stable optimization for large-scale machine learning problems (Page 32-36)