Variance Reduction - Control Variates for SGLD Efficiency

3.3 Control Variates for SGLD Efficiency

3.3.2 Variance Reduction

The improvements of using the control variate gradient estimate (3.3.1) over the standard (3.2.2) become apparent when we calculate the variances of each. For our analysis, we make the assumption that the posterior is m-strongly-log-concave and M-smooth, formally defined in Assumption 3.3.1. These assumptions are common when analysing gradient based samplers that do not have an acceptance step (Durmus and Moulines, 2017a; Dalalyan and Karagulyan, 2017). The assumptions imply that mI ₄ ∇2_f_(θ) ₄ _{M I, for all} _θ _∈

Rd, where I is the identity matrix, and for two

matrices A1, A2 ∈Rd×d,A1 4A2 means thatA2−A1 is positive semi-definite. In all

the following analysis we usek·k to denote the Euclidean norm.

Assumption 3.3.1. Strongly log-concave and smooth posterior: there exists positive constants m and M, such that the following conditions hold for the negative log-

posterior f(θ)−f(θ0)− ∇f(θ0)>(θ−θ0)≥ m 2 kθ−θ 0_k2 (3.3.3) k∇f(θ)− ∇f(θ0)k ≤Mkθ−θ0k. (3.3.4) for all θ, θ0 ∈_Rd_.

We further need a smoothness condition for each of the likelihood terms in order to bound the variance of our control-variate estimator of the gradient.

Assumption 3.3.2. Smoothness: there exists constants L0, . . . , LN such that

k∇fi(θ)− ∇fi(θ0)k ≤Likθ−θ0k, for i= 0, . . . , N .

Using Assumption 3.3.2 we are able to derive a bound on the variance of the gradient estimate of SGLD-CV. This bound is formally stated in Lemma 3.3.3.

Lemma 3.3.3. Under Assumption 3.3.2. Let θk be the state of SGLD-CV at the kth iteration, with stepsize h and centring value θ. Assume we estimate the gradientˆ using the control variate estimator with pi = Li/

j=1Lj for i = 1, . . . , N. Define ξk := ∇f(θ˜ k)− ∇f(θk), so that ξk measures the noise in the gradient estimate ∇f˜ and has mean 0. Then for all θk,θˆ∈_Rd_{, and all} _k _{= 1, . . . , K} _{we have}

Ekξkk 2 ≤ _PN i=1Li 2 n E θk− ˆ θ 2 . (3.3.5)

Here the expectation is over both the noise in the minibatch choice, as well as the distribution ofθk. All proofs are relegated to the Appendix. It is simple to show that Assumption 3.3.2 also implies that the log-posterior isM-smooth, withM =PN_i₌₀Li,

i.e. condition (3.3.4) in Assumption 3.3.1 holds. This allows us to write Ekξkk2 ≤ M2 n E θk− ˆ θ 2 .

We will use this form of the bound for the rest of the analysis. In many situations, it is easier to work with a global bound on the smoothness constants, as in Assumption 3.3.4 below, and it is natural to choose pi = 1/N. We use pi = 1/N in all our implementations in the experiments of Section 3.5.

In order to consider how SGLD-CV scales with N we need to make assumptions on the properties of the posterior and how these change withN. To make discussions concrete we will focus on the following, strong, assumption that each likelihood-term in the posterior is L-smooth and l-strongly-log-concave. As we discuss later, our results apply under weaker conditions.

Assumption 3.3.4. Assume there exists positive constantsL and l such that fi satisfies the following conditions

fi(θ)−fi(θ0)− ∇fi(θ0)>(θ−θ0)≥ l

2kθ−θ

0_k2

k∇fi(θ)− ∇fi(θ0)k ≤Lkθ−θ0k. for all i∈0, . . . , N and θ, θ0 ∈_Rd_.

Under this assumption the constants, m and M, of the posterior both increase linearly withN, as shown by the following Lemma.

Lemma 3.3.5. Suppose Assumption 3.3.4 holds. Thenf satisfies the following f(θ)−f(θ0)− ∇f(θ0₎>_(θ₋_θ0₎_≥ l(N+ 1)

2 kθ−θ

0_k2

Thus the log posterior is M-smooth and m-strongly-concave with parameters M = (N + 1)L and m= (N + 1)l.

We can see that the bound on the gradient estimate variance in (3.3.5) depends on the distance between θk and ˆθ. Appealing to the Bernstein-von Mises theorem (see e.g. Le Cam, 2012), under standard asymptotics, and providedh is small enough (we make this more formal in the analysis to follow, but it must be at mostO(1/N)), we would expect the distance _E

θk− ˆ θ 2

to be O(1/N), if ˆθ is within O(N−1/2) of the posterior mean, once the MCMC algorithm has burnt in. As M is O(N), this suggests that_Ekξkk2 will be O(N).

To see the potential benefit of using control variates to estimate the gradient in situations where N is large, we now compare this O(N) result for SGLD-CV, with a result on the variance of the simple estimator, ∇fˆ(θ). If we randomly pick some data point index I and fix some point θ = ϑ, then define Vj(ϑ) to be the empirical variance of ∂jfI(ϑ) over the dataset x; and set V(ϑ) =Pdj=1Vj(ϑ). Then, defining ˆξ(θ) =∇fˆ(θ)− ∇f(θ), if we assume we are sampling the minibatch without replacement then E ˆ ξ(ϑ) 2 = N(N −n) n V(ϑ).

Now, suppose that as N → ∞, the posterior converges to some point mass at θ0 ∈Rd. Then we would expect that, forϑ close to θ0, E

ˆ ξ(ϑ) 2 ≈ N2 n V(θ0), so that the estimator will beO(N2). More precisely, asN → ∞, if we assume we can choose >0 such that V(ϑ)≥σ2 _> _{0 for all}_ϑ _{in an epsilon ball around} _θ

constantc > 0 such that n N2E ˆ ξ(θ) 2 →c, asN → ∞. (3.3.6)

This suggests using the estimate∇f˜, rather than ∇f, could give anˆ O(N) reduction in variance, and this plays a key part in the computational cost improvements we show in the next section.

In document Large scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo (Page 83-87)