Computational Statistics for Big Data

(1)

Computational Statistics for Big Data

Author:

Jack Baker

¹

Supervisors:

Paul Fearnhead

¹

Emily Fox

²

1

Lancaster University

²

The University of Washington

September 1, 2015

Abstract

The amount of data stored by organisations and individuals is growing at an as- tonishing rate. As statistical models grow in complexity and size, traditional machine learning algorithms are struggling to scale well to the large datasets required for model fitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind.

However the algorithm has proven to be an invaluable tool for training complex statistical models. This report discusses a number of possible solutions that enable MCMC to scale more effectively to large datasets. We focus on two particular solutions to this problem: batch methods and stochastic gradient Monte Carlo methods.

Batch methods split the full dataset into disjoint subsets, and run traditional MCMC on each subset. The difficulty of these methods is in recombining the MCMC output run on each subset. The idea is that this will be a close approximation to the posterior using the full dataset. Stochastic gradient Monte Carlo approximately samples from the full posterior but uses only a subsample of data at each iteration. It does this by combining two key ideas. Stochastic optimization, which is an alogorithm used to find the mode of the posterior but uses only a subset of the data at each iteration; Hamiltonian Monte Carlo, which is a method used to provide efficient proposals for Metropolis-Hastings algorithms with high acceptance rates. After discussing the methods and important extensions, we perform a simulation study, which compares the methods and how they are affected by various model properties.

(2)

1 Introduction

As the amount of data stored by individuals and organisations grows, statistical models have advanced in complexity and size. Often much statistical methodology has focussed on fitting models with limited data. Now we are faced by the opposite problem, we have so much data that traditional statistical methods struggle to cope and run exceptionally slowly.

These problems have led to a rapidly evolving area of statistics and machine learning, which develops algorithms which are scalable as the size of data increases. The ‘size’ of data is generally used to mean one of two things: the dimensionality of the data or the number of observations. In this report we focus on methods which have been designed to be scalable as the number of observations increases. Data with a large number of observations is often referred to as tall data.

Currently, large scale machine learning models are being trained mainly using optimization methods such as stochastic optimization. These algorithms are mainly used for their speed, they are fast to train models even when there are a huge number of observations available. The methods’ speed is due to the fact that at each iteration the algorithms only use a subset of all the available data. The downside is that these methods only find local maxima of the posterior distribution, meaning they only produce a point estimate which can lead to overfitting.

A key appeal of Bayesian methods is that they produce a whole distribution of possible parameter values, which allows uncertainty to be quantified, reducing the risk of overfitting.

While approximating parameter uncertainty using stochastic optimization can be done, for complex models this approximation can be very poor. Generally the Bayesian posterior distribution is simulated from using statistical algorithms known as Markov chain Monte Carlo (MCMC). The problem is that these algorithms require calculations over the whole dataset at each iteration, meaning the algorithms are slow for large datasets. Therefore the next generation of MCMC algorithms which scale to large datasets needs to be developed.

1.1 An overview of methods

We begin this section with a more formal statement of the problem. Suppose we wish to train a model with probability density p(x|θ), where θ is an unknown parameter vector, and x ∈ x is the model data. Let the likelihood of the model be p(x|θ) = QN

i=1p(x_i|θ) and the prior for the parameter be p(θ). Our interest is in the posterior p(θ|x) ∝ p(x|θ)p(θ), which quantifies the most likely values of θ given the data x.

Commonly we simulate from the posterior using the Metropolis-Hastings (MH) algorithm, arguably the most popular MCMC algorithm. At each iteration, given a current state θ, the algorithm proposes some new state θ⁰ from some proposal q(.). This new state is then accepted as part of the sample with probability

α = q(θ)p(θ⁰|x)

q(θ⁰)p(θ|x) = q(θ)p(x|θ⁰)p(θ⁰) q(θ⁰)p(x|θ)p(θ).

Notice that at each iteration, the MH algorithm requires calculation of the likelihood at the new state θ⁰. This requires a computation over the whole dataset, which is infeasibly

(4)

slow when N is large. This is the key bottleneck in Metropolis-Hastings, and other MCMC algorithms, when they are being used with large datasets.

A number of solutions have been proposed for this problem, and they can generally be divided into three categories. We refer to these categories as batch methods, stochastic gradient methods and subsampling methods. Batch methods aim to make use of recent hardware developments which makes the parallelisation of computational work more accessible. They split the dataset x into disjoint batches x_B₁, . . . , x_B_S. The structure of the posterior allows separate MCMC algorithms to be run on these batches in parallel in order to simulate from each subposterior p(θ|x_B_s) ∝ p(θ)^1/Sp(x_B_s|θ). These simulations must then be combined in order to generate a sample which approximates the full posterior p(θ|x). This is where the main challenge lies.

Stochastic gradient methods make use of sophisticated proposals that have been suggested for MCMC. These methods use gradients of the log posterior in order to suggest new states which have very high acceptance rates. When free constants of these proposals are tuned in a certain way these rates can be so high that we can get rid of the acceptance step and still sample from a good approximation to the posterior. However the gradient calculation still requires a computation over the whole dataset. Therefore the gradients of the log posterior need to be estimated using only a subsample of the data, which introduces extra noise.

Subsampling methods propose various methods to keep the MCMC algorithm largely as is but use only a subset of the data in the acceptance step at each iteration. Certain methods exist which allow this to be done while still sampling from the true posterior distribution.

However this advantage often comes at the cost of poor mixing. Other methods achieve the result by introducing controlled biases, these methods often mix better.

1.2 Report outline

This report provides a review of batch methods and stochastic gradient methods outlined in Section 1.1. The reviewed methods are then implemented and compared under a variety of scenarios.

In Section 2 we discuss batch methods, including parametric contributions by Scott et al.

(2013) and Neiswanger et al. (2013), nonparametric and semiparametric methods introduced by Neiswanger et al. (2013) as well as more recent developments. Section 3 sees a review of stochastic gradient methods, including the stochastic gradient Langevin dynamics (SGLD) algorithm of Welling and Teh (2011) and the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm of Chen et al. (2014). Stochastic optimization methods which are currently employed to train algorithms which rely on large datasets are considered. An introduction of Hamiltonian Monte Carlo, which is used to produce proposals for the SGHMC algorithms is provided. Finally we examine the literature which provide further theoretical results for the algorithms, as well as proposed improvements.

In Section 4 the algorithms reviewed in the report are compared, code for the implemen- tations are available on GitHub: https://goo.gl/9ZGHP2. A relatively simple model is used for comparison, a multivariate t-distribution. Therefore in order to really test the methods, the number of observations is kept small. First the effect of bandwidth choice for nonparametric/semiparametric methods is investigated. The performance effect of the number of

(5)

observations and the dimensionality of the target are compared for all the methods. The batch size for the batch methods, and the subsample size for the stochastic gradient methods are considered too.

2 Batch methods

2.1 Introduction

In order to speed up MCMC, it is natural to consider parallelisation. Advances in hardware allow many jobs to be run in parallel over separate cores. These advances have been used to speed up many other computationally intensive algorithms. Parallelising MCMC has proven difficult however since MCMC is inherently sequential in nature and parallelisation requires minimal communication between machines. A natural way to parallelise MCMC is to split the data into different subsets. MCMC for each subset is then run separately on different machines. In this case the main problem is how to recombine our MCMC samples of each subset while ensuring the final sample is close as possible to the true posterior. In this section, we discuss parametric and nonparametric methods suggested to do this.

2.2 Splitting the data

Suppose we have N i.i.d. data points x. We wish to investigate a model with probability density p(θ|x), where θ is an unknown parameter vector. Let the likelihood be p(x|θ) = QN

i=1p(θ|x_i) and the prior we assign to θ be p(θ). Then the full posterior for the data p(θ|x) is given by

p(θ|x) ∝ p(θ)p(x|θ). (2.1)

Let B1, . . . , BS be a partition of {1, . . . , n}, and xBi be the corresponding set of data points x_B_i = {x_i : i ∈ B_i}. We refer to x_B_i as the i^th batch of data. We can rewrite (2.1) as

p(θ|x) ∝ p(θ)

S

Y

s=1

p(xBs|θ) =

S

Y

s=1

p(θ)^1/Sp(xBs|θ).

For brevity we will write the S batches of data as x₁, . . . , x_S from now on. Let us define the subposterior p(θ|xs) by

p(θ|x_s) ∝ p(θ)^1/Sp(θ|x_s).

Therefore we have that p(θ|x) ∝QS

s=1p(θ|xs). The idea of batch methods for big data is to run MCMC separately to sample from each subposterior. These samples are then combined in some way so that the final sample follows the full posterior p(θ|x) as closely as possible.

2.3 Efficiently sampling from products of Gaussian mixtures

Before we outline recombination methods in more detail, we discuss certain important properties of the multivariate Normal distribution which will prove useful later.

(6)

Suppose we have S multivariate Normal densities N (θ|µ_s, Σ_s) for s ∈ {1, . . . , S}, then Wu (2004) shows that their product can be written, up to a constant of proportionality, as

S

Y

s=1

N (θ|µ_s, Σ_s) ∝ N (θ|µ, Σ),

where

Σ =

S

X

s=1

Σ⁻¹_s

!⁻¹

, µ = Σ

S

X

s=1

Σ⁻¹_s µ_s

!

. (2.2)

Now suppose we have a set of S Gaussian mixtures {p_s(θ)}^S_s=1,

p_s(θ) =

M

X

m=1

ω_m,sN (θ|µ_m,s, Σ_s),

where ω_m,sdenote the mixture weights. For simplicity we assume that the number of components in each mixture is the same and that each Gaussian component in the mixture shares a common variance which is diagonal.

We wish to sample from the product of these Gaussian mixtures,

p(θ) ∝

S

Y

s=1

p_s(θ). (2.3)

It can be shown using induction that

S

Y

s=1 M

X

m=1

ωm,sN (θ|µm,s, Σs) = X

l1

· · ·X

lS

S

Y

s=1

ωls,sN (θ|µls,s, Σs),

where we label each component of the sum using L = (l₁, . . . , l_S), where l_s ∈ {1, . . . , M }.

It follows from this and results above about products of Gaussians, (2.3) is equivalent to a Gaussian mixture with M^S mixture components. Therefore sampling from this product can be performed exactly in two steps. Firstly we sample from one of the M^S components of the mixture according to its weight, then we draw a sample from the corresponding Gaussian component (Ihler et al., 2004).

The parameters of the L^th Gaussian component be calculated using (2.2) and are given by

ΣL=

S

X

s=1

Σ⁻¹_s

!⁻¹

, µL = ΣL S

X

s=1

Σ⁻¹_s µls,s

! .

The unnormalised weight of the L^th mixture component is given by (Ihler et al., 2004)

ω_L∝ QS

s=1ω_l_s_,sN (θ|µ_l_s_,s, Σ_s) N (θ|µ_L, Σ_L) .

(7)

In order to use this exact method we need to calculate the normalising constant for the weights Z =P

Lω_L. As M and S grow this exact sampling method becomes computationally infeasible as the calculation of Z and the drawing a sample from p(.) both take O(M^S) time.

This fact, along with memory requirements mean that sampling from p(θ) using the exact method quickly becomes impossible.

In cases where exact sampling from the mixture is infeasible, a number of methods have been proposed. For a review the reader is suggested to refer to Ihler et al. (2004). A common approach is to use a Gibbs sampling style approach. At each iteration, S − 1 of the labels l_i are fixed, while one label, call it l_j, is sampled from the corresponding conditional density p(θ|l−j). The notation l−j refers to {l_i|i ∈ {1, . . . , S}, i 6= j}. After a fixed number of new label values have been drawn, a sample is drawn from the mixture component indicated by the current label values. While this approach often produces good results, it can require a large number of samples before it accurately represents the true mixture density due to multimodality. A number of suggestions have been made to improve this standard Gibbs sampling approach, for example using multiscale sampling (Ihler et al., 2004) and parallel tempering (Rudoy and Wolfe, 2007).

2.4 Parametric recombination methods

There are a number of methods proposed to recombine subposterior samples which exactly target the full posterior p(θ|x) when it is Normally distributed. We refer to these methods as parametric. Intuition for why this assumption might be valid for a large class of models comes from the Bernstein-von Mises Theorem (Le Cam, 2012), which is a central limit theorem for Bayesian statistics. Assuming suitable regularity conditions, and that the data is realised from a unique true parameter value θ₀, the theorem states that the posterior for the data tends to a Normal distribution centred around θ₀. In particular, for large N the posterior is found to be well approximated by N (θ0, I⁻¹(θ0)), where I(θ) is Fisher’s information matrix.

Since we are aiming to efficiently sample from models with large amounts of data, this approximation appears to be particularly relevant.

Neiswanger et al. (2013) propose to combine samples by approximating each subposterior using a Normal distribution, and then using results for products of Gaussians in order to combine these approximations. Let ˆµ_s and ˆΣ_s denote the sample mean and sample variance of the MCMC output for batch s. Then we can approximate the distribution of each subposterior by N (ˆµ_s, ˆΣ_s). Using (2.2), the full posterior can be estimated by simply multiplying these subposterior estimates together. It follows the estimate will be multivariate Gaussian with mean ˆµ and variance ˆΣ given by

Σ =ˆ

S

X

s=1

Σˆ⁻¹_s

!⁻¹

, µ = ˆˆ Σ

S

X

s=1

Σˆ⁻¹_s µˆ_s

!

. (2.4)

Scott et al. (2013) propose a similar method, where samples are combined using averaging.

Their method is known as consensus Monte Carlo. Denote the j^th sample from subposterior s by θ_sj. Then suppose each subposterior is assigned a weight denoted by W_s (this is a matrix in the multivariate case), the j^th draw ˆθ_j from the consensus approximation to the

(8)

full posterior is given by

θˆ_j =

S

X

s=1

W_s

!⁻¹ _S X

s=1

W_sθ_sj.

When each subposterior is Normal, then the full posterior is also Normal, and when we set the weights to be W_s = V ar(θ|x_s) then ˆθ_j will be exact draws from the full posterior.

The idea is that even when the subposteriors are non-Gaussian, the draw ˆθj will still be a close approximation to the posterior. Scott et al. (2013) suggests using the sample variance of each batch as the weight values in practice, due to exact results in the Normal case.

Key advantages of the two approximations outlined above are that they are fast and relatively quick to converge when models are close to Gaussian. However they only target the full posterior exactly if either each subposterior is Normally distributed, or the size of each batch tends to infinity. Therefore the methods’ performance on non-Gaussian targets should be explored, especially when they are multi-modal, since the methods may conceivably struggle in these cases.

Rabinovich et al. (2015) suggest extending the Consensus Monte Carlo algorithm of Scott et al. (2013) by relaxing the restriction of aggregation using averaging. Suppose we pick a draw from each subposterior, θ₁, . . . , θ_S. Then let us refer to the function used to aggregate these draws as F (θ1, . . . , θS), so in the case of Consensus Monte Carlo we have

F (θ₁, . . . , θ_S) =

S

X

s=1

W_s

!⁻¹ _S X

s=1

W_sθ_s.

Rabinovich et al. (2015) suggest trying to adaptively choose the best aggregation function F (.). Motivation for this is that the averaging function used in Scott et al. (2013) is only known to be exact in the case of Gaussian posteriors. In order to adaptively choose F (.), Rabinovich et al. (2015) use variational Bayes. However the method requires the introduction of an optimization step, and it would be interesting to investigate the relative improvement in the approximation in using the method, versus the increase in computation time.

2.5 Nonparametric methods

While the methods outlined above work relatively well when subposteriors approximately Gaussian, it is not clear how they behave when models are far away from Gaussian, or when batch sizes are small. Neiswanger et al. (2013) therefore suggest an alternative method based on kernel density estimation which can be shown to target the full posterior asymptotically, as the number of samples drawn from each subposterior tends to infinity.

Let x₁, . . . , x_N be a sample from a distribution of dimension d with density f . Kernel density estimation is a method for providing an estimate ˆf of the density. The kernel density estimation for f at a point x is

f (x) =ˆ 1 N

N

X

i=1

K_H(x − x_i),

(9)

where H is a d × d symmetric, positive-definite matrix known as the bandwidth and K is the unscaled kernel, which is a symmetric, d-dimensional density. K_H is related to K by K_H(x) = |H|^−1/2K(H^−1/2x). Commonly the kernel function K is chosen to be Gaussian since it leads to smooth density estimates and it simplifies mathematical analysis (Duong, 2004). The bandwidth is an important factor in determining the accuracy of a kernel density estimate as it controls the smoothing of the estimate.

Suppose we have a sample {θ_m,s}^M_m=1 from each subposterior s ∈ {1, . . . , S}. Neiswanger et al. (2013) suggest approximating each subposterior using a kernel density estimate with Gaussian kernel and diagonal bandwidth matrix h²I, where I is the d-dimensional identity matrix. Denote this estimate by ˆp_s(θ), then we can write it as

ˆ

p_s(θ) = 1 M

M

X

m=1

N (θ|θ_m,s, h²I),

where N (.|θ_m,s, h²I) denotes a d-dimensional Gaussian density with mean θ_m,s and variance h²I.

The estimate for the full posterior ˆp(θ|x) is then defined to be the product of the estimates for each batch

ˆ

p(θ|x) =

S

Y

s=1

ˆ

p_s(θ) = 1 M^S

S

Y

s=1 M

X

m=1

N (θ|θ_m,s, h²I). (2.5) Therefore the estimate for the full posterior becomes a product of Gaussian mixtures as discussed in Section 2.3. By introducing a similar labelling system L = (l₁, . . . , l_S) with l_s∈ {1, . . . , M }, we can again derive an explicit expression for the resulting mixture. While Neiswanger et al. (2013) uses common variance h²I for each kernel, we suggest it might be better to use a diagonal matrix Λ since different parameters may differ considerably in variance. In either case, assuming a common, diagonal variance Λ across the kernel estimates for each batch, the weights in the product (2.5) simplify to

ω_L∝

S

Y

s=1

N (θ_l_s_,s|¯θ_L, Λ), θ¯_L= 1 S

S

X

s=1

θ_l_s_,s. (2.6)

The L^th component of the mixture simplifies to N (θ|θ_L, Λ/S).

Given that this method is designed for use with large datasets, the number of components of the resulting Gaussian mixture will be very large. Therefore efficiently sampling from it is an important issue to consider. Neiswanger et al. (2013) recommends sampling from the full posterior estimate using a similar method to the Gibbs sampling approach as outlined in Section 2.3. In order to avoid calculating the conditional distribution of the weights however, they use a Metropolis within Gibbs approach as follows. Setting all labels except the current, l_s, fixed, we randomly sample a new value for l_s. We then accept this new label with probability equal to the corresponding values for the weights. The full algorithm is

(10)

detailed in Algorithm 1.

Algorithm 1: Combining Batches Using Kernel Density Estimation.

Data: Samples from each subposterior s ∈ {1, . . . , S}, {θ_m,s}^M_m=1. Result: Sample from an estimate of the full posterior p(θ|x).

Draw an initial label L by simulating ls ∼ Unif({1, . . . , M }), s ∈ {1, . . . , S}.

for i = 1 to T do h ← h(i)

for s = 1 to S do

Create a new label C := (c₁, . . . , c_S) and set C ← L Draw a new value for index s in C, c_s∼ Unif({1, . . . , M }) Simulate u ∼ Unif(0, 1)

if u < ω_C/ω_L then L ← C

end end

Simulate θ_i ∼ N (¯θ_L,^h_M²I) end

Notice that in the algorithm, h is changed as a function of the iteration i. In particular Neiswanger et al. (2013) specify the function h(i) = i^−1/(4+d). This causes the bandwidth to decrease at each iteration and is referred to as annealing. The properties of annealing are investigated further in Section 4. In their paper Neiswanger et al. (2013) assume that the number of iterations is the same as the size of the sample from each subposterior. However this is not necessary, in fact when we are trying to sample from a mixture with a large number of components we may need to simulate more times than this in order to ensure the sample accurately represents the true KDE approximation.

While this algorithm may improve results as models move away from Gaussianity, kernel density estimation is known to perform poorly at high dimensions so the algorithm will deteriorate as the dimensionality of θ increases. The algorithm suffers from the curse of dimensionality in the number of batches and the size of the MCMC sample simulated from each subposterior. This suggests that as the number of batches increases the accuracy and mixing of the algorithm will be affected. The algorithm requires the user to choose a bandwidth estimate, the performance of the algorithm to different bandwidth choices would therefore be interesting to investigate.

In the original paper by Neiswanger et al. (2013), it is suggested to use a Gaussian kernel with bandwidth h²I. However as mentioned earlier, different parameters may have different variances. The algorithm would probably perform better by using a more general diagonal matrix Λ, especially as this does not particularly increase the complexity of the algorithm. Using a common bandwidth parameter across batches eases computation however it may negatively affect the performance of the algorithm. Note when discussing products of Gaussian mixtures in 2.3, the variances across different mixtures did not need to be assumed common. Therefore further improvements might be made by varying bandwidths across batches, though this would increase computational expense. Finally improvements could be gained by using more sophisticated methods to sample from the product of kernel density

(11)

estimates (Ihler et al., 2004; Rudoy and Wolfe, 2007).

A number of developments have been proposed for Algorithm 1. Wang and Dunson (2013) note that the algorithm performs poorly when samples from each subposterior do not overlap. In order to improve this they suggest to smooth each subposterior using a Weierstrass transform, which simply takes the convolution of the density with a Gaussian function. The transformed function can be seen as a smoothed version of the original which tends to increase the overlap between subposteriors. They then approximate the full posterior as a product of the Weierstrass transform of each subposterior. However, since in general the approximation to each subposterior will be empirical, its Weierstrass transform corresponds to a kernel density estimator. Therefore this method, for all intents and purposes, is the same as the original algorithm by Neiswanger et al. (2013), so still suffers from many of the same problems.

An alternative method to improve overlap between the supports of each subposterior is to use heavier tailed kernels in the kernel density estimation. Implementing this however will require some work in order to be able to sample from the resulting product of mixtures, since nice properties for the product of these heavier tailed distributions may not hold. Therefore alternative methods for sampling will need to be developed.

Wang et al. (2015) rather than using kernel density estimation use space partitioning methods to partition the space into disjoint subsets, and produce counts of the number of points contained in each of these subsets. This produces an estimate of each subposterior akin to a multi-dimensional histogram. An estimate to the full posterior can then be made by multiplying subposterior estimates together and normalizing. This algorithm helps solve the explosion of mixture components that affects algorithm 1. Despite this, the algorithm will still suffer when the supports of each subposterior do not overlap. Moreover the algorithm is more complicated to implement and will be affected by the choice of partitioning used.

Alternatively there have been suggestions to introduce suitable metrics which allow summaries of a set of probability measures to be defined. This allows batches to be recombined in terms of these summaries. For example Minsker et al. (2014) use a metric known as the Wasserstein distance measure in order to define the median posterior from a set of subposteriors. Similarly Srivastava et al. (2015) also use the Wasserstein distance to calculate a summary of the subposteriors known as the barycenter. This allows them to produce an estimate for the full posterior which they refer to as the Wasserstein posterior or WASP.

However the statistical properties of these measures is unclear and needs to be investigated further.

2.6 Semiparametric methods

In order to account for the fact that the nonparametric method Algorithm 1 is slow to converge, Neiswanger et al. (2013) suggest producing a semiparametric estimator (Hjort and Glad, 1995) of each subposterior. This estimator combines the parametric estimator characterised by (2.4) and the nonparametric estimator detailed by Algorithm 1. More specifically, each subposterior is estimated by (Hjort and Glad, 1995)

ˆ

p_s(θ) = ˆf_s(θ)ˆr(θ),

(12)

where ˆf_s(θ) = N (θ|ˆµ_s, ˆΣ_s) and ˆr(θ) is a nonparametric estimator of the correction function r(θ) = p_s(θ)/ ˆf_s(θ).

Assuming a Gaussian kernel for ˆr(θ), Neiswanger et al. (2013) write down an explicit expression for ˆp_s(θ)

ˆ

ps(θ) = 1 M

M

X

m=1

N (θ|θ_m,s, h²I)N (θ|ˆµ_s, ˆΣ_s) fˆ_s(θ_m,s) = 1

M

X

m=1

N (θ|θ_m,s, h²I)N (θ|ˆµ_s, ˆΣ_s) N (θm,s|ˆµs, ˆΣs) .

Similarly to the nonparametric method, we can produce an estimate for the full posterior ˆ

p(θ|x) as the product of estimates for each subposterior. Once again this results in a mixture of Gaussians with M^S components. Using the label L = (l₁, . . . , l_S) then the L^th mixture weight W_L and component c_L is given by

W_L∝ ω_LN (¯θ_L|ˆµ, ˆΣ + _S^hI) QS

s=1N (θ_l_s_,s|ˆµ_s, ˆΣ_s), c_L = N (θ|µ_L, Σ_L),

where ω_L and ¯θ_L are as defined in (2.6), and the parameters of the mixture component are Σ_L= S

hI + ˆΣ⁻¹

−1

, µ_L= Σ_L S

hI ¯θ_L+ ˆΣ⁻¹µˆ

,

where ˆΣ and ˆµ are as defined in (2.4). Sampling from this mixture can be performed by using Algorithm 1 replacing weights and parameters where appropriate.

As h → 0, the semiparametric component parameters Σ_L and µ_L approach the corresponding nonparametric component parameters. This motivates Neiswanger et al. (2013) to suggest an alternative semiparametric algorithm where the nonparametric component weights ω_L are used instead of W_L. Their reasoning is that the resulting algorithm may have a higher acceptance probability and is still asymptotically exact as the batch size tends to infinity. As in Section 2.5, a bandwidth matrix with identical diagonal elements hI will not necessarily be the best choice for the bandwidth if different dimensions of the parameters have different scales or variances. However the algorithm can easily be extended to using a diagonal bandwidth matrix Λ in a similar way to the nonparametric method.

While this method may solve the problem that the nonparametric method is slow to converge in high dimensions, the performance of the algorithm is not well understood. For example as models tend away from Gaussianity, how will the algorithm perform when it includes this parametric term. Moreover the model still suffers from the curse of dimensionality in terms of the number of mixture components. The model will also be affected by bandwidth choice.

2.7 Conclusion

In this section we outlined batch methods. Batch methods split a large dataset up into smaller subsets, run parallel MCMC on these subsets, and then combine the MCMC output to obtain an approximation to the full posterior. A couple of methods appealed to the Bernstein- von Mises theorem in order to approximate each subposterior by a Normal distribution.

(13)

The resulting approximation to the full posterior could be found using standard results for products of Gaussians. However these methods are only exact if each subposterior is Normal, or as the number of observations in each batch tends to infinity. Performance of the methods when these assumptions are violated needs to be investigated.

Alternative methods used kernel density estimation or a mixture of a Normal estimate and a kernel density estimate to approximate each subposterior. These estimates could then be combined by using results for the product of mixtures of Gaussians. However the resulting approximation was a mixture of M^S components, which is difficult to sample from efficiently.

Moreover kernel density estimation is known to deteriorate as dimensionality increases and requires the choice of a bandwidth.

To conclude, each of the batch methods have either undesirable qualities or properties which are not well understood. These issues need reviewing before the methods can be used with confidence in practice. Batch methods are particularly suited to models which exhibit structure, for example hierarchical models.

3 Stochastic gradient methods

3.1 Introduction

Methods currently employed in large scale machine learning are generally optimization based methods. One method employed frequently in training machine learning models is known as stochastic optimization (Robbins and Monro, 1951). This method is used to optimize a likelihood function in a similar way to traditional gradient ascent. The key difference is that at each iteration rather than using the whole dataset only a subset is used. While the method produces impressive results at low computational cost, it has a number of downsides.

Parameter uncertainty is not captured using this method, since it only produces a point estimate. Though uncertainty can be estimated using a Normal approximation, for more complex models this estimate may be poor. This means models fitted using stochastic optimization can suffer from overfitting. Since the method does not sample from the posterior as in traditional MCMC, the algorithm can get stuck in local maxima.

Methods outlined in this section aim to combine the subsampling approach of stochastic optimization, with posterior sampling, which helps capture uncertainty in parameter estimates. The section begins by outlining stochastic optimization, before introducing stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC), the two key algorithms for big data discussed in this section. Hamiltonian Monte Carlo (HMC), a technique used extensively by SGHMC, is reviewed.

3.2 Stochastic optimization

Let x₁, . . . , x_N be data observed from a model with probability density function p(x|θ) where θ denotes an unknown parameter vector. Assigning a prior p(θ) to θ, as usual our interest is

(14)

in the posterior

p(θ|x) ∝ p(θ)

N

Y

i=1

p(x_i|θ), where we define p(x|θ) =QN

i=1p(x_i|θ) to be the likelihood.

Stochastic optimization (Robbins and Monro, 1951) aims to find the mode θ^∗ of the posterior distribution, otherwise known as the MAP estimate of θ. The idea of finding the mode of the posterior rather than the likelihood is that the prior p(θ) regularizes the parameters, meaning it acts as a penalty for model complexity which helps prevent overfitting.

At each iteration t, stochastic optimization takes a subset of the data s_t and updates the parameters as follows (Welling and Teh, 2011)

θ_t+1 = θ_t+ _t

2 ∇ log p(θ_t) + N n

X

xi∈st

∇ log p(x_i|θ_t)

!

where _t is the stepsize at each iteration and |s_t| = n. The idea is that over the long run the noise in using a subset of the data is averaged out, and the algorithm tends towards a standard gradient descent. Clearly when the number of observations N is large, using only a subset of the data is much less computationally expensive. This is a key advantage of stochastic optimization.

Provided that _∞

X

t=1

_t= ∞,

∞

X

t=1

²_t < ∞, (3.1)

and p(θ|x) satisfies certain technical conditions, this algorithm is guaranteed to converge to a local maximum.

A common extension of stochastic optimization which will be needed later is known as stochastic optimization with momentum. This is commonly employed when the likelihood surface exhibits a particular structure, one example where the method is employed extensively is in the training of deep neural networks. In this case we introduce a variable ν, which is referred to as the velocity of the trajectory. The parameter updates then proceed as follows

ν_t+1 = (1 − α)ν_t+ ηt

X

xi∈st

∇ log p(x_i|θ_t)

! ,

θ_t+1 = ν_t+1+ θ_t (3.2)

where α and η are free parameters to be tuned.

While stochastic optimization is used frequently by large scale machine learning practi- tioners, it does not capture parameter uncertainty since it only produces a point estimate of θ. This means that models fit using stochastic optimization can often suffer from overfitting and requires some form of regularization. One common method to provide an approximation to the true posterior is to fit a Gaussian approximation at the point estimate.

(15)

Suppose θ₀ is the true mode of the posterior p(θ|x). Then using Taylor’s expansion about θ₀, we find (Bishop, 2006)

log p(θ|x) ≈ log p(θ₀|x) + (θ − θ₀)^T∇ log p(θ|x) − 1

2(θ − θ₀)^TH[log p(θ₀|x)](θ − θ₀)

= log p(θ₀|x) − 1

2(θ − θ₀)^TH[log p(θ₀|x)](θ − θ₀),

where H[g(.)] is the Hessian matrix of the function g(.), and we have used the fact that the gradient of the log posterior at θ₀ is 0.

Let us denote the Hessian H[log p(θ|x)] := V⁻¹[θ], then taking the exponential of both sides we find

p(θ|x) ≈ A exp

−1

2(θ − θ₀)^TV⁻¹[θ₀](θ − θ₀)

,

where A is some constant. This is the kernel of a Gaussian density, suggesting an approximation to the posterior of the form N (θ^∗, V [θ^∗]), where θ^∗ is an estimate of the mode to be found. This is often referred to as a Laplace approximation.

By the Bernstein-von Mises theorem, this approximation is expected to become an increasingly accurate approximation as the number of observations increases. However since the approximation is based only on distributional aspects at one point, the approximation can miss important properties of the distribution (Bishop, 2006). Moreover distributions which are multimodal will be approximated very poorly by this approximation. Therefore while the approximation may work well for less complex distributions when plenty of data is available, the approximation may struggle for more complex models. This motivates us to consider methods which aim to combine the performance of stochastic optimization while being able to account for parameter uncertainty.

3.3 Hamiltonian Monte Carlo

Hamiltonian dynamics was originally developed as an important reformulation of Newtonian dynamics, and serves as a vital tool in statistical physics. More recently though, Hamiltonian dynamics has been used to produce proposals for the Metropolis-Hastings algorithm which explore the parameter space rapidly and have very high acceptance rates. The acceptance calculations in the Metropolis-Hastings algorithm is computationally intensive when a lot of data is available. However as outlined later, by combining ideas from stochastic optimization and Hamiltonian dynamics, we are able to approximately simulate from the posterior distribution without using an acceptance calculation. In light of this, we review Hamilto- nian Monte Carlo, a method which produces efficient proposals for the Metropolis-Hastings algorithm.

3.3.1 Hamiltonian dynamics

Hamiltonian dynamics was traditionally developed to describe the motion of objects under a system of forces. In two dimensions a common analogy used to visualise the dynamics is a frictionless puck sliding over a surface of varying height (Neal, 2010). The state of the

(16)

system consists of the puck’s position θ, and its momentum (mass times velocity) r. Both of which are 2-dimensional vectors. The state of the system is governed by its potential energy U (θ) and its kinetic energy K(r). If the puck is moving on a flat part of the space, then it will have constant velocity. However as the puck begins to pick up height, its kinetic energy decreases and its potential energy increases as it slows. If its kinetic energy reaches zero the puck moves back down the hill, and its potential energy decreases as its kinetic energy increases.

More formally Hamiltonian dynamics is described by a Hamiltonian function H(r, θ), where r and θ are both d-dimensional. The Hamiltonian determines how r and θ change over time as follows

dθ_i

dt = ∂H

∂r_i, dr_i

dt = −∂H

∂θ_i. (3.3)

Hamiltonian dynamics has a number of properties which are crucial for its use in constructing MCMC proposals. Firstly, Hamiltonian dynamics is reversible, meaning that the mapping from the state (r(t), θ(t)) at time t to the state (r(t + s), θ(t + s)) at time t + s is one-to-one.

A second property is that the dynamics keeps the Hamiltonian invariant or conserved. This can be easily shown using (3.3) as follows

dH dt =

d

X

i=1

dθ_i dt

∂H

∂θ_i + dri

dt

∂H

∂r_i

=

d

X

i=1

∂H

∂r_i

∂H

∂θ_i +∂H

∂θ_i

∂H

∂r_i

= 0.

In order to use Hamiltonian dynamics to simulate from a distribution we need to trans- late the density function to a potential energy function, and introduce artificial momentum variables to go with these position variables of interest. A Markov chain can then be simulated where at each iteration we resample the momentum variables, simulate Hamiltonian dynamics for a number of iterations, and then perform a Metropolis Hastings acceptance step with the new variables obtained from the simulation.

In light of this, for Hamiltonian Monte Carlo we generally define the Hamiltonian H(r, θ) to be of the following form

H(r, θ) = U (θ) + K(r),

where θ is the vector we are simulating from and the momentum vector r is constructed artificially. Using the notation in Section 3.2 the potential energy is then defined to be

U (θ) = − log p(θ)

N

Y

i=1

p(x_i|θ)

!

= − log p(θ) −

N

X

i=1

log p(x_i|θ). (3.4)

The kinetic energy is defined as

K(r) = 1

2r^TM⁻¹r, (3.5)

where M is a symmetric, positive definite mass matrix.

(17)

3.3.2 Using Hamiltonian dynamics in MCMC

In order to relate the potential and kinetic energy functions to the distribution of interest, we can use the concept of a canonical distribution. Given some energy function E(x), defined over the state of x, the canonical distribution over the states of x is defined to be

P (x) = 1

Z exp{−E(x)/(k_BT )}, (3.6)

where Z is a normalizing constant, k_B is Boltzmann’s constant, and T is defined to be the temperature of the system. The Hamiltonian is an energy function defined over the joint state of r and θ, so that we can write down the joint distribution defined by the function as

P (r, θ) ∝ exp{−H(r, θ)/(kBT )}.

If we now assume the Hamiltonian is of the form described by (3.4), (3.5), and that kBT = 1, then we find that

P (r, θ) ∝ exp{−U (θ)} exp{−K(r)}

∝ p(θ|x)N (r|0, M ).

So that the distribution for r and θ defined by the Hamiltonian are independent and the marginal distribution of θ is its posterior distribution.

This relationship enables us to describe Hamiltonian Monte Carlo (HMC), which can be used to simulate from continuous distributions whose density can be evaluated up to a normalizing constant. A requirement of HMC is that we can calculate the derivatives of the log of the target density. HMC samples from the joint distribution for (θ, r). Therefore by discarding the samples for r we obtain a sample from the posterior p(θ|x). Generally we choose the components of r (r_i) to be independent, each with variance m_i. This allows us to write the kinetic energy as

K(r) =

d

X

i=1

r_i² 2m_i.

In order to approximate Hamiltonian’s equations computationally, we need to discretize time using a small stepsize . There are a number of ways to do this, however in practice the leapfrog method often produces good results. The method works as follows:

1. r_i(t + /2) = r_i(t) − ₂^∂U_∂θ

i(θ(t)), 2. θi(t + h) = θi(t) + ^∂K_∂r

i(r(t + h/2)), 3. r_i(t + h) = r_i(t + h/2) − ^h₂_∂θ^∂U

i(θ(t + h)).

The leapfrog method has a number of desirable properties, including that it is reversible and volume preserving. An effect of this is that at the acceptance step, the proposal distributions cancel, so that the acceptance probability is simply a ratio of the canonical distributions at the proposed and current states. Since we must discretize the equations in order to simulate from them, the posterior p(θ|x) is not invariant under the approximate dynamics. This is

(18)

why the acceptance step is required, as it corrects for this error. As the stepsize tends to zero, the acceptance rate of the leapfrog method tends to 1 as the approximation moves closer to true Hamiltonian dynamics.

Now that we have outlined how to approximate the Hamiltonian equations, we can outline Hamiltonian Monte Carlo. HMC is performed in two steps as follows:

1. Simulate new values for the momentum variables r ∼ N (0, M ).

2. Simulate Hamiltonian dynamics for L steps with stepsize using the leapfrog method.

The momentum variables are then negated, and the new state (θ^∗, r^∗) is accepted with probability

min {1, exp{H(θ, r) − H(θ^∗, r^∗)}} . 3.3.3 Developments in HMC and tuning

HMC allows the state space to be explored rapidly and has high acceptance rates. However in order to gain these benefits, we need to ensure that L and are properly tuned. Generally it is recommended to use trial values for L and and to use traceplots and autocorrelation plots to decide on how quickly the resulting algorithm converges and how well it is exploring the state space. The presence of multiple modes can be an issue for HMC, and requires special treatment (Neal, 2010). Therefore it is recommended the algorithm is run at different starting points to ensure multimodality is not present.

Suppose we have an estimate of the variance matrix for θ, if the variables appear to be correlated then HMC may not explore the parameter space effectively. One way to improve the performance of HMC in this case is to set M = ˆΣ⁻¹, where ˆΣ is our estimate of V ar(θ|x).

The selection of the stepsize is very important in HMC, since selecting a size that is too big will result in a low acceptance rate, while selecting a size that is too small will result in slow exploration of the space. Selecting too large can be particularly problematic as it can cause instability in the Hamiltonian error, which leads to very low acceptance. In situations where the mass matrix M is the diagonal matrix, the stability limit for is given by the width of the distribution in its most constrained direction. For a Gaussian distribution, this is the square root of the smallest eigenvalue of the covariance matrix for θ.

The value of L is also an important quantity to choose when tuning the HMC algorithm.

Selecting L too small will mean the HMC explores the space with inefficient random walk behaviour as the next state will still be correlated with the previous state. On the other hand selecting L too large will waste computation and lower acceptance rates.

There have been a number of important developments to HMC. Girolami and Calderhead (2011) introduced Riemannian Manifold Hamiltonian Monte Carlo, which simulates HMC in a Riemannian space rather than a Euclidean one. This effectively enables the use of position- dependent mass matrices M . Using this result, the algorithm will sample more efficiently from distributions where parameters of interest exhibit strong correlations. A recent development by Hoffman and Gelman (2014) led to the development of the ‘No U-turn Sampler’. This enables the automatic and adaptive tuning of the stepsize and the trajectory length L. This is an important development since the tuning of HMC algorithms is a non-trivial task.

Alternative methods to the leapfrog method for simulating Hamiltonian dynamics have

(19)

been developed. These enable us to to handle constraints on the variables, or to exploit partially analytic solutions (Neal, 2010). As mentioned earlier, HMC can have considerable difficulty moving between the modes of a distribution. A number of schemes have been developed to solve this problem including tempered transitions Neal (1996) and annealed importance sampling Neal (2001).

3.4 Stochastic gradient Langevin Monte Carlo

A special case of HMC arises, known as Langevin Monte Carlo, when we only use a single leapfrog step to propose a new state. Its name comes from its similarity to the theory of Langevin dynamics in physics. Welling and Teh (2011) noticed that the discretized form of Langevin Monte Carlo has a comparable structure to that of stochastic optimization, outlined in Section 3.2. This motivates them to develop an algorithm based on Langevin Monte Carlo, which only uses a subsample of the dataset to calculate the gradient of the potential energy

∇U . They show that by using a stepsize that decreases with time, the algorithm will smoothly transition from a stochastic gradient descent to sampling approximately from the posterior distribution, without the need for an acceptance step. This result along with the fact that only a subsample of the data is used at each iteration, means that the algorithm is scalable to large datasets.

3.4.1 Stochastic gradient Langevin Monte Carlo

Langevin Monte Carlo arises from HMC when we use only one leapfrog step in generating a new state (r, θ). In this case we can remove any explicit mention of momentum variables and propose a new value for θ as follows (Neal, 2010)

θ_t+1 = θ_t− a² 2

∂U

∂θ + η,

where η ∼ N (0, a²) and a is some constant. Using our particular expression of the potential energy (3.4), we can write

θ_t+1 = θ_t+

2 ∇ log p(θ_t) +

N

X

i=1

∇ log p(x_i|θ_t)

! + η,

= θ_t−

2∇U (θ_t) + η (3.7)

where = a².

While being a special case of Hamiltonian Monte Carlo, the properties of Langevin dynamics are somewhat different. We cannot typically set a very large, so the state space is normally explored a lot slower than using HMC. The proposal for Langevin Monte Carlo is a particular discretization of a stochastic differential equation (SDE) known as Langevin dynamics. Writing this discretization as an SDE we obtain

dθ = −1

2∇U (θ)dt + dW = −1

2∇U (θ)dt + N (0, dt), (3.8)

(20)

where W is a Wiener process and we have informally written dW as N (0, dt). A Wiener process is a stochastic process with the following properties:

1. W (0) = 0 with probability 1;

2. W (t + h) − W (t) ∼ N (0, h) and is independent of W (τ ) for τ ≤ t.

It can be shown that, under certain conditions, the posterior distribution p(θ|x) is the sta- tionary distribution of (3.8). This motivates the Metropolis-adjusted Langevin algorithm (MALA), which uses (3.7) as a proposal for the Metropolis-Hastings algorithm.

When there are a large number of observations available, ∇U (θ) is expensive to calculate at each iteration, since it requires the evaluation of the log likelihood gradient. Welling and Teh (2011) therefore suggest introducing an unbiased estimator of ∇U (θ) which uses only a subset s_t of the data at each iteration. The estimator ∇ ˜U (θ) is given as follows

∇ ˜U (θ) = −∇ log p(θ) − N n

X

xi∈st

∇ log p(x_i|θ). (3.9)

We use that

∇ ˜U (θ) = ∇U (θ) + ν, (3.10)

where ν is some noise term which we refer to as the stochastic gradient noise.

Using this estimator in place of ∇U (θ) in a Langevin Monte Carlo update we obtain the following

θt+1 = θt+

2 ∇ log p(θt) + N n

X

xi∈st

∇ log p(xi|θt)

!

+ η, (3.11)

= θt+

2U (θt) +

2νt+ η.

If we assume that the stochastic gradient noise ν_t has variance V (θ_t), then the term ₂ν_t has variance ₂²V (θt). Therefore for small , η, which has variance , will dominate. As we send

→ 0, (3.11) will approximate Langevin dynamics and sample approximately from p(θ|x), without the need for an acceptance step.

This result motivates Welling and Teh (2011) to suggest an algorithm that uses (3.11) to update θ_t, but to decrease the stepsize to 0 as the number of iterations t increases. Leading to the SGLD algorithm update

θ_t+1= θ_t+ _t

X

xi∈s_t

∇ log p(x_i|θ_t)

!

+ η_t (3.12)

Noting the similarity between (3.12) and stochastic optimization, they suggest decreasing _t according to the conditions (3.1) to ensure that the noise in the stochastic gradients average out. The result is an algorithm that transitions smoothly between stochastic gradient descent and approximately sampling from the posterior using an increasingly accurate discretization of Langevin dynamics. Since the stepsize must decrease to zero, the mixing rate of the

(21)

algorithm will slow as the number of iterations increases. Putting this all together we outline the full SGLD procedure in Algorithm 2.

Algorithm 2: Stochastic gradient Langevin dynamics (SGLD).

Input: Initial estimate θ1, stepsize function (t), subsample size |st| = n, likelihood and prior gradients ∇p(x|θ) and ∇p(θ).

Result: Approximate sample from the full posterior p(θ|x).

for t = 1 to T do

← (t)

Sample s_t from full dataset x η ∼ N (0, )

θ ← θ +₂ ∇ log p(θ) + ^N_n P

xi∈s_t∇ log p(xi|θ) + η if small enough then

Store θ as part of the sample end

end

3.4.2 Discussion and tuning

Teh et al. (2014) study SGLD theoretically and show that, given regularity conditions, es- timators derived from an SGLD sample are consistent and satisfy a central limit theorem.

They reveal that for polynomial stepsizes of the form _t = a(b + t)^−α, the optimal choice of α is 1/3. The rate of convergence of SGLD is shown to be T^−1/3, where T is the number of iterations of SGLD. This is slower than the traditional Monte Carlo rate of T^−1/2, and is due to the decreasing stepsizes.

In tuning the algorithm the key constants that need to be chosen are those used in the stepsize, a and b, and the subsample size n. To avoid divergence it is important to keep the stochastic gradient noise under control, especially as N gets large. This can be done in two ways. One is to increase the subsample size n, another is to keep the stepsize small. However in order to keep the algorithm efficient the subsample size needs to be kept relatively small, Welling and Teh (2011) suggest keeping it in the hundreds. Therefore the main constant that needs to be considered in tuning is a. Set a too large and the stochastic gradient noise dominates for too long and the algorithm never moves to posterior sampling.

Set a too small however and the parameter space is not explored efficiently enough.

Problems with this method include that it is important for the step sizes to decrease to zero so that the acceptance rate is not needed. However this means the mixing rate of the algorithm will slow down as the number of iterations increase. There are a few ways around this. One is to stop decreasing the step size once it falls below a threshold and the rejection rate is negligible, however in this case the posterior will still be explored slowly. The other is to use this algorithm initially for burn-in, then switch to an alternative MCMC method later which is more efficient. However both these solutions require significant hand-tuning beforehand. The decelerating mixing rate makes it less clear how the algorithm compares to other samplers, while it requires only a fraction of the dataset per iteration, this is offset by the fact that more iterations are required to reach the accuracy of other samplers (Bardenet

(22)

et al., 2015).

Another problem with the method is that it often explores the state space inefficiently.

This is because Langevin dynamics explores the state space less efficiently than more general HMC. This is motivation for stochastic gradient HMC (Chen et al., 2014) which is discussed in Section 3.5.

Note that similar to HMC, certain parameters may have a much higher variance than others. In this case we can use a preconditioning matrix M to bring all the parameters onto a similar scale, allowing the algorithm to explore the space more efficiently. The algorithm including preconditioning can simply be written as

θt+1= θt+ _t

2M ∇ log p(θt) + N n

X

xi∈st

∇ log p(xi|θt)

! + ηt,

where η_t∼ N (0, _tM ).

Provided the size of the subset n is large enough, we can use the central limit theorem to approximate V (θ_t) by its empirical covariance

V (θt) ≈ N² n²

X

xi∈st

(y(xi, θt) − ¯y(θt))(y(xi, θt) − ¯y(θt))^T = N²

n Vs, (3.13) where y(x_i, θ_t) = ∇ log p(x_i|θ_t) +_N¹∇ log p(θ_t) and ¯y(θ_t) = ¹_nP

xi∈sty(x_i, θ_t). From (3.13) we determine that the variance of a stochastic gradient step can be estimated by ²^t_4n^N²M V_sM (Welling and Teh, 2011), so that for the injected noise to dominate, denoting the largest eigenvalue of M V_sM by λ, we require

α = ²_tN²

4n λ 1.

Therefore using the fact that the Fisher’s information I ≈ N V_s, and that the posterior variance Σ_θ ≈ I⁻¹ for large n, we can find the approximate stepsize at which the injected noise will dominate. Denoting the smallest eigenvalue of Σ_θ by λ_θ, the stepsize can be given by _t≈ ^4αn_N λ_θ. This stepsize is generally small.

Suppose we have a sample θ₁, . . . , θ_mwhich is output from the algorithm. Since the mixing of the algorithm decelerates, standard Monte Carlo estimates will overemphasize parts of the sample where the stepsize is small. This increases the variance of the estimate, though it remains consistent. Therefore Welling and Teh (2011) suggest instead to use the estimate

E(f (θ)) ≈ PT

t=1_tf (θ_t) PT

t=1_t , which is also consistent.

3.4.3 Further developments

A number of extensions to the original SGLD algorithm by Welling and Teh (2011) have been suggested. Ahn et al. (2012) aim to improve the mixing of the algorithm by appealing

Computational Statistics for Big Data