Computational Statistics for Big Data
Author:
Jack Baker
1Supervisors:
Paul Fearnhead
1Emily Fox
21
Lancaster University
2The University of Washington
September 1, 2015
Abstract
The amount of data stored by organisations and individuals is growing at an as- tonishing rate. As statistical models grow in complexity and size, traditional machine learning algorithms are struggling to scale well to the large datasets required for model fitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind.
However the algorithm has proven to be an invaluable tool for training complex statis- tical models. This report discusses a number of possible solutions that enable MCMC to scale more effectively to large datasets. We focus on two particular solutions to this problem: batch methods and stochastic gradient Monte Carlo methods.
Batch methods split the full dataset into disjoint subsets, and run traditional MCMC on each subset. The difficulty of these methods is in recombining the MCMC output run on each subset. The idea is that this will be a close approximation to the posterior using the full dataset. Stochastic gradient Monte Carlo approximately samples from the full posterior but uses only a subsample of data at each iteration. It does this by combining two key ideas. Stochastic optimization, which is an alogorithm used to find the mode of the posterior but uses only a subset of the data at each iteration; Hamiltonian Monte Carlo, which is a method used to provide efficient proposals for Metropolis-Hastings algorithms with high acceptance rates. After discussing the methods and important extensions, we perform a simulation study, which compares the methods and how they are affected by various model properties.
Contents
1 Introduction 2
1.1 An overview of methods . . . 2
1.2 Report outline . . . 3
2 Batch methods 4 2.1 Introduction . . . 4
2.2 Splitting the data . . . 4
2.3 Efficiently sampling from products of Gaussian mixtures . . . 4
2.4 Parametric recombination methods . . . 6
2.5 Nonparametric methods . . . 7
2.6 Semiparametric methods . . . 10
2.7 Conclusion . . . 11
3 Stochastic gradient methods 12 3.1 Introduction . . . 12
3.2 Stochastic optimization . . . 12
3.3 Hamiltonian Monte Carlo . . . 14
3.4 Stochastic gradient Langevin Monte Carlo . . . 18
3.5 Stochastic gradient Hamiltonian Monte Carlo . . . 22
3.6 Conclusion . . . 26
4 Simulation study 27 4.1 Introduction . . . 27
4.2 Batch methods . . . 29
4.3 Stochastic gradient methods . . . 33
4.4 Conclusion . . . 37
5 Future Work 38 5.1 Introduction . . . 38
5.2 Further comparison of batch methods . . . 38
5.3 Tuning guidance for stochastic gradient methods . . . 39
5.4 Using batch methods to analyse complex hierarchical models . . . 40
1 Introduction
As the amount of data stored by individuals and organisations grows, statistical models have advanced in complexity and size. Often much statistical methodology has focussed on fitting models with limited data. Now we are faced by the opposite problem, we have so much data that traditional statistical methods struggle to cope and run exceptionally slowly.
These problems have led to a rapidly evolving area of statistics and machine learning, which develops algorithms which are scalable as the size of data increases. The ‘size’ of data is generally used to mean one of two things: the dimensionality of the data or the number of observations. In this report we focus on methods which have been designed to be scalable as the number of observations increases. Data with a large number of observations is often referred to as tall data.
Currently, large scale machine learning models are being trained mainly using optimiza- tion methods such as stochastic optimization. These algorithms are mainly used for their speed, they are fast to train models even when there are a huge number of observations available. The methods’ speed is due to the fact that at each iteration the algorithms only use a subset of all the available data. The downside is that these methods only find local maxima of the posterior distribution, meaning they only produce a point estimate which can lead to overfitting.
A key appeal of Bayesian methods is that they produce a whole distribution of possible parameter values, which allows uncertainty to be quantified, reducing the risk of overfitting.
While approximating parameter uncertainty using stochastic optimization can be done, for complex models this approximation can be very poor. Generally the Bayesian posterior distribution is simulated from using statistical algorithms known as Markov chain Monte Carlo (MCMC). The problem is that these algorithms require calculations over the whole dataset at each iteration, meaning the algorithms are slow for large datasets. Therefore the next generation of MCMC algorithms which scale to large datasets needs to be developed.
1.1 An overview of methods
We begin this section with a more formal statement of the problem. Suppose we wish to train a model with probability density p(x|θ), where θ is an unknown parameter vector, and x ∈ x is the model data. Let the likelihood of the model be p(x|θ) = QN
i=1p(xi|θ) and the prior for the parameter be p(θ). Our interest is in the posterior p(θ|x) ∝ p(x|θ)p(θ), which quantifies the most likely values of θ given the data x.
Commonly we simulate from the posterior using the Metropolis-Hastings (MH) algorithm, arguably the most popular MCMC algorithm. At each iteration, given a current state θ, the algorithm proposes some new state θ0 from some proposal q(.). This new state is then accepted as part of the sample with probability
α = q(θ)p(θ0|x)
q(θ0)p(θ|x) = q(θ)p(x|θ0)p(θ0) q(θ0)p(x|θ)p(θ).
Notice that at each iteration, the MH algorithm requires calculation of the likelihood at the new state θ0. This requires a computation over the whole dataset, which is infeasibly
slow when N is large. This is the key bottleneck in Metropolis-Hastings, and other MCMC algorithms, when they are being used with large datasets.
A number of solutions have been proposed for this problem, and they can generally be divided into three categories. We refer to these categories as batch methods, stochastic gradi- ent methods and subsampling methods. Batch methods aim to make use of recent hardware developments which makes the parallelisation of computational work more accessible. They split the dataset x into disjoint batches xB1, . . . , xBS. The structure of the posterior allows separate MCMC algorithms to be run on these batches in parallel in order to simulate from each subposterior p(θ|xBs) ∝ p(θ)1/Sp(xBs|θ). These simulations must then be combined in order to generate a sample which approximates the full posterior p(θ|x). This is where the main challenge lies.
Stochastic gradient methods make use of sophisticated proposals that have been suggested for MCMC. These methods use gradients of the log posterior in order to suggest new states which have very high acceptance rates. When free constants of these proposals are tuned in a certain way these rates can be so high that we can get rid of the acceptance step and still sample from a good approximation to the posterior. However the gradient calculation still requires a computation over the whole dataset. Therefore the gradients of the log posterior need to be estimated using only a subsample of the data, which introduces extra noise.
Subsampling methods propose various methods to keep the MCMC algorithm largely as is but use only a subset of the data in the acceptance step at each iteration. Certain methods exist which allow this to be done while still sampling from the true posterior distribution.
However this advantage often comes at the cost of poor mixing. Other methods achieve the result by introducing controlled biases, these methods often mix better.
1.2 Report outline
This report provides a review of batch methods and stochastic gradient methods outlined in Section 1.1. The reviewed methods are then implemented and compared under a variety of scenarios.
In Section 2 we discuss batch methods, including parametric contributions by Scott et al.
(2013) and Neiswanger et al. (2013), nonparametric and semiparametric methods introduced by Neiswanger et al. (2013) as well as more recent developments. Section 3 sees a review of stochastic gradient methods, including the stochastic gradient Langevin dynamics (SGLD) algorithm of Welling and Teh (2011) and the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm of Chen et al. (2014). Stochastic optimization methods which are currently employed to train algorithms which rely on large datasets are considered. An introduction of Hamiltonian Monte Carlo, which is used to produce proposals for the SGHMC algorithms is provided. Finally we examine the literature which provide further theoretical results for the algorithms, as well as proposed improvements.
In Section 4 the algorithms reviewed in the report are compared, code for the implemen- tations are available on GitHub: https://goo.gl/9ZGHP2. A relatively simple model is used for comparison, a multivariate t-distribution. Therefore in order to really test the methods, the number of observations is kept small. First the effect of bandwidth choice for nonpara- metric/semiparametric methods is investigated. The performance effect of the number of
observations and the dimensionality of the target are compared for all the methods. The batch size for the batch methods, and the subsample size for the stochastic gradient methods are considered too.
2 Batch methods
2.1 Introduction
In order to speed up MCMC, it is natural to consider parallelisation. Advances in hardware allow many jobs to be run in parallel over separate cores. These advances have been used to speed up many other computationally intensive algorithms. Parallelising MCMC has proven difficult however since MCMC is inherently sequential in nature and parallelisation requires minimal communication between machines. A natural way to parallelise MCMC is to split the data into different subsets. MCMC for each subset is then run separately on different machines. In this case the main problem is how to recombine our MCMC samples of each subset while ensuring the final sample is close as possible to the true posterior. In this section, we discuss parametric and nonparametric methods suggested to do this.
2.2 Splitting the data
Suppose we have N i.i.d. data points x. We wish to investigate a model with probability density p(θ|x), where θ is an unknown parameter vector. Let the likelihood be p(x|θ) = QN
i=1p(θ|xi) and the prior we assign to θ be p(θ). Then the full posterior for the data p(θ|x) is given by
p(θ|x) ∝ p(θ)p(x|θ). (2.1)
Let B1, . . . , BS be a partition of {1, . . . , n}, and xBi be the corresponding set of data points xBi = {xi : i ∈ Bi}. We refer to xBi as the ith batch of data. We can rewrite (2.1) as
p(θ|x) ∝ p(θ)
S
Y
s=1
p(xBs|θ) =
S
Y
s=1
p(θ)1/Sp(xBs|θ).
For brevity we will write the S batches of data as x1, . . . , xS from now on. Let us define the subposterior p(θ|xs) by
p(θ|xs) ∝ p(θ)1/Sp(θ|xs).
Therefore we have that p(θ|x) ∝QS
s=1p(θ|xs). The idea of batch methods for big data is to run MCMC separately to sample from each subposterior. These samples are then combined in some way so that the final sample follows the full posterior p(θ|x) as closely as possible.
2.3 Efficiently sampling from products of Gaussian mixtures
Before we outline recombination methods in more detail, we discuss certain important prop- erties of the multivariate Normal distribution which will prove useful later.
Suppose we have S multivariate Normal densities N (θ|µs, Σs) for s ∈ {1, . . . , S}, then Wu (2004) shows that their product can be written, up to a constant of proportionality, as
S
Y
s=1
N (θ|µs, Σs) ∝ N (θ|µ, Σ),
where
Σ =
S
X
s=1
Σ−1s
!−1
, µ = Σ
S
X
s=1
Σ−1s µs
!
. (2.2)
Now suppose we have a set of S Gaussian mixtures {ps(θ)}Ss=1,
ps(θ) =
M
X
m=1
ωm,sN (θ|µm,s, Σs),
where ωm,sdenote the mixture weights. For simplicity we assume that the number of compo- nents in each mixture is the same and that each Gaussian component in the mixture shares a common variance which is diagonal.
We wish to sample from the product of these Gaussian mixtures,
p(θ) ∝
S
Y
s=1
ps(θ). (2.3)
It can be shown using induction that
S
Y
s=1 M
X
m=1
ωm,sN (θ|µm,s, Σs) = X
l1
· · ·X
lS
S
Y
s=1
ωls,sN (θ|µls,s, Σs),
where we label each component of the sum using L = (l1, . . . , lS), where ls ∈ {1, . . . , M }.
It follows from this and results above about products of Gaussians, (2.3) is equivalent to a Gaussian mixture with MS mixture components. Therefore sampling from this product can be performed exactly in two steps. Firstly we sample from one of the MS components of the mixture according to its weight, then we draw a sample from the corresponding Gaussian component (Ihler et al., 2004).
The parameters of the Lth Gaussian component be calculated using (2.2) and are given by
ΣL=
S
X
s=1
Σ−1s
!−1
, µL = ΣL S
X
s=1
Σ−1s µls,s
! .
The unnormalised weight of the Lth mixture component is given by (Ihler et al., 2004)
ωL∝ QS
s=1ωls,sN (θ|µls,s, Σs) N (θ|µL, ΣL) .
In order to use this exact method we need to calculate the normalising constant for the weights Z =P
LωL. As M and S grow this exact sampling method becomes computationally infeasible as the calculation of Z and the drawing a sample from p(.) both take O(MS) time.
This fact, along with memory requirements mean that sampling from p(θ) using the exact method quickly becomes impossible.
In cases where exact sampling from the mixture is infeasible, a number of methods have been proposed. For a review the reader is suggested to refer to Ihler et al. (2004). A common approach is to use a Gibbs sampling style approach. At each iteration, S − 1 of the labels li are fixed, while one label, call it lj, is sampled from the corresponding conditional density p(θ|l−j). The notation l−j refers to {li|i ∈ {1, . . . , S}, i 6= j}. After a fixed number of new label values have been drawn, a sample is drawn from the mixture component indicated by the current label values. While this approach often produces good results, it can require a large number of samples before it accurately represents the true mixture density due to multimodality. A number of suggestions have been made to improve this standard Gibbs sampling approach, for example using multiscale sampling (Ihler et al., 2004) and parallel tempering (Rudoy and Wolfe, 2007).
2.4 Parametric recombination methods
There are a number of methods proposed to recombine subposterior samples which exactly target the full posterior p(θ|x) when it is Normally distributed. We refer to these methods as parametric. Intuition for why this assumption might be valid for a large class of models comes from the Bernstein-von Mises Theorem (Le Cam, 2012), which is a central limit theorem for Bayesian statistics. Assuming suitable regularity conditions, and that the data is realised from a unique true parameter value θ0, the theorem states that the posterior for the data tends to a Normal distribution centred around θ0. In particular, for large N the posterior is found to be well approximated by N (θ0, I−1(θ0)), where I(θ) is Fisher’s information matrix.
Since we are aiming to efficiently sample from models with large amounts of data, this approximation appears to be particularly relevant.
Neiswanger et al. (2013) propose to combine samples by approximating each subposterior using a Normal distribution, and then using results for products of Gaussians in order to combine these approximations. Let ˆµs and ˆΣs denote the sample mean and sample variance of the MCMC output for batch s. Then we can approximate the distribution of each subpos- terior by N (ˆµs, ˆΣs). Using (2.2), the full posterior can be estimated by simply multiplying these subposterior estimates together. It follows the estimate will be multivariate Gaussian with mean ˆµ and variance ˆΣ given by
Σ =ˆ
S
X
s=1
Σˆ−1s
!−1
, µ = ˆˆ Σ
S
X
s=1
Σˆ−1s µˆs
!
. (2.4)
Scott et al. (2013) propose a similar method, where samples are combined using averaging.
Their method is known as consensus Monte Carlo. Denote the jth sample from subposterior s by θsj. Then suppose each subposterior is assigned a weight denoted by Ws (this is a matrix in the multivariate case), the jth draw ˆθj from the consensus approximation to the
full posterior is given by
θˆj =
S
X
s=1
Ws
!−1 S X
s=1
Wsθsj.
When each subposterior is Normal, then the full posterior is also Normal, and when we set the weights to be Ws = V ar(θ|xs) then ˆθj will be exact draws from the full posterior.
The idea is that even when the subposteriors are non-Gaussian, the draw ˆθj will still be a close approximation to the posterior. Scott et al. (2013) suggests using the sample variance of each batch as the weight values in practice, due to exact results in the Normal case.
Key advantages of the two approximations outlined above are that they are fast and relatively quick to converge when models are close to Gaussian. However they only target the full posterior exactly if either each subposterior is Normally distributed, or the size of each batch tends to infinity. Therefore the methods’ performance on non-Gaussian targets should be explored, especially when they are multi-modal, since the methods may conceivably struggle in these cases.
Rabinovich et al. (2015) suggest extending the Consensus Monte Carlo algorithm of Scott et al. (2013) by relaxing the restriction of aggregation using averaging. Suppose we pick a draw from each subposterior, θ1, . . . , θS. Then let us refer to the function used to aggregate these draws as F (θ1, . . . , θS), so in the case of Consensus Monte Carlo we have
F (θ1, . . . , θS) =
S
X
s=1
Ws
!−1 S X
s=1
Wsθs.
Rabinovich et al. (2015) suggest trying to adaptively choose the best aggregation function F (.). Motivation for this is that the averaging function used in Scott et al. (2013) is only known to be exact in the case of Gaussian posteriors. In order to adaptively choose F (.), Rabinovich et al. (2015) use variational Bayes. However the method requires the introduction of an optimization step, and it would be interesting to investigate the relative improvement in the approximation in using the method, versus the increase in computation time.
2.5 Nonparametric methods
While the methods outlined above work relatively well when subposteriors approximately Gaussian, it is not clear how they behave when models are far away from Gaussian, or when batch sizes are small. Neiswanger et al. (2013) therefore suggest an alternative method based on kernel density estimation which can be shown to target the full posterior asymptotically, as the number of samples drawn from each subposterior tends to infinity.
Let x1, . . . , xN be a sample from a distribution of dimension d with density f . Kernel density estimation is a method for providing an estimate ˆf of the density. The kernel density estimation for f at a point x is
f (x) =ˆ 1 N
N
X
i=1
KH(x − xi),
where H is a d × d symmetric, positive-definite matrix known as the bandwidth and K is the unscaled kernel, which is a symmetric, d-dimensional density. KH is related to K by KH(x) = |H|−1/2K(H−1/2x). Commonly the kernel function K is chosen to be Gaussian since it leads to smooth density estimates and it simplifies mathematical analysis (Duong, 2004). The bandwidth is an important factor in determining the accuracy of a kernel density estimate as it controls the smoothing of the estimate.
Suppose we have a sample {θm,s}Mm=1 from each subposterior s ∈ {1, . . . , S}. Neiswanger et al. (2013) suggest approximating each subposterior using a kernel density estimate with Gaussian kernel and diagonal bandwidth matrix h2I, where I is the d-dimensional identity matrix. Denote this estimate by ˆps(θ), then we can write it as
ˆ
ps(θ) = 1 M
M
X
m=1
N (θ|θm,s, h2I),
where N (.|θm,s, h2I) denotes a d-dimensional Gaussian density with mean θm,s and variance h2I.
The estimate for the full posterior ˆp(θ|x) is then defined to be the product of the estimates for each batch
ˆ
p(θ|x) =
S
Y
s=1
ˆ
ps(θ) = 1 MS
S
Y
s=1 M
X
m=1
N (θ|θm,s, h2I). (2.5) Therefore the estimate for the full posterior becomes a product of Gaussian mixtures as discussed in Section 2.3. By introducing a similar labelling system L = (l1, . . . , lS) with ls∈ {1, . . . , M }, we can again derive an explicit expression for the resulting mixture. While Neiswanger et al. (2013) uses common variance h2I for each kernel, we suggest it might be better to use a diagonal matrix Λ since different parameters may differ considerably in variance. In either case, assuming a common, diagonal variance Λ across the kernel estimates for each batch, the weights in the product (2.5) simplify to
ωL∝
S
Y
s=1
N (θls,s|¯θL, Λ), θ¯L= 1 S
S
X
s=1
θls,s. (2.6)
The Lth component of the mixture simplifies to N (θ|θL, Λ/S).
Given that this method is designed for use with large datasets, the number of components of the resulting Gaussian mixture will be very large. Therefore efficiently sampling from it is an important issue to consider. Neiswanger et al. (2013) recommends sampling from the full posterior estimate using a similar method to the Gibbs sampling approach as outlined in Section 2.3. In order to avoid calculating the conditional distribution of the weights however, they use a Metropolis within Gibbs approach as follows. Setting all labels except the current, ls, fixed, we randomly sample a new value for ls. We then accept this new label with probability equal to the corresponding values for the weights. The full algorithm is
detailed in Algorithm 1.
Algorithm 1: Combining Batches Using Kernel Density Estimation.
Data: Samples from each subposterior s ∈ {1, . . . , S}, {θm,s}Mm=1. Result: Sample from an estimate of the full posterior p(θ|x).
Draw an initial label L by simulating ls ∼ Unif({1, . . . , M }), s ∈ {1, . . . , S}.
for i = 1 to T do h ← h(i)
for s = 1 to S do
Create a new label C := (c1, . . . , cS) and set C ← L Draw a new value for index s in C, cs∼ Unif({1, . . . , M }) Simulate u ∼ Unif(0, 1)
if u < ωC/ωL then L ← C
end end
Simulate θi ∼ N (¯θL,hM2I) end
Notice that in the algorithm, h is changed as a function of the iteration i. In particular Neiswanger et al. (2013) specify the function h(i) = i−1/(4+d). This causes the bandwidth to decrease at each iteration and is referred to as annealing. The properties of annealing are investigated further in Section 4. In their paper Neiswanger et al. (2013) assume that the number of iterations is the same as the size of the sample from each subposterior. However this is not necessary, in fact when we are trying to sample from a mixture with a large number of components we may need to simulate more times than this in order to ensure the sample accurately represents the true KDE approximation.
While this algorithm may improve results as models move away from Gaussianity, ker- nel density estimation is known to perform poorly at high dimensions so the algorithm will deteriorate as the dimensionality of θ increases. The algorithm suffers from the curse of di- mensionality in the number of batches and the size of the MCMC sample simulated from each subposterior. This suggests that as the number of batches increases the accuracy and mixing of the algorithm will be affected. The algorithm requires the user to choose a bandwidth estimate, the performance of the algorithm to different bandwidth choices would therefore be interesting to investigate.
In the original paper by Neiswanger et al. (2013), it is suggested to use a Gaussian kernel with bandwidth h2I. However as mentioned earlier, different parameters may have different variances. The algorithm would probably perform better by using a more general diagonal matrix Λ, especially as this does not particularly increase the complexity of the algorithm. Using a common bandwidth parameter across batches eases computation however it may negatively affect the performance of the algorithm. Note when discussing products of Gaussian mixtures in 2.3, the variances across different mixtures did not need to be assumed common. Therefore further improvements might be made by varying bandwidths across batches, though this would increase computational expense. Finally improvements could be gained by using more sophisticated methods to sample from the product of kernel density
estimates (Ihler et al., 2004; Rudoy and Wolfe, 2007).
A number of developments have been proposed for Algorithm 1. Wang and Dunson (2013) note that the algorithm performs poorly when samples from each subposterior do not overlap. In order to improve this they suggest to smooth each subposterior using a Weierstrass transform, which simply takes the convolution of the density with a Gaussian function. The transformed function can be seen as a smoothed version of the original which tends to increase the overlap between subposteriors. They then approximate the full posterior as a product of the Weierstrass transform of each subposterior. However, since in general the approximation to each subposterior will be empirical, its Weierstrass transform corresponds to a kernel density estimator. Therefore this method, for all intents and purposes, is the same as the original algorithm by Neiswanger et al. (2013), so still suffers from many of the same problems.
An alternative method to improve overlap between the supports of each subposterior is to use heavier tailed kernels in the kernel density estimation. Implementing this however will require some work in order to be able to sample from the resulting product of mixtures, since nice properties for the product of these heavier tailed distributions may not hold. Therefore alternative methods for sampling will need to be developed.
Wang et al. (2015) rather than using kernel density estimation use space partitioning methods to partition the space into disjoint subsets, and produce counts of the number of points contained in each of these subsets. This produces an estimate of each subposterior akin to a multi-dimensional histogram. An estimate to the full posterior can then be made by multiplying subposterior estimates together and normalizing. This algorithm helps solve the explosion of mixture components that affects algorithm 1. Despite this, the algorithm will still suffer when the supports of each subposterior do not overlap. Moreover the algorithm is more complicated to implement and will be affected by the choice of partitioning used.
Alternatively there have been suggestions to introduce suitable metrics which allow sum- maries of a set of probability measures to be defined. This allows batches to be recombined in terms of these summaries. For example Minsker et al. (2014) use a metric known as the Wasserstein distance measure in order to define the median posterior from a set of subpos- teriors. Similarly Srivastava et al. (2015) also use the Wasserstein distance to calculate a summary of the subposteriors known as the barycenter. This allows them to produce an estimate for the full posterior which they refer to as the Wasserstein posterior or WASP.
However the statistical properties of these measures is unclear and needs to be investigated further.
2.6 Semiparametric methods
In order to account for the fact that the nonparametric method Algorithm 1 is slow to converge, Neiswanger et al. (2013) suggest producing a semiparametric estimator (Hjort and Glad, 1995) of each subposterior. This estimator combines the parametric estimator characterised by (2.4) and the nonparametric estimator detailed by Algorithm 1. More specifically, each subposterior is estimated by (Hjort and Glad, 1995)
ˆ
ps(θ) = ˆfs(θ)ˆr(θ),
where ˆfs(θ) = N (θ|ˆµs, ˆΣs) and ˆr(θ) is a nonparametric estimator of the correction function r(θ) = ps(θ)/ ˆfs(θ).
Assuming a Gaussian kernel for ˆr(θ), Neiswanger et al. (2013) write down an explicit expression for ˆps(θ)
ˆ
ps(θ) = 1 M
M
X
m=1
N (θ|θm,s, h2I)N (θ|ˆµs, ˆΣs) fˆs(θm,s) = 1
M
M
X
m=1
N (θ|θm,s, h2I)N (θ|ˆµs, ˆΣs) N (θm,s|ˆµs, ˆΣs) .
Similarly to the nonparametric method, we can produce an estimate for the full posterior ˆ
p(θ|x) as the product of estimates for each subposterior. Once again this results in a mixture of Gaussians with MS components. Using the label L = (l1, . . . , lS) then the Lth mixture weight WL and component cL is given by
WL∝ ωLN (¯θL|ˆµ, ˆΣ + ShI) QS
s=1N (θls,s|ˆµs, ˆΣs), cL = N (θ|µL, ΣL),
where ωL and ¯θL are as defined in (2.6), and the parameters of the mixture component are ΣL= S
hI + ˆΣ−1
−1
, µL= ΣL S
hI ¯θL+ ˆΣ−1µˆ
,
where ˆΣ and ˆµ are as defined in (2.4). Sampling from this mixture can be performed by using Algorithm 1 replacing weights and parameters where appropriate.
As h → 0, the semiparametric component parameters ΣL and µL approach the corre- sponding nonparametric component parameters. This motivates Neiswanger et al. (2013) to suggest an alternative semiparametric algorithm where the nonparametric component weights ωL are used instead of WL. Their reasoning is that the resulting algorithm may have a higher acceptance probability and is still asymptotically exact as the batch size tends to infinity. As in Section 2.5, a bandwidth matrix with identical diagonal elements hI will not necessarily be the best choice for the bandwidth if different dimensions of the parameters have differ- ent scales or variances. However the algorithm can easily be extended to using a diagonal bandwidth matrix Λ in a similar way to the nonparametric method.
While this method may solve the problem that the nonparametric method is slow to converge in high dimensions, the performance of the algorithm is not well understood. For example as models tend away from Gaussianity, how will the algorithm perform when it includes this parametric term. Moreover the model still suffers from the curse of dimension- ality in terms of the number of mixture components. The model will also be affected by bandwidth choice.
2.7 Conclusion
In this section we outlined batch methods. Batch methods split a large dataset up into smaller subsets, run parallel MCMC on these subsets, and then combine the MCMC output to obtain an approximation to the full posterior. A couple of methods appealed to the Bernstein- von Mises theorem in order to approximate each subposterior by a Normal distribution.
The resulting approximation to the full posterior could be found using standard results for products of Gaussians. However these methods are only exact if each subposterior is Normal, or as the number of observations in each batch tends to infinity. Performance of the methods when these assumptions are violated needs to be investigated.
Alternative methods used kernel density estimation or a mixture of a Normal estimate and a kernel density estimate to approximate each subposterior. These estimates could then be combined by using results for the product of mixtures of Gaussians. However the resulting approximation was a mixture of MS components, which is difficult to sample from efficiently.
Moreover kernel density estimation is known to deteriorate as dimensionality increases and requires the choice of a bandwidth.
To conclude, each of the batch methods have either undesirable qualities or properties which are not well understood. These issues need reviewing before the methods can be used with confidence in practice. Batch methods are particularly suited to models which exhibit structure, for example hierarchical models.
3 Stochastic gradient methods
3.1 Introduction
Methods currently employed in large scale machine learning are generally optimization based methods. One method employed frequently in training machine learning models is known as stochastic optimization (Robbins and Monro, 1951). This method is used to optimize a likelihood function in a similar way to traditional gradient ascent. The key difference is that at each iteration rather than using the whole dataset only a subset is used. While the method produces impressive results at low computational cost, it has a number of downsides.
Parameter uncertainty is not captured using this method, since it only produces a point estimate. Though uncertainty can be estimated using a Normal approximation, for more complex models this estimate may be poor. This means models fitted using stochastic opti- mization can suffer from overfitting. Since the method does not sample from the posterior as in traditional MCMC, the algorithm can get stuck in local maxima.
Methods outlined in this section aim to combine the subsampling approach of stochastic optimization, with posterior sampling, which helps capture uncertainty in parameter esti- mates. The section begins by outlining stochastic optimization, before introducing stochas- tic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC), the two key algorithms for big data discussed in this section. Hamiltonian Monte Carlo (HMC), a technique used extensively by SGHMC, is reviewed.
3.2 Stochastic optimization
Let x1, . . . , xN be data observed from a model with probability density function p(x|θ) where θ denotes an unknown parameter vector. Assigning a prior p(θ) to θ, as usual our interest is
in the posterior
p(θ|x) ∝ p(θ)
N
Y
i=1
p(xi|θ), where we define p(x|θ) =QN
i=1p(xi|θ) to be the likelihood.
Stochastic optimization (Robbins and Monro, 1951) aims to find the mode θ∗ of the posterior distribution, otherwise known as the MAP estimate of θ. The idea of finding the mode of the posterior rather than the likelihood is that the prior p(θ) regularizes the parameters, meaning it acts as a penalty for model complexity which helps prevent overfitting.
At each iteration t, stochastic optimization takes a subset of the data st and updates the parameters as follows (Welling and Teh, 2011)
θt+1 = θt+ t
2 ∇ log p(θt) + N n
X
xi∈st
∇ log p(xi|θt)
!
where t is the stepsize at each iteration and |st| = n. The idea is that over the long run the noise in using a subset of the data is averaged out, and the algorithm tends towards a standard gradient descent. Clearly when the number of observations N is large, using only a subset of the data is much less computationally expensive. This is a key advantage of stochastic optimization.
Provided that ∞
X
t=1
t= ∞,
∞
X
t=1
2t < ∞, (3.1)
and p(θ|x) satisfies certain technical conditions, this algorithm is guaranteed to converge to a local maximum.
A common extension of stochastic optimization which will be needed later is known as stochastic optimization with momentum. This is commonly employed when the likelihood surface exhibits a particular structure, one example where the method is employed extensively is in the training of deep neural networks. In this case we introduce a variable ν, which is referred to as the velocity of the trajectory. The parameter updates then proceed as follows
νt+1 = (1 − α)νt+ ηt
2 ∇ log p(θt) + N n
X
xi∈st
∇ log p(xi|θt)
! ,
θt+1 = νt+1+ θt (3.2)
where α and η are free parameters to be tuned.
While stochastic optimization is used frequently by large scale machine learning practi- tioners, it does not capture parameter uncertainty since it only produces a point estimate of θ. This means that models fit using stochastic optimization can often suffer from overfitting and requires some form of regularization. One common method to provide an approximation to the true posterior is to fit a Gaussian approximation at the point estimate.
Suppose θ0 is the true mode of the posterior p(θ|x). Then using Taylor’s expansion about θ0, we find (Bishop, 2006)
log p(θ|x) ≈ log p(θ0|x) + (θ − θ0)T∇ log p(θ|x) − 1
2(θ − θ0)TH[log p(θ0|x)](θ − θ0)
= log p(θ0|x) − 1
2(θ − θ0)TH[log p(θ0|x)](θ − θ0),
where H[g(.)] is the Hessian matrix of the function g(.), and we have used the fact that the gradient of the log posterior at θ0 is 0.
Let us denote the Hessian H[log p(θ|x)] := V−1[θ], then taking the exponential of both sides we find
p(θ|x) ≈ A exp
−1
2(θ − θ0)TV−1[θ0](θ − θ0)
,
where A is some constant. This is the kernel of a Gaussian density, suggesting an approxi- mation to the posterior of the form N (θ∗, V [θ∗]), where θ∗ is an estimate of the mode to be found. This is often referred to as a Laplace approximation.
By the Bernstein-von Mises theorem, this approximation is expected to become an in- creasingly accurate approximation as the number of observations increases. However since the approximation is based only on distributional aspects at one point, the approximation can miss important properties of the distribution (Bishop, 2006). Moreover distributions which are multimodal will be approximated very poorly by this approximation. Therefore while the approximation may work well for less complex distributions when plenty of data is available, the approximation may struggle for more complex models. This motivates us to consider methods which aim to combine the performance of stochastic optimization while being able to account for parameter uncertainty.
3.3 Hamiltonian Monte Carlo
Hamiltonian dynamics was originally developed as an important reformulation of Newtonian dynamics, and serves as a vital tool in statistical physics. More recently though, Hamiltonian dynamics has been used to produce proposals for the Metropolis-Hastings algorithm which explore the parameter space rapidly and have very high acceptance rates. The acceptance calculations in the Metropolis-Hastings algorithm is computationally intensive when a lot of data is available. However as outlined later, by combining ideas from stochastic optimiza- tion and Hamiltonian dynamics, we are able to approximately simulate from the posterior distribution without using an acceptance calculation. In light of this, we review Hamilto- nian Monte Carlo, a method which produces efficient proposals for the Metropolis-Hastings algorithm.
3.3.1 Hamiltonian dynamics
Hamiltonian dynamics was traditionally developed to describe the motion of objects under a system of forces. In two dimensions a common analogy used to visualise the dynamics is a frictionless puck sliding over a surface of varying height (Neal, 2010). The state of the
system consists of the puck’s position θ, and its momentum (mass times velocity) r. Both of which are 2-dimensional vectors. The state of the system is governed by its potential energy U (θ) and its kinetic energy K(r). If the puck is moving on a flat part of the space, then it will have constant velocity. However as the puck begins to pick up height, its kinetic energy decreases and its potential energy increases as it slows. If its kinetic energy reaches zero the puck moves back down the hill, and its potential energy decreases as its kinetic energy increases.
More formally Hamiltonian dynamics is described by a Hamiltonian function H(r, θ), where r and θ are both d-dimensional. The Hamiltonian determines how r and θ change over time as follows
dθi
dt = ∂H
∂ri, dri
dt = −∂H
∂θi. (3.3)
Hamiltonian dynamics has a number of properties which are crucial for its use in constructing MCMC proposals. Firstly, Hamiltonian dynamics is reversible, meaning that the mapping from the state (r(t), θ(t)) at time t to the state (r(t + s), θ(t + s)) at time t + s is one-to-one.
A second property is that the dynamics keeps the Hamiltonian invariant or conserved. This can be easily shown using (3.3) as follows
dH dt =
d
X
i=1
dθi dt
∂H
∂θi + dri
dt
∂H
∂ri
=
d
X
i=1
∂H
∂ri
∂H
∂θi +∂H
∂θi
∂H
∂ri
= 0.
In order to use Hamiltonian dynamics to simulate from a distribution we need to trans- late the density function to a potential energy function, and introduce artificial momentum variables to go with these position variables of interest. A Markov chain can then be sim- ulated where at each iteration we resample the momentum variables, simulate Hamiltonian dynamics for a number of iterations, and then perform a Metropolis Hastings acceptance step with the new variables obtained from the simulation.
In light of this, for Hamiltonian Monte Carlo we generally define the Hamiltonian H(r, θ) to be of the following form
H(r, θ) = U (θ) + K(r),
where θ is the vector we are simulating from and the momentum vector r is constructed artificially. Using the notation in Section 3.2 the potential energy is then defined to be
U (θ) = − log p(θ)
N
Y
i=1
p(xi|θ)
!
= − log p(θ) −
N
X
i=1
log p(xi|θ). (3.4)
The kinetic energy is defined as
K(r) = 1
2rTM−1r, (3.5)
where M is a symmetric, positive definite mass matrix.
3.3.2 Using Hamiltonian dynamics in MCMC
In order to relate the potential and kinetic energy functions to the distribution of interest, we can use the concept of a canonical distribution. Given some energy function E(x), defined over the state of x, the canonical distribution over the states of x is defined to be
P (x) = 1
Z exp{−E(x)/(kBT )}, (3.6)
where Z is a normalizing constant, kB is Boltzmann’s constant, and T is defined to be the temperature of the system. The Hamiltonian is an energy function defined over the joint state of r and θ, so that we can write down the joint distribution defined by the function as
P (r, θ) ∝ exp{−H(r, θ)/(kBT )}.
If we now assume the Hamiltonian is of the form described by (3.4), (3.5), and that kBT = 1, then we find that
P (r, θ) ∝ exp{−U (θ)} exp{−K(r)}
∝ p(θ|x)N (r|0, M ).
So that the distribution for r and θ defined by the Hamiltonian are independent and the marginal distribution of θ is its posterior distribution.
This relationship enables us to describe Hamiltonian Monte Carlo (HMC), which can be used to simulate from continuous distributions whose density can be evaluated up to a normalizing constant. A requirement of HMC is that we can calculate the derivatives of the log of the target density. HMC samples from the joint distribution for (θ, r). Therefore by discarding the samples for r we obtain a sample from the posterior p(θ|x). Generally we choose the components of r (ri) to be independent, each with variance mi. This allows us to write the kinetic energy as
K(r) =
d
X
i=1
ri2 2mi.
In order to approximate Hamiltonian’s equations computationally, we need to discretize time using a small stepsize . There are a number of ways to do this, however in practice the leapfrog method often produces good results. The method works as follows:
1. ri(t + /2) = ri(t) − 2∂U∂θ
i(θ(t)), 2. θi(t + h) = θi(t) + ∂K∂r
i(r(t + h/2)), 3. ri(t + h) = ri(t + h/2) − h2∂θ∂U
i(θ(t + h)).
The leapfrog method has a number of desirable properties, including that it is reversible and volume preserving. An effect of this is that at the acceptance step, the proposal distributions cancel, so that the acceptance probability is simply a ratio of the canonical distributions at the proposed and current states. Since we must discretize the equations in order to simulate from them, the posterior p(θ|x) is not invariant under the approximate dynamics. This is
why the acceptance step is required, as it corrects for this error. As the stepsize tends to zero, the acceptance rate of the leapfrog method tends to 1 as the approximation moves closer to true Hamiltonian dynamics.
Now that we have outlined how to approximate the Hamiltonian equations, we can outline Hamiltonian Monte Carlo. HMC is performed in two steps as follows:
1. Simulate new values for the momentum variables r ∼ N (0, M ).
2. Simulate Hamiltonian dynamics for L steps with stepsize using the leapfrog method.
The momentum variables are then negated, and the new state (θ∗, r∗) is accepted with probability
min {1, exp{H(θ, r) − H(θ∗, r∗)}} . 3.3.3 Developments in HMC and tuning
HMC allows the state space to be explored rapidly and has high acceptance rates. However in order to gain these benefits, we need to ensure that L and are properly tuned. Generally it is recommended to use trial values for L and and to use traceplots and autocorrelation plots to decide on how quickly the resulting algorithm converges and how well it is exploring the state space. The presence of multiple modes can be an issue for HMC, and requires special treatment (Neal, 2010). Therefore it is recommended the algorithm is run at different starting points to ensure multimodality is not present.
Suppose we have an estimate of the variance matrix for θ, if the variables appear to be correlated then HMC may not explore the parameter space effectively. One way to improve the performance of HMC in this case is to set M = ˆΣ−1, where ˆΣ is our estimate of V ar(θ|x).
The selection of the stepsize is very important in HMC, since selecting a size that is too big will result in a low acceptance rate, while selecting a size that is too small will result in slow exploration of the space. Selecting too large can be particularly problematic as it can cause instability in the Hamiltonian error, which leads to very low acceptance. In situations where the mass matrix M is the diagonal matrix, the stability limit for is given by the width of the distribution in its most constrained direction. For a Gaussian distribution, this is the square root of the smallest eigenvalue of the covariance matrix for θ.
The value of L is also an important quantity to choose when tuning the HMC algorithm.
Selecting L too small will mean the HMC explores the space with inefficient random walk behaviour as the next state will still be correlated with the previous state. On the other hand selecting L too large will waste computation and lower acceptance rates.
There have been a number of important developments to HMC. Girolami and Calderhead (2011) introduced Riemannian Manifold Hamiltonian Monte Carlo, which simulates HMC in a Riemannian space rather than a Euclidean one. This effectively enables the use of position- dependent mass matrices M . Using this result, the algorithm will sample more efficiently from distributions where parameters of interest exhibit strong correlations. A recent development by Hoffman and Gelman (2014) led to the development of the ‘No U-turn Sampler’. This enables the automatic and adaptive tuning of the stepsize and the trajectory length L. This is an important development since the tuning of HMC algorithms is a non-trivial task.
Alternative methods to the leapfrog method for simulating Hamiltonian dynamics have
been developed. These enable us to to handle constraints on the variables, or to exploit partially analytic solutions (Neal, 2010). As mentioned earlier, HMC can have considerable difficulty moving between the modes of a distribution. A number of schemes have been developed to solve this problem including tempered transitions Neal (1996) and annealed importance sampling Neal (2001).
3.4 Stochastic gradient Langevin Monte Carlo
A special case of HMC arises, known as Langevin Monte Carlo, when we only use a single leapfrog step to propose a new state. Its name comes from its similarity to the theory of Langevin dynamics in physics. Welling and Teh (2011) noticed that the discretized form of Langevin Monte Carlo has a comparable structure to that of stochastic optimization, outlined in Section 3.2. This motivates them to develop an algorithm based on Langevin Monte Carlo, which only uses a subsample of the dataset to calculate the gradient of the potential energy
∇U . They show that by using a stepsize that decreases with time, the algorithm will smoothly transition from a stochastic gradient descent to sampling approximately from the posterior distribution, without the need for an acceptance step. This result along with the fact that only a subsample of the data is used at each iteration, means that the algorithm is scalable to large datasets.
3.4.1 Stochastic gradient Langevin Monte Carlo
Langevin Monte Carlo arises from HMC when we use only one leapfrog step in generating a new state (r, θ). In this case we can remove any explicit mention of momentum variables and propose a new value for θ as follows (Neal, 2010)
θt+1 = θt− a2 2
∂U
∂θ + η,
where η ∼ N (0, a2) and a is some constant. Using our particular expression of the potential energy (3.4), we can write
θt+1 = θt+
2 ∇ log p(θt) +
N
X
i=1
∇ log p(xi|θt)
! + η,
= θt−
2∇U (θt) + η (3.7)
where = a2.
While being a special case of Hamiltonian Monte Carlo, the properties of Langevin dy- namics are somewhat different. We cannot typically set a very large, so the state space is normally explored a lot slower than using HMC. The proposal for Langevin Monte Carlo is a particular discretization of a stochastic differential equation (SDE) known as Langevin dynamics. Writing this discretization as an SDE we obtain
dθ = −1
2∇U (θ)dt + dW = −1
2∇U (θ)dt + N (0, dt), (3.8)
where W is a Wiener process and we have informally written dW as N (0, dt). A Wiener process is a stochastic process with the following properties:
1. W (0) = 0 with probability 1;
2. W (t + h) − W (t) ∼ N (0, h) and is independent of W (τ ) for τ ≤ t.
It can be shown that, under certain conditions, the posterior distribution p(θ|x) is the sta- tionary distribution of (3.8). This motivates the Metropolis-adjusted Langevin algorithm (MALA), which uses (3.7) as a proposal for the Metropolis-Hastings algorithm.
When there are a large number of observations available, ∇U (θ) is expensive to calculate at each iteration, since it requires the evaluation of the log likelihood gradient. Welling and Teh (2011) therefore suggest introducing an unbiased estimator of ∇U (θ) which uses only a subset st of the data at each iteration. The estimator ∇ ˜U (θ) is given as follows
∇ ˜U (θ) = −∇ log p(θ) − N n
X
xi∈st
∇ log p(xi|θ). (3.9)
We use that
∇ ˜U (θ) = ∇U (θ) + ν, (3.10)
where ν is some noise term which we refer to as the stochastic gradient noise.
Using this estimator in place of ∇U (θ) in a Langevin Monte Carlo update we obtain the following
θt+1 = θt+
2 ∇ log p(θt) + N n
X
xi∈st
∇ log p(xi|θt)
!
+ η, (3.11)
= θt+
2U (θt) +
2νt+ η.
If we assume that the stochastic gradient noise νt has variance V (θt), then the term 2νt has variance 22V (θt). Therefore for small , η, which has variance , will dominate. As we send
→ 0, (3.11) will approximate Langevin dynamics and sample approximately from p(θ|x), without the need for an acceptance step.
This result motivates Welling and Teh (2011) to suggest an algorithm that uses (3.11) to update θt, but to decrease the stepsize to 0 as the number of iterations t increases. Leading to the SGLD algorithm update
θt+1= θt+ t
2 ∇ log p(θt) + N n
X
xi∈st
∇ log p(xi|θt)
!
+ ηt (3.12)
Noting the similarity between (3.12) and stochastic optimization, they suggest decreasing t according to the conditions (3.1) to ensure that the noise in the stochastic gradients average out. The result is an algorithm that transitions smoothly between stochastic gradient descent and approximately sampling from the posterior using an increasingly accurate discretization of Langevin dynamics. Since the stepsize must decrease to zero, the mixing rate of the
algorithm will slow as the number of iterations increases. Putting this all together we outline the full SGLD procedure in Algorithm 2.
Algorithm 2: Stochastic gradient Langevin dynamics (SGLD).
Input: Initial estimate θ1, stepsize function (t), subsample size |st| = n, likelihood and prior gradients ∇p(x|θ) and ∇p(θ).
Result: Approximate sample from the full posterior p(θ|x).
for t = 1 to T do
← (t)
Sample st from full dataset x η ∼ N (0, )
θ ← θ +2 ∇ log p(θ) + Nn P
xi∈st∇ log p(xi|θ) + η if small enough then
Store θ as part of the sample end
end
3.4.2 Discussion and tuning
Teh et al. (2014) study SGLD theoretically and show that, given regularity conditions, es- timators derived from an SGLD sample are consistent and satisfy a central limit theorem.
They reveal that for polynomial stepsizes of the form t = a(b + t)−α, the optimal choice of α is 1/3. The rate of convergence of SGLD is shown to be T−1/3, where T is the number of iterations of SGLD. This is slower than the traditional Monte Carlo rate of T−1/2, and is due to the decreasing stepsizes.
In tuning the algorithm the key constants that need to be chosen are those used in the stepsize, a and b, and the subsample size n. To avoid divergence it is important to keep the stochastic gradient noise under control, especially as N gets large. This can be done in two ways. One is to increase the subsample size n, another is to keep the stepsize small. However in order to keep the algorithm efficient the subsample size needs to be kept relatively small, Welling and Teh (2011) suggest keeping it in the hundreds. Therefore the main constant that needs to be considered in tuning is a. Set a too large and the stochastic gradient noise dominates for too long and the algorithm never moves to posterior sampling.
Set a too small however and the parameter space is not explored efficiently enough.
Problems with this method include that it is important for the step sizes to decrease to zero so that the acceptance rate is not needed. However this means the mixing rate of the algorithm will slow down as the number of iterations increase. There are a few ways around this. One is to stop decreasing the step size once it falls below a threshold and the rejection rate is negligible, however in this case the posterior will still be explored slowly. The other is to use this algorithm initially for burn-in, then switch to an alternative MCMC method later which is more efficient. However both these solutions require significant hand-tuning beforehand. The decelerating mixing rate makes it less clear how the algorithm compares to other samplers, while it requires only a fraction of the dataset per iteration, this is offset by the fact that more iterations are required to reach the accuracy of other samplers (Bardenet
et al., 2015).
Another problem with the method is that it often explores the state space inefficiently.
This is because Langevin dynamics explores the state space less efficiently than more general HMC. This is motivation for stochastic gradient HMC (Chen et al., 2014) which is discussed in Section 3.5.
Note that similar to HMC, certain parameters may have a much higher variance than others. In this case we can use a preconditioning matrix M to bring all the parameters onto a similar scale, allowing the algorithm to explore the space more efficiently. The algorithm including preconditioning can simply be written as
θt+1= θt+ t
2M ∇ log p(θt) + N n
X
xi∈st
∇ log p(xi|θt)
! + ηt,
where ηt∼ N (0, tM ).
Provided the size of the subset n is large enough, we can use the central limit theorem to approximate V (θt) by its empirical covariance
V (θt) ≈ N2 n2
X
xi∈st
(y(xi, θt) − ¯y(θt))(y(xi, θt) − ¯y(θt))T = N2
n Vs, (3.13) where y(xi, θt) = ∇ log p(xi|θt) +N1∇ log p(θt) and ¯y(θt) = 1nP
xi∈sty(xi, θt). From (3.13) we determine that the variance of a stochastic gradient step can be estimated by 2t4nN2M VsM (Welling and Teh, 2011), so that for the injected noise to dominate, denoting the largest eigenvalue of M VsM by λ, we require
α = 2tN2
4n λ 1.
Therefore using the fact that the Fisher’s information I ≈ N Vs, and that the posterior variance Σθ ≈ I−1 for large n, we can find the approximate stepsize at which the injected noise will dominate. Denoting the smallest eigenvalue of Σθ by λθ, the stepsize can be given by t≈ 4αnN λθ. This stepsize is generally small.
Suppose we have a sample θ1, . . . , θmwhich is output from the algorithm. Since the mixing of the algorithm decelerates, standard Monte Carlo estimates will overemphasize parts of the sample where the stepsize is small. This increases the variance of the estimate, though it remains consistent. Therefore Welling and Teh (2011) suggest instead to use the estimate
E(f (θ)) ≈ PT
t=1tf (θt) PT
t=1t , which is also consistent.
3.4.3 Further developments
A number of extensions to the original SGLD algorithm by Welling and Teh (2011) have been suggested. Ahn et al. (2012) aim to improve the mixing of the algorithm by appealing