Chapter 5 Inference for Models with Cubic Drift and Linear Diffu-
5.5 GPU Computing
In this section we describe the implementation of the Algorithms 4.2 and 4.1 for sampling parameters and missing data on a Graphics Processing Unit (GPU) using NVIDIA’s CUDA parallel programming environment. Recently statisticians have realised the potential for reduced computational time that can be achieved using massively parallel computation using GPUs. The GPU, having developed from in- creasing computational demand on graphics rendering within the gaming and video editing industry, is specialised at Single Instruction Multiple Data (SIMD) tasks. Compared with the conventional CPU it devotes more transistors to arithmetic in- structions and less to data caching and flow control and so are suitable for algorithms with high arithmetic intensity, few branching statements or calls to memory.
The GPU consists of a number of Streaming Multiprocesses (SM) each of which has a limited amount of fast on-chip shared memory. An SM is capable of performing an operation on 32 threads simultaneously whilst holding the remaining threads in memory. For the purpose of compatibility with different GPU configura-
A1, 7 Density −0.5 0.0 0.5 0.0 1.0 2.0 A1, 8 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5 A1, 9 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.4 0.8 1.2 A1, 10 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.4 0.8 1.2 A2, 7 Density −0.6 −0.4 −0.2 0.0 0.2 0.4 0.0 1.0 2.0 3.0 A2, 8 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 A2, 9 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 A2, 10 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0
Figure 5.16: Marginal distributions of cubic parameters inferred for a two dimen-
sional model of form Eq. 5.1. A data set withN = 1,000 observations at interval
∆ = 0.1 was used. The diffusion parameters were fixed and there was no missing
data. The blue histogram shows the parameters that gave stable solutions to the SDE, while the mauve is for those that gave unstable solutions. The purple shows the overlap between the two regions of the marginal distributions. The true values are given by the red lines.
tions the CUDA programming model groups threads into thread blocks. Currently up to 1024 threads can be contained in a single thread block. Each block is sent to a SM and instructions on 32 threads executed simultaneously. The execution of these groups of threads hides the latency associated with memory request operations.
Thread Blocks are organised into a Grid. Whilst threads within a block have a limited amount of fast local on-chip memory, blocks within a grid only share access to global device memory which is relatively slow with latency at the order of 100 clock cycles. Threads within a block can be synchronised and communication between them is fast. Blocks communicate by transferring through the CPU which is slow.
Statisticians considering implementing their algorithm on a GPU should take these hardware factors into account when designing there parallel code. Another consideration is the use of single or double floating point arithmetic. GPUs were originally designed to use single precision but with the recent demand for general purpose GPU computation more recent models have double precision capability. However, single precision remains at least 3-4 times faster, although this may come down in the future.
When converting a statistical algorithm for massively parallel computation one should consider how to decompose the problem into identical operations that can be performed with little dependence between them. Many data intensive appli- cations in statistics are amenable to this sort of alteration. For example, Suchard et al. [2010] demonstrate the gains of using a GPU on a Bayesian mixture problem. Given a Gaussian mixture density they estimate the mean, variance and weight of each component. The inference algorithm is simplified by using a data augmentation strategy which structures the problem to be soluble by Gibbs sampling. Each data point is assigned a configuration variable. At each stage of the algorithm the poste- rior configuration probabilities are computed. For a lot of data and many mixture components the number of configuration probabilities becomes very large. They im- plement a fine grained parallelisation strategy where each data point-configuration pair is assigned a dedicated thread. They describe their choice of execution plan to optimise the use of shared memory and minimise latency associated with transfers between global and shared memory. Given that the amount of shared memory is only 16KB they describe the efficient method of memory transfers to global memory by coalescing transactions into multiples of 16. After considering these hardware details they report a 120 speed up over the standard algorithm implemented on a single CPU.
by Lee et al. [2010]. They describe various Monte Carlo methods that can be parallelised instead of parallelising the data as in the previous example. They show how easy it is to implement an importance sampling algorithm for the GPU by computing the importance weight of a sample by a single thread. They note that the standard Metropolis-Hastings algorithm does not gain much by parallelisation as it is an inherently serial algorithm although population MCMC and particle samplers work well. They split a population MCMC algorithm so that each thread samples a different distribution with reversible swaps between chains. They have thousands of chains simulating from tempered distributions with only a single chain sampling the target density. Applied to a mixture model they show the improved mixing of the chain between widely separated modes. They also demonstrate a Sequential Monte Carlo algorithm where, like the importance sampling example, each thread updates the weights for each particle. The authors note that there was little reduction in accuracy by using only single precision. For large numbers of Monte Carlo samples they report a speed increase of approximately 280 over the CPU implementation.
The inference procedure in this thesis transfers naturally to a GPU imple- mentation. The Markov nature of SDE data implies that the data set can be divided into independent blocks. In our implementation each thread is responsible for a sin- gle observation interval. The imputed data within that interval is sampled using the independence sampler proposal by a single thread. Each thread has an ID and uses this to reference its section of data.
The algorithm is split into two steps. Firstly the update of missing data and secondly the sampling of parameters. For perfect observation of the process, each thread in the first step can run without communication with threads responsible for neighbouring data intervals. If there is observation error then the data at the observation time needs to be passed between threads causing a potential bottleneck in this step of the algorithm. We only consider the case of perfect observation.
The sampling of parameters is a global operation as it involves all of the data in the likelihood function. However, again due to the Markov property, each thread can compute the likelihood for a single data block. When this is done the threads need to synchronise before all the values can be added to form a single likelihood value. This is an example of a parallel reduction algorithm and is computed using a tree structure. Each evenly numbered thread receives a value from its neighbouring thread and adds it to its own. Then every four threads sum their values and so on until there is just a single likelihood value. This is then added to the prior which can be computed by a single thread. The pseudo code for the update of missing data and parameters using the innovation scheme is shown below.
Algorithm 5.1 Parallel SDE inference with perfect observations. For each stepY
has m+ 1 components and is stored in local memory, unique to each thread. For
the second stepσ∗ is stored in shared memory so is accessible to all threads.
τ=blockDim.x blockIdx.x + threadIdx.x
Y0 =Xτ m,Ym =Xτ m+m
α= 0
fori= 0 tom−2do
Yi+1 ∼q(Yi+1|Yi, Xτ m+m, σ) whereq(·|·) is one of the bridge
distributions discussed in Table 5.1, Section 5.2
α=α+L(Y[0:m]|σ)−L(Xτ m:τ m+m|σ)
end for
whereL(Y[0:m]|σ) is the log likelihood function.
Y[0:m] is accepted with probability exp(α).
if τ = 0 then
σ∗ =σ+, where∼ N(0, η) and η is a tuning parameter.
end if
τ=blockDim.x blockIdx.x + threadIdx.x
Y0 =Xτ m
B0 = 0,Y0 =Xτ m
fori= 0 tom−2do
Bi+1 =f−1(Yi+1, σ) wheref(·) is one of the transformations for the innovation
scheme discussed in Section 5.3.
Yi+1 =f(Wi+1, σ∗) end for ατ =L(Y0:m|σ∗)−L(Xτ m:τ m+m|σ) +|J(f(X, σ))| − |J(f(Y, σ))|whereJ(·) is the Jacobian forf(). SYNCTHREADS fori= 1 to BlockDim.x−1do if (τ = 0) mod 2ithen ατ =ατ +ατ+i end if SYNCTHREADS end for τ = 0
α0 = α0+π(σ∗)−π(σ), the prior distributions. Set σ = σ∗ and X = Y with
β Density −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 0.0 1.0 2.0 m=1m=2 m=4 m=8 m=16 m=32 σ2 Density 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 1.0 2.0 m=1m=2 m=4 m=8 m=16 m=32
Figure 5.17: Posterior distributions for parameters from the O-U process Eq. (4.9) output from the GPU implementation of Algorithm 5.1 (solid lines) compared with the exact posterior distributions (histograms). Parameters were estimated using a
data set with N = 100 observations and interobservation time ∆ = 0.1. A single
long run of 105 MCMC samples were used to compute the posteriors.
At present the algorithm only applies to univariate processes but could easily
be generalised. Each thread of the algorithm proposes new dataY[0:m]for a sequence
of missing data Xτ m+[0:m] indexed by a parameter τ. This is calculated as τ =
blockDim.xblockIdx.x+threadIdx.x, whereblockIdx.xindexes the thread block of
the data,blockDim.xis the size of the thread block and threadIdx.x is the thread
identifier. The valueτ = 0 is the master thread and performs global operations that
need to be computed only once.
The first part of the algorithm, for imputing missing data, divides into in- dependent threads so there is almost a linear increase in computational efficiency
with number of threads for any given value of m. However, this is limited by the
number of threads per block. The second stage, updating parameters, is slower as each thread requires access to some shared memory to read the updated parameters and there is a reduction step to calculate the global likelihood value.
Initially we tested our implementation of Algorithm 5.1 on a GPU by apply- ing it to the one dimensional OU-process model of Eq. (4.9). We used a data set
withN = 100 observations and interobservation time ∆ = 0.1. We used the Mod-
ified Bridge proposal of Table 5.1 to impute the missing data. We compared the parameter estimates with those of the exact posterior distributions. The results, shown in Figure 5.17, demonstrate that the estimated posteriors converge to the
true distributions for increasingm.
Figure 5.18 compares the real computational time of the GPU with the CPU implementations. Each plot shows how the running time increases with the amount
of imputed data m. Notice that for small amounts of data, N < 65, the CPU
compensate for the increased overheads and reduced clock speed of the GPU im- plementation. The potential of the parallel algorithm is demonstrated for values
N >65. Here, although both algorithms are linear in m, the GPU implementation
is much faster. This is particularly true for large m with speed increases of factor
5 or more. On this particular GPU (in a standard laptop) the speed increase are
not realised for N > 257. This is because, as mentioned previously, the algorithm
will have to use multiple thread blocks so the threads would not have access to the same shared memory. As scientific computing expands its use of GPUs the number of threads per block should rise and the amount of shared memory increase.