GPU Computing - Inference for Models with Cubic Drift and Linear Diffu-

Chapter 5 Inference for Models with Cubic Drift and Linear Diffu-

5.5 GPU Computing

In this section we describe the implementation of the Algorithms 4.2 and 4.1 for sampling parameters and missing data on a Graphics Processing Unit (GPU) using NVIDIA’s CUDA parallel programming environment. Recently statisticians have realised the potential for reduced computational time that can be achieved using massively parallel computation using GPUs. The GPU, having developed from in- creasing computational demand on graphics rendering within the gaming and video editing industry, is specialised at Single Instruction Multiple Data (SIMD) tasks. Compared with the conventional CPU it devotes more transistors to arithmetic instructions and less to data caching and flow control and so are suitable for algorithms with high arithmetic intensity, few branching statements or calls to memory.

The GPU consists of a number of Streaming Multiprocesses (SM) each of which has a limited amount of fast on-chip shared memory. An SM is capable of performing an operation on 32 threads simultaneously whilst holding the remaining threads in memory. For the purpose of compatibility with different GPU configura-

A1, 7 Density −0.5 0.0 0.5 0.0 1.0 2.0 A1, 8 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5 A1, 9 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.4 0.8 1.2 A1, 10 Density −1.0 −0.5 0.0 0.5 1.0 0.0 0.4 0.8 1.2 A2, 7 Density −0.6 −0.4 −0.2 0.0 0.2 0.4 0.0 1.0 2.0 3.0 A2, 8 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 A2, 9 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 A2, 10 Density −1.0 −0.5 0.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 5.16: Marginal distributions of cubic parameters inferred for a two dimen-

sional model of form Eq. 5.1. A data set withN = 1,000 observations at interval

∆ = 0.1 was used. The diffusion parameters were fixed and there was no missing

data. The blue histogram shows the parameters that gave stable solutions to the SDE, while the mauve is for those that gave unstable solutions. The purple shows the overlap between the two regions of the marginal distributions. The true values are given by the red lines.

tions the CUDA programming model groups threads into thread blocks. Currently up to 1024 threads can be contained in a single thread block. Each block is sent to a SM and instructions on 32 threads executed simultaneously. The execution of these groups of threads hides the latency associated with memory request operations.

Thread Blocks are organised into a Grid. Whilst threads within a block have a limited amount of fast local on-chip memory, blocks within a grid only share access to global device memory which is relatively slow with latency at the order of 100 clock cycles. Threads within a block can be synchronised and communication between them is fast. Blocks communicate by transferring through the CPU which is slow.

Statisticians considering implementing their algorithm on a GPU should take these hardware factors into account when designing there parallel code. Another consideration is the use of single or double floating point arithmetic. GPUs were originally designed to use single precision but with the recent demand for general purpose GPU computation more recent models have double precision capability. However, single precision remains at least 3-4 times faster, although this may come down in the future.

When converting a statistical algorithm for massively parallel computation one should consider how to decompose the problem into identical operations that can be performed with little dependence between them. Many data intensive appli- cations in statistics are amenable to this sort of alteration. For example, Suchard et al. [2010] demonstrate the gains of using a GPU on a Bayesian mixture problem. Given a Gaussian mixture density they estimate the mean, variance and weight of each component. The inference algorithm is simplified by using a data augmentation strategy which structures the problem to be soluble by Gibbs sampling. Each data point is assigned a configuration variable. At each stage of the algorithm the posterior configuration probabilities are computed. For a lot of data and many mixture components the number of configuration probabilities becomes very large. They implement a fine grained parallelisation strategy where each data point-configuration pair is assigned a dedicated thread. They describe their choice of execution plan to optimise the use of shared memory and minimise latency associated with transfers between global and shared memory. Given that the amount of shared memory is only 16KB they describe the efficient method of memory transfers to global memory by coalescing transactions into multiples of 16. After considering these hardware details they report a 120 speed up over the standard algorithm implemented on a single CPU.

by Lee et al. [2010]. They describe various Monte Carlo methods that can be parallelised instead of parallelising the data as in the previous example. They show how easy it is to implement an importance sampling algorithm for the GPU by computing the importance weight of a sample by a single thread. They note that the standard Metropolis-Hastings algorithm does not gain much by parallelisation as it is an inherently serial algorithm although population MCMC and particle samplers work well. They split a population MCMC algorithm so that each thread samples a different distribution with reversible swaps between chains. They have thousands of chains simulating from tempered distributions with only a single chain sampling the target density. Applied to a mixture model they show the improved mixing of the chain between widely separated modes. They also demonstrate a Sequential Monte Carlo algorithm where, like the importance sampling example, each thread updates the weights for each particle. The authors note that there was little reduction in accuracy by using only single precision. For large numbers of Monte Carlo samples they report a speed increase of approximately 280 over the CPU implementation.

The inference procedure in this thesis transfers naturally to a GPU implementation. The Markov nature of SDE data implies that the data set can be divided into independent blocks. In our implementation each thread is responsible for a single observation interval. The imputed data within that interval is sampled using the independence sampler proposal by a single thread. Each thread has an ID and uses this to reference its section of data.

The algorithm is split into two steps. Firstly the update of missing data and secondly the sampling of parameters. For perfect observation of the process, each thread in the first step can run without communication with threads responsible for neighbouring data intervals. If there is observation error then the data at the observation time needs to be passed between threads causing a potential bottleneck in this step of the algorithm. We only consider the case of perfect observation.

The sampling of parameters is a global operation as it involves all of the data in the likelihood function. However, again due to the Markov property, each thread can compute the likelihood for a single data block. When this is done the threads need to synchronise before all the values can be added to form a single likelihood value. This is an example of a parallel reduction algorithm and is computed using a tree structure. Each evenly numbered thread receives a value from its neighbouring thread and adds it to its own. Then every four threads sum their values and so on until there is just a single likelihood value. This is then added to the prior which can be computed by a single thread. The pseudo code for the update of missing data and parameters using the innovation scheme is shown below.

Algorithm 5.1 Parallel SDE inference with perfect observations. For each stepY

has m+ 1 components and is stored in local memory, unique to each thread. For

the second stepσ∗ is stored in shared memory so is accessible to all threads.

τ=blockDim.x blockIdx.x + threadIdx.x

Y0 =Xτ m,Ym =Xτ m+m

α= 0

fori= 0 tom−2do

Yi+1 ∼q(Yi+1|Yi, Xτ m+m, σ) whereq(·|·) is one of the bridge

distributions discussed in Table 5.1, Section 5.2

α=α+L(Y_[0:_m_]|σ)−L(Xτ m:τ m+m|σ)

end for

whereL(Y[0:m]|σ) is the log likelihood function.

Y_[0:_m_] is accepted with probability exp(α).

if τ = 0 then

σ∗ =σ+, where∼ N(0, η) and η is a tuning parameter.

end if

τ=blockDim.x blockIdx.x + threadIdx.x

Y0 =Xτ m

B0 = 0,Y0 =Xτ m

fori= 0 tom−2do

Bi+1 =f−1(Yi+1, σ) wheref(·) is one of the transformations for the innovation

scheme discussed in Section 5.3.

α0 = α0+π(σ∗)−π(σ), the prior distributions. Set σ = σ∗ and X = Y with

β Density −1.5 −0.5 0.0 0.5 1.0 1.5 2.0 0.0 1.0 2.0 m=1m=2 m=4 m=8 m=16 m=32 σ2 Density 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 1.0 2.0 m=1m=2 m=4 m=8 m=16 m=32

Figure 5.17: Posterior distributions for parameters from the O-U process Eq. (4.9) output from the GPU implementation of Algorithm 5.1 (solid lines) compared with the exact posterior distributions (histograms). Parameters were estimated using a

data set with N = 100 observations and interobservation time ∆ = 0.1. A single

long run of 105 MCMC samples were used to compute the posteriors.

At present the algorithm only applies to univariate processes but could easily

be generalised. Each thread of the algorithm proposes new dataY[0:m]for a sequence

of missing data Xτ m+[0:m] indexed by a parameter τ. This is calculated as τ =

blockDim.xblockIdx.x+threadIdx.x, whereblockIdx.xindexes the thread block of

the data,blockDim.xis the size of the thread block and threadIdx.x is the thread

identifier. The valueτ = 0 is the master thread and performs global operations that

need to be computed only once.

The first part of the algorithm, for imputing missing data, divides into independent threads so there is almost a linear increase in computational efficiency

with number of threads for any given value of m. However, this is limited by the

number of threads per block. The second stage, updating parameters, is slower as each thread requires access to some shared memory to read the updated parameters and there is a reduction step to calculate the global likelihood value.

Initially we tested our implementation of Algorithm 5.1 on a GPU by apply- ing it to the one dimensional OU-process model of Eq. (4.9). We used a data set

withN = 100 observations and interobservation time ∆ = 0.1. We used the Mod-

ified Bridge proposal of Table 5.1 to impute the missing data. We compared the parameter estimates with those of the exact posterior distributions. The results, shown in Figure 5.17, demonstrate that the estimated posteriors converge to the

true distributions for increasingm.

Figure 5.18 compares the real computational time of the GPU with the CPU implementations. Each plot shows how the running time increases with the amount

of imputed data m. Notice that for small amounts of data, N < 65, the CPU

compensate for the increased overheads and reduced clock speed of the GPU implementation. The potential of the parallel algorithm is demonstrated for values

N >65. Here, although both algorithms are linear in m, the GPU implementation

is much faster. This is particularly true for large m with speed increases of factor

5 or more. On this particular GPU (in a standard laptop) the speed increase are

not realised for N > 257. This is because, as mentioned previously, the algorithm

will have to use multiple thread blocks so the threads would not have access to the same shared memory. As scientific computing expands its use of GPUs the number of threads per block should rise and the amount of shared memory increase.

In document Methods of likelihood based inference for constructing stochastic climate models (Page 149-155)