Parallelization Strategies for Multicore Data Analysis
Wei-Chen Chen1 Russell Zaretzki2
1University of Tennessee, Dept of EEB
2University of Tennessee, Dept. Statistics, Operations, and Management Science
Computing in the Cloud, April 6-8, 2014
Outline
1 Introduction Basic Strategy
Data Analysis Algorithms
2 Data and Analysis Techniques
3 Examples
Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM
Basic Strategy
Multicore
A core is an individual processor
CPUs used to have a single core, and the terms were interchangeable.
Modern CPU’s have several cores on a single CPU chip.
Processors on the same chip share memory allowing much easier implementation of parallel algorithms.
Basic Strategy
Multi Node
Multi-node machines typically have many interconnected CPU’s.
Each CPU may have a number of cores which can share memory.
Utilizing a multi node machine usually involves explicitly moving data between nodes.
This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.
Basic Strategy
Points of Parallelism
In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence?
Data: Statistical Independence
If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer.
Parallelism of the algorithm
Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be
computed separately and then recombined?
Data Analysis Algorithms
Inputs and Outputs
Most data based (statistical) analyses in the life sciences follow a basic functional structure.
Inputs
D- Data in the form of a vector, list, or other structure.
Λ- Parameters of interest usually summarized as a vector, matrix, or list.
Outputs
φ- A scaler, vector, matrix or combination of these things.
Data Analysis Algorithms
Key Algorithms
Simulation
Parallel chains running at the same time may improve efficiency.
Cost of any burn in or discarded samples needs to be considered.
Useful if we want to run many chains with different data or parameter values.
Cluster Computers highly effective. Use job schedulers.
Optimization
Inherently serial operation controlled by a master process.
Parallel implementation is most likely to occur within the function call.
Sample Data
Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of
Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain.
n = 826 deer sampled.
Deer were grouped among 351 farms.
Sex of deer and length are explanatory variables.
For the response variables, define Yisas 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.
Logistic Regression
Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Yisand use the logistic link function to relate the mean pisto the explanatory variables.
That is,
Yis∼ Bin(1, pis(xisβ))
pis(xisβ) = exp β0+ β1xs+ β2xlen+ β3xlen∗ xs 1 + exp β0+ β1xs+ β2xlen+ β3xlen∗ xs
We are allowing each gender to have its own intercept and slope.
Likelihood Function
Whether we take a Bayesian or MLE approach, we will need the log likelihood.
l(β|y , X ) =
I
Y
i=1 Si
Y
s=1
exp(xisTβ) 1 + exp(xisTβ)
!yis
1 1 + exp(xisTβ)
!1−yis
= expP
i
P
syisxisTβ Q
i
Q
s[1 + exp(xisTβ)]
Likelihood Function
Whether we take a Bayesian or MLE approach, we will need the log likelihood.
l(β|y , X ) =
I
Y
i=1 Si
Y
s=1
exp(xisTβ) 1 + exp(xisTβ)
!yis
1 1 + exp(xisTβ)
!1−yis
= expP
i
P
syisxisTβ Q
i
Q
s[1 + exp(xisTβ)]
Example 1: Multicore M-H Samplers
Bayesian Inference in Logistic Regression
We’ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler.
The Prior: β ∼ c The posterior
π(β|y , X ) ∝ l(β|y , X )π(β) ∝ l(β|y , X ) The proposal distribution
q(βi|βi−1) ∼N(βi−1,I−1(βi−1))
Example 1: Multicore M-H Samplers
Simulation with the Metropolis Hastings Step
Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm.
Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β0 Iteration t (t ≥ 1):
1 Given β(t−1), generate ˜β ∼q(β(t−1), β).
2 Compute
ρ(β(t−1), ˜β) = min 1, π( ˜β)q( ˜β, β(t−1)) π(β(t−1))q(β(t−1), ˜β)
!
= min(1, π( ˜β)/π(β(t−1)))
3 With probability ρ(β(t−1), ˜β), accept ˜βand set βt = ˜β;
otherwise reject ˜βand set βt= β(t−1)
Example 1: Multicore M-H Samplers
Opportunities for Parallelism.
MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.
Where are the opportunities for parallelism in this example?
Two possibilities:
1 Multichain- We can run multiple independent chains each starting from a different initial value β0. Very Easy to do but we need to allow each chain to burn in.
2 Faster Function- We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data,
compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large.
3 Finally, if we have enough resources we could try to do both.
Example 1: Multicore M-H Samplers
Example: Random Walk M-H in R
Let’s try simulating the posterior for our deer parasite example.
1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R.
2 Method 2 accesses multiple cores through the mclapply function. Run file22-mcmc-glm-mclapply.R.
3 Method 3 uses the pBDR package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file23-mcmc-glm-pbdR.R
Example 1: Multicore M-H Samplers
Ex 1: Questions
1 Can you modify the code to change the number of cores/resources that you are using?
2 How can you create 95% credible intervals from the output?
3 Can you time your results to see if there are any improvements?
Example 2: Multicore Bootstrap Sampling
Variance Components
Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation.
Introduce a "random intercept" into our model to take this into account.
Yis∼ Bin(1, pis(xisTβ))
pis(xisβ) = exp β0+ αi+ β1xs+ β2xlen 1 + exp β0+ αi+ β1xs+ β2xlen where αi ∼ N(0, σ2α).
1 GLMM- generalized linear mixed model.
2 Can be fit byPQL- Penalized Quasi-Likelihood method.
3 This method is known to produce biased estimates of both β and σα2.
4 Confidence intervals for σ2αalso biased.
Example 2: Multicore Bootstrap Sampling
Bootstrap to the Rescue
Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist.
Non-Parametric Bootstrap Percentile Method
Initialization: Fit the PQL Model to the original data.
1 Sample with replacement the subset of observations from each farm and combine to create a new data set.
2 Compute the PQL estimate of the resampled data set.
3 Collect the estimates of σα2 and produce a confidence interval.
4 Create prediction intervals for the individual αi.
Example 2: Multicore Bootstrap Sampling
Opportunities for Parallelism.
As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.
Where are the opportunities for parallelism in this example?
Two possibilities:
1 Multichain- Again, run multiple chains since the bootstrap simulation is totally independent.
2 Faster Function- The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data.
A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit.
3 Take advantage of gains by using vectorization and avoiding loops.
Example 2: Multicore Bootstrap Sampling
Example: Nonparametric Bootstrap
Let’s try bootstrapping the farm effect for our deer parasite example.
1 First run01-max_pql.Rto fit the initial model.
2 Method 1 is simply a serial implementation with a for loop.
Run file11-npbs_for.R.
3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R.
4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R
5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R
Example 2: Multicore Bootstrap Sampling
Ex 2: Questions
1 Can you modify the code to change the number of cores/resources that you are using?
2 Can you time your results to see if there are any improvements?
3 Estimate mean and the median of variation for the bootstrapped samples?
4 Find a C.I. for beta.
5 More appropriate way to bootstrap?
Example 3: Multicore Methods for fitting GLMM
The GLMM Likelihood
The generalized linear mixed model likelihood requires us to integrate over the αi with respect to their densities.
l(β|y , X ) =Y
i
Z ∞
−∞
expP
s(yisxisTβ + αi) Q
s[1 + exp(xisTβ + αi)]
!
p(αi|σα2)dsi
where p(αi|σα2) =N(0, σ2α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?
Example 3: Multicore Methods for fitting GLMM
Approach 1: Maximizing the Likelihood
Outer Layer
Optimization Level: Inherently Serial.
Master Process chooses new parameter values to pass to the function (β, σ2α).
Function returns a value to the optimization algorithm.
Function Evaluation
Numerical integration or Monte Carlo integration.
Compute the product/sum of the integrals.
Example 3: Multicore Methods for fitting GLMM
Opportunities for Parallelism.
Where are the opportunities for parallelism in this example?
Two possibilities:
1 Multichain- Not viable at the outer level. Could try multiple optimizations to check convergence.
2 Faster Function- Break the function up by doing integrations for each group(farm) separately.
Example 3: Multicore Methods for fitting GLMM
Bayesian Approach to GLMM
p(β, α, σ2α|y ) ∝ p(y |β, α, σα2)p(β)p(α|σα2)p(σ2α) Full Conditionals
p(β|·) ∝
I
Y
i=1 Si
Y
s=1
p(yij|β, αi)p(β)
p(β|·) ∝
Si
Y
s=1
p(yij|β, αi)p(β)
p(σα2|·) ∝
Si
Y
s=1
p(αi|σα2)p(σα2)
Example 3: Multicore Methods for fitting GLMM
Example: Sampling the Posterior Distribution of the GLMM
1 Method 1 is simply a serial implementation with a for loop.
31-mcmc_glmm.R.
2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file
41-mcmc_glmm_mclapply.R.
3 Method 3 like 2 but uses the pbdr package.
42-mcmc_glmm_pbdR.R.
Example 3: Multicore Methods for fitting GLMM
Ex 3: Questions
1 Exercise: Find 95% creditable intervals for sd.random.
2 Other ideas.