Parallelization Strategies for Multicore Data Analysis

(1)

Parallelization Strategies for Multicore Data Analysis

Wei-Chen Chen¹ Russell Zaretzki²

1University of Tennessee, Dept of EEB

2University of Tennessee, Dept. Statistics, Operations, and Management Science

Computing in the Cloud, April 6-8, 2014

(2)

Outline

1 Introduction Basic Strategy

Data Analysis Algorithms

2 Data and Analysis Techniques

3 Examples

Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM

(3)

Basic Strategy

Multicore

A core is an individual processor

CPUs used to have a single core, and the terms were interchangeable.

Modern CPU’s have several cores on a single CPU chip.

Processors on the same chip share memory allowing much easier implementation of parallel algorithms.

(4)

Basic Strategy

Multi Node

Multi-node machines typically have many interconnected CPU’s.

Each CPU may have a number of cores which can share memory.

Utilizing a multi node machine usually involves explicitly moving data between nodes.

This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.

(5)

Basic Strategy

Points of Parallelism

In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence?

Data: Statistical Independence

If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer.

Parallelism of the algorithm

Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be

computed separately and then recombined?

(6)

Inputs and Outputs

Most data based (statistical) analyses in the life sciences follow a basic functional structure.

Inputs

D- Data in the form of a vector, list, or other structure.

Λ- Parameters of interest usually summarized as a vector, matrix, or list.

Outputs

φ- A scaler, vector, matrix or combination of these things.

(7)

Key Algorithms

Simulation

Parallel chains running at the same time may improve efficiency.

Cost of any burn in or discarded samples needs to be considered.

Useful if we want to run many chains with different data or parameter values.

Cluster Computers highly effective. Use job schedulers.

Optimization

Inherently serial operation controlled by a master process.

Parallel implementation is most likely to occur within the function call.

(8)

Sample Data

Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of

Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain.

n = 826 deer sampled.

Deer were grouped among 351 farms.

Sex of deer and length are explanatory variables.

For the response variables, define Y_isas 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.

(9)

Logistic Regression

Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Y_isand use the logistic link function to relate the mean p_isto the explanatory variables.

That is,

Y_is∼ Bin(1, pis(x_isβ))

p_is(x_isβ) = exp β₀+ β₁x_s+ β₂x_len+ β₃x_len∗ x_s 1 + exp β₀+ β1x_s+ β2x_len+ β3x_len∗ xs

We are allowing each gender to have its own intercept and slope.

(10)

Likelihood Function

Whether we take a Bayesian or MLE approach, we will need the log likelihood.

l(β|y , X ) =

I

Y

i=1 Si

Y

s=1

exp(x_is^Tβ) 1 + exp(x_is^Tβ)

!yis

1 1 + exp(x_is^Tβ)

!1−yis

= expP

i

P

sy_isx_is^Tβ Q

i

Q

s[1 + exp(x_is^Tβ)]

(11)

Likelihood Function

Whether we take a Bayesian or MLE approach, we will need the log likelihood.

l(β|y , X ) =

I

Y

i=1 Si

Y

s=1

exp(x_is^Tβ) 1 + exp(x_is^Tβ)

!yis

1 1 + exp(x_is^Tβ)

!1−yis

= expP

i

P

sy_isx_is^Tβ Q

i

Q

s[1 + exp(x_is^Tβ)]

(12)

Example 1: Multicore M-H Samplers

Bayesian Inference in Logistic Regression

We’ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler.

The Prior: β ∼ c The posterior

π(β|y , X ) ∝ l(β|y , X )π(β) ∝ l(β|y , X ) The proposal distribution

q(β_i|β_i−1) ∼N(β_i−1,I⁻¹(β_i−1))

(13)

Simulation with the Metropolis Hastings Step

Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm.

Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β⁰ Iteration t (t ≥ 1):

1 Given β^(t−1), generate ˜β ∼q(β^(t−1), β).

2 Compute

ρ(β^(t−1), ˜β) = min 1, π( ˜β)q( ˜β, β^(t−1)) π(β^(t−1))q(β^(t−1), ˜β)

!

= min(1, π( ˜β)/π(β^(t−1)))

3 With probability ρ(β^(t−1), ˜β), accept ˜βand set β^t = ˜β;

otherwise reject ˜βand set β^t= β^(t−1)

(14)

Opportunities for Parallelism.

MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.

Where are the opportunities for parallelism in this example?

Two possibilities:

1 Multichain- We can run multiple independent chains each starting from a different initial value β⁰. Very Easy to do but we need to allow each chain to burn in.

2 Faster Function- We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data,

compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large.

3 Finally, if we have enough resources we could try to do both.

(15)

Example: Random Walk M-H in R

Let’s try simulating the posterior for our deer parasite example.

1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R.

2 Method 2 accesses multiple cores through the mclapply function. Run file22-mcmc-glm-mclapply.R.

3 Method 3 uses the pBDR package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file23-mcmc-glm-pbdR.R

(16)

Ex 1: Questions

1 Can you modify the code to change the number of cores/resources that you are using?

2 How can you create 95% credible intervals from the output?

3 Can you time your results to see if there are any improvements?

(17)

Example 2: Multicore Bootstrap Sampling

Variance Components

Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation.

Introduce a "random intercept" into our model to take this into account.

Y_is∼ Bin(1, p_is(x_is^Tβ))

p_is(x_isβ) = exp β₀+ α_i+ β₁xs+ β₂x_len 1 + exp β₀+ α_i+ β₁x_s+ β₂x_len where α_i ∼ N(0, σ²_α).

1 GLMM- generalized linear mixed model.

2 Can be fit byPQL- Penalized Quasi-Likelihood method.

3 This method is known to produce biased estimates of both β and σ_α².

4 Confidence intervals for σ²_αalso biased.

(18)

Bootstrap to the Rescue

Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist.

Non-Parametric Bootstrap Percentile Method

Initialization: Fit the PQL Model to the original data.

1 Sample with replacement the subset of observations from each farm and combine to create a new data set.

2 Compute the PQL estimate of the resampled data set.

3 Collect the estimates of σ_α² and produce a confidence interval.

4 Create prediction intervals for the individual αi.

(19)

As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.

Two possibilities:

1 Multichain- Again, run multiple chains since the bootstrap simulation is totally independent.

2 Faster Function- The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data.

A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit.

3 Take advantage of gains by using vectorization and avoiding loops.

(20)

Example: Nonparametric Bootstrap

Let’s try bootstrapping the farm effect for our deer parasite example.

1 First run01-max_pql.Rto fit the initial model.

2 Method 1 is simply a serial implementation with a for loop.

Run file11-npbs_for.R.

3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R.

4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R

5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R

(21)

Ex 2: Questions

1 Can you modify the code to change the number of cores/resources that you are using?

2 Can you time your results to see if there are any improvements?

3 Estimate mean and the median of variation for the bootstrapped samples?

4 Find a C.I. for beta.

5 More appropriate way to bootstrap?

(22)

Example 3: Multicore Methods for fitting GLMM

The GLMM Likelihood

The generalized linear mixed model likelihood requires us to integrate over the α_i with respect to their densities.

l(β|y , X ) =Y

i

Z ∞

−∞

expP

s(y_isx_is^Tβ + α_i) Q

s[1 + exp(x_is^Tβ + α_i)]

!

p(α_i|σ_α²)ds_i

where p(α_i|σ_α²) =N(0, σ²_α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?

(23)

Approach 1: Maximizing the Likelihood

Outer Layer

Optimization Level: Inherently Serial.

Master Process chooses new parameter values to pass to the function (β, σ²_α).

Function returns a value to the optimization algorithm.

Function Evaluation

Numerical integration or Monte Carlo integration.

Compute the product/sum of the integrals.

(24)

Two possibilities:

1 Multichain- Not viable at the outer level. Could try multiple optimizations to check convergence.

2 Faster Function- Break the function up by doing integrations for each group(farm) separately.

(25)

Bayesian Approach to GLMM

p(β, α, σ²_α|y ) ∝ p(y |β, α, σ_α²)p(β)p(α|σ_α²)p(σ²_α) Full Conditionals

p(β|·) ∝

I

Y

i=1 Si

Y

s=1

p(y_ij|β, α_i)p(β)

p(β|·) ∝

Si

Y

s=1

p(y_ij|β, α_i)p(β)

p(σ_α²|·) ∝

Si

Y

s=1

p(α_i|σ_α²)p(σ_α²)

(26)

Example: Sampling the Posterior Distribution of the GLMM

1 Method 1 is simply a serial implementation with a for loop.

31-mcmc_glmm.R.

2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file

41-mcmc_glmm_mclapply.R.

3 Method 3 like 2 but uses the pbdr package.

42-mcmc_glmm_pbdR.R.

(27)

Ex 3: Questions

1 Exercise: Find 95% creditable intervals for sd.random.

2 Other ideas.