• No results found

Parallelization Strategies for Multicore Data Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Parallelization Strategies for Multicore Data Analysis"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Parallelization Strategies for Multicore Data Analysis

Wei-Chen Chen1 Russell Zaretzki2

1University of Tennessee, Dept of EEB

2University of Tennessee, Dept. Statistics, Operations, and Management Science

Computing in the Cloud, April 6-8, 2014

(2)

Outline

1 Introduction Basic Strategy

Data Analysis Algorithms

2 Data and Analysis Techniques

3 Examples

Example 1: Multicore M-H Samplers Example 2: Multicore Bootstrap Sampling Example 3: Multicore Methods for fitting GLMM

(3)

Basic Strategy

Multicore

A core is an individual processor

CPUs used to have a single core, and the terms were interchangeable.

Modern CPU’s have several cores on a single CPU chip.

Processors on the same chip share memory allowing much easier implementation of parallel algorithms.

(4)

Basic Strategy

Multi Node

Multi-node machines typically have many interconnected CPU’s.

Each CPU may have a number of cores which can share memory.

Utilizing a multi node machine usually involves explicitly moving data between nodes.

This is a significant complication and a high level of expertise is required to successfully and efficiently use these machines.

(5)

Basic Strategy

Points of Parallelism

In order to efficiently make use of multicore resources we need to understand our data and modelling procedure. The basic question is where is the independence?

Data: Statistical Independence

If our data is such that statistical independence exists between observations or groups of observations we may be able to take advantage of this special structure to divide and conquer.

Parallelism of the algorithm

Do parts of the algorithm allow for parallelism. Can the problem be divided into independent working pieces that can be

computed separately and then recombined?

(6)

Data Analysis Algorithms

Inputs and Outputs

Most data based (statistical) analyses in the life sciences follow a basic functional structure.

Inputs

D- Data in the form of a vector, list, or other structure.

Λ- Parameters of interest usually summarized as a vector, matrix, or list.

Outputs

φ- A scaler, vector, matrix or combination of these things.

(7)

Data Analysis Algorithms

Key Algorithms

Simulation

Parallel chains running at the same time may improve efficiency.

Cost of any burn in or discarded samples needs to be considered.

Useful if we want to run many chains with different data or parameter values.

Cluster Computers highly effective. Use job schedulers.

Optimization

Inherently serial operation controlled by a master process.

Parallel implementation is most likely to occur within the function call.

(8)

Sample Data

Vicente et al. (2006) looked at the distribution and faecal shedding patterns of the first-stage larvae (L1) of

Elaphostrongylus cervi (Nematoda: Protostrongylidae) in red deer across Spain.

n = 826 deer sampled.

Deer were grouped among 351 farms.

Sex of deer and length are explanatory variables.

For the response variables, define Yisas 1 if the parasite E. cervi L1 is found in animal j at farm i, and 0 otherwise.

(9)

Logistic Regression

Our goal is to relate presence/absence of the parasite to the size of the host animal and its gender which are known. We assume a binomial distribution for Yisand use the logistic link function to relate the mean pisto the explanatory variables.

That is,

Yis∼ Bin(1, pis(xisβ))

pis(xisβ) = exp β0+ β1xs+ β2xlen+ β3xlen∗ xs 1 + exp β0+ β1xs+ β2xlen+ β3xlen∗ xs

We are allowing each gender to have its own intercept and slope.

(10)

Likelihood Function

Whether we take a Bayesian or MLE approach, we will need the log likelihood.

l(β|y , X ) =

I

Y

i=1 Si

Y

s=1

exp(xisTβ) 1 + exp(xisTβ)

!yis

1 1 + exp(xisTβ)

!1−yis

= expP

i

P

syisxisTβ Q

i

Q

s[1 + exp(xisTβ)]

(11)

Likelihood Function

Whether we take a Bayesian or MLE approach, we will need the log likelihood.

l(β|y , X ) =

I

Y

i=1 Si

Y

s=1

exp(xisTβ) 1 + exp(xisTβ)

!yis

1 1 + exp(xisTβ)

!1−yis

= expP

i

P

syisxisTβ Q

i

Q

s[1 + exp(xisTβ)]

(12)

Example 1: Multicore M-H Samplers

Bayesian Inference in Logistic Regression

We’ll keep things simple here and assume an improper unit prior for β because of the lack of available conjugate priors. As a proposal distribution we will use a normal random walk sampler.

The Prior: β ∼ c The posterior

π(β|y , X ) ∝ l(β|y , X )π(β) ∝ l(β|y , X ) The proposal distribution

q(βii−1) ∼N(βi−1,I−1i−1))

(13)

Example 1: Multicore M-H Samplers

Simulation with the Metropolis Hastings Step

Lack of conjugate priors and the form of the posterior requires that we simulate the posterior using the MH algorithm.

Random Walk M-H Algorithm for Logistic Regression Initialization: Choose an arbitrary starting value β0 Iteration t (t ≥ 1):

1 Given β(t−1), generate ˜β ∼q(β(t−1), β).

2 Compute

ρ(β(t−1), ˜β) = min 1, π( ˜β)q( ˜β, β(t−1)) π(β(t−1))q(β(t−1), ˜β)

!

= min(1, π( ˜β)/π(β(t−1)))

3 With probability ρ(β(t−1), ˜β), accept ˜βand set βt = ˜β;

otherwise reject ˜βand set βt= β(t−1)

(14)

Example 1: Multicore M-H Samplers

Opportunities for Parallelism.

MCMC, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.

Where are the opportunities for parallelism in this example?

Two possibilities:

1 Multichain- We can run multiple independent chains each starting from a different initial value β0. Very Easy to do but we need to allow each chain to burn in.

2 Faster Function- We could use parallelism to speed up the calculation of the likelihood function, particularly if we had very large samples. For example if we had thousands of observations per farm we could break up the data,

compute the likelihood separately for each farm, and finally bring the results together to get a final value. This may be slightly more work than our first idea but will probably only help if the data is very large.

3 Finally, if we have enough resources we could try to do both.

(15)

Example 1: Multicore M-H Samplers

Example: Random Walk M-H in R

Let’s try simulating the posterior for our deer parasite example.

1 Method 1 is simply a serial implementation. Run file 21-mcmc-glm.R.

2 Method 2 accesses multiple cores through the mclapply function. Run file22-mcmc-glm-mclapply.R.

3 Method 3 uses the pBDR package. This allows you to work in a multinode environment and will be discussed more tomorrow. Run file23-mcmc-glm-pbdR.R

(16)

Example 1: Multicore M-H Samplers

Ex 1: Questions

1 Can you modify the code to change the number of cores/resources that you are using?

2 How can you create 95% credible intervals from the output?

3 Can you time your results to see if there are any improvements?

(17)

Example 2: Multicore Bootstrap Sampling

Variance Components

Previous example ignored the variation in the data due to the farms. Farms may be an important source of variation.

Introduce a "random intercept" into our model to take this into account.

Yis∼ Bin(1, pis(xisTβ))

pis(xisβ) = exp β0+ αi+ β1xs+ β2xlen 1 + exp β0+ αi+ β1xs+ β2xlen where αi ∼ N(0, σ2α).

1 GLMM- generalized linear mixed model.

2 Can be fit byPQL- Penalized Quasi-Likelihood method.

3 This method is known to produce biased estimates of both β and σα2.

4 Confidence intervals for σ2αalso biased.

(18)

Example 2: Multicore Bootstrap Sampling

Bootstrap to the Rescue

Use the bootstrap percentile method to simulate the distribution of the of the estimate and create a confidence interval. Both parametric and nonparametric approaches exist.

Non-Parametric Bootstrap Percentile Method

Initialization: Fit the PQL Model to the original data.

1 Sample with replacement the subset of observations from each farm and combine to create a new data set.

2 Compute the PQL estimate of the resampled data set.

3 Collect the estimates of σα2 and produce a confidence interval.

4 Create prediction intervals for the individual αi.

(19)

Example 2: Multicore Bootstrap Sampling

Opportunities for Parallelism.

As before, the accuracy of estimates and inferences improves with greater sampling. We would like to use parallelism to increase the speed at which we sample.

Where are the opportunities for parallelism in this example?

Two possibilities:

1 Multichain- Again, run multiple chains since the bootstrap simulation is totally independent.

2 Faster Function- The re-sampling step is a very simple task and can be computed in one step. Most of the work is involved in refitting the PQL model on the resampled data.

A multicore PQL may make sense, however, the data set may again be too small to have this be of much benefit.

3 Take advantage of gains by using vectorization and avoiding loops.

(20)

Example 2: Multicore Bootstrap Sampling

Example: Nonparametric Bootstrap

Let’s try bootstrapping the farm effect for our deer parasite example.

1 First run01-max_pql.Rto fit the initial model.

2 Method 1 is simply a serial implementation with a for loop.

Run file11-npbs_for.R.

3 Method 2 uses lapply to eliminate the for loop. Run file 12-npbs_lapply.R.

4 Method 3 uses the mclapply package. Run file 13-npbs_mclapply.R

5 Method 4 again uses the pbdr package. Run file 14-npbs_pbdR.R

(21)

Example 2: Multicore Bootstrap Sampling

Ex 2: Questions

1 Can you modify the code to change the number of cores/resources that you are using?

2 Can you time your results to see if there are any improvements?

3 Estimate mean and the median of variation for the bootstrapped samples?

4 Find a C.I. for beta.

5 More appropriate way to bootstrap?

(22)

Example 3: Multicore Methods for fitting GLMM

The GLMM Likelihood

The generalized linear mixed model likelihood requires us to integrate over the αi with respect to their densities.

l(β|y , X ) =Y

i

Z

−∞

expP

s(yisxisTβ + αi) Q

s[1 + exp(xisTβ + αi)]

!

p(αiα2)dsi

where p(αiα2) =N(0, σ2α). PQL approximates this integral using a quadratic approximation. What can we do to improve the quality of the estimates?

(23)

Example 3: Multicore Methods for fitting GLMM

Approach 1: Maximizing the Likelihood

Outer Layer

Optimization Level: Inherently Serial.

Master Process chooses new parameter values to pass to the function (β, σ2α).

Function returns a value to the optimization algorithm.

Function Evaluation

Numerical integration or Monte Carlo integration.

Compute the product/sum of the integrals.

(24)

Example 3: Multicore Methods for fitting GLMM

Opportunities for Parallelism.

Where are the opportunities for parallelism in this example?

Two possibilities:

1 Multichain- Not viable at the outer level. Could try multiple optimizations to check convergence.

2 Faster Function- Break the function up by doing integrations for each group(farm) separately.

(25)

Example 3: Multicore Methods for fitting GLMM

Bayesian Approach to GLMM

p(β, α, σ2α|y ) ∝ p(y |β, α, σα2)p(β)p(α|σα2)p(σ2α) Full Conditionals

p(β|·) ∝

I

Y

i=1 Si

Y

s=1

p(yij|β, αi)p(β)

p(β|·) ∝

Si

Y

s=1

p(yij|β, αi)p(β)

p(σα2|·) ∝

Si

Y

s=1

p(αiα2)p(σα2)

(26)

Example 3: Multicore Methods for fitting GLMM

Example: Sampling the Posterior Distribution of the GLMM

1 Method 1 is simply a serial implementation with a for loop.

31-mcmc_glmm.R.

2 Method 2 uses mclapply to eliminate the need to loop through all of the random effects. Run file

41-mcmc_glmm_mclapply.R.

3 Method 3 like 2 but uses the pbdr package.

42-mcmc_glmm_pbdR.R.

(27)

Example 3: Multicore Methods for fitting GLMM

Ex 3: Questions

1 Exercise: Find 95% creditable intervals for sd.random.

2 Other ideas.

References

Related documents

See also specifi c types

Oleh karena itu diperlukan informasi mengenai pendugaan emisi karbondioksida akibat kebakaran hutan dan lahan pada berbagai tipe tutupan lahan di Kalimantan Barat

-Basic Website: A simple website built in Dreamweaver utilizing the students’ custom logos.. -Flash Animation: A 1-3 minute custom animation made

Challenges from the programme team perspective is to facilitate sufficient peer support between students in order to nurture a community of ePortfolio students who can solve

LEGOLAND parks in 2005 for €375 million, while also acquiring a 22% equity stake in the new theme park operator, Merlin Entertainment. A lot of decisions were made regarding

We bench- mark the operator-based machine learning model on a num- ber of existing data sets that account for forces, energies, and dipole moments across chemical space and show how

The results for wages and productivity hold in the years before the export start, which indicates self-selection into exporting of more productive services firms that pay

Plume phase activity at Redoubt Volcano was much more intense than what was observed at Augustine Volcano (the explosive eruptions of Redoubt Volcano were also much larger