Programming with Big Data in R

(1)

Drew Schmidt and George Ostrouchov

useR! 2014

(2)

The

p

pb

p

b

d

R

Core Team

Wei-Chen Chen

1 George Ostrouchov

2 ,

3 Pragneshkumar Patel

3 Drew Schmidt

3 Support

This work used resources ofNational Institute for Computational Sciencesat the University of Tennessee, Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.

ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work also used resources of theOak Ridge Leadership Computing Facility

at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

1_{Department of Ecology and Evolutionary Biology}

University of Tennessee, Knoxville TN, USA

2_{Computer Science and Mathematics Division}

Oak Ridge National Laboratory, Oak Ridge TN, USA

3_{Joint Institute for Computational Sciences}

University of Tennessee, Knoxville TN, USA

(3)

Downloads

This presentation is available at:

http://r-pbd.org/tutorial

(4)

About This Presentation

Installation Instructions

Installation instructions for setting up a p

p

b

d

R

environment are available:

http://r-pbd.org/install.html

This includes instructions for installing R, MPI, and p

p

b

d

R

R.

(5)

1

Introduction

2

Profiling and Benchmarking

3

The pbdR Project

4

Introduction to pbdMPI

5

The Generalized Block Distribution

6

Basic Statistics Examples

7

Data Input

8

Introduction to pbdDMAT and the ddmatrix Structure

9

Examples Using pbdDMAT

10

MPI Profiling

11

Wrapup

(6)

Introduction

1 Introduction

A Concise Introduction to Parallelism

A Quick Overview of Parallel Hardware

A Quick Overview of Parallel Software

Summary

(7)

1 Introduction

A Concise Introduction to Parallelism

A Quick Overview of Parallel Hardware

A Quick Overview of Parallel Software

Summary

(8)

Introduction A Concise Introduction to Parallelism

Parallelism

Serial Programming

Parallel Programming

(9)

Difficulty in Parallelism

1

Implicit parallelism

: Parallel details hidden from user

2

Explicit parallelism

: Some assembly required. . .

3

Embarrassingly Parallel

: Also called

loosely coupled

. Obvious how to

make parallel; lots of independence in computations.

4

Tightly Coupled

: Opposite of embarrassingly parallel; lots of

dependence in computations.

(10)

Speedup

Wallclock Time

: Time of the clock on the wall from start to finish

Speedup

: unitless measure of improvement; more is better.

Sn

1

,

n

2

=

Run time for

n

1 cores

Run time for

n

2 cores

n

1 is often taken to be 1

In this case, comparing parallel algorithm to serial algorithm

(11)

Good Speedup

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal

Bad Speedup

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal

(12)

Scalability and Benchmarking

1

Strong

: Fixed

total

problem size.

Less work per core as more cores are added.

2

Weak

: Fixed

local

(per core) problem size.

Same work per core as more cores are added.

(13)

Good Strong Scaling

●● ● ● ● ● ● 0 10 20 30 40 50 0 20 40 60 Cores Speedup

Good Weak Scaling

● ● ● ● ● 2950 3000 3050 3100 4000 8000 12000 16000 Cores Time

(14)

Shared and Distributed Memory Machines

Shared Memory

Direct access to read/change

memory (one node)

Distributed

No direct access to read/change

memory (many nodes); requires

communication

(15)

Shared Memory Machines

Thousands of cores

Nautilus, University of Tennessee 1024 cores 4 TB RAM

Distributed Memory Machines

Hundreds of thousands of cores

Kraken, University of Tennessee 112,896 cores 147 TB RAM

(16)

Introduction A Quick Overview of Parallel Hardware

1 Introduction

A Concise Introduction to Parallelism

A Quick Overview of Parallel Hardware

A Quick Overview of Parallel Software

Summary

(17)

Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE

+ cache + cacheCORE + cacheCORE + cacheCORE

Network

Shared Memory Local Memory

GPU or MIC

Co-Processor

GPU: Graphical Processing Unit MIC: Many Integrated Core

(18)

Your Laptop or Desktop

Mem Mem Mem Mem

Distributed Memory

Memory

CORE

Network

GPU or MIC

Co-Processor

(19)

Mem Mem Mem Mem

Distributed Memory

Memory

CORE

Network

GPU or MIC

Co-Processor

(20)

Server to Supercomputer

Mem Mem Mem Mem

Distributed Memory

Memory

CORE

Network

GPU or MIC

Co-Processor

(21)

Mem Mem Mem Mem

Distributed Memory

Memory

CORE

Network

GPU or MIC

Co-Processor

Clus

ter

Mult

icor

e

GPU

or M

any

core

“Dis

tribu

ting

”

‘Mul

tithr

ead

ing”

“Off

load

ing”

(22)

Introduction A Quick Overview of Parallel Software

1 Introduction

A Concise Introduction to Parallelism

A Quick Overview of Parallel Hardware

A Quick Overview of Parallel Software

Summary

(23)

Mem Mem Mem Mem Distributed Memory

Memory CORE

Network

GPU or MIC Co-Processor

Focus on who owns what data and what communication is needed

Focus on which tasks can be parallel

Same Task on Blocks of data

Sockets MPI Hadoop CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork

(24)

30+ Years of Parallel Computing Research

Sockets MPI Hadoop Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache

Memory CORE

Network

CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork

(25)

Memory CORE

Network

(26)

Putting It All Together Challenge

Memory CORE

Network

(27)

Memory CORE

Network

Sockets MPI Hadoop snow Rmpi pbdMPI

snow + multicore = parallel

CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork multicore RHadoop Foreign Langauge Interfaces: .C .Call Rcpp OpenCL inline . . .

(28)

Introduction Summary

1 Introduction

A Concise Introduction to Parallelism

A Quick Overview of Parallel Hardware

A Quick Overview of Parallel Software

Summary

(29)

Summary

Three flavors of hardware

Distributed is stable

Multicore and co-processor are evolving

Two memory models

Distributed works in multicore

Parallelism hierarchy

Medium to big machines have all three

(30)

Profiling and Benchmarking

2 Profiling and Benchmarking

Why Profile?

Profiling R Code

Advanced R Profiling

Summary

(31)

2 Profiling and Benchmarking

Why Profile?

Profiling R Code

Advanced R Profiling

Summary

(32)

Profiling and Benchmarking Why Profile?

Performance and Accuracy

Sometimes

π

= 3

.

14 is (a) infinitely faster

than the “correct” answer and (b) the

differ-ence between the “correct” and the “wrong”

answer is meaningless. . . . The thing is, some

specious value of “correctness” is often

irrel-evant because it doesn’t matter. While

per-formance almost always matters. And I

ab-solutely detest the fact that people so often

dismiss performance concerns so readily.

— Linus Torvalds, August 8, 2008

(33)

Why Profile?

Because performance matters.

Bad practices scale up!

Your bottlenecks may surprise you.

Because R is dumb.

R users claim to be data people. . . so act like it!

(34)

Compilers often correct bad behavior. . .

A Really Dumb Loop

int

m a i n () {

int

x , i ;

for

( i =0; i < 1 0 ; i ++)

x = 1;

r e t u r n

0;

}

clang -O3 example.c

m a i n :

. cfi

_

s t a r t p r o c

# BB #0:

x o r l

% eax ,

% eax

ret

clang example.c

m a i n :

. cfi

_

s t a r t p r o c

# BB #0:

m o v l

$

0 , -4(% rsp )

m o v l

$

0 ,

-12(% rsp )

. L B B 0

_

1:

c m p l

$

10 ,

-12(% rsp )

jge

. L B B 0

_

4 # BB #2:

m o v l

$

1 , -8(% rsp )

# BB #3:

m o v l

-12(% rsp ) , % eax

a d d l

$

1 , % eax

m o v l

% eax ,

-12(% rsp )

jmp

. L B B 0

_

1 . L B B 0

_

4:

m o v l

$

0 , % eax

ret

(35)

Dumb Loop

1

for

( i in 1: n ) {

2

tA <- t(A)

3

Y

< -

tA %

* % Q

4

Q < - qr

.

Q

(

qr

( Y ) )

5

Y

< -

A %

* % Q

6

Q < - qr

.

Q

(

qr

( Y ) )

7

}

8 9

Q

Better Loop

1

tA <- t(A)

2 3

for

( i in 1: n ) {

4

Y

< -

tA %

* % Q

5

Q < - qr

.

Q

(

qr

( Y ) )

6

Y

< -

A %

* % Q

7

Q < - qr

.

Q

(

qr

( Y ) )

8

}

9 10

Q

(36)

Example from a Real R Package

Exerpt from Original function

1

w h i l e

( i <= N ) {

2

for

( j in 1: i ) {

3

d.k <- as.matrix(x)[l==j,l==j]

4

...

Exerpt from Modified function

1

x.mat <- as.matrix(x)

2 3

w h i l e

( i <= N ) {

4

for

( j in 1: i ) {

5

d.k <- x.mat[l==j,l==j]

6

...

By changing just 1 line of

code, performance of the

main method improved by

over 350%!

(37)

Some Thoughts

R is slow.

Bad programmers are slower.

R isn’t very clever (compared to a compiler).

The Bytecode compiler helps, but not nearly as much as a compiler.

(38)

Profiling and Benchmarking Profiling R Code

2 Profiling and Benchmarking

Why Profile?

Profiling R Code

Advanced R Profiling

Summary

(39)

Timings

Getting simple timings as a basic measure of performance is easy, and

valuable.

system.time()

— timing blocks of code.

Rprof()

— timing execution of R functions.

Rprofmem()

— reporting memory allocation in R .

tracemem()

— detect when a copy of an R object is created.

The

rbenchmark

package — Benchmark comparisons.

(40)

Profiling and Benchmarking Advanced R Profiling

2 Profiling and Benchmarking

Why Profile?

Profiling R Code

Advanced R Profiling

Summary

(41)

Other Profiling Tools

perf (Linux)

PAPI

MPI profiling: fpmpi, mpiP, TAU

(42)

Profiling and Benchmarking Advanced R Profiling

Profiling MPI Codes with

pbdPROF

1. Rebuild p

p

b

d

R

packages

R CMD I N S T A L L p b d M P I

_

0.2 -1. tar . gz \

- - c o n f i g u r e - a r g s = \

" - - enable - p b d P R O F "

2. Run code

m p i r u n - np 64 R s c r i p t my

_

s c r i p t . R

3. Analyze results

1

l i b r a r y

( p b d P R O F )

2

p r o f

< - r e a d

. p r o f (

" o u t p u t . m p i P "

)

3

p l o t

( prof ,

p l o t

. t y p e =

" m e s s a g e s 2 "

)

(43)

Profiling with

pbdPAPI

Performance Application Programming

Interface

High and low level interfaces

Linux only :(

Function

Description of Measurement

system.flips()

Time, floating point instructions, and Mflips

system.flops()

Time, floating point operations, and Mflops

system.cache()

Cache misses, hits, accesses, and reads

system.epc()

Events per cycle

system.idle()

Idle cycles

system.cpuormem()

CPU or RAM bound

∗

system.utilization()

CPU utilization

∗

(44)

Profiling and Benchmarking Summary

2 Profiling and Benchmarking

Why Profile?

Profiling R Code

Advanced R Profiling

Summary

(45)

Summary

Profile, profile, profile.

Use

system.time()

to get a general sense of a method.

Use

rbenchmark’s

benchmark()

to compare 2 methods.

Use

Rprof()

for more detailed profiling.

Other tools exist for more hardcore applications (pbdPAPI

and

pbdPROF).

(46)

The pbdR Project

3 The pbdR Project

The pbdR Project

pbdR Connects R to HPC Libraries

Using pbdR

Summary

(47)

3 The pbdR Project

The pbdR Project

pbdR Connects R to HPC Libraries

Using pbdR

Summary

(48)

The pbdR Project The pbdR Project

Programming with Big Data in R (pbdR)

Striving for

Productivity, Portability, Performance

Free

a

R packages.

Bridging high-performance compiled code

with high-productivity of R

Scalable, big data analytics.

Offers implicit and explicit parallelism.

Methods have syntax

identical

to R.

a

MPL, BSD, and GPL licensed

(49)

pbdR Packages

(50)

pbdR Motivation

Why HPC libraries (MPI, ScaLAPACK, PETSc, . . . )?

The HPC community has been at this for decades.

They’re tested. They work. They’re fast.

You’re not going to beat Jack Dongarra at dense linear algebra.

(51)

Simple Interface for MPI Operations with

pbdMPI

Rmpi

1

# int

2

mpi . a l l r e d u c e ( x , t y p e =1)

3

# d o u b l e

4

mpi . a l l r e d u c e ( x , t y p e =2)

pbdMPI

1

a l l r e d u c e ( x )

Types in R

1

>

is

.

i n t e g e r

(1)

2

[1] F A L S E

3

>

is

.

i n t e g e r

(2)

4

[1] F A L S E

5

>

is

.

i n t e g e r

( 1 : 2 )

6

[1] T R U E

(52)

Distributed Matrices and Statistics with

pbdDMAT

Matrix Exponentiation

●● ● ● ● ● ● ● ● ● 1 24 8 12 16 24 32 48 64 Cores Nodes ●1●2●3●4

Speedup (Relative to Serial Code)

1

l i b r a r y

( p b d D M A T )

2

3

dx

< -

d d m a t r i x (

" r n o r m "

, 5000 , 5 0 0 0 )

4

e x p m ( dx )

(53)

Distributed Matrices and Statistics with

pbdDMAT

Least Squares Benchmark

21.3442.6885.35 170.7 341.41 1016.09 21.3442.68 85.35 170.7 341.41 1016.09 21.3442.6885.35 170.7 341.41 1016.09 25 50 75 100 125 5041008 2016 4032 8064 24000 5041008 2016 4032 8064 24000 5041008 2016 4032 8064 24000 Cores

Run Time (Seconds)

Predictors 500 1000 2000

Fitting y~x With Fixed Local Size of ~43.4 MiB

x

<

−

d d m a t r i x (

” r n o r m ”

,

nrow

=m,

n c o l

=n )

y

<

−

d d m a t r i x (

” r n o r m ”

,

nrow

=m,

n c o l

=1)

mdl

<

−

lm

. f i t ( x=x , y=y )

(54)

Getting Started with HPC for R Users:

pbdDEMO

140 page, textbook-style vignette.

Over 30 demos, utilizing all

∗

packages.

Not just a hello world!

Demos include:

PCA

Regression

Parallel data input

Model-based clustering

Simple Monte Carlo simulation

Bayesian MCMC

(55)

3 The pbdR Project

The pbdR Project

pbdR Connects R to HPC Libraries

Using pbdR

Summary

(56)

The pbdR Project pbdR Connects R to HPC Libraries

R and

p

b

d

R

Interfaces to HPC Libraries

Interconnection Network PROC

+ cache + cachePROC + cachePROC + cachePROC

Memory CORE

Network

Trilinos PETSc PLASMA DPLASMA MKL LibSci ScaLAPACK PBLAS BLACS cuBLAS MAGMA PAPI Tau MPI mpiP fpmpi NetCDF4 ADIOS magma HiPLAR HiPLARM pbdMPI pbdPAPI pbdNCDF4 pbdADIOS ppbbddPROFPROF pbdPROF pbdDEMO ACML pbdDMAT pbdDMAT pbdDMAT pbdDMAT

(57)

3 The pbdR Project

The pbdR Project

pbdR Connects R to HPC Libraries

Using pbdR

Summary

(58)

The pbdR Project Using pbdR

pbdR Paradigms

p

b

d

R

programs are R programs!

Differences:

Batch execution (non-interactive).

Parallel code utilizes Single Program/Multiple Data (SPMD) style

Emphasizes data parallelism.

(59)

Batch Execution

Running a serial R program in batch:

1

R s c r i p t my

_

s c r i p t . r

or

1

R CMD B A T C H my

_

s c r i p t . r

Running a parallel (with MPI) R program in batch:

1

m p i r u n - np 2 R s c r i p t my

_

par

_

s c r i p t . r

(60)

The pbdR Project Using pbdR

Single Program/Multiple Data (SPMD)

SPMD is a programming

paradigm

.

Not to be confused with SIMD.

Paradigms

Programming models

OOP, Functional, SPMD, . . .

SIMD

Hardware instructions

MMX, SSE, . . .

(61)

Single Program/Multiple Data (SPMD)

SPMD is arguably the simplest extension of serial programming.

Only one program is written, executed in batch on all processors.

Different processors are autonomous; there is no manager.

Dominant programming model for large machines for 30 years.

(62)

The pbdR Project Summary

Summary

p

b

d

R

connects R to scalable HPC libraries.

The

pbdDEMO

package offers numerous examples and explanations

for getting started with distributed R programming.

p

b

d

R

programs are R programs.

(63)

4 Introduction to pbdMPI

Managing a Communicator

Reduce, Gather, Broadcast, and Barrier

Other pbdMPI Tools

Summary

(64)

Introduction to pbdMPI Managing a Communicator

4 Introduction to pbdMPI

Managing a Communicator

Reduce, Gather, Broadcast, and Barrier

Other pbdMPI Tools

Summary

(65)

Message Passing Interface (MPI)

MPI

: Standard for managing communications (data and instructions)

between different nodes/computers.

Implementations

: OpenMPI, MPICH2, Cray MPT, . . .

Enables parallelism (via communication) on distributed machines.

Communicator

: manages communications between processors.

(66)

MPI Operations (1 of 2)

Managing a Communicator: Create and destroy communicators.

init()

— initialize communicator

finalize()

— shut down communicator(s)

Rank query: determine the processor’s position in the communicator.

comm.rank()

— “who am I?”

comm.size()

— “how many of us are there?”

Printing: Printing output from various ranks.

comm.print(x)

comm.cat(x)

WARNING: only use these functions on

results

, never on

yet-to-be-computed things.

(67)

Quick Example 1

Rank Query: 1 rank.r

1

l i b r a r y

( pbdMPI , q u i e t l y = T R U E )

2

i n i t ()

3 4

my .

r a n k < -

c o m m .

r a n k

()

5

c o m m .

p r i n t

( my .

rank

,

all

.

r a n k

= T R U E )

6 7

f i n a l i z e ()

Execute this script via:

1

m p i r u n - np 2 R s c r i p t 1 _ r a n k . r

Sample Output:

1

C O M M . R A N K = 0

2

[1] 0

3

C O M M . R A N K = 1

4

[1] 1

(68)

Quick Example 2

Hello World: 2 hello.r

1

l i b r a r y

( pbdMPI , q u i e t l y = T R U E )

2

i n i t ()

3 4

c o m m .

p r i n t

(

" Hello , w o r l d "

)

5 6

c o m m .

p r i n t

(

" H e l l o a g a i n "

,

all

.

r a n k

= TRUE , q u i e t l y = T R U E )

7 8

f i n a l i z e ()

Execute this script via:

1

m p i r u n - np 2 R s c r i p t 2 _ h e l l o . r

Sample Output:

1

C O M M . R A N K = 0

2

[1]

" Hello , w o r l d "

3

[1]

" H e l l o a g a i n "

4

[1]

" H e l l o a g a i n "

(69)

4 Introduction to pbdMPI

Managing a Communicator

Reduce, Gather, Broadcast, and Barrier

Other pbdMPI Tools

Summary

(70)

Introduction to pbdMPI Reduce, Gather, Broadcast, and Barrier

MPI Operations

1

Reduce

2

Gather

3

_Broadcast

4

Barrier

(71)

Reductions — Combine results into single result

(72)

Gather — Many-to-one

(73)

Broadcast — One-to-many

(74)

Barrier — Synchronization

(75)

MPI Operations (2 of 2)

Reduction: each processor has a number

x; add all of them up, find

the largest/smallest, . . . .

reduce(x, op=’sum’)

— reduce to one

allreduce(x, op=’sum’)

— reduce to all

Gather: each processor has a number; create a new object on some

processor containing all of those numbers.

gather(x)

— gather to one

allgather(x)

— gather to all

Broadcast: one processor has a number

x

that every other processor

should also have.

bcast(x)

Barrier: “computation wall”; no processor can proceed until

all

processors can proceed.

barrier()

(76)

Quick Example 3

Reduce and Gather: 3 gt.r

1

l i b r a r y

( pbdMPI , q u i e t l y = T R U E )

2

i n i t ()

3 4

c o m m .

set

. s e e d (

d i f f

= T R U E )

5 6

n

< - s a m p l e

(1:10 , s i z e =1)

7 8

gt

< -

g a t h e r ( n )

9

c o m m .

p r i n t

(

u n l i s t

( gt ) )

10 11

sm

< -

a l l r e d u c e ( n , op =

’ sum ’

)

12

c o m m .

p r i n t

( sm ,

all

.

r a n k

= T )

13 14

f i n a l i z e ()

Execute this script via:

1

m p i r u n - np 2 R s c r i p t 3 _ gt . r

Sample Output:

1

C O M M . R A N K = 0

2

[1] 2 8

3

C O M M . R A N K = 0

4

[1] 10

5

C O M M . R A N K = 1

6

[1] 10

(77)

Broadcast: 4 bcast.r

1

l i b r a r y

( pbdMPI , q u i e t l y = T )

2

i n i t ()

3 4

if

( c o m m .

r a n k

() = = 0 ) {

5

x

< - m a t r i x

(1:4 ,

n r o w

=2)

6

}

e l s e

{

7

x

< -

N U L L

8

}

9 10

y

< -

b c a s t ( x ,

r a n k

.

s o u r c e

=0)

11 12

c o m m .

p r i n t

( y ,

r a n k

=1)

13 14

f i n a l i z e ()

Execute this script via:

1

m p i r u n - np 2 R s c r i p t 4 _ b c a s t . r

Sample Output:

1

C O M M . R A N K = 1

2

[ ,1] [ ,2]

3

[1 ,]

1

3

4

[2 ,]

2

4

(78)

Introduction to pbdMPI Other pbdMPI Tools

4 Introduction to pbdMPI

Managing a Communicator

Reduce, Gather, Broadcast, and Barrier

Other pbdMPI Tools

Summary

(79)

Random Seeds

pbdMPI

offers a simple interface for managing random seeds:

comm.set.seed(seed=1234, diff=FALSE)

— All processors use

the same seed.

comm.set.seed(seed=1234, diff=FALSE)

— All processors use

the same seed.

(80)

Introduction to pbdMPI Other pbdMPI Tools

Other Helper Tools

pbdMPI

Also contains useful tools for Manager/Worker and task

parallelism codes:

Task Subsetting: Distributing a list of jobs/tasks

get.jid(n)

ply: Functions in the ply family.

pbdApply(X, MARGIN, FUN, ...)

— analogue of

apply()

pbdLapply(X, FUN, ...)

— analogue of

lapply()

pbdSapply(X, FUN, ...)

— analogue of

sapply()

(81)

4 Introduction to pbdMPI

Managing a Communicator

Reduce, Gather, Broadcast, and Barrier

Other pbdMPI Tools

Summary

(82)

Introduction to pbdMPI Summary

Summary

Start by loading the package:

1

l i b r a r y

( pbdMPI , q u i e t = T R U E )

Always initialize before starting and finalize when finished:

1

i n i t ()

2 3

# ...

4

5

f i n a l i z e ()

(83)

5 The Generalized Block Distribution

GBD: a Way to Distribute Your Data

Example GBD Distributions

Summary

(84)

GBD GBD: a Way to Distribute Your Data

5 The Generalized Block Distribution

GBD: a Way to Distribute Your Data

Example GBD Distributions

Summary

(85)

Distributing Data

Problem:

How to distribute the data

x

=







x

1 ,

1 x

1 ,

2 x

1 ,

3 x

2 ,

1 x

2 ,

2 x

2 ,

3 x

3 ,

1 x

3 ,

2 x

3 ,

3 x

4 ,

1 x

4 ,

2 x

4 ,

3 x

5 ,

1 x

5 ,

2 x

5 ,

3 x

6 ,

1 x

6 ,

2 x

6 ,

3 x

7 ,

1 x

7 ,

2 x

7 ,

3 x

8 ,

1 x

8 ,

2 x

8 ,

3 x

9 ,

1 x

9 ,

2 x

9 ,

3 x

10 ,

1 x

10 ,

2 x

10 ,

3 





10 ×

3

?

(86)

Distributing a Matrix Across 4 Processors: Block Distribution

Data

x

=







x1

,

1 x1

,

2 x1

,

3 x

2 ,

1 x

2 ,

2 x

2 ,

3 x

3 ,

1 x

3 ,

2 x

3 ,

3 x

4 ,

1 x

4 ,

2 x

4 ,

3 x

5 ,

1 x

5 ,

2 x

5 ,

3 x

6 ,

1 x

6 ,

2 x

6 ,

3 x

7 ,

1 x

7 ,

2 x

7 ,

3 x

8 ,

1 x

8 ,

2 x

8 ,

3 x

9 ,

1 x

9 ,

2 x

9 ,

3 x

10 ,

1 x

10 ,

2 x

10 ,

3 





10 ×

3 Processors

0

1

2

3

(87)

Distributing a Matrix Across 4 Processors: Local Load Balance

Data

x

=







x1

,

1 x1

,

2 x1

,

3 x

2 ,

1 x

2 ,

2 x

2 ,

3 x

3 ,

1 x

3 ,

2 x

3 ,

3 x

4 ,

1 x

4 ,

2 x

4 ,

3 x

5 ,

1 x

5 ,

2 x

5 ,

3 x

6 ,

1 x

6 ,

2 x

6 ,

3 x

7 ,

1 x

7 ,

2 x

7 ,

3 x

8 ,

1 x

8 ,

2 x

8 ,

3 x

9 ,

1 x

9 ,

2 x

9 ,

3 x

10 ,

1 x

10 ,

2 x

10 ,

3 





10 ×

3 Processors

0

1

2

3

(88)

The

GBD

Data Structure

Throughout the examples, we will make use of the Generalized Block Distribution, or

GBD

distributed matrix structure.

1

GBD

is

distributed

. No processor owns all the data.

2

GBD

is

non-overlapping

. Rows uniquely assigned to

processors.

3

GBD

is

row-contiguous

. If a processor owns one element

of a row, it owns the entire row.

4

GBD

is globally

row-major

, locally

column-major

.

5

GBD

is often

locally balanced

, where each processor

owns (almost) the same amount of data. But this is

not required.







x

1,1

x

1,2

x

1,3

x

2,1

x

2,2

x

2,3

x

3,1

x

3,2

x

3,3

x

4,1

x

4,2

x

4,3

x

5,1

x

5,2

x

5,3

x

6,1

x

6,2

x

6,3

x

7,1

x

7,2

x

7,3

x

8,1

x

8,2

x

8,3

x9

,1

x9

,2

x9

,3

x10

,1

x10

,2

x10

,3







6

The last row of the local storage of a processor is adjacent (by global row) to the first row

of the local storage of next processor (by communicator number) that owns data.

7

GBD

is (relatively) easy to understand, but can lead to bottlenecks if you have many more

columns than rows.

(89)

5 The Generalized Block Distribution

GBD: a Way to Distribute Your Data

Example GBD Distributions

Summary

(90)

GBD Example GBD Distributions

Understanding GBD: Global Matrix

x

=







x

11 x

12 x

13 x

14 x

15 x

16 x

17 x

18 x

19 x

21 x

22 x

23 x

24 x

25 x

26 x

27 x

28 x

29 x

31 x

32 x

33 x

34 x

35 x

36 x

37 x

38 x

39 x

41 x

42 x

43 x

44 x

45 x

46 x

47 x

48 x

49 x

51 x

52 x

53 x

54 x

55 x

56 x

57 x

58 x

59 x

61 x

62 x

63 x

64 x

65 x

66 x

67 x

68 x

69 x

71 x

72 x

73 x

74 x

75 x

76 x

77 x

78 x

79 x

81 x

82 x

83 x

84 x

85 x

86 x

87 x

88 x

89 x

91 x

92 x

93 x

94 x

95 x

96 x

97 x

98 x

99 





9 ×

9 Processors =

0

1

2

3

4

5

(91)

Understanding GBD: Load Balanced GBD

x

=







x11

x12

x13

x14

x15

x16

x17

x18

x19

x21

x22

x23

x24

x25

x26

x27

x28

x29

x

31 x

32 x

33 x

34 x

35 x

36 x

37 x

38 x

39 x

41 x

42 x

43 x

44 x

45 x

46 x

47 x

48 x

49 x

51 x

52 x

53 x

54 x

55 x

56 x

57 x

58 x

59 x

61 x

62 x

63 x

64 x

65 x

66 x

67 x

68 x

69 x

71 x

72 x

73 x

74 x

75 x

76 x

77 x

78 x

79 x

81 x

82 x

83 x

84 x

85 x

86 x

87 x

88 x

89 x

91 x

92 x

93 x

94 x

95 x

96 x

97 x

98 x

99 





9 ×

9 Processors =

0

1

2

3

4

5

(92)

Understanding GBD: Local View

x11

x12

x13

x14

x15

x16

x17

x18

x19

x21

x22

x23

x24

x25

x26

x27

x28

x29

2 ×

9 x

31 x

32 x

33 x

34 x

35 x

36 x

37 x

38 x

39 x

41 x

42 x

43 x

44 x

45 x

46 x

47 x

48 x

49

2 ×

9 x

51 x

52 x

53 x

54 x

55 x

56 x

57 x

58 x

59 x

61 x

62 x

63 x

64 x

65 x

66 x

67 x

68 x

69

2 ×

9 x

71 x

72 x

73 x

74 x

75 x

76 x

77 x

78 x

79

1 ×

9 x81

x82

x83

x84

x85

x86

x87

x88

x89

₁

_×

₉

x

91 x

92 x

93 x

94 x

95 x

96 x

97 x

98 x

99

1 ×

9 Processors =

0

1

2

3

4

5

(93)

Understanding GBD: Non-Balanced GBD

x

=







x

11 x

12 x

13 x

14 x

15 x

16 x

17 x

18 x

19 x

21 x

22 x

23 x

24 x

25 x

26 x

27 x

28 x

29 x

31 x

32 x

33 x

34 x

35 x

36 x

37 x

38 x

39 x

41 x

42 x

43 x

44 x

45 x

46 x

47 x

48 x

49 x

51 x

52 x

53 x

54 x

55 x

56 x

57 x

58 x

59 x

61 x

62 x

63 x

64 x

65 x

66 x

67 x

68 x

69 x

71 x

72 x

73 x

74 x

75 x

76 x

77 x

78 x

79 x

81 x

82 x

83 x

84 x

85 x

86 x

87 x

88 x

89 x

91 x

92 x

93 x

94 x

95 x

96 x

97 x

98 x

99 





9 ×

9 Processors =

0

1

2

3

4

5

(94)

Understanding GBD: Local View

0 ×

9 





x

11 x

12 x

13 x

14 x

15 x

16 x

17 x

18 x

19 x

21 x

22 x

23 x

24 x

25 x

26 x

27 x

28 x

29 x

31 x

32 x

33 x

34 x

35 x

36 x

37 x

38 x

39 x

41 x

42 x

43 x

44 x

45 x

46 x

47 x

48 x

49 





4 ×

9 x

51 x

52 x

53 x

54 x

55 x

56 x

57 x

58 x

59 x

61 x

62 x

63 x

64 x

65 x

66 x

67 x

68 x

69

2 ×

9 x

71 x

72 x

73 x

74 x

75 x

76 x

77 x

78 x

79

1 ×

9

0 ×

9 x

81 x

82 x

83 x

84 x

85 x

86 x

87 x

88 x

89 x91

x92

x93

x94

x95

x96

x97

x98

x99

2 ×

9 Processors =

0

1

2

3

4

5

(95)

5 The Generalized Block Distribution

GBD: a Way to Distribute Your Data

Example GBD Distributions

Summary

(96)

GBD Summary

Summary

Need to distribute your data? Try splitting by row.

May not work well if your data is square (or longer than tall).

(97)

6 Basic Statistics Examples

pbdMPI Example: Monte Carlo Simulation

pbdMPI Example: Sample Covariance

pbdMPI Example: Linear Regression

Summary

(98)

Basic Statistics Examples pbdMPI Example: Monte Carlo Simulation

6 Basic Statistics Examples

pbdMPI Example: Monte Carlo Simulation

pbdMPI Example: Sample Covariance

pbdMPI Example: Linear Regression

Summary

(99)

Example 1: Monte Carlo Simulation

Sample

N

uniform observations (

x

i

,

y

i

) in the unit square [0

,

1]

×

[0

,

1].

Then

π

≈

4 #

Inside Circle

#

Total

= 4

# Blue

+

# Red

0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0

(100)

Example 1: Monte Carlo Simulation GBD Algorithm

1

Let

n

be big-ish; we’ll take

n

= 50

,

000.

2

_{Generate an}

_n

×

_{2 matrix}

_x

_{of standard uniform observations.}

3

Count the number of rows satisfying

x

2 +

y

2 ≤

1

4

Ask everyone else what their answer is; sum it all up.

5

Take this new answer, multiply by 4 and divide by

n

6

If my rank is 0, print the result.

(101)

Example 1: Monte Carlo Simulation Code

Serial Code

1

N

< -

5 0 0 0 0

2

X

< - m a t r i x

(

r u n i f

( N

*

2) ,

n c o l

=2)

3

r

< - sum

( r o w S u m s ( X ^2) <= 1)

4

PI

< -

4 *

r

/

N

5

p r i n t

( PI )

Parallel Code

1

l i b r a r y

( pbdMPI , q u i e t = T R U E )

2

i n i t ()

3

c o m m .

set

. s e e d ( s e e d = 1 2 3 4 5 6 7 ,

d i f f

= T R U E )

4 5

N . gbd

< -

5 0 0 0 0

/

c o m m . s i z e ()

6

X . gbd

< - m a t r i x

(

r u n i f

( N . gbd

*

2) ,

n c o l

= 2)

7

r . gbd

< - sum

( r o w S u m s ( X . gbd ^2) <= 1)

8

r

< -

a l l r e d u c e ( r . gbd )

9

PI

< -

4 *

r

/

( N . gbd

*

c o m m . s i z e () )

10

c o m m .

p r i n t

( PI )

11 12

f i n a l i z e ()

(102)

Note

For the remainder, we will exclude loading, init, and finalize calls.

(103)

6 Basic Statistics Examples

pbdMPI Example: Monte Carlo Simulation

pbdMPI Example: Sample Covariance

pbdMPI Example: Linear Regression

Summary

(104)

Basic Statistics Examples pbdMPI Example: Sample Covariance

Example 2: Sample Covariance

cov

(

x

n

×

p

) =

1 n

₋

1 n

X

i

=1

(

x

i

−

µ

x

) (

x

i

−

µ

x

)

T

(105)

Example 2: Sample Covariance GBD Algorithm

1

_{Determine the total number of rows}

_N

_.

2

Compute the vector of column means of the full matrix.

3

Subtract each column’s mean from that column’s entries in each local

matrix.

4

Compute the crossproduct locally and reduce.

5

Divide by

N

−

1.

(106)

Basic Statistics Examples pbdMPI Example: Sample Covariance

Example 2: Sample Covariance Code

Serial Code

1

N

< - n r o w

( X )

2

mu

< -

c o l S u m s ( X )

/

N

3 4

X

< - s w e e p

( X , S T A T S = mu , M A R G I N =2)

5

Cov . X

< - c r o s s p r o d

( X )

/

( N -1)

6 7

p r i n t

( Cov . X )

Parallel Code

1

N

< -

a l l r e d u c e (

n r o w

( X . gbd ) , op =

" sum "

)

2

mu

< -

a l l r e d u c e ( c o l S u m s ( X . gbd )

/

N , op =

" sum "

)

3 4

X . gbd

< - s w e e p

( X . gbd , S T A T S = mu , M A R G I N =2)

5

Cov . X

< -

a l l r e d u c e (

c r o s s p r o d

( X . gbd ) , op =

" sum "

)

/

( N -1)

6 7

c o m m .

p r i n t

( Cov . X )

(107)

6 Basic Statistics Examples

pbdMPI Example: Monte Carlo Simulation

pbdMPI Example: Sample Covariance

pbdMPI Example: Linear Regression

Summary

(108)

Basic Statistics Examples pbdMPI Example: Linear Regression

Example 3: Linear Regression

Find

β

such that

y

=

Xβ

+

When

X

is full rank,

ˆ

β

= (X

T

X)

−

1 X

T

y