Drew Schmidt and George Ostrouchov
useR! 2014
The
p
p
pb
p
p
p
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
Core Team
Wei-Chen Chen
1
George Ostrouchov
2
,
3
Pragneshkumar Patel
3
Drew Schmidt
3
Support
This work used resources ofNational Institute for Computational Sciencesat the University of Tennessee, Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.
ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work also used resources of theOak Ridge Leadership Computing Facility
at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
1Department of Ecology and Evolutionary Biology
University of Tennessee, Knoxville TN, USA
2Computer Science and Mathematics Division
Oak Ridge National Laboratory, Oak Ridge TN, USA
3Joint Institute for Computational Sciences
University of Tennessee, Knoxville TN, USA
Downloads
This presentation is available at:
http://r-pbd.org/tutorial
About This Presentation
Installation Instructions
Installation instructions for setting up a p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
environment are available:
http://r-pbd.org/install.html
This includes instructions for installing R, MPI, and p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R.
1
Introduction
2
Profiling and Benchmarking
3
The pbdR Project
4
Introduction to pbdMPI
5
The Generalized Block Distribution
6
Basic Statistics Examples
7
Data Input
8
Introduction to pbdDMAT and the ddmatrix Structure
9
Examples Using pbdDMAT
10
MPI Profiling
11
Wrapup
Introduction
Contents
1
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
1
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
Introduction A Concise Introduction to Parallelism
Parallelism
Serial Programming
Parallel Programming
Difficulty in Parallelism
1
Implicit parallelism
: Parallel details hidden from user
2
Explicit parallelism
: Some assembly required. . .
3
Embarrassingly Parallel
: Also called
loosely coupled
. Obvious how to
make parallel; lots of independence in computations.
4
Tightly Coupled
: Opposite of embarrassingly parallel; lots of
dependence in computations.
Introduction A Concise Introduction to Parallelism
Speedup
Wallclock Time
: Time of the clock on the wall from start to finish
Speedup
: unitless measure of improvement; more is better.
Sn
1,
n
2=
Run time for
n
1
cores
Run time for
n
2
cores
n
1
is often taken to be 1
In this case, comparing parallel algorithm to serial algorithm
Good Speedup
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application OptimalBad Speedup
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application OptimalIntroduction A Concise Introduction to Parallelism
Scalability and Benchmarking
1
Strong
: Fixed
total
problem size.
Less work per core as more cores are added.
2
Weak
: Fixed
local
(per core) problem size.
Same work per core as more cores are added.
Good Strong Scaling
●● ● ● ● ● ● 0 10 20 30 40 50 0 20 40 60 Cores SpeedupGood Weak Scaling
● ● ● ● ● 2950 3000 3050 3100 4000 8000 12000 16000 Cores Time
Introduction A Concise Introduction to Parallelism
Shared and Distributed Memory Machines
Shared Memory
Direct access to read/change
memory (one node)
Distributed
No direct access to read/change
memory (many nodes); requires
communication
Shared Memory Machines
Thousands of cores
Nautilus, University of Tennessee 1024 cores 4 TB RAM
Distributed Memory Machines
Hundreds of thousands of cores
Kraken, University of Tennessee 112,896 cores 147 TB RAM
Introduction A Quick Overview of Parallel Hardware
1
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC
Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Introduction A Quick Overview of Parallel Hardware
Your Laptop or Desktop
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC
Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC
Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Introduction A Quick Overview of Parallel Hardware
Server to Supercomputer
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cacheMem Mem Mem Mem
Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC
Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC
Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Clus
ter
Mult
icor
e
GPU
or M
any
core
“Dis
tribu
ting
”
‘Mul
tithr
ead
ing”
“Off
load
ing”
Introduction A Quick Overview of Parallel Software
1
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Focus on who owns what data and what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data
Sockets MPI Hadoop CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork
Introduction A Quick Overview of Parallel Software
30+ Years of Parallel Computing Research
Focus on who owns what data and what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data
Sockets MPI Hadoop Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork
Focus on who owns what data and what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data
Sockets MPI Hadoop Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork
Introduction A Quick Overview of Parallel Software
Putting It All Together Challenge
Focus on who owns what data and what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data
Sockets MPI Hadoop Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork
Interconnection Network PROC + cache PROC + cache PROC + cache PROC + cache
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Focus on who owns what data and what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data
Sockets MPI Hadoop snow Rmpi pbdMPI
snow + multicore = parallel
CUDA OpenCL OpenACC OpenMP OpenACC OpenMP Pthreads fork multicore RHadoop Foreign Langauge Interfaces: .C .Call Rcpp OpenCL inline . . .
Introduction Summary
1
Introduction
A Concise Introduction to Parallelism
A Quick Overview of Parallel Hardware
A Quick Overview of Parallel Software
Summary
Summary
Three flavors of hardware
Distributed is stable
Multicore and co-processor are evolving
Two memory models
Distributed works in multicore
Parallelism hierarchy
Medium to big machines have all three
Profiling and Benchmarking
Contents
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Summary
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Summary
Profiling and Benchmarking Why Profile?
Performance and Accuracy
Sometimes
π
= 3
.
14
is (a) infinitely faster
than the “correct” answer and (b) the
differ-ence between the “correct” and the “wrong”
answer is meaningless. . . . The thing is, some
specious value of “correctness” is often
irrel-evant because it doesn’t matter. While
per-formance almost always matters. And I
ab-solutely detest the fact that people so often
dismiss performance concerns so readily.
— Linus Torvalds, August 8, 2008
Why Profile?
Because performance matters.
Bad practices scale up!
Your bottlenecks may surprise you.
Because R is dumb.
R users claim to be data people. . . so act like it!
Profiling and Benchmarking Why Profile?
Compilers often correct bad behavior. . .
A Really Dumb Loop
int
m a i n () {
int
x , i ;
for
( i =0; i < 1 0 ; i ++)
x = 1;
r e t u r n
0;
}
clang -O3 example.c
m a i n :
. cfi
_
s t a r t p r o c
# BB #0:
x o r l
% eax ,
% eax
ret
clang example.c
m a i n :
. cfi
_
s t a r t p r o c
# BB #0:
m o v l
$
0 , -4(% rsp )
m o v l
$
0 ,
-12(% rsp )
. L B B 0
_
1:
c m p l
$
10 ,
-12(% rsp )
jge
. L B B 0
_
4
# BB #2:
m o v l
$
1 , -8(% rsp )
# BB #3:
m o v l
-12(% rsp ) , % eax
a d d l
$
1 , % eax
m o v l
% eax ,
-12(% rsp )
jmp
. L B B 0
_
1
. L B B 0
_
4:
m o v l
$
0 , % eax
ret
Dumb Loop
1for
( i in 1: n ) {
2tA <- t(A)
3Y
< -
tA %
* % Q
4Q < - qr
.
Q
(
qr
( Y ) )
5Y
< -
A %
* % Q
6Q < - qr
.
Q
(
qr
( Y ) )
7}
8 9Q
Better Loop
1tA <- t(A)
2 3for
( i in 1: n ) {
4Y
< -
tA %
* % Q
5Q < - qr
.
Q
(
qr
( Y ) )
6Y
< -
A %
* % Q
7Q < - qr
.
Q
(
qr
( Y ) )
8}
9 10Q
Profiling and Benchmarking Why Profile?
Example from a Real R Package
Exerpt from Original function
1
w h i l e
( i <= N ) {
2
for
( j in 1: i ) {
3
d.k <- as.matrix(x)[l==j,l==j]
4
...
Exerpt from Modified function
1
x.mat <- as.matrix(x)
2 3w h i l e
( i <= N ) {
4for
( j in 1: i ) {
5d.k <- x.mat[l==j,l==j]
6...
By changing just 1 line of
code, performance of the
main method improved by
over 350%!
Some Thoughts
R is slow.
Bad programmers are slower.
R isn’t very clever (compared to a compiler).
The Bytecode compiler helps, but not nearly as much as a compiler.
Profiling and Benchmarking Profiling R Code
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Summary
Timings
Getting simple timings as a basic measure of performance is easy, and
valuable.
system.time()
— timing blocks of code.
Rprof()
— timing execution of R functions.
Rprofmem()
— reporting memory allocation in R .
tracemem()
— detect when a copy of an R object is created.
The
rbenchmark
package — Benchmark comparisons.
Profiling and Benchmarking Advanced R Profiling
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Summary
Other Profiling Tools
perf (Linux)
PAPI
MPI profiling: fpmpi, mpiP, TAU
Profiling and Benchmarking Advanced R Profiling
Profiling MPI Codes with
pbdPROF
1. Rebuild p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
packages
R CMD I N S T A L L p b d M P I
_
0.2 -1. tar . gz \
- - c o n f i g u r e - a r g s = \
" - - enable - p b d P R O F "
2. Run code
m p i r u n - np 64 R s c r i p t my
_
s c r i p t . R
3. Analyze results
1l i b r a r y
( p b d P R O F )
2p r o f
< - r e a d
. p r o f (
" o u t p u t . m p i P "
)
3p l o t
( prof ,
p l o t
. t y p e =
" m e s s a g e s 2 "
)
Profiling with
pbdPAPI
Performance Application Programming
Interface
High and low level interfaces
Linux only :(
Function
Description of Measurement
system.flips()
Time, floating point instructions, and Mflips
system.flops()
Time, floating point operations, and Mflops
system.cache()
Cache misses, hits, accesses, and reads
system.epc()
Events per cycle
system.idle()
Idle cycles
system.cpuormem()
CPU or RAM bound
∗
system.utilization()
CPU utilization
∗
Profiling and Benchmarking Summary
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Summary
Summary
Profile, profile, profile.
Use
system.time()
to get a general sense of a method.
Use
rbenchmark’s
benchmark()
to compare 2 methods.
Use
Rprof()
for more detailed profiling.
Other tools exist for more hardcore applications (pbdPAPI
and
pbdPROF).
The pbdR Project
Contents
3
The pbdR Project
The pbdR Project
pbdR Connects R to HPC Libraries
Using pbdR
Summary
3
The pbdR Project
The pbdR Project
pbdR Connects R to HPC Libraries
Using pbdR
Summary
The pbdR Project The pbdR Project
Programming with Big Data in R (pbdR)
Striving for
Productivity, Portability, Performance
Free
a
R packages.
Bridging high-performance compiled code
with high-productivity of R
Scalable, big data analytics.
Offers implicit and explicit parallelism.
Methods have syntax
identical
to R.
a
MPL, BSD, and GPL licensed
pbdR Packages
The pbdR Project The pbdR Project
pbdR Motivation
Why HPC libraries (MPI, ScaLAPACK, PETSc, . . . )?
The HPC community has been at this for decades.
They’re tested. They work. They’re fast.
You’re not going to beat Jack Dongarra at dense linear algebra.
Simple Interface for MPI Operations with
pbdMPI
Rmpi
1# int
2mpi . a l l r e d u c e ( x , t y p e =1)
3# d o u b l e
4mpi . a l l r e d u c e ( x , t y p e =2)
pbdMPI
1a l l r e d u c e ( x )
Types in R
1>
is
.
i n t e g e r
(1)
2[1] F A L S E
3>
is
.
i n t e g e r
(2)
4[1] F A L S E
5>
is
.
i n t e g e r
( 1 : 2 )
6[1] T R U E
The pbdR Project The pbdR Project
Distributed Matrices and Statistics with
pbdDMAT
Matrix Exponentiation
●● ● ● ● ● ● ● ● ● 1 24 8 12 16 24 32 48 64 Cores Nodes ●1●2●3●4Speedup (Relative to Serial Code)
1
l i b r a r y
( p b d D M A T )
2
3
dx
< -
d d m a t r i x (
" r n o r m "
, 5000 , 5 0 0 0 )
4
e x p m ( dx )
Distributed Matrices and Statistics with
pbdDMAT
Least Squares Benchmark
21.3442.6885.35 170.7 341.41 1016.09 21.3442.68 85.35 170.7 341.41 1016.09 21.3442.6885.35 170.7 341.41 1016.09 25 50 75 100 125 5041008 2016 4032 8064 24000 5041008 2016 4032 8064 24000 5041008 2016 4032 8064 24000 Cores
Run Time (Seconds)
Predictors 500 1000 2000
Fitting y~x With Fixed Local Size of ~43.4 MiB
x
<
−
d d m a t r i x (
” r n o r m ”
,
nrow
=m,
n c o l
=n )
y
<
−
d d m a t r i x (
” r n o r m ”
,
nrow
=m,
n c o l
=1)
mdl
<
−
lm
. f i t ( x=x , y=y )
The pbdR Project The pbdR Project
Getting Started with HPC for R Users:
pbdDEMO
140 page, textbook-style vignette.
Over 30 demos, utilizing all
∗
packages.
Not just a hello world!
Demos include:
PCA
Regression
Parallel data input
Model-based clustering
Simple Monte Carlo simulation
Bayesian MCMC
3
The pbdR Project
The pbdR Project
pbdR Connects R to HPC Libraries
Using pbdR
Summary
The pbdR Project pbdR Connects R to HPC Libraries
R and
p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
Interfaces to HPC Libraries
Interconnection Network PROC
+ cache + cachePROC + cachePROC + cachePROC
Mem Mem Mem Mem Distributed Memory
Memory CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core
Trilinos PETSc PLASMA DPLASMA MKL LibSci ScaLAPACK PBLAS BLACS cuBLAS MAGMA PAPI Tau MPI mpiP fpmpi NetCDF4 ADIOS magma HiPLAR HiPLARM pbdMPI pbdPAPI pbdNCDF4 pbdADIOS ppbbddPROFPROF pbdPROF pbdDEMO ACML pbdDMAT pbdDMAT pbdDMAT pbdDMAT
3
The pbdR Project
The pbdR Project
pbdR Connects R to HPC Libraries
Using pbdR
Summary
The pbdR Project Using pbdR
pbdR Paradigms
p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
programs are R programs!
Differences:
Batch execution (non-interactive).
Parallel code utilizes Single Program/Multiple Data (SPMD) style
Emphasizes data parallelism.
Batch Execution
Running a serial R program in batch:
1R s c r i p t my
_
s c r i p t . r
or
1
R CMD B A T C H my
_
s c r i p t . r
Running a parallel (with MPI) R program in batch:
1m p i r u n - np 2 R s c r i p t my
_
par
_
s c r i p t . r
The pbdR Project Using pbdR
Single Program/Multiple Data (SPMD)
SPMD is a programming
paradigm
.
Not to be confused with SIMD.
Paradigms
Programming models
OOP, Functional, SPMD, . . .
SIMD
Hardware instructions
MMX, SSE, . . .
Single Program/Multiple Data (SPMD)
SPMD is arguably the simplest extension of serial programming.
Only one program is written, executed in batch on all processors.
Different processors are autonomous; there is no manager.
Dominant programming model for large machines for 30 years.
The pbdR Project Summary
Summary
p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
connects R to scalable HPC libraries.
The
pbdDEMO
package offers numerous examples and explanations
for getting started with distributed R programming.
p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
programs are R programs.
4
Introduction to pbdMPI
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
Introduction to pbdMPI Managing a Communicator
4
Introduction to pbdMPI
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
Message Passing Interface (MPI)
MPI
: Standard for managing communications (data and instructions)
between different nodes/computers.
Implementations
: OpenMPI, MPICH2, Cray MPT, . . .
Enables parallelism (via communication) on distributed machines.
Communicator
: manages communications between processors.
Introduction to pbdMPI Managing a Communicator
MPI Operations (1 of 2)
Managing a Communicator: Create and destroy communicators.
init()
— initialize communicator
finalize()
— shut down communicator(s)
Rank query: determine the processor’s position in the communicator.
comm.rank()
— “who am I?”
comm.size()
— “how many of us are there?”
Printing: Printing output from various ranks.
comm.print(x)
comm.cat(x)
WARNING: only use these functions on
results
, never on
yet-to-be-computed things.
Quick Example 1
Rank Query: 1 rank.r
1
l i b r a r y
( pbdMPI , q u i e t l y = T R U E )
2i n i t ()
3 4my .
r a n k < -
c o m m .
r a n k
()
5c o m m .
p r i n t
( my .
rank
,
all
.
r a n k
= T R U E )
6 7f i n a l i z e ()
Execute this script via:
1
m p i r u n - np 2 R s c r i p t 1 _ r a n k . r
Sample Output:
1C O M M . R A N K = 0
2[1] 0
3C O M M . R A N K = 1
4[1] 1
Introduction to pbdMPI Managing a Communicator
Quick Example 2
Hello World: 2 hello.r
1
l i b r a r y
( pbdMPI , q u i e t l y = T R U E )
2i n i t ()
3 4c o m m .
p r i n t
(
" Hello , w o r l d "
)
5 6c o m m .
p r i n t
(
" H e l l o a g a i n "
,
all
.
r a n k
= TRUE , q u i e t l y = T R U E )
7 8f i n a l i z e ()
Execute this script via:
1
m p i r u n - np 2 R s c r i p t 2 _ h e l l o . r
Sample Output:
1C O M M . R A N K = 0
2[1]
" Hello , w o r l d "
3[1]
" H e l l o a g a i n "
4[1]
" H e l l o a g a i n "
4
Introduction to pbdMPI
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
Introduction to pbdMPI Reduce, Gather, Broadcast, and Barrier
MPI Operations
1Reduce
2Gather
3Broadcast
4Barrier
Reductions — Combine results into single result
Introduction to pbdMPI Reduce, Gather, Broadcast, and Barrier
Gather — Many-to-one
Broadcast — One-to-many
Introduction to pbdMPI Reduce, Gather, Broadcast, and Barrier
Barrier — Synchronization
MPI Operations (2 of 2)
Reduction: each processor has a number
x; add all of them up, find
the largest/smallest, . . . .
reduce(x, op=’sum’)
— reduce to one
allreduce(x, op=’sum’)
— reduce to all
Gather: each processor has a number; create a new object on some
processor containing all of those numbers.
gather(x)
— gather to one
allgather(x)
— gather to all
Broadcast: one processor has a number
x
that every other processor
should also have.
bcast(x)
Barrier: “computation wall”; no processor can proceed until
all
processors can proceed.
barrier()
Introduction to pbdMPI Reduce, Gather, Broadcast, and Barrier
Quick Example 3
Reduce and Gather: 3 gt.r
1
l i b r a r y
( pbdMPI , q u i e t l y = T R U E )
2i n i t ()
3 4c o m m .
set
. s e e d (
d i f f
= T R U E )
5 6n
< - s a m p l e
(1:10 , s i z e =1)
7 8gt
< -
g a t h e r ( n )
9c o m m .
p r i n t
(
u n l i s t
( gt ) )
10 11sm
< -
a l l r e d u c e ( n , op =
’ sum ’
)
12c o m m .
p r i n t
( sm ,
all
.
r a n k
= T )
13 14f i n a l i z e ()
Execute this script via:
1
m p i r u n - np 2 R s c r i p t 3 _ gt . r
Sample Output:
1C O M M . R A N K = 0
2[1] 2 8
3C O M M . R A N K = 0
4[1] 10
5C O M M . R A N K = 1
6[1] 10
Broadcast: 4 bcast.r
1l i b r a r y
( pbdMPI , q u i e t l y = T )
2i n i t ()
3 4if
( c o m m .
r a n k
() = = 0 ) {
5x
< - m a t r i x
(1:4 ,
n r o w
=2)
6}
e l s e
{
7x
< -
N U L L
8}
9 10y
< -
b c a s t ( x ,
r a n k
.
s o u r c e
=0)
11 12c o m m .
p r i n t
( y ,
r a n k
=1)
13 14f i n a l i z e ()
Execute this script via:
1m p i r u n - np 2 R s c r i p t 4 _ b c a s t . r
Sample Output:
1C O M M . R A N K = 1
2[ ,1] [ ,2]
3[1 ,]
1
3
4[2 ,]
2
4
Introduction to pbdMPI Other pbdMPI Tools
4
Introduction to pbdMPI
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
Random Seeds
pbdMPI
offers a simple interface for managing random seeds:
comm.set.seed(seed=1234, diff=FALSE)
— All processors use
the same seed.
comm.set.seed(seed=1234, diff=FALSE)
— All processors use
the same seed.
Introduction to pbdMPI Other pbdMPI Tools
Other Helper Tools
pbdMPI
Also contains useful tools for Manager/Worker and task
parallelism codes:
Task Subsetting: Distributing a list of jobs/tasks
get.jid(n)
*ply: Functions in the *ply family.
pbdApply(X, MARGIN, FUN, ...)
— analogue of
apply()
pbdLapply(X, FUN, ...)
— analogue of
lapply()
pbdSapply(X, FUN, ...)
— analogue of
sapply()
4
Introduction to pbdMPI
Managing a Communicator
Reduce, Gather, Broadcast, and Barrier
Other pbdMPI Tools
Summary
Introduction to pbdMPI Summary
Summary
Start by loading the package:
1l i b r a r y
( pbdMPI , q u i e t = T R U E )
Always initialize before starting and finalize when finished:
1i n i t ()
2 3
# ...
45
f i n a l i z e ()
5
The Generalized Block Distribution
GBD: a Way to Distribute Your Data
Example GBD Distributions
Summary
GBD GBD: a Way to Distribute Your Data
5
The Generalized Block Distribution
GBD: a Way to Distribute Your Data
Example GBD Distributions
Summary
Distributing Data
Problem:
How to distribute the data
x
=
x
1
,
1
x
1
,
2
x
1
,
3
x
2
,
1
x
2
,
2
x
2
,
3
x
3
,
1
x
3
,
2
x
3
,
3
x
4
,
1
x
4
,
2
x
4
,
3
x
5
,
1
x
5
,
2
x
5
,
3
x
6
,
1
x
6
,
2
x
6
,
3
x
7
,
1
x
7
,
2
x
7
,
3
x
8
,
1
x
8
,
2
x
8
,
3
x
9
,
1
x
9
,
2
x
9
,
3
x
10
,
1
x
10
,
2
x
10
,
3
10
×
3
?
GBD GBD: a Way to Distribute Your Data
Distributing a Matrix Across 4 Processors: Block Distribution
Data
x
=
x1
,
1
x1
,
2
x1
,
3
x
2
,
1
x
2
,
2
x
2
,
3
x
3
,
1
x
3
,
2
x
3
,
3
x
4
,
1
x
4
,
2
x
4
,
3
x
5
,
1
x
5
,
2
x
5
,
3
x
6
,
1
x
6
,
2
x
6
,
3
x
7
,
1
x
7
,
2
x
7
,
3
x
8
,
1
x
8
,
2
x
8
,
3
x
9
,
1
x
9
,
2
x
9
,
3
x
10
,
1
x
10
,
2
x
10
,
3
10
×
3
Processors
0
1
2
3
Distributing a Matrix Across 4 Processors: Local Load Balance
Data
x
=
x1
,
1
x1
,
2
x1
,
3
x
2
,
1
x
2
,
2
x
2
,
3
x
3
,
1
x
3
,
2
x
3
,
3
x
4
,
1
x
4
,
2
x
4
,
3
x
5
,
1
x
5
,
2
x
5
,
3
x
6
,
1
x
6
,
2
x
6
,
3
x
7
,
1
x
7
,
2
x
7
,
3
x
8
,
1
x
8
,
2
x
8
,
3
x
9
,
1
x
9
,
2
x
9
,
3
x
10
,
1
x
10
,
2
x
10
,
3
10
×
3
Processors
0
1
2
3
GBD GBD: a Way to Distribute Your Data
The
GBD
Data Structure
Throughout the examples, we will make use of the Generalized Block Distribution, or
GBD
distributed matrix structure.
1
GBD
is
distributed
. No processor owns all the data.
2GBD
is
non-overlapping
. Rows uniquely assigned to
processors.
3
GBD
is
row-contiguous
. If a processor owns one element
of a row, it owns the entire row.
4
GBD
is globally
row-major
, locally
column-major
.
5
GBD
is often
locally balanced
, where each processor
owns (almost) the same amount of data. But this is
not required.
x
1,1x
1,2x
1,3x
2,1x
2,2x
2,3x
3,1x
3,2x
3,3x
4,1x
4,2x
4,3x
5,1x
5,2x
5,3x
6,1x
6,2x
6,3x
7,1x
7,2x
7,3x
8,1x
8,2x
8,3x9
,1x9
,2x9
,3x10
,1x10
,2x10
,3
6
The last row of the local storage of a processor is adjacent (by global row) to the first row
of the local storage of next processor (by communicator number) that owns data.
7
GBD
is (relatively) easy to understand, but can lead to bottlenecks if you have many more
columns than rows.
5
The Generalized Block Distribution
GBD: a Way to Distribute Your Data
Example GBD Distributions
Summary
GBD Example GBD Distributions
Understanding GBD: Global Matrix
x
=
x
11
x
12
x
13
x
14
x
15
x
16
x
17
x
18
x
19
x
21
x
22
x
23
x
24
x
25
x
26
x
27
x
28
x
29
x
31
x
32
x
33
x
34
x
35
x
36
x
37
x
38
x
39
x
41
x
42
x
43
x
44
x
45
x
46
x
47
x
48
x
49
x
51
x
52
x
53
x
54
x
55
x
56
x
57
x
58
x
59
x
61
x
62
x
63
x
64
x
65
x
66
x
67
x
68
x
69
x
71
x
72
x
73
x
74
x
75
x
76
x
77
x
78
x
79
x
81
x
82
x
83
x
84
x
85
x
86
x
87
x
88
x
89
x
91
x
92
x
93
x
94
x
95
x
96
x
97
x
98
x
99
9
×
9
Processors =
0
1
2
3
4
5
Understanding GBD: Load Balanced GBD
x
=
x11
x12
x13
x14
x15
x16
x17
x18
x19
x21
x22
x23
x24
x25
x26
x27
x28
x29
x
31
x
32
x
33
x
34
x
35
x
36
x
37
x
38
x
39
x
41
x
42
x
43
x
44
x
45
x
46
x
47
x
48
x
49
x
51
x
52
x
53
x
54
x
55
x
56
x
57
x
58
x
59
x
61
x
62
x
63
x
64
x
65
x
66
x
67
x
68
x
69
x
71
x
72
x
73
x
74
x
75
x
76
x
77
x
78
x
79
x
81
x
82
x
83
x
84
x
85
x
86
x
87
x
88
x
89
x
91
x
92
x
93
x
94
x
95
x
96
x
97
x
98
x
99
9
×
9
Processors =
0
1
2
3
4
5
GBD Example GBD Distributions
Understanding GBD: Local View
x11
x12
x13
x14
x15
x16
x17
x18
x19
x21
x22
x23
x24
x25
x26
x27
x28
x29
2
×
9
x
31
x
32
x
33
x
34
x
35
x
36
x
37
x
38
x
39
x
41
x
42
x
43
x
44
x
45
x
46
x
47
x
48
x
49
2
×
9
x
51
x
52
x
53
x
54
x
55
x
56
x
57
x
58
x
59
x
61
x
62
x
63
x
64
x
65
x
66
x
67
x
68
x
69
2
×
9
x
71
x
72
x
73
x
74
x
75
x
76
x
77
x
78
x
79
1
×
9
x81
x82
x83
x84
x85
x86
x87
x88
x89
1
×
9
x
91
x
92
x
93
x
94
x
95
x
96
x
97
x
98
x
99
1
×
9
Processors =
0
1
2
3
4
5
Understanding GBD: Non-Balanced GBD
x
=
x
11
x
12
x
13
x
14
x
15
x
16
x
17
x
18
x
19
x
21
x
22
x
23
x
24
x
25
x
26
x
27
x
28
x
29
x
31
x
32
x
33
x
34
x
35
x
36
x
37
x
38
x
39
x
41
x
42
x
43
x
44
x
45
x
46
x
47
x
48
x
49
x
51
x
52
x
53
x
54
x
55
x
56
x
57
x
58
x
59
x
61
x
62
x
63
x
64
x
65
x
66
x
67
x
68
x
69
x
71
x
72
x
73
x
74
x
75
x
76
x
77
x
78
x
79
x
81
x
82
x
83
x
84
x
85
x
86
x
87
x
88
x
89
x
91
x
92
x
93
x
94
x
95
x
96
x
97
x
98
x
99
9
×
9
Processors =
0
1
2
3
4
5
GBD Example GBD Distributions
Understanding GBD: Local View
0
×
9
x
11
x
12
x
13
x
14
x
15
x
16
x
17
x
18
x
19
x
21
x
22
x
23
x
24
x
25
x
26
x
27
x
28
x
29
x
31
x
32
x
33
x
34
x
35
x
36
x
37
x
38
x
39
x
41
x
42
x
43
x
44
x
45
x
46
x
47
x
48
x
49
4
×
9
x
51
x
52
x
53
x
54
x
55
x
56
x
57
x
58
x
59
x
61
x
62
x
63
x
64
x
65
x
66
x
67
x
68
x
69
2
×
9
x
71
x
72
x
73
x
74
x
75
x
76
x
77
x
78
x
79
1
×
9
0
×
9
x
81
x
82
x
83
x
84
x
85
x
86
x
87
x
88
x
89
x91
x92
x93
x94
x95
x96
x97
x98
x99
2
×
9
Processors =
0
1
2
3
4
5
5
The Generalized Block Distribution
GBD: a Way to Distribute Your Data
Example GBD Distributions
Summary
GBD Summary
Summary
Need to distribute your data? Try splitting by row.
May not work well if your data is square (or longer than tall).
6
Basic Statistics Examples
pbdMPI Example: Monte Carlo Simulation
pbdMPI Example: Sample Covariance
pbdMPI Example: Linear Regression
Summary
Basic Statistics Examples pbdMPI Example: Monte Carlo Simulation
6
Basic Statistics Examples
pbdMPI Example: Monte Carlo Simulation
pbdMPI Example: Sample Covariance
pbdMPI Example: Linear Regression
Summary
Example 1: Monte Carlo Simulation
Sample
N
uniform observations (
x
i
,
y
i
) in the unit square [0
,
1]
×
[0
,
1].
Then
π
≈
4
#
Inside Circle
#
Total
= 4
# Blue
# Blue
+
# Red
0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0Basic Statistics Examples pbdMPI Example: Monte Carlo Simulation
Example 1: Monte Carlo Simulation GBD Algorithm
1
Let
n
be big-ish; we’ll take
n
= 50
,
000.
2
Generate an
n
×
2 matrix
x
of standard uniform observations.
3Count the number of rows satisfying
x
2
+
y
2
≤
1
4
Ask everyone else what their answer is; sum it all up.
5Take this new answer, multiply by 4 and divide by
n
6If my rank is 0, print the result.
Example 1: Monte Carlo Simulation Code
Serial Code
1N
< -
5 0 0 0 0
2X
< - m a t r i x
(
r u n i f
( N
*
2) ,
n c o l
=2)
3r
< - sum
( r o w S u m s ( X ^2) <= 1)
4PI
< -
4
*
r
/
N
5p r i n t
( PI )
Parallel Code
1l i b r a r y
( pbdMPI , q u i e t = T R U E )
2i n i t ()
3c o m m .
set
. s e e d ( s e e d = 1 2 3 4 5 6 7 ,
d i f f
= T R U E )
4 5N . gbd
< -
5 0 0 0 0
/
c o m m . s i z e ()
6X . gbd
< - m a t r i x
(
r u n i f
( N . gbd
*
2) ,
n c o l
= 2)
7r . gbd
< - sum
( r o w S u m s ( X . gbd ^2) <= 1)
8r
< -
a l l r e d u c e ( r . gbd )
9PI
< -
4
*
r
/
( N . gbd
*
c o m m . s i z e () )
10c o m m .
p r i n t
( PI )
11 12f i n a l i z e ()
Basic Statistics Examples pbdMPI Example: Monte Carlo Simulation
Note
For the remainder, we will exclude loading, init, and finalize calls.
6
Basic Statistics Examples
pbdMPI Example: Monte Carlo Simulation
pbdMPI Example: Sample Covariance
pbdMPI Example: Linear Regression
Summary
Basic Statistics Examples pbdMPI Example: Sample Covariance
Example 2: Sample Covariance
cov
(
x
n
×
p
) =
1
n
−
1
n
X
i
=1
(
x
i
−
µ
x
) (
x
i
−
µ
x
)
T
Example 2: Sample Covariance GBD Algorithm
1
Determine the total number of rows
N
.
2
Compute the vector of column means of the full matrix.
3
Subtract each column’s mean from that column’s entries in each local
matrix.
4
Compute the crossproduct locally and reduce.
5Divide by
N
−
1.
Basic Statistics Examples pbdMPI Example: Sample Covariance
Example 2: Sample Covariance Code
Serial Code
1N
< - n r o w
( X )
2mu
< -
c o l S u m s ( X )
/
N
3 4X
< - s w e e p
( X , S T A T S = mu , M A R G I N =2)
5Cov . X
< - c r o s s p r o d
( X )
/
( N -1)
6 7p r i n t
( Cov . X )
Parallel Code
1N
< -
a l l r e d u c e (
n r o w
( X . gbd ) , op =
" sum "
)
2mu
< -
a l l r e d u c e ( c o l S u m s ( X . gbd )
/
N , op =
" sum "
)
3 4X . gbd
< - s w e e p
( X . gbd , S T A T S = mu , M A R G I N =2)
5Cov . X
< -
a l l r e d u c e (
c r o s s p r o d
( X . gbd ) , op =
" sum "
)
/
( N -1)
6 7c o m m .
p r i n t
( Cov . X )
6
Basic Statistics Examples
pbdMPI Example: Monte Carlo Simulation
pbdMPI Example: Sample Covariance
pbdMPI Example: Linear Regression
Summary
Basic Statistics Examples pbdMPI Example: Linear Regression