High Performance Computing with R
Drew Schmidt
April 6, 2014
Contents
1 Introduction
2 Profiling and Benchmarking
3 Writing Better R Code
4 All About Compilers and R
5 An Overview of Parallelism
6 Shared Memory Parallelism in R
7 Distributed Memory Parallelism with R
8 Exercises
1 Introduction
Compute Resources HPC Myths
Compute Resources
Compute Resources
Your laptop. Your own server. The cloud. NSF resources.
Compute Resources
XSEDE
(+) Free∗!!!
(+) Access tomassive compute resources.
(+) OS image and software managed by others. (-) 1-3 month turnaround for new applications.
HPC Myths
What is High Performance Computing (HPC)?
High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
HPC Myths
HPC Myths
1 We don’t need to worry about it.
2 HPC requires a supercomputer.
3 HPC is only for academics.
4 HPC is too expensive.
5 You can’t do HPC with R.
6 There’s no need for HPC in biology.
HPC Myths
HPC Myths
Programming with Big Data in R (pbdR)
170.7 341.41 1016.09 21.3442.68 85.35 170.7 341.41 1016.09 21.3442.6885.35 170.7 341.41 1016.09 50 75 100 125
Run Time (Seconds)
Predictors 500 1000 2000
HPC Myths
HPC in the Biological Sciences
According to Manuel Peitsch, co-founder of the Swiss Institute of Bioinformatics, HPC is “essential” and plays a critical role in the life sciences in 4 ways:
1 Massive amounts of data generated by modern ’omics’ and
genome sequencing technologies.
2 Modeling increasingly large biomolecular systems using
quantum mechanics/molecular mechanics and molecular dynamics.
3 Modeling biological networks and simulating how network
perturbations lead to adverse outcomes and disease.
2 Profiling and Benchmarking Why Profile?
Profiling R Code Other Ways to Profile Summary
Why Profile?
Why Profile?
Because performance matters. Your bottlenecks may surprise you. Because R is dumb.
Why Profile?
Compilers often correct bad behavior. . .
A Really Dumb Loop
int m a i n () { int x , i ; for ( i =0; i < 1 0 ; i ++) x = 1; r e t u r n 0; }
clang -O3 example.c m a i n : . cfi_s t a r t p r o c clang example.c m a i n : . cfi_s t a r t p r o c # BB #0: m o v l $0 , -4(% rsp ) m o v l $0 , -12(% rsp ) . L B B 0_1: c m p l $10 , -12(% rsp ) jge . L B B 0_4 # BB #2: m o v l $1 , -8(% rsp ) # BB #3: m o v l -12(% rsp ) , % eax a d d l $1 , % eax
Why Profile?
R will not!
Dumb Loop 1 for ( i in 1: n ) { 2 tA <- t(A) 3 Y < - tA %* % Q 4 Q < - qr.Q(qr( Y ) ) 5 Y < - A %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 } 8 9 Q Better Loop 1 tA <- t(A) 2 3 for ( i in 1: n ) { 4 Y < - tA %* % Q 5 Q < - qr.Q(qr( Y ) ) 6 Y < - A %* % Q 7 Q < - qr.Q(qr( Y ) ) 8 } 9 10 QWhy Profile?
Example from the clusterGenomics Package
Exerpt from Original findW function
1 n <- nrow(as.matrix(dX)) 2 3 w h i l e( k <= K ) { 4 for( i in 1: k ) { 5 # Sum of within - c l u s t e r d i s p e r s i o n : 6 d.k <- as.matrix(dX)[labX==i,labX==i] 7 D. k < - sum( d . k ) 8 ...
Exerpt from Modified findW function
1 dX.mat <- as.matrix(dX) 2 n <- nrow(dX.mat) 3 4 w h i l e( k <= K ) { By changing just 2 lines of code, I was able to improve the speed of his method by over 350%!
Profiling R Code
Runtime Tools
Getting simple timings as a basic measure of performance is easy, and valuable.
system.time() rbenchmark Rprof()
Profiling R Code
Performance Profiling Tools: system.time()
system.time()is a basic R utility for giving run times of
expressions > x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) > s y s t e m . t i m e ( t ( x ) % * % x ) u s e r s y s t e m e l a p s e d 0 . 4 5 9 0 . 0 2 8 0 . 4 8 8 > s y s t e m . t i m e ( c r o s s p r o d ( x ) ) u s e r s y s t e m e l a p s e d 0 . 2 3 4 0 . 0 0 0 0 . 2 3 4 > s y s t e m . t i m e ( cov ( x ) ) [3] e l a p s e d
Profiling R Code
Performance Profiling Tools: system.time()
Improving therexpokit Package
l i b r a r y ( r e x p o k i t o l d ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 5 . 4 9 6 l i b r a r y ( r e x p o k i t ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 4 . 1 6 4 5 . 4 9 6 / 4 . 1 6 4 # 1 . 3 1 9 8 8 5
Profiling R Code
Performance Profiling Tools: rbenchmark
rbenchmarkis a simple package that easily benchmarks different
functions: x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) f < - f u n c t i o n ( x ) t ( x ) % * % x g < - f u n c t i o n ( x ) c r o s s p r o d ( x ) l i b r a r y ( r b e n c h m a r k ) b e n c h m a r k ( f ( x ) , g ( x ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f ( x ) 100 6 4 . 1 5 3 2 . 0 6 3
Profiling R Code
Rprof()
A very useful tool for profiling R code
R p r o f ( f i l e n a m e = " R p r o f . out " , a p p e n d = FALSE , i n t e r v a l =0.02 , m e m o r y . p r o f i l i n g = FALSE , gc . p r o f i l i n g = FALSE ,
Profiling R Code
Rprof()
d a t a ( i r i s ) m y r i s < - i r i s [1:100 , ] fam < - b i n o m i a l ( l o g i t ) R p r o f () m y m d l < - r e p l i c a t e (1 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f () R p r o f () m y m d l < - r e p l i c a t e (10 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f ()Other Ways to Profile
Other Profiling Tools
Rprofmem() tracemem() perf (Linux) PAPI, TAU, . . .
Other Ways to Profile
Profiling withpbdPROF
1. RebuildpbdRpackages R CMD I N S T A L L p b d M P I_0.2 -1. tar . gz \ - - c o n f i g u r e - a r g s = \ " - - enable - p b d P R O F " 2. Run code m p i r u n - np 64 R s c r i p t my_s c r i p t . R 3. Analyze results 1 l i b r a r y( p b d P R O F ) Publication-quality graphs
Summary
Summary
Profile, profile, profile.
Usesystem.time() to get a general sense of a method.
Userbenchmark’sbenchmark()to compare 2 methods.
UseRprof() for more detailed profiling.
3 Writing Better R Code Functions
Loops, Ply Functions, and Vectorization Summary
Serial R Improvements
Functions
Function Evaluation
Function calls are comparativelyexpensive (≈10x slower than C)
In absolute terms, the abstraction is worth the price. Recursion sucks. Avoid at all costs.
Functions
Recursion 1
1 f i b 1 < - f u n c t i o n( n ) 2 { 3 if ( n == 0 || n == 1) 4 r e t u r n( 1 L ) 5 e l s e 6 r e t u r n( f i b 1 ( n -1) + f i b 1 ( n -2) ) 7 } 8 9 10 f i b 2 < - f u n c t i o n( n ) 11 { 12 if ( n == 0 || n == 1) 13 r e t u r n( 1 L ) 14 15 f0 < - 1 L 16 f1 < - 1 L 17 18 i < - 1 L 19 fib < - 0 LFunctions
Recursion 2
20 w h i l e ( i < n ) 21 { 22 fib < - f0 + f1 23 f0 < - f1 24 f1 < - fib 25 i < - i +1 26 } 27 28 r e t u r n( fib ) 29 }Functions
Recursion 3
l i b r a r y ( r b e n c h m a r k ) n < - 20 b e n c h m a r k ( f i b 1 ( n ) , f i b 2 ( n ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f i b 1 ( n ) 100 2 . 0 4 7 1 0 2 3 . 5 # 2 f i b 2 ( n ) 100 0 . 0 0 2 1.0 s y s t e m . t i m e ( f i b 1 ( 4 5 ) ) [3] 2 9 3 4 . 1 4 6 s y s t e m . t i m e ( f i b 2 ( 4 5 ) ) [3] 3 . 0 6 1 6 e -5 # R e l a t i v e p e r f o r m a n c e 2 9 3 4 . 1 4 6 / 3 . 0 6 1 6 e -5 # 9 5 8 3 7 0 1 3Loops, Ply Functions, and Vectorization
Loops, Plys, and Vectorization
Loops are slow.
apply(),Reduce() are justforloops.
Map(),lapply(),sapply(),mapply()(and most other
core ones) arenotfor loops.
Vectorization is the fastest of these options, but tends to be much more memory wasteful.
Loops, Ply Functions, and Vectorization
Loops: Best Practices
Profile, profile, profile.
Evaluate how practical it is to rewrite as an lapply(), vectorize, or push to compiled code.
Loops, Ply Functions, and Vectorization
Loops 1
1 f1 < - f u n c t i o n( n ) { 2 x < - c() 3 for ( i in 1: n ) { 4 x < - c( x , i ^2) 5 } 6 7 x 8 } 9 10 11 f2 < - f u n c t i o n( n ) { 12 x < - i n t e g e r( n ) 13 for ( i in 1: n ) { 14 x [ i ] < - i ^2 15 }Loops, Ply Functions, and Vectorization
Loops 2
l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e 1 f1 ( n ) 100 0 . 2 3 4 2 . 4 1 2 2 f2 ( n ) 100 0 . 0 9 7 1 . 0 0 0Loops, Ply Functions, and Vectorization
Ply’s: Best Practices
Most ply’s are just shorthand/higher expressions of loops. Generally not much faster (if at all), especially with the compiler.
Loops, Ply Functions, and Vectorization
Vectorization
x+y
x[, 1] <- 0 rnorm(1000)
Loops, Ply Functions, and Vectorization
Plys and Vectorization
1 f3 < - f u n c t i o n( n ) { 2 s a p p l y(1: n , f u n c t i o n( i ) i ^2) 3 } 4 5 f4 < - f u n c t i o n( n ) { 6 (1: n )*(1: n ) 7 } l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) , f3 ( n ) , f4 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e
Loops, Ply Functions, and Vectorization
Loops, Plys, and Vectorization 1
1 f1 < - f u n c t i o n( ind ) {
2 sum < - 0
3 for ( i in ind ) {
4 if ( i%%2 == 0)
5 sum < - sum - log( i )
6 e l s e
7 sum < - sum + log( i )
8 }
9 sum
10 }
11
12 f2 < - f u n c t i o n( ind ) {
13 sum(s a p p l y( X = ind , FUN =f u n c t i o n( i ) if ( i%%2 = = 0 ) -log( i )
e l s e log( i ) ) )
14 }
15
16 f3 < - f u n c t i o n( ind ) {
17 s i g n < - r e p l i c a t e (l e n g t h( ind ) , 1 L )
Loops, Ply Functions, and Vectorization
Loops, Plys, and Vectorization 2
19
20 sum(s i g n * log( ind ) )
21 }
22 23
24 l i b r a r y( r b e n c h m a r k )
25 ind < - 1 : 5 0 0 0 0
26 b e n c h m a r k ( f1 ( ind ) , f2 ( ind ) , f3 ( ind ) )
l i b r a r y ( r b e n c h m a r k ) ind < - 1 : 5 0 0 0 0
Loops, Ply Functions, and Vectorization
Summary
Summary
Avoid recursion at all costs. Vectorize when you can. Pre-allocate your data in loops.
4 All About Compilers and R
Building R with a Different Compiler The Bytecode Compiler
Bringing Compiled C/C++/Fortran to R Summary
Building R with a Different Compiler
Better Compiler
GNU (gcc/gfortran) and clang/gfortran are free and will compile anything, but don’t produce the fastest binaries. Don’t even bother with anything from Microsoft.
Intel icc is very fast on intel hardware. (≈ 20% over GNU)
Building R with a Different Compiler
Compiling R with icc and ifort
Faster, but not painless.
Requires Intel Composer suite license ($$$) Improvements are most visible on Intel hardware.
The Bytecode Compiler
The Compiler Package
Released in 2011 (Tierney)
Bytecode: sort of like machine code for interpreters. . . Improves R code speed 2-5% generally.
The Bytecode Compiler
Bytecode Compilation
By default, packages are not (bytecode) compiled. Exceptions: base (base,stats, . . . ) and recommended
(MASS,Matrix, . . . ) packages.
Downsides to package compilation: (1) bigger install size, (2) longer install process.
The Bytecode Compiler Compiling a Function 1 t e s t < - f u n c t i o n( x ) x +1 2 t e s t 3 # f u n c t i o n ( x ) x +1 4 5 l i b r a r y( c o m p i l e r ) 6 7 t e s t < - c m p f u n ( t e s t ) 8 t e s t 9 # f u n c t i o n ( x ) x +1 10 # < b y t e c o d e : 0 x 3 8 c 8 6 c 8 > 11 12 d i s a s s e m b l e ( t e s t ) 13 # l i s t (. Code , l i s t (7 L , G E T F U N . OP , 1 L , M A K E P R O M . OP , 2 L , P U S H C O N S T A R G . OP ,
The Bytecode Compiler
Compiling Packages
From R
1 i n s t a l l.p a c k a g e s(" my_p a c k a g e ", t y p e =" s o u r c e ",
I N S T A L L_o p t s =" - - byte - c o m p i l e ")
From The Shell
1 e x p o r t R _C O M P I L E_P K G S =1
The Bytecode Compiler
Compiling YOUR Package
In the DESCRIPTION file, you can setByteCompile: yes
to require bytecode compilation (overridden by --no-byte-compile).
Not recommended during development. CRAN may yell at you.
Bringing Compiled C/C++/Fortran to R
(Machine Code) Compiled Code
Moving to compiled code can be difficult. But performance is very compelling.
Bringing Compiled C/C++/Fortran to R
Extra Credit
Compare the bytecode of these two functions:
Wasteful 1 f < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 for ( i in 1: n ) { 4 tA < - t( A ) 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 } Less Wasteful 1 g < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 tA < - t( A ) 4 for ( i in 1: n ) { 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 }
Summary
Summary
Compiling R itself with a different compiler can improve performance, but is non-trivial.
The compiler package offers small, but free speedup. The (bytecode) compiler works best on loops.
5 An Overview of Parallelism Terminology: Parallelism Choice of BLAS Library Guidelines
Terminology: Parallelism
Parallelism
Terminology: Parallelism
Parallelism
Terminology: Parallelism
Parallel Programming Vocabulary: Difficulty in Parallelism
1 Implicit parallelism: Parallel details hidden from user
Example: Using multi-threaded BLAS
2 Explicit parallelism: Some assembly required. . .
Example: Using themclapply() from theparallelpackage
3 Embarrassingly Parallelor loosely coupled: Obvious how to
make parallel; lots of independence in computations. Example: Fit two independent models in parallel.
4 Tightly Coupled: Opposite of embarrassingly parallel; lots of
dependence in computations.
Terminology: Parallelism
Speedup
Wallclock Time: Time of the clock on the wall from start to finish
Speedup: unitless measure of improvement; more is better.
Sn1,n2 =
Time for n1 cores Time for n2 cores
n1is often taken to be 1
Terminology: Parallelism
Speedup
Good Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal Bad Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application OptimalTerminology: Parallelism
Shared and Distributed Memory Machines
Shared Memory
Direct access to read/change memory (one node)
Examples: laptop, GPU, MIC
Distributed
No direct access to
read/change memory (many nodes); requires communication
Terminology: Parallelism
Shared and Distributed Memory Machines
Shared Memory Machines
Thousands of cores
Nautilus, University of Tennessee 1024 cores 4 TB RAM
Distributed Memory Machines
Hundreds of thousands of cores
Kraken, University of Tennessee 112,896 cores 147 TB RAM
Terminology: Parallelism
Shared and Distributed Programming from R
Shared Memory
Examples: parallel,snow,
foreach,gputools,HiPLARM
Distributed
Examples: pbdR,Rmpi,
RHadoop,RHIPE
CRAN HPC Task View
Choice of BLAS Library
The BLAS
Basic Linear Algebra Subprograms.
Simple vector-vector (level 1), matrix-vector (level 2), and matrix-matrix (level 3).
R uses BLAS (and LAPACK) for most linear algebra operations.
There are different implementations available, with massively different performance.
Choice of BLAS Library Benchmark 1 s e t. s e e d ( 1 2 3 4 ) 2 m<−2000 3 n<−2000 4 x<−m a t r i x( 5 r n o r m(m∗n ) , 6 m, n ) 7 8 o b j e c t . s i z e ( x ) 9 10 l i b r a r y( r b e n c h m a r k ) 11 12 b e n c h m a r k ( x%∗%x ) 13 b e n c h m a r k (s v d( x ) )
x%*%x on 2000x2000 matrix (~31 MiB) x%*%x on 4000x4000 matrix (~122 MiB)
svd(x) on 1000x1000 matrix (~8 MiB) svd(x) on 2000x2000 matrix (~31 MiB) 0 10 20 30 40 50 30 40 50 A v er age W
all Clock Run Time (10 Runs)
Comparison of Different BLAS Implementations for Matrix−Matrix Multiplication and SVD
Choice of BLAS Library
Using openblas
On Debian and derivatives:
1 s u d o apt -get i n s t a l l l i b o p e n b l a s -dev
2 s u d o update- a l t e r n a t i v e s - - c o n f i g l i b b l a s . so .3
Guidelines
Independence
Parallelism requires independence.
Separate evaluations of R functions is embarrassingly parallel. For bio applications, this may mean splitting calculations by gene.
Guidelines
Portability
Not all packages (or methods within a package) support all OS’s.
In the HPC world, that usually means “doesn’t work on Windows”.
Guidelines
RNG’s in Parallel
Be careful!
Summary
Summary
Many kinds of parallelism available to R.
Better/parallel BLAS is free speedup for linear algebra, but takes some work.
6 Shared Memory Parallelism in R The parallel Package
The parallel Package
The parallel Package
Comes with R ≥2.14.0
Includesmulticore + most of snow.
The parallel Package
The parallel Package: multicore
The parallel Package
The parallel Package: multicore
(+) Data copied to child on write (handled by OS) (+) Very efficient.
(-) No Windows support. (-) Not as efficient as threads.
The parallel Package
The parallel Package: multicore
1 m c l a p p l y ( X , FUN , ... ,
2 mc . p r e s c h e d u l e = TRUE , mc . set . s e e d = TRUE ,
3 mc . s i l e n t = FALSE , mc . c o r e s = g e t O p t i o n (" mc . c o r e s ", 2 L ) , 4 mc . c l e a n u p = TRUE , mc . a l l o w . r e c u r s i v e = T R U E ) 1 x < - l a p p l y(1:10 , s q r t) 2 3 l i b r a r y( p a r a l l e l ) 4 x . mc < - m c l a p p l y (1:10 , s q r t) 5
The parallel Package
The parallel Package: multicore
1 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:10 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 2 # [1] 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 3 4 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:2 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 5 # [1] 2 7 4 5 7 2 7 4 5
The parallel Package
The parallel Package: snow
Uses sockets.
(+) Works on all platforms.
(-) More fiddley thanmclapply().
The parallel Package
The parallel Package: multicore
1 # ## Set up the w o r k e r p r o c e s s e s 2 my . cl < - m a k e C l u s t e r ( d e t e c t C o r e s () ) 3 my . cl 4 # s o c k e t c l u s t e r w i t h 4 n o d e s on h o s t l o c a l h o s t 5 6 p a r S a p p l y ( cl , 1:5 , s q r t) 7 8 s t o p C l u s t e r ( my . cl )
The parallel Package
The parallel Package: Summary
All detectCores() splitIndices() multicore mclapply() mcmapply() mcparallel() snow makeCluster() stopCluster() parLapply()
The foreach Package
The foreach Package
On Cran (Revolution Analytics).
Main package is foreach, which is a single interface for a number of “backend” packages.
Backends: doMC,doMPI,doParallel,doRedis,doRNG,
The foreach Package
The foreach Package
(+) Works on all platforms (with correct backend). (+) Can even work serial with minor notational change. (+) Write the code once, use whichever backend you prefer. (-) Really bizarre, non-R-ish synatx.
The foreach Package
Efficiency Issues ???
−2 0 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 100 1000 10000 1e+05 1e+06Length of Iterating Set
Log Run Time in Seconds
Function ● ● ● lapply mclapply foreach
Coin Flipping with 24 Cores
1 # ## Bad p e r f o r m a n c e 2 f o r e a c h ( i =1: len ) % d o p a r % t i n y f u n ( i ) 3 4 # ## E x p e c t e d p e r f o r m a n c e 5 f o r e a c h ( i =1: n c o r e s ) % d o p a r % { 6 out < -n u m e r i c( len/n c o r e s ) 7 for ( j in 1:( len/n c o r e s ) ) 8 out [ i ] < -t i n y f u n ( j ) 9 out 10 }
The foreach Package
The foreach Package: General Procedure
Load foreach and your backend package.
Register your backend.
The foreach Package
Using foreach: serial
1 l i b r a r y( f o r e a c h ) 2 3 # ## E x a m p l e 1 4 f o r e a c h ( i = 1 : 3 ) %do% s q r t( i ) 5 6 # ## E x a m p l e 2 7 n < - 50 8 r e p s < - 100 9 10 x < - f o r e a c h ( i =1: r e p s ) %do% { 11 sum(r n o r m( n , m e a n= i ) ) / ( n*r e p s ) 12 }
The foreach Package
Using foreach: Parallel
1 l i b r a r y( f o r e a c h ) 2 l i b r a r y( < m y b a c k e n d >) 3 4 r e g i s t e r < M y B a c k e n d >() 5 6 # ## E x a m p l e 1 7 f o r e a c h ( i = 1 : 3 ) %dopar% s q r t( i ) 8 9 # ## E x a m p l e 2 10 n < - 50 11 r e p s < - 100 12 13 x < - f o r e a c h ( i =1: r e p s ) %dopar% {
The foreach Package
foreach backends
multicore 1 l i b r a r y( d o P a r a l l e l ) 2 r e g i s t e r D o P a r a l l e l ( c o r e s = n c o r e s ) 3 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () snow 1 l i b r a r y( d o P a r a l l e l ) 2 cl < - m a k e C l u s t e r ( n c o r e s ) 3 r e g i s t e r D o P a r a l l e l ( cl = cl ) 4 5 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () 6 s t o p C l u s t e r ( cl )The foreach Package
foreach Summary
Make sure to register your backend.
Different backends may have different performance.
Use%dopar% for parallel foreach.
%do% and%dopar% mustappear on the same line as the
7 Distributed Memory Parallelism with R Distributed Memory Parallelism Rmpi
pbdMPI vs Rmpi Summary
Distributed Memory Parallelism
Why Distribute?
Nodes only hold so much ram.
Commodity hardware: ≈32−64 gib.
With a few exceptions (ff,bigmemory), R does computations in memory.
Distributed Memory Parallelism
Packages for Distributed Memory Parallelism in R
Rmpi, and snowvia Rmpi
RHIPE andRHadoopecosystem
Distributed Memory Parallelism
Hasty Explanation of MPI
We will return to this on Monday. MPI = Message Passing Interface
Recall: Distributed machines can’t directly manipulate memory of other nodes.
Canindirectly manipulate them, however. . .
Rmpi
Rmpi Hello World
mpi . s p a w n . R s l a v e s ( n s l a v e s =2) # 2 s l a v e s are s p a w n e d s u c c e s s f u l l y . 0 f a i l e d . # m a s t e r ( r a n k 0 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 1 ( r a n k 1 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 2 ( r a n k 2 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a mpi . r e m o t e . e x e c ( p a s t e ( " I am " , mpi . c o m m . r a n k () , " of " , mpi . c o m m . s i z e () ) ) # $ s l a v e 1 # [1] " I am 1 of 3" # # $ s l a v e 2 # [1] " I am 2 of 3" mpi . e x i t ()
Rmpi
Using Rmpi from snow
l i b r a r y ( s n o w ) l i b r a r y ( R m p i ) cl < - m a k e C l u s t e r (2 , t y p e = " MPI " ) c l u s t e r C a l l ( cl , f u n c t i o n () Sys . g e t p i d () ) c l u s t e r C a l l ( cl , runif , 2) s t o p C l u s t e r ( cl ) mpi . q u i t ()
Rmpi
Rmpi Resources
Rmpi tutorial: http://math.acadiau.ca/ACMMaC/Rmpi/
Rmpi manual: http:
pbdMPI vs Rmpi
pbdMPI vs Rmpi
Rmpi is interactive;pbdMPI is exclusively batch.
pbdMPI is easier to install.
pbdMPI has a simpler interface.
pbdMPI vs Rmpi Example Syntax Rmpi 1 # int 2 mpi . a l l r e d u c e ( x , t y p e =1) 3 # d o u b l e 4 mpi . a l l r e d u c e ( x , t y p e =2) pbdMPI 1 a l l r e d u c e ( x ) Types in R 1 > is.i n t e g e r(1) 2 [1] F A L S E 3 > is.i n t e g e r(2) 4 [1] F A L S E 5 > is.i n t e g e r( 1 : 2 ) 6 [1] T R U E
Summary
Summary
Distributed parallelism is necessary when computations no longer fit in ram.
Several options available; most go beyond the scope of this talk.
Exercises 1
1 Suppose we wish to store the square root of all integers from
1 to 10000 in a vector. Do this in each of the following ways,
and compare them with rbenchmark:
for loop without initialization for loop with initialization Ply function
vectorization
2 Revisit the previous example, evaluating the different
implementations withRprof().
3 Count the number of integer multiples of 5 or 17 which are
less than 10,000,000.
Solve this withlapply().
Solve this with vectorization.
Exercises 2
4 The Monte Hall game is a well
known “paradox” from elementary probability. From Wikipedia:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, ”Do you want to pick door No. 2?” Is it to your advantage to switch your choice?
Exercises 3
5 The following is a more substantive example that utilizes
multiple cores to perform a real analysis task. Run and evaluate this example. The example is modified from an
example of Wei-Chen’s from the pbdDEMOpackage.
There are 148 EIAV sequences in the data set. They are sequencing from multiple blood serum collected longitudinally from one EIA horse over periodical fever cycles after EIAV infection. The virus population evolved within the sick horse over time. Some subtype can break the horse’s immune system. It was to identify how many subtypes were evolving within the horse, which type is associated with disease early onset, where/which sample/time point to isolate that subtype virus. Moreover, which mutated region of sequence was critical for that subtype in order to break the immune system.
Exercises 4
1 l i b r a r y( p h y c l u s t , q u i e t l y = T R U E ) 2 l i b r a r y( p a r a l l e l ) 3 4 # ## L o a d d a t a 5 d a t a.p a t h < - p a s t e(. l i b P a t h s () [1] , "/p h y c l u s t/d a t a/p o n y 5 2 4 . phy ", sep = " ") 6 p o n y . 5 2 4 < - r e a d. p h y l i p (d a t a.p a t h) 7 X < - p o n y . 5 2 4$org 8 K0 < - 1 9 Ka < - 2 10 11 # ## F i n d M L E s 12 ret . K0 < - f i n d. b e s t ( X , K0 )Exercises 5
18 X . b < - b o o t s t r a p .seq.d a t a( ret . K0 )$org
19 20 ret . K0 < - p h y c l u s t ( X . b , K0 ) 21 r e p e a t{ 22 ret . Ka < - p h y c l u s t ( X . b , Ka ) 23 if( ret . Ka$l o g L > ret . K0$l o g L ) { 24 b r e a k 25 } 26 } 27 28 LRT . b < - -2 * ( ret . Ka$l o g L - ret . K0$l o g L ) 29 LRT . b 30 } 31 32 # ## T a s k p u l l and s u m m a r y 33 ret < - m c l a p p l y (1:100 , FUN ) 34 LRT . B < - u n l i s t( ret ) 35 cat(" K0 : ", K0 , " \ n ", 36 " Ka : ", Ka , " \ n ",
Exercises 6
37 " l o g L K0 : ", ret . K0$logL , " \ n ",
38 " l o g L Ka : ", ret . Ka$logL , " \ n ",
39 " LRT : ", LRT , " \ n ",
Important Topics Not Discussed Here
Distributed computing (for real) — pbdR on Monday. Utilizing compiled code — Rcpp on Tuesday.
Multithreading. GPU’s and MIC’s. R+Hadoop.
Thanks for coming!