High Performance Computing with R

(1)

High Performance Computing with R

Drew Schmidt

April 6, 2014

(2)

Compilers often correct bad behavior. . .

A Really Dumb Loop

int m a i n () { int x , i ; for ( i =0; i < 1 0 ; i ++) x = 1; r e t u r n 0; }

clang -O3 example.c m a i n : . cfi_s t a r t p r o c clang example.c m a i n : . cfi_s t a r t p r o c # BB #0: m o v l $0 , -4(% rsp ) m o v l $0 , -12(% rsp ) . L B B 0_1: c m p l $10 , -12(% rsp ) jge . L B B 0_4 # BB #2: m o v l $1 , -8(% rsp ) # BB #3: m o v l -12(% rsp ) , % eax a d d l $1 , % eax

(14)

Why Profile?

R will not!

Dumb Loop 1 for ( i in 1: n ) { 2 tA <- t(A) 3 Y < - tA %* % Q 4 Q < - qr.Q(qr( Y ) ) 5 Y < - A %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 } 8 9 Q Better Loop 1 tA <- t(A) 2 3 for ( i in 1: n ) { 4 Y < - tA %* % Q 5 Q < - qr.Q(qr( Y ) ) 6 Y < - A %* % Q 7 Q < - qr.Q(qr( Y ) ) 8 } 9 10 Q

(15)

Why Profile?

Example from the clusterGenomics Package

Exerpt from Original findW function

1 n <- nrow(as.matrix(dX)) 2 3 w h i l e( k <= K ) { 4 for( i in 1: k ) { 5 # Sum of within - c l u s t e r d i s p e r s i o n : 6 d.k <- as.matrix(dX)[labX==i,labX==i] 7 D. k < - sum( d . k ) 8 ...

Exerpt from Modified findW function

1 dX.mat <- as.matrix(dX) 2 n <- nrow(dX.mat) 3 4 w h i l e( k <= K ) { By changing just 2 lines of code, I was able to improve the speed of his method by over 350%!

(16)

Profiling R Code

Runtime Tools

Getting simple timings as a basic measure of performance is easy, and valuable.

system.time() rbenchmark Rprof()

(17)

Profiling R Code

Performance Profiling Tools: system.time()

system.time()is a basic R utility for giving run times of

expressions > x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) > s y s t e m . t i m e ( t ( x ) % * % x ) u s e r s y s t e m e l a p s e d 0 . 4 5 9 0 . 0 2 8 0 . 4 8 8 > s y s t e m . t i m e ( c r o s s p r o d ( x ) ) u s e r s y s t e m e l a p s e d 0 . 2 3 4 0 . 0 0 0 0 . 2 3 4 > s y s t e m . t i m e ( cov ( x ) ) [3] e l a p s e d

(18)

Profiling R Code

Performance Profiling Tools: system.time()

Improving therexpokit Package

l i b r a r y ( r e x p o k i t o l d ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 5 . 4 9 6 l i b r a r y ( r e x p o k i t ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 4 . 1 6 4 5 . 4 9 6 / 4 . 1 6 4 # 1 . 3 1 9 8 8 5

(19)

Profiling R Code

Performance Profiling Tools: rbenchmark

rbenchmarkis a simple package that easily benchmarks different

functions: x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) f < - f u n c t i o n ( x ) t ( x ) % * % x g < - f u n c t i o n ( x ) c r o s s p r o d ( x ) l i b r a r y ( r b e n c h m a r k ) b e n c h m a r k ( f ( x ) , g ( x ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f ( x ) 100 6 4 . 1 5 3 2 . 0 6 3

(20)

Profiling R Code

Rprof()

A very useful tool for profiling R code

R p r o f ( f i l e n a m e = " R p r o f . out " , a p p e n d = FALSE , i n t e r v a l =0.02 , m e m o r y . p r o f i l i n g = FALSE , gc . p r o f i l i n g = FALSE ,

(21)

Profiling R Code

Rprof()

d a t a ( i r i s ) m y r i s < - i r i s [1:100 , ] fam < - b i n o m i a l ( l o g i t ) R p r o f () m y m d l < - r e p l i c a t e (1 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f () R p r o f () m y m d l < - r e p l i c a t e (10 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f ()

(22)

Other Ways to Profile

Other Profiling Tools

Rprofmem() tracemem() perf (Linux) PAPI, TAU, . . .

(23)

Other Ways to Profile

Profiling withpbdPROF

1. RebuildpbdRpackages R CMD I N S T A L L p b d M P I_0.2 -1. tar . gz \ - - c o n f i g u r e - a r g s = \ " - - enable - p b d P R O F " 2. Run code m p i r u n - np 64 R s c r i p t my_s c r i p t . R 3. Analyze results 1 l i b r a r y( p b d P R O F ) Publication-quality graphs

(24)

Summary

Profile, profile, profile.

Usesystem.time() to get a general sense of a method.

Userbenchmark’sbenchmark()to compare 2 methods.

UseRprof() for more detailed profiling.

(25)

3 Writing Better R Code Functions

Loops, Ply Functions, and Vectorization Summary

(26)

Serial R Improvements

(27)

Functions

Function Evaluation

Function calls are comparativelyexpensive (≈10x slower than C)

In absolute terms, the abstraction is worth the price. Recursion sucks. Avoid at all costs.

(28)

Functions

Recursion 1

1 f i b 1 < - f u n c t i o n( n ) 2 { 3 if ( n == 0 || n == 1) 4 r e t u r n( 1 L ) 5 e l s e 6 r e t u r n( f i b 1 ( n -1) + f i b 1 ( n -2) ) 7 } 8 9 10 f i b 2 < - f u n c t i o n( n ) 11 { 12 if ( n == 0 || n == 1) 13 r e t u r n( 1 L ) 14 15 f0 < - 1 L 16 f1 < - 1 L 17 18 i < - 1 L 19 fib < - 0 L

(29)

Functions

Recursion 2

20 w h i l e ( i < n ) 21 { 22 fib < - f0 + f1 23 f0 < - f1 24 f1 < - fib 25 i < - i +1 26 } 27 28 r e t u r n( fib ) 29 }

(30)

Functions

Recursion 3

l i b r a r y ( r b e n c h m a r k ) n < - 20 b e n c h m a r k ( f i b 1 ( n ) , f i b 2 ( n ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f i b 1 ( n ) 100 2 . 0 4 7 1 0 2 3 . 5 # 2 f i b 2 ( n ) 100 0 . 0 0 2 1.0 s y s t e m . t i m e ( f i b 1 ( 4 5 ) ) [3] 2 9 3 4 . 1 4 6 s y s t e m . t i m e ( f i b 2 ( 4 5 ) ) [3] 3 . 0 6 1 6 e -5 # R e l a t i v e p e r f o r m a n c e 2 9 3 4 . 1 4 6 / 3 . 0 6 1 6 e -5 # 9 5 8 3 7 0 1 3

(31)

Loops, Ply Functions, and Vectorization

Loops, Plys, and Vectorization

Loops are slow.

apply(),Reduce() are justforloops.

Map(),lapply(),sapply(),mapply()(and most other

core ones) arenotfor loops.

Vectorization is the fastest of these options, but tends to be much more memory wasteful.

(32)

Loops: Best Practices

Profile, profile, profile.

Evaluate how practical it is to rewrite as an lapply(), vectorize, or push to compiled code.

(33)

Loops 1

1 f1 < - f u n c t i o n( n ) { 2 x < - c() 3 for ( i in 1: n ) { 4 x < - c( x , i ^2) 5 } 6 7 x 8 } 9 10 11 f2 < - f u n c t i o n( n ) { 12 x < - i n t e g e r( n ) 13 for ( i in 1: n ) { 14 x [ i ] < - i ^2 15 }

(34)

Loops 2

l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e 1 f1 ( n ) 100 0 . 2 3 4 2 . 4 1 2 2 f2 ( n ) 100 0 . 0 9 7 1 . 0 0 0

(35)

Ply’s: Best Practices

Most ply’s are just shorthand/higher expressions of loops. Generally not much faster (if at all), especially with the compiler.

(36)

Vectorization

x+y

x[, 1] <- 0 rnorm(1000)

(37)

Plys and Vectorization

1 f3 < - f u n c t i o n( n ) { 2 s a p p l y(1: n , f u n c t i o n( i ) i ^2) 3 } 4 5 f4 < - f u n c t i o n( n ) { 6 (1: n )*(1: n ) 7 } l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) , f3 ( n ) , f4 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e

(38)

Loops, Plys, and Vectorization 1

1 f1 < - f u n c t i o n( ind ) {

2 sum < - 0

3 for ( i in ind ) {

4 if ( i%%2 == 0)

5 sum < - sum - log( i )

6 e l s e

7 sum < - sum + log( i )

8 }

9 sum

10 }

11

12 f2 < - f u n c t i o n( ind ) {

13 sum(s a p p l y( X = ind , FUN =f u n c t i o n( i ) if ( i%%2 = = 0 ) -log( i )

e l s e log( i ) ) )

14 }

15

16 f3 < - f u n c t i o n( ind ) {

17 s i g n < - r e p l i c a t e (l e n g t h( ind ) , 1 L )

(39)

Loops, Plys, and Vectorization 2

19

20 sum(s i g n * log( ind ) )

21 }

22 23

24 l i b r a r y( r b e n c h m a r k )

25 ind < - 1 : 5 0 0 0 0

26 b e n c h m a r k ( f1 ( ind ) , f2 ( ind ) , f3 ( ind ) )

l i b r a r y ( r b e n c h m a r k ) ind < - 1 : 5 0 0 0 0

(40)

(41)

Summary

Avoid recursion at all costs. Vectorize when you can. Pre-allocate your data in loops.

(42)

4 All About Compilers and R

Building R with a Different Compiler The Bytecode Compiler

Bringing Compiled C/C++/Fortran to R Summary

(43)

Building R with a Different Compiler

Better Compiler

GNU (gcc/gfortran) and clang/gfortran are free and will compile anything, but don’t produce the fastest binaries. Don’t even bother with anything from Microsoft.

Intel icc is very fast on intel hardware. (≈ 20% over GNU)

(44)

Building R with a Different Compiler

Compiling R with icc and ifort

Faster, but not painless.

Requires Intel Composer suite license ($$$) Improvements are most visible on Intel hardware.

(45)

The Bytecode Compiler

The Compiler Package

Released in 2011 (Tierney)

Bytecode: sort of like machine code for interpreters. . . Improves R code speed 2-5% generally.

(46)

Bytecode Compilation

By default, packages are not (bytecode) compiled. Exceptions: base (base,stats, . . . ) and recommended

(MASS,Matrix, . . . ) packages.

Downsides to package compilation: (1) bigger install size, (2) longer install process.

(47)

The Bytecode Compiler Compiling a Function 1 t e s t < - f u n c t i o n( x ) x +1 2 t e s t 3 # f u n c t i o n ( x ) x +1 4 5 l i b r a r y( c o m p i l e r ) 6 7 t e s t < - c m p f u n ( t e s t ) 8 t e s t 9 # f u n c t i o n ( x ) x +1 10 # < b y t e c o d e : 0 x 3 8 c 8 6 c 8 > 11 12 d i s a s s e m b l e ( t e s t ) 13 # l i s t (. Code , l i s t (7 L , G E T F U N . OP , 1 L , M A K E P R O M . OP , 2 L , P U S H C O N S T A R G . OP ,

(48)

Compiling Packages

From R

1 i n s t a l l.p a c k a g e s(" my_p a c k a g e ", t y p e =" s o u r c e ",

I N S T A L L_o p t s =" - - byte - c o m p i l e ")

From The Shell

1 e x p o r t R _C O M P I L E_P K G S =1

(49)

Compiling YOUR Package

In the DESCRIPTION file, you can setByteCompile: yes

to require bytecode compilation (overridden by --no-byte-compile).

Not recommended during development. CRAN may yell at you.

(50)

Bringing Compiled C/C++/Fortran to R

(Machine Code) Compiled Code

Moving to compiled code can be difficult. But performance is very compelling.

(51)

Bringing Compiled C/C++/Fortran to R

Extra Credit

Compare the bytecode of these two functions:

Wasteful 1 f < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 for ( i in 1: n ) { 4 tA < - t( A ) 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 } Less Wasteful 1 g < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 tA < - t( A ) 4 for ( i in 1: n ) { 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 }

(52)

Summary

Compiling R itself with a different compiler can improve performance, but is non-trivial.

The compiler package offers small, but free speedup. The (bytecode) compiler works best on loops.

(53)

5 An Overview of Parallelism Terminology: Parallelism Choice of BLAS Library Guidelines

(54)

Terminology: Parallelism

Parallelism

(55)

Parallelism

(56)

Parallel Programming Vocabulary: Difficulty in Parallelism

1 Implicit parallelism: Parallel details hidden from user

Example: Using multi-threaded BLAS

2 _{Explicit parallelism}_{: Some assembly required. . .}

Example: Using themclapply() from theparallelpackage

3 Embarrassingly Parallelor loosely coupled: Obvious how to

make parallel; lots of independence in computations. Example: Fit two independent models in parallel.

4 Tightly Coupled: Opposite of embarrassingly parallel; lots of

dependence in computations.

(57)

Speedup

Wallclock Time: Time of the clock on the wall from start to finish

Speedup: unitless measure of improvement; more is better.

Sn1,n2 =

Time for n1 cores Time for n2 cores

n1is often taken to be 1

(58)

Speedup

Good Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal Bad Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal

(59)

Shared and Distributed Memory Machines

Shared Memory

Direct access to read/change memory (one node)

Examples: laptop, GPU, MIC

Distributed

No direct access to

read/change memory (many nodes); requires communication

(60)

Shared and Distributed Memory Machines

Shared Memory Machines

Thousands of cores

Nautilus, University of Tennessee 1024 cores 4 TB RAM

Distributed Memory Machines

Hundreds of thousands of cores

Kraken, University of Tennessee 112,896 cores 147 TB RAM

(61)

Shared and Distributed Programming from R

Shared Memory

Examples: parallel,snow,

foreach,gputools,HiPLARM

Distributed

Examples: pbdR,Rmpi,

RHadoop,RHIPE

CRAN HPC Task View

(62)

Choice of BLAS Library

The BLAS

Basic Linear Algebra Subprograms.

Simple vector-vector (level 1), matrix-vector (level 2), and matrix-matrix (level 3).

R uses BLAS (and LAPACK) for most linear algebra operations.

There are different implementations available, with massively different performance.

(63)

Choice of BLAS Library Benchmark 1 s e t. s e e d ( 1 2 3 4 ) 2 m<−2000 3 n<−2000 4 x<−m a t r i x( 5 r n o r m(m∗n ) , 6 m, n ) 7 8 o b j e c t . s i z e ( x ) 9 10 l i b r a r y( r b e n c h m a r k ) 11 12 b e n c h m a r k ( x%∗%x ) 13 b e n c h m a r k (s v d( x ) )

x%*%x on 2000x2000 matrix (~31 MiB) x%*%x on 4000x4000 matrix (~122 MiB)

svd(x) on 1000x1000 matrix (~8 MiB) svd(x) on 2000x2000 matrix (~31 MiB) 0 10 20 30 40 50 30 40 50 A v er age W

all Clock Run Time (10 Runs)

Comparison of Different BLAS Implementations for Matrix−Matrix Multiplication and SVD

(64)

Choice of BLAS Library

Using openblas

On Debian and derivatives:

1 s u d o apt -get i n s t a l l l i b o p e n b l a s -dev

2 s u d o update- a l t e r n a t i v e s - - c o n f i g l i b b l a s . so .3

(65)

Guidelines

Independence

Parallelism requires independence.

Separate evaluations of R functions is embarrassingly parallel. For bio applications, this may mean splitting calculations by gene.

(66)

Guidelines

Portability

Not all packages (or methods within a package) support all OS’s.

In the HPC world, that usually means “doesn’t work on Windows”.

(67)

Guidelines

RNG’s in Parallel

Be careful!

(68)

Summary

Many kinds of parallelism available to R.

Better/parallel BLAS is free speedup for linear algebra, but takes some work.

(69)

6 Shared Memory Parallelism in R The parallel Package

(70)

The parallel Package

Comes with R ≥2.14.0

Includesmulticore + most of snow.

(71)

The parallel Package: multicore

(72)

(+) Data copied to child on write (handled by OS) (+) Very efficient.

(-) No Windows support. (-) Not as efficient as threads.

(73)

1 m c l a p p l y ( X , FUN , ... ,

2 mc . p r e s c h e d u l e = TRUE , mc . set . s e e d = TRUE ,

3 mc . s i l e n t = FALSE , mc . c o r e s = g e t O p t i o n (" mc . c o r e s ", 2 L ) , 4 mc . c l e a n u p = TRUE , mc . a l l o w . r e c u r s i v e = T R U E ) 1 x < - l a p p l y(1:10 , s q r t) 2 3 l i b r a r y( p a r a l l e l ) 4 x . mc < - m c l a p p l y (1:10 , s q r t) 5

(74)

1 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:10 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 2 # [1] 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 3 4 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:2 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 5 # [1] 2 7 4 5 7 2 7 4 5

(75)

The parallel Package: snow

Uses sockets.

(+) Works on all platforms.

(-) More fiddley thanmclapply().

(76)

1 # ## Set up the w o r k e r p r o c e s s e s 2 my . cl < - m a k e C l u s t e r ( d e t e c t C o r e s () ) 3 my . cl 4 # s o c k e t c l u s t e r w i t h 4 n o d e s on h o s t l o c a l h o s t 5 6 p a r S a p p l y ( cl , 1:5 , s q r t) 7 8 s t o p C l u s t e r ( my . cl )

(77)

The parallel Package: Summary

All detectCores() splitIndices() multicore mclapply() mcmapply() mcparallel() snow makeCluster() stopCluster() parLapply()

(78)

The foreach Package

On Cran (Revolution Analytics).

Main package is foreach, which is a single interface for a number of “backend” packages.

Backends: doMC,doMPI,doParallel,doRedis,doRNG,

(79)

The foreach Package

(+) Works on all platforms (with correct backend). (+) Can even work serial with minor notational change. (+) Write the code once, use whichever backend you prefer. (-) Really bizarre, non-R-ish synatx.

(80)

The foreach Package

Efficiency Issues ???

−2 0 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 100 1000 10000 1e+05 1e+06

Length of Iterating Set

Log Run Time in Seconds

Function ● ● ● lapply mclapply foreach

Coin Flipping with 24 Cores

1 # ## Bad p e r f o r m a n c e 2 f o r e a c h ( i =1: len ) % d o p a r % t i n y f u n ( i ) 3 4 # ## E x p e c t e d p e r f o r m a n c e 5 f o r e a c h ( i =1: n c o r e s ) % d o p a r % { 6 out < -n u m e r i c( len/n c o r e s ) 7 for ( j in 1:( len/n c o r e s ) ) 8 out [ i ] < -t i n y f u n ( j ) 9 out 10 }

(81)

The foreach Package

The foreach Package: General Procedure

Load foreach and your backend package.

Register your backend.

(82)

The foreach Package

Using foreach: serial

1 l i b r a r y( f o r e a c h ) 2 3 # ## E x a m p l e 1 4 f o r e a c h ( i = 1 : 3 ) %do% s q r t( i ) 5 6 # ## E x a m p l e 2 7 n < - 50 8 r e p s < - 100 9 10 x < - f o r e a c h ( i =1: r e p s ) %do% { 11 sum(r n o r m( n , m e a n= i ) ) / ( n*r e p s ) 12 }

(83)

The foreach Package

Using foreach: Parallel

1 l i b r a r y( f o r e a c h ) 2 l i b r a r y( < m y b a c k e n d >) 3 4 r e g i s t e r < M y B a c k e n d >() 5 6 # ## E x a m p l e 1 7 f o r e a c h ( i = 1 : 3 ) %dopar% s q r t( i ) 8 9 # ## E x a m p l e 2 10 n < - 50 11 r e p s < - 100 12 13 x < - f o r e a c h ( i =1: r e p s ) %dopar% {

(84)

The foreach Package

foreach backends

multicore 1 l i b r a r y( d o P a r a l l e l ) 2 r e g i s t e r D o P a r a l l e l ( c o r e s = n c o r e s ) 3 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () snow 1 l i b r a r y( d o P a r a l l e l ) 2 cl < - m a k e C l u s t e r ( n c o r e s ) 3 r e g i s t e r D o P a r a l l e l ( cl = cl ) 4 5 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () 6 s t o p C l u s t e r ( cl )

(85)

The foreach Package

foreach Summary

Make sure to register your backend.

Different backends may have different performance.

Use%dopar% for parallel foreach.

%do% and%dopar% mustappear on the same line as the

(86)

7 Distributed Memory Parallelism with R Distributed Memory Parallelism Rmpi

pbdMPI vs Rmpi Summary

(87)

Distributed Memory Parallelism

Why Distribute?

Nodes only hold so much ram.

Commodity hardware: ≈32−64 gib.

With a few exceptions (ff,bigmemory), R does computations in memory.

(88)

Packages for Distributed Memory Parallelism in R

Rmpi, and snowvia Rmpi

RHIPE andRHadoopecosystem

(89)

Hasty Explanation of MPI

We will return to this on Monday. MPI = Message Passing Interface

Recall: Distributed machines can’t directly manipulate memory of other nodes.

Canindirectly manipulate them, however. . .

(90)

Rmpi

Rmpi Hello World

mpi . s p a w n . R s l a v e s ( n s l a v e s =2) # 2 s l a v e s are s p a w n e d s u c c e s s f u l l y . 0 f a i l e d . # m a s t e r ( r a n k 0 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 1 ( r a n k 1 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 2 ( r a n k 2 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a mpi . r e m o t e . e x e c ( p a s t e ( " I am " , mpi . c o m m . r a n k () , " of " , mpi . c o m m . s i z e () ) ) # $ s l a v e 1 # [1] " I am 1 of 3" # # $ s l a v e 2 # [1] " I am 2 of 3" mpi . e x i t ()

(91)

Rmpi

Using Rmpi from snow

l i b r a r y ( s n o w ) l i b r a r y ( R m p i ) cl < - m a k e C l u s t e r (2 , t y p e = " MPI " ) c l u s t e r C a l l ( cl , f u n c t i o n () Sys . g e t p i d () ) c l u s t e r C a l l ( cl , runif , 2) s t o p C l u s t e r ( cl ) mpi . q u i t ()

(92)

Rmpi

Rmpi Resources

Rmpi tutorial: http://math.acadiau.ca/ACMMaC/Rmpi/

Rmpi manual: http:

(93)

pbdMPI vs Rmpi

Rmpi is interactive;pbdMPI is exclusively batch.

pbdMPI is easier to install.

pbdMPI has a simpler interface.

(94)

pbdMPI vs Rmpi Example Syntax Rmpi 1 # int 2 mpi . a l l r e d u c e ( x , t y p e =1) 3 # d o u b l e 4 mpi . a l l r e d u c e ( x , t y p e =2) pbdMPI 1 a l l r e d u c e ( x ) Types in R 1 > is.i n t e g e r(1) 2 [1] F A L S E 3 > is.i n t e g e r(2) 4 [1] F A L S E 5 > is.i n t e g e r( 1 : 2 ) 6 [1] T R U E

(95)

Summary

Distributed parallelism is necessary when computations no longer fit in ram.

Several options available; most go beyond the scope of this talk.

(96)

Exercises 1

1 _{Suppose we wish to store the square root of all integers from}

1 to 10000 in a vector. Do this in each of the following ways,

and compare them with rbenchmark:

for loop without initialization for loop with initialization Ply function

vectorization

2 _{Revisit the previous example, evaluating the different}

implementations withRprof().

3 Count the number of integer multiples of 5 or 17 which are

less than 10,000,000.

Solve this withlapply().

Solve this with vectorization.

(97)

Exercises 2

4 The Monte Hall game is a well

known “paradox” from elementary probability. From Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, ”Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

(98)

Exercises 3

5 The following is a more substantive example that utilizes

multiple cores to perform a real analysis task. Run and evaluate this example. The example is modified from an

example of Wei-Chen’s from the pbdDEMOpackage.

There are 148 EIAV sequences in the data set. They are sequencing from multiple blood serum collected longitudinally from one EIA horse over periodical fever cycles after EIAV infection. The virus population evolved within the sick horse over time. Some subtype can break the horse’s immune system. It was to identify how many subtypes were evolving within the horse, which type is associated with disease early onset, where/which sample/time point to isolate that subtype virus. Moreover, which mutated region of sequence was critical for that subtype in order to break the immune system.

(99)

Exercises 4

1 l i b r a r y( p h y c l u s t , q u i e t l y = T R U E ) 2 l i b r a r y( p a r a l l e l ) 3 4 # ## L o a d d a t a 5 d a t a.p a t h < - p a s t e(. l i b P a t h s () [1] , "/p h y c l u s t/d a t a/p o n y 5 2 4 . phy ", sep = " ") 6 p o n y . 5 2 4 < - r e a d. p h y l i p (d a t a.p a t h) 7 X < - p o n y . 5 2 4$org 8 K0 < - 1 9 Ka < - 2 10 11 # ## F i n d M L E s 12 ret . K0 < - f i n d. b e s t ( X , K0 )

(100)

Exercises 5

18 X . b < - b o o t s t r a p .seq.d a t a( ret . K0 )$org

19 20 ret . K0 < - p h y c l u s t ( X . b , K0 ) 21 r e p e a t{ 22 ret . Ka < - p h y c l u s t ( X . b , Ka ) 23 if( ret . Ka$l o g L > ret . K0$l o g L ) { 24 b r e a k 25 } 26 } 27 28 LRT . b < - -2 * ( ret . Ka$l o g L - ret . K0$l o g L ) 29 LRT . b 30 } 31 32 # ## T a s k p u l l and s u m m a r y 33 ret < - m c l a p p l y (1:100 , FUN ) 34 LRT . B < - u n l i s t( ret ) 35 cat(" K0 : ", K0 , " \ n ", 36 " Ka : ", Ka , " \ n ",

(101)

Exercises 6

37 " l o g L K0 : ", ret . K0$logL , " \ n ",

38 " l o g L Ka : ", ret . Ka$logL , " \ n ",

39 " LRT : ", LRT , " \ n ",

(102)

(103)

Important Topics Not Discussed Here

Distributed computing (for real) — pbdR on Monday. Utilizing compiled code — Rcpp on Tuesday.

Multithreading. GPU’s and MIC’s. R+Hadoop.

(104)

Thanks for coming!

High Performance Computing with R