• No results found

High Performance Computing with R

N/A
N/A
Protected

Academic year: 2021

Share "High Performance Computing with R"

Copied!
104
0
0

Loading.... (view fulltext now)

Full text

(1)

High Performance Computing with R

Drew Schmidt

April 6, 2014

(2)

Contents

1 Introduction

2 Profiling and Benchmarking

3 Writing Better R Code

4 All About Compilers and R

5 An Overview of Parallelism

6 Shared Memory Parallelism in R

7 Distributed Memory Parallelism with R

8 Exercises

(3)

1 Introduction

Compute Resources HPC Myths

(4)

Compute Resources

Compute Resources

Your laptop. Your own server. The cloud. NSF resources.

(5)

Compute Resources

XSEDE

(+) Free∗!!!

(+) Access tomassive compute resources.

(+) OS image and software managed by others. (-) 1-3 month turnaround for new applications.

(6)

HPC Myths

What is High Performance Computing (HPC)?

High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.

(7)

HPC Myths

HPC Myths

1 We don’t need to worry about it.

2 HPC requires a supercomputer.

3 HPC is only for academics.

4 HPC is too expensive.

5 You can’t do HPC with R.

6 There’s no need for HPC in biology.

(8)

HPC Myths

(9)

HPC Myths

Programming with Big Data in R (pbdR)

170.7 341.41 1016.09 21.3442.68 85.35 170.7 341.41 1016.09 21.3442.6885.35 170.7 341.41 1016.09 50 75 100 125

Run Time (Seconds)

Predictors 500 1000 2000

(10)

HPC Myths

HPC in the Biological Sciences

According to Manuel Peitsch, co-founder of the Swiss Institute of Bioinformatics, HPC is “essential” and plays a critical role in the life sciences in 4 ways:

1 Massive amounts of data generated by modern ’omics’ and

genome sequencing technologies.

2 Modeling increasingly large biomolecular systems using

quantum mechanics/molecular mechanics and molecular dynamics.

3 Modeling biological networks and simulating how network

perturbations lead to adverse outcomes and disease.

(11)

2 Profiling and Benchmarking Why Profile?

Profiling R Code Other Ways to Profile Summary

(12)

Why Profile?

Why Profile?

Because performance matters. Your bottlenecks may surprise you. Because R is dumb.

(13)

Why Profile?

Compilers often correct bad behavior. . .

A Really Dumb Loop

int m a i n () { int x , i ; for ( i =0; i < 1 0 ; i ++) x = 1; r e t u r n 0; }

clang -O3 example.c m a i n : . cfi_s t a r t p r o c clang example.c m a i n : . cfi_s t a r t p r o c # BB #0: m o v l $0 , -4(% rsp ) m o v l $0 , -12(% rsp ) . L B B 0_1: c m p l $10 , -12(% rsp ) jge . L B B 0_4 # BB #2: m o v l $1 , -8(% rsp ) # BB #3: m o v l -12(% rsp ) , % eax a d d l $1 , % eax

(14)

Why Profile?

R will not!

Dumb Loop 1 for ( i in 1: n ) { 2 tA <- t(A) 3 Y < - tA %* % Q 4 Q < - qr.Q(qr( Y ) ) 5 Y < - A %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 } 8 9 Q Better Loop 1 tA <- t(A) 2 3 for ( i in 1: n ) { 4 Y < - tA %* % Q 5 Q < - qr.Q(qr( Y ) ) 6 Y < - A %* % Q 7 Q < - qr.Q(qr( Y ) ) 8 } 9 10 Q

(15)

Why Profile?

Example from the clusterGenomics Package

Exerpt from Original findW function

1 n <- nrow(as.matrix(dX)) 2 3 w h i l e( k <= K ) { 4 for( i in 1: k ) { 5 # Sum of within - c l u s t e r d i s p e r s i o n : 6 d.k <- as.matrix(dX)[labX==i,labX==i] 7 D. k < - sum( d . k ) 8 ...

Exerpt from Modified findW function

1 dX.mat <- as.matrix(dX) 2 n <- nrow(dX.mat) 3 4 w h i l e( k <= K ) { By changing just 2 lines of code, I was able to improve the speed of his method by over 350%!

(16)

Profiling R Code

Runtime Tools

Getting simple timings as a basic measure of performance is easy, and valuable.

system.time() rbenchmark Rprof()

(17)

Profiling R Code

Performance Profiling Tools: system.time()

system.time()is a basic R utility for giving run times of

expressions > x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) > s y s t e m . t i m e ( t ( x ) % * % x ) u s e r s y s t e m e l a p s e d 0 . 4 5 9 0 . 0 2 8 0 . 4 8 8 > s y s t e m . t i m e ( c r o s s p r o d ( x ) ) u s e r s y s t e m e l a p s e d 0 . 2 3 4 0 . 0 0 0 0 . 2 3 4 > s y s t e m . t i m e ( cov ( x ) ) [3] e l a p s e d

(18)

Profiling R Code

Performance Profiling Tools: system.time()

Improving therexpokit Package

l i b r a r y ( r e x p o k i t o l d ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 5 . 4 9 6 l i b r a r y ( r e x p o k i t ) s y s t e m . t i m e ( e x p o k i t _ d g p a d m _ Q m a t ( x ) ) [3] # 4 . 1 6 4 5 . 4 9 6 / 4 . 1 6 4 # 1 . 3 1 9 8 8 5

(19)

Profiling R Code

Performance Profiling Tools: rbenchmark

rbenchmarkis a simple package that easily benchmarks different

functions: x < - m a t r i x ( r n o r m ( 1 0 0 0 0 * 5 0 0 ) , n r o w =10000 , n c o l = 5 0 0 ) f < - f u n c t i o n ( x ) t ( x ) % * % x g < - f u n c t i o n ( x ) c r o s s p r o d ( x ) l i b r a r y ( r b e n c h m a r k ) b e n c h m a r k ( f ( x ) , g ( x ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f ( x ) 100 6 4 . 1 5 3 2 . 0 6 3

(20)

Profiling R Code

Rprof()

A very useful tool for profiling R code

R p r o f ( f i l e n a m e = " R p r o f . out " , a p p e n d = FALSE , i n t e r v a l =0.02 , m e m o r y . p r o f i l i n g = FALSE , gc . p r o f i l i n g = FALSE ,

(21)

Profiling R Code

Rprof()

d a t a ( i r i s ) m y r i s < - i r i s [1:100 , ] fam < - b i n o m i a l ( l o g i t ) R p r o f () m y m d l < - r e p l i c a t e (1 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f () R p r o f () m y m d l < - r e p l i c a t e (10 , m y g l m < - glm ( S p e c i e s ~ . , f a m i l y = fam , d a t a = m y r i s ) ) R p r o f ( N U L L ) s u m m a r y R p r o f ()

(22)

Other Ways to Profile

Other Profiling Tools

Rprofmem() tracemem() perf (Linux) PAPI, TAU, . . .

(23)

Other Ways to Profile

Profiling withpbdPROF

1. RebuildpbdRpackages R CMD I N S T A L L p b d M P I_0.2 -1. tar . gz \ - - c o n f i g u r e - a r g s = \ " - - enable - p b d P R O F " 2. Run code m p i r u n - np 64 R s c r i p t my_s c r i p t . R 3. Analyze results 1 l i b r a r y( p b d P R O F ) Publication-quality graphs

(24)

Summary

Summary

Profile, profile, profile.

Usesystem.time() to get a general sense of a method.

Userbenchmark’sbenchmark()to compare 2 methods.

UseRprof() for more detailed profiling.

(25)

3 Writing Better R Code Functions

Loops, Ply Functions, and Vectorization Summary

(26)

Serial R Improvements

(27)

Functions

Function Evaluation

Function calls are comparativelyexpensive (≈10x slower than C)

In absolute terms, the abstraction is worth the price. Recursion sucks. Avoid at all costs.

(28)

Functions

Recursion 1

1 f i b 1 < - f u n c t i o n( n ) 2 { 3 if ( n == 0 || n == 1) 4 r e t u r n( 1 L ) 5 e l s e 6 r e t u r n( f i b 1 ( n -1) + f i b 1 ( n -2) ) 7 } 8 9 10 f i b 2 < - f u n c t i o n( n ) 11 { 12 if ( n == 0 || n == 1) 13 r e t u r n( 1 L ) 14 15 f0 < - 1 L 16 f1 < - 1 L 17 18 i < - 1 L 19 fib < - 0 L

(29)

Functions

Recursion 2

20 w h i l e ( i < n ) 21 { 22 fib < - f0 + f1 23 f0 < - f1 24 f1 < - fib 25 i < - i +1 26 } 27 28 r e t u r n( fib ) 29 }

(30)

Functions

Recursion 3

l i b r a r y ( r b e n c h m a r k ) n < - 20 b e n c h m a r k ( f i b 1 ( n ) , f i b 2 ( n ) ) # t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e # 1 f i b 1 ( n ) 100 2 . 0 4 7 1 0 2 3 . 5 # 2 f i b 2 ( n ) 100 0 . 0 0 2 1.0 s y s t e m . t i m e ( f i b 1 ( 4 5 ) ) [3] 2 9 3 4 . 1 4 6 s y s t e m . t i m e ( f i b 2 ( 4 5 ) ) [3] 3 . 0 6 1 6 e -5 # R e l a t i v e p e r f o r m a n c e 2 9 3 4 . 1 4 6 / 3 . 0 6 1 6 e -5 # 9 5 8 3 7 0 1 3

(31)

Loops, Ply Functions, and Vectorization

Loops, Plys, and Vectorization

Loops are slow.

apply(),Reduce() are justforloops.

Map(),lapply(),sapply(),mapply()(and most other

core ones) arenotfor loops.

Vectorization is the fastest of these options, but tends to be much more memory wasteful.

(32)

Loops, Ply Functions, and Vectorization

Loops: Best Practices

Profile, profile, profile.

Evaluate how practical it is to rewrite as an lapply(), vectorize, or push to compiled code.

(33)

Loops, Ply Functions, and Vectorization

Loops 1

1 f1 < - f u n c t i o n( n ) { 2 x < - c() 3 for ( i in 1: n ) { 4 x < - c( x , i ^2) 5 } 6 7 x 8 } 9 10 11 f2 < - f u n c t i o n( n ) { 12 x < - i n t e g e r( n ) 13 for ( i in 1: n ) { 14 x [ i ] < - i ^2 15 }

(34)

Loops, Ply Functions, and Vectorization

Loops 2

l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e 1 f1 ( n ) 100 0 . 2 3 4 2 . 4 1 2 2 f2 ( n ) 100 0 . 0 9 7 1 . 0 0 0

(35)

Loops, Ply Functions, and Vectorization

Ply’s: Best Practices

Most ply’s are just shorthand/higher expressions of loops. Generally not much faster (if at all), especially with the compiler.

(36)

Loops, Ply Functions, and Vectorization

Vectorization

x+y

x[, 1] <- 0 rnorm(1000)

(37)

Loops, Ply Functions, and Vectorization

Plys and Vectorization

1 f3 < - f u n c t i o n( n ) { 2 s a p p l y(1: n , f u n c t i o n( i ) i ^2) 3 } 4 5 f4 < - f u n c t i o n( n ) { 6 (1: n )*(1: n ) 7 } l i b r a r y ( r b e n c h m a r k ) n < - 1 0 0 0 b e n c h m a r k ( f1 ( n ) , f2 ( n ) , f3 ( n ) , f4 ( n ) ) t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e

(38)

Loops, Ply Functions, and Vectorization

Loops, Plys, and Vectorization 1

1 f1 < - f u n c t i o n( ind ) {

2 sum < - 0

3 for ( i in ind ) {

4 if ( i%%2 == 0)

5 sum < - sum - log( i )

6 e l s e

7 sum < - sum + log( i )

8 }

9 sum

10 }

11

12 f2 < - f u n c t i o n( ind ) {

13 sum(s a p p l y( X = ind , FUN =f u n c t i o n( i ) if ( i%%2 = = 0 ) -log( i )

e l s e log( i ) ) )

14 }

15

16 f3 < - f u n c t i o n( ind ) {

17 s i g n < - r e p l i c a t e (l e n g t h( ind ) , 1 L )

(39)

Loops, Ply Functions, and Vectorization

Loops, Plys, and Vectorization 2

19

20 sum(s i g n * log( ind ) )

21 }

22 23

24 l i b r a r y( r b e n c h m a r k )

25 ind < - 1 : 5 0 0 0 0

26 b e n c h m a r k ( f1 ( ind ) , f2 ( ind ) , f3 ( ind ) )

l i b r a r y ( r b e n c h m a r k ) ind < - 1 : 5 0 0 0 0

(40)

Loops, Ply Functions, and Vectorization

(41)

Summary

Summary

Avoid recursion at all costs. Vectorize when you can. Pre-allocate your data in loops.

(42)

4 All About Compilers and R

Building R with a Different Compiler The Bytecode Compiler

Bringing Compiled C/C++/Fortran to R Summary

(43)

Building R with a Different Compiler

Better Compiler

GNU (gcc/gfortran) and clang/gfortran are free and will compile anything, but don’t produce the fastest binaries. Don’t even bother with anything from Microsoft.

Intel icc is very fast on intel hardware. (≈ 20% over GNU)

(44)

Building R with a Different Compiler

Compiling R with icc and ifort

Faster, but not painless.

Requires Intel Composer suite license ($$$) Improvements are most visible on Intel hardware.

(45)

The Bytecode Compiler

The Compiler Package

Released in 2011 (Tierney)

Bytecode: sort of like machine code for interpreters. . . Improves R code speed 2-5% generally.

(46)

The Bytecode Compiler

Bytecode Compilation

By default, packages are not (bytecode) compiled. Exceptions: base (base,stats, . . . ) and recommended

(MASS,Matrix, . . . ) packages.

Downsides to package compilation: (1) bigger install size, (2) longer install process.

(47)

The Bytecode Compiler Compiling a Function 1 t e s t < - f u n c t i o n( x ) x +1 2 t e s t 3 # f u n c t i o n ( x ) x +1 4 5 l i b r a r y( c o m p i l e r ) 6 7 t e s t < - c m p f u n ( t e s t ) 8 t e s t 9 # f u n c t i o n ( x ) x +1 10 # < b y t e c o d e : 0 x 3 8 c 8 6 c 8 > 11 12 d i s a s s e m b l e ( t e s t ) 13 # l i s t (. Code , l i s t (7 L , G E T F U N . OP , 1 L , M A K E P R O M . OP , 2 L , P U S H C O N S T A R G . OP ,

(48)

The Bytecode Compiler

Compiling Packages

From R

1 i n s t a l l.p a c k a g e s(" my_p a c k a g e ", t y p e =" s o u r c e ",

I N S T A L L_o p t s =" - - byte - c o m p i l e ")

From The Shell

1 e x p o r t R _C O M P I L E_P K G S =1

(49)

The Bytecode Compiler

Compiling YOUR Package

In the DESCRIPTION file, you can setByteCompile: yes

to require bytecode compilation (overridden by --no-byte-compile).

Not recommended during development. CRAN may yell at you.

(50)

Bringing Compiled C/C++/Fortran to R

(Machine Code) Compiled Code

Moving to compiled code can be difficult. But performance is very compelling.

(51)

Bringing Compiled C/C++/Fortran to R

Extra Credit

Compare the bytecode of these two functions:

Wasteful 1 f < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 for ( i in 1: n ) { 4 tA < - t( A ) 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 } Less Wasteful 1 g < - f u n c t i o n( A , Q) { 2 n < - n c o l( A ) 3 tA < - t( A ) 4 for ( i in 1: n ) { 5 Y < - tA %* % Q 6 Q < - qr.Q(qr( Y ) ) 7 Y < - A %* % Q 8 Q < - qr.Q(qr( Y ) ) 9 }

(52)

Summary

Summary

Compiling R itself with a different compiler can improve performance, but is non-trivial.

The compiler package offers small, but free speedup. The (bytecode) compiler works best on loops.

(53)

5 An Overview of Parallelism Terminology: Parallelism Choice of BLAS Library Guidelines

(54)

Terminology: Parallelism

Parallelism

(55)

Terminology: Parallelism

Parallelism

(56)

Terminology: Parallelism

Parallel Programming Vocabulary: Difficulty in Parallelism

1 Implicit parallelism: Parallel details hidden from user

Example: Using multi-threaded BLAS

2 Explicit parallelism: Some assembly required. . .

Example: Using themclapply() from theparallelpackage

3 Embarrassingly Parallelor loosely coupled: Obvious how to

make parallel; lots of independence in computations. Example: Fit two independent models in parallel.

4 Tightly Coupled: Opposite of embarrassingly parallel; lots of

dependence in computations.

(57)

Terminology: Parallelism

Speedup

Wallclock Time: Time of the clock on the wall from start to finish

Speedup: unitless measure of improvement; more is better.

Sn1,n2 =

Time for n1 cores Time for n2 cores

n1is often taken to be 1

(58)

Terminology: Parallelism

Speedup

Good Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal Bad Speedup ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 2 4 6 8 Cores Speedup group Application Optimal

(59)

Terminology: Parallelism

Shared and Distributed Memory Machines

Shared Memory

Direct access to read/change memory (one node)

Examples: laptop, GPU, MIC

Distributed

No direct access to

read/change memory (many nodes); requires communication

(60)

Terminology: Parallelism

Shared and Distributed Memory Machines

Shared Memory Machines

Thousands of cores

Nautilus, University of Tennessee 1024 cores 4 TB RAM

Distributed Memory Machines

Hundreds of thousands of cores

Kraken, University of Tennessee 112,896 cores 147 TB RAM

(61)

Terminology: Parallelism

Shared and Distributed Programming from R

Shared Memory

Examples: parallel,snow,

foreach,gputools,HiPLARM

Distributed

Examples: pbdR,Rmpi,

RHadoop,RHIPE

CRAN HPC Task View

(62)

Choice of BLAS Library

The BLAS

Basic Linear Algebra Subprograms.

Simple vector-vector (level 1), matrix-vector (level 2), and matrix-matrix (level 3).

R uses BLAS (and LAPACK) for most linear algebra operations.

There are different implementations available, with massively different performance.

(63)

Choice of BLAS Library Benchmark 1 s e t. s e e d ( 1 2 3 4 ) 2 m<−2000 3 n<−2000 4 x<−m a t r i x( 5 r n o r m(m∗n ) , 6 m, n ) 7 8 o b j e c t . s i z e ( x ) 9 10 l i b r a r y( r b e n c h m a r k ) 11 12 b e n c h m a r k ( x%∗%x ) 13 b e n c h m a r k (s v d( x ) )

x%*%x on 2000x2000 matrix (~31 MiB) x%*%x on 4000x4000 matrix (~122 MiB)

svd(x) on 1000x1000 matrix (~8 MiB) svd(x) on 2000x2000 matrix (~31 MiB) 0 10 20 30 40 50 30 40 50 A v er age W

all Clock Run Time (10 Runs)

Comparison of Different BLAS Implementations for Matrix−Matrix Multiplication and SVD

(64)

Choice of BLAS Library

Using openblas

On Debian and derivatives:

1 s u d o apt -get i n s t a l l l i b o p e n b l a s -dev

2 s u d o update- a l t e r n a t i v e s - - c o n f i g l i b b l a s . so .3

(65)

Guidelines

Independence

Parallelism requires independence.

Separate evaluations of R functions is embarrassingly parallel. For bio applications, this may mean splitting calculations by gene.

(66)

Guidelines

Portability

Not all packages (or methods within a package) support all OS’s.

In the HPC world, that usually means “doesn’t work on Windows”.

(67)

Guidelines

RNG’s in Parallel

Be careful!

(68)

Summary

Summary

Many kinds of parallelism available to R.

Better/parallel BLAS is free speedup for linear algebra, but takes some work.

(69)

6 Shared Memory Parallelism in R The parallel Package

(70)

The parallel Package

The parallel Package

Comes with R ≥2.14.0

Includesmulticore + most of snow.

(71)

The parallel Package

The parallel Package: multicore

(72)

The parallel Package

The parallel Package: multicore

(+) Data copied to child on write (handled by OS) (+) Very efficient.

(-) No Windows support. (-) Not as efficient as threads.

(73)

The parallel Package

The parallel Package: multicore

1 m c l a p p l y ( X , FUN , ... ,

2 mc . p r e s c h e d u l e = TRUE , mc . set . s e e d = TRUE ,

3 mc . s i l e n t = FALSE , mc . c o r e s = g e t O p t i o n (" mc . c o r e s ", 2 L ) , 4 mc . c l e a n u p = TRUE , mc . a l l o w . r e c u r s i v e = T R U E ) 1 x < - l a p p l y(1:10 , s q r t) 2 3 l i b r a r y( p a r a l l e l ) 4 x . mc < - m c l a p p l y (1:10 , s q r t) 5

(74)

The parallel Package

The parallel Package: multicore

1 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:10 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 2 # [1] 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 2 7 4 5 4 2 7 4 5 5 2 7 4 5 2 2 7 4 5 3 3 4 s i m p l i f y 2 a r r a y ( m c l a p p l y (1:2 , f u n c t i o n( i ) Sys . g e t p i d () , mc . c o r e s =4) ) 5 # [1] 2 7 4 5 7 2 7 4 5

(75)

The parallel Package

The parallel Package: snow

Uses sockets.

(+) Works on all platforms.

(-) More fiddley thanmclapply().

(76)

The parallel Package

The parallel Package: multicore

1 # ## Set up the w o r k e r p r o c e s s e s 2 my . cl < - m a k e C l u s t e r ( d e t e c t C o r e s () ) 3 my . cl 4 # s o c k e t c l u s t e r w i t h 4 n o d e s on h o s t l o c a l h o s t 5 6 p a r S a p p l y ( cl , 1:5 , s q r t) 7 8 s t o p C l u s t e r ( my . cl )

(77)

The parallel Package

The parallel Package: Summary

All detectCores() splitIndices() multicore mclapply() mcmapply() mcparallel() snow makeCluster() stopCluster() parLapply()

(78)

The foreach Package

The foreach Package

On Cran (Revolution Analytics).

Main package is foreach, which is a single interface for a number of “backend” packages.

Backends: doMC,doMPI,doParallel,doRedis,doRNG,

(79)

The foreach Package

The foreach Package

(+) Works on all platforms (with correct backend). (+) Can even work serial with minor notational change. (+) Write the code once, use whichever backend you prefer. (-) Really bizarre, non-R-ish synatx.

(80)

The foreach Package

Efficiency Issues ???

−2 0 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 100 1000 10000 1e+05 1e+06

Length of Iterating Set

Log Run Time in Seconds

Function ● ● ● lapply mclapply foreach

Coin Flipping with 24 Cores

1 # ## Bad p e r f o r m a n c e 2 f o r e a c h ( i =1: len ) % d o p a r % t i n y f u n ( i ) 3 4 # ## E x p e c t e d p e r f o r m a n c e 5 f o r e a c h ( i =1: n c o r e s ) % d o p a r % { 6 out < -n u m e r i c( len/n c o r e s ) 7 for ( j in 1:( len/n c o r e s ) ) 8 out [ i ] < -t i n y f u n ( j ) 9 out 10 }

(81)

The foreach Package

The foreach Package: General Procedure

Load foreach and your backend package.

Register your backend.

(82)

The foreach Package

Using foreach: serial

1 l i b r a r y( f o r e a c h ) 2 3 # ## E x a m p l e 1 4 f o r e a c h ( i = 1 : 3 ) %do% s q r t( i ) 5 6 # ## E x a m p l e 2 7 n < - 50 8 r e p s < - 100 9 10 x < - f o r e a c h ( i =1: r e p s ) %do% { 11 sum(r n o r m( n , m e a n= i ) ) / ( n*r e p s ) 12 }

(83)

The foreach Package

Using foreach: Parallel

1 l i b r a r y( f o r e a c h ) 2 l i b r a r y( < m y b a c k e n d >) 3 4 r e g i s t e r < M y B a c k e n d >() 5 6 # ## E x a m p l e 1 7 f o r e a c h ( i = 1 : 3 ) %dopar% s q r t( i ) 8 9 # ## E x a m p l e 2 10 n < - 50 11 r e p s < - 100 12 13 x < - f o r e a c h ( i =1: r e p s ) %dopar% {

(84)

The foreach Package

foreach backends

multicore 1 l i b r a r y( d o P a r a l l e l ) 2 r e g i s t e r D o P a r a l l e l ( c o r e s = n c o r e s ) 3 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () snow 1 l i b r a r y( d o P a r a l l e l ) 2 cl < - m a k e C l u s t e r ( n c o r e s ) 3 r e g i s t e r D o P a r a l l e l ( cl = cl ) 4 5 f o r e a c h ( i = 1 : 2 ) % d o p a r % Sys . g e t p i d () 6 s t o p C l u s t e r ( cl )

(85)

The foreach Package

foreach Summary

Make sure to register your backend.

Different backends may have different performance.

Use%dopar% for parallel foreach.

%do% and%dopar% mustappear on the same line as the

(86)

7 Distributed Memory Parallelism with R Distributed Memory Parallelism Rmpi

pbdMPI vs Rmpi Summary

(87)

Distributed Memory Parallelism

Why Distribute?

Nodes only hold so much ram.

Commodity hardware: ≈32−64 gib.

With a few exceptions (ff,bigmemory), R does computations in memory.

(88)

Distributed Memory Parallelism

Packages for Distributed Memory Parallelism in R

Rmpi, and snowvia Rmpi

RHIPE andRHadoopecosystem

(89)

Distributed Memory Parallelism

Hasty Explanation of MPI

We will return to this on Monday. MPI = Message Passing Interface

Recall: Distributed machines can’t directly manipulate memory of other nodes.

Canindirectly manipulate them, however. . .

(90)

Rmpi

Rmpi Hello World

mpi . s p a w n . R s l a v e s ( n s l a v e s =2) # 2 s l a v e s are s p a w n e d s u c c e s s f u l l y . 0 f a i l e d . # m a s t e r ( r a n k 0 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 1 ( r a n k 1 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a # s l a v e 2 ( r a n k 2 , c o m m 1) of s i z e 3 is r u n n i n g on : w o o t a b e g a mpi . r e m o t e . e x e c ( p a s t e ( " I am " , mpi . c o m m . r a n k () , " of " , mpi . c o m m . s i z e () ) ) # $ s l a v e 1 # [1] " I am 1 of 3" # # $ s l a v e 2 # [1] " I am 2 of 3" mpi . e x i t ()

(91)

Rmpi

Using Rmpi from snow

l i b r a r y ( s n o w ) l i b r a r y ( R m p i ) cl < - m a k e C l u s t e r (2 , t y p e = " MPI " ) c l u s t e r C a l l ( cl , f u n c t i o n () Sys . g e t p i d () ) c l u s t e r C a l l ( cl , runif , 2) s t o p C l u s t e r ( cl ) mpi . q u i t ()

(92)

Rmpi

Rmpi Resources

Rmpi tutorial: http://math.acadiau.ca/ACMMaC/Rmpi/

Rmpi manual: http:

(93)

pbdMPI vs Rmpi

pbdMPI vs Rmpi

Rmpi is interactive;pbdMPI is exclusively batch.

pbdMPI is easier to install.

pbdMPI has a simpler interface.

(94)

pbdMPI vs Rmpi Example Syntax Rmpi 1 # int 2 mpi . a l l r e d u c e ( x , t y p e =1) 3 # d o u b l e 4 mpi . a l l r e d u c e ( x , t y p e =2) pbdMPI 1 a l l r e d u c e ( x ) Types in R 1 > is.i n t e g e r(1) 2 [1] F A L S E 3 > is.i n t e g e r(2) 4 [1] F A L S E 5 > is.i n t e g e r( 1 : 2 ) 6 [1] T R U E

(95)

Summary

Summary

Distributed parallelism is necessary when computations no longer fit in ram.

Several options available; most go beyond the scope of this talk.

(96)

Exercises 1

1 Suppose we wish to store the square root of all integers from

1 to 10000 in a vector. Do this in each of the following ways,

and compare them with rbenchmark:

for loop without initialization for loop with initialization Ply function

vectorization

2 Revisit the previous example, evaluating the different

implementations withRprof().

3 Count the number of integer multiples of 5 or 17 which are

less than 10,000,000.

Solve this withlapply().

Solve this with vectorization.

(97)

Exercises 2

4 The Monte Hall game is a well

known “paradox” from elementary probability. From Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, ”Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

(98)

Exercises 3

5 The following is a more substantive example that utilizes

multiple cores to perform a real analysis task. Run and evaluate this example. The example is modified from an

example of Wei-Chen’s from the pbdDEMOpackage.

There are 148 EIAV sequences in the data set. They are sequencing from multiple blood serum collected longitudinally from one EIA horse over periodical fever cycles after EIAV infection. The virus population evolved within the sick horse over time. Some subtype can break the horse’s immune system. It was to identify how many subtypes were evolving within the horse, which type is associated with disease early onset, where/which sample/time point to isolate that subtype virus. Moreover, which mutated region of sequence was critical for that subtype in order to break the immune system.

(99)

Exercises 4

1 l i b r a r y( p h y c l u s t , q u i e t l y = T R U E ) 2 l i b r a r y( p a r a l l e l ) 3 4 # ## L o a d d a t a 5 d a t a.p a t h < - p a s t e(. l i b P a t h s () [1] , "/p h y c l u s t/d a t a/p o n y 5 2 4 . phy ", sep = " ") 6 p o n y . 5 2 4 < - r e a d. p h y l i p (d a t a.p a t h) 7 X < - p o n y . 5 2 4$org 8 K0 < - 1 9 Ka < - 2 10 11 # ## F i n d M L E s 12 ret . K0 < - f i n d. b e s t ( X , K0 )

(100)

Exercises 5

18 X . b < - b o o t s t r a p .seq.d a t a( ret . K0 )$org

19 20 ret . K0 < - p h y c l u s t ( X . b , K0 ) 21 r e p e a t{ 22 ret . Ka < - p h y c l u s t ( X . b , Ka ) 23 if( ret . Ka$l o g L > ret . K0$l o g L ) { 24 b r e a k 25 } 26 } 27 28 LRT . b < - -2 * ( ret . Ka$l o g L - ret . K0$l o g L ) 29 LRT . b 30 } 31 32 # ## T a s k p u l l and s u m m a r y 33 ret < - m c l a p p l y (1:100 , FUN ) 34 LRT . B < - u n l i s t( ret ) 35 cat(" K0 : ", K0 , " \ n ", 36 " Ka : ", Ka , " \ n ",

(101)

Exercises 6

37 " l o g L K0 : ", ret . K0$logL , " \ n ",

38 " l o g L Ka : ", ret . Ka$logL , " \ n ",

39 " LRT : ", LRT , " \ n ",

(102)
(103)

Important Topics Not Discussed Here

Distributed computing (for real) — pbdR on Monday. Utilizing compiled code — Rcpp on Tuesday.

Multithreading. GPU’s and MIC’s. R+Hadoop.

(104)

Thanks for coming!

References

Related documents

The objective is to analyze and compare these recent productions with earlier narrative patterns and subgenres of the Japanese period film to find out if there is a

Since our aim was to produce metal ion beams (especially from Gold and Calcium) with good stability and without any major modification of the source (for

and supporting experimental and observational researches in the field of relativistic astrophysics, theoretical and observational cosmology, particularly the groups

Potato tissue samples were collected from harvested tubers with scab symptoms from Balcarce, a location with more than 110 years of potato crop history. Thirty-one scab lesions

Distinctiveness of tuple generation. In our formulation, we try to cover every entity detected in the target sentence while avoid extracting duplicate tuples. 6.1a, we show

In this paper, we considered a new realized stochastic volatility model with general Gegenbauer long memory (RSV-GGLM), which encompasses the new RSV model with seasonal long

In light of these developments, this study focuses on crucial factors of repurchase behavior of consumers based on the recency, frequency, and monetary value (RFM)

The results obtained in the barriers to process innovation model show that most of the variables are viewed as factors that obstruct the development of innovation activities;