• No results found

CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow? How to run R programs faster? Tomas Kalibera

N/A
N/A
Protected

Academic year: 2021

Share "CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow? How to run R programs faster? Tomas Kalibera"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Why is R slow?

How to run R programs faster?

Tomas Kalibera

CMYK 0/100/100/20 66/54/42/17 34/21/10/0

(2)

My Background

Virtual machines, runtimes for programming languages Real-time Java

Automatic memory management Evaluating software performance

R User

Benchmarks

(3)

Currently working on: FastR

A new, experimental virtual machine for (a subset of) R language. Discovering optimizations that can

speed-up R.

CMYK 0/100/100/20 66/54/42/17 34/21/10/0

Core team

Jan Vitek

Tomas Kalibera

Petr Maj Floreal Morandat

(4)

Community: Dynamic Languages

for Scalable Data Analytics

Use one dynamic, high level language for data

analytics tasks running on platforms from a tablet to the cloud.

R, Matlab, Python, Julia

Large software companies interested in R

NSF Funded Workshop at SPLASH 2013

(5)
(6)

int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "tm n\n"); return 1; } int n = atoi(argv[1]); printf("n = %d\n", n); Source code main if decl call != argc 2 call ret Parse tree parsing

(7)

main if decl call != argc 2 call ret

Parse tree executed directly by

(AST) Interpreter

Class If

Node Condition, TrueBranch, FalseBranch; Result execute() { If (Condition.execute() == TRUE) { TrueBranch.execute() } else { FalseBranch.execute() } Return NULL; }

GNU R works like this.

Interpreter

(8)

compilation linking

Compiler

Ahead of time: C/C++/Fortran Just-in-time: Java/C# 0000000000400580 <main>: 400580: 41 54 push %r12 400582: 83 ff 02 cmp $0x2,%edi 400585: 55 push %rbp 400586: 53 push %rbx 400587: 74 25 je 4005ae <main+0x2e> 400589: 48 8b 0d c8 0a 20 00 mov 0x200ac8(%rip),%rcx 400590: ba 05 00 00 00 mov $0x5,%edx 400595: be 01 00 00 00 mov $0x1,%esi 40059a: bf 04 08 40 00 mov $0x400804,%edi 40059f: e8 cc ff ff ff callq 400570 <fwrite@plt> 4005a4: b8 01 00 00 00 mov $0x1,%eax

4005a9: 5b pop %rbx 4005aa: 5d pop %rbp 4005ab: 41 5c pop %r12 4005ad: c3 retq Machine code main if decl call != argc 2 call ret Parse tree

Fast.

(9)

FastR

Self-optimizing AST interpreter

– Aims to be still easy to develop, maintain – But fast

The AST (tree) rewrites as the program

executes

– Speculative rewrites, recovery ●

Runs on a JVM

– High-performance garbage collector

– Just-in-Time compilation improves speed

(10)
(11)
(12)

Understanding why GNU-R is slow

Speeding-up R programs

(13)

Toeplitz Matrix

In AT&T R Benchmarks 2.5 (Simon Urbanek) Initializing a square matrix

a

i , j

=

i

j

+

1

1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1

(14)

TM using For Loop

(as included in AT&T R Benchmarks 2.5) tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }

a

i , j

=∣

i

j

∣+

1

(15)

TM using For Loop

(as included in AT&T R Benchmarks 2.5 ) tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }

a

i , j

=

i

j

+

1

N = 500 650 ms N = 1000 2610 ms N = 1500 5910 ms

(16)

TM in C

int *b = (int *)malloc(n * n * sizeof(int)); for(j = 1; j <= n; j++) { for(k = 1; k <= n; k++) { b[(k - 1) + (j - 1) * n] = abs(j - k) + 1; } } N = 500 650 ms N = 1000 2610 ms N = 1500 5910 ms

In R

N = 500 0.2 ms N = 1000 0.9 ms N = 1500 2.1 ms

In C

(17)

Toeplitz Matrix

(18)

TM: Checking with a profiler

> Rprof() > dummy <- tmFor(5000) > Rprof(NULL) > summaryRProf() $by.self

self.time self.pct total.time total.pct "tmFor" 51.42 86.36 59.54 100.00 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03 $by.total

total.time total.pct self.time self.pct "tmFor" 59.54 100.00 51.42 86.36 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03

(19)

TM: R profiler does not help

tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }

Performance

critical

part.

(20)

TM: Checking with a system profiler

env CFLAGS=-g ./configure --with-blas --with-lapack --enable-R-static-lib –disable-BLAS-shlib

make

source("tm.r")

dummy <- tmFor(5000) perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.r perf report -g + 1.08% R R [.] real_binary + 0.75% R R [.] integer_binary + 0.74% R R [.] do_abs + 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval

(21)

TM: Checking with a system profiler

+ 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal

(22)

R built-in functions can be changed

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }

abs is a built-in function

abs can be changed at any time

> abs <- function(x) { x * x } > abs(-10)

[1] 100

> for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) } [1] 11

[1] 3.464102 [1] 3.605551 Variable look-up

(23)

R built-in functions can be changed

tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b } tmFor n n n b n n nj nk GlobalEnv n n n tmFor n n n abs BaseNamespaceEnv .Primitive("abs") Variable look-up

(24)

R built-in functions can be changed

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }

abs is a built-in function

+ - ( [ { ← for :

are all built-in functions

> `:` <- sum > 1:10

[1] 11

> `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) } > z <- 10

[1] 100

(25)

Variables can be deleted

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } > x <- 10 > rm(x) > x

Error: object 'x' not found

> for(i in 1:3) { if (i==2) { rm(i) } else print(i) } [1] 1

[1] 3

> for(i in 1:3) { if (i==2) { rm(i) } ; print(i) } [1] 1

Error in print(i) : object 'i' not found

variable look-up is needed

Loop control variable can be deleted

(26)

TM: Checking with a system profiler

Linked-list

allocation and

use

+ 9.91% R R [.] Rf_eval - 9.53% R R [.] Rf_cons - Rf_cons + 29.87% Rf_allocList + 24.96% Rf_evalList + 14.35% Rf_evalListKeepMissing + 6.04% Rf_lcons + 5.90% Rf_DispatchOrEval + 5.29% Rf_list2 + 3.85% evalseq + 3.26% Rf_defineVar + 3.04% Rf_list1 + 1.18% Rf_eval + 0.75% replaceCall + 0.52% evalArgs + 6.67% R R [.] Rf_findVarInFrame3

(27)

Arguments passed as linked-list

Linked-list allocation and use

for (j in 1:n) { for (k in 1:n) {

b[k,j] <- abs(j - k) + 1

Converted to a general replacement call of form F(X) ← Y

The replacement call is then transformed

F(X) ← Y TMP ← X

X ← “F<-”( TMP, value = Y )

b[k,j] ← Y TMP ← b

(28)

Replacement call

is expensive

Linked-list allocation and use

b[k,j] ← Y TMP ← b b ← “[<-”( TMP, k, j, value = Y ) n n n TMP n [<-nk nj n Y n n n b n

<-This linked list

allocated in each

iteration

(29)

Toeplitz Matrix

(30)

R Byte-code compiler

env R_ENABLE_JIT=3 R AST Bytecode N = 500 650 ms 130 ms N = 1000 2610 ms 530 ms N = 1500 5910 ms 1150 ms

Always use byte-code compiler!

> require(compiler)

Loading required package: compiler > help(cmpfun)

(31)

TM: Sapply

tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) }

(32)

TM: Sapply

tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) } For Sapply N = 500 130 ms 320 ms N = 1000 530 ms 1300 ms N = 1500 1150 ms 2960 ms

Using sapply instead of for sometimes

helps. Not now...

(33)

TM: Rows Algo

tmRows <- function(n) {

b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= 2) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1

(34)

TM: Rows Algo

tmRows <- function(n) {

b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= n) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } For Rows N = 500 130 ms 13 ms N = 1000 530 ms 59 ms N = 1500 1150 ms 169 ms

(35)

TM: Cols Algo

tmCols <- function(n) {

b <- matrix(nrow = n, ncol = n) b[,1] <- 1:n

if (n >= 2) {

for(col in 2:n) {

b[,col] <- c(col, b[-n, col-1]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1

(36)

TM: Cols2 Algo

tmByCols <- function(n) { if (n >= 2) {

sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1

(37)

TM: Cols2 Algo

tmByCols <- function(n) { if (n >= 2) {

sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } Rows Cols2 N = 500 13 ms 5 ms N = 1000 59 ms 39 ms N = 1500 169 ms 58 ms

(38)

TM: Outer Algo

tmOuter <- function(n) {

outer(X = 1:n, Y = 1:n, FUN = function(j,k) {

abs(j - k) + 1 }) } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1

(39)

TM: Outer Algo

tmOuter <- function(n) {

outer(X = 1:n, Y = 1:n, FUN = function(j,k) {

abs(j - k) + 1 }) } Cols2 Outer C N = 500 5 ms 2 ms 0.2 ms N = 1000 39 ms 27 ms 0.9 ms N = 1500 58 ms 47 ms 2.1 ms

Yet faster. Vectorized.

Also easy to read.

(40)

TM: Summary

For Outer C For-FastR

N = 500 130 ms 2 ms 0.2 ms 13 ms

N = 1000 530 ms 27 ms 0.9 ms 47 ms

(41)

Summary

Use byte-code compiler

Vectorize

Use built-ins (sum, prod, cumsum, outer)

Use simplest data structure possible

– Matrix instead of data.frame – Avoid data.frame indexing

Save and re-use intermediate results

Please consider donating your code/data in form of benchmarks.

References

Related documents

The aim of this paper was to explore the effects on the French forest sector of three policies to mitigate climate change: a fuelwood consumption subsidy (substitution), a payment

We combine the unitary equivalence transformation to complex symmetric tridiagonal form with three algorithms for computing the symmetric singular value decomposition (abbreviated

No access to banking Me2U mobile money system Allows people without regular bank accounts to use banking services to buy and sell, and transfer money. Small businesses can get

In answer to these questions, this paper offers two main reasons for the relative neglect of this motivational component: the first is related to the histori- cal roots of the

He/she shall likewise monitor attendance, training- related activities and performance of trainee during the OJT program in the Host-Company; and 9.3 Require the student-trainee

When these concepts are generalized to allow for multiple players and actions and applied to the common pool resource game, action sampling simply reduces to the dominant

During postembryonic development, the majority of the neurons (95%) that make up the adult central brain are generated by approximately 100 pairs of neural stem cell- like

1. To ensure that parents receive the best opportunity to reunify with children and youth in foster care, services designed to safely return the children and youth to their