Why is R slow?
How to run R programs faster?
Tomas Kalibera
CMYK 0/100/100/20 66/54/42/17 34/21/10/0
My Background
Virtual machines, runtimes for programming languages Real-time Java
Automatic memory management Evaluating software performance
R User
Benchmarks
Currently working on: FastR
A new, experimental virtual machine for (a subset of) R language. Discovering optimizations that can
speed-up R.
CMYK 0/100/100/20 66/54/42/17 34/21/10/0
Core team
Jan Vitek
Tomas Kalibera
Petr Maj Floreal Morandat
Community: Dynamic Languages
for Scalable Data Analytics
Use one dynamic, high level language for data
analytics tasks running on platforms from a tablet to the cloud.
R, Matlab, Python, Julia
Large software companies interested in R
NSF Funded Workshop at SPLASH 2013
int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "tm n\n"); return 1; } int n = atoi(argv[1]); printf("n = %d\n", n); Source code main if decl call != argc 2 call ret Parse tree parsing
main if decl call != argc 2 call ret
Parse tree executed directly by
(AST) Interpreter
Class If
Node Condition, TrueBranch, FalseBranch; Result execute() { If (Condition.execute() == TRUE) { TrueBranch.execute() } else { FalseBranch.execute() } Return NULL; }
GNU R works like this.
Interpreter
compilation linking
Compiler
Ahead of time: C/C++/Fortran Just-in-time: Java/C# 0000000000400580 <main>: 400580: 41 54 push %r12 400582: 83 ff 02 cmp $0x2,%edi 400585: 55 push %rbp 400586: 53 push %rbx 400587: 74 25 je 4005ae <main+0x2e> 400589: 48 8b 0d c8 0a 20 00 mov 0x200ac8(%rip),%rcx 400590: ba 05 00 00 00 mov $0x5,%edx 400595: be 01 00 00 00 mov $0x1,%esi 40059a: bf 04 08 40 00 mov $0x400804,%edi 40059f: e8 cc ff ff ff callq 400570 <fwrite@plt> 4005a4: b8 01 00 00 00 mov $0x1,%eax
4005a9: 5b pop %rbx 4005aa: 5d pop %rbp 4005ab: 41 5c pop %r12 4005ad: c3 retq Machine code main if decl call != argc 2 call ret Parse tree
Fast.
FastR
●
Self-optimizing AST interpreter
– Aims to be still easy to develop, maintain – But fast
●
The AST (tree) rewrites as the program
executes
– Speculative rewrites, recovery ●
Runs on a JVM
– High-performance garbage collector
– Just-in-Time compilation improves speed
Understanding why GNU-R is slow
Speeding-up R programs
Toeplitz Matrix
In AT&T R Benchmarks 2.5 (Simon Urbanek) Initializing a square matrix
a
i , j=
∣
i
−
j
∣
+
1
1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1TM using For Loop
(as included in AT&T R Benchmarks 2.5) tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
a
i , j=∣
i
−
j
∣+
1
TM using For Loop
(as included in AT&T R Benchmarks 2.5 ) tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
a
i , j=
∣
i
−
j
∣
+
1
N = 500 650 ms N = 1000 2610 ms N = 1500 5910 msTM in C
int *b = (int *)malloc(n * n * sizeof(int)); for(j = 1; j <= n; j++) { for(k = 1; k <= n; k++) { b[(k - 1) + (j - 1) * n] = abs(j - k) + 1; } } N = 500 650 ms N = 1000 2610 ms N = 1500 5910 ms
In R
N = 500 0.2 ms N = 1000 0.9 ms N = 1500 2.1 msIn C
Toeplitz Matrix
TM: Checking with a profiler
> Rprof() > dummy <- tmFor(5000) > Rprof(NULL) > summaryRProf() $by.selfself.time self.pct total.time total.pct "tmFor" 51.42 86.36 59.54 100.00 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03 $by.total
total.time total.pct self.time self.pct "tmFor" 59.54 100.00 51.42 86.36 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03
TM: R profiler does not help
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
Performance
critical
part.
TM: Checking with a system profiler
env CFLAGS=-g ./configure --with-blas --with-lapack --enable-R-static-lib –disable-BLAS-shlib
make
source("tm.r")
dummy <- tmFor(5000) perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.r perf report -g + 1.08% R R [.] real_binary + 0.75% R R [.] integer_binary + 0.74% R R [.] do_abs + 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval
TM: Checking with a system profiler
+ 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocalR built-in functions can be changed
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }abs is a built-in function
abs can be changed at any time
> abs <- function(x) { x * x } > abs(-10)
[1] 100
> for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) } [1] 11
[1] 3.464102 [1] 3.605551 Variable look-up
R built-in functions can be changed
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b } tmFor n n n b n n nj nk GlobalEnv n n n tmFor n n n abs BaseNamespaceEnv .Primitive("abs") Variable look-up
R built-in functions can be changed
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }abs is a built-in function
+ - ( [ { ← for :
are all built-in functions
> `:` <- sum > 1:10
[1] 11
> `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) } > z <- 10
[1] 100
Variables can be deleted
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } > x <- 10 > rm(x) > xError: object 'x' not found
> for(i in 1:3) { if (i==2) { rm(i) } else print(i) } [1] 1
[1] 3
> for(i in 1:3) { if (i==2) { rm(i) } ; print(i) } [1] 1
Error in print(i) : object 'i' not found
variable look-up is needed
Loop control variable can be deleted
TM: Checking with a system profiler
Linked-list
allocation and
use
+ 9.91% R R [.] Rf_eval - 9.53% R R [.] Rf_cons - Rf_cons + 29.87% Rf_allocList + 24.96% Rf_evalList + 14.35% Rf_evalListKeepMissing + 6.04% Rf_lcons + 5.90% Rf_DispatchOrEval + 5.29% Rf_list2 + 3.85% evalseq + 3.26% Rf_defineVar + 3.04% Rf_list1 + 1.18% Rf_eval + 0.75% replaceCall + 0.52% evalArgs + 6.67% R R [.] Rf_findVarInFrame3Arguments passed as linked-list
Linked-list allocation and use
for (j in 1:n) { for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
Converted to a general replacement call of form F(X) ← Y
The replacement call is then transformed
F(X) ← Y TMP ← X
X ← “F<-”( TMP, value = Y )
b[k,j] ← Y TMP ← b
Replacement call
is expensive
Linked-list allocation and use
b[k,j] ← Y TMP ← b b ← “[<-”( TMP, k, j, value = Y ) n n n TMP n [<-nk nj n Y n n n b n
<-This linked list
allocated in each
iteration
Toeplitz Matrix
R Byte-code compiler
env R_ENABLE_JIT=3 R AST Bytecode N = 500 650 ms 130 ms N = 1000 2610 ms 530 ms N = 1500 5910 ms 1150 msAlways use byte-code compiler!
> require(compiler)
Loading required package: compiler > help(cmpfun)
TM: Sapply
tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) }TM: Sapply
tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) } For Sapply N = 500 130 ms 320 ms N = 1000 530 ms 1300 ms N = 1500 1150 ms 2960 msUsing sapply instead of for sometimes
helps. Not now...
TM: Rows Algo
tmRows <- function(n) {
b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= 2) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1
TM: Rows Algo
tmRows <- function(n) {
b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= n) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } For Rows N = 500 130 ms 13 ms N = 1000 530 ms 59 ms N = 1500 1150 ms 169 ms
TM: Cols Algo
tmCols <- function(n) {
b <- matrix(nrow = n, ncol = n) b[,1] <- 1:n
if (n >= 2) {
for(col in 2:n) {
b[,col] <- c(col, b[-n, col-1]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1
TM: Cols2 Algo
tmByCols <- function(n) { if (n >= 2) {
sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1
TM: Cols2 Algo
tmByCols <- function(n) { if (n >= 2) {
sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } Rows Cols2 N = 500 13 ms 5 ms N = 1000 59 ms 39 ms N = 1500 169 ms 58 ms
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) {
abs(j - k) + 1 }) } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) {
abs(j - k) + 1 }) } Cols2 Outer C N = 500 5 ms 2 ms 0.2 ms N = 1000 39 ms 27 ms 0.9 ms N = 1500 58 ms 47 ms 2.1 ms
Yet faster. Vectorized.
Also easy to read.
TM: Summary
For Outer C For-FastR
N = 500 130 ms 2 ms 0.2 ms 13 ms
N = 1000 530 ms 27 ms 0.9 ms 47 ms
Summary
●
Use byte-code compiler
●Vectorize
●
Use built-ins (sum, prod, cumsum, outer)
●Use simplest data structure possible
– Matrix instead of data.frame – Avoid data.frame indexing
●
Save and re-use intermediate results
Please consider donating your code/data in form of benchmarks.