Academic year: 2021

Why is R slow?

How to run R programs faster?

Tomas Kalibera

My Background

Virtual machines, runtimes for programming languages Real-time Java

Automatic memory management Evaluating software performance

R User



Currently working on: FastR

A new, experimental virtual machine for (a subset of) R language. Discovering optimizations that can

speed-up R.

Core team

Jan Vitek

Tomas Kalibera

Petr Maj Floreal Morandat


Community: Dynamic Languages

for Scalable Data Analytics

Use one dynamic, high level language for data

analytics tasks running on platforms from a tablet to the cloud.

R, Matlab, Python, Julia

Large software companies interested in R

NSF Funded Workshop at SPLASH 2013


int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "tm n\n"); return 1; } int n = atoi(argv[1]); printf("n = %d\n", n); Source code main if decl call != argc 2 call ret Parse tree parsing


main if decl call != argc 2 call ret

Parse tree executed directly by

(AST) Interpreter

Class If

Node Condition, TrueBranch, FalseBranch; Result execute() { If (Condition.execute() == TRUE) { TrueBranch.execute() } else { FalseBranch.execute() } Return NULL; }

GNU R works like this.



compilation linking


Ahead of time: C/C++/Fortran Just-in-time: Java/C# 0000000000400580 <main>: 400580: 41 54 push %r12 400582: 83 ff 02 cmp $0x2,%edi 400585: 55 push %rbp 400586: 53 push %rbx 400587: 74 25 je 4005ae <main+0x2e> 400589: 48 8b 0d c8 0a 20 00 mov 0x200ac8(%rip),%rcx 400590: ba 05 00 00 00 mov $0x5,%edx 400595: be 01 00 00 00 mov $0x1,%esi 40059a: bf 04 08 40 00 mov $0x400804,%edi 40059f: e8 cc ff ff ff callq 400570 <fwrite@plt> 4005a4: b8 01 00 00 00 mov $0x1,%eax

4005a9: 5b pop %rbx 4005aa: 5d pop %rbp 4005ab: 41 5c pop %r12 4005ad: c3 retq Machine code main if decl call != argc 2 call ret Parse tree




Self-optimizing AST interpreter

– Aims to be still easy to develop, maintain – But fast

The AST (tree) rewrites as the program


– Speculative rewrites, recovery ●

Runs on a JVM

– High-performance garbage collector

– Just-in-Time compilation improves speed


Understanding why GNU-R is slow

Speeding-up R programs


Toeplitz Matrix

In AT&T R Benchmarks 2.5 (Simon Urbanek) Initializing a square matrix


i , j






1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1


TM using For Loop

(as included in AT&T R Benchmarks 2.5) tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }


i , j







TM using For Loop

(as included in AT&T R Benchmarks 2.5 ) tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }


i , j






N = 500 650 ms N = 1000 2610 ms N = 1500 5910 ms


TM in C

int *b = (int *)malloc(n * n * sizeof(int)); for(j = 1; j <= n; j++) { for(k = 1; k <= n; k++) { b[(k - 1) + (j - 1) * n] = abs(j - k) + 1; } } N = 500 650 ms N = 1000 2610 ms N = 1500 5910 ms

In R

N = 500 0.2 ms N = 1000 0.9 ms N = 1500 2.1 ms

In C


Toeplitz Matrix


TM: Checking with a profiler

> Rprof() > dummy <- tmFor(5000) > Rprof(NULL) > summaryRProf() $by.self

self.time self.pct total.time total.pct "tmFor" 51.42 86.36 59.54 100.00 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03 $by.total

total.time total.pct self.time self.pct "tmFor" 59.54 100.00 51.42 86.36 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03


TM: R profiler does not help

tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }





TM: Checking with a system profiler

env CFLAGS=-g ./configure --with-blas --with-lapack --enable-R-static-lib –disable-BLAS-shlib



dummy <- tmFor(5000) perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.r perf report -g + 1.08% R R [.] real_binary + 0.75% R R [.] integer_binary + 0.74% R R [.] do_abs + 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval


TM: Checking with a system profiler

+ 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal


R built-in functions can be changed

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }

abs is a built-in function

abs can be changed at any time

> abs <- function(x) { x * x } > abs(-10)

[1] 100

> for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) } [1] 11

[1] 3.464102 [1] 3.605551 Variable look-up


R built-in functions can be changed

tmFor <- function(n) {

b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b } tmFor n n n b n n nj nk GlobalEnv n n n tmFor n n n abs BaseNamespaceEnv .Primitive("abs") Variable look-up


R built-in functions can be changed

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }

abs is a built-in function

+ - ( [ { ← for :

are all built-in functions

> `:` <- sum > 1:10

[1] 11

> `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) } > z <- 10

[1] 100


Variables can be deleted

for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } > x <- 10 > rm(x) > x

Error: object 'x' not found

> for(i in 1:3) { if (i==2) { rm(i) } else print(i) } [1] 1

[1] 3

> for(i in 1:3) { if (i==2) { rm(i) } ; print(i) } [1] 1

Error in print(i) : object 'i' not found

variable look-up is needed

Loop control variable can be deleted


TM: Checking with a system profiler


allocation and


+ 9.91% R R [.] Rf_eval - 9.53% R R [.] Rf_cons - Rf_cons + 29.87% Rf_allocList + 24.96% Rf_evalList + 14.35% Rf_evalListKeepMissing + 6.04% Rf_lcons + 5.90% Rf_DispatchOrEval + 5.29% Rf_list2 + 3.85% evalseq + 3.26% Rf_defineVar + 3.04% Rf_list1 + 1.18% Rf_eval + 0.75% replaceCall + 0.52% evalArgs + 6.67% R R [.] Rf_findVarInFrame3


Arguments passed as linked-list

Linked-list allocation and use

for (j in 1:n) { for (k in 1:n) {

b[k,j] <- abs(j - k) + 1

Converted to a general replacement call of form F(X) ← Y

The replacement call is then transformed

F(X) ← Y TMP ← X

X ← “F<-”( TMP, value = Y )

b[k,j] ← Y TMP ← b


Replacement call

is expensive

Linked-list allocation and use

b[k,j] ← Y TMP ← b b ← “[<-”( TMP, k, j, value = Y ) n n n TMP n [<-nk nj n Y n n n b n

<-This linked list

allocated in each



Toeplitz Matrix


R Byte-code compiler

env R_ENABLE_JIT=3 R AST Bytecode N = 500 650 ms 130 ms N = 1000 2610 ms 530 ms N = 1500 5910 ms 1150 ms

Always use byte-code compiler!

> require(compiler)

Loading required package: compiler > help(cmpfun)


TM: Sapply

tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) }


TM: Sapply

tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) } For Sapply N = 500 130 ms 320 ms N = 1000 530 ms 1300 ms N = 1500 1150 ms 2960 ms

Using sapply instead of for sometimes

helps. Not now...


TM: Rows Algo

tmRows <- function(n) {

b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= 2) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1


TM: Rows Algo

tmRows <- function(n) {

b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= n) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b } For Rows N = 500 130 ms 13 ms N = 1000 530 ms 59 ms N = 1500 1150 ms 169 ms


TM: Cols Algo

tmCols <- function(n) {

b <- matrix(nrow = n, ncol = n) b[,1] <- 1:n

if (n >= 2) {

for(col in 2:n) {

b[,col] <- c(col, b[-n, col-1]) } } b } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1


TM: Cols2 Algo

tmByCols <- function(n) { if (n >= 2) {

sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1


TM: Cols2 Algo

tmByCols <- function(n) { if (n >= 2) {

sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } } Rows Cols2 N = 500 13 ms 5 ms N = 1000 59 ms 39 ms N = 1500 169 ms 58 ms


TM: Outer Algo

tmOuter <- function(n) {

outer(X = 1:n, Y = 1:n, FUN = function(j,k) {

abs(j - k) + 1 }) } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1


TM: Outer Algo

tmOuter <- function(n) {

outer(X = 1:n, Y = 1:n, FUN = function(j,k) {

abs(j - k) + 1 }) } Cols2 Outer C N = 500 5 ms 2 ms 0.2 ms N = 1000 39 ms 27 ms 0.9 ms N = 1500 58 ms 47 ms 2.1 ms

Yet faster. Vectorized.

Also easy to read.


TM: Summary

For Outer C For-FastR

N = 500 130 ms 2 ms 0.2 ms 13 ms

N = 1000 530 ms 27 ms 0.9 ms 47 ms



Use byte-code compiler


Use built-ins (sum, prod, cumsum, outer)

Use simplest data structure possible

– Matrix instead of data.frame – Avoid data.frame indexing

Save and re-use intermediate results

Please consider donating your code/data in form of benchmarks.


