Big Data, R, and HPC
A Survey of Advanced Computing with R
Drew Schmidt
April 16, 2015
About Me
@wrathematics
http://librestats.com
https://github.com/wrathematics
http://wrathematics.info
Introduction
1
Introduction
XSEDE
Big Data and Bio
Why is my code slow?
2
Profiling and Benchmarking
3
A Hasty Introduction to Advanced Computing with R
4
Wrapup
Introduction XSEDE
1
Introduction
XSEDE
Big Data and Bio
Why is my code slow?
Introduction XSEDE
•
Extreme Science and Engineering Discovery Environment
•
Follow on NSF project to TeraGrid in 2012
•
Centers operate machines, and XSEDE provides seamless
infrastructure for allocaEons, access, and training
•
Researchers propose resource use through XRAS
•
Supports thousands of scienEsts in fields such as:
–
Chemistry
–
BioinformaEcs
–
Materials Science
–
Data Sciences
Introduction XSEDE
XSEDE Allocations
•
Want to use XSEDE resources to teach a
class?
–
h3ps://portal.xsede.org/alloca;ons-‐
overview#types-‐educa;on
•
Just looking to try out a larger resource or a
special resource your campus doesn’t have?
–
h3ps://portal.xsede.org/alloca;ons-‐
overview#types-‐startup
Introduction XSEDE
XSEDE Allocations
•
See a Campus Champion
–
h.ps://www.xsede.org/current-‐champions
•
Ready to scale up your research?
–
h.ps://portal.xsede.org/alloca>ons-‐
overview#types-‐research
Introduction XSEDE
More
“
helpful
”
resources
xsede.org
à
User Services
•
Resources
available at each Service Provider
•
User Guides describing memory, number of CPUs, file systems,
etc.
•
Storage facili?es
•
So@ware (Comprehensive Search)
•
Training:
portal.xsede.org
à
Training
•
Course Calendar
•
On-‐line training
•
Cer?fica?ons
•
Get face-‐to-‐face help from XSEDE experts at your ins?tu?on;
contact your local
Campus Champions
.
•
Extended Collabora?ve Support (formerly known as Advanced
User Support (AUSS))
Introduction Big Data and Bio
1
Introduction
XSEDE
Big Data and Bio
Why is my code slow?
Introduction Big Data and Bio
Big Data
Volume, Velocity, Variety
Volume — Sequencers
Velocity — sensors ???
Variety — fasta, csv, databases, images, unstructured, . . .
Complexity — complicated models
Introduction Big Data and Bio
Common Computational Problems in Bio
p
>
n
.
Many small tasks (workflow).
Parallelization often difficult.
Introduction Why is my code slow?
1
Introduction
XSEDE
Big Data and Bio
Why is my code slow?
Introduction Why is my code slow?
Why is my code slow?
Bad code.
Abstraction.
Serial.
Introduction Why is my code slow?
Bad Code
R isn’t very smart.
R is slow, but bad programmers are slower!
Bad parallel code may be slower than good serial code.
Introduction Why is my code slow?
Abstraction
Never completely free!
Could cost a microsecond (not worth worrying about!) or much,
much more . . .
But abstraction is A Good Thing
TM
.
Have to find the right balance.
Introduction Why is my code slow?
Serial
https://csgillespie.wordpress.com/2011/01/25/cpu-and-gpu-trends-over-time/
Profiling and Benchmarking
1
Introduction
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Benchmarking
3
A Hasty Introduction to Advanced Computing with R
4
Wrapup
Profiling and Benchmarking Why Profile?
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Benchmarking
Profiling and Benchmarking Why Profile?
Performance and Accuracy
Sometimes
π
= 3
.
14
is (a) infinitely faster
than the “correct” answer and (b) the
differ-ence between the “correct” and the “wrong”
answer is meaningless. . . . The thing is, some
specious value of “correctness” is often
irrel-evant because it doesn’t matter. While
per-formance almost always matters. And I
ab-solutely detest the fact that people so often
dismiss performance concerns so readily.
— Linus Torvalds, August 8, 2008
Profiling and Benchmarking Why Profile?
Why Profile?
Because performance matters.
Bad practices scale up!
Your bottlenecks may surprise you.
Because R is dumb.
R users claim to be data people. . . so act like it!
Profiling and Benchmarking Why Profile?
Compilers often correct bad behavior. . .
A Really Dumb Loop
1
int
m a i n () {
2int
x , i ;
3for
( i =0; i < 1 0 ; i ++)
4x = 1;
5r e t u r n
0;
6}
clang -O3 -S example.c
m a i n :
. cfi_
s t a r t p r o c
# BB #0:
x o r l
% eax ,
% eax
ret
clang -S example.c
m a i n :
. cfi_
s t a r t p r o c
# BB #0:
m o v l
$0 , -4(% rsp )
m o v l
$0 ,
-12(% rsp )
. L B B 0
_
1:
c m p l
$
10 ,
-12(% rsp )
jge
. L B B 0
_4
# BB #2:
m o v l
$1 , -8(% rsp )
# BB #3:
m o v l
-12(% rsp ) , % eax
a d d l
$1 , % eax
m o v l
% eax ,
-12(% rsp )
jmp
. L B B 0
_1
. L B B 0
_
4:
m o v l
$0 , % eax
ret
Profiling and Benchmarking Why Profile?
R will not!
Dumb Loop
1for
( i in 1: n ) {
2tA <- t(A)
3Y
< -
tA %* % Q
4Q < - qr.
Q(
qr( Y ) )
5Y
< -
A %* % Q
6Q < - qr.
Q(
qr( Y ) )
7}
8 9Q
Better Loop
1tA <- t(A)
2 3for
( i in 1: n ) {
4Y
< -
tA %* % Q
5Q < - qr.
Q(
qr( Y ) )
6Y
< -
A %* % Q
7Q < - qr.
Q(
qr( Y ) )
8}
9 10Q
Profiling and Benchmarking Why Profile?
Example from a Real R Package
Exerpt from Original function
1
w h i l e
( i <= N ) {
2
for
( j in 1: i ) {
3
d.k <- as.matrix(x)[l==j,l==j]
4
...
Exerpt from Modified function
1
x.mat <- as.matrix(x)
2 3w h i l e
( i <= N ) {
4for
( j in 1: i ) {
5d.k <- x.mat[l==j,l==j]
6...
By changing just 1 line of
code, performance of the
main method improved by
over 350%
!
Profiling and Benchmarking Why Profile?
Some Thoughts
R is slow.
Bad programmers are slower.
R can’t fix bad programming.
Profiling and Benchmarking Profiling R Code
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Benchmarking
Profiling and Benchmarking Profiling R Code
Timings
Getting simple timings as a basic measure of performance is easy, and
valuable.
system.time()
— timing blocks of code.
Rprof()
— timing execution of R functions.
Rprofmem()
— reporting memory allocation in R .
tracemem()
— detect when a copy of an R object is created.
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
system.time()
system.time()
is a basic R utility for timing expressions
1
x
< - m a t r i x
(
r n o r m
( 2 0 0 0 0*
7 5 0 ) ,
n r o w
=20000 ,
n c o l
= 7 5 0 )
2 3s y s t e m
.
t i m e
(t
( x ) %* %
x )
4#
u s e r
s y s t e m e l a p s e d
5#
2 . 1 8 7
0 . 0 3 2
2 . 3 2 4
6 7s y s t e m
.
t i m e
(
c r o s s p r o d
( x ) )
8#
u s e r
s y s t e m e l a p s e d
9#
1 . 0 0 9
0 . 0 0 3
1 . 0 1 9
10 11s y s t e m
.
t i m e
(
cov( x ) )
12#
u s e r
s y s t e m e l a p s e d
13#
6 . 2 6 4
0 . 0 2 6
6 . 3 3 8
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
system.time()
Put more complicated expressions inside of brackets:
1
x
< - m a t r i x
(
r n o r m
( 2 0 0 0 0*
7 5 0 ) ,
n r o w
=20000 ,
n c o l
= 7 5 0 )
2 3s y s t e m
.
t i m e
({
4y
< -
x +1
5z
< -
y
*2
6})
7#
u s e r
s y s t e m e l a p s e d
8#
0 . 0 5 7
0 . 0 3 2
0 . 0 8 9
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
Rprof()
1
R p r o f ( f i l e n a m e =
" R p r o f . out "
,
a p p e n d= FALSE , i n t e r v a l =0.02 ,
2
m e m o r y
. p r o f i l i n g = FALSE ,
gc. p r o f i l i n g = FALSE ,
3
l i n e . p r o f i l i n g = FALSE , n u m f i l e s = 1 0 0 L , b u f s i z e = 1 0 0 0 0 L )
Profiling and Benchmarking Profiling R Code
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
Rprof()
1
x
< - m a t r i x
(
r n o r m
( 1 0 0 0 0*
2 5 0 ) ,
n r o w
=10000 ,
n c o l
= 2 5 0 )
2 3R p r o f ()
4i n v i s i b l e
( p r c o m p ( x ) )
5R p r o f ( N U L L )
6 7s u m m a r y R p r o f ()
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
Rprof()
1$ by
. s e l f
2s e l f .
t i m e
s e l f . pct t o t a l .
t i m e
t o t a l . pct
3" La . svd "
0 . 6 8
6 9 . 3 9
0 . 7 2
7 3 . 4 7
4" %
* %
"
0 . 1 2
1 2 . 2 4
0 . 1 2
1 2 . 2 4
5" a p e r m . d e f a u l t "
0 . 0 4
4 . 0 8
0 . 0 4
4 . 0 8
6" a r r a y "
0 . 0 4
4 . 0 8
0 . 0 4
4 . 0 8
7" m a t r i x "
0 . 0 4
4 . 0 8
0 . 0 4
4 . 0 8
8" s w e e p "
0 . 0 2
2 . 0 4
0 . 1 0
1 0 . 2 0
9# ## o u t p u t t r u n c a t e d by p r e s e n t e r
10 11$ by
. t o t a l
12t o t a l .
t i m e
t o t a l . pct s e l f .t i m e
s e l f . pct
13" p r c o m p "
0 . 9 8
1 0 0 . 0 0
0 . 0 0
0 . 0 0
14" p r c o m p . d e f a u l t "
0 . 9 8
1 0 0 . 0 0
0 . 0 0
0 . 0 0
15" svd "
0 . 7 6
7 7 . 5 5
0 . 0 0
0 . 0 0
16" La . svd "
0 . 7 2
7 3 . 4 7
0 . 6 8
6 9 . 3 9
17# ## o u t p u t t r u n c a t e d by p r e s e n t e r
18 19$ s a m p l e
. i n t e r v a l
20[1] 0 . 0 2
21 22$
s a m p l i n g .t i m e
23[1] 0 . 9 8
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
Rprof()
1
R p r o f ( i n t e r v a l = . 9 9 )
2
i n v i s i b l e
( p r c o m p ( x ) )
3
R p r o f ( N U L L )
4
5
s u m m a r y R p r o f ()
Profiling and Benchmarking Profiling R Code
Performance Profiling Tools:
Rprof()
1
$ by
. s e l f
2[1] s e l f .
t i m e
s e l f . pct
t o t a l .
t i m e
t o t a l . pct
3<0 rows > (
or
0 -
l e n g t h row.
n a m e s
)
4 5$ by
. t o t a l
6[1] t o t a l .t i m e
t o t a l . pct
s e l f .
t i m e
s e l f . pct
7<0 rows > (
or
0 -
l e n g t h row.
n a m e s
)
8 9$ s a m p l e. i n t e r v a l
10[1] 0 . 9 9
11 12$
s a m p l i n g .t i m e
13[1] 0
Profiling and Benchmarking Advanced R Profiling
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Benchmarking
Profiling and Benchmarking Advanced R Profiling
Other Profiling Tools
perf, PAPI
fpmpi, mpiP, TAU
pbdPROF
pbdPAPI
Profiling and Benchmarking Advanced R Profiling
Profiling MPI Codes with
pbdPROF
1. Rebuild p
p
p
p
p
p
b
b
b
b
b
b
d
d
d
d
d
d
R
R
R
R
R
R
packages
R CMD I N S T A L L p b d M P I_
0.2 -1. tar . gz \
- - c o n f i g u r e - a r g s = \
" - - enable - p b d P R O F "
2. Run code
m p i r u n - np 64 R s c r i p t my
_
s c r i p t . R
3. Analyze results
1l i b r a r y
( p b d P R O F )
2p r o f
< - r e a d
. p r o f (
" o u t p u t . m p i P "
)
3p l o t
( prof ,
p l o t
. t y p e =
" m e s s a g e s 2 "
)
Profiling and Benchmarking Advanced R Profiling
Profiling with
pbdPAPI
Bindings for Performance Application
Programming Interface (PAPI)
Gathers detailed hardware counter data.
High and low level interfaces
Function
Description of Measurement
system.flips()
Time, floating point instructions, and Mflips
system.flops()
Time, floating point operations, and Mflops
system.cache()
Cache misses, hits, accesses, and reads
system.epc()
Events per cycle
system.idle()
Idle cycles
system.cpuormem()
CPU or RAM bound
∗
system.utilization()
CPU utilization
∗
Profiling and Benchmarking Benchmarking
2
Profiling and Benchmarking
Why Profile?
Profiling R Code
Advanced R Profiling
Benchmarking
Profiling and Benchmarking Benchmarking
Benchmarking
R functions are complicated!
Symbol lookup, creating the abstract syntax tree, creating promises
for arguments, argument checking, creating environments, . . .
Executing a second time can have dramatically different performance
over the first execution.
Benchmarking several methods fairly requires some care.
Profiling and Benchmarking Benchmarking
Benchmarking tools: rbenchmark
rbenchmark
is a simple package that easily benchmarks different
functions:
1x
< - m a t r i x
(
r n o r m
( 1 0 0 0 0*
5 0 0 ) ,
n r o w
=10000 ,
n c o l
= 5 0 0 )
2 3f
< - f u n c t i o n
( x )
t
( x ) %* %
x
4g
< - f u n c t i o n
( x )
c r o s s p r o d
( x )
5 6l i b r a r y
( r b e n c h m a r k )
7b e n c h m a r k ( f ( x ) , g ( x ) , c o l u m n s =c
(
" t e s t "
,
" r e p l i c a t i o n s "
,
" e l a p s e d "
,
" r e l a t i v e "
) )
8 9#
t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e
10# 1 f ( x )
100
1 3 . 6 7 9
3 . 5 8 8
11# 2 g ( x )
100
3 . 8 1 2
1 . 0 0 0
Profiling and Benchmarking Benchmarking
Benchmarking tools: microbenchmark
microbenchmark
is a separate package with a slightly different
philosophy:
1x
< - m a t r i x
(
r n o r m
( 1 0 0 0 0*
5 0 0 ) ,
n r o w
=10000 ,
n c o l
= 5 0 0 )
2 3f
< - f u n c t i o n
( x )
t
( x ) %* %
x
4g
< - f u n c t i o n
( x )
c r o s s p r o d
( x )
5 6l i b r a r y
( m i c r o b e n c h m a r k )
7m i c r o b e n c h m a r k ( f ( x ) , g ( x ) , u n i t =
" s "
)
8 9# U n i t : s e c o n d s
10#
e x p r
min
lq
m e a n
m e d i a n
uq
max n e v a l
11#
f ( x ) 0 . 1 1 4 1 8 6 1 7 0 . 1 1 6 4 7 5 1 7 0 . 1 2 2 5 8 5 5 6 0 . 1 1 7 5 4 3 0 2 0 . 1 2 0 5 8 1 4 5
0 . 1 7 2 9 2 5 0 7
100
12#
g ( x ) 0 . 0 3 5 4 2 5 5 2 0 . 0 3 6 1 3 7 7 2 0 . 0 3 8 8 4 4 9 7 0 . 0 3 6 6 8 2 3 1 0 . 0 3 7 4 0 1 7 3
0 . 0 7 4 7 8 3 0 9
100
Profiling and Benchmarking Benchmarking
Benchmarking tools: microbenchmark
I generally prefer
rbenchmark
, but the built-in plots for
microbenchmark
are nice:
1b e n c h
< -
m i c r o b e n c h m a r k ( f ( x ) , g ( x ) , u n i t =
" s "
)
2 3b o x p l o t
( b e n c h )
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● f(x) g(x) 40 60 80 100 120 140 160 Expression log(time) [t]A Hasty Introduction to Advanced Computing with R
1
Introduction
2
Profiling and Benchmarking
3
A Hasty Introduction to Advanced Computing with R
Free
Better Code
Compiled code
Parallelism
4
Wrapup
A Hasty Introduction to Advanced Computing with R
Types of Improvements
Free.
Better code.
Compiled code.
Parallelism.
A Hasty Introduction to Advanced Computing with R Free
3
A Hasty Introduction to Advanced Computing with R
Free
Better Code
Compiled code
Parallelism
A Hasty Introduction to Advanced Computing with R Free
Build R with a Better Compiler
Better compiler =
⇒
Faster R
Not entirely painless.
Can cost $$$.
R Installation and Administration
:
http://cran.r-project.org/doc/manuals/R-admin.html
A Hasty Introduction to Advanced Computing with R Free
The Bytecode Compiler
1
f
< - f u n c t i o n
( n )
for
( i in 1: n ) 2*
( 3 + 4 )
2 3 4l i b r a r y
( c o m p i l e r )
5f_
c o m p
< -
c m p f u n ( f )
6 7 8l i b r a r y
( r b e n c h m a r k )
9 10n
< -
1 0 0 0 0 0
11b e n c h m a r k ( f ( n ) , f_
c o m p ( n ) , c o l u m n s =
c(
" t e s t "
,
" r e p l i c a t i o n s "
,
" e l a p s e d "
,
12" r e l a t i v e "
) ,
13o r d e r
=
" r e l a t i v e "
)
14#
t e s t r e p l i c a t i o n s e l a p s e d r e l a t i v e
15# 2 f_
c o m p ( n )
100
2 . 6 0 4
1 . 0 0 0
16# 1
f ( n )
100
2 . 8 4 5
1 . 0 9 3
A Hasty Introduction to Advanced Computing with R Free
Choice of BLAS
Library
1 s e t. s e e d ( 1 2 3 4 ) 2 m<−2000 3 n<−2000 4 x<−m a t r i x( 5 r n o r m(m∗n ) , 6 m, n ) 7 8 o b j e c t . s i z e ( x ) 9 10 l i b r a r y( r b e n c h m a r k ) 11 12 b e n c h m a r k ( x%∗%x ) 13 b e n c h m a r k (s v d( x ) )x%*%x on 2000x2000 matrix (~31 MiB) x%*%x on 4000x4000 matrix (~122 MiB)
svd(x) on 1000x1000 matrix (~8 MiB) svd(x) on 2000x2000 matrix (~31 MiB) 0 10 20 30 40 50 0 10 20 30 40 50
reference atlas openblas1 openblas2 reference atlas openblas1 openblas2
BLAS Impelentation
A
v
er
age W
all Clock Run Time (10 Runs)
Comparison of Different BLAS Implementations for Matrix−Matrix Multiplication and SVD
A Hasty Introduction to Advanced Computing with R Better Code
3
A Hasty Introduction to Advanced Computing with R
Free
Better Code
Compiled code
Parallelism
A Hasty Introduction to Advanced Computing with R Better Code
Loops, Plys, and Vectorization
Loops are slow.
apply(),
Reduce()
are just
for
loops.
Map(),
lapply(),
sapply(),
mapply()
(and most other core ones)
are
not
for
loops.
Ply functions are not vectorized
.
Vectorization is fastest, but consumes lots of memory.
A Hasty Introduction to Advanced Computing with R Compiled code
3
A Hasty Introduction to Advanced Computing with R
Free
Better Code
Compiled code
Parallelism
A Hasty Introduction to Advanced Computing with R Compiled code
Rcpp
What Rcpp
is
R interface to compiled code.
Package ecosystem (Rcpp, RcppArmadillo, RcppEigen, . . . ).
Utilities to make writing C++ more convenient for R users.
A tool which requires C++ knowledge to effectively utilize.
What Rcpp
is not
Magic.
Automatic R-to-C++ converter.
A way around having to learn C++.
As easy to use as R.
A Hasty Introduction to Advanced Computing with R Compiled code
Quickly Getting Started
1
c o d e
< -
’
2# i n c l u d e < R c p p . h >
3 4/ /
[[ R c p p :: e x p o r t ]]
5int p l u s t w o ( int n )
6{
7r e t u r n n +2;
8}
9’
10 11l i b r a r y
( R c p p )
12s o u r c e C p p ( c o d e = c o d e )
13 14p l u s t w o (1)
15# [1] 3
A Hasty Introduction to Advanced Computing with R Parallelism
3
A Hasty Introduction to Advanced Computing with R
Free
Better Code
Compiled code
Parallelism
A Hasty Introduction to Advanced Computing with R Parallelism
Parallelism
Serial Programming
Parallel Programming
A Hasty Introduction to Advanced Computing with R Parallelism
Parallel Programming: In Theory
A Hasty Introduction to Advanced Computing with R Parallelism
Parallel Programming: In Practice
A Hasty Introduction to Advanced Computing with R Parallelism
Shared and Distributed Memory Machines
Shared Memory Machines
Thousands of cores
Nautilus, University of Tennessee
1024 cores 4 TB RAM
Distributed Memory Machines
Hundreds of thousands of cores
Titan, Oak Ridge National Lab
299,008 cores 584 TB RAM
A Hasty Introduction to Advanced Computing with R Parallelism
Parallel Programming Packages for R
Shared Memory
Examples:
parallel
,
snow
,
foreach
,
gputools
,
HiPLARM
Distributed
Examples:
pbdR
,
Rmpi
,
RHadoop
,
RHIPE
CRAN HPC Task View
For more examples, see:
http://cran.r-project.org/web/views/
HighPerformanceComputing.html
A Hasty Introduction to Advanced Computing with R Parallelism
Parallel Programming Packages for R
PETSc
pbdDMAT
PLASMA
Interconnection Network
PROC
+ cache + cachePROC + cachePROC + cachePROC
Mem Mem Mem Mem Distributed Memory
Memory
CORE
+ cache + cacheCORE + cacheCORE + cacheCORE
Network
Shared Memory Local Memory
GPU or MIC Co-Processor
GPU: Graphical Processing Unit MIC: Many Integrated Core Focus on who owns what data and
what communication is needed
Focus on which tasks can be parallel
Same Task on Blocks of data Sockets MPI Hadoop OpenMP Threads fork CUDA OpenCL OpenACC OpenMP OpenACC multicore (fork) snow + multicore = parallel
ScaLAPACK PBLAS BLACS MAGMA Trilinos DPLASMA CUBLAS MKL ACML LibSci .C .Call Rcpp OpenCL inline snow Rmpi pbdMPI LAPACK BLAS RHIPE pbdDMAT pbdDMAT HiPLAR HiPLARM magma
A Hasty Introduction to Advanced Computing with R Parallelism
pbdR Packages
Wrapup
1
Introduction
2
Profiling and Benchmarking
3
A Hasty Introduction to Advanced Computing with R
4
Wrapup
Wrapup
Performance-Centered Development Model
1
Just get it working.
2
Profile vigorously.
3
Weigh your options.
Improve R code? (
lapply()
, vectorization, a package, . . . )
Incorporate C/C++?
Go parallel?
Some combination of these. . .
4
Don’t forget the free stuff (BLAS, bytecode compiler, . . . ).
5
Repeat 2 — 4 until performance is acceptable.
Wrapup