Big Data Analytics Using R

(1)

BIG DATA DEFINITION Parallelization principles Tools Summary

Big Data Analytics Using R

Eddie Aronovich

October 23, 2014

(2)

BIG DATA DEFINITION Parallelization principles Tools Summary

Characteristics

Volume Velocity Verity Veracity (integrity)

(5)

Characteristics

(6)

Characteristics

(7)

Characteristics

(8)

Characteristics

(9)

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(10)

ScaleUp vs. ScaleOut

(11)

ScaleUp vs. ScaleOut

(12)

ScaleUp vs. ScaleOut

(13)

ScaleUp vs. ScaleOut

(14)

ScaleUp vs. ScaleOut

(15)

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(16)

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec.

Reading the same (96MB) file takes 1.5-6 sec.

(17)

How long it takes to read / write data ?

(18)

How long it takes to read / write data ?

Generating 1e9 var was not possible on the host I used

Calculating mean() and var() on all data was not possible

(19)

How long it takes to read / write data ?

(20)

How long it takes to read / write data ?

(21)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(22)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk

For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(23)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file

BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(24)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-(

Compute mean and var of all files and ....compute total VAR

(25)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....

compute total VAR

(26)

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(27)

Now you know why you are here at this time !

(28)

One is not enough

Data

Data can not be gathered on one node

One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(29)

One is not enough

Data

Data can not be gathered on one node One computer can not process all data

Data transfer is not feasible

Processors

Multiple hosts

Reference

(30)

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

Multiple hosts

Reference

(31)

One is not enough

Data

Processors

One process, One thread (old and simple)

Multiple processes, One host Multiple hosts

Reference

(32)

One is not enough

Data

Processors

Multiple hosts

Reference

(33)

One is not enough

Data

Processors

Multiple hosts

Reference

(34)

One is not enough

Data

Processors

Multiple hosts

Reference

(35)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional

- requires sync

Data - requires pre/post processing Combination ← Hadoop made it easy !

(36)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Splitting tasks - How to ?

Functional - requires sync

(37)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Splitting tasks - How to ?

Functional - requires sync Data

- requires pre/post processing Combination ← Hadoop made it easy !

(38)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Splitting tasks - How to ?

Data - requires pre/post processing

Combination ← Hadoop made it easy !

(39)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Splitting tasks - How to ?

Data - requires pre/post processing Combination

← Hadoop made it easy !

(40)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Splitting tasks - How to ?

(41)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law

Life experience Where to parallelize ? Storage issues

Can we parallelize forever ?

Gustafson’s Law

S (P) = P − α(P − 1)

S - scale-up, P - # of processors,

α - proportion that can not be parallelized

Amdahl’s Law

T (P) = T (1)(α +_P1 · (1 − α)) T - processing time

(42)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Thumb rules

Programs are small, data is big ⇒ move program to data!

Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Checklist: Pre-processing, Synchronization and Post-processing

(43)

BIG DATA DEFINITION

Tools Summary

Life experience

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise!

Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

(44)

BIG DATA DEFINITION

Tools Summary

Life experience

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts

Multiple hosts might require authentication process. Checklist: Pre-processing, Synchronization and Post-processing

(45)

BIG DATA DEFINITION

Tools Summary

Life experience

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

(46)

BIG DATA DEFINITION

Tools Summary

Life experience

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

(47)

BIG DATA DEFINITION

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster

Hosting services Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(48)

BIG DATA DEFINITION

Tools Summary

Storage issues

Little supercomputing center

Local cluster Hosting services

Cloud

(49)

BIG DATA DEFINITION

Tools Summary

Storage issues

Little supercomputing center

Local cluster Hosting services Cloud

(50)

BIG DATA DEFINITION

Tools Summary

Storage issues

Little supercomputing center

Storage (beware of outgoing traffic costs)

CPUs

(51)

BIG DATA DEFINITION

Tools Summary

Storage issues

Little supercomputing center

(52)

BIG DATA DEFINITION

Tools Summary

Storage issues

Little supercomputing center

(53)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(54)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

(55)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

The data is large and the memory is small ...

I/O is a performance killer

(56)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

(57)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

NAS / SAN is usually better than simple disk

Mounting remote file systems e.g. sshfs (degrades performance)

(58)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

(59)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

(60)

BIG DATA DEFINITION

Tools Summary

Storage issues

The data won’t feet in ...

(61)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Snow and Snowfall

MPI tools (beyond the scope of this talk)

Snow and Snowfall

https://www.stat.berkeley.edu/classes/s244/snowfall.pdf

biopara

(62)

Clusters and Grids

Queuing systems:

Can run almost any program

Very simple to use (for command liners)

Grids:

Requires authentication issues Huge amount of resources

Ubiquitous mostly in the academic environment

(63)

Example: condor queuing system

E x e c u t a b l e = / usr / l o c a l / bin / R

a r g u m e n t s = - - v a n i l l a - - f i l e = p a t h / f i l e . r w h e n _ t o _ t r a n s f e r _ o u t p u t = O N _ E X I T _ O R _ E V I C T

Log = eddiea - ed2 . log

E r r o r = del . eddiea - ed2 . err . $ ( P r o c e s s )

O u t p u t = del . eddiea - ed2 . out . $ ( P r o c e s s ) R e q u i r e m e n t s = A R C H ==" X 8 6 _ 6 4 " && \

( t a r g e t . F r e e M e m o r y M B > 6 0 0 0 ) n o t i f i c a t i o n = n e v e r

U n i v e r s e = V A N I L L A Q u e u e 3 0 0 0

(64)

Clouds

Use resources as service (Ben-Yehuda et al.) Very cheap for peak usage or low frequency Wrong usage might be very expensive

Look for the service you need (it would usually be better and cheaper)

In case of problem - Google it.

(65)

Hadoop - overview

An open source reliable, scalable, distributed computing system

Designed to scale up from single servers to thousands of machines

Each offering local computation and storage

Designed to detect and handle failures at the application layer or hardware

(66)

Hadoop - components

Hadoop Common HDFS YARN, MR

Many other tools (incl. R package)

(67)

Using hadoop file system (HDFS)

l i b r a r y ( r h d f s ) h d f s . i n i t () h d f s . ls ( ’/ ’)

(68)

Using hadoop file system (HDFS)

l i b r a r y ( r J a v a )

. j i n i t ( p a r a m e t e r s =" - X m x 8 g ")

(69)

Writing an R object (1e8) to hdfs

> f = h d f s . f i l e ("/ a . h d f s " ," w " , b u f f e r s i z e = 1 0 4 8 5 7 6 0 0 ) > s y s t e m . t i m e ( h d f s . w r i t e ( a , f , h s y n c = F A L S E )) u s e r s y s t e m e l a p s e d 7 . 3 9 2 2 . 1 9 2 1 5 . 0 7 2 > h d f s . c l o s e ( f ) [1] T R U E

(70)

Writing an R object (1e8) to tmpfs

> a < - r u n i f (1 e8 )

> s y s t e m . t i m e ( s a v e ( a , f i l e = ’ a1 ’)) u s e r s y s t e m e l a p s e d

9 9 . 6 4 8 0 . 3 1 6 9 9 . 8 9 3

(71)

Hadoop - map reduce

Splits each process into Map and Reduce

map - Classifies the data that would be process together reduce - process the data

Hadoop takes care of all the ”behind the scenes”

(72)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters l i b r a r y ( r m r 2 )

Sys . s e t e n v ( H A D O O P _ S T R E A M I N G = " . . . / hadoop - s t r e a m i n g - 2 . 5 . 1 . jar ") Sys . s e t e n v ( H A D O O P _ H O M E = " . . . / h a d o o p ")

Sys . s e t e n v ( H A D O O P _ C M D = " . . . / bin / h a d o o p ")

(73)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Simple mapper pi . map . fn < - f u n c t i o n ( n ){ X < - r u n i f ( n ) Y < - r u n i f ( n ) S = sum (( X ^2+ Y ^2) <1) r e t u r n ( k e y v a l (1 , n / S )) }

(74)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters pi . r e d u c e . fn < - f u n c t i o n ( p ){ k e y v a l (4* m e a n ( p )) }

(75)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR The MR process PI = f u n c t i o n ( n , o u t p u t = N U L L ){ m a p r e d u c e ( i n p u t = n , o u t p u t = output , i n p u t . f o r m a t =" t e x t " , map = pi . map . fn , v e r b o s e = T R U E ) }

(76)

From command line

h a d o o p jar { P A T H }/ hadoop - s t r e a m i n g - 2 . 5 . 1 . jar - f i l e map . R - m a p p e r map . R

- f i l e r e d u c e . R - r e d u c e r r e d u c e . R - i n p u t i n p u t _ f i l e - o u t p u t out }

map.R and reduce.R in this case have to be Rscript Files should be located in HDFS

(77)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility

Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(78)

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs

Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(79)

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations

Moving to cloud offers new business You can’t refuse !

(80)

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business

You can’t refuse !

(81)

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(82)

Summary

Thank you

Big Data Analytics Using R