• No results found

Big Data Analytics Using R

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics Using R"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

BIG DATA DEFINITION Parallelization principles Tools Summary

Big Data Analytics Using R

Eddie Aronovich

October 23, 2014

(2)

BIG DATA DEFINITION Parallelization principles Tools Summary

Table of contents

1 BIG DATA DEFINITION

Definition Characteristics Scaling Challange

2 Parallelization principles

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience Where to parallelize ? Storage issues 3 Tools Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR 4 Summary

(3)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Gartners’ term (probably)

Simple (personal) Definition

Too much data to be easily processed. Used mainly for marketing and FUD

(4)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Characteristics

Volume Velocity Verity Veracity (integrity)

(5)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Characteristics

Volume Velocity Verity Veracity (integrity)

(6)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Characteristics

Volume Velocity Verity Veracity (integrity)

(7)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Characteristics

Volume Velocity Verity Veracity (integrity)

(8)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Characteristics

Volume Velocity Verity Veracity (integrity)

(9)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(10)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(11)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(12)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(13)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(14)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

ScaleUp vs. ScaleOut

Scale Up - more ... in one box

expensive (growing exp) ”closed box” approach technical complexity

Scale Out - more boxes

cheap (avg.) best of breed usage is not trivial

”You want Scale Out? Well, Scale Out costs. And right here is where you start paying ... in sweat....”

(15)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(16)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec.

Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(17)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(18)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used

Calculating mean() and var() on all data was not possible

(19)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(20)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

How long it takes to read / write data ?

Generating a 1e7 var using runif() and writing it - 10-16 sec. Reading the same (96MB) file takes 1.5-6 sec.

Generating 1e9 var was not possible on the host I used Calculating mean() and var() on all data was not possible

(21)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(22)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk

For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(23)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file

BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(24)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-(

Compute mean and var of all files and ....compute total VAR

(25)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....

compute total VAR

(26)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

And in parallel ?

Create 100 files, each size 1e7 and write them on the disk For each file: read it, compute mean() and var() for each file BUT - reading 100 files takes 100 times reading of one file :-( Compute mean and var of all files and ....compute total VAR

(27)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

Now you know why you are here at this time !

(28)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node

One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(29)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data

Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(30)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(31)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple)

Multiple processes, One host Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(32)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(33)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(34)

BIG DATA DEFINITION Parallelization principles Tools Summary Definition Characteristics Scaling Challange

One is not enough

Data

Data can not be gathered on one node One computer can not process all data Data transfer is not feasible

Processors

One process, One thread (old and simple) Multiple processes, One host

Multiple hosts

Reference

High-Performance and Parallel Computing with R by Derik Eddelbuettel

(35)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional

- requires sync

Data - requires pre/post processing Combination ← Hadoop made it easy !

(36)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional - requires sync

Data - requires pre/post processing Combination ← Hadoop made it easy !

(37)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional - requires sync Data

- requires pre/post processing Combination ← Hadoop made it easy !

(38)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional - requires sync

Data - requires pre/post processing

Combination ← Hadoop made it easy !

(39)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional - requires sync

Data - requires pre/post processing Combination

← Hadoop made it easy !

(40)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ? Storage issues

Splitting tasks - How to ?

Functional - requires sync

Data - requires pre/post processing Combination ← Hadoop made it easy !

(41)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer

Amdahl’s and Gustafson’s Law

Life experience Where to parallelize ? Storage issues

Can we parallelize forever ?

Gustafson’s Law

S (P) = P − α(P − 1)

S - scale-up, P - # of processors,

α - proportion that can not be parallelized

Amdahl’s Law

T (P) = T (1)(α +P1 · (1 − α)) T - processing time

(42)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Where to parallelize ? Storage issues

Thumb rules

Programs are small, data is big ⇒ move program to data!

Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Checklist: Pre-processing, Synchronization and Post-processing

(43)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Where to parallelize ? Storage issues

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise!

Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Checklist: Pre-processing, Synchronization and Post-processing

(44)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Where to parallelize ? Storage issues

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts

Multiple hosts might require authentication process. Checklist: Pre-processing, Synchronization and Post-processing

(45)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Where to parallelize ? Storage issues

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Checklist: Pre-processing, Synchronization and Post-processing

(46)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law

Life experience

Where to parallelize ? Storage issues

Thumb rules

Programs are small, data is big ⇒ move program to data! Parallel processing requires conversion of results ← Do it wise! Queuing systems overhead is proportional to # of jobs & hosts Multiple hosts might require authentication process.

Checklist: Pre-processing, Synchronization and Post-processing

(47)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster

Hosting services Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(48)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster Hosting services

Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(49)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster Hosting services Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(50)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster Hosting services Cloud

Storage (beware of outgoing traffic costs)

CPUs

Nice tool: segue - Allows running R jobs on AWS

(51)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster Hosting services Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(52)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

Little supercomputing center

Local cluster Hosting services Cloud

Storage (beware of outgoing traffic costs) CPUs

Nice tool: segue - Allows running R jobs on AWS

(53)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(54)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(55)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ...

I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(56)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(57)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk

Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(58)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(59)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(60)

BIG DATA DEFINITION

Parallelization principles

Tools Summary

Divide and Conquer Amdahl’s and Gustafson’s Law Life experience

Where to parallelize ?

Storage issues

The data won’t feet in ...

Loading the data is the first stage, but ...

Mapping instead of loading

The data is large and the memory is small ... I/O is a performance killer

NAS / SAN is usually better than simple disk Mounting remote file systems e.g. sshfs (degrades performance)

Distributed file systems (w/o parallel processing) ← usually cheap, maintenance

Move the program, not the data !

(61)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Snow and Snowfall

MPI tools (beyond the scope of this talk)

Snow and Snowfall

https://www.stat.berkeley.edu/classes/s244/snowfall.pdf

biopara

(62)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Clusters and Grids

Queuing systems:

Can run almost any program

Very simple to use (for command liners)

Grids:

Requires authentication issues Huge amount of resources

Ubiquitous mostly in the academic environment

(63)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Example: condor queuing system

E x e c u t a b l e = / usr / l o c a l / bin / R

a r g u m e n t s = - - v a n i l l a - - f i l e = p a t h / f i l e . r w h e n _ t o _ t r a n s f e r _ o u t p u t = O N _ E X I T _ O R _ E V I C T

Log = eddiea - ed2 . log

E r r o r = del . eddiea - ed2 . err . $ ( P r o c e s s )

O u t p u t = del . eddiea - ed2 . out . $ ( P r o c e s s ) R e q u i r e m e n t s = A R C H ==" X 8 6 _ 6 4 " && \

( t a r g e t . F r e e M e m o r y M B > 6 0 0 0 ) n o t i f i c a t i o n = n e v e r

U n i v e r s e = V A N I L L A Q u e u e 3 0 0 0

(64)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Clouds

Use resources as service (Ben-Yehuda et al.) Very cheap for peak usage or low frequency Wrong usage might be very expensive

Look for the service you need (it would usually be better and cheaper)

In case of problem - Google it.

(65)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Hadoop - overview

An open source reliable, scalable, distributed computing system

Designed to scale up from single servers to thousands of machines

Each offering local computation and storage

Designed to detect and handle failures at the application layer or hardware

(66)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Hadoop - components

Hadoop Common HDFS YARN, MR

Many other tools (incl. R package)

(67)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Using hadoop file system (HDFS)

l i b r a r y ( r h d f s ) h d f s . i n i t () h d f s . ls ( ’/ ’)

(68)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Using hadoop file system (HDFS)

l i b r a r y ( r J a v a )

. j i n i t ( p a r a m e t e r s =" - X m x 8 g ")

(69)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Writing an R object (1e8) to hdfs

> f = h d f s . f i l e ("/ a . h d f s " ," w " , b u f f e r s i z e = 1 0 4 8 5 7 6 0 0 ) > s y s t e m . t i m e ( h d f s . w r i t e ( a , f , h s y n c = F A L S E )) u s e r s y s t e m e l a p s e d 7 . 3 9 2 2 . 1 9 2 1 5 . 0 7 2 > h d f s . c l o s e ( f ) [1] T R U E

(70)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Writing an R object (1e8) to tmpfs

> a < - r u n i f (1 e8 )

> s y s t e m . t i m e ( s a v e ( a , f i l e = ’ a1 ’)) u s e r s y s t e m e l a p s e d

9 9 . 6 4 8 0 . 3 1 6 9 9 . 8 9 3

(71)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

Hadoop - map reduce

Splits each process into Map and Reduce

map - Classifies the data that would be process together reduce - process the data

Hadoop takes care of all the ”behind the scenes”

(72)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters l i b r a r y ( r m r 2 )

Sys . s e t e n v ( H A D O O P _ S T R E A M I N G = " . . . / hadoop - s t r e a m i n g - 2 . 5 . 1 . jar ") Sys . s e t e n v ( H A D O O P _ H O M E = " . . . / h a d o o p ")

Sys . s e t e n v ( H A D O O P _ C M D = " . . . / bin / h a d o o p ")

(73)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR Simple mapper pi . map . fn < - f u n c t i o n ( n ){ X < - r u n i f ( n ) Y < - r u n i f ( n ) S = sum (( X ^2+ Y ^2) <1) r e t u r n ( k e y v a l (1 , n / S )) }

(74)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR MR parameters pi . r e d u c e . fn < - f u n c t i o n ( p ){ k e y v a l (4* m e a n ( p )) }

(75)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR The MR process PI = f u n c t i o n ( n , o u t p u t = N U L L ){ m a p r e d u c e ( i n p u t = n , o u t p u t = output , i n p u t . f o r m a t =" t e x t " , map = pi . map . fn , v e r b o s e = T R U E ) }

(76)

BIG DATA DEFINITION Parallelization principles Tools Summary Explicit Parallelism Hadoop Hadoop - HDFS Hadoop - MR

From command line

h a d o o p jar { P A T H }/ hadoop - s t r e a m i n g - 2 . 5 . 1 . jar - f i l e map . R - m a p p e r map . R

- f i l e r e d u c e . R - r e d u c e r r e d u c e . R - i n p u t i n p u t _ f i l e - o u t p u t out }

map.R and reduce.R in this case have to be Rscript Files should be located in HDFS

(77)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility

Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(78)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs

Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(79)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations

Moving to cloud offers new business You can’t refuse !

(80)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business

You can’t refuse !

(81)

BIG DATA DEFINITION Parallelization principles Tools

Summary

The bottom line

Moving to cloud increases agility Moving to cloud reduces costs Moving to cloud improve operations Moving to cloud offers new business You can’t refuse !

(82)

BIG DATA DEFINITION Parallelization principles Tools

Summary

Thank you

References

Related documents