Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

(1)

Introduc)on to RHadoop

Master’s Degree in Informa1cs Engineering

Master’s Programme in ICT Innova1on: Data Science

(EIT ICT Labs Master School)

(2)

Contents

•

Introduc1on to…

•

MapReduce

•

HDFS

•

Hadoop

(3)

MapReduce & DQ

•

Divide and Conquer (DQ)

•

General idea

•  Divide a problem into sub-‐problems (smaller)

•  Solve each problem (independently)

(4)

DQ: pseudo-‐code

Func1on DQ (X: Problem data)

if small(X) then

S = easy(X)

if not

divide(X) => (X

₁

, ..., X

_k

)

for i = 1 to k do

S

_i

= DQ(X

_i

)

S = combine(S

₁

, ..., S

_k

)

return S

(5)

DQ: eﬃciency

•

Eﬃciency of this approach

•

An appropriate threshold must be selected to apply easy(X)

•

Decomposi1on and combining func1ons must be eﬃcient

(6)

DQ: Remarks

•

It can not be applied to any type of problems

•

Some1mes, it might not be obvious how to divide a large problem

into sub-‐problems

•

If such division is uneven, we will have an unbalanced system, which would

have an import impact on the overall performance of the algorithm

•

The size of the reduced problems must be signiﬁcantly smaller than the

original one so that massively parallel supercomputer could be used and the

communica1on overhead can be compensated

(7)

MapReduce: general scheme

(8)

MapReduce: more detail

(9)

MapReduce: example

(10)

Hadoop Distributed File System (HDFS)

•

Distributed File System evolved from Google implementa1on (GFS)

•

Fault-‐tolerant: ﬁles and divided in chunks and those are distributed

and replicated through the cluster

•

Normally, the replica1on ra1o is 3

•

There is a Master Node that stores this meta-‐data: which ﬁles, into

how many chunks these are divided and where they are stored

(11)

(12)

•

In HDFS, blocks should be read from the beginning to the end (this

favors the

MapReduce

approach)

•

Files in the HDFS system ARE NOT stored along with the host system

ﬁles

•

HDFS is normally an abstrac1on OVER an exis1ng ﬁle system (ext3, ext4, etc.)

•

Thus, there are speciﬁc commands to manipulate the HDFS ﬁle system

•

To open a ﬁle stored in HDFS, the client must contact the

NameNode

to retrieve the loca1on of each block of the ﬁle (at the

DataNodes

)

(13)

•

Data locality:

normally, when launching a job, it is run in the same

node that stores the data it must manipulate

•

The meta-‐data stored in the

NameNode

is not automa1cally

(14)

HDFS from the command line

•

Each user of the HDFS has a personal directory

•

No security direc1ves implemented, so users can write anywhere

•

Access to HDFS through the

hdfs

command

hdfs dfs

command

•

Important commands

•

-‐copyFromLocal vs. -‐copyToLocal

•

-‐mkdir

•

-‐cp, -‐mv

(15)

Hadoop MRv1 vs Yarn (MRv2)

•

Hadoop MRv1

•

Resources management and tasks scheduling and monitoring done by a single

process (bogle-‐neck):

Job Tracker

•

Each sub-‐problem is run by an independent process:

Task Tracker

•

Hadoop MRv2

•

Resources management and tasks scheduling and monitoring are split in

diﬀerent processes

•  Resource Manager (RM): overall resources management

•  Applica>on Master(AM): per job tasks scheduling and monitoring

(16)

(17)

Example: wordcount

•

Input:

document made up of words

•

Output:

A set of (Word, count(Word))

•

Two func1ons:

map

and

reduce

•

map(k1, v1):

for each word w in v1

emit(w, 1)

•

reduce(k2, v2_list):

int result = 0;

for each v in v2_list

result += v;

emit(k2, result)

(18)

(19)

(20)

RHadoop

•

Developed by Revolu1on Analy1cs (acquired by Microsol)

•

Three main components

•

rhdfs: R + HDFS

•

rmr2: R + Map Reduce

•

rhbase: R + Hbase

•

Can be downloaded from:

hgps://github.com/Revolu1onAnaly1cs/RHadoop/wiki/Downloads

(21)

RHadoop: interac)ng with HDFS

# Load rhdfs library library(rhdfs) # Start rhdfs hdfs.init()

# Basic "ls", path is mandatory hdfs.ls("/user/hadoop”) # Create directory work.dir <-‐ "/user/hadoop/aux/” hdfs.mkdir(work.dir) # And delete hdfs.delete(work.dir) # Create again hdfs.mkdir(work.dir)

(22)

RHadoop: wordcount example

•

Library loading and ini1aliza1on

# Loading the RHadoop libraries

library('rhdfs’)

library('rmr2')

# Ini1alizaing the RHadoop

hdfs.init()

(23)

wordcount = func1on(input,

# The output can be an HDFS path but # if it is NULL some temporary ﬁle will # be generated and wrapped in a big data # object, like the ones generated by to.dfs output = NULL,

pagern = " ") {

# Deﬁning wordcount Map func1on wc.map = func1on(., lines) {

keyval( unlist(strsplit(x = lines, split = pagern)), 1) }

# Deﬁning wordcount Reduce func1on wc.reduce = func1on(word, counts ) { keyval(word, sum(counts)) }

(24)

RHadoop: wordcount example

# Deﬁning MapReduce parameters by calling mapreduce func1on

mapreduce(input = input ,

output = output,

# You can specify your own input and output formats

# and produce binary formats with the func1ons

# make.input.format and make.output.format

input.format = "text”,

map = wc.map,

reduce = wc.reduce,

# With combiner

combine = T)

}

(25)

# Running MapReduce Job by passing the Hadoop

# input directory loca1on as parameter

wordcount('/user/hadoop/wordcount/quijote.txt')

# Retrieving the RHadoop MapReduce output

# data by passing output

# directory loca1on as parameter

from.dfs("/tmp/ﬁle1b0817a5bcd0")

•

El Quijote can be downloaded from:

(26)

RHadoop: airline example

•

We will analyze the commercial data of an airline

•

The input data ﬁle is a CSV

•

We will need to use a custom input formager to ease the task of

processing the ﬁle

•

Data can be downloaded from:

hgp://stat-‐compu1ng.org/dataexpo/2009/1987.csv.bz2

(27)

RHadoop: airline example

library(rmr2)

library('rhdfs’)

hdfs.init()

# Put data in HDFS

hdfs.data.root = '/user/hadoop/rhadoop/airline’

hdfs.data = ﬁle.path(hdfs.data.root, 'data’)

hdfs.mkdir(hdfs.data)

hdfs.put("/home/hadoop/Downloads/1987.csv", hdfs.data)

(28)

RHadoop: airline example (input format)

#

# asa.csv.input.format() -‐ read CSV data ﬁles and label ﬁeld names # for beger code readability (especially in the mapper)

#

asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('Year', 'Month', 'DayofMonth', 'DayOfWeek',

'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',

'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'Cancella1onCode', 'Diverted', 'CarrierDelay', 'WeatherDelay',

'NASDelay', 'SecurityDelay', 'LateAircralDelay'), stringsAsFactors=F)

(29)

RHadoop: airline example (mapper 1/2)

#

# the mapper gets keys and values from the input formager

# in our case, the key is NULL and the value is a data.frame from read.table() #

mapper.year.market.enroute_1me = func1on(key, val.df) {

# Remove header lines, cancella1ons, and diversions:

val.df = subset(val.df, Year != 'Year' & Cancelled == 0 & Diverted == 0)

# We don't care about direc1on of travel, so construct a new 'market' vector # with airports ordered alphabe1cally (e.g, LAX to JFK becomes 'JFK-‐LAX')

(30)

RHadoop: airline example (mapper 2/2)

# key consists of year, market

output.key = data.frame(year=as.numeric(val.df$Year), market=market, stringsAsFactors=F)

# emit data.frame of gate-‐to-‐gate elapsed 1mes (CRS and actual) + 1me in air output.val = val.df[,c('CRSElapsedTime', 'ActualElapsedTime', 'AirTime')]

colnames(output.val) = c('scheduled', 'actual', 'inﬂight')

# and ﬁnally, make sure they're numeric while we're at it

output.val = transform(output.val, scheduled = as.numeric(scheduled),

actual = as.numeric(actual), inﬂight = as.numeric(inﬂight))

return( keyval(output.key, output.val) ) }

(31)

RHadoop: airline example (reducer)

#

# the reducer gets all the values for a given key

# the values (which may be mul1-‐valued as here) come in the form of a data.frame #

reducer.year.market.enroute_1me = func1on(key, val.df) {

output.key = key

output.val = data.frame(ﬂights = nrow(val.df),

scheduled = mean(val.df$scheduled, na.rm=T), actual = mean(val.df$actual, na.rm=T),

inﬂight = mean(val.df$inﬂight, na.rm=T) )

return( keyval(output.key, output.val) ) }

(32)

RHadoop: ﬁnal conﬁgura)on and execu)on

mr.year.market.enroute_1me = func1on (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper.year.market.enroute_1me, reduce = reducer.year.market.enroute_1me, backend.parameters = list(

hadoop = list(D = "mapred.reduce.tasks=2") ),

verbose=T) }

(33)

RHadoop: gathering results

results = from.dfs( out )

results.df = as.data.frame(results, stringsAsFactors=F )

colnames(results.df) = c('year', 'market', 'ﬂights', 'scheduled', 'actual',

'inﬂight')

print(head(results.df))

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)