• No results found

Lecture 10 - Functional programming: Hadoop and MapReduce

N/A
N/A
Protected

Academic year: 2021

Share "Lecture 10 - Functional programming: Hadoop and MapReduce"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 10 - Functional programming:

Hadoop and MapReduce

Sohan Dharmaraja

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41

(2)

For today

Big Data and Text analytics Functional programming concepts MapReduce

Apache Hadoop

(3)

What is “Big Data”?

“A concept referring to data, whose size is beyond the ability of commonly used software to handle in acceptable time limits.1

1Snijders, C., Matzat, U., & Reips, U. (2012). “Big Data”: Big gaps of knowledge in the field of Internet science. International Journal of Internet Science

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 3 / 41

(4)

Problems

Volume: increasingamounts become difficult to handle

Velocity: processingspeed is key. Fast insight gives you an edge Variety: big data is usuallyunstructuredand very diverse in type

(5)

How big is “Big Data”?

90% of data available today was created in thelast 2 years 12 TB (12,000,000 MB) of Tweets are generated every day Data sources are growing: healthcare, weather, stocks, etc.

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 5 / 41

(6)

How is “Big Data” analyzed?

Parallel computing (CUDA)

Distributed computing (Hadoop, MongoDB, etc)

(7)

Big data scenarios

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 7 / 41

(8)

Example 1

Your client’s stores are crowded at peak hours.

So crowded, that customers walk away in frustration.

At other times, the stores arenearly empty.

They are selling below potential due to cart abandonmentandfailure to attract customersthroughout the day.

(9)

Data scientist scenario: Example 1

Your client’s stores are crowded at peak hours.

So crowded, that customers walk away in frustration.

At other times, the stores arenearly empty.

They are selling below potential due to cart abandonmentandfailure to attract customersthroughout the day.

The tools:

You have a marketing budget, authority to send email advertisements, and to make special offers promotion schemes. You also have control over staff

scheduling and checkout procedures.

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 9 / 41

(10)

Example 2

You are asked to build a “recommendation system” for an online merchant.

i.e. present information that is likely of interest to the user

(11)

Amazon?

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 11 / 41

(12)

Google?

(13)

Brainstorming session

— Focus on recommendation for Amazon shoppers.

What would you say to the client to get the contract?

How would you approach the problem?

What do you need from me?

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 13 / 41

(14)

Text analytics

“Data cleansing” in very important.

tokenization (phrase? words? . . . ) spelling normalization: color, colour

orthographic normalization: In Danish, søster = “sister”

morphological normalization: verbs, adverbs, adverbial participles Zipf’s law: inverse frequency law: the∼ 7%,of ∼ 3.5%

Can also can be considered part of data representation

(15)

Distributed Computing is hard

Need to befast: lots of data to churn

Need to bescalable: varying amounts of data to churn

Need to befault-tolerant: intensive processes ⇒ hardware failures

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 15 / 41

(16)

Parallel design patterns

so far: low level kernel programming: thread mapping, etc think at a higher level than individual CUDA kernels specify what to compute, not how to compute it let programmer worry about algorithm

“Functional” approaches are motivated from above

(17)

Functional programming?

Languages: Lisp, ML, Haskell, etc ...

Different (often useful) perspective:

I Recursion (no loops allowed)

I Function pointers (Map, Reduce, etc...)

Finds uses in

I programming language theory (logic, proof systems, etc)

I design of compilers

I concurrent programming

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 17 / 41

(18)

Roadmap for Map Reduce

Map: applies a process to data

Reduce: combines results into answers

ie. Divide and conquer forbig data, introduced by Google in 2004

(19)

Passing functions as arguments

// square a number int f(int x) {

return x*x;

}

// apply a function pointer to a number int g(int (*f)(int), int x) {

return (*f)(x);

}

// passing f as a function pointer to g void main() {

int res = g(f, 4);

}

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 19 / 41

(20)

Map: applying a function to all elements

int inc(int x){

return x + 1;

}

// apply function pointer f to every element of an array void map(int* ary, int n, int (*f)(int)){

if ( n == 0 ) return;

*ary = (*f)(*ary);

map(ary+1, n-1, f);

}

void main() {

int ary[5] = {1,2,3,4,5};

(21)

Reduce: combines elements of array

int add(int x, int y) { return x + y; } int mul(int x, int y) { return x * y; }

// reduce every element of array with function pointer f int reduce(int* ary, int n, int (*f)(int, int), int b) {

if ( n == 0 ) return b;

return (*f)(reduce(ary+1, n-1, f, b), *ary);

}

void main() {

int ary[5] = {1,2,3,4,5};

int sum = reduce(ary, 5, add, 0);

int fac = reduce(ary, 5, mul, 1);

}

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 21 / 41

(22)

Roadmap for Map Reduce

Map: assigns processes to machines

Reduce: combines machines’ results into answers These interactions have consequences

I Sometimes parallelization is obvious, sometimes not

I Recursion/map/reduce often help to decide how

I Again: “divide and conquer” mentality

(23)

MapReduce - Big Picture

A programming model for processing large data setsin batch Designed to execute on a cluster ofcommodity hardware Let tasks fail and beable to retry

Brings code to data, rather than data to code

Limit communication by allowing only certain operations in a flow

All inspired by functional programming ...

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 23 / 41

(24)

Central MapReduce Ideas

Operate on key-value pairs

Data scientist provides map and reduce (input)

< k1, v1 > −−→map < k2, v2 >

< k2, v2 > combine,sort

−−−−−−−−→ < k2, v2 >

< k2, v2 > −−−−→reduce < k3, v3 >

(output) Efficient Sort provide by MapReduce library

(25)

MapReduce Example - Word Count

Example: Two text files file1:

Hello World Bye World file2:

Hello Hadoop Goodbye Hadoop

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 25 / 41

(26)

MapReduce Example - Word Count: Map step

First map emits:

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

Second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

(27)

MapReduce Example - Word Count: Sort & Combine step

Sorted output of first map:

< Bye, 1>

< Hello, 1>

< World, 2>

Sorted output of second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 27 / 41

(28)

MapReduce Example - Word Count: Reduce step

Reduce method sums the values for each key.

Output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

(29)

Caveats

All problems do not fit well (or at all) within this model This model is not suitable for real-time processing of data May not linearly scale in relation to your input data

Easier than scratch: but can be tricky to fit your problem to it

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 29 / 41

(30)

What is MapReduce good at solving?

Identify, transform, aggregate, filter, count, sort. . .

Discovery tasks (vs. high repetition of similar tasks, many reads) Unstructured data (vs. tabular, indexes)

Continuously updated data (indexing cost) Many, many, many machines (fault tolerance)

(31)

Painfully Parallel Problems

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 31 / 41

(32)

Painfully Parallel Problems

Given a function that is both commutative and associative (e.g., + or ∗)

Commutative: x + y = y + x

Associative: (x + y) + z = x + (y + z)

Partition: 5 6 4 1 4 1 2 5 6 2 7 6 3 4 6 1

Map: 16 12 21 14

Reduce: 63

(33)

Hadoop as a product

Cloud computing: sell time to make profitable use of excess capacity.

Google, Yahoo! and Amazon offer cloud computing services Customer submit jobs to vendor, who run them it in parallel Hadoop is written in Java

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 33 / 41

(34)

Counting example - Java code

mapper:

void map(String name, String document){

// name: document name

// document: document contents for word in document:

EmitIntermediate(word, "1");

reducer:

void reduce(String word, Iterator partialCounts){

// word: a word

// partialCounts: a list of aggregated partial counts int sum = 0;

for pc in partialCounts:

sum += ParseInt(pc);

(35)

The Components

HDFS: Hadoop Distributed File System.

NameNode - tracks where in the cluster HDFS data is kept DataNode - Duplicates data in the HDFS (see later)

JobTracker - Assigns MapReduce tasks to nodes in the cluster TaskTracker - Node that tracks MapReduce tasks from JobTracker

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 35 / 41

(36)

The role of disk files

Hadoop has its own file system, HDFS, built on top of the native OS.

Very large files are possible: some span more than one disk/machine.

This raises serious reliability issues. The HDFS is replicated, existing in at least 3 copies, i.e. on at least 3 separate disks.

Note: having IO files in HDFS minimizes communications costsin shipping data.

Slogan: “Moving computation is cheaper than moving data.”

(37)

Abstraction with Pig

MapReduce: A programming model for parallel processing, introduced in Google’s 2004 paper. Hadoop provides a framework for running MapReduce jobs written in Java.

Pig: A high level data flow language for processing data. Pig describe steps to be executed by running one or more MapReduce jobs.

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 37 / 41

(38)

Pig

Pig is an SQL like language, and it is an abstraction layer over the MapReduce framework.

Pig is nothing but a higher level abstraction of MapReduce which can ease the problem solving and programming process

it shares the advantages and disadvantages of MapReduce

(39)

Word count example using Pig

Example: Word Count

A = LOAD ’/raw_data/’ USING TextLoader();

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO ’/myvolume/wordcount’;

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 39 / 41

(40)

Hadoop demo

(41)

Resources

Java

http://download.oracle.com/javase/7/docs/api/java/lang/

Thread.html

http://download.oracle.com/javase/tutorial/essential/

concurrency/

MapReduce

http://labs.google.com/papers/mapreduce.html http://hadoop.apache.org/

Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 41 / 41

References

Related documents

Artinya bahwa pada wilayah yang dipengaruhi ENSO, akurasi model ramalan produksi padi (dengan variabel prediktor SST Nino 3.4, DMI, rasio LT/LB) akan tinggi, dan sebaliknya

Correlation between Breast Self-Examination Practices and Demographic Characteristics, Risk Factors and Clinical Stage of Breast Cancer among Iraqi Patients..

The nine independent (9) factors or variables that were perceived to inhibit the realisation of the objectives of the GAX are: unimportance of GAX by owners

Two components are important to achieve the multi-party risk efficiency: recalling a sense of shame through straightforward presentation for honest listening and reconsidering

counseling was feasible to implement in outpatient commu- nity-based substance abuse treatment settings, was effective in producing modest abstinence rates and strong reductions

It is also important to note Somali women reported experiences of positive aspects of childbirth, for example women reported an appreciation for care received, support from

90’ı 5 µm altı olan gezegen bilyalı değirmende 4 saat boyunca öğütülen tozla (C4 tozu) devam edilmesi uygun görülmüş olup bu tozun detaylı tane boyut analizi Şekil

The dynamic model of the grid consists of turbine governors (TG), automatic voltage regulators (AVR) as well as wind turbines, solar power units and energy storage units1.