Lecture 10 - Functional programming:
Hadoop and MapReduce
Sohan Dharmaraja
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41
For today
Big Data and Text analytics Functional programming concepts MapReduce
Apache Hadoop
What is “Big Data”?
“A concept referring to data, whose size is beyond the ability of commonly used software to handle in acceptable time limits.1”
1Snijders, C., Matzat, U., & Reips, U. (2012). “Big Data”: Big gaps of knowledge in the field of Internet science. International Journal of Internet Science
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 3 / 41
Problems
Volume: increasingamounts become difficult to handle
Velocity: processingspeed is key. Fast insight gives you an edge Variety: big data is usuallyunstructuredand very diverse in type
How big is “Big Data”?
90% of data available today was created in thelast 2 years 12 TB (12,000,000 MB) of Tweets are generated every day Data sources are growing: healthcare, weather, stocks, etc.
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 5 / 41
How is “Big Data” analyzed?
Parallel computing (CUDA)
Distributed computing (Hadoop, MongoDB, etc)
Big data scenarios
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 7 / 41
Example 1
Your client’s stores are crowded at peak hours.
So crowded, that customers walk away in frustration.
At other times, the stores arenearly empty.
They are selling below potential due to cart abandonmentandfailure to attract customersthroughout the day.
Data scientist scenario: Example 1
Your client’s stores are crowded at peak hours.
So crowded, that customers walk away in frustration.
At other times, the stores arenearly empty.
They are selling below potential due to cart abandonmentandfailure to attract customersthroughout the day.
The tools:
You have a marketing budget, authority to send email advertisements, and to make special offers promotion schemes. You also have control over staff
scheduling and checkout procedures.
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 9 / 41
Example 2
You are asked to build a “recommendation system” for an online merchant.
i.e. present information that is likely of interest to the user
Amazon?
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 11 / 41
Google?
Brainstorming session
— Focus on recommendation for Amazon shoppers.
What would you say to the client to get the contract?
How would you approach the problem?
What do you need from me?
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 13 / 41
Text analytics
“Data cleansing” in very important.
tokenization (phrase? words? . . . ) spelling normalization: color, colour
orthographic normalization: In Danish, søster = “sister”
morphological normalization: verbs, adverbs, adverbial participles Zipf’s law: inverse frequency law: the∼ 7%,of ∼ 3.5%
Can also can be considered part of data representation
Distributed Computing is hard
Need to befast: lots of data to churn
Need to bescalable: varying amounts of data to churn
Need to befault-tolerant: intensive processes ⇒ hardware failures
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 15 / 41
Parallel design patterns
so far: low level kernel programming: thread mapping, etc think at a higher level than individual CUDA kernels specify what to compute, not how to compute it let programmer worry about algorithm
“Functional” approaches are motivated from above
Functional programming?
Languages: Lisp, ML, Haskell, etc ...
Different (often useful) perspective:
I Recursion (no loops allowed)
I Function pointers (Map, Reduce, etc...)
Finds uses in
I programming language theory (logic, proof systems, etc)
I design of compilers
I concurrent programming
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 17 / 41
Roadmap for Map Reduce
Map: applies a process to data
Reduce: combines results into answers
ie. Divide and conquer forbig data, introduced by Google in 2004
Passing functions as arguments
// square a number int f(int x) {
return x*x;
}
// apply a function pointer to a number int g(int (*f)(int), int x) {
return (*f)(x);
}
// passing f as a function pointer to g void main() {
int res = g(f, 4);
}
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 19 / 41
Map: applying a function to all elements
int inc(int x){
return x + 1;
}
// apply function pointer f to every element of an array void map(int* ary, int n, int (*f)(int)){
if ( n == 0 ) return;
*ary = (*f)(*ary);
map(ary+1, n-1, f);
}
void main() {
int ary[5] = {1,2,3,4,5};
Reduce: combines elements of array
int add(int x, int y) { return x + y; } int mul(int x, int y) { return x * y; }
// reduce every element of array with function pointer f int reduce(int* ary, int n, int (*f)(int, int), int b) {
if ( n == 0 ) return b;
return (*f)(reduce(ary+1, n-1, f, b), *ary);
}
void main() {
int ary[5] = {1,2,3,4,5};
int sum = reduce(ary, 5, add, 0);
int fac = reduce(ary, 5, mul, 1);
}
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 21 / 41
Roadmap for Map Reduce
Map: assigns processes to machines
Reduce: combines machines’ results into answers These interactions have consequences
I Sometimes parallelization is obvious, sometimes not
I Recursion/map/reduce often help to decide how
I Again: “divide and conquer” mentality
MapReduce - Big Picture
A programming model for processing large data setsin batch Designed to execute on a cluster ofcommodity hardware Let tasks fail and beable to retry
Brings code to data, rather than data to code
Limit communication by allowing only certain operations in a flow
All inspired by functional programming ...
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 23 / 41
Central MapReduce Ideas
Operate on key-value pairs
Data scientist provides map and reduce (input)
< k1, v1 > −−→map < k2, v2 >
< k2, v2 > combine,sort
−−−−−−−−→ < k2, v2 >
< k2, v2 > −−−−→reduce < k3, v3 >
(output) Efficient Sort provide by MapReduce library
MapReduce Example - Word Count
Example: Two text files file1:
Hello World Bye World file2:
Hello Hadoop Goodbye Hadoop
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 25 / 41
MapReduce Example - Word Count: Map step
First map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
MapReduce Example - Word Count: Sort & Combine step
Sorted output of first map:
< Bye, 1>
< Hello, 1>
< World, 2>
Sorted output of second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 27 / 41
MapReduce Example - Word Count: Reduce step
Reduce method sums the values for each key.
Output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
Caveats
All problems do not fit well (or at all) within this model This model is not suitable for real-time processing of data May not linearly scale in relation to your input data
Easier than scratch: but can be tricky to fit your problem to it
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 29 / 41
What is MapReduce good at solving?
Identify, transform, aggregate, filter, count, sort. . .
Discovery tasks (vs. high repetition of similar tasks, many reads) Unstructured data (vs. tabular, indexes)
Continuously updated data (indexing cost) Many, many, many machines (fault tolerance)
Painfully Parallel Problems
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 31 / 41
Painfully Parallel Problems
Given a function that is both commutative and associative (e.g., + or ∗)
Commutative: x + y = y + x
Associative: (x + y) + z = x + (y + z)
Partition: 5 6 4 1 4 1 2 5 6 2 7 6 3 4 6 1
Map: 16 12 21 14
Reduce: 63
Hadoop as a product
Cloud computing: sell time to make profitable use of excess capacity.
Google, Yahoo! and Amazon offer cloud computing services Customer submit jobs to vendor, who run them it in parallel Hadoop is written in Java
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 33 / 41
Counting example - Java code
mapper:
void map(String name, String document){
// name: document name
// document: document contents for word in document:
EmitIntermediate(word, "1");
reducer:
void reduce(String word, Iterator partialCounts){
// word: a word
// partialCounts: a list of aggregated partial counts int sum = 0;
for pc in partialCounts:
sum += ParseInt(pc);
The Components
HDFS: Hadoop Distributed File System.
NameNode - tracks where in the cluster HDFS data is kept DataNode - Duplicates data in the HDFS (see later)
JobTracker - Assigns MapReduce tasks to nodes in the cluster TaskTracker - Node that tracks MapReduce tasks from JobTracker
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 35 / 41
The role of disk files
Hadoop has its own file system, HDFS, built on top of the native OS.
Very large files are possible: some span more than one disk/machine.
This raises serious reliability issues. The HDFS is replicated, existing in at least 3 copies, i.e. on at least 3 separate disks.
Note: having IO files in HDFS minimizes communications costsin shipping data.
Slogan: “Moving computation is cheaper than moving data.”
Abstraction with Pig
MapReduce: A programming model for parallel processing, introduced in Google’s 2004 paper. Hadoop provides a framework for running MapReduce jobs written in Java.
Pig: A high level data flow language for processing data. Pig describe steps to be executed by running one or more MapReduce jobs.
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 37 / 41
Pig
Pig is an SQL like language, and it is an abstraction layer over the MapReduce framework.
Pig is nothing but a higher level abstraction of MapReduce which can ease the problem solving and programming process
it shares the advantages and disadvantages of MapReduce
Word count example using Pig
Example: Word Count
A = LOAD ’/raw_data/’ USING TextLoader();
B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));
C = GROUP B BY $0;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO ’/myvolume/wordcount’;
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 39 / 41
Hadoop demo
Resources
Java
http://download.oracle.com/javase/7/docs/api/java/lang/
Thread.html
http://download.oracle.com/javase/tutorial/essential/
concurrency/
MapReduce
http://labs.google.com/papers/mapreduce.html http://hadoop.apache.org/
Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 41 / 41