WA2341 Hadoop Programming EVALUATION ONLY

(1)

WA2341 Hadoop

Programming

Web Age Solutions Inc. USA: 1-877-517-6540 Canada: 1-866-206-4644

Web: http://www.webagesolutions.com

(2)

The following terms are trademarks of other companies:

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both. IBM, WebSphere, DB2 and Tivoli are trademarks of the International Business Machines Corporation in the United States, other countries, or both.

Other company, product, and service names may be trademarks or service marks of others.

For customizations of this book or other sales inquiries, please contact us at: USA: 1-877-517-6540, email: [email protected]

Canada: 1-866-206-4644 toll free, email: [email protected]

This publication is protected by the copyright laws of Canada, United States and any other country where this book is sold. Unauthorized use of this material, including but not limited to, reproduction of the whole or part of the content, re-sale or transmission

through fax, photocopy or e-mail is prohibited. To obtain authorization for any such activities, please write to:

Web Age Solutions Inc. 439 University Ave Suite 820

Toronto

Ontario, M5G 1Y8

(3)

Chapter 1 - MapReduce Overview

Objectives

In this chapter, participants will learn about:



MapReduce Programming Model



Main MapReduce design principles

1.1 MapReduce Defined



There are different definitions of what MapReduce (single word) is:

◊

a programming model

◊

parallel processing framework

◊

a computational paradigm

◊

batch query processor



This technique (model, etc.) was influenced by functional programming

languages that have the

map

and

reduce

functions

1.2 Google's MapReduce



MapReduce was first introduced by Google back in 2004



Google applied for and was granted US Patent 7,650,331 on MapReduce

called "System and method for efficient large-scale data processing"



The patent lists

Jeffrey Dean and Sanjay Ghemawat

as its inventors



The value proposition of the MapReduce framework is its scalability and

fault-tolerance achieved by its architecture

Notes:

ABSTRACT

A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more

(12)

Chapter 1 - MapReduce Overview

independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.

Source: http://www.google.ca/patents/US7650331

1.3 MapReduce Explained



MapReduce works by breaking the data processing task into two phases:

◊

The

map

phase and the

reduce

phase executed sequentially with the

output of the map operation piped (emitted) as input into the reduce

operation

◊

The map phase is backed up by a Map() procedure breaks up the

chunks of original data into a list of key/value pairs



There is a data split operation feeding data into the Map() procedure

◊

The reduce phase is backed up by a Reduce() procedure that performs

some soft of an aggregation operation by key (e.g. counting, finding the

maximum of elements in the data set, etc.) on data received from the

Map() procedure

◊

There is also an additional step between the

map

and

reduce

phases,

called "Shuffle", that prepares (sorts, etc.) and directs output of the

map

phase to the

reduce

phase

12

(13)

1.4 MapReduce Explained

(Source: Wikipedia)

13

(14)

1.5 MapReduce Word Count Job

Notes:

The key in the Word Count MapReduce operation is the word itself. Each word's associated value is the number of occurrences of this word in the input data set. The map phase emits the key/value pairs as word,1 (each word gets counted as occurring only once) with potentially multiple duplications which will be combined in the shuffle step and then aggregated in the reduce phase.

1.6 MapReduce Shared-Nothing Architecture



MapReduce is designed around the

shared-nothing

architecture which

leads to computational efficiencies in distributed computing environments



Shared-nothing means "independent of others"

◊

A mapper process is independent from other mapper processes, and

so are reducer processes



This architecture allows the map and reduce operations to be executed in

parallel (in their respective phases)

14

(15)



MapReduce programming model is linearly scalable

1.7 Similarity with SQL Aggregation Operations



It may help compare MapReduce with aggregation operations used in

SQL



In SQL, aggregation is achieved by using COUNT(), AVG(), MIN(), and

other such functions (which act as some sort of reducers) with the GROUP

BY clause, e.g.

SELECT MONTH, SUM(SALES) FROM YEAR_END_REPORT GROUP BY MONTH



MapReduce offers a more fine-grained programmatic control over data

processing in multi-node computing environments using parallel

programming algorithms

1.8 Example of Map & Reduce Operations using JavaScript



JavaScript supports some elements of functional programming in the form

of the

map()

and

reduce()

functions



Problem

: Find out the sum of elements of the

[1,2,3]

array after all its

elements have been increased by 10%

◊

Note

: An alternative solution is to find the sum of the array elements

before the increase and then apply the 10% increase to the total

1.9 Example of Map & Reduce Operations using JavaScript



Solution

:

◊

The Map phase (apply the function of a 10% value increase to each

element of the input array of [1,2,3]):

[1,2,3].

map

(function(x){return x + x/10});

Result

: [1.1, 2.2, 3.3]

◊

The Reduce phase (use the result array of the Map operation as input

and sum up all its elements):

15

(16)

[1.1, 2.2, 3.3].

reduce

(function(x,y){return x + y});

Result

: 6.6



Note

: While the

map()

and

reduce()

functions here can also be used

independently from each other; MapReduce is always a single operation

1.10 Problems Suitable for Solving with MapReduce



The following is the criteria that you can use to see if the problem at hand

can be efficiently solved by MapReduce:

◊

The problem can be split into smaller problems with no shared state

that can be solved in parallel



Those smaller problems are independent from one another and do

not require interactions

◊

The problem can be decomposed into the

map

and

reduce

operations



The map operation: execute the same operation on all data



The reduce operation: execute the same operation on each group of

data produced by the map operation

◊

Basically, you should see the generic "divide and conquer" pattern

1.11 Typical MapReduce Jobs



Counting tokens (words, URLs, etc.)



Finding aggregate values in the target data set, e.g. the average value



Processing geographical data

◊

Google Maps uses MapReduce to find the nearest feature, like coffee

shop, museum, etc., to a given address



etc...

1.12 Fault-tolerance of MapReduce



MapReduce operations have a degree of fault-tolerance and built-in

recoverability from certain types of run-time failures which is leveraged in

16

(17)

production-ready systems (e.g. Hadoop)



MapReduce enjoys these quality of service due to its architecture based

on process parallelism



Failed units of work are rescheduled and resubmitted for execution should

some mappers or reducers fail (provided the source data is still available)



Note:

High Performance (e.g. Grid) Computing systems that use the

Message Passing Interface communication for check-points are more

difficult to program for failure recovery

1.13 Distributed Computing Economics



MapReduce splits the workload into units of work (independent

computation tasks) and intelligently distributes them across available

worker nodes for parallel processing



To improve system performance, MapReduce tries to start a computation

task on the node where the data to be worked on is stored

◊

This helps avoid unnecessary network operations



"Data locality" (collocation of data with the compute node) underlies the

principles of Distributed Computing Economics

Notes:

Distributed Computing Economics was a topic covered by Jim Gray of Microsoft Research in his paper published back in 2003 (http://research.microsoft.com/pubs/70001/tr-2003-24.pdf

His main conclusion was: "Put the computation near the data" or "One puts computing as close to the data as possible in order to avoid expensive network traffic".

1.14 MapReduce Systems



Many systems leverage MapReduce programming model:

◊

Hadoop has a built-in MapReduce engine for executing both regular

and streaming MapReduce jobs

◊

Amazon WS offers their clients Elastic MapReduce (EMR) Service

(running on a Hadoop cluster)

17

(18)

Notes:

Amazon Elastic MapReduce works as a Job flow where the client needs to follow pre-defined steps: 1. Provide the map and reduce applications/scripts to the Hadoop Framework via Amazon S3 bucket upload or use pre-installed ones

2. Specify the S3 bucket containing data input file(s) and the output S3 bucket to receive the result of the MapReduce workflow

3. Allocate the needed compute (EC2) instances

4. Execute the Job and pickup the output from the output S3 bucket

1.15 Summary



The MapReduce programming model was influenced by functional

programming languages



MapReduce (one word) has the map and reduce components working

together in a distributed computational environment



Production MapReduce system offer a number of quality of services, such

as fault-tolerance

18

(19)

Chapter 2 - Hadoop Overview

Objectives

In this chapter, participants will learn about:



Apache Hadoop and its core components



Hadoop's main design considerations

2.1 Apache Hadoop



Apache Hadoop is a distributed fault-tolerant computing platform written in

Java



Designed as a massively parallel processing (MPP) system based on a

distributed master-slave architecture



Hadoop allows for the distributed processing of large data sets across

clusters of computers using simple programming models, e.g. MapReduce



Hadoop is an open source project

Notes:

Hadoop is the name the son of Doug Cutting (the project's creator) gave to his yellow elephant toy. Doug Cutting is presently Chairman of the Apache Software Foundation and Cloudera's Chief Architect.

Massively parallel processing (MPP) is the system configuration that enlists hundreds and thousands of CPUs simultaneously for solving a particular computational task. Each computer in the MPP system controls its own resources along with task execution coordination with other nodes in the system. The coordination aspect sets Hadoop aside from MPP systems as Hadoop is designed on the shared-nothing architecture delegating the coordination activities to the dispatcher layer (made up of the JobTracker and TaskTrackers).

According to the Wikipedia article (http://en.wikipedia.org/wiki/Shared-nothing), "A shared nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage."

Yahoo provided funds to make Hadoop a Web-scale technology; the initial use case for Hadoop at Yahoo was to create and analyze a WebMap graph that consists of about one trillion (1012_{) Web links}

and 100 billion distinct URLs.

(20)

Chapter 2 - Hadoop Overview

2.2 Apache Hadoop Logo

2.3 Typical Hadoop Applications



Log and/or clickstream analysis



Web crawling results processing



Marketing analytics



Machine learning and data mining



Data archiving (e.g. for regulatory compliance, etc.)



See

http://wiki.apache.org/hadoop/PoweredBy

for the list of educational

and production uses of Hadoop

2.4 Hadoop Clusters



First versions of Hadoop were only able to handle 20 machines; newer

versions are capable to run Hadoop clusters comprising thousands of

nodes



Hadoop clusters run on moderately high-end commodity hardware ($2-5K

per machine)



Hadoop clusters can be used as a data hub, data warehouse or a

business analytics platform

2.5 Hadoop Design Principles



Hadoop's design was influenced by ideas published in Google File System

(GFS) and MapReduce white papers



Hadoop's core component, Hadoop Distributed File System (HDFS) is the

counterpart of GFS

20

(21)



Hadoop uses functionally equivalent to Google's MapReduce data

processing system also called MapReduce (term coined by Google's

engineers)



One of the main principle of Hadoop's architecture is "design for failure"

◊

To deliver high-availability quality of service, Hadoop detects and

handles failures at the application layer (rather than relying on

hardware)

2.6 Hadoop's Core Components



The Hadoop project is made up of the following main components:

◊

Common



Contains Hadoop infrastructure elements (interfaces with HDFS,

system libraries, RPC connectors, Hadoop admin scripts, etc.)

◊

Hadoop Distributed File System (HDFS)



Hadoop's persistence component designed to run on clusters of

commodity hardware built around the "

load

once and

read

many

times" concept

◊

MapReduce



A distributed data processing framework used as data analysis

system

2.7 Hadoop Simple Definition



In a nutshell, Hadoop is a distributed computing framework that consists

of:

◊

Reliable data storage (provided via HDFS)

◊

Analysis system (provided by MapReduce)

21

(22)

2.8 High-Level Hadoop Architecture

Notes:

The HDFS NameNode acts as a meta-data server for all information stored in HDFS on data nodes. The MapReducer master is responsible for allocating the map and reduce jobs in a distributed environment in a most efficient way.

2.9 Hadoop-based Systems for Data Analysis



Hadoop (via HDFS) can host the following systems for data analysis:

◊

The MapReduce engine (the major data analytics component of the

Hadoop project)

◊

Apache Pig

◊

HBase database

◊

Apache Hive data warehouse system

◊

Apache Mahout machine learning system

◊

etc.

22

(23)

2.10 Hadoop Caveats



Hadoop is a batch-oriented processing system



Many Hadoop-centric business analytics systems have high processing

latency which comes from their dependencies on the MapReduce

sub-system

◊

Querying even small data sets (under a gigabyte in size) may take up

to several minutes

◊

This is in sharp contrast with querying speed of relational databases

such as MySQL, Oracle, DB2, etc. where functionally similar work can

be done several orders of magnitude faster by applying indexes and

other techniques



Systems that bypass MapReduce when building queries have near

real-time querying real-times (e.g. HBase and Cloudera Impala)

2.11 Summary



Apache Hadoop is a distributed fault-tolerant computing platform written in

Java used for processing large data sets across clusters of computers



Hadoop clusters may comprise thousands of nodes



One of the main principle of Hadoop's architecture is "design for failure"

23

(24)

(25)

Chapter 3 - Hadoop Distributed File System Overview

Objectives

In this chapter, participants will learn about:



Hadoop Distributed File System (HDFS)



Ways to access HDFS

3.1 Hadoop Distributed File System



The Hadoop Distributed File System (HDFS) is a distributed, scalable,

fault-tolerant and portable file system written in Java



HDFS's architecture is based on the master/slave design pattern



An HDFS cluster consists of:

◊

A single NameNode (metadata server) holding directory information

about files on DataNodes

◊

A number of DataNodes, usually one per machine in the cluster



HDFS design is "rack-aware" to minimize network latency

Notes:

HDFS includes a server called a secondary NameNode , which is not a fail-over NameNode, rather, the secondary NameNode connects to the primary NameNode at specified intervals and pulls out the primary NameNode's directory information for building file system current snapshots. These snapshots can be used to restart a failed primary NameNode from the most recent checkpoint without having to replay the entire journal of file-system actions.

The secondary NameNode has been deprecated in favor of the newly introduced Checkpoint node which takes over the former's tasks and adds more functionality.

Work is under way to provide automatic fail-over for the NameNode to prevent a single point of failure of a cluster.

Rack-awareness means taking into account a machine's physical location (rack) while scheduling tasks and allocating storage. Basically, HDFS is aware of the fact that network bandwidth between machines sitting in the same server rack is greater than that between machines in different racks.

The creators of HDFS made a design decision in favor of using machines with internal hard drives as it helps ensure data locality (data is stored on the hard drive of the machine which CPU is used for processing it, e.g. with MapReduce). For that reason, Storage Area Network (SAN) or similar storage technologies are not recommended for performance considerations.

(26)

Chapter 3 - Hadoop Distributed File System Overview

3.2 Hadoop Distributed File System



HDFS is designed for efficient implementation of the following data

processing pattern:

◊

Write once (normally, just load the data set on the file system)

◊

Read many times (for data analysis, etc.)



HDFS functionality (and that of Hadoop) is geared towards batch-oriented

rather than real-time scenarios



HDFS does not have a built-in cache



Processing of data-intensive jobs can be done in parallel

3.3 Data Blocks



A data file in HDFS is split into blocks (a typical block size used by HDFS

is 64 MB)

◊

For large files, bigger block sizes will help reduce the amount of

metadata stored in the NameNode (metadata server)



Data reliability is achieved by replicating data blocks across multiple

machines

◊

By default, data blocks get replicated to three nodes: two on the same

server rack, and one on a different rack for redundancy

26

(27)

3.4 Data Block Replication Example

Example of a three-way (default) replication of a single data block for redundancy and achieving high data availability (the block size is usually a multiple of 64M)

3.5 HDFS NameNode Directory Diagram

Notes:

The HDFS NameNode maintains the directory metadata about all data nodes and data blocks stored for each file registered on HDFS.

27

(28)

3.6 Accessing HDFS



HDFS is not a Portable Operating System Interface (POSIX) compliant file

system



It is modeled after traditional hierarchical file systems containing

directories and files



File-system commands (copy, move, delete, etc.) against HDFS can be

performed in a number of ways:

◊

Through the HDFS Command-Line Interface (CLI) which supports

Unix-like commands:

cat

,

chown

,

ls

,

mkdir, mv

, etc.

◊

Using Java API for HDFS

◊

Via the C-language wrapper around the Java API

◊

Using regular HTTP browser for file-system and file content viewing



This is made possible through Web server (Jetty) embedded in the

NameNode and DataNodes

3.7 Examples of HDFS Commands



The common way to invoke the HDFS CLI:

◊

hadoop fs {HDFS commands}



Copying a file from the local file system over to HDFS (with the same

name):

◊

hadoop fs -put <filename_on_local_sysetm>



Copying the file from HDFS to the local file system (with the same name):

◊

hadoop fs -get <filename_on_HDFS>



Recursively listing files in a directory:

◊

hadoop fs -ls -R <directory_on_HDFS>



Creating a directory under the current user's home directory (e.g.

/user/userid/

) on HDFS

◊

hadoop fs -mkdir REPORT

28

(29)

Notes:

Instead of the put command, you can use the functionally equivalent but more descriptive

copyFromLocal command, and the get command also has its more descriptive equivalent:

copyToLocal.

3.8 Client Interactions with HDFS for the Read Operation



Whenever a Hadoop client needs to read a file on HDFS, it first contacts

the NameNode



The NameNode locates block ids associated with the file as well as IP

address of the DataNodes storing those blocks



The NameNode returns the related information to the client



The client contacts the related DataNodes and supplies the block ids for

DataNodes to locate the blocks on their local HDFS storage



The DataNodes serve the blocks of data back to the client

3.9 Read Operation Sequence Diagram

29

(30)

3.10 Client Interactions with HDFS for the Write Operation



Whenever a Hadoop client needs to write a file to HDFS, it first contacts

the NameNode



The NameNode generates and registers meta-data about the file and

allocates suitable DataNodes to keep the file's replicas



Information about the allocated DataNodes is sent back to the client



Client uses HDFS I/O facilities to stream content to the first DataNode in

the list (the "primary" DataNode)



The "primary" DataNode makes a local copy of related data and engages

other DataNodes in a peer-to-peer data sharing communication for data

replication



Each DataNode sends acknowledgments of receiving their data back to

the "primary" DataNode



The client gets an acknowledgment from the "primary" DataNode as a

confirmation of the HDFS write operation



The client notifies the NameNode on completion of the operation

Notes:

Once all DataNodes receive their replicas of the file, they cache the content in their memory (on Java Heap) and send acknowledgments to the "primary" DataNode. The actual data persistence to the local file system is done by DataNodes asynchronously at a later time.

3.11 Communication inside HDFS



HDFS is designed for batch processing rather than for interactive use



This affected the design of communication protocols inside HDFS



Communication inside HDFS is modeled after the Remote Procedure Call

(RPC) application protocol layered on top of the TCP/IP protocol

3.12 Summary



File-system commands against HDFS can be performed in a number of

30

(31)

ways:

◊

Through the

fs

command-line interface (which supports Unix-like

commands: cat, chown, ls, mkdir, mv, etc.)

◊

Using Java API for HDFS

◊

Via the C-language wrapper around the Java API

31

(32)

WA2341 Hadoop Programming EVALUATION ONLY

WA2341 Hadoop

Programming

Table of Contents

Chapter 1 - MapReduce Overview

Objectives

In this chapter, participants will learn about:

MapReduce Programming Model

Main MapReduce design principles

1.1 MapReduce Defined

There are different definitions of what MapReduce (single word) is:

a programming model

parallel processing framework

a computational paradigm

batch query processor

This technique (model, etc.) was influenced by functional programming

languages that have the

map

and

reduce

functions

1.2 Google's MapReduce

MapReduce was first introduced by Google back in 2004

Google applied for and was granted US Patent 7,650,331 on MapReduce

called "System and method for efficient large-scale data processing"

The patent lists

Jeffrey Dean and Sanjay Ghemawat

as its inventors

The value proposition of the MapReduce framework is its scalability and

fault-tolerance achieved by its architecture

Notes:

1.3 MapReduce Explained

MapReduce works by breaking the data processing task into two phases:

The

map

phase and the

reduce

phase executed sequentially with the

output of the map operation piped (emitted) as input into the reduce

operation

The map phase is backed up by a Map() procedure breaks up the

chunks of original data into a list of key/value pairs

There is a data split operation feeding data into the Map() procedure

The reduce phase is backed up by a Reduce() procedure that performs

some soft of an aggregation operation by key (e.g. counting, finding the

maximum of elements in the data set, etc.) on data received from the

Map() procedure

There is also an additional step between the

map

and

reduce

phases,

called "Shuffle", that prepares (sorts, etc.) and directs output of the

map

phase to the

reduce

phase

1.4 MapReduce Explained

1.5 MapReduce Word Count Job

Notes:

1.6 MapReduce Shared-Nothing Architecture

MapReduce is designed around the

shared-nothing

architecture which

leads to computational efficiencies in distributed computing environments

Shared-nothing means "independent of others"

A mapper process is independent from other mapper processes, and

so are reducer processes

This architecture allows the map and reduce operations to be executed in

parallel (in their respective phases)

MapReduce programming model is linearly scalable

1.7 Similarity with SQL Aggregation Operations

It may help compare MapReduce with aggregation operations used in

SQL

In SQL, aggregation is achieved by using COUNT(), AVG(), MIN(), and

other such functions (which act as some sort of reducers) with the GROUP

BY clause, e.g.

MapReduce offers a more fine-grained programmatic control over data

processing in multi-node computing environments using parallel

programming algorithms