WA2341 Hadoop
Programming
Web Age Solutions Inc. USA: 1-877-517-6540 Canada: 1-866-206-4644
Web: http://www.webagesolutions.com
The following terms are trademarks of other companies:
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both. IBM, WebSphere, DB2 and Tivoli are trademarks of the International Business Machines Corporation in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks of others.
For customizations of this book or other sales inquiries, please contact us at: USA: 1-877-517-6540, email: [email protected]
Canada: 1-866-206-4644 toll free, email: [email protected]
Copyright © 2014 Web Age Solutions Inc.
This publication is protected by the copyright laws of Canada, United States and any other country where this book is sold. Unauthorized use of this material, including but not limited to, reproduction of the whole or part of the content, re-sale or transmission
through fax, photocopy or e-mail is prohibited. To obtain authorization for any such activities, please write to:
Web Age Solutions Inc. 439 University Ave Suite 820
Toronto
Ontario, M5G 1Y8
Table of Contents
Chapter 1 - MapReduce Overview...11
1.1 MapReduce Defined...11
1.2 Google's MapReduce...11
1.3 MapReduce Explained...12
1.4 MapReduce Explained...13
1.5 MapReduce Word Count Job...14
1.6 MapReduce Shared-Nothing Architecture ...14
1.7 Similarity with SQL Aggregation Operations ...15
1.8 Example of Map & Reduce Operations using JavaScript...15
1.9 Example of Map & Reduce Operations using JavaScript...15
1.10 Problems Suitable for Solving with MapReduce...16
1.11 Typical MapReduce Jobs...16
1.12 Fault-tolerance of MapReduce ...16
1.13 Distributed Computing Economics ...17
1.14 MapReduce Systems ...17
1.15 Summary...18
Chapter 2 - Hadoop Overview...19
2.1 Apache Hadoop ...19
2.2 Apache Hadoop Logo ...20
2.3 Typical Hadoop Applications...20
2.4 Hadoop Clusters...20
2.5 Hadoop Design Principles...20
2.6 Hadoop's Core Components ...21
2.7 Hadoop Simple Definition...21
2.8 High-Level Hadoop Architecture ...22
2.9 Hadoop-based Systems for Data Analysis ...22
2.10 Hadoop Caveats...23
2.11 Summary...23
Chapter 3 - Hadoop Distributed File System Overview...25
3.1 Hadoop Distributed File System...25
3.2 Hadoop Distributed File System...26
3.3 Data Blocks ...26
3.4 Data Block Replication Example ...27
3.5 HDFS NameNode Directory Diagram...27
3.6 Accessing HDFS...28
3.7 Examples of HDFS Commands...28
3.8 Client Interactions with HDFS for the Read Operation ...29
3.9 Read Operation Sequence Diagram...29
3.10 Client Interactions with HDFS for the Write Operation ...30
3.11 Communication inside HDFS...30
3.12 Summary...30
Chapter 4 - MapReduce with Hadoop...33
4.1 Hadoop's MapReduce...33
4.2 JobTracker and TaskTracker...33
4.3 MapReduce Programming Options...34
4.4 Java MapReduce API...34
4.5 The Structure of a Java MapReduce Program ...35
4.6 The Mapper Class ...35
4.7 The Reducer Class ...36
4.8 The Driver Class ...36
4.9 Compiling Classes...37
4.10 Running the MapReduce Job...37
4.11 The Structure of a Single MapReduce Program ...38
4.12 Combiner Pass (Optional)...38
4.13 Hadoop's Streaming MapReduce...39
4.14 Python Word Count Mapper Program Example ...39
4.15 Python Word Count Reducer Program Example ...40
4.16 Setting up Java Classpath for Streaming Support ...40
4.17 Streaming Use Cases ...41
4.18 The Streaming API vs Java MapReduce API...41
4.19 Amazon Elastic MapReduce...41
4.20 Amazon Elastic MapReduce...42
4.21 Amazon Elastic MapReduce...43
4.22 Summary...43
Chapter 5 - Apache Pig Scripting Platform...45
5.1 What is Pig? ...45
5.2 Pig Latin ...45
5.3 Apache Pig Logo...46
5.4 Pig Execution Modes ...46
5.5 Local Execution Mode ...46
5.6 MapReduce Execution Mode ...46
5.7 Running Pig...47
5.8 Running Pig in Batch Mode ...47
5.9 What is Grunt?...47
5.10 Pig Latin Statements...48
5.11 Pig Programs...49
5.12 Pig Latin Script Example ...49
5.13 SQL Equivalent...49
5.14 Differences between Pig and SQL ...50
5.15 Statement Processing in Pig...50
5.16 Comments in Pig...50
5.17 Supported Simple Data Types...51
5.18 Supported Complex Data Types...51
5.19 Arrays...51
5.20 Defining Relation's Schema ...52
5.21 The bytearray Generic Type...53
5.22 Using Field Delimiters...53
5.23 Referencing Fields in Relations ...53
5.24 Summary...54
Chapter 6 - Apache Pig HDFS Interface...55
6.1 The HDFS Interface...55
6.2 FSShell Commands (Short List)...55
6.3 Grunt's Old File System Commands...56
6.4 Summary...57
Chapter 7 - Apache Pig Relational and Eval Operators...59
7.1 Pig Relational Operators...59
7.2 Example of Using the JOIN Operator ...59
7.3 Example of Using the JOIN Operator ...60
7.4 Example of Using the Order By Operator ...60
7.5 Caveats of Using Relational Operators...60
7.6 Pig Eval Functions...61
7.7 Caveats of Using Eval Functions (Operators)...61
7.8 Example of Using Single-column Eval Operations...62
7.9 Example of Using Eval Operators For Global Operations...62
7.10 Summary...63
Chapter 8 - Apache Pig Miscellaneous Topics...65
8.1 Utility Commands...65
8.2 Handling Compression...65
8.3 User-Defined Functions...66
8.4 Filter UDF Skeleton Code...66
8.5 Summary...67
Chapter 9 - Apache Pig Performance...69
9.1 Apache Pig Performance...69
9.2 Performance Enhancer - Use the Right Schema Type ...69
9.3 Performance Enhancer - Apply Data Filters...69
9.4 Use the PARALLEL Clause...70
9.5 Examples of the PARALLEL Clause...70
9.6 Performance Enhancer - Limiting the Data Sets...71
9.7 Displaying Execution Plan ...71
9.8 Summary...72
Chapter 10 - Hive ...73
10.1 What is Hive? ...73
10.2 Apache Hive Logo...73
10.3 Hive's Value Proposition...73
10.4 Who uses Hive?...74
10.5 Hive's Main Systems...74
10.6 Hive Features...74
10.7 Hive Architecture...76
10.8 HiveQL...76
10.9 Where are the Hive Tables Located?...77
10.10 Hive Command-line Interface (CLI)...78
10.11 Summary...78
Chapter 11 - Hive Command-line Interface...81
11.1 Hive Command-line Interface (CLI)...81
11.2 The Hive Interactive Shell ...81
11.3 Running Host OS Commands from the Hive Shell ...82
11.4 Interfacing with HDFS from the Hive Shell ...82
11.5 The Hive in Unattended Mode ...82
11.6 The Hive CLI Integration with the OS Shell ...83
11.7 Executing HiveQL Scripts ...83
11.8 Comments in Hive Scripts...83
11.9 Variables and Properties in Hive CLI ...84
11.10 Setting Properties in CLI ...84
11.11 Example of Setting Properties in CLI ...84
11.12 Hive Namespaces...85
11.13 Using the SET Command ...85
11.14 Setting Properties in the Shell ...86
11.15 Setting Properties for the New Shell Session ...86
11.16 Summary...86
Chapter 12 - Hive Data Definition Language ...89
12.1 Hive Data Definition Language...89
12.2 Creating Databases in Hive ...89
12.3 Using Databases ...90
12.4 Creating Tables in Hive ...90
12.5 Supported Data Type Categories...91
12.6 Common Primitive Types ...91
12.7 Example of the CREATE TABLE Statement ...91
12.8 The STRUCT Type ...92
12.9 Table Partitioning ...92
12.10 Table Partitioning ...93
12.11 Table Partitioning on Multiple Columns ...93
12.12 Viewing Table Partitions...94
12.13 Row Format ...94
12.14 Data Serializers / Deserializers ...94
12.15 File Format Storage...95
12.16 More on File Formats...96
12.17 The EXTERNAL DDL Parameter ...96
12.18 Example of Using EXTERNAL ...96
12.19 Creating an Empty Table ...97
12.20 Dropping a Table ...97
12.21 Table / Partition(s) Truncation...98
12.22 Alter Table/Partition/Column...98
12.23 Views...98
12.24 Create View Statement ...99
12.25 Why Use Views?...99
12.26 Restricting Amount of Viewable Data ...99
12.27 Examples of Restricting Amount of Viewable Data ...100
12.28 Creating and Dropping Indexes ...100
12.29 Describing Data ...101
12.30 Summary...101
Chapter 13 - Hive Data Manipulation Language...103
13.1 Hive Data Manipulation Language (DML)...103
13.2 Using the LOAD DATA statement ...103
13.3 Example of Loading Data into a Hive Table ...104
13.4 Loading Data with the INSERT Statement ...104
13.5 Appending and Replacing Data with the INSERT Statement...104
13.6 Examples of Using the INSERT Statement ...104
13.7 Multi Table Inserts...105
13.8 Multi Table Inserts Syntax...105
13.9 Multi Table Inserts Example ...105
13.10 Summary...106
Chapter 14 - Hive Select Statement...107
14.1 HiveQL...107
14.2 The SELECT Statement Syntax...107
14.3 The WHERE Clause...108
14.4 Examples of the WHERE Statement...108
14.5 Partition-based Queries...108
14.6 Example of an Efficient SELECT Statement...109
14.7 The DISTINCT Clause...109
14.8 Supported Numeric Operators...110
14.9 Built-in Mathematical Functions...110
14.10 Built-in Aggregate Functions...110
14.11 Built-in Statistical Functions...111
14.12 Other Useful Built-in Functions...111
14.13 The GROUP BY Clause...112
14.14 The HAVING Clause...112
14.15 The LIMIT Clause ...112
14.16 The ORDER BY Clause...112
14.17 The JOIN Clause...113
14.18 The CASE … Clause...113
14.19 Example of CASE … Clause...113
14.20 Summary...114
Chapter 15 - Apache Sqoop ...115
15.1 What is Sqoop? ...115
15.2 Apache Sqoop Logo ...115
15.3 Sqoop Import / Export ...115
15.4 Sqoop Help...116
15.5 Examples of Using Sqoop Commands...116
15.6 Data Import Example ...117
15.7 Fine-tuning Data Import ...117
15.8 Controlling the Number of Import Processes...117
15.9 Data Splitting ...118
15.10 Helping Sqoop Out ...118
15.11 Example of Executing Sqoop Load in Parallel...119
15.12 A Word of Caution: Avoid Complex Free-Form Queries ...119
15.13 Using Direct Export from Databases ...119
15.14 Example of Using Direct Export from MySQL...120
15.15 More on Direct Mode Import ...120
15.16 Changing Data Types...120
15.17 Example of Default Types Overriding ...121
15.18 File Formats...121
15.19 The Apache Avro Serialization System ...121
15.20 Binary vs Text ...122
15.21 More on the SequenceFile Binary Format ...122
15.22 Generating the Java Table Record Source Code...123
15.23 Data Export from HDFS...123
15.24 Export Tool Common Arguments ...123
15.25 Data Export Control Arguments...124
15.26 Data Export Example ...124
15.27 Using a Staging Table ...125
15.28 INSERT and UPDATE Statements...125
15.29 INSERT Operations...125
15.30 UPDATE Operations ...126
15.31 Example of the Update Operation ...126
15.32 Failed Exports...127
15.33 Summary...127
Chapter 16 - Cloudera Impala...129
16.1 What is Cloudera Impala? ...129
16.2 Impala's Logo...130
16.3 Benefits of Using Impala ...130
16.4 Key Features...131
16.5 How Impala Handles SQL Queries ...131
16.6 Impala Programming Interfaces...132
16.7 Impala SQL Language Reference...132
16.8 Differences Between Impala and HiveQL ...133
16.9 Impala Shell...133
16.10 Impala Shell Main Options ...134
16.11 Impala Shell Commands...134
16.12 Impala Common Shell Commands...134
16.13 Cloudera Web Admin UI...135
16.14 Impala Browse-based Query Editor...135
16.15 Summary...136
Chapter 17 - Apache HBase ...137
17.1 What is HBase?...137
17.2 HBase Design ...138
17.3 HBase Features...138
17.4 The Write-Ahead Log (WAL) and MemStore...139
17.5 HBase vs RDBS...139
17.6 HBase vs Apache Cassandra...140
17.7 Interfacing with HBase ...141
17.8 HBase Thrift And REST Gateway...141
17.9 HBase Table Design...142
17.10 Column Families...142
17.11 A Cell's Value Versioning...143
17.12 Timestamps ...143
17.13 Accessing Cells ...143
17.14 HBase Table Design Digest...144
17.15 Table Horizontal Partitioning with Regions ...144
17.16 HBase Compaction...144
17.17 Loading Data in HBase ...145
17.18 HBase Shell ...145
17.19 HBase Shell Command Groups...146
17.20 Creating and Populating a Table in HBase Shell ...147
17.21 Getting a Cell's Value ...147
17.22 Counting Rows in an HBase Table ...148
17.23 Summary...148
Chapter 18 - Apache HBase Java API ...151
18.1 HBase Java Client...151
18.2 HBase Scanners ...151
18.3 Using ResultScanner Efficiently...152
18.4 The Scan Class ...152
18.5 The KeyValue Class ...153
18.6 The Result Class ...153
18.7 Getting Versions of Cell Values Example ...154
18.8 The Cell Interface...154
18.9 HBase Java Client Example ...154
18.10 Scanning the Table Rows ...155
18.11 Dropping a Table ...155
18.12 The Bytes Utility Class ...156
18.13 Summary...156
Chapter 1 - MapReduce Overview
Objectives
In this chapter, participants will learn about:
MapReduce Programming Model
Main MapReduce design principles
1.1 MapReduce Defined
There are different definitions of what MapReduce (single word) is:
◊
a programming model
◊
parallel processing framework
◊a computational paradigm
◊batch query processor
This technique (model, etc.) was influenced by functional programming
languages that have the
map
and
reduce
functions
1.2 Google's MapReduce
MapReduce was first introduced by Google back in 2004
Google applied for and was granted US Patent 7,650,331 on MapReduce
called "System and method for efficient large-scale data processing"
The patent lists
Jeffrey Dean and Sanjay Ghemawat
as its inventors
The value proposition of the MapReduce framework is its scalability and
fault-tolerance achieved by its architecture
Notes:
ABSTRACT
A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more
Chapter 1 - MapReduce Overview
independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
Source: http://www.google.ca/patents/US7650331
1.3 MapReduce Explained
MapReduce works by breaking the data processing task into two phases:
◊
The
map
phase and the
reduce
phase executed sequentially with the
output of the map operation piped (emitted) as input into the reduce
operation
◊
The map phase is backed up by a Map() procedure breaks up the
chunks of original data into a list of key/value pairs
There is a data split operation feeding data into the Map() procedure
◊
The reduce phase is backed up by a Reduce() procedure that performs
some soft of an aggregation operation by key (e.g. counting, finding the
maximum of elements in the data set, etc.) on data received from the
Map() procedure
◊
There is also an additional step between the
map
and
reduce
phases,
called "Shuffle", that prepares (sorts, etc.) and directs output of the
map
phase to the
reduce
phase
12
Chapter 1 - MapReduce Overview
1.4 MapReduce Explained
(Source: Wikipedia)
13
Chapter 1 - MapReduce Overview
1.5 MapReduce Word Count Job
Notes:
The key in the Word Count MapReduce operation is the word itself. Each word's associated value is the number of occurrences of this word in the input data set. The map phase emits the key/value pairs as word,1 (each word gets counted as occurring only once) with potentially multiple duplications which will be combined in the shuffle step and then aggregated in the reduce phase.
1.6 MapReduce Shared-Nothing Architecture
MapReduce is designed around the
shared-nothing
architecture which
leads to computational efficiencies in distributed computing environments
Shared-nothing means "independent of others"
◊
A mapper process is independent from other mapper processes, and
so are reducer processes
This architecture allows the map and reduce operations to be executed in
parallel (in their respective phases)
14
Chapter 1 - MapReduce Overview
MapReduce programming model is linearly scalable
1.7 Similarity with SQL Aggregation Operations
It may help compare MapReduce with aggregation operations used in
SQL
In SQL, aggregation is achieved by using COUNT(), AVG(), MIN(), and
other such functions (which act as some sort of reducers) with the GROUP
BY clause, e.g.
SELECT MONTH, SUM(SALES) FROM YEAR_END_REPORT GROUP BY MONTH
MapReduce offers a more fine-grained programmatic control over data
processing in multi-node computing environments using parallel
programming algorithms
1.8 Example of Map & Reduce Operations using JavaScript
JavaScript supports some elements of functional programming in the form
of the
map()
and
reduce()
functions
Problem
: Find out the sum of elements of the
[1,2,3]
array after all its
elements have been increased by 10%
◊
Note
: An alternative solution is to find the sum of the array elements
before the increase and then apply the 10% increase to the total
1.9 Example of Map & Reduce Operations using JavaScript
Solution
:
◊
The Map phase (apply the function of a 10% value increase to each
element of the input array of [1,2,3]):
[1,2,3].
map
(function(x){return x + x/10});
Result
: [1.1, 2.2, 3.3]
◊
The Reduce phase (use the result array of the Map operation as input
and sum up all its elements):
15
Chapter 1 - MapReduce Overview
[1.1, 2.2, 3.3].
reduce
(function(x,y){return x + y});
Result
: 6.6
Note
: While the
map()
and
reduce()
functions here can also be used
independently from each other; MapReduce is always a single operation
1.10 Problems Suitable for Solving with MapReduce
The following is the criteria that you can use to see if the problem at hand
can be efficiently solved by MapReduce:
◊
The problem can be split into smaller problems with no shared state
that can be solved in parallel
Those smaller problems are independent from one another and do
not require interactions
◊
The problem can be decomposed into the
map
and
reduce
operations
The map operation: execute the same operation on all data
The reduce operation: execute the same operation on each group of
data produced by the map operation
◊
Basically, you should see the generic "divide and conquer" pattern
1.11 Typical MapReduce Jobs
Counting tokens (words, URLs, etc.)
Finding aggregate values in the target data set, e.g. the average value
Processing geographical data
◊
Google Maps uses MapReduce to find the nearest feature, like coffee
shop, museum, etc., to a given address
etc...
1.12 Fault-tolerance of MapReduce
MapReduce operations have a degree of fault-tolerance and built-in
recoverability from certain types of run-time failures which is leveraged in
16
Chapter 1 - MapReduce Overview
production-ready systems (e.g. Hadoop)
MapReduce enjoys these quality of service due to its architecture based
on process parallelism
Failed units of work are rescheduled and resubmitted for execution should
some mappers or reducers fail (provided the source data is still available)
Note:
High Performance (e.g. Grid) Computing systems that use the
Message Passing Interface communication for check-points are more
difficult to program for failure recovery
1.13 Distributed Computing Economics
MapReduce splits the workload into units of work (independent
computation tasks) and intelligently distributes them across available
worker nodes for parallel processing
To improve system performance, MapReduce tries to start a computation
task on the node where the data to be worked on is stored
◊
This helps avoid unnecessary network operations
"Data locality" (collocation of data with the compute node) underlies the
principles of Distributed Computing Economics
Notes:
Distributed Computing Economics was a topic covered by Jim Gray of Microsoft Research in his paper published back in 2003 (http://research.microsoft.com/pubs/70001/tr-2003-24.pdf
His main conclusion was: "Put the computation near the data" or "One puts computing as close to the data as possible in order to avoid expensive network traffic".
1.14 MapReduce Systems
Many systems leverage MapReduce programming model:
◊
Hadoop has a built-in MapReduce engine for executing both regular
and streaming MapReduce jobs
◊
Amazon WS offers their clients Elastic MapReduce (EMR) Service
(running on a Hadoop cluster)
17
Chapter 1 - MapReduce Overview
Notes:
Amazon Elastic MapReduce works as a Job flow where the client needs to follow pre-defined steps: 1. Provide the map and reduce applications/scripts to the Hadoop Framework via Amazon S3 bucket upload or use pre-installed ones
2. Specify the S3 bucket containing data input file(s) and the output S3 bucket to receive the result of the MapReduce workflow
3. Allocate the needed compute (EC2) instances
4. Execute the Job and pickup the output from the output S3 bucket
1.15 Summary
The MapReduce programming model was influenced by functional
programming languages
MapReduce (one word) has the map and reduce components working
together in a distributed computational environment
Production MapReduce system offer a number of quality of services, such
as fault-tolerance
18
Chapter 2 - Hadoop Overview
Objectives
In this chapter, participants will learn about:
Apache Hadoop and its core components
Hadoop's main design considerations
2.1 Apache Hadoop
Apache Hadoop is a distributed fault-tolerant computing platform written in
Java
Designed as a massively parallel processing (MPP) system based on a
distributed master-slave architecture
Hadoop allows for the distributed processing of large data sets across
clusters of computers using simple programming models, e.g. MapReduce
Hadoop is an open source project
Notes:
Hadoop is the name the son of Doug Cutting (the project's creator) gave to his yellow elephant toy. Doug Cutting is presently Chairman of the Apache Software Foundation and Cloudera's Chief Architect.
Massively parallel processing (MPP) is the system configuration that enlists hundreds and thousands of CPUs simultaneously for solving a particular computational task. Each computer in the MPP system controls its own resources along with task execution coordination with other nodes in the system. The coordination aspect sets Hadoop aside from MPP systems as Hadoop is designed on the shared-nothing architecture delegating the coordination activities to the dispatcher layer (made up of the JobTracker and TaskTrackers).
According to the Wikipedia article (http://en.wikipedia.org/wiki/Shared-nothing), "A shared nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage."
Yahoo provided funds to make Hadoop a Web-scale technology; the initial use case for Hadoop at Yahoo was to create and analyze a WebMap graph that consists of about one trillion (1012) Web links
and 100 billion distinct URLs.
Chapter 2 - Hadoop Overview
2.2 Apache Hadoop Logo
2.3 Typical Hadoop Applications
Log and/or clickstream analysis
Web crawling results processing
Marketing analytics
Machine learning and data mining
Data archiving (e.g. for regulatory compliance, etc.)
See
http://wiki.apache.org/hadoop/PoweredBy
for the list of educational
and production uses of Hadoop
2.4 Hadoop Clusters
First versions of Hadoop were only able to handle 20 machines; newer
versions are capable to run Hadoop clusters comprising thousands of
nodes
Hadoop clusters run on moderately high-end commodity hardware ($2-5K
per machine)
Hadoop clusters can be used as a data hub, data warehouse or a
business analytics platform
2.5 Hadoop Design Principles
Hadoop's design was influenced by ideas published in Google File System
(GFS) and MapReduce white papers
Hadoop's core component, Hadoop Distributed File System (HDFS) is the
counterpart of GFS
20
Chapter 2 - Hadoop Overview
Hadoop uses functionally equivalent to Google's MapReduce data
processing system also called MapReduce (term coined by Google's
engineers)
One of the main principle of Hadoop's architecture is "design for failure"
◊
To deliver high-availability quality of service, Hadoop detects and
handles failures at the application layer (rather than relying on
hardware)
2.6 Hadoop's Core Components
The Hadoop project is made up of the following main components:
◊
Common
Contains Hadoop infrastructure elements (interfaces with HDFS,
system libraries, RPC connectors, Hadoop admin scripts, etc.)
◊
Hadoop Distributed File System (HDFS)
Hadoop's persistence component designed to run on clusters of
commodity hardware built around the "
load
once and
read
many
times" concept
◊
MapReduce
A distributed data processing framework used as data analysis
system
2.7 Hadoop Simple Definition
In a nutshell, Hadoop is a distributed computing framework that consists
of:
◊
Reliable data storage (provided via HDFS)
◊Analysis system (provided by MapReduce)
21
Chapter 2 - Hadoop Overview
2.8 High-Level Hadoop Architecture
Notes:
The HDFS NameNode acts as a meta-data server for all information stored in HDFS on data nodes. The MapReducer master is responsible for allocating the map and reduce jobs in a distributed environment in a most efficient way.
2.9 Hadoop-based Systems for Data Analysis
Hadoop (via HDFS) can host the following systems for data analysis:
◊
The MapReduce engine (the major data analytics component of the
Hadoop project)
◊
Apache Pig
◊HBase database
◊
Apache Hive data warehouse system
◊Apache Mahout machine learning system
◊etc.
22
Chapter 2 - Hadoop Overview
2.10 Hadoop Caveats
Hadoop is a batch-oriented processing system
Many Hadoop-centric business analytics systems have high processing
latency which comes from their dependencies on the MapReduce
sub-system
◊
Querying even small data sets (under a gigabyte in size) may take up
to several minutes
◊
This is in sharp contrast with querying speed of relational databases
such as MySQL, Oracle, DB2, etc. where functionally similar work can
be done several orders of magnitude faster by applying indexes and
other techniques
Systems that bypass MapReduce when building queries have near
real-time querying real-times (e.g. HBase and Cloudera Impala)
2.11 Summary
Apache Hadoop is a distributed fault-tolerant computing platform written in
Java used for processing large data sets across clusters of computers
Hadoop clusters may comprise thousands of nodes
One of the main principle of Hadoop's architecture is "design for failure"
23
Chapter 3 - Hadoop Distributed File System Overview
Objectives
In this chapter, participants will learn about:
Hadoop Distributed File System (HDFS)
Ways to access HDFS
3.1 Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed, scalable,
fault-tolerant and portable file system written in Java
HDFS's architecture is based on the master/slave design pattern
An HDFS cluster consists of:
◊
A single NameNode (metadata server) holding directory information
about files on DataNodes
◊
A number of DataNodes, usually one per machine in the cluster
HDFS design is "rack-aware" to minimize network latency
Notes:
HDFS includes a server called a secondary NameNode , which is not a fail-over NameNode, rather, the secondary NameNode connects to the primary NameNode at specified intervals and pulls out the primary NameNode's directory information for building file system current snapshots. These snapshots can be used to restart a failed primary NameNode from the most recent checkpoint without having to replay the entire journal of file-system actions.
The secondary NameNode has been deprecated in favor of the newly introduced Checkpoint node which takes over the former's tasks and adds more functionality.
Work is under way to provide automatic fail-over for the NameNode to prevent a single point of failure of a cluster.
Rack-awareness means taking into account a machine's physical location (rack) while scheduling tasks and allocating storage. Basically, HDFS is aware of the fact that network bandwidth between machines sitting in the same server rack is greater than that between machines in different racks.
The creators of HDFS made a design decision in favor of using machines with internal hard drives as it helps ensure data locality (data is stored on the hard drive of the machine which CPU is used for processing it, e.g. with MapReduce). For that reason, Storage Area Network (SAN) or similar storage technologies are not recommended for performance considerations.
Chapter 3 - Hadoop Distributed File System Overview
3.2 Hadoop Distributed File System
HDFS is designed for efficient implementation of the following data
processing pattern:
◊
Write once (normally, just load the data set on the file system)
◊Read many times (for data analysis, etc.)
HDFS functionality (and that of Hadoop) is geared towards batch-oriented
rather than real-time scenarios
HDFS does not have a built-in cache
Processing of data-intensive jobs can be done in parallel
3.3 Data Blocks
A data file in HDFS is split into blocks (a typical block size used by HDFS
is 64 MB)
◊
For large files, bigger block sizes will help reduce the amount of
metadata stored in the NameNode (metadata server)
Data reliability is achieved by replicating data blocks across multiple
machines
◊
By default, data blocks get replicated to three nodes: two on the same
server rack, and one on a different rack for redundancy
26
Chapter 3 - Hadoop Distributed File System Overview
3.4 Data Block Replication Example
Example of a three-way (default) replication of a single data block for redundancy and achieving high data availability (the block size is usually a multiple of 64M)
3.5 HDFS NameNode Directory Diagram
Notes:
The HDFS NameNode maintains the directory metadata about all data nodes and data blocks stored for each file registered on HDFS.
27
Chapter 3 - Hadoop Distributed File System Overview
3.6 Accessing HDFS
HDFS is not a Portable Operating System Interface (POSIX) compliant file
system
It is modeled after traditional hierarchical file systems containing
directories and files
File-system commands (copy, move, delete, etc.) against HDFS can be
performed in a number of ways:
◊
Through the HDFS Command-Line Interface (CLI) which supports
Unix-like commands:
cat
,
chown
,
ls
,
mkdir, mv
, etc.
◊
Using Java API for HDFS
◊
Via the C-language wrapper around the Java API
◊
Using regular HTTP browser for file-system and file content viewing
This is made possible through Web server (Jetty) embedded in the
NameNode and DataNodes
3.7 Examples of HDFS Commands
The common way to invoke the HDFS CLI:
◊
hadoop fs {HDFS commands}
Copying a file from the local file system over to HDFS (with the same
name):
◊
hadoop fs -put <filename_on_local_sysetm>
Copying the file from HDFS to the local file system (with the same name):
◊
hadoop fs -get <filename_on_HDFS>
Recursively listing files in a directory:
◊
hadoop fs -ls -R <directory_on_HDFS>
Creating a directory under the current user's home directory (e.g.
/user/userid/
) on HDFS
◊
hadoop fs -mkdir REPORT
28
Chapter 3 - Hadoop Distributed File System Overview
Notes:
Instead of the put command, you can use the functionally equivalent but more descriptive
copyFromLocal command, and the get command also has its more descriptive equivalent:
copyToLocal.
3.8 Client Interactions with HDFS for the Read Operation
Whenever a Hadoop client needs to read a file on HDFS, it first contacts
the NameNode
The NameNode locates block ids associated with the file as well as IP
address of the DataNodes storing those blocks
The NameNode returns the related information to the client
The client contacts the related DataNodes and supplies the block ids for
DataNodes to locate the blocks on their local HDFS storage
The DataNodes serve the blocks of data back to the client
3.9 Read Operation Sequence Diagram
29
Chapter 3 - Hadoop Distributed File System Overview
3.10 Client Interactions with HDFS for the Write Operation
Whenever a Hadoop client needs to write a file to HDFS, it first contacts
the NameNode
The NameNode generates and registers meta-data about the file and
allocates suitable DataNodes to keep the file's replicas
Information about the allocated DataNodes is sent back to the client
Client uses HDFS I/O facilities to stream content to the first DataNode in
the list (the "primary" DataNode)
The "primary" DataNode makes a local copy of related data and engages
other DataNodes in a peer-to-peer data sharing communication for data
replication
Each DataNode sends acknowledgments of receiving their data back to
the "primary" DataNode
The client gets an acknowledgment from the "primary" DataNode as a
confirmation of the HDFS write operation
The client notifies the NameNode on completion of the operation
Notes:
Once all DataNodes receive their replicas of the file, they cache the content in their memory (on Java Heap) and send acknowledgments to the "primary" DataNode. The actual data persistence to the local file system is done by DataNodes asynchronously at a later time.
3.11 Communication inside HDFS
HDFS is designed for batch processing rather than for interactive use
This affected the design of communication protocols inside HDFS
Communication inside HDFS is modeled after the Remote Procedure Call
(RPC) application protocol layered on top of the TCP/IP protocol
3.12 Summary
File-system commands against HDFS can be performed in a number of
30
Chapter 3 - Hadoop Distributed File System Overview
ways:
◊
Through the
fs
command-line interface (which supports Unix-like
commands: cat, chown, ls, mkdir, mv, etc.)
◊
Using Java API for HDFS
◊
Via the C-language wrapper around the Java API
31