The 4th China Cloud Computing Conference – May 25th, 2012.
Apache Hadoop: Past, Present, and Future
Dr. Amr Awadallah | Founder, Chief Technical Officer
[email protected], twitter: @awadallah
Hadoop Past
2
• Fastest sort of a TB, 62secs over 1,460 nodes
• Sorted a PB in 16.25hours over 3,658 nodes
So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for data storage
and processing (open source under the Apache license).
• Core Hadoop has two main systems:
– Hadoop Distributed File System: self-healing high-bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.
• Key business values:
– Flexibility – Store any data, Run any analysis.
– Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
– Economics – Cost per TB at a fraction of traditional options.
3
Hadoop Present (CDH3)
Coordination Data
Integration
Fast Read/Write
Access Analytical Languages
BI Connectivity Job Workflow Metadata
APACHE ZOOKEEPER APACHE FLUME,
APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE
ODBC/JDBC APACHE OOZIE APACHE HIVE File System Mount Web Console Data Mining
FUSE-NFS HUE APACHE MAHOUT
APACHE WHIRR
4
Build/Test:APACHE BIGTOP
Cloudera Manager Free Edition (Installation Wizard)
HDFS: Hadoop Distributed File System
Block Replicas are distributed across servers and racks.
Block Replication for:
• Durability
• Availability
• Throughput Optimized for:
• Throughput
• Put/Get/Delete
• Appends
A given file is broken down into blocks (default=64MB), then blocks are
replicated across cluster (default=3).
5
MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks, then tasks are scheduled to be as
close to data as possible.
Three levels of data locality:
• Same server as data (local disk)
• Same rack as data (rack/leaf switch)
• Wherever there is a free slot (cross rack)
System detects laggard tasks and speculatively executes parallel tasks on the same slice of data.
Optimized for:
• Batch Processing
• Failure Recovery
6
MapReduce: Computational Framework
7
Split 1
Split i
Split N
Map 1
(docid, text)
(docid,
text) Map i
(docid,
text) Map M
Reduce 1
Output File 1
(sorted words, sum of
counts)
Reduce i
Output File i
(sorted words, sum of
counts)
Reduce R
Output File R
(sorted words, sum of
counts)
(words, counts)
(sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
Shuffle
(words, counts) (sorted words, counts)
“To Be Or Not To Be?”
Be, 5
Be, 12
Be, 7 Be, 6
Be, 30
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Hadoop High-Level Architecture
8
Name Node
Maintains mapping of file names to blocks to data node slaves.
Job Tracker
Tracks resources and schedules jobs across task tracker slaves.
Data Node
Stores and serves blocks of data
Hadoop Client
Contacts Name Node for data or Job Tracker to submit jobs
Task Tracker
Runs tasks (work units) within a job
Share Physical Node
Changes for Better Availability/Scalability
9
Data Node
Stores and serves blocks of data
Hadoop Client
Contacts Name Node for data or Job Tracker to submit jobs
Task Tracker
Runs tasks (work units) within a job
Share Physical Node
Name Node Job Tracker
Federation partitions out the name space, High Availability via an Active Standby.
Each job has its own Application Manager, Resource Manager is decoupled from MR.
MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and allocating nodes)
• Application life-cycle management (for MapReduce scheduling and execution)
Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps
10
HBase versus HDFS
HDFS: HBase:
Use For:
• Dimension tables which are updated frequently and require random low-latency lookups.
Use For:
• Fact tables that are mostly append only and require sequential full table scans.
Optimized For:
• Large Files
• Sequential Access (Hi Throughput)
• Append Only
Optimized For:
• Small Records
• Random Access (Lo Latency)
• Atomic Record Updates
Not Suitable For:
• Low Latency Interactive OLAP.
©2012 Cloudera, Inc. All Rights Reserved.
11
What Next: The Long Term Vision
A fully functional Big Data Operating System
that allows you to:
1. Collect Data (in real-time).
2. Store Data (very reliably, securely).
3. Process Data (workload management).
4. Analyze Data (metadata management).
5. Serve Data (interactively, low-latency).
12
Hadoop in the Enterprise Data Stack
Logs Files Web Data Relational Databases
IDEs BI /
Analytics
Enterprise Reporting
Enterprise Data Warehouse
Online Serving Systems
Cloudera Manager
SYSTEM OPERATORS
ENGINEERS ANALYSTS BUSINESS USERS
Web/Mobile Applications
CUSTOMERS
Sqoop Sqoop
Sqoop
Flume Flume
Flume
Modeling Tools
DATA SCIENTISTS
DATA ARCHITECTS
Meta Data/
ETL Tools
ODBC, JDBC, NFS
14
Three Deployment Options
Cloudera Enterprise
Cloudera Manager
CDH
(Cloudera Distribution including Apache Hadoop)
Cloudera Support Option 1:
The Complete Solution
• Cloudera Enterprise (Cloudera Manager + Support) + CDH
• Fastest rollout
• Easiest to deploy, manage and configure
• Fully supported
• Annual Subscription
Proprietary Software + Support
100% Open Source Solution:
Apache Hadoop + selected components
CDH
(Cloudera Distribution including Apache Hadoop Option 2:
CDH Free Download
• Cloudera Manager Free Edition + CDH
• No software cost
• No support included
• Manage up to 50 nodes
• Limited functionality in Cloudera Manager
CDH
(Cloudera Distribution including Apache Hadoop Option 3:
Raw Packages
• CDH Only
• Self install and configure
• No software cost
• No support included
• More management &
installation complexity
• Longer rollout for most customers
Solution Components
Cloudera Manager
(Free Edition)
Up to 50 Nodes
15
Books
16