Apache Hadoop: Past, Present, and Future

(1)

The 4^th China Cloud Computing Conference – May 25^th, 2012.

Apache Hadoop: Past, Present, and Future

Dr. Amr Awadallah | Founder, Chief Technical Officer

[email protected], twitter: @awadallah

(2)

Hadoop Past

2

• Fastest sort of a TB, 62secs over 1,460 nodes

• Sorted a PB in 16.25hours over 3,658 nodes

(3)

So What is Apache Hadoop ?

• A scalable fault-tolerant distributed system for data storage

and processing (open source under the Apache license).

• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing high-bandwidth clustered storage.

– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

• Key business values:

– Flexibility – Store any data, Run any analysis.

– Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.

– Economics – Cost per TB at a fraction of traditional options.

3

(4)

Hadoop Present (CDH3)

Coordination Data

Integration

Fast Read/Write

Access Analytical Languages

BI Connectivity Job Workflow Metadata

APACHE ZOOKEEPER APACHE FLUME,

APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

ODBC/JDBC APACHE OOZIE APACHE HIVE File System Mount Web Console Data Mining

FUSE-NFS HUE APACHE MAHOUT

APACHE WHIRR

4

Build/Test:APACHE BIGTOP

Cloudera Manager Free Edition (Installation Wizard)

(5)

HDFS: Hadoop Distributed File System

Block Replicas are distributed across servers and racks.

Block Replication for:

• Durability

• Availability

• Throughput Optimized for:

• Throughput

• Put/Get/Delete

• Appends

A given file is broken down into blocks (default=64MB), then blocks are

replicated across cluster (default=3).

5

(6)

MapReduce: Resource Manager / Scheduler

A given job is broken down into tasks, then tasks are scheduled to be as

close to data as possible.

Three levels of data locality:

• Same server as data (local disk)

• Same rack as data (rack/leaf switch)

• Wherever there is a free slot (cross rack)

System detects laggard tasks and speculatively executes parallel tasks on the same slice of data.

Optimized for:

• Batch Processing

• Failure Recovery

6

(7)

MapReduce: Computational Framework

7

Split 1

Split i

Split N

Map 1

(docid, text)

(docid,

text) Map i

(docid,

text) Map M

Reduce 1

Output File 1

(sorted words, sum of

counts)

Reduce i

Output File i

counts)

Reduce R

Output File R

counts)

(words, counts)

(sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)

Shuffle

(words, counts) (sorted words, counts)

“To Be Or Not To Be?”

Be, 5

Be, 12

Be, 7 Be, 6

Be, 30

cat *.txt | mapper.pl | sort | reducer.pl > out.txt

(8)

Hadoop High-Level Architecture

8

Name Node

Maintains mapping of file names to blocks to data node slaves.

Job Tracker

Tracks resources and schedules jobs across task tracker slaves.

Data Node

Stores and serves blocks of data

Hadoop Client

Contacts Name Node for data or Job Tracker to submit jobs

Task Tracker

Runs tasks (work units) within a job

Share Physical Node

(9)

Changes for Better Availability/Scalability

9

Data Node

Stores and serves blocks of data

Hadoop Client

Contacts Name Node for data or Job Tracker to submit jobs

Task Tracker

Runs tasks (work units) within a job

Share Physical Node

Name Node Job Tracker

Federation partitions out the name space, High Availability via an Active Standby.

Each job has its own Application Manager, Resource Manager is decoupled from MR.

(10)

MapReduce Next Gen

Main idea is to split up the JobTracker functions:

• Cluster resource management (for tracking and allocating nodes)

• Application life-cycle management (for MapReduce scheduling and execution)

Enables:

• High Availability

• Better Scalability

• Efficient Slot Allocation

• Rolling Upgrades

• Non-MapReduce Apps

10

(11)

HBase versus HDFS

HDFS: HBase:

Use For:

• Dimension tables which are updated frequently and require random low-latency lookups.

Use For:

• Fact tables that are mostly append only and require sequential full table scans.

Optimized For:

• Large Files

• Sequential Access (Hi Throughput)

• Append Only

Optimized For:

• Small Records

• Random Access (Lo Latency)

• Atomic Record Updates

Not Suitable For:

• Low Latency Interactive OLAP.

11

(12)

What Next: The Long Term Vision

A fully functional Big Data Operating System

that allows you to:

1. Collect Data (in real-time).

2. Store Data (very reliably, securely).

3. Process Data (workload management).

4. Analyze Data (metadata management).

5. Serve Data (interactively, low-latency).

12

(13)

(14)

Hadoop in the Enterprise Data Stack

Logs Files Web Data Relational Databases

IDEs BI /

Analytics

Enterprise Reporting

Enterprise Data Warehouse

Online Serving Systems

Cloudera Manager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile Applications

CUSTOMERS

Sqoop Sqoop

Sqoop

Flume Flume

Flume

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/

ETL Tools

ODBC, JDBC, NFS

14

(15)

Three Deployment Options

Cloudera Enterprise

Cloudera Manager

CDH

(Cloudera Distribution including Apache Hadoop)

Cloudera Support Option 1:

The Complete Solution

• Cloudera Enterprise (Cloudera Manager + Support) + CDH

• Fastest rollout

• Easiest to deploy, manage and configure

• Fully supported

• Annual Subscription

Proprietary Software + Support

100% Open Source Solution:

Apache Hadoop + selected components

CDH

(Cloudera Distribution including Apache Hadoop Option 2:

CDH Free Download

• Cloudera Manager Free Edition + CDH

• No software cost

• No support included

• Manage up to 50 nodes

• Limited functionality in Cloudera Manager

CDH

(Cloudera Distribution including Apache Hadoop Option 3:

Raw Packages

• CDH Only

• Self install and configure

• No software cost

• No support included

• More management &

installation complexity

• Longer rollout for most customers

Solution Components

Cloudera Manager

(Free Edition)

Up to 50 Nodes

15

(16)

Books

16