• No results found

Apache Hadoop: Past, Present, and Future

N/A
N/A
Protected

Academic year: 2021

Share "Apache Hadoop: Past, Present, and Future"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

The 4th China Cloud Computing Conference – May 25th, 2012.

Apache Hadoop: Past, Present, and Future

Dr. Amr Awadallah | Founder, Chief Technical Officer

[email protected], twitter: @awadallah

(2)

Hadoop Past

2

• Fastest sort of a TB, 62secs over 1,460 nodes

• Sorted a PB in 16.25hours over 3,658 nodes

(3)

So What is Apache Hadoop ?

• A scalable fault-tolerant distributed system for data storage

and processing (open source under the Apache license).

• Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing high-bandwidth clustered storage.

– MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

• Key business values:

– Flexibility – Store any data, Run any analysis.

– Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.

– Economics – Cost per TB at a fraction of traditional options.

3

(4)

Hadoop Present (CDH3)

Coordination Data

Integration

Fast Read/Write

Access Analytical Languages

BI Connectivity Job Workflow Metadata

APACHE ZOOKEEPER APACHE FLUME,

APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

ODBC/JDBC APACHE OOZIE APACHE HIVE File System Mount Web Console Data Mining

FUSE-NFS HUE APACHE MAHOUT

APACHE WHIRR

4

Build/Test:APACHE BIGTOP

Cloudera Manager Free Edition (Installation Wizard)

(5)

HDFS: Hadoop Distributed File System

Block Replicas are distributed across servers and racks.

Block Replication for:

• Durability

• Availability

• Throughput Optimized for:

• Throughput

• Put/Get/Delete

• Appends

A given file is broken down into blocks (default=64MB), then blocks are

replicated across cluster (default=3).

5

(6)

MapReduce: Resource Manager / Scheduler

A given job is broken down into tasks, then tasks are scheduled to be as

close to data as possible.

Three levels of data locality:

• Same server as data (local disk)

• Same rack as data (rack/leaf switch)

• Wherever there is a free slot (cross rack)

System detects laggard tasks and speculatively executes parallel tasks on the same slice of data.

Optimized for:

• Batch Processing

• Failure Recovery

6

(7)

MapReduce: Computational Framework

7

Split 1

Split i

Split N

Map 1

(docid, text)

(docid,

text) Map i

(docid,

text) Map M

Reduce 1

Output File 1

(sorted words, sum of

counts)

Reduce i

Output File i

(sorted words, sum of

counts)

Reduce R

Output File R

(sorted words, sum of

counts)

(words, counts)

(sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)

Shuffle

(words, counts) (sorted words, counts)

“To Be Or Not To Be?”

Be, 5

Be, 12

Be, 7 Be, 6

Be, 30

cat *.txt | mapper.pl | sort | reducer.pl > out.txt

(8)

Hadoop High-Level Architecture

8

Name Node

Maintains mapping of file names to blocks to data node slaves.

Job Tracker

Tracks resources and schedules jobs across task tracker slaves.

Data Node

Stores and serves blocks of data

Hadoop Client

Contacts Name Node for data or Job Tracker to submit jobs

Task Tracker

Runs tasks (work units) within a job

Share Physical Node

(9)

Changes for Better Availability/Scalability

9

Data Node

Stores and serves blocks of data

Hadoop Client

Contacts Name Node for data or Job Tracker to submit jobs

Task Tracker

Runs tasks (work units) within a job

Share Physical Node

Name Node Job Tracker

Federation partitions out the name space, High Availability via an Active Standby.

Each job has its own Application Manager, Resource Manager is decoupled from MR.

(10)

MapReduce Next Gen

Main idea is to split up the JobTracker functions:

• Cluster resource management (for tracking and allocating nodes)

• Application life-cycle management (for MapReduce scheduling and execution)

Enables:

• High Availability

• Better Scalability

• Efficient Slot Allocation

• Rolling Upgrades

• Non-MapReduce Apps

10

(11)

HBase versus HDFS

HDFS: HBase:

Use For:

Dimension tables which are updated frequently and require random low-latency lookups.

Use For:

Fact tables that are mostly append only and require sequential full table scans.

Optimized For:

Large Files

Sequential Access (Hi Throughput)

Append Only

Optimized For:

Small Records

Random Access (Lo Latency)

Atomic Record Updates

Not Suitable For:

Low Latency Interactive OLAP.

©2012 Cloudera, Inc. All Rights Reserved.

11

(12)

What Next: The Long Term Vision

A fully functional Big Data Operating System

that allows you to:

1. Collect Data (in real-time).

2. Store Data (very reliably, securely).

3. Process Data (workload management).

4. Analyze Data (metadata management).

5. Serve Data (interactively, low-latency).

12

(13)
(14)

Hadoop in the Enterprise Data Stack

Logs Files Web Data Relational Databases

IDEs BI /

Analytics

Enterprise Reporting

Enterprise Data Warehouse

Online Serving Systems

Cloudera Manager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile Applications

CUSTOMERS

Sqoop Sqoop

Sqoop

Flume Flume

Flume

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/

ETL Tools

ODBC, JDBC, NFS

14

(15)

Three Deployment Options

Cloudera Enterprise

Cloudera Manager

CDH

(Cloudera Distribution including Apache Hadoop)

Cloudera Support Option 1:

The Complete Solution

Cloudera Enterprise (Cloudera Manager + Support) + CDH

Fastest rollout

Easiest to deploy, manage and configure

Fully supported

Annual Subscription

Proprietary Software + Support

100% Open Source Solution:

Apache Hadoop + selected components

CDH

(Cloudera Distribution including Apache Hadoop Option 2:

CDH Free Download

Cloudera Manager Free Edition + CDH

No software cost

No support included

Manage up to 50 nodes

Limited functionality in Cloudera Manager

CDH

(Cloudera Distribution including Apache Hadoop Option 3:

Raw Packages

CDH Only

Self install and configure

No software cost

No support included

More management &

installation complexity

Longer rollout for most customers

Solution Components

Cloudera Manager

(Free Edition)

Up to 50 Nodes

15

(16)

Books

16

References

Related documents

This tradition has been especially strong in the United States, where a major driver of continuing anti-Catholic attitudes was the perception that the political, social

bundles of Qualified Items must be listed on an automated clearing bundle recap (as specified in section 3). Separate recaps are required for Qualified Items $50,000.00 and

Central to this thesis reports is the research conducted in case studies carried out in two white goods manufacturers in Anhui Province. The aim of this

The liaison officer has helped to strengthen cooperation between Albania and EU Member States’ police forces in the fight against organised crime, including drug

Urban traffic management will also increasingly depend on various sharing models for means of trans- port, such as bikesharing, carsharing or taxi-sharing concepts, which must

In 2011 he joined the Clean Energy team as an Investment Director where he manages the Ingenious Solar EIS and also the Renewable Energy EIS Fund, deploying retail investors’ Funds

Alcoholics Anonymous “Book Study Group” Alcoholics only, Big Book Hours: Thursday 7:15 a.m.. (back room), Salida

            The state diagram above shows the general logic of the controller. So at S0 the only