Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

(1)

Hadoop Ecosystem Overview

CMSC 491

Hadoop-Based Distributed Computing Spring 2015

Adam Shook

(2)

Agenda

• Introduce Hadoop projects to prepare you for your group work

– Intimate detail will be provided in future lectures

• Discuss potential use cases for each project

(3)

Topics

• HDFS

• MapReduce

• YARN

• Sqoop

• Flume

• NiFi

• Pig

• Hive

• Streaming

• HBase

• Accumulo

• Avro

• Parquet

• Mahout

• Oozie

• Storm

• ZooKeeper

• Spark

• SQL-on-Hadoop

• In-Memory Stores

• Cassandra

• Kafka

• Crunch

• Azkaban

(4)

HDFS

• Hadoop Distributed File System

– High-performance file system for storing data

• We’ve talked about this enough

(5)

Hadoop MapReduce

• High-performance fault-tolerance data processing system

• We’ve also talked about this enough

(6)

YARN

• Abstract framework for distributed application development

• Split functionality of JobTracker into two components

– ResourceManager – ApplicationMaster

• TaskTracker becomes NodeManager

– Containers instead of map and reduce slots

• Configurable amount of memory per NodeManager

(7)

MapReduce 2.x on YARN

• MapReduce API has not changed

– Binary-level backwards compatible (no recompile)

• Application Master launches and monitors job via YARN

• MapReduce History Server to store… history

• Enabled Yahoo! to scale beyond 4,000 nodes

(8)

Hadoop Ecosystem

• Core Technologies

– Hadoop Distributed File System – Hadoop MapReduce

• Many other tools…

– Which we will be discussing… now

(9)

Apache Sqoop

• Apache project designed for efficient transfer between Apache Hadoop and structured data stores

• Use through CLI and extendable

• Use cases?

(10)

Apache Flume

• Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data

• Configure agents using simple files, extendable

• Use cases?

(11)

Apache NiFi

• A service to reliably move and manipulate files between clusters using a web front-end

• Uses a GUI to drop processors and connect them to build workflows

• Use cases?

(12)

Apache Pig

• Platform for analyzing large data sets that

consists of a high-level language for expressing data analysis programs

• Infrastructure compiles language to a sequence of MapReduce programs

• Use cases?

(13)

Apache Hive

• Data warehouse facilitating querying and managing large datasets

• Compiles SQL-like queries into MapReduce programs

• Use cases?

(14)

Hadoop Streaming

• Utility to create and run MapReduce jobs with any executable or script as the mapper or

reducer

• Just a jar file, not a real project

• Use cases?

(15)

Which high-level API is for you?

• What are you comfortable with?

• What are you being told to use?

(16)

Apache HBase

• Distributed, scalable, big data store

• Data stored as sorted key/value pairs, with the key consisting of a row and column

• Use cases?

(17)

Apache Accumulo

• Robust, scalable, high-performance data storage and retrieval key/value store

• Cell-based access controls

– i.e. cell-level security

• Use cases?

(18)

Apache Avro

• Data serialization system for the Hadoop ecosystem

• Use cases?

(19)

Apache Parquet

• Columnar storage format for Hadoop

• Use cases?

(20)

Apache Mahout

• Machine learning library to build scalable

machine learning algorithms implemented on top of Hadoop MapReduce

• Use cases?

(21)

Apache Oozie

• Workflow scheduler system to manage Apache Hadoop jobs

• Use cases?

(22)

Apache Storm

• Distributed real-time computation system

• Didn’t have a logo until June 2014

• How is this different than MapReduce?

• Use cases?

(23)

Apache ZooKeeper

• Effort to develop and maintain and open- source server enabling highly reliable

distributed coordination

• Use cases?

(24)

Apache Spark

• Fast and general engine for large-scale data processing

• Write applications in Java, Scala, or Python

• Use cases?

(25)

SQL on Hadoop

• Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc.

• SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store

• Use cases? Non use cases?

(26)

Sample Architecture

HDFS

Flume Agent

MapReduce Pig HBase Storm

Website Oozie

Webserver

Sales

Call Center SQL

SQL

(27)

OTHER HADOOP PROJECTS

We [maybe] won’t be covering these in detail later on

(28)

Redis, Memcached, etc.

• Open-source in-memory key/value stores

• Use cases?

(29)

Apache Cassandra

• NoSQL database for managing large amounts of

structured, semi-structured, and unstructured data

• Support for clusters spanning multiple datacenters

• Unlike HBase and Accumulo, data is not stored on HDFS

• Use cases? Non use cases?

(30)

Apache Crunch

• Java framework for writing, testing, and

running MapReduce pipelines with a simple API

• Same code executes as a local job, as a

MapReduce job, or as a streaming Spark job

• Use cases? ^*

*Not the real logo, but truly fantastic

(31)

Apache Kafka

• High-throughput distributed publish-subscribe message service

• Use cases?

(32)

Azkaban

• Batch workflow job scheduler to run Hadoop jobs

• Use cases?

(33)

Review

• A lot of projects available to you for your grou project

• Think of a problem you are interested in, then choose the appropriate projects to solve it

• Keep in mind data ingest, storage, processing, and egress

• Feel free to explore and use other projects than the ones I have listed here

– Get permission if you plan on using it as part of your project quota

(34)

References

• All those logos are the property of their owners

• *.apache.org

• redis.io