BIG DATA TECHNOLOGY. Hadoop Ecosystem

(1)

BIG DATA

TECHNOLOGY

Hadoop Ecosystem

(2)

Agenda

•  Background

•  What is Big Data

•  Solution Objective

•  Introduction to Hadoop

•  Hadoop Ecosystem

•  Hybrid EDW Model

•  Predictive Analysis using Hadoop

•  Conclusion

(3)

What is

Big Data?

“ DATA EVERYWHERE BUT

NOWHERE…

Dr. Michio Kaku

”

(4)

Data Everywhere…

(5)

Data growth

• 

An estimated 90% of the world’s data has been

created over the past two year

• 

Data is doubling every two years & global annual

data creation is set to leap from 1.2 zettabytes in

2012 to 35 zettabytes in 2020

• 

Every day, we create 2.5 quintillion bytes of data

• 

Unstructured information is growing 15 times the

rate of structured information

• 

Operational Data is extremely small compared to

other data sources around

(6)

Definition of Big Data

•  Processing many TBs to Petabyte of data.

•  Data arrives in large bursts

•  Sift through the noise to identify the right data to improve business insight

Volume

•  Analyze more data in less time to facilitate faster and more responsive business decision making.

•  New Data acquisition and very rapid creation of data

•  Batch, Near Time and Real Time Data Feeds Velocity

•  Data is in many formats, including unstructured, semi structured, Complex document & Rich media

•  Data format is constantly changing

•  Changing Data Context Variety

(7)

Every Day Examples of Big Data

Category Data Big Data

Descriptive Age, Gender, Income, Demographics

Attitudes,

Psychographics Social User Defined Influence, Peers Location Home Address Real Time

Interaction Who is available next Who is best to serve the personality of the

consumer Relationship Transactional Patterns,

experience, internal and external data

(8)

Big Data Solution Objectives

• 

Enables scalable, accurate & powerful analysis.

• 

Process data fast & cost effectively

• 

Connect high volume & volatile data to enable

organizations to take effective business decisions.

• 

Planning future success – using insights from big data

to increase the value of predictive analytics.

• 

Data Mining can be done using variety of techniques.

• 

Build Ability to experiment, Discover & rationalize

(9)

Processing Challenge

•  Storage capacities of hard drives have increased but transfer rates have not kept up

•  Hardware Failure

•  Most analysis tasks need to be able to combine the data in some way.

(10)

Why Hadoop

• 

The ability to read and write data in parallel to or from

multiple disks.

• 

Enables applications to work with thousands of nodes

and petabytes of data with automatic failover

• 

A reliable shared storage and analysis system

• 

Open Source Architecture for large scale computation &

data processing on network of commodity hardware

• 

Ability to work with variety of Data mining capabilities

• 

Allows Innovation and bring top talent to the core

(11)

Hadoop Concept

• 

Distribute the data in its original form

• 

Process the data where it is stored

• 

Combine the result from different nodes

• 

HDFS: Distributed file system scalable to

accommodate any size of data, Tolerant of

failures due to built in replication and regeneration

• 

Map Reduce - Processes multiple data sources

into structured data (map), Performs optional

aggregation on results (Reduce)

(12)

History of Hadoop

•  Created by Doug Cutting

•  2002 – Apache Nutch, open source web search engine

•  2003 – Google publishes a paper describing the architecture of their distributed filesystem, GFS.

•  2004 – Nutch Distributed Filesystem (NDFS)

•  2004 – Google publishes a paper on MapReduce

•  2005 – Nutch MapReduce implementation

•  2006 – Hadoop is created; Cutting joins Yahoo!

•  2008 – Yahoo! demonstrates Hadoop capabilities

•  2008 – broke the world record for fastest sort

•  2013 – Continued Innovation and Adoption

(13)

Hadoop Ecosystem

(14)

Hadoop Ecosystem - Hadoop Base

Platform

(15)

•  Hadoop Distributed File System

•  Files split into 128MB blocks

•  Blocks replicated across several DATANODEs (usually 3)

•  Single NAMENODE stores metadata (file names, block locations, etc.)

•  Optimized for large files

Hadoop Ecosystem - HDFS

(16)

Hadoop Map Reduce Example

(17)

Hadoop Ecosystem - HBase

•  HBase is a distributed column-oriented database built on top of HDFS

•  NoSQL highly available Database used as input and/or output with Hadoop

•  Used when you require real-time read/write random-access to very large datasets

(18)

•  Developed at Facebook

•  Maintains list of table schemas

•  SQL-like query language (HQL)

•  Can call Hadoop Streaming scripts from HQL

•  Supports table partitioning, clustering, complex data types, some optimizations

•  Translates SQL into MapReduce jobs

Hadoop Ecosystem - Hive

(19)

•  Zookeeper

•  Configuration storage and synchronization system for Hadoop

•  Pig

•  Data modeling language (Pig Latin) for creating Map Reduce jobs

•  SQOOP

•  Data import tool to bring structured data into Hbase from RDBMS

•  Avro

•  Framework for persistent data and communication between Hadoop nodes

•  Apache Oozie

•  An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo.

Hadoop Ecosystem - Additional Components

(20)

Traditional & Big Data Approaches

IT Structures the data to answer that question

IT Delivers a platform to enable creative discovery

Business Explores what questions could be

asked Business Users

Determine what question to ask

Monthly sales reports Profitability analysis Customer surveys

Brand sentiment

Predictive Analytics Maximum asset utilization

Big Data Approach

Iterative & Exploratory Analysis

Traditional Approach

Structured & Repeatable Analysis

(21)

Bring the Balance

(22)

Analytics Platform – Differentiator

Conventional + Big Data

(23)

Hybrid Enterprise Data warehouse

Cleansing Tools, Metadata, Legal, Compliance

Data Sources

Cleansing, Modeling

Hybrid Platform

Presentation Tier

Emerging and Raw Data streams Existing and Operational Data sources

(24)

Big Data and Analytics

(25)

Analytics

Descriptive analytics: Using historical data to describe the business. This is usually associated with Business

Intelligence or visibility systems. In supply chain understand historical demand patterns, to understand how product flows through your supply chain & to understand when a shipment might be late.

Predictive analytics: Using data to predict trends &

patterns. This is commonly associated with statistics. In the supply chain, you use predictive analytics to forecast future demand or to forecast the price of fuel.

Prescriptive analytics: – using data to suggest the optimal solution. This is commonly associated with optimization. In the supply chain, you use prescriptive analytics to set your

inventory levels, schedule your plants, or route your trucks.

(26)

Predicative analytics and Crime Mitigation

•  Look at past data

•  Create Patterns based on

•  Where crime events happened,

•  Situations

•  Seasons,

•  Socio economic

•  Sniff Data from multiple sources

•  Travel Data

•  Changes in Socio economic Data

•  Crime Hot spots

•  Social Media Data – sensitive comments

•  Financial Data – USPA, AML

•  Images and documents

•  Why it is a Big Data Problem

•  Data is coming from variety of sources

•  in different form

•  At a very dynamic rate in different volume

•  Solution – Identify and process data from identified sources à create a Mathematical model à Process the model offline and Real Time

(27)

Business Intelligence & Predictive

Analysis

•  Transportation: Identify Traffic Patterns and predicting Traffic Conditions

•  Big Data in Health Care

•  Conservation of Natural Resources

•  Waste Management

•  Operationally and economically efficient education system

•  Resident services

•  Global Warming

•  Revenue Opportunities – Tourism and Tax patterns

(28)

Conclusion

•  Identify the Problem statement

•  Align Hadoop with Business strategy

•  Hadoop is not Big Data but part of the ecosystem.

•  Big Data Solution is both Art and Science

•  Rationalizing the approach is extremely critical

•  Hadoop and conventional EDW need to co-exist

•  Attract top talent by using the innovative and latest technology.

•  Operational, Execution and Business efficiency should be the success criteria

•  Data Privacy, Legal and Access regulations has to be integral part of design and Metadata

(29)

Appendix

(30)

Storage Capacity Terms

(31)

Big Data Example – by Type

(32)

Structured

varying types of data, used in combination.

Semi-

Structured Unstructured

...

Time=…&customer=…&product=…

...

Variety

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA

TECHNOLOGY

Agenda

• Background

• What is Big Data

• Solution Objective

• Introduction to Hadoop

• Hadoop Ecosystem

• Hybrid EDW Model

• Predictive Analysis using Hadoop

• Conclusion

What is

Big Data?

“ DATA EVERYWHERE BUT

NOWHERE…

”

Data Everywhere…

Data growth

An estimated 90% of the world’s data has been

created over the past two year

Data is doubling every two years & global annual

data creation is set to leap from 1.2 zettabytes in

2012 to 35 zettabytes in 2020

Every day, we create 2.5 quintillion bytes of data

Unstructured information is growing 15 times the

rate of structured information

Operational Data is extremely small compared to

other data sources around

Definition of Big Data

Every Day Examples of Big Data

Big Data Solution Objectives

Enables scalable, accurate & powerful analysis.

Process data fast & cost effectively

Connect high volume & volatile data to enable

organizations to take effective business decisions.

Planning future success – using insights from big data

to increase the value of predictive analytics.

Data Mining can be done using variety of techniques.

Build Ability to experiment, Discover & rationalize

Processing Challenge

Why Hadoop

The ability to read and write data in parallel to or from

multiple disks.

Enables applications to work with thousands of nodes

and petabytes of data with automatic failover

A reliable shared storage and analysis system

Open Source Architecture for large scale computation &

data processing on network of commodity hardware

Ability to work with variety of Data mining capabilities

Allows Innovation and bring top talent to the core

Hadoop Concept

Distribute the data in its original form

Process the data where it is stored

Combine the result from different nodes

HDFS: Distributed file system scalable to

accommodate any size of data, Tolerant of

failures due to built in replication and regeneration

Map Reduce - Processes multiple data sources

into structured data (map), Performs optional

aggregation on results (Reduce)

History of Hadoop

Hadoop Ecosystem

Hadoop Ecosystem - Hadoop Base

Platform

Hadoop Ecosystem - HDFS

Hadoop Map Reduce Example

Hadoop Ecosystem - HBase

Hadoop Ecosystem - Hive

Hadoop Ecosystem - Additional Components

Traditional & Big Data Approaches

Bring the Balance

Analytics Platform – Differentiator

Conventional + Big Data

Hybrid Enterprise Data warehouse

Big Data and Analytics

Analytics

Predicative analytics and Crime Mitigation

Business Intelligence & Predictive

Analysis

•  Background

•  What is Big Data

•  Solution Objective

•  Introduction to Hadoop

•  Hadoop Ecosystem

•  Hybrid EDW Model

•  Predictive Analysis using Hadoop

•  Conclusion