BIG DATA
TECHNOLOGY
Hadoop Ecosystem
Agenda
• Background
• What is Big Data
• Solution Objective
• Introduction to Hadoop
• Hadoop Ecosystem
• Hybrid EDW Model
• Predictive Analysis using Hadoop
• Conclusion
What is
Big Data?
“ DATA EVERYWHERE BUT
NOWHERE…
Dr. Michio Kaku”
Data Everywhere…
Data growth
•
An estimated 90% of the world’s data has been
created over the past two year
•
Data is doubling every two years & global annual
data creation is set to leap from 1.2 zettabytes in
2012 to 35 zettabytes in 2020
•
Every day, we create 2.5 quintillion bytes of data
•
Unstructured information is growing 15 times the
rate of structured information
•
Operational Data is extremely small compared to
other data sources around
Definition of Big Data
• Processing many TBs to Petabyte of data.
• Data arrives in large bursts
• Sift through the noise to identify the right data to improve business insight
Volume
• Analyze more data in less time to facilitate faster and more responsive business decision making.
• New Data acquisition and very rapid creation of data
• Batch, Near Time and Real Time Data Feeds Velocity
• Data is in many formats, including unstructured, semi structured, Complex document & Rich media
• Data format is constantly changing
• Changing Data Context Variety
Every Day Examples of Big Data
Category Data Big Data
Descriptive Age, Gender, Income, Demographics
Attitudes,
Psychographics Social User Defined Influence, Peers Location Home Address Real Time
Interaction Who is available next Who is best to serve the personality of the
consumer Relationship Transactional Patterns,
experience, internal and external data
Big Data Solution Objectives
•
Enables scalable, accurate & powerful analysis.
•
Process data fast & cost effectively
•
Connect high volume & volatile data to enable
organizations to take effective business decisions.
•
Planning future success – using insights from big data
to increase the value of predictive analytics.
•
Data Mining can be done using variety of techniques.
•
Build Ability to experiment, Discover & rationalize
Processing Challenge
• Storage capacities of hard drives have increased but transfer rates have not kept up
• Hardware Failure
• Most analysis tasks need to be able to combine the data in some way.
Why Hadoop
•
The ability to read and write data in parallel to or from
multiple disks.
•
Enables applications to work with thousands of nodes
and petabytes of data with automatic failover
•
A reliable shared storage and analysis system
•
Open Source Architecture for large scale computation &
data processing on network of commodity hardware
•
Ability to work with variety of Data mining capabilities
•
Allows Innovation and bring top talent to the core
Hadoop Concept
•
Distribute the data in its original form
•
Process the data where it is stored
•
Combine the result from different nodes
•
HDFS: Distributed file system scalable to
accommodate any size of data, Tolerant of
failures due to built in replication and regeneration
•
Map Reduce - Processes multiple data sources
into structured data (map), Performs optional
aggregation on results (Reduce)
History of Hadoop
• Created by Doug Cutting
• 2002 – Apache Nutch, open source web search engine
• 2003 – Google publishes a paper describing the architecture of their distributed filesystem, GFS.
• 2004 – Nutch Distributed Filesystem (NDFS)
• 2004 – Google publishes a paper on MapReduce
• 2005 – Nutch MapReduce implementation
• 2006 – Hadoop is created; Cutting joins Yahoo!
• 2008 – Yahoo! demonstrates Hadoop capabilities
• 2008 – broke the world record for fastest sort
• 2013 – Continued Innovation and Adoption
Hadoop Ecosystem
Hadoop Ecosystem - Hadoop Base
Platform
• Hadoop Distributed File System
• Files split into 128MB blocks
• Blocks replicated across several DATANODEs (usually 3)
• Single NAMENODE stores metadata (file names, block locations, etc.)
• Optimized for large files
Hadoop Ecosystem - HDFS
Hadoop Map Reduce Example
Hadoop Ecosystem - HBase
• HBase is a distributed column-oriented database built on top of HDFS
• NoSQL highly available Database used as input and/or output with Hadoop
• Used when you require real-time read/write random-access to very large datasets
• Developed at Facebook
• Maintains list of table schemas
• SQL-like query language (HQL)
• Can call Hadoop Streaming scripts from HQL
• Supports table partitioning, clustering, complex data types, some optimizations
• Translates SQL into MapReduce jobs
Hadoop Ecosystem - Hive
• Zookeeper
• Configuration storage and synchronization system for Hadoop
• Pig
• Data modeling language (Pig Latin) for creating Map Reduce jobs
• SQOOP
• Data import tool to bring structured data into Hbase from RDBMS
• Avro
• Framework for persistent data and communication between Hadoop nodes
• Apache Oozie
• An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo.
Hadoop Ecosystem - Additional Components
Traditional & Big Data Approaches
IT Structures the data to answer that question
IT Delivers a platform to enable creative discovery
Business Explores what questions could be
asked Business Users
Determine what question to ask
Monthly sales reports Profitability analysis Customer surveys
Brand sentiment
Predictive Analytics Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis
Bring the Balance
Analytics Platform – Differentiator
Conventional + Big Data
Hybrid Enterprise Data warehouse
Cleansing Tools, Metadata, Legal, Compliance
Data Sources
Cleansing, Modeling
Hybrid Platform
Presentation Tier
Emerging and Raw Data streams Existing and Operational Data sources
Big Data and Analytics
Analytics
Descriptive analytics: Using historical data to describe the business. This is usually associated with Business
Intelligence or visibility systems. In supply chain understand historical demand patterns, to understand how product flows through your supply chain & to understand when a shipment might be late.
Predictive analytics: Using data to predict trends &
patterns. This is commonly associated with statistics. In the supply chain, you use predictive analytics to forecast future demand or to forecast the price of fuel.
Prescriptive analytics: – using data to suggest the optimal solution. This is commonly associated with optimization. In the supply chain, you use prescriptive analytics to set your
inventory levels, schedule your plants, or route your trucks.
Predicative analytics and Crime Mitigation
• Look at past data
• Create Patterns based on
• Where crime events happened,
• Situations
• Seasons,
• Socio economic
• Sniff Data from multiple sources
• Travel Data
• Changes in Socio economic Data
• Crime Hot spots
• Social Media Data – sensitive comments
• Financial Data – USPA, AML
• Images and documents
• Why it is a Big Data Problem
• Data is coming from variety of sources
• in different form
• At a very dynamic rate in different volume
• Solution – Identify and process data from identified sources à create a Mathematical model à Process the model offline and Real Time
Business Intelligence & Predictive
Analysis
• Transportation: Identify Traffic Patterns and predicting Traffic Conditions
• Big Data in Health Care
• Conservation of Natural Resources
• Waste Management
• Operationally and economically efficient education system
• Resident services
• Global Warming
• Revenue Opportunities – Tourism and Tax patterns
Conclusion
• Identify the Problem statement
• Align Hadoop with Business strategy
• Hadoop is not Big Data but part of the ecosystem.
• Big Data Solution is both Art and Science
• Rationalizing the approach is extremely critical
• Hadoop and conventional EDW need to co-exist
• Attract top talent by using the innovative and latest technology.
• Operational, Execution and Business efficiency should be the success criteria
• Data Privacy, Legal and Access regulations has to be integral part of design and Metadata
Appendix
Storage Capacity Terms
Big Data Example – by Type
Structured
varying types of data, used in combination.
Semi-
Structured Unstructured
...
Time=…&customer=…&product=…
...
Variety