Agenda
• Big Data Overview
• Business Cases and Benefits
• Hadoop Technology Architecture
• Big Data Development Process
• Summary
Big Data Overview
What is Big data
• Big data analytics is concerned with the analysis of large volumes of
transaction/event data and behavioral analysis of human/human a
human/system interactions. (Gartner)
• Big data represents the collection of technologies that handle large
data volumes well beyond that inflection point and for which, at least
in theory, hardware resources required to manage data by volume
track to an almost straight line rather than a curve. (IDC)
Structured ….. Non Structured
Level Example
Structured Relational database
Semi-structured XML data files
Quasi-structured Text documents
Unstructured Images and video
A new class of problems has
emerged which demands an
ability to accept and
manage data without
advanced knowledge of
its structure or format.
N o n S tr u c tu re d
Unstructured Data Growth Trends
Big Data
Big Data, An Integrated Architecture
Transaction
Data
Transaction
Data
DBMS
(OLTP)
DBMS
(OLTP)
Master &
Ref Data
Master &
Ref Data
S tr u c tu re d
Real-Time
Real-Time
Machine
Generated
Machine
Generated
Social
Media
Social
Media
Text, Image
Video, Audio
Text, Image
Video, Audio
Key-Value
Data Store
Key-Value
Data Store
U n s tr u c tu re d S e m i- s tr u c tu re d
Hadoop
Cluster
MapReduce
Hadoop
Cluster
MapReduce
Message-
Based
Message-
Based
DB Replic
DB Replic
ETL/ELT
ETL/ELT
ChangeDC
ChangeDC
Capture
Capture Store/Proces
s
Store/Proces
s Integrate Integrate
M a n a g e m e n t S e c u ri ty , G o v e rn a n c e
M a n a g e m e n t S e c u ri ty , G o v e rn a n c e
Advanced
Analytics
Advanced
Analytics
Visual
Discovery
Visual
Discovery
Data
Warehouse
Data
Warehouse
Text Analytics
and Search
Text Analytics
and Search
Reporting &
Dashboards
Reporting &
Dashboards
Alerting
Alerting
In-Database
Analytics
In-Database
Analytics
EPM
BI Applications
EPM
BI Applications
ODS
ODS
Data Marts
Data Marts
Streaming
(CEP Engine)
Streaming
(CEP Engine)
Organize
Organize Analyze Analyze Gover
n
Gover
n
What is Big data
VOLUME VELOCITY VARIETY VALUE
SOCIAL
BLOG
SMART
METER
101100101001
001001101010
101011100101
010100100101
Source: oracle.com
Business Cases and Benefits
Hadoop Use Cases
• CRM, Customer Analysis, Social Marketing
• Telco: Network Analysis, Quality of Service
• Public Service: Crime Analysis, Flooding Alert, City Planning
• Healthcare: Patient Safety, EMR, Next Best Action (NBA)
• Retail: Everyday Low Price, Offer better quality products, Next Best
Action (NBA)
• Finance: Risk Management, Loan Origination, Credit Line, Wealth
Management
• HCM, Talent Management, Social Analytics, Etc.
0101010101010101010101010101010101010101
0101010101010101010101010101010101010101
0101010101010101010101010101010101010101
What does a Big Data World look like?
Utilities
What they collect
Smart Metering -Monitors
power usageHow they use it
Better demand planning
Better targeted marketing
Better targeted products based on individuals
power needs
The ability to predict demand at
household level
Reduce exposure to spot market
Big Data means…
1 . B lo o d P re s s u re H e a rt r a te T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s
1 . B lo o d P re s s u re S le e p T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s
3. Public/Private
Hospital executes
Health Program
integrated
EHR/EMR Systems
2. Government
creates/revises
National Health
Program
1 . B lo o d P re s s u re C o a c h T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s
Big Data
Cloud
Suggested Health Improvement (Secured Personal Access)
Government Officer
Doctor
Nurse
Officer
Integrated Medical Device
Integrated Medical Device
Integrated Medical Device
Integrated Medical Device
Integrated Medical Device
Personal Health
Improvement
Public Healthcare
Management of Outbreak Through Early Detection of Clusters
Hadoop Technology
What is Hadoop?
• A scalable fault-tolerant distributed system for data
storage and processing
• Its scalability comes from the marriage of:
• HDFS: Self-Healing High-Bandwidth Clustered Storage
• MapReduce: Fault-Tolerant Distributed Processing
• Operates on structured and complex data
• A large and active ecosystem (many developers and
additions like HBase, Hive, Pig, …)
• Open source under the Apache License
• http://wiki.apache.org/hadoop/
Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
• 2003-2004: Google publishes GFS and MapReduce papers
• 2004: Cutting adds DFS & MapReduce support to Nutch
• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
• 2007: NY Times converts 4TB of archives over 100 EC2s
• 2008: Web-scale deployments at Y!, Facebook, Last.fm
• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
• May 2009:
• Yahoo does fastest sort of a TB, 62secs over 1,460 nodes
• Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World
• September 2009: Doug Cutting joins Cloudera
apache.org/hadoop/
Basic Architecture
Client
Basic Architecture
Client
Name Node
HDFS
Basic Architecture
Client
Name Node
Data Node Data Node Data Node
HDFS
Basic Architecture
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
HDFS Map Reduce
Basic Architecture
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
HDFS Map Reduce
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Block Size = 64MB
Replication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few
¢/month vs $/month
apache.org/hadoop/
MapReduce: Distributed Processing
MapReduce
(Job Scheduling/Execution System)
Hadoop
HDFS
(Hadoop Distributed File System)
MapReduce
(Job Scheduling/Execution System)
Hadoop Ecosystem
HDFS
(Hadoop Distributed File System)
Hive Sqoop
Z o o k e p p e r F lu m e
HBase
Relational Databases: Hadoop:
Use The Right Tool For The Right Job
When to use?
• Affordable Storage/Compute
• Structured or Not (Agility)
• Resilient Auto Scalability
When to use?
• Interactive Reporting (<1sec)
• Multistep Transactions
• Lots of Inserts/Updates/Deletes
apache.org/hadoop/
Example – Thai Content Analytics
Thai WordCount
Input Content
Example – Thai Content Analytics
Thai WordCount
The results
Example - Hive
Hive enables Hadoop to support SQL for Non-Java developer
hive (default)> select * from test_tbl;
OK
1 USA
62 Indonesia
63 Philippines
65 Singapore
66 Thailand
Time taken: 0.287 seconds
hive (default)>
Note: Support only text content with column separator
Bid Data Development Process
Big Data Development Process Guideline
Architecture
Planning
• Targeted Users
• Target Opportunities
• Data Scientist
• Data Source/Type
• Data Capturing
Approach
• Data Processing and
Visualize Planning
• Technology
Architecture
• Big Data EcoSystem
• (Hadoop Ecosystem)
• Sizing
• Integration
• Security