• No results found

Big Data Big Data/Data Analytics & Software Development

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Big Data/Data Analytics & Software Development"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data

Big Data/Data Analytics & Software Development

Danairat T.

[email protected], 081-559-1446

(2)

Agenda

Big Data Overview

Business Cases and Benefits

Hadoop Technology Architecture

Big Data Development Process

Summary

(3)

Big Data Overview

(4)

What is Big data

• Big data analytics is concerned with the analysis of large volumes of

transaction/event data and behavioral analysis of human/human a

human/system interactions. (Gartner)

• Big data represents the collection of technologies that handle large

data volumes well beyond that inflection point and for which, at least

in theory, hardware resources required to manage data by volume

track to an almost straight line rather than a curve. (IDC)

(5)

Structured ….. Non Structured

Level Example

Structured Relational database

Semi-structured XML data files

Quasi-structured Text documents

Unstructured Images and video

A new class of problems has

emerged which demands an

ability to accept and

manage data without

advanced knowledge of

its structure or format.

N o n S tr u c tu re d

(6)

Unstructured Data Growth Trends

(7)

Big Data

Big Data, An Integrated Architecture

Transaction

Data

Transaction

Data

DBMS

(OLTP)

DBMS

(OLTP)

Master &

Ref Data

Master &

Ref Data

S tr u c tu re d

Real-Time

Real-Time

Machine

Generated

Machine

Generated

Social

Media

Social

Media

Text, Image

Video, Audio

Text, Image

Video, Audio

Key-Value

Data Store

Key-Value

Data Store

U n s tr u c tu re d S e m i- s tr u c tu re d

Hadoop

Cluster

MapReduce

Hadoop

Cluster

MapReduce

Message-

Based

Message-

Based

DB Replic

DB Replic

ETL/ELT

ETL/ELT

ChangeDC

ChangeDC

Capture

Capture Store/Proces

s

Store/Proces

s Integrate Integrate

M a n a g e m e n t S e c u ri ty , G o v e rn a n c e

M a n a g e m e n t S e c u ri ty , G o v e rn a n c e

Advanced

Analytics

Advanced

Analytics

Visual

Discovery

Visual

Discovery

Data

Warehouse

Data

Warehouse

Text Analytics

and Search

Text Analytics

and Search

Reporting &

Dashboards

Reporting &

Dashboards

Alerting

Alerting

In-Database

Analytics

In-Database

Analytics

EPM

BI Applications

EPM

BI Applications

ODS

ODS

Data Marts

Data Marts

Streaming

(CEP Engine)

Streaming

(CEP Engine)

Organize

Organize Analyze Analyze Gover

n

Gover

n

(8)

What is Big data

VOLUME VELOCITY VARIETY VALUE

SOCIAL

BLOG

SMART

METER

101100101001

001001101010

101011100101

010100100101

Source: oracle.com

(9)

Business Cases and Benefits

(10)

Hadoop Use Cases

CRM, Customer Analysis, Social Marketing

Telco: Network Analysis, Quality of Service

Public Service: Crime Analysis, Flooding Alert, City Planning

Healthcare: Patient Safety, EMR, Next Best Action (NBA)

Retail: Everyday Low Price, Offer better quality products, Next Best

Action (NBA)

Finance: Risk Management, Loan Origination, Credit Line, Wealth

Management

HCM, Talent Management, Social Analytics, Etc.

(11)

0101010101010101010101010101010101010101

0101010101010101010101010101010101010101

0101010101010101010101010101010101010101

What does a Big Data World look like?

Utilities

What they collect

Smart Metering -Monitors

power usage

How they use it

Better demand planning

Better targeted marketing

Better targeted products based on individuals

power needs

The ability to predict demand at

household level

Reduce exposure to spot market

Big Data means…

(12)

1 . B lo o d P re s s u re H e a rt r a te T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s

1 . B lo o d P re s s u re S le e p T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s

3. Public/Private

Hospital executes

Health Program

integrated

EHR/EMR Systems

2. Government

creates/revises

National Health

Program

1 . B lo o d P re s s u re C o a c h T ra c k in g 4 . H e a lt h T ra c k in g H e a lt h c h e c k u p r e c o rd s

Big Data

Cloud

Suggested Health Improvement (Secured Personal Access)

Government Officer

Doctor

Nurse

Officer

Integrated Medical Device

Integrated Medical Device

Integrated Medical Device

Integrated Medical Device

Integrated Medical Device

Personal Health

Improvement

(13)

Public Healthcare

Management of Outbreak Through Early Detection of Clusters

(14)

Hadoop Technology

(15)

What is Hadoop?

A scalable fault-tolerant distributed system for data

storage and processing

• Its scalability comes from the marriage of:

HDFS: Self-Healing High-Bandwidth Clustered Storage

MapReduce: Fault-Tolerant Distributed Processing

• Operates on structured and complex data

• A large and active ecosystem (many developers and

additions like HBase, Hive, Pig, …)

Open source under the Apache License

http://wiki.apache.org/hadoop/

(16)

Hadoop History

2002-2004: Doug Cutting and Mike Cafarella started working on Nutch

2003-2004: Google publishes GFS and MapReduce papers

2004: Cutting adds DFS & MapReduce support to Nutch

2006: Yahoo! hires Cutting, Hadoop spins out of Nutch

2007: NY Times converts 4TB of archives over 100 EC2s

2008: Web-scale deployments at Y!, Facebook, Last.fm

April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes

May 2009:

• Yahoo does fastest sort of a TB, 62secs over 1,460 nodes

• Yahoo sorts a PB in 16.25hours over 3658 nodes

June 2009, Oct 2009: Hadoop Summit, Hadoop World

September 2009: Doug Cutting joins Cloudera

apache.org/hadoop/

(17)

Basic Architecture

Client

(18)

Basic Architecture

Client

Name Node

HDFS

(19)

Basic Architecture

Client

Name Node

Data Node Data Node Data Node

HDFS

(20)

Basic Architecture

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

HDFS Map Reduce

(21)

Basic Architecture

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

HDFS Map Reduce

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

(22)

Block Size = 64MB

Replication Factor = 3

HDFS: Hadoop Distributed File System

Cost/GB is a few

¢/month vs $/month

apache.org/hadoop/

(23)

MapReduce: Distributed Processing

(24)

MapReduce

(Job Scheduling/Execution System)

Hadoop

HDFS

(Hadoop Distributed File System)

(25)

MapReduce

(Job Scheduling/Execution System)

Hadoop Ecosystem

HDFS

(Hadoop Distributed File System)

Hive Sqoop

Z o o k e p p e r F lu m e

HBase

(26)

Relational Databases: Hadoop:

Use The Right Tool For The Right Job

When to use?

Affordable Storage/Compute

Structured or Not (Agility)

Resilient Auto Scalability

When to use?

Interactive Reporting (<1sec)

Multistep Transactions

Lots of Inserts/Updates/Deletes

apache.org/hadoop/

(27)

Example – Thai Content Analytics

Thai WordCount

Input Content

(28)

Example – Thai Content Analytics

Thai WordCount

The results

(29)

Example - Hive

Hive enables Hadoop to support SQL for Non-Java developer

hive (default)> select * from test_tbl;

OK

1 USA

62 Indonesia

63 Philippines

65 Singapore

66 Thailand

Time taken: 0.287 seconds

hive (default)>

Note: Support only text content with column separator

(30)

Bid Data Development Process

(31)

Big Data Development Process Guideline

Architecture

Planning

• Targeted Users

• Target Opportunities

• Data Scientist

• Data Source/Type

• Data Capturing

Approach

• Data Processing and

Visualize Planning

• Technology

Architecture

• Big Data EcoSystem

• (Hadoop Ecosystem)

• Sizing

• Integration

• Security

Big Data

Development

• Develop Use Cases

• Set up Big Data

Pseudo-distribution

Mode

• Set up HDFS

• Develop Data

Capturing System

• Develop Data

Analytic

• Map Reduce

• Hive

• R

• Etc.

• Integrate result to

Enterprise Analytic

System

• Set up Big Data

Operation

and Support

• Monitor HDFS

utilization and

capacity planning

• Monitor Job Tracker

availability

• Monitor Data

Capturing System

• Upgrade or Patch

Big Data Hadoop

ecosystem

• System admin.

Training

• Helpdesk Training

• End-User Training

(Analytic Results)

System

Evaluation

• Adoption Rates for

each analytics results

• No. of Missing Analytic

Results

• No. of Missing Data

• Lost hours per month

• Avg. of each Analytic

Result Response Time

• No. of Technology

System Failure per

month

(32)

Summary

(33)

Big Data

Big Data, An Integrated Architecture

Transaction

Data

Transaction

Data

DBMS

(OLTP)

DBMS

(OLTP)

Master &

Ref Data

Master &

Ref Data

S tr u c tu re d

Real-Time

Real-Time

Machine

Generated

Machine

Generated

Social

Media

Social

Media

Text, Image

Video, Audio

Text, Image

Video, Audio

Key-Value

Data Store

Key-Value

Data Store

U n s tr u c tu re d S e m i- s tr u c tu re d

Hadoop

Cluster

MapReduce

Hadoop

Cluster

MapReduce

Message-

Based

Message-

Based

DB Replic

DB Replic

ETL/ELT

ETL/ELT

ChangeDC

ChangeDC

Capture

Capture Store/Proces

s

Store/Proces

s Integrate Integrate

M a n a g e m e n t S e c u ri ty , G o v e rn a n c e

M a n a g e m e n t S e c u ri ty , G o v e rn a n c e

Advanced

Analytics

Advanced

Analytics

Visual

Discovery

Visual

Discovery

Data

Warehouse

Data

Warehouse

Text Analytics

and Search

Text Analytics

and Search

Reporting &

Dashboards

Reporting &

Dashboards

Alerting

Alerting

In-Database

Analytics

In-Database

Analytics

EPM

BI Applications

EPM

BI Applications

ODS

ODS

Data Marts

Data Marts

Streaming

(CEP Engine)

Streaming

(CEP Engine)

Organize

Organize Analyze Analyze Gover

n

Gover

n

(34)

Thank you very much

References

Related documents