• No results found

Business Intelligence for Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Business Intelligence for Big Data"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

Business Intelligence

for Big Data

(2)

Business Intelligence =

reports, dashboards, analysis,

visualization, alerts, auditing,

data transformation

(3)
(4)

Example Hadoop BI Use Cases Today

Transactional

•Fraud detection

•Financial services/stock markets

Sub-Transactional

•Weblogs

(5)

Example Hadoop BI Use Cases Today

Non-Transactional

•Web pages, blogs etc •Documents

•Physical events

•Application events •Machine events

(6)
(7)

Data Lake

• Single source • Large volume • Not distilled

(8)

Data Lakes

• 0-2 data lakes per company

• Known and unknown questions • $1-10k questions, not $1m ones • Multiple user communities

(9)

Data Lake Requirements

• Store all the data

• Satisfy routine reporting and analysis

(10)
(11)

Big Data Architecture

Data Mart(s) Ad-Hoc

Data Lake(s)

Data Warehouse

(12)

Does Hadoop Replace Data Marts?

• If it behaves like database

• If it has low latency (sub-second) Hadoop (to date)

(13)

Structured

BI Tools Need...

(14)

Why Hadoop and BI?

• Distributed, scalable file system • Distributed processing

• Commodity hardware

• Scales out beyond technology and/or economy of a RDBMS

(15)

“The working conditions

within Hadoop are shocking”

ETL Developer

(16)

Hadoop and BI?

(17)

Hadoop and BI?

public void map( Text key, Text value,

OutputCollector output, Reporter reporter)

public void reduce( Text key,

Iterator values,

OutputCollector output, Reporter reporter)

(18)

MapReduce Limitations

Doing everything with MapReduce is like doing everything with recursion.

(19)

Google’s Use Case

• Needed to index the internet • Huge set of unstructured data • Predetermined input

• Predetermined output (the index) • Predetermined questions

• Single user community

(20)

Yahoo’s Use Case

• Needed to index the internet • Huge set of unstructured data • Predetermined input

• Predetermined output (the index) • Predetermined questions

• Single user community

(21)

Current Use Cases

• Not indexing the internet

• Huge set of semi/structured data • Different input source and format • Different outputs

• Different questions

• Multiple user communities

• Need parallel processing and storage

(22)

Unfortunately Hadoop

wasn’t designed

(23)

Hadoop’s Strengths and Weaknesses

• Distributed processing • Distributed file system • Commodity hardware

• Scales out beyond technology and/or economy of a RDBMS

But...

(24)

BI and Hadoop Architecture

Until Hadoop behaves and performs like a database a hybrid architecture is needed

• Data sources • Hadoop

(25)

Optimize Visualize Load Files / HDFS Hive DM & DW

Applications & Systems

App Tier RDBMS Hadoop

(26)
(27)
(28)

If only we had a

Java, embeddable,

(29)

Pentaho Data Integration

Hadoop PDI Engine

(30)

Hadoop Architecture

Clients Java/

Python

Job

Tracker TrackerTask TrackerTask TrackerTask

Name

Node NodeData NodeData NodeData

Map/ Reduce

File System

(31)

Pentaho/Hadoop Architecture - External

Pentaho Data Integration

• Move files

• Read HDFS files • Write HDFS files

• Execute MapReduce jobs

• Other ETL operations

Job

Tracker TrackerTask TrackerTask TrackerTask

Name

Node NodeData NodeData NodeData

Map/ Reduce

(32)

Pentaho/Hadoop Architecture - Internal

Job

Tracker TrackerTask TrackerTask TrackerTask

Name

Node NodeData NodeData NodeData

Map/ Reduce

File System Client

• Exec ETL in parallel

(33)

Task Tracker

Pentaho/Hadoop Architecture - Internal

Task Tracker Data Node PDI Engine PDI Map Class Reader

Class CollectorOutput Reducer

Inject Listen

Output Class

(34)

Files / HDFS Hive DM & DW App Tier RDBMS Hadoop M e t a d a t a

Reporting / Dashboards / Analysis

(35)

Applications & Systems

Reporting / Dashboards / Analysis

App Tier RDBMS Hadoop

(36)
(37)

FAQ

1. Will Pentaho contribute to Apache’s

Hadoop projects? Yes

2. Will Pentaho distribute Hadoop as part of

their product? Unlikely

3. What version of Hadoop will be

supported? Initially 20.2

4. Will Pentaho’s APIs allow existing open

(38)

FAQ

5. Will Pentaho provide support or services

to help setup Hadoop? Yes, no, maybe

6. What are the requirements to be in the Pentaho Hadoop beta program?

(39)
(40)

Structured

BI Tools Need...

(41)

For Modeling...

• Data rich

• Metadata poor

• Sample = table scan

(42)

BI Tools Don’t Need

• CREATE / INSERT • UPDATE

• DELETE

(43)

Mondrian (OLAP) Needs Required: •SELECT •FROM •WHERE •GROUP BY •ORDER BY Nice to have: •HAVING

•ORDER BY ... NULLS COLLATE •COUNT(DISTINCT x,y)

(44)

Side Topic:

(45)

Can I Use Hadoop

as a Data Warehouse?

(46)

No, probably not*

(47)

What is a Data Warehouse?

Data Mart

•Data structured for query and reporting

Data Warehouse

•What you get if you create data marts for

(48)
(49)

More information

www.pentaho.com/hadoop

References

Related documents

Sketched  out  specific  action  plans  and  ideas  associated  with  those  objectives  and  determined   accountability  and  evaluation  measures.. Challenges

 First  I  analyze  the  leadership  structures  governing  AmeriCorps  volunteers...  This  represents  strong

It was also noted that the schemas like mistrust/abuse, social self-alienation, emotional inhibition, and abandonment had the highest correlation with the love

FSFI ) دش هتفرگ.. داعبا یتخانش و یراتفر لماوع و تسا یطابترا و یناور ،یتسیز فلتخم :تسا هدناجنگ یشزومآ هتسب رد ار تراهم شزومآ ،یطابترا یاه اب هطبار رد طلغ یلو عیاش

Amitai Etzioni, beroemd vanwege zijn voorkeur voor het communitarisme, stelt: “privacy is one good amongst other goods, and should be weighed as such” (Etzioni, 2007, p. Het

The results of the experiments show that the proposed approaches are highly effective in the detection of email-born phishing attacks as well as in the identification

Distal Dendritic Enrichment of HCN1 Is Present in Cultured Organotypic Hippocampal Slices but Not Dissociated Pyrami- dal Neurons —To study the basic mechanism controlling h

This paper is organized as follows: In Section II, we explore previous work on feature map compression in neural networks and the exploitation of feature map sparsity, placing a