Data Warehouse design

(1)

- 1-

Data Warehouse design

Design of Enterprise Systems University of Pavia

(2)

Big Data Overview

Big Data DW & BI

Big Data Market

Hadoop & Mahout

(3)

- 3-

BIG DATA OVERVIEW

(4)

Big Data Overview: Table of Contents

Big Data Overview

Data Growth Definition Big Data v.s. Relational

Data Its Value

Big Data

(5)

- 5-

Big Data Overview: Data Growth

 Storage capacity increases 23% on average annually

 End the ability to store all the available information 0/通用格式 19/通用格式 9/通用格式 29/通用格式 18/通用格式 6/通用格式 26/通用格式 15/通用格式 E xab y te s Years

Data Storage Growth

0/通用格式 8/通用格式 18/通用格式 24/通用格式 3/通用格式 11/通用格式 18/通用格式 28/通用格式 6/通用格式 15/通用格式 E xab y te s Years

Data Storage Growth

 Exponential growth during a decade starts from 2010

(6)

Big Data Overview: Definition

 Gartner Definition(2012): "Big data is high volume,

high velocity, and/or high variety information assets

that require new forms of processing to enable

enhanced decision making, insight discovery and

(7)

- 7-

Big Data Overview: Big Data V.S. Relational Data

Application

Relation-Based Data

Big Data

Data processing

Single-computer

platform that scales with

better CPUs, centralized

processing.

Cluster platforms that

scale to thousands of

nodes, distributed

process.

Data management

(SQL), centralized

Relational database

storage.

Non-relational

databases that manage

varied data types and

formats (NoSQL),

distributed storage.

Analytics

_centralized.

Batched, descriptive,

and prescriptive,

Real-time, predictive

distributed analytics.

(8)

Big Data Overview: Its Value 1/3

Several classes of company

heading the revenue chart($11.59 billion)

 broad-portfolio tech giants (IBM, HP, Oracle, EMC)

 leading software houses (Teradata, SAP, Microsoft)  professional services

companies (PwC, Accenture)

Source: Wikibon, Big Data Vendor Revenue and Market Forecast 2012-2017

(9)

- 9-

Big Data Overview: Its Value 2/3

 Pure play: vendors who

derive 100 percent of their revenue from this market

Source: Wikibon, Big Data Vendor Revenue and

Market Forecast 2012-2017

(10)

Big Data Overview: Its Value 3/3

Source: Worldwide Big Data Technologies and Services: 2012-2015 Forecast (IDC, 2012)

 IDC: Big data will become a $17 billion business by 2015($23.8 billion by 2016)

 Big data storage will account for 6.8% of the entire worldwide storage market by 2015

(11)

- 11-

Big Data Overview: Big Data Benefits

Business benefits received by implementing an effective Big Data

(12)

Big Data Overview: Big Data Usage 1/2

 E-Commerce and Market Intelligence

– Recommender system

– Social media monitoring and analysis – Crowd-sourcing systems

– Social and virtual games  E-Government and Politics 2.0

– Ubiquitous government services – Equal access and public services – Citizen engagement

 Science & Technology – S&T innovation – Hypothesis testing – Knowledge discovery

 Smart Health and Wellbeing – Human and plant genomics – Healthcare decision support – Patient community analysis  Security and Public Safety

– Crime analysis

– Computational criminology – Terrorism informatics

– Open-source intelligence – Cyber security

(13)

- 13-

Big Data Overview: Big Data Usage 2/2

(14)

Big Data Overview: Challenges 1/2

Main challenges between Big Data and companies. The survey is based on

1153 responses from 325 respondents

(15)

- 15-

Big Data Overview: Challenges 2/2

A Survey of European

companies from Steria's

Business Intelligence Maturity

Audit (biMA)

 Technical

– 38% has data quality

problem

– A lack of data

governance; no master

data management

system(38%)

 Organizational

– 72% has no BI strategy;

70% has no BI governance

– 7% grades big data as

relevant

Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7-of-european-companies-rate-big-data-as-very-relevant-to-their-business/

(16)

BIG DATA, DW & BI

Data Warehouse design

(17)

- 17-

Big Data, DW & BI: Table of Contents

Big Data,

DW & BI

(18)

BI Evolution

Key Characteristics Gartner BI Platforms Core _Capabilities Gartner Hype Cycle

BI&A 1.0

-DBMS-based, structured content. -RDBMS & data warehousing. -ETL & OLAP.

-Dashboards & scorecards.

-Data mining & statistical analysis.

-Ad hoc query & search-based BI -Reporting, dashboards &

scorecards -OLAP

-Interactive visualization

-Predictive modeling & data mining.

-Column-based DBMS -In-memory DBMS -Real-time decision

-Data mining workbenches

BI&A 2.0

Web-based, unstructured content -Information retrieval and

extraction

-Opinion mining -Question answering -Web analytics and web intelligence

-Social media analytics -Social network analysis -Spatial-temporal analysis

-Information semantic services

-Natural language question answering

-Content & text analytics

BI&A 3.0

Mobile and sensor-based content -Location-aware analysis

-Person-centered analysis -Context-relevant analysis -Mobile visualization & HCI

-Mobile BI

(19)

- 19-

Big Data Overview: Techniques 1/2

A/B Testing

A technique in which a control group is compared with a variety of test groups in order to determine what treatments will improve a given objective. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big Data enables huge numbers of tests to be executed and analyzed.

Cluster Analysis A statistical method aimed to classify an huge data set and

in particular to identify a common behavior.

Classification

Classification. A set of techniques to identify the categories in which new data points belong, based on a training set containing data points that have already been categorized.

Data Mining

A set of techniques and technologies with the purpose to extract patterns from large datasets through the combination of methods following statistics and algorithms. These

techniques include association rule learning, cluster analysis, classification, and regression.

McKinsey Global Institute in 2011 provided a list of the top 10 common

techniques applicable across a range of industries, particularly in response to

the need to analyze new amounts of data and their combination.

(20)

Big Data Overview: Techniques 2/2

McKinsey Global Institute in 2011 provided a list of the top 10 common

techniques applicable across a range of industries, particularly in response to

the need to analyze new amounts of data and their combination.

List of the top 10 techniques which require Big data(2/2)

Network analysis

A set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed.

Predictive modeling

A set of techniques in which a mathematical model is created or chosen to best predict the probability of an outcome.

Sentiment analysis

Application of natural language processing and other analytic

techniques to identify and extract subjective information from source text material.

Statistics

The science of the collection, organization, and interpretation of data, including the design of surveys and experiments. Statistical techniques are often used to understand the relationships between all the variables.

(21)

- 21-

Big Data: Cost 1/2

 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in

particular a comparison between a “build” and “buy” solution.

Item Cost Notes

Servers $400,000

@$22k each; enterprise class with dual power supplies, 36TB of serial attached SCSI (SAS) storage, 48-64 gigabytes memory, 1 rack

Server support $60,000 @15% of server cost

Switches $15,000

3 @ $5k for InfiniBand; in older network switches will run at least 3x the costs of InfiniBand

Distribution/systems

management software $90,000 Cloudera: 18 nodes @ $5k each Integration $100,000 Licenses and dedicated hardware

Information

Management Tools $20,000 320 hours @ $100/hour human cost Node Configuration

and Implementation $16,000

8 hours/node, 20 nodes = 160 hours, $100/hour

Build Project Costs $733,000 Those project items where a "buy" option _exists

(22)

Big Data: Cost 2/2

 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in

particular a comparison between a “build” and “buy” solution.

Build Versus Buy Elements (Using Buy Pricing)

Item

Cost

Notes

Build Total

$733,000

Buy (Oracle Big Data

Appliance)

$450,000

Cost of Oracle Big Data

Appliance for same

infrastructure and tasks

costs (list)

Buy (Oracle Big Data

Appliance) Savings

$283,000

Not lifecycle costs, just

for initial project

ESG Estimated Savings

~39%

Oracle Big Data

Appliance lowers costs

versus do-it-yourself

(23)

- 23-

Big Data: Best Practices 1/3

First of all, however, we need to focus on some considerations on when is suitable to use Big Data technologies

 Analyze a huge quantity of data not only structured but also semi-structured and unstructured from a wide variety of resources;

 All of the data gathered must be analyzed against a sample or in another case, sampling of data is not as effective as the analysis made upon a large amount of data;

 Iterative and explorative analysis when business measures on the data are not determined a priori;

 Solving information and business challenges that are not properly addressed by a traditional relational database approach.

(24)

Big Data: Best Practices 2/3

The best practices that we are going to describe regard both the

management aspects and the organizational and technological ones.

 Muting the HiPPOs: the highest-paid person opinions are those on which

depend the most important decisions on how to retrieve and analyze data.

Today these people rely too much on intuition and experience rather than

the pure rationality of data so there is the need to transform this behavior;

 Start with initiative that led to customer-centric outcome. It is very

important for those organization that are customer oriented to begin with

customer analytics that enable better services as a result of a deep

understand of customers needs and future behaviors;

 Develop an enterprise schema that include the vision, the strategies and the

requirements for Big Data and is useful to align the business users need

and the implementation roadmap of information technologies;

 In order to achieve near-term results is crucial the adoption of a pragmatic

approach, starting from the most logical and cost-effective place to look for

insight that is within the enterprise;

(25)

- 25-

Big Data: Best Practices 3/3

 Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools. So the enterprises should invest in acquiring both tools and skills;

 The Big Data strategy and the business analytics should encompass an evaluation of the decision-making processes of the organization as well as an evaluation on the groups and types of decision makers;

 Try to uncover new metrics, key performance indicators and new analytics technique to lock at new and existing data in a different way in order to find new opportunity. This could require setting up a separate Big Data team with the purpose of experiment and innovate;

 The final goal of a Big Data project is not the collection of much data as possible but the support of the concrete business needs and provide new reliable information to decision makers;

 Only one technology cannot meet all the Big Data requirements. The presence of

different workloads, data types, and user types should be served by the most suitable technology. For example, Hadoop could be the best choice for a large-scale Web log

analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies must coexist and address use cases for which they are optimized.

(26)

BIG DATA MARKET

Data Warehouse design

(27)

- 27-

Big Data Market Definition

IDC(2012) defines

the big data

market as an

aggregation of

storage, server,

networking,

software, and

services market

segments, each

with several

sub-segments.

(28)

Big Data Market Segments

 Services

– business consulting, business process outsourcing, plus IT projectbased

services, IT outsourcing, and IT support, and training services related to Big Data implementations

 Infrastructure

– External storage systems

– Servers(including internal storage,

memory, network cards) and supporting system software as well as spending for self-built servers by large cloud service providers

– Datacenter networking infrastructure used in support of Big Data server and storage infrastructure

 Softwares

– Data organization and management software, including parallel and distributed file systems and others – Analytics and discovery software,

including search engines used for Big Data applications, data mining, text mining, rich media analysis, data visualization, and others

(29)

- 29-

Big Data Market Analysis

Marketsandmarkets

–

Big Data Market By Types (Hardware; Software;

Services; BDaaS - HaaS; Analytics; Visualization as

Service); By Software (Hadoop, Big Data Analytics

and Databases, System Software (IMDB, IMC):

(30)

HADOOP & MAHOUT

Data Warehouse design

(31)

- 31-

Hadoop & Mahout: Table of Contents

Hadoop Overview HDFS Structure File Write File Read Map Reduce Structure Job Submission Job Execution Hadoop Ecosystem HBase Pig Hive Mahout Overview Algorithms

(32)

Hadoop: Overview

Master Node

Hadoop Overview

Slave Node1 Slave Node K Slave Node N

...

Storage Computing Storage Computing Storage Computing

HDFS

Map-Reduce

 The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

– Open source – Scalable – Distributed

(33)

- 33-

Hadoop & Mahout: Table of Contents

(34)

Hadoop: HDFS Structure

Name Node Metadata

HDFS Structure

Data Node1 Data Node K Data Node N

…....

..

…....

..

1

2

3

1

2

3

1

2

3 File

 Name node controls almost everything about storage

 Large files are partitioned into chunks and stored across multiple nodes  File chunks are replicated to mitigate the node failure problems

(35)

- 35-

Hadoop: HDFS write

(36)

Hadoop: HDFS Read

(37)

- 37-

Hadoop & Mahout: Table of Contents

(38)

Hadoop: Map-Reduce Structure

 Job tracker controls almost everything about computing  Key concepts of Map-Reduce

– Computation goes with data

Job Tracker

Map-Reduce Structure

Task Tracker1

Task Tracker K

Task Tracker N

Mapper

Reducer

Mapper

Reducer

Mapper

Reducer

…...

(39)

- 39-

Hadoop: Job submission

 The initialization takes some time

(40)

Hadoop: Map-Reduce Execution

(41)

- 41-

Hadoop & Mahout: Table of Contents

(42)

Hadoop Ecosystem: HBase

 HDFS

–

Structured/semi-structure/unstructure d data

– Write only once, read many  Hbase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable

 Column based database. It supports

– Insert – Delete – Update

(43)

- 43-

Hadoop Ecosystem: Hbase Storage model 1/3

(44)

Hadoop Ecosystem: Hbase Storage model 1/3

(45)

- 45-

Hadoop Ecosystem: Hbase Storage model 1/3

(46)

Hadoop Ecosystem: Pig

 Hadoop

– A lot of java codes in case of analyzing – No scripting

 Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

 Pig generates and compiles a

(47)

- 47-

Hadoop Ecosystem: Pig Sample Scripts

RawInput = LOAD '$INPUT' USING

com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreach RawInput GENERATE ContextCategoryId as Category,

DefLevelId , TagId, URL,Impressions;

 defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);

GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group,

SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();

(48)

Hadoop Ecosystem: Hive

 Hive is a data warehouse infrastructure built on top of hadoop

 Supports analysis of large datasets stored in Hadoop compatible file systems like HDFS and Amazon S3 file system

 Provides SQL-Like query language called HiveSQL  Provides index to accelerate queries

(49)

- 49-

Hadoop Ecosystem: HiveSQL

 DML – Select  DDL – SHOW TABLES – CREATE TABLE – ALTER TABLE – DROP TABLE

(50)

Mahot

(51)

- 51-

Mahout: Overview

 A scalable machine

learning library built on Hadoop, written in java  Driven by Ng et al.’s

paper “MapReduce for Machine Learning on Multicore”

(52)

Mahout: Algorithms

 Classification – Logistic Regression – Bayesian – SVM – NN

– Hidden Markov Models  Clustering

– Kmeans

– Mean Shift Clustering – Spectral Clustering – Top Down Clustering

 Pattern Mining

– Parallel FP Growth Algorithm

 Regression

– Locally Weighted Linear Regression  Dimension reduction – SVD – PCA – GDA  Collaborative filtering – Non-distributed recommenders – Distributed Item-Based Collaborative Filtering

(53)

- 53-

EXERCISE

(54)

Mobility Analyzer: A Show Case

HANA DB CSV Files Sequence Files Mahout Clusterdump Cluster Info. Cluster Info. HANA DB

Site Data Flow Modules

CSVConverter ImportClusterInfo ExportTweetsInfo Local Hadoop Local Run.sh