• No results found

Data Warehouse Designs for Big Data Performance

N/A
N/A
Protected

Academic year: 2021

Share "Data Warehouse Designs for Big Data Performance"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Warehouse Designs for Big Data Performance

Dave Beulke

Dave Beulke and Associates

Platform: Cross Platform

E10-Wednesday 16-October 2013 9:45-10:45

(2)

[email protected]

Member of the inaugural IBM DB2 Information Champions One of 45 IBM DB2 Gold Consultant Worldwide

Past President of International DB2 Users Group - IDUG Best speaker at CMG conference & former TDWI instructor

Former Co-Author of certification tests DB2 DBA Certification test

IBM Business Intelligence certification test Former Columnist for IBM Data Management Magazine

Consulting

CPU Demand Reduction Guaranteed!

DB2 Performance Review

DW & Database Design Review Security Audit & Compliance DB2 10 Migration Assistance

DB2 10 Performance IBM White Paper & Redbook

Extensive experience in VLDB databases, DW design and performance Working with DB2 on z/OS since V1.2

Working with DB2 on LUW since OS/2 Extended Edition

Designed/implemented first data warehouse in 1988 for E.F. Hutton Teaching Educational Seminars

DB2 Version 10 Transition

DB2 Performance for Java Developers

Data Warehousing Designs for Performance How to Do a Performance Review

Data Studio and pureQuery

Weekly Performance Tips:

www.DaveBeulke.com

(3)

Data Warehousing - New and old demands

Direct Marketing

Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign

purchases

Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category

Customer Retention

Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention

Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs

Health

Treatments

Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes

Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Financial Analytics

Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk

(4)

Many Solution Architectures

Fact and dimension tables

Ralph Kimball: Centralize DW

One DW serving diverse needs and users

Bill Immon: Information Factory

Wheel Hub and Spoke concept – Central to Mart

Hybrid Operational DW BI Solutions

Customized for your business

(5)

Data Warehousing – Rise of DW Machines

• Massively Parallel Machines

Grid

No SQL like Hadoop, MongoDB, GreenPlum etc…

• Free databases

Open source

Vendors redefining themselves

• Follow the money

Power Electricity

Cooling

Copies of the data

(6)

Partitioning parallelism reduces time

Determine I/O requirements per year/month/week/day

Formula = CPU ms + 2-20 ms per call

5B per year = rows per week = 400,000,000 rows = 400,000 CPU seconds + 800,000- 8,000,000 I/O secs

Elapsed time 222 to 2,222 hours processing each week

• 10 parallel schedules

(2,222 / 4) / 10 = 55.55 hours

Now we have 100’s of CPU Cores available

• SQL Queries

• Partitioning encourages query parallelism

(7)

SQL Features

OLAP functions

RANK

DENSE_RANK

ROW_NUMBER

ORDER BY clause

OVER clause

PARTITION BY clause

RANGE clause

ROW clause

ROLLUP

CUBE

Group By or Grouping Sets

DB2 Cube Views

Virtual cube backed by real structures

XML and Metadata usage

RANK Example:

SELECT WORKDEPT, AVG(SALARY+BONUS)

AS AVG_TOTAL_SALARY, RANK() OVER (ORDER BY

AVG(SALARY+BONUS) DESC)

AS RANK_AVG_SAL FROM BEULKE.EMPLOYEE GROUP BY WORKDEPT ORDER BY RANK_AVG_SAL

DENSE RANK Example:

SELECT WORKDEPT, EMPNO, LASTNAME, FIRSTNME, EDLEVEL

DENSE_RANK()

OVER (PARTITION BY WORKDEPT

ORDER BY EDLEVEL DESC) AS RANK_EDLEVEL FROM EMPLOYEE

ORDER BY WORKDEPT, LASTNAME

ROW NUMBER Example:

SELECT ROW_NUMBER()

OVER (ORDER BY WORKDEPT, LASTNAME) AS NUMBER,

LASTNAME, SALARY

FROM EMPLOYEE

ORDER BY WORKDEPT, LASTNAME

(8)

New Indexes Opportunities

• Separate clustering and partitioning indexes

Clustering is not defined through partitioning index

Partitioning can be done in table definition DDL

PARTITION ENDING AT clause

• Cluster for biggest workload

Data load/inserts/maintenance

SQL activity usually ?10-25%? scanned

Compliment MQT aggregates

Clustering for sort elimination

Partition for parallelism/recovery CUST_ORDERS

By Date

ORDER_DEPT_NBR_IX

(9)

Capture the environment characteristics

• Number of CPUs per LPAR available

• Virtualization of CPUs VM Ware/Cloud

• Amount of LPAR memory available for workload

• Amount of paging that is happening

• Number of disk drives

• Amount of I/O to individual drives

• Only get 30%-50% of optimum speed

(10)

Materialized Query Tables

• Available on z/OS and LUW

• Improved data refresh options

• Aggregate via multiple tables

• Design and aggregate for users

Fact-Month MQT

Fact-1Q MQT Fact-Yearly

MQT

Fact-Daily MQT

MQT or View

Fact-Week MQT

(11)

MQTs - Requirements & Options

• Find all totals, sum and SQL functions used in workload

Analyze base tables and their columns - NULLs

Analyze the types of functions used

SUM, AVG, temporary totals, Counts etc….

• Know data change frequency

SQL UPDATE, INSERT, DELETE

What is the schedule of change activity

Do you need a staging table

(12)

MQT Parameters - Optimization

• Setting the optimization level

Three different ways to achieve SQL optimization

• System level

DFT_QUERYOPT configuration parameter

• BIND level

QUERYOPT optimization-level bind option parameter

• Statement level – SQL statement

SET CURRENT QUERY OPTIMIZATION = host variable or number

(13)

MQT Parameters - Optimization

• Query rewrite considerations

• MQT column definitions

• Isolation Level

• Special Registers

REFRESH AGE

CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION

System, User, All, or None

• Only dynamic queries or during BIND

Query rewrite at the query block level

Columns, Predicates, IN-List, GROUP BY items, Derived columns

EXPLAIN: Table_Type = M

(14)

MQT Example

• Daily sales figures feed MQTs

Create additive MQTs

MQTs created for all analysis comparison points

• MQTs can be built from other MQTs

Define Quarterly Sales from Monthly Sales

• Combine MQTs through Views

Views over region, territory, store id etc…

Daily

Sales

Weekly Totals Monthly Totals Quarterly Totals Totals Y-T-D

OLTP Detail

Trans from today

View

(15)

• 5B rows per year–10 per 4k page= ½B pages

• MQT aggregates save large amounts of everything

• Create aggregates for every possibility

• “On Demand” information

• Sales by department

• Sales by zip code

• Sales by time period – day/week/month/quarter/AP

• All reporting and analysis areas

• Trace usage to create/eliminate aggregates

• Total by month ½B I/Os versus 12 I/Os

Fact-Month MQT

Fact-Daily MQT Fact-Week

MQT Fact-1Q

MQT Fact-Yearly

MQT

MQT – 10 to 1000 times improvement!

Y-T-D View

(16)

Temporal Data Designs

• Example: Create a table, policy_info, that uses a BUSINESS_TIME period

• The grain for this temporal table is a DATE

Through the bus_start and bus_end definitions

CREATE TABLE beulke.policy_info(

policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,

bus_start DATE NOT NULL, bus_end DATE NOT NULL,

PERIOD BUSINESS_TIME(bus_start, bus_end));

© Copyright Dave Beulke & Associates [email protected] Page 19

(17)

Bi-Temporal Data Designs

• Inserts data along with both time constraints

UPDATE or DELETE can result in many more data rows

Must consider two dimensions!

BUSINESS_TIME

SYSTEM_TIME

CREATE TABLE beulke.policy_info(

policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,

bus_start DATE NOT NULL, bus_end DATE NOT NULL,

sys__starttime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW START,

sys_endtime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD BUSINESS_TIME(bus_start, bus_end)

PERIOD SYSTEM_TIME sys_starttime, sys_endtime);

(18)

Temporal tables business_time

• Can contain status of the business on a certain day!

• Separate business time from system time

Just because you can have bi-temporal does not mean it fits your application

Sometimes fits for DW applications and transactional systems

• Complexity with the SQL & processes to make sure the business situation is valid

Can also use SQL for future business_time

(19)

Temporal Tables system_time

• SYSTEM_TIME relative to the world

Constant relative across all the processing

• Reflects the latest status of the processing

Transactions can be unique within the overall system

TIMESTAMP (12) WITHOUT OVERLAP definition

Granularity can be associated with any derivative of TIMESTAMP

DATE, DATE & TIME

SQL processes can manipulate both SYSTEM_TIME and BUSINESS_TIME

Use single service or easily understood interfaces to these period columns

(20)

Hadoop

• Reliable, Scalable, distributed computing

Apache Open Source Community by Doug Cutting

• Why and who uses it

Commodity hardware with petabytes of data

Google MapReduce and Google File System

Yahoo, IBM, Facebook, Amazon

Nothing special container/DASD placement

• Name node with many data nodes

Distributed file system over HTTP

Multiple copies of the data

Billions of data node files

(21)

Hadoop

• One Name node w/many data nodes

Petabytes of data

Typically 3 copies of data

3 copies stored – 2 local & 1 remote

• Directory structure of all files

Works at file level

• Name Node works

Job Tracker and Task Tracker

MapReduce Engine -2 steps

Map –chops up problem to worker node

Nodes sub-divide work further

Reduce brings all answers together Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/

(22)

Hadoop

• Hadoop is like Teradata?

Hadoop does not use B-trees or Hash partitioning

• Hadoop programming is primitive

MapReduce Matched pairs of data

Map (Key1, Value1)  List (Key2, Value2)

Reduce (Key2, list (Value2))  List answer(Values)

Done on massive parallel scale

• Job tracker & Task Tracker are vital

Schedule & Restart processes

• Hadoop is version 0.21 Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/

(23)

NoSQL Access Considerations

• Pig Hive R considerations

SQL is no longer your access language

No more common access for cross platform

Programmer productivity

System integration difficulties

• NoSQL Database are most scanning system

All this data find hash tags/friends/related

Map Reduce considerations

• SQL is the better interface

• New everything?

(24)
(25)

DW through the Cloud

• Ability to use computing resources through Internet

No ownership of resources, tools, or infrastructure

Only can access the information process

• Cloud through ……

Amazon - Elastic Cloud Computing (EC2)

Simple Storage Services (S3)

IBM, HP, Microsoft, Google

• AKA

Platform as a Service

Software as a Service SaaS

• Why - Money!!!!

An example The New York Times used 100 Amazon EC2

instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not

including bandwidth).

(26)

Common Questions

• Separate along access patterns

Time period comparisons

Product

Location or Store

• Partition or separate table

Combine partitioning, multiple tables with UNION ALL view

Fact- Clothes

Fact- 2012

Fact- Europe Fact-

USA Fact-

2011 Fact-

2010 Fact-

Furniture

Sales Total View

(27)

Big Data Flow

• Design encourages concurrent loads and queries

Either through bulk, trickle or custom loads

Automatically aggregates data within hierarchy

Massively parallel processing of all workload aspects

Fact-Week MQT Fact-Daily

MQT

Fact-Month MQT

Fact-Yearly MQT Fact-1Q

MQT

Leverage daily processing aggregates

(28)

zIIP/zAAP engine usage

G

zIIP zIIP zIIP zIIP

G G

G G G

G G G

zAAP

zAAP

zAAP

• z/OS has Specialty CPUs

No software license costs

Significantly reduces z/OS costs

• zIIP – Java workloads

Java Batch as fast as COBOL

Web & App Server activity

DB2 Java connections

Utilities

• zAAP – UNIX workloads

LINUX workloads

(29)

Daily processing and REFRESH

Fact-Daily Claim_Month MQT

Detail Fact Table

Process or REFRESH

Daily MQT only has 7 days of data

Fact-Month MQT

Fact-Yearly MQT

Fact-Week MQT Daily

Daily

Monthly Monthly

Weekly

Monthly

Fact-1Q MQT

 Data is processed and refreshed within the Weekly MQT

 MQTs build from other MQTs

 The data cascades into the downstream MQTs

(30)

MQTs series for each year

Fact-Week MQT Fact-Daily

MQT

Fact-Month MQT

Fact-Yearly MQT Fact-1Q

MQT

Fact-Week MQT Fact-Daily

MQT

Fact-Month MQT

Fact-Yearly MQT Fact-1Q

MQT

Fact-Week MQT Fact-Daily

MQT

Fact-Month MQT

Fact-Yearly MQT Fact-1Q

MQT

MQTs for 2010 MQTs for 2011 MQTs for 2012

End–User View

10 years

22 Billion rows

(31)

BLU - Column Databases

MySQL Article Example go to http://www.infinidb.org/downloads

(32)

Data Warehousing - New and old demands

Direct Marketing

Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign

purchases

Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category

Customer Retention

Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention

Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs

Health

Treatments

Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes

Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Financial Analytics

Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk Direct

Marketing

Rules based analytics, Association Models, Decision Trees, Neural, Naïve, Bayesian Networks to determine sales campaigns and tactics

Cross Selling Association Rules based are most common methods for cross selling to influence customer’s behavior and purchases

Customer Retention

Associations, Rules, decision trees logistic choice models, Regression based analysis of customer and their company history to predict customer satisfaction and retention

Customer Risk Classification through Neural, Naïve, Bayesian Networks, Modeling time series and exponential exploration to incorporate oddities for adverse situations

Health

Treatments

Rules based analytics, Association Models, Decision Trees, Neural, Naïve,

Bayesian Networks to evaluate components of personell, location, quality, costs, and outcomes

Fraud Detection Identify transaction anomalies and calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Financial Analytics

Time based linear modeling to predict risk, value, stock price trends and profit probabilities

(33)

IBM - DB2 V 10.5 BLU

• New BLU Acceleration Technology from IBM

• Rethink your Big Data design points

• New Level of Compression

• Dramatic Data Skipping improvements in I/O

• Leveraging Microprocessor improvements

• More In-Memory capabilities

(34)

IBM DB2 BLU Version 10.5

• New Columnar “BLU” Data store

• Adaptive Compression + Compressed ordering

• Improved HADR

• Rethink Indexing

• SIMD – Leveraging Microprocessor Technology

• No more need for Complex SQL

• Easy Faster training or requirements

• Data Skipping

(35)

Hybrid DW Architectures – Rethink Everything

• Share Centralized – transaction

• Share Nothing – distributed model

• Operational BI – specialized customize

Reduce the copies of the data

Remove restrictions for research

Customized configuration for computing power for analytics

• Specialized Processors one step further

Specialized customized for your business needs

Fraud Detection and Health Care analytics or ???

© Copyright Dave Beulke & Associates [email protected] Page 39

(36)

Cloud

Hybrid DW Architecture

Smart Connections

IBM compatible

Server

IBM compatible

IDAA Or Kepler

Web XML OLAP Interface

Customized Selective

Storage

• Hybrid architecture for matching business needs

Directed Workloads

Time Dimension Product Dimension

Product Group Product

Quarter Month

Day Store Dimension

Area Region

Store

Year

Week Inventory/sales-ratio

Inventory Analysis

Operation DW BI

(37)

Checklist for Performance

DB2 SQL continues to lead the industry

Performance advantages of new SQL OLAP

Temporal & MDC offer some opportunities

Indexing continues to get better

Partitioning & parallelism for performance

MQTs offer HUGE opportunities

Must be substantial CPU and I/O savings

Can be used independently

Data refresh is convenient

(38)

Checklist for Performance – New Thinking

Leverage new DB2 SQL Columnar Capabilities

Compression, In Memory, SIMD, Skipping

No more designs – Load and Go!

SIMD,

No more indexes

Partitioning & parallelism for performance

Use UID frequency or Time based partitioning

Include all filtering within SQL

I/O is most important factor

© Copyright Dave Beulke & Associates [email protected] Page 42

(39)

Data Warehouse Designs for Big Data Performance

Dave Beulke and Associates [email protected]

E10-Wednesday 16-October 2013 9:45-10:45

Evaluate my session online:

www.idug.org/eu2013/eval

BLOG:

www.DaveBeulke.com

References

Related documents

The variational method is easy to formulate if the blurring function is a Gaussian, therefore a simple model for the thermal response of a flat bottom hole based on

The Japanese steel industry is an ideal case for examining the effects of export subsidies: after World War II, the steel industry experienced unprecedented growth in

145 Accord- ingly, conventional at-the-money (or out-of-the-money) stock options never produce a corporate earnings charge. Indexed options, however, lack a fixed exercise price

Thomas, Mieke van de Sandt-Koenderman, Ineke van der Meulen, Evy Visch- Brink, Linda Worrall & Heather Harris Wright (2019): RELEASE: a protocol for a systematic review

A pot experiment was conducted to study the effect of bio-fertilizers on growth, yield and quality of buckwheat (Fagopyrum esculentum Moench) at K.R.C.C.H college of Arabhavi

Central trade-off: offloading computation to cloud may save CPU energy on phone, but sending current state of app to cloud costs radio energy on phone.. Energy Consumption

Validity of students' evaluations of teaching: A comparison with instructor self evaluations by teaching assistants, undergraduate faculty and graduate faculty.. Los Angeles:

In accordance with the Hong Kong Civil Aviation (Investigation of Accidents) Regulations (Laws of Hong Kong, Chapter 448 subsidiary legislation B), the Civil Aviation