Data Warehouse Designs for Big Data Performance

(1)

Data Warehouse Designs for Big Data Performance

Dave Beulke

Dave Beulke and Associates

Platform: Cross Platform

E10-Wednesday 16-October 2013 9:45-10:45

(2)

[email protected]

Member of the inaugural IBM DB2 Information Champions One of 45 IBM DB2 Gold Consultant Worldwide

Past President of International DB2 Users Group - IDUG Best speaker at CMG conference & former TDWI instructor

Former Co-Author of certification tests DB2 DBA Certification test

IBM Business Intelligence certification test Former Columnist for IBM Data Management Magazine

Consulting

CPU Demand Reduction Guaranteed!

DB2 Performance Review

DW & Database Design Review Security Audit & Compliance DB2 10 Migration Assistance

DB2 10 Performance IBM White Paper & Redbook

Extensive experience in VLDB databases, DW design and performance Working with DB2 on z/OS since V1.2

Working with DB2 on LUW since OS/2 Extended Edition

Designed/implemented first data warehouse in 1988 for E.F. Hutton Teaching Educational Seminars

DB2 Version 10 Transition

DB2 Performance for Java Developers

Data Warehousing Designs for Performance How to Do a Performance Review

Data Studio and pureQuery

Weekly Performance Tips:

www.DaveBeulke.com

(3)

Data Warehousing - New and old demands

Direct Marketing

Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign

purchases

Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category

Customer Retention

Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention

Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs

Health

Treatments

Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes

Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Financial Analytics

Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk

(4)

Many Solution Architectures

• Fact and dimension tables

• Ralph Kimball: Centralize DW

• One DW serving diverse needs and users

• Bill Immon: Information Factory

• Wheel Hub and Spoke concept – Central to Mart

• Hybrid Operational DW BI Solutions

• Customized for your business

(5)

Data Warehousing – Rise of DW Machines

• Massively Parallel Machines

• Grid

• No SQL like Hadoop, MongoDB, GreenPlum etc…

• Free databases

• Open source

• Vendors redefining themselves

• Follow the money

• Power Electricity

• Cooling

• Copies of the data

(6)

Partitioning parallelism reduces time

Determine I/O requirements per year/month/week/day

• Formula = CPU ms + 2-20 ms per call

5B per year = rows per week = 400,000,000 rows = 400,000 CPU seconds + 800,000- 8,000,000 I/O secs

• Elapsed time 222 to 2,222 hours processing each week

• 10 parallel schedules

• (2,222 / 4) / 10 = 55.55 hours

• Now we have 100’s of CPU Cores available

• SQL Queries

• Partitioning encourages query parallelism

(7)

SQL Features

• OLAP functions

• RANK

• DENSE_RANK

• ROW_NUMBER

• ORDER BY clause

• OVER clause

• PARTITION BY clause

• RANGE clause

• ROW clause

• ROLLUP

• CUBE

• Group By or Grouping Sets

• DB2 Cube Views

• Virtual cube backed by real structures

• XML and Metadata usage

RANK Example:

SELECT WORKDEPT, AVG(SALARY+BONUS)

AS AVG_TOTAL_SALARY, RANK() OVER (ORDER BY

AVG(SALARY+BONUS) DESC)

AS RANK_AVG_SAL FROM BEULKE.EMPLOYEE GROUP BY WORKDEPT ORDER BY RANK_AVG_SAL

DENSE RANK Example:

SELECT WORKDEPT, EMPNO, LASTNAME, FIRSTNME, EDLEVEL

DENSE_RANK()

OVER (PARTITION BY WORKDEPT

ORDER BY EDLEVEL DESC) AS RANK_EDLEVEL FROM EMPLOYEE

ORDER BY WORKDEPT, LASTNAME

ROW NUMBER Example:

SELECT ROW_NUMBER()

OVER (ORDER BY WORKDEPT, LASTNAME) AS NUMBER,

LASTNAME, SALARY

FROM EMPLOYEE

ORDER BY WORKDEPT, LASTNAME

(8)

New Indexes Opportunities

• Separate clustering and partitioning indexes

• Clustering is not defined through partitioning index

• Partitioning can be done in table definition DDL

• PARTITION ENDING AT clause

• Cluster for biggest workload

• Data load/inserts/maintenance

• SQL activity usually ?10-25%? scanned

• Compliment MQT aggregates

•

• Clustering for sort elimination

• Partition for parallelism/recovery CUST_ORDERS

By Date

ORDER_DEPT_NBR_IX

(9)

Capture the environment characteristics

• Number of CPUs per LPAR available

• Virtualization of CPUs VM Ware/Cloud

• Amount of LPAR memory available for workload

• Amount of paging that is happening

• Number of disk drives

• Amount of I/O to individual drives

• Only get 30%-50% of optimum speed

(10)

Materialized Query Tables

• Available on z/OS and LUW

• Improved data refresh options

• Aggregate via multiple tables

• Design and aggregate for users

Fact-Month MQT

Fact-1Q MQT Fact-Yearly

MQT

Fact-Daily MQT

MQT or View

Fact-Week MQT

(11)

MQTs - Requirements & Options

• Find all totals, sum and SQL functions used in workload

• Analyze base tables and their columns - NULLs

• Analyze the types of functions used

• SUM, AVG, temporary totals, Counts etc….

• Know data change frequency

• SQL UPDATE, INSERT, DELETE

• What is the schedule of change activity

• Do you need a staging table

(12)

MQT Parameters - Optimization

• Setting the optimization level

• Three different ways to achieve SQL optimization

• System level

• DFT_QUERYOPT configuration parameter

• BIND level

• QUERYOPT optimization-level bind option parameter

• Statement level – SQL statement

• SET CURRENT QUERY OPTIMIZATION = host variable or number

(13)

MQT Parameters - Optimization

• Query rewrite considerations

• MQT column definitions

• Isolation Level

• Special Registers

• REFRESH AGE

• CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION

• System, User, All, or None

• Only dynamic queries or during BIND

• Query rewrite at the query block level

• Columns, Predicates, IN-List, GROUP BY items, Derived columns

• EXPLAIN: Table_Type = M

(14)

MQT Example

• Daily sales figures feed MQTs

• Create additive MQTs

• MQTs created for all analysis comparison points

• MQTs can be built from other MQTs

• Define Quarterly Sales from Monthly Sales

• Combine MQTs through Views

• Views over region, territory, store id etc…

Daily

Sales

^Weekly^Totals ^Monthly^Totals ^Quarterly^Totals ^Totals^Y-T-D

OLTP Detail

Trans from today

View

(15)

• 5B rows per year–10 per 4k page= ½B pages

• MQT aggregates save large amounts of everything

• Create aggregates for every possibility

• “On Demand” information

• Sales by department

• Sales by zip code

• Sales by time period – day/week/month/quarter/AP

• All reporting and analysis areas

• Trace usage to create/eliminate aggregates

• Total by month ½B I/Os versus 12 I/Os

Fact-Month MQT

Fact-Daily MQT Fact-Week

MQT Fact-1Q

MQT Fact-Yearly

MQT

MQT – 10 to 1000 times improvement!

Y-T-D View

(16)

Temporal Data Designs

• Example: Create a table, policy_info, that uses a BUSINESS_TIME period

• The grain for this temporal table is a DATE

• Through the bus_start and bus_end definitions

CREATE TABLE beulke.policy_info(

policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,

bus_start DATE NOT NULL, bus_end DATE NOT NULL,

PERIOD BUSINESS_TIME(bus_start, bus_end));

(17)

Bi-Temporal Data Designs

• Inserts data along with both time constraints

• UPDATE or DELETE can result in many more data rows

• Must consider two dimensions!

BUSINESS_TIME

SYSTEM_TIME

CREATE TABLE beulke.policy_info(

policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,

bus_start DATE NOT NULL, bus_end DATE NOT NULL,

sys__starttime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW START,

sys_endtime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,

PERIOD BUSINESS_TIME(bus_start, bus_end)

PERIOD SYSTEM_TIME sys_starttime, sys_endtime);

(18)

Temporal tables business_time

• Can contain status of the business on a certain day!

• Separate business time from system time

• Just because you can have bi-temporal does not mean it fits your application

• Sometimes fits for DW applications and transactional systems

• Complexity with the SQL & processes to make sure the business situation is valid

• Can also use SQL for future business_time

(19)

Temporal Tables system_time

• SYSTEM_TIME relative to the world

• Constant relative across all the processing

• Reflects the latest status of the processing

• Transactions can be unique within the overall system

• TIMESTAMP (12) WITHOUT OVERLAP definition

• Granularity can be associated with any derivative of TIMESTAMP

• DATE, DATE & TIME

• SQL processes can manipulate both SYSTEM_TIME and BUSINESS_TIME

• Use single service or easily understood interfaces to these period columns

(20)

Hadoop

• Reliable, Scalable, distributed computing

• Apache Open Source Community by Doug Cutting

• Why and who uses it

• Commodity hardware with petabytes of data

• Google MapReduce and Google File System

• Yahoo, IBM, Facebook, Amazon

• Nothing special container/DASD placement

• Name node with many data nodes

• Distributed file system over HTTP

• Multiple copies of the data

• Billions of data node files

(21)

Hadoop

• One Name node w/many data nodes

• Petabytes of data

• Typically 3 copies of data

• 3 copies stored – 2 local & 1 remote

• Directory structure of all files

• Works at file level

• Name Node works

• Job Tracker and Task Tracker

• MapReduce Engine -2 steps

• Map –chops up problem to worker node

• Nodes sub-divide work further

• Reduce brings all answers together Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/

(22)

Hadoop

• Hadoop is like Teradata?

• Hadoop does not use B-trees or Hash partitioning

• Hadoop programming is primitive

• MapReduce Matched pairs of data

• Map (Key1, Value1)  List (Key2, Value2)

• Reduce (Key2, list (Value2))  List answer(Values)

• Done on massive parallel scale

• Job tracker & Task Tracker are vital

• Schedule & Restart processes

• Hadoop is version 0.21 Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/

(23)

NoSQL Access Considerations

• Pig Hive R considerations

• SQL is no longer your access language

• No more common access for cross platform

• Programmer productivity

• System integration difficulties

• NoSQL Database are most scanning system

• All this data find hash tags/friends/related

• Map Reduce considerations

• SQL is the better interface

• New everything?

(24)

(25)

DW through the Cloud

• Ability to use computing resources through Internet

• No ownership of resources, tools, or infrastructure

• Only can access the information process

• Cloud through ……

• Amazon - Elastic Cloud Computing (EC2)

• Simple Storage Services (S3)

• IBM, HP, Microsoft, Google

• AKA

• Platform as a Service

• Software as a Service SaaS

• Why - Money!!!!

An example The New York Times used 100 Amazon EC2

instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not

including bandwidth).

(26)

Common Questions

• Separate along access patterns

• Time period comparisons

• Product

• Location or Store

• Partition or separate table

• Combine partitioning, multiple tables with UNION ALL view

Fact- Clothes

Fact- 2012

Fact- Europe Fact-

USA Fact-

2011 Fact-

2010 Fact-

Furniture

Sales Total View

(27)

Big Data Flow

• Design encourages concurrent loads and queries

• Either through bulk, trickle or custom loads

• Automatically aggregates data within hierarchy

• Massively parallel processing of all workload aspects

Fact-Week MQT Fact-Daily

MQT

Fact-Month MQT

Fact-Yearly MQT Fact-1Q

MQT

Leverage daily processing aggregates

(28)

zIIP/zAAP engine usage

G

zIIP zIIP zIIP zIIP

G G

G G G

zAAP

• z/OS has Specialty CPUs

• No software license costs

• Significantly reduces z/OS costs

• zIIP – Java workloads

• Java Batch as fast as COBOL

• Web & App Server activity

• DB2 Java connections

• Utilities

• zAAP – UNIX workloads

• LINUX workloads

(29)

Daily processing and REFRESH

Fact-Daily Claim_Month MQT

Detail Fact Table

Process or REFRESH

Daily MQT only has 7 days of data

Fact-Month MQT

Fact-Yearly MQT

Fact-Week MQT Daily

Daily

Monthly Monthly

Weekly

Monthly

Fact-1Q MQT

 Data is processed and refreshed within the Weekly MQT

 MQTs build from other MQTs

 The data cascades into the downstream MQTs

(30)

MQTs series for each year

MQT

Fact-Month MQT

MQT

Fact-Month MQT

MQT

Fact-Month MQT

MQT

MQTs for 2010 MQTs for 2011 MQTs for 2012

End–User View

10 years

22 Billion rows

(31)

BLU - Column Databases

MySQL Article Example go to http://www.infinidb.org/downloads

(32)

Data Warehousing - New and old demands

Direct Marketing

Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign

purchases

Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category

Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention

Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs

Health

Treatments

Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes

Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk Direct

Marketing

Rules based analytics, Association Models, Decision Trees, Neural, Naïve, Bayesian Networks to determine sales campaigns and tactics

Cross Selling Association Rules based are most common methods for cross selling to influence customer’s behavior and purchases

Associations, Rules, decision trees logistic choice models, Regression based analysis of customer and their company history to predict customer satisfaction and retention

Customer Risk Classification through Neural, Naïve, Bayesian Networks, Modeling time series and exponential exploration to incorporate oddities for adverse situations

Health

Treatments

Rules based analytics, Association Models, Decision Trees, Neural, Naïve,

Bayesian Networks to evaluate components of personell, location, quality, costs, and outcomes

Fraud Detection Identify transaction anomalies and calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…

Time based linear modeling to predict risk, value, stock price trends and profit probabilities

(33)

IBM - DB2 V 10.5 BLU

• New BLU Acceleration Technology from IBM

• Rethink your Big Data design points

• New Level of Compression

• Dramatic Data Skipping improvements in I/O

• Leveraging Microprocessor improvements

• More In-Memory capabilities

(34)

IBM DB2 BLU Version 10.5

• New Columnar “BLU” Data store

• Adaptive Compression + Compressed ordering

• Improved HADR

• Rethink Indexing

• SIMD – Leveraging Microprocessor Technology

• No more need for Complex SQL

• Easy Faster training or requirements

• Data Skipping

(35)

Hybrid DW Architectures – Rethink Everything

• Share Centralized – transaction

• Share Nothing – distributed model

• Operational BI – specialized customize

• Reduce the copies of the data

• Remove restrictions for research

• Customized configuration for computing power for analytics

• Specialized Processors one step further

• Specialized customized for your business needs

• Fraud Detection and Health Care analytics or ???

(36)

Cloud

Hybrid DW Architecture

Smart Connections

IBM compatible

Server

IBM compatible

IDAA Or Kepler

Web XML OLAP Interface

Customized Selective

Storage

• Hybrid architecture for matching business needs

Directed Workloads

Time Dimension Product Dimension

Product Group Product

Quarter Month

Day Store Dimension

Area Region

Store

Year

Week Inventory/sales-ratio

Inventory Analysis

Operation DW BI

(37)

Checklist for Performance

• DB2 SQL continues to lead the industry

• Performance advantages of new SQL OLAP

• Temporal & MDC offer some opportunities

• Indexing continues to get better

• Partitioning & parallelism for performance

• MQTs offer HUGE opportunities

• Must be substantial CPU and I/O savings

• Can be used independently

• Data refresh is convenient

(38)

Checklist for Performance – New Thinking

• Leverage new DB2 SQL Columnar Capabilities

• Compression, In Memory, SIMD, Skipping

• No more designs – Load and Go!

• SIMD,

• No more indexes

• Partitioning & parallelism for performance

• Use UID frequency or Time based partitioning

• Include all filtering within SQL

• I/O is most important factor

(39)

Data Warehouse Designs for Big Data Performance

Dave Beulke and Associates [email protected]

E10-Wednesday 16-October 2013 9:45-10:45

Evaluate my session online:

www.idug.org/eu2013/eval

BLOG:

www.DaveBeulke.com