Data Warehouse Designs for Big Data Performance
Dave Beulke
Dave Beulke and Associates
Platform: Cross Platform
E10-Wednesday 16-October 2013 9:45-10:45
Member of the inaugural IBM DB2 Information Champions One of 45 IBM DB2 Gold Consultant Worldwide
Past President of International DB2 Users Group - IDUG Best speaker at CMG conference & former TDWI instructor
Former Co-Author of certification tests DB2 DBA Certification test
IBM Business Intelligence certification test Former Columnist for IBM Data Management Magazine
Consulting
CPU Demand Reduction Guaranteed!
DB2 Performance Review
DW & Database Design Review Security Audit & Compliance DB2 10 Migration Assistance
DB2 10 Performance IBM White Paper & Redbook
Extensive experience in VLDB databases, DW design and performance Working with DB2 on z/OS since V1.2
Working with DB2 on LUW since OS/2 Extended Edition
Designed/implemented first data warehouse in 1988 for E.F. Hutton Teaching Educational Seminars
DB2 Version 10 Transition
DB2 Performance for Java Developers
Data Warehousing Designs for Performance How to Do a Performance Review
Data Studio and pureQuery
Weekly Performance Tips:
www.DaveBeulke.com
Data Warehousing - New and old demands
Direct Marketing
Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign
purchases
Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category
Customer Retention
Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention
Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs
Health
Treatments
Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes
Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…
Financial Analytics
Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk
Many Solution Architectures
• Fact and dimension tables
• Ralph Kimball: Centralize DW
• One DW serving diverse needs and users
• Bill Immon: Information Factory
• Wheel Hub and Spoke concept – Central to Mart
• Hybrid Operational DW BI Solutions
• Customized for your business
Data Warehousing – Rise of DW Machines
• Massively Parallel Machines
• Grid
• No SQL like Hadoop, MongoDB, GreenPlum etc…
• Free databases
• Open source
• Vendors redefining themselves
• Follow the money
• Power Electricity
• Cooling
• Copies of the data
Partitioning parallelism reduces time
Determine I/O requirements per year/month/week/day
• Formula = CPU ms + 2-20 ms per call
5B per year = rows per week = 400,000,000 rows = 400,000 CPU seconds + 800,000- 8,000,000 I/O secs
• Elapsed time 222 to 2,222 hours processing each week
• 10 parallel schedules
• (2,222 / 4) / 10 = 55.55 hours
• Now we have 100’s of CPU Cores available
• SQL Queries
• Partitioning encourages query parallelism
SQL Features
• OLAP functions
• RANK
• DENSE_RANK
• ROW_NUMBER
• ORDER BY clause
• OVER clause
• PARTITION BY clause
• RANGE clause
• ROW clause
• ROLLUP
• CUBE
• Group By or Grouping Sets
• DB2 Cube Views
• Virtual cube backed by real structures
• XML and Metadata usage
RANK Example:
SELECT WORKDEPT, AVG(SALARY+BONUS)
AS AVG_TOTAL_SALARY, RANK() OVER (ORDER BY
AVG(SALARY+BONUS) DESC)
AS RANK_AVG_SAL FROM BEULKE.EMPLOYEE GROUP BY WORKDEPT ORDER BY RANK_AVG_SAL
DENSE RANK Example:
SELECT WORKDEPT, EMPNO, LASTNAME, FIRSTNME, EDLEVEL
DENSE_RANK()
OVER (PARTITION BY WORKDEPT
ORDER BY EDLEVEL DESC) AS RANK_EDLEVEL FROM EMPLOYEE
ORDER BY WORKDEPT, LASTNAME
ROW NUMBER Example:
SELECT ROW_NUMBER()
OVER (ORDER BY WORKDEPT, LASTNAME) AS NUMBER,
LASTNAME, SALARY
FROM EMPLOYEE
ORDER BY WORKDEPT, LASTNAME
New Indexes Opportunities
• Separate clustering and partitioning indexes
• Clustering is not defined through partitioning index
• Partitioning can be done in table definition DDL
• PARTITION ENDING AT clause
• Cluster for biggest workload
• Data load/inserts/maintenance
• SQL activity usually ?10-25%? scanned
• Compliment MQT aggregates
•
• Clustering for sort elimination
• Partition for parallelism/recovery CUST_ORDERS
By Date
ORDER_DEPT_NBR_IX
Capture the environment characteristics
• Number of CPUs per LPAR available
• Virtualization of CPUs VM Ware/Cloud
• Amount of LPAR memory available for workload
• Amount of paging that is happening
• Number of disk drives
• Amount of I/O to individual drives
• Only get 30%-50% of optimum speed
Materialized Query Tables
• Available on z/OS and LUW
• Improved data refresh options
• Aggregate via multiple tables
• Design and aggregate for users
Fact-Month MQT
Fact-1Q MQT Fact-Yearly
MQT
Fact-Daily MQT
MQT or View
Fact-Week MQT
MQTs - Requirements & Options
• Find all totals, sum and SQL functions used in workload
• Analyze base tables and their columns - NULLs
• Analyze the types of functions used
• SUM, AVG, temporary totals, Counts etc….
• Know data change frequency
• SQL UPDATE, INSERT, DELETE
• What is the schedule of change activity
• Do you need a staging table
MQT Parameters - Optimization
• Setting the optimization level
• Three different ways to achieve SQL optimization
• System level
• DFT_QUERYOPT configuration parameter
• BIND level
• QUERYOPT optimization-level bind option parameter
• Statement level – SQL statement
• SET CURRENT QUERY OPTIMIZATION = host variable or number
MQT Parameters - Optimization
• Query rewrite considerations
• MQT column definitions
• Isolation Level
• Special Registers
• REFRESH AGE
• CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION
• System, User, All, or None
• Only dynamic queries or during BIND
• Query rewrite at the query block level
• Columns, Predicates, IN-List, GROUP BY items, Derived columns
• EXPLAIN: Table_Type = M
MQT Example
• Daily sales figures feed MQTs
• Create additive MQTs
• MQTs created for all analysis comparison points
• MQTs can be built from other MQTs
• Define Quarterly Sales from Monthly Sales
• Combine MQTs through Views
• Views over region, territory, store id etc…
Daily
Sales
Weekly Totals Monthly Totals Quarterly Totals Totals Y-T-DOLTP Detail
Trans from today
View
• 5B rows per year–10 per 4k page= ½B pages
• MQT aggregates save large amounts of everything
• Create aggregates for every possibility
• “On Demand” information
• Sales by department
• Sales by zip code
• Sales by time period – day/week/month/quarter/AP
• All reporting and analysis areas
• Trace usage to create/eliminate aggregates
• Total by month ½B I/Os versus 12 I/Os
Fact-Month MQT
Fact-Daily MQT Fact-Week
MQT Fact-1Q
MQT Fact-Yearly
MQT
MQT – 10 to 1000 times improvement!
Y-T-D View
Temporal Data Designs
• Example: Create a table, policy_info, that uses a BUSINESS_TIME period
• The grain for this temporal table is a DATE
• Through the bus_start and bus_end definitions
CREATE TABLE beulke.policy_info(
policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,
bus_start DATE NOT NULL, bus_end DATE NOT NULL,
PERIOD BUSINESS_TIME(bus_start, bus_end));
© Copyright Dave Beulke & Associates [email protected] Page 19
Bi-Temporal Data Designs
• Inserts data along with both time constraints
• UPDATE or DELETE can result in many more data rows
• Must consider two dimensions!
BUSINESS_TIME
SYSTEM_TIME
CREATE TABLE beulke.policy_info(
policy_id CHAR(4) NOT NULL, coverage INT NOT NULL,
bus_start DATE NOT NULL, bus_end DATE NOT NULL,
sys__starttime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW START,
sys_endtime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD BUSINESS_TIME(bus_start, bus_end)
PERIOD SYSTEM_TIME sys_starttime, sys_endtime);
Temporal tables business_time
• Can contain status of the business on a certain day!
• Separate business time from system time
• Just because you can have bi-temporal does not mean it fits your application
• Sometimes fits for DW applications and transactional systems
• Complexity with the SQL & processes to make sure the business situation is valid
• Can also use SQL for future business_time
Temporal Tables system_time
• SYSTEM_TIME relative to the world
• Constant relative across all the processing
• Reflects the latest status of the processing
• Transactions can be unique within the overall system
• TIMESTAMP (12) WITHOUT OVERLAP definition
• Granularity can be associated with any derivative of TIMESTAMP
• DATE, DATE & TIME
• SQL processes can manipulate both SYSTEM_TIME and BUSINESS_TIME
• Use single service or easily understood interfaces to these period columns
Hadoop
• Reliable, Scalable, distributed computing
• Apache Open Source Community by Doug Cutting
• Why and who uses it
• Commodity hardware with petabytes of data
• Google MapReduce and Google File System
• Yahoo, IBM, Facebook, Amazon
• Nothing special container/DASD placement
• Name node with many data nodes
• Distributed file system over HTTP
• Multiple copies of the data
• Billions of data node files
Hadoop
• One Name node w/many data nodes
• Petabytes of data
• Typically 3 copies of data
• 3 copies stored – 2 local & 1 remote
• Directory structure of all files
• Works at file level
• Name Node works
• Job Tracker and Task Tracker
• MapReduce Engine -2 steps
• Map –chops up problem to worker node
• Nodes sub-divide work further
• Reduce brings all answers together Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/
Hadoop
• Hadoop is like Teradata?
• Hadoop does not use B-trees or Hash partitioning
• Hadoop programming is primitive
• MapReduce Matched pairs of data
• Map (Key1, Value1) List (Key2, Value2)
• Reduce (Key2, list (Value2)) List answer(Values)
• Done on massive parallel scale
• Job tracker & Task Tracker are vital
• Schedule & Restart processes
• Hadoop is version 0.21 Distributed computing with Linux and Hadoop http://www.ibm.com/developerworks/linux/library/l-hadoop/
NoSQL Access Considerations
• Pig Hive R considerations
• SQL is no longer your access language
• No more common access for cross platform
• Programmer productivity
• System integration difficulties
• NoSQL Database are most scanning system
• All this data find hash tags/friends/related
• Map Reduce considerations
• SQL is the better interface
• New everything?
DW through the Cloud
• Ability to use computing resources through Internet
• No ownership of resources, tools, or infrastructure
• Only can access the information process
• Cloud through ……
• Amazon - Elastic Cloud Computing (EC2)
• Simple Storage Services (S3)
• IBM, HP, Microsoft, Google
• AKA
• Platform as a Service
• Software as a Service SaaS
• Why - Money!!!!
An example The New York Times used 100 Amazon EC2
instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not
including bandwidth).
Common Questions
• Separate along access patterns
• Time period comparisons
• Product
• Location or Store
• Partition or separate table
• Combine partitioning, multiple tables with UNION ALL view
Fact- Clothes
Fact- 2012
Fact- Europe Fact-
USA Fact-
2011 Fact-
2010 Fact-
Furniture
Sales Total View
Big Data Flow
• Design encourages concurrent loads and queries
• Either through bulk, trickle or custom loads
• Automatically aggregates data within hierarchy
• Massively parallel processing of all workload aspects
Fact-Week MQT Fact-Daily
MQT
Fact-Month MQT
Fact-Yearly MQT Fact-1Q
MQT
Leverage daily processing aggregates
zIIP/zAAP engine usage
G
zIIP zIIP zIIP zIIP
G G
G G G
G G G
zAAP
zAAP
zAAP
• z/OS has Specialty CPUs
• No software license costs
• Significantly reduces z/OS costs
• zIIP – Java workloads
• Java Batch as fast as COBOL
• Web & App Server activity
• DB2 Java connections
• Utilities
• zAAP – UNIX workloads
• LINUX workloads
Daily processing and REFRESH
Fact-Daily Claim_Month MQT
Detail Fact Table
Process or REFRESH
Daily MQT only has 7 days of data
Fact-Month MQT
Fact-Yearly MQT
Fact-Week MQT Daily
Daily
Monthly Monthly
Weekly
Monthly
Fact-1Q MQT
Data is processed and refreshed within the Weekly MQT
MQTs build from other MQTs
The data cascades into the downstream MQTs
MQTs series for each year
Fact-Week MQT Fact-Daily
MQT
Fact-Month MQT
Fact-Yearly MQT Fact-1Q
MQT
Fact-Week MQT Fact-Daily
MQT
Fact-Month MQT
Fact-Yearly MQT Fact-1Q
MQT
Fact-Week MQT Fact-Daily
MQT
Fact-Month MQT
Fact-Yearly MQT Fact-1Q
MQT
MQTs for 2010 MQTs for 2011 MQTs for 2012
End–User View
10 years
22 Billion rows
BLU - Column Databases
MySQL Article Example go to http://www.infinidb.org/downloads
Data Warehousing - New and old demands
Direct Marketing
Analysis of the customer’s demographic, location, to tailor advertising and marketing campaigns to predict profit variability of campaign
purchases
Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category
Customer Retention
Analysis of customer history, company interaction, and services performed to predict customer satisfaction and retention
Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs
Health
Treatments
Analysis of the different drug and physical treatments for conditions, illnesses and diseases against quality, costs, and outcomes
Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…
Financial Analytics
Market type, company category and financial statements analytics related to stock pricing, trends and profit probabilities and risk Direct
Marketing
Rules based analytics, Association Models, Decision Trees, Neural, Naïve, Bayesian Networks to determine sales campaigns and tactics
Cross Selling Association Rules based are most common methods for cross selling to influence customer’s behavior and purchases
Customer Retention
Associations, Rules, decision trees logistic choice models, Regression based analysis of customer and their company history to predict customer satisfaction and retention
Customer Risk Classification through Neural, Naïve, Bayesian Networks, Modeling time series and exponential exploration to incorporate oddities for adverse situations
Health
Treatments
Rules based analytics, Association Models, Decision Trees, Neural, Naïve,
Bayesian Networks to evaluate components of personell, location, quality, costs, and outcomes
Fraud Detection Identify transaction anomalies and calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc…
Financial Analytics
Time based linear modeling to predict risk, value, stock price trends and profit probabilities
IBM - DB2 V 10.5 BLU
• New BLU Acceleration Technology from IBM
• Rethink your Big Data design points
• New Level of Compression
• Dramatic Data Skipping improvements in I/O
• Leveraging Microprocessor improvements
• More In-Memory capabilities
IBM DB2 BLU Version 10.5
• New Columnar “BLU” Data store
• Adaptive Compression + Compressed ordering
• Improved HADR
• Rethink Indexing
• SIMD – Leveraging Microprocessor Technology
• No more need for Complex SQL
• Easy Faster training or requirements
• Data Skipping
Hybrid DW Architectures – Rethink Everything
• Share Centralized – transaction
• Share Nothing – distributed model
• Operational BI – specialized customize
• Reduce the copies of the data
• Remove restrictions for research
• Customized configuration for computing power for analytics
• Specialized Processors one step further
• Specialized customized for your business needs
• Fraud Detection and Health Care analytics or ???
© Copyright Dave Beulke & Associates [email protected] Page 39
Cloud
Hybrid DW Architecture
Smart Connections
IBM compatible
Server
IBM compatible
IDAA Or Kepler
Web XML OLAP Interface
Customized Selective
Storage
• Hybrid architecture for matching business needs
Directed Workloads
Time Dimension Product Dimension
Product Group Product
Quarter Month
Day Store Dimension
Area Region
Store
Year
Week Inventory/sales-ratio
Inventory Analysis
Operation DW BI
Checklist for Performance
• DB2 SQL continues to lead the industry
• Performance advantages of new SQL OLAP
• Temporal & MDC offer some opportunities
• Indexing continues to get better
• Partitioning & parallelism for performance
• MQTs offer HUGE opportunities
• Must be substantial CPU and I/O savings
• Can be used independently
• Data refresh is convenient
Checklist for Performance – New Thinking
• Leverage new DB2 SQL Columnar Capabilities
• Compression, In Memory, SIMD, Skipping
• No more designs – Load and Go!
• SIMD,
• No more indexes
• Partitioning & parallelism for performance
• Use UID frequency or Time based partitioning
• Include all filtering within SQL
• I/O is most important factor
© Copyright Dave Beulke & Associates [email protected] Page 42
Data Warehouse Designs for Big Data Performance
Dave Beulke and Associates [email protected]
E10-Wednesday 16-October 2013 9:45-10:45
Evaluate my session online:
www.idug.org/eu2013/eval
BLOG:
www.DaveBeulke.com