Advanced Querying
OLAP
Data Warehousing
Database Applications
• Transaction processing
– Online setting– Supports day-to-day operation of business
• Decision support
– Offline setting– Strategic planning (statistics)
Transaction Processing
Transaction processing
• Operational setting
• Up-to-date = critical
• Simple data
• Simple queries
Flight reservations
• ticket sales
• do not sell a seat twice
• reservation, date, name
• Give flight details of X List flights to Y
Transaction Processing
• Database must support
– simple data• tables – simple queries
• select from where …
– consistency & integrity CRITICAL – concurrency
• Relational databases, Object-Oriented,
Object-Relational
Decision Support
Decision support
• Off-line setting
• « Historical » data
• Summarized data
• Different databases
• Statistical queries
Flight company
• Evaluate ROI flights
• Flights of last year
• # passengers on line L
• Passengers, fuel costs, maintenance info
• Average % of seats sold/month/destination
A decision support DB that is maintained separatelyfrom the organization’s operational databases.
Why Separate Data Warehouse?
• High performance for both systems – DBMS— tuned for OLTP
• access methods, indexing, concurrency control, recovery – Warehouse—tuned for OLAP
• complex OLAP queries, multidimensional view, consolidation.
• Different functions and different data
– Missing data: Decision support requires historical data which operational DBs do not typically maintain
– Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources – Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Data Warehouse
Three-Tier Architecture
Data Warehouse Extract
Transform Load Refresh
OLAP Engine Monitor
&
Integrator Metadata
Data Sources Front-End Tools
Serve
Data Marts Operational
DBs other sources
Data Storage
OLAP
Server Analysis
Query/Reporting
Data Mining
ROLAP Server
OLAP
• OLAP = OnLine Analytical Processing
– Online = no waiting for answers• OLAP system = system that supports
analytical queriesthat are dimensional
in nature.
This Lecture
• Examples of decision support queries
• Data Cubes
– Conceptual data model – Typical operations
• Implementation
– ROLAP vs MOLAP – Indexing structures• SQL:1999 support for OLAP
Examples of Queries
• Flight company: evaluate ticket sales
– give total, average, minimal, maximal amount – per date: week, month, year– by destination/source port/country/continent – by ticket type
– by # of connections – …
Characteristics
• One special attribute: amount
measure
• Other attributes: select relevant regions
dimensions
Different levels of generality (month, year, …)
hierarchies
• Measure data is summarized: sum, min,
max, average
aggregations
Supermarket example
• Evaluate the sales of products
– Product cost in $– Customer: ID, city, state, country, – Store: chain, size, location, – Product: brand, type, … – …
• What are the measure and dimensional
attributes, where are the hierarchies?
measure
Dim.
hierarchies
Why dimensions?
customer store
product
Cost in $
• Multidimensional view on the data
Cross Tabulation
• Cross-tabulations are highly useful – Sales of clothes JuneAugust ‘06
138 51 22 65 August
570 329 67 174 Total
198 120 20 58 July
234 158 25 51 June
Total Orange Red
Blue
Product: color
Date:month, JuneAugust 2006
Data cubes
• Extension of Cross-Tables to multiple
dimensions
• Conceptual notion
138 51 22 65 August
570 329 67 174 Total
198 120 20 58 July
234 158 25 51 June
Total Orange Red
Blue
Dimensions
Data Points/
1stlevel of aggregation Aggregated w.r.t. X-dim
Aggregated w.r.t. Y-dim
Aggregated w.r.t. X and Y
Data Cubes
Date Product
Country
sum TV sum
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Data Cubes
• Base cuboid
= n-dimensional cube with n
number of dimensions
• The top most 0-D cuboid, which holds the
highest-level of summarization, is called
the apex cuboid
• The lattice of cuboids forms a data cube
Lattice of Cuboids
all
product, date, country
product date country
product, date product, country
date, country
Operations with Data Cubes
Scenario:
• Before starting the analysis task:
– what data?
• select a few relevant dimensions
• define hierarchy
• aggregation functions of interest – Pre-materialize
• load data
• compute counts/max, min, avg, … on beforehand
Operations with Data Cubes
• What operations can you think of an
analyst might find useful? (e.g., store)
Operations with Data Cubes
• What operations can you think of that an
analyst might find useful? (e.g., store)
– only look at stores in the Netherlands – look at cities instead of individual stores – look at the cross-table for product-date – restrict analysis to 2006, product O1
– go back to a finer granularity at the store level
Roll-Up
• Move in one dimension from a lower
granularity to a higher one
– store city – cities country – product product type
Drill-down
• Move in one dimension from a higher
granularity to a lower one
– city store – country cities – product type product
• Drill-through:
– go back to the original, individual data records
Pivoting
• Change the dimensions that are
“displayed”; select a cross-tab.
– look at the cross-table for product-date – display cross-table for date-customer
Slice & dice
• Select a part of the cube by restricting one
or more dimensions
– restrict analysis to “city = Eindhoven”
Summary of Concepts
• Cube: Multidimensional view on data
– dimensional attributes– measure attribute
• Operations:
– roll-up/drill-down – pivoting – slice and dice
Implementation
• To make query answering more efficient:
consolidate (materialize) aggregations
• Obvious implementation: multidimensional
array.
– Fast lookup: cell(prod. p, date d, prom. pr):
• look up index of p1, index of d, index of pr:
index = (p x D x PR) + (d x PR) + pr
Implementation
• Multidimensional array
– obvious problem: sparse datacan easily be solved, though.
Example:
binary search tree, key on index hash table.
Implementation
• However: very quickly people were
confronted with the Data Explosion
ProblemConsolidating the summaries blows up the data enormously !
Reasons are often misunderstood and confusing.
Data Explosion Problem
• Why?
Suppose:
– n dimensions, every dimension has d values – dnpossible tuples.
– Number of cells in the cube: (d+1)n – So, this is not the problem
Data Explosion Problem
• Why?
Suppose
– n dimensions, every dimension has d values – every dimension has a hierarchy
– most extreme case: binary tree
2d possibilities/dimension
Data Explosion Problem
• Why?
Suppose
– n dimensions, every dimension has d values – every dimension has a hierarchy
– most extreme case: binary tree
2d possibilities/dimension 2n x dncells
Only partial explanation (factor 2
ncomes
from an extremely pathological case)
Data Explosion Problem
• Why?
– The problem is that most data is not dense, but sparse.
– Hence, not all dncombinations are possible.
Example: 10 dimensions with 10 values – 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
Data Explosion Problem
Example: 10 dimensions with 10 values – 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
Every tuple increases count of 210cells !With hierarchies: effect even worse!
If every hierarchy has 5 items:
5
10= 9 765 625 cells!
View Selection Problem
• Suffices to precompute some aggregates, and compute others on demand.
– aggregate on (item-name, color) from an aggregate on (item-name, color, size)
– For all but a few “non-decomposable” aggregates such as median
• Several optimizations for computing multiple aggregates
– Compute aggregate on (item-name, color) from an aggregate on
(item-name, color, size)
– Compute aggregates on (item-name, color, size), (item-name, color) and (item-name) in single DB sort
View Selection Problem
all
product, date, country
product date country
product, date product, country
date, country
View Selection Problem
all
product, date, country
product date country
product, date product, country date, country
Which views to select:
hard research problem !
Implementation
Nowadays systems can be divided in three
categories:
– ROLAP (Relational OLAP)
• OLAP supported on top of a relational database – MOLAP (Multi-Dimensional OLAP)
• Use of special multi-dimensional data structures – HOLAP: (Hybrid)
• combination of previous two
ROLAP
• Cubes can easily be represented in
relational tables: special value “all”
Month Prod. Cust. Price
Jan p1 c1 10
Jan p2 c1 8
Jan p1 c2 10
Feb p1 c1 9
…
all p1 c1 102
Jan all c1 18
Jan p1 all 1 230
all all c1 4 235
…
all all all 1 253 458
ROLAP
• Typical database scheme:
– star schema
• fact table is central
• links to dimensional tables – Extensions:
• snowflake schema
– dimensions have hierarchy/extra information attached
• Star constellation
– multiple star schemas sharing dimensions
Example of a Star Schema
Order No Order No Order Date Order Date
Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City
SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota
OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price
ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category CategoryDescription CategoryDescription UnitPrice UnitPrice
DateKey DateKey DateDate
CityName CityName State State Country Country Order
Order
Customer Customer
Salesperson Salesperson
CityCity Date Date Product Product
Fact Table Fact Table
Example of a Snowflake
Schema
Order No Order No Order Date Order Date
Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City
SalespersonID SalespersonID SalespersonName SalespersonName CityCity Quota Quota
OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price
ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category Category Category UnitPrice UnitPrice
DateKey DateKey Date Date Month Month
CityName CityName State State Country Country Order
Order
Customer Customer
Salesperson Salesperson
CityCity Date Date Product Product
Fact Table Fact Table
CategoryName CategoryName CategoryDescr CategoryDescr
Month Month Year Year YearYear
StateName StateName Country Country Category Category
State State Month Month
Year Year
branch_key branch_name branch_type time_key day day_of_the_week month quarter year
Measures Branch
Time
item_key item_name brand type supplier_key
Item
location_key street city Province/street country
Location Sales Fact Table
Avg_sales Euros_sold Unit_sold Location_key
Branch_key Item_key Time_key
shipper_key shipper_name location_key shipper_type shipper unit_shipped
Euros_sold to_location from_location
shipper_key Item_key Time_key Shipping Fact Table
Multiple fact tables share dimension tables