Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

(1)

Advanced Querying

OLAP

Data Warehousing

Database Applications

• Transaction processing

– Online setting

– Supports day-to-day operation of business

• Decision support

– Offline setting

– Strategic planning (statistics)

Transaction Processing

Transaction processing

• Operational setting

• Up-to-date = critical

• Simple data

• Simple queries

Flight reservations

• ticket sales

• do not sell a seat twice

• reservation, date, name

• Give flight details of X List flights to Y

Transaction Processing

• Database must support

– simple data

• tables – simple queries

• select from where …

– consistency & integrity CRITICAL – concurrency

• Relational databases, Object-Oriented,

Object-Relational

Decision Support

Decision support

• Off-line setting

• « Historical » data

• Summarized data

• Different databases

• Statistical queries

Flight company

• Evaluate ROI flights

• Flights of last year

• # passengers on line L

• Passengers, fuel costs, maintenance info

• Average % of seats sold/month/destination

A decision support DB that is maintained separatelyfrom the organization’s operational databases.

Why Separate Data Warehouse?

• High performance for both systems – DBMS— tuned for OLTP

• access methods, indexing, concurrency control, recovery – Warehouse—tuned for OLAP

• complex OLAP queries, multidimensional view, consolidation.

• Different functions and different data

– Missing data: Decision support requires historical data which operational DBs do not typically maintain

– Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources – Data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled

Data Warehouse

(2)

Three-Tier Architecture

Data Warehouse Extract

Transform Load Refresh

OLAP Engine Monitor

&

Integrator Metadata

Data Sources Front-End Tools

Serve

Data Marts Operational

DBs other sources

Data Storage

OLAP

Server Analysis

Query/Reporting

Data Mining

ROLAP Server

OLAP

• OLAP = OnLine Analytical Processing

– Online = no waiting for answers

• OLAP system = system that supports

analytical queries

that are dimensional

in nature.

This Lecture

• Examples of decision support queries

• Data Cubes

– Conceptual data model – Typical operations

• Implementation

– ROLAP vs MOLAP – Indexing structures

• SQL:1999 support for OLAP

Examples of Queries

• Flight company: evaluate ticket sales

– give total, average, minimal, maximal amount – per date: week, month, year

– by destination/source port/country/continent – by ticket type

– by # of connections – …

Characteristics

• One special attribute: amount

measure

• Other attributes: select relevant regions

dimensions

Different levels of generality (month, year, …)

hierarchies

• Measure data is summarized: sum, min,

max, average

aggregations

Supermarket example

• Evaluate the sales of products

– Product cost in $

– Customer: ID, city, state, country, – Store: chain, size, location, – Product: brand, type, … – …

• What are the measure and dimensional

attributes, where are the hierarchies?

measure

Dim.

hierarchies

(3)

Why dimensions?

customer store

product

Cost in $

• Multidimensional view on the data

Cross Tabulation

• Cross-tabulations are highly useful – Sales of clothes JuneAugust ‘06

138 51 22 65 August

570 329 67 174 Total

198 120 20 58 July

234 158 25 51 June

Total Orange Red

Blue

Product: color

Date:month, JuneAugust 2006

Data cubes

• Extension of Cross-Tables to multiple

dimensions

• Conceptual notion

138 51 22 65 August

570 329 67 174 Total

198 120 20 58 July

234 158 25 51 June

Total Orange Red

Blue

Dimensions

Data Points/

1^stlevel of aggregation Aggregated w.r.t. X-dim

Aggregated w.r.t. Y-dim

Aggregated w.r.t. X and Y

Data Cubes

Date Product

Country

sum TV sum

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

Ireland

France

Germany

sum

Data Cubes

• Base cuboid

= n-dimensional cube with n

number of dimensions

• The top most 0-D cuboid, which holds the

highest-level of summarization, is called

the apex cuboid

• The lattice of cuboids forms a data cube

Lattice of Cuboids

all

product, date, country

product date country

product, date product, country

date, country

(4)

Operations with Data Cubes

Scenario:

• Before starting the analysis task:

– what data?

• select a few relevant dimensions

• define hierarchy

• aggregation functions of interest – Pre-materialize

• load data

• compute counts/max, min, avg, … on beforehand

Operations with Data Cubes

• What operations can you think of an

analyst might find useful? (e.g., store)

Operations with Data Cubes

• What operations can you think of that an

analyst might find useful? (e.g., store)

– only look at stores in the Netherlands – look at cities instead of individual stores – look at the cross-table for product-date – restrict analysis to 2006, product O1

– go back to a finer granularity at the store level

Roll-Up

• Move in one dimension from a lower

granularity to a higher one

– store city – cities country – product product type

Drill-down

• Move in one dimension from a higher

granularity to a lower one

– city store – country cities – product type product

• Drill-through:

– go back to the original, individual data records

Pivoting

• Change the dimensions that are

“displayed”; select a cross-tab.

– look at the cross-table for product-date – display cross-table for date-customer

(5)

Slice & dice

• Select a part of the cube by restricting one

or more dimensions

– restrict analysis to “city = Eindhoven”

Summary of Concepts

• Cube: Multidimensional view on data

– dimensional attributes

– measure attribute

• Operations:

– roll-up/drill-down – pivoting – slice and dice

Implementation

• To make query answering more efficient:

consolidate (materialize) aggregations

• Obvious implementation: multidimensional

array.

– Fast lookup: cell(prod. p, date d, prom. pr):

• look up index of p1, index of d, index of pr:

index = (p x D x PR) + (d x PR) + pr

Implementation

• Multidimensional array

– obvious problem: sparse data

can easily be solved, though.

Example:

binary search tree, key on index hash table.

Implementation

• However: very quickly people were

confronted with the Data Explosion

Problem

Consolidating the summaries blows up the data enormously !

Reasons are often misunderstood and confusing.

Data Explosion Problem

• Why?

Suppose:

– n dimensions, every dimension has d values – dⁿpossible tuples.

– Number of cells in the cube: (d+1)ⁿ – So, this is not the problem

(6)

Data Explosion Problem

• Why?

Suppose

– n dimensions, every dimension has d values – every dimension has a hierarchy

– most extreme case: binary tree

2d possibilities/dimension

Data Explosion Problem

• Why?

Suppose

– n dimensions, every dimension has d values – every dimension has a hierarchy

– most extreme case: binary tree

2d possibilities/dimension 2ⁿx dⁿcells

Only partial explanation (factor 2

ⁿ

comes

from an extremely pathological case)

Data Explosion Problem

• Why?

– The problem is that most data is not dense, but sparse.

– Hence, not all dⁿcombinations are possible.

Example: 10 dimensions with 10 values – 10 000 000 000 possibilities

Suppose « only » 1 000 000 are present

Data Explosion Problem

Example: 10 dimensions with 10 values – 10 000 000 000 possibilities

Suppose « only » 1 000 000 are present

Every tuple increases count of 2¹⁰cells !

With hierarchies: effect even worse!

If every hierarchy has 5 items:

5

¹⁰

= 9 765 625 cells!

View Selection Problem

• Suffices to precompute some aggregates, and compute others on demand.

– aggregate on (item-name, color) from an aggregate on (item-name, color, size)

– For all but a few “non-decomposable” aggregates such as median

• Several optimizations for computing multiple aggregates

– Compute aggregate on (item-name, color) from an aggregate on

(item-name, color, size)

– Compute aggregates on (item-name, color, size), (item-name, color) and (item-name) in single DB sort

View Selection Problem

all

product, date product, country

date, country

(7)

View Selection Problem

all

product, date product, country date, country

Which views to select:

hard research problem !

Implementation

Nowadays systems can be divided in three

categories:

– ROLAP (Relational OLAP)

• OLAP supported on top of a relational database – MOLAP (Multi-Dimensional OLAP)

• Use of special multi-dimensional data structures – HOLAP: (Hybrid)

• combination of previous two

ROLAP

• Cubes can easily be represented in

relational tables: special value “all”

Month Prod. Cust. Price

Jan p1 c1 10

Jan p2 c1 8

Jan p1 c2 10

Feb p1 c1 9

…

all p1 c1 102

Jan all c1 18

Jan p1 all 1 230

all all c1 4 235

…

all all all 1 253 458

ROLAP

• Typical database scheme:

– star schema

• fact table is central

• links to dimensional tables – Extensions:

• snowflake schema

– dimensions have hierarchy/extra information attached

• Star constellation

– multiple star schemas sharing dimensions

Example of a Star Schema

Order No Order No Order Date Order Date

Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City

SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota

OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price

ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category CategoryDescription CategoryDescription UnitPrice UnitPrice

DateKey DateKey DateDate

CityName CityName State State Country Country Order

Order

Customer Customer

Salesperson Salesperson

CityCity Date Date Product Product

Fact Table Fact Table

Example of a Snowflake

Schema

Order No Order No Order Date Order Date

Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City

SalespersonID SalespersonID SalespersonName SalespersonName CityCity Quota Quota

OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price

ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category Category Category UnitPrice UnitPrice

DateKey DateKey Date Date Month Month

CityName CityName State State Country Country Order

Order

Customer Customer

Salesperson Salesperson

CityCity Date Date Product Product

Fact Table Fact Table

CategoryName CategoryName CategoryDescr CategoryDescr

Month Month Year Year YearYear

StateName StateName Country Country Category Category

State State Month Month

Year Year

(8)

branch_key branch_name branch_type time_key day day_of_the_week month quarter year

Measures Branch

Time

item_key item_name brand type supplier_key

Item

location_key street city Province/street country

Location Sales Fact Table

Avg_sales Euros_sold Unit_sold Location_key

Branch_key Item_key Time_key

shipper_key shipper_name location_key shipper_type shipper unit_shipped

Euros_sold to_location from_location

shipper_key Item_key Time_key Shipping Fact Table

Multiple fact tables share dimension tables