• No results found

Data Warehousing and OLAP.

N/A
N/A
Protected

Academic year: 2021

Share "Data Warehousing and OLAP."

Copied!
138
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Warehousing and OLAP

Toon Calders Toon Calders t.calders@tue.nl

(2)

Motivation

« Traditional » relational databases are geared towards online transaction processing:

• bank terminal

• flight reservations

• student administration • student administration

Decision support systems have different requirements

(3)

Transaction Processing

Transaction processing

Operational setting

Up-to-date = critical

Flight reservations

ticket sales

do not sell a seat

twice

Simple data

Simple queries; only

« touch » a small part

of the database

twice

reservation, date,

name

Give flight details of

X, List flights to Y

(4)

Transaction Processing

Database must support

• simple data − tables

• simple queries

− select from where … − select from where …

• consistency & integrity CRITICAL • concurrency

Relational databases, Oriented, Object-Relational

(5)

Decision Support

Decision support

Off-line setting

« Historical » data

Summarized data

Flight company

Evaluate ROI flights

Flights of last year

# passengers per

Summarized data

Integrate different

databases

Statistical queries

# passengers per

carrier for destination X

Passengers, fuel costs,

maintenance info

Average % of seats

(6)

PART I

Concepts

(7)

Outline

Online Analytical Processing

Data Warehouses

Conceptual model: Data Cubes

Query languages for supporting OLAP

• SQL extensions • SQL extensions • MDX

(8)

A decision support DB maintained

separately

from the operational databases.

Why Separate Data Warehouse?

• Different functions

Data Warehouse

• Different functions

– DBMS— tuned for OLTP

– Warehouse—tuned for OLAP

• Different data

– Decision support requires historical data

• Integration of data from

heterogeneous sources

(9)

Three-Tier Architecture

Data Extract Transform Monitor & Integrator Metadata Serve Operational DBs other sources OLAP Server Analysis Query/Reporting Data Warehouse Transform Load Refresh OLAP Engine

Data Sources Front-End Tools

Serve Data Marts DBs Data Storage Data Mining ROLAP Server

(10)

Three-Tier Architecture

Extract-Transform-Load

• Get data from different sources; data integration • Cleaning the data

• Takes significant part of the effort (up to 80%!)

Refresh

Refresh

• Keep the data warehouse up to date when source data

(11)

Three-Tier Architecture

Data storage

• Optimized for OLAP

• Specialized data structures

• Specialized indexing structures

Data marts

• common term to refer to “less ambitious data

warehouses”

(12)

OLAP

OLAP = OnLine Analytical Processing

• Online = no waiting for answers

OLAP system = system that supports analytical queries that are dimensional in nature.

(13)

Outline

Online Analytical Processing

Data Warehouses

Conceptual model: Data cubes

Query languages for supporting OLAP

• SQL extensions • SQL extensions • MDX

(14)

Examples of Queries

Flight company: evaluate ticket sales

• give total, average, minimal, maximal amount • per date: week, month, year

• by destination/source port/country/continent • by ticket type

• by ticket type

• by # of connections • …

(15)

Common Characteristics

One (or few) special attribute(s): amount

 

 measure

Other attributes: select relevant regions

 

 dimensions

Different levels of generality (month, year)

Different levels of generality (month, year)

 

 hierarchies

Measurement data is summarized: sum, min, max, average

 

(16)

Supermarket Example

Evaluate the sales of products

• Product cost in $

• Customer: ID, city, state, country, • Store: chain, size, location,

• Product: brand, type, …

measure Dim.

hierarchies

• Product: brand, type, … • …

What are the measure and dimensional attributes, where are the hierarchies?

(17)

Why Dimensions?

store Cost in $

• Multidimensional view on the data

customer

(18)

Cross Tabulation

Cross-tabulations are highly useful

• Sales of clothes JuneAugust ‘06

Product: color

Blue Red Orange Total

June 51 25 158 234 July 58 20 120 198 August 65 22 51 138 Total 174 67 329 570 Date:month, JuneAugust 2006

(19)

Data Cubes

Extension of Cross-Tables to multiple dimensions

Conceptual notion

Blue Red Orange Total

Dimensions June 51 25 158 234 July 58 20 120 198 August 65 22 51 138 Total 174 67 329 570 Data Points/

1st level of aggregation Aggregated

w.r.t. X-dim

Aggregated w.r.t. Y-dim

Aggregated w.r.t. X and Y

(20)

Data Cubes

Date C ou n tr y sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr Ireland France C ou n tr y France Germany sum

(21)

Data Cubes

1) #TV’s sold in the 2Qtr? 2) Total sales in 3Qtr?

3) #products sold in France this year 4) #PC’s in Ireland in 2Qtr? sum 1Qtr 2Qtr 3Qtr 4Qtr Ireland France sum TV VCR PC France Germany sum

(22)

Data Cubes

sum sum 4 TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr Ireland France 1) #TV’s sold in the 2Qtr? 2) Total sales in 3Qtr?

3) #products sold in France this year

4) #PC’s in Ireland in 2Qtr?

In the back, at the

bottom, second column

2

3

France Germany

(23)

Outline

Online Analytical Processing

Data Warehouses

Conceptual model: Data cubes

Query languages for supporting OLAP • SQL extensions

• SQL extensions • MDX

(24)

Operations with Data Cubes

What operations can you think of that an analyst might find useful? (e.g., store)

(25)

Operations with Data Cubes

What operations can you think of that an analyst might find useful? (e.g., store)

• only look at stores in the Netherlands • look at cities instead of individual stores • look at the cross-table for product-date • look at the cross-table for product-date • restrict analysis to 2006, product O1

(26)

Roll-Up

Move in one dimension from a lower granularity to a higher one

• store  city

• cities  country

• product  product type • product  product type • Quarter  Semester

(27)

Roll-Up

C ou n tr y sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr Ireland France Date C ou n tr y France Germany sum

(28)

Roll-Up

C ou n tr y sum sum TV VCR PC 1st semester Ireland France Date 2nd semester C ou n tr y France Germany sum

(29)

Drill-down

Inverse operation

Move in one dimension from a higher granularity to a lower one

• city  store

• country  cities

• product type  product

Drill-through:

(30)

Pivoting

Change the dimensions that are “displayed”; select a cross-tab.

• look at the cross-table for product-date • display cross-table for date-customer

(31)

Pivoting

Change the dimensions that are “displayed”; select a cross-tab.

• look at the cross-table for product-date • display cross-table for date-customer

Sales Date

1st sem 2nd sem Total

Ireland 20 23 43

France 126 138 264

Country Germany 56 48 104

(32)

Slice & dice

(33)

Select

C ou n tr y sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr Ireland France C ou n tr y France Germany sum

(34)

Select

Select a part of the cube by restricting one or more dimensions

• restrict analysis to Ireland and VCR

sum

(35)

Outline

Online Analytical Processing

Data Warehouses

Conceptual model: Data cubes

Query languages for supporting OLAP

SQL extensions

SQL extensions

• MDX

(36)

Extended Aggregation

SQL-92 aggregation quite limited

• Many useful aggregates are either very hard or

impossible to specify

− Data cube

− Complex aggregates (median, variance) − Complex aggregates (median, variance)

− binary aggregates (correlation, regression curves) − ranking queries (“assign each student a rank based

on the total marks”)

(37)

Representing the Cube

Special value « null » is used:

Sales Date

1st sem 2nd sem Total

Ireland 20 23 43

Ireland 20 23 43

France 126 138 264

Country Germany 56 48 104

(38)

Representing the Cube

Special value « null » is used:

Date Country Sales

1st semester Ireland 20 1st semester France 126 1st semester Germany 56 1st semester Germany 56 1st semester null 202 2nd semester Ireland 23 2nd semester France 138 2nd semester Germany 48 2nd semester null 209 null Ireland 43 null France 264 null Germany 104 null null 411

(39)

Group by Cube

group by cube:

select item-name, color, size, sum(number)

from sales

group by cube(item-name, color, size)

Computes the union of eight different groupings of the Computes the union of eight different groupings of the

sales relation:

{ (item-name, color, size), (item-name, color),

(item-name, size),(color, size),(item-name),

(color),(size), ( ) }

(40)

Group by Cube

Relational representation of the date-country-sales cube can be computed as follows:

select semester as date, country, sum(sales) from sales

group by cube(semester,country) group by cube(semester,country)

• grouping() and decode() can be applied to replace “null”

by other constant:

(41)

Group by Rollup

rollup construct generates union on every prefix of specified list of attributes

select item-name, color, size,sum(number)

from sales

from sales

group by rollup(item-name, color, size)

Generates union of four groupings:

{ (item-name, color, size), (item-name, color), (item-name), ( ) }

(42)

Group by Rollup

Rollup can be used to generate aggregates at multiple levels.

E.g., suppose itemcategory(item-name, category) gives category of each item.

select category, item-name, sum(number) select category, item-name, sum(number)

from sales, itemcategory

where sales.item-name = itemcategory.item-name group by rollup(category, item-name)

gives a hierarchical summary by item-name and by

(43)

Group by Cube & Rollup

Multiple rollups and cubes can be used in a single group by clause

• Each generates set of group by lists, cross product of

(44)

Example

select item-name, color, size, sum(number)

from sales

group by rollup(item-name), rollup(color, size)

generates the groupings generates the groupings

{item-name, ()} X {(color, size), (color), ()}

=

{ (item-name, color, size), (item-name, color),

(item-name),(color, size), (color), ( ) }

(45)

MDX

Multidimensional Expressions (MDX) is a query language for cubes

• Supported by many data warehouses • Input and output are cubes

SELECT { [Measures].[Store Sales] } ON

COLUMNS, { [Date].[2002], [Date].[2003] } ON

ROWS FROM Sales

(46)

Outline

Online Analytical Processing

Data Warehouses

Conceptual model: Data cubes

Query languages for supporting OLAP

• SQL extensions • SQL extensions • MDX

(47)

Implementation

To make query answering more efficient: consolidate (materialize) all aggregations

Early implementations used a multidimensional array.

• Fast lookup: cell(prod. p, date d, prom. pr): • Fast lookup: cell(prod. p, date d, prom. pr):

− look up index of p1, index of d, index of pr:

(48)

Implementation

Multidimensional array

• obvious problem: sparse data

can easily be solved, though. Example:

Example:

binary search tree, key on index hash table.

(49)

Implementation

However: very quickly people were confronted with the

Data Explosion Problem

Consolidating the summaries blows the data enormously ! Reasons are often misunderstood and confusing.

(50)

Data Explosion Problem

Why? Suppose:

• n dimensions, every dimension has d values • dn possible tuples.

• Number of cells in the cube: (d+1)n

(51)

Data Explosion Problem

Why? Suppose

• n dimensions, every dimension has d values • every dimension has a hierarchy

• most extreme case: binary tree 



(52)

Data Explosion Problem

Why? Suppose

• n dimensions, every dimension has d values • every dimension has a hierarchy

• most extreme case: binary tree • most extreme case: binary tree

 

 2d possibilities/dimension  2 n x dn cells

Only partial explanation (factor 2n comes from an

(53)

Data Explosion Problem

Why?

• The problem is that most data is not dense, but sparse. • Hence, not all dn combinations are possible.

Example: 10 dimensions with 10 values Example: 10 dimensions with 10 values

• 10 000 000 000 possibilities

(54)

Data Explosion Problem

Example: 10 dimensions with 10 values

• 10 000 000 000 possibilities

Suppose « only » 1 000 000 are present

Every tuple increases count of 210 cells !

With hierarchies: effect even worse! If every hierarchy has 5 items:

(55)

Summary Part I

Datawarehouses supporting OLAP for decision support

Data Cubes as a conceptual model

• Measurement, dimensions, hierarchy, aggregation

Queries

• Roll-up, Drill-down, Slice and dice, pivoting… • Roll-up, Drill-down, Slice and dice, pivoting… • SQL:1999 extensions for supporting OLAP

(56)

PART II Technology

II.1 View materialization

(57)

View materialization

Is it possible to get performance gains when only partially materializing the cube?

Which parts should we materialize in order to get the best performance?

best performance?

V. Harinarayan, A. Rajaraman, and J. D. Ullman: Implementing data cubes efficiently. In: ACM SIGMOD 1996.

(58)

Overview

Problem statement

Formal model of computation

• Partial order on Queries • Cost model

Greedy solution

Greedy solution

Performance guarantee

(59)

The problem: informally

Relation Sales

• Dimensions: part, supplier, customer • Measure: sales

• Aggregation: sum

The data cube contains all aggregations on part,

The data cube contains all aggregations on part, supplier, customer; e.g.:

• (p1,s1,<all>,165) • (p1,s2,<all>,145)

(60)

Queries

SELECT part, supplier, sum(sales) FROM Sales

SELECT part, sum(sales) FROM Sales

We are interested in answering the following type of queries efficiently:

( part, supplier ) ( part ) FROM Sales

GROUP BY part, supplier

FROM Sales GROUP BY part

SELECT supplier, sum(sales) FROM Sales

GROUP BY supplier

SELECT customer, part, sum(sales) FROM Sales

GROUP BY customer, part ( part, customer ) ( supplier )

(61)

Queries

The queries are quantified by their grouping attributes

These queries select disjoint parts of the cube.

(62)

Queries

supplier

part

(63)

Queries

supplier

(part,supplier,customer)

part

(64)

Queries

supplier

(supplier,customer)

part

(65)

Queries

supplier

part

customer

(66)

Materialization

Materializing everything is impossible

Cost of evaluating following query directly:

SELECT D1, …, Dk, sum(M) FROM R

• in practice roughly linear in the size of R

( worst case: |R| log(|R|) )

Materializing some queries can help the other queries too

FROM R

(67)

Materialization

Example:

SELECT customer, part, sum(sales) FROM Sales

GROUP BY customer, part ( part, customer )

| Sales | GROUP BY customer, part

SELECT part, sum(sales) FROM Sales

GROUP BY part ( part )

(68)

Materialization

Example:

SELECT customer, part, sum(sales) FROM Sales

GROUP BY customer, part ( part, customer )

| Sales ||PC| GROUP BY customer, part

SELECT part, sum(sales) FROM Sales GROUP BY part ( part ) materialized as PC | Sales | PC |PC|

(69)

Example

Query

Answer

(part,supplier,customer)

6M

(part,customer)

6M

(part,supplier)

0.8M

(supplier,customer)

6M

(part)

0.2M

(supplier)

0.01M

(customer)

0.1M

()

1

(70)

Example: nothing materialized

Query

Answer

Cost

(part,supplier,customer)

6M

6M

(part,customer)

6M

6M

(part,supplier)

0.8M

6M

(supplier,customer)

6M

6M

(part)

0.2M

6M

(supplier)

0.01M

6M

(customer)

0.1M

6M

()

1

6M

Total cost: 48M

(71)

Example: some materialized

Query

Answer

Cost

(part,supplier,customer)

6M

(part,customer)

6M

(part,supplier)

0.8M

(supplier,customer)

6M

(part)

0.2M

(supplier)

0.01M

(customer)

0.1M

()

1

(72)

Example: some materialized

Query

Answer

Cost

(part,supplier,customer)

6M

6M

(part,customer)

6M

6M

(part,supplier)

0.8M

0.8M

(supplier,customer)

6M

6M

(part)

0.2M

0.8M

(supplier)

0.01M

0.8M

(customer)

0.1M

0.1M

()

1

0.1M

Total cost: 20.6M

(73)

Research Questions

What is the optimal set of views to materialize?

How many views must we materialize to get reasonable performance?

Given bounded space S, what is the optimal choice

(74)

Central Question

Given:

• for every view its size • a number k

Find:

• which k materialized views give the most gain

• which k materialized views give the most gain

This problem is NP-complete

(75)

Overview

Problem statement

Formal model of computationPartial order on Queries

Cost model

Greedy solution

Greedy solution

Performance guarantee

(76)

Partial order on queries

Q1 ≤ ≤ ≤ ≤ Q2 :

• Q1 can be answered using the query results of Q2 • A materialized view of Q2 can be used to answer Q1

• ≤≤≤≤ is a partial order on the views

(77)

Partial order on queries

(part, supplier, customer)

(part, supplier) (part, customer) (supplier, customer) (part, supplier)

(part) (supplier) (customer)

(78)

Partial order on queries

Can be generalized to hierarchies:

(product, country, -) ≤≤≤≤ (product, city,-)

(-, country, -) ≤≤≤≤ (product, city, -)

(-, -, -) ≤≤≤≤ (product, city, -)

(-, country, year) ≤≤≤≤ (-, city, month)

(-, country, year) ≤≤≤≤ (-, city, month)

(a1, …, an) ≤≤≤≤ (b

1, …, bn) if for all i = 1…n,

ai higher than or equal to bi in the hierarchy of the ith dimension (ai more general than bi)

(79)

Cost Model

Given a set of materialized views S = {V1,…,Vn}, the cost of query Q is

costS(Q) := min( { |Vi| : i = 1…n, Q ≤≤≤≤ V

i } )

The benefit of view W for computing Q w.r.t. S

The benefit of view W for computing Q w.r.t. S BW (Q | S) = costSSSS∪{W}{W}{W}{W}(Q) - costSSSS(Q)

Total benefit of W w.r.t. S BW(S) =

Σ

Σ

Σ

Σ

Q BW (Q | S)

(80)

Cost Model: Example

Q CS CSU{W} BW psc 6M 6M 0 Ps 6M 0.8M 5.2M psc (6M) ps (0.8M) pc (6M) sc (6M) Pc 6M 6M 0 p 6M 0.8M 5.2M s 6M 0.8M 5.2M c 0.1 0.1 0 () 0.1M 0.1M 0 p (0.2M) s (0.01M) c (0.1M) (1)

(81)

Overview

Problem statement

Formal model of computation

• Partial order on Queries • Cost model

Greedy solution

Greedy solution

Performance guarantee

(82)

Greedy algorithm

• Input:

• Size for every view and constant k

• In each step

• In each step

• Select the view with the most benefit

• Add it to the result

(83)

Greedy Algorithm

Algorithm

S := { top view }; for i:=1 to k {

select view W such that BW(S) is maximal select view W such that BW(S) is maximal S := S ∪∪∪∪ {W}

}

(84)

Example

a

100

50

75

E.g. The cost of

constructing view b given the view A is 100If we choose b to

b

c

d

e

f

g

h

50

75

20

40

30

1

10

If we choose b to materialize, the new cost of constructing view b is 50.

(85)

First round

a

100

50

75

Notice that not only b, but also d, e, g and h can be calculated from b

So the total benefit is

b

c

d

e

f

g

h

50

75

20

40

30

1

10

So the total benefit is (100 – 50) x 5 = 250

(86)

Continue…

Similarly, the benefit

of materializing c is

(100 – 75) x 5 = 125

a

100

50

75

b

c

d

e

f

g

h

50

75

20

40

30

1

10

Benefit

b

250

c

125

(87)

Not yet finish…

For e, Benefit =

(100-30) x 3 = 210

a

100

50

75

Benefit

b

c

d

e

f

g

h

50

75

20

40

30

1

10

b

250

c

125

e

210

(88)

Let’s choose b!

a

100

50

75

• For d and f ,

Benefit =

(100-20) x 2= 160

and

(100-40) x 2 =120

b

c

d

e

f

g

h

50

75

20

40

30

1

10

(100-40) x 2 =120

Benefit b 250 c 125 d 160 e 210 f 120

(89)

Next round?

Seems we should choose e, as it has the second largest benefit.

Let’s see what will happen in the second round.

Benefit Benefit b 250 c 125 d 160 e 210 f 120

(90)

Second round!

a

100

50

75

Now, only c and f get

benefit if we materialize

c (since e, g and h can

be more efficiently

calculated by using b)

b

c

d

e

f

g

h

50

75

20

40

30

1

10

calculated by using b)

Benefit

= (100 – 75) x 2 = 50

Benefit

c

50

(91)

How about choosing f?

a

100

50

75

If we choose f, we

found that h can be

effectively calculated

by using f instead of b.

Benefit

b

c

d

e

f

g

h

50

75

40

30

1

10

Benefit

= (100 – 40) + (50 – 40)

Benefit

c

50

f

70

20

(92)

Easy to work out others

Benefit of d

= (50 – 20) x 2 = 60

Benefit of e

= (50 – 30) x 3 = 60

a

100

50

75

Benefit of g

= 50 – 1 = 49

Benefit of h

= 50 – 10 = 40

b

c

d

e

f

g

h

50

75

20

40

30

1

10

(93)

Observation

In the first round, the benefit of choosing f (only 120) is far from the best choice (250)

But in second round, choosing f gives the maximum benefit!

1st round Benefit 2nd round Benefit

1st round Benefit b 250 c 125 d 160 e 210 f 120 2nd round Benefit c 50 d 60 e 70 f 70 g 49

(94)

Overview

Problem statement

Formal model of computation

• Partial order on Queries • Cost model

Greedy solution

Greedy solution

Performance guarantee

(95)

Simple? Optimal?

Trade off! This simple algorithm is not optimal in all cases!

(96)

Bad example

a

b

c

d

200

100

100

b

c

d

100

100

20 nodes Total 1000

99

(97)

Bad example

a

200

Choose cBenefit = (200-99) x (1 + 20 + 20) = 4141 = maximum

b

c

d

100

100

20 nodes Total 1000

99

(98)

Bad example

a

200

Now choose either 1 of b and d (same benefit)

b

c

d

100

100

20 nodes Total 1000

99

(99)

Bad example

a

200

How about these?

Very expensive!!!

b

c

d

100

100

20 nodes Total 1000

99

(100)

Optimal solution should be…

a

200

Only c is a little bit

expensive.

b

c

d

100

100

20 nodes Total 1000

99

(101)

Some theoretical results

It can be proved that we can get at least (e – 1 ) / e (which is about 63%) of the benefit of the optimal algorithm.

When selecting k views, bound is:

When selecting k views, bound is:

Bgreedy / Bopt ≥≥≥≥

There are lattices for which this ratio is arbitrarily close to this bound.

k

k

k

 −

1

1

(102)

Extensions (1)

Problem

• The views in a lattice are unlikely to have the same

probability of being requested in a query.

Solution:

Solution:

(103)

Extensions (2)

Problem

• Instead of asking for some fixed number (k) of views to

materialize, we might instead allocate a fixed amount of space to views.

Solution

Solution

• We can consider the “benefit of each view per unit

(104)

Conclusions Cube Materialization

Materialization of views is an essential query optimization strategy for decision-support applications.

Reason to materialize some part of the data cube but not all of the cube.

not all of the cube.

A lattice framework that models multidimensional analysis very well.

(105)

Conclusions Cube Materialization

Finding optimal solution is NP-hard.

Introduction of greedy algorithm

Greedy algorithm works on this lattice and picks the almost right views to materialize.

There exists cases which greedy algorithm fails to

There exists cases which greedy algorithm fails to produce optimal solution.

But greedy algorithm has guaranteed performance

(106)

II.2 Data storage and indexing

How is the data stored?

relational database (ROLAP)

Specialized structures (MOLAP)

How can we speed up computation?

How can we speed up computation?

• Indexing structures − bitmap index

(107)

Implementation

Nowadays systems can be divided in three categories:

• ROLAP (Relational OLAP)

− OLAP supported on top of a relational database • MOLAP (Multi-Dimensional OLAP)

− Use of special multi-dimensional data structures − Use of special multi-dimensional data structures • HOLAP: (Hybrid)

(108)

ROLAP

Typical database scheme:

• star schema

− fact table is central

− links to dimensional tables • Extensions:

• Extensions:

− snowflake schema

− dimensions have hierarchy/extra information

attached

− Star constellation

(109)

Example of a Star Schema

Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category CategoryDescription CategoryDescription UnitPrice UnitPrice Order Order Customer Customer Product Product Fact Table Fact Table Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price UnitPrice UnitPrice DateKey DateKey Date Date CityName CityName State State Country Country Salesperson Salesperson City City Date Date

(110)

Example of a Snowflake Schema

Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name OrderNO OrderNO SalespersonID SalespersonID ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr CategoryName CategoryName UnitPrice UnitPrice Order Order Customer Customer Product Product Fact Table Fact Table CategoryName CategoryName CategoryDescr CategoryDescr Category Category Customer Name Customer Name Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price DateKey DateKey Date Date Month Month CityName CityName StateName StateName Salesperson Salesperson City City Date Date Month Month Year Year StateName StateName Country Country State State Month Month

(111)

time_key day day_of_the_week month Time item_key item_name brand Item

Sales Fact Table

Time_key from_location

shipper_key Item_key Time_key

Shipping Fact Table

Multiple fact tables share dimension tables

Example of Fact Constellation

branch_key branch_name branch_type month quarter year Measures Branch brand type supplier_key location_key street city Province/street country Location Avg_sales Euros_sold Unit_sold Location_key Branch_key Item_key Time_key shipper_key shipper_name location_key shipper_type shipper unit_shipped Euros_sold to_location from_location

(112)

This Lecture

How is the data stored?

• Relational database (ROLAP)Specialized structures (MOLAP)

How can we speed up computation?

How can we speed up computation?

• Indexing structures − bitmap index

(113)

MOLAP

Not on top of relational database

• most popular design

• specialized data structures − Multicubes vs Hypercubes

• Not all subcubes are materialized • Not all subcubes are materialized

(114)

Storing the cube

User identifies set of sparse attributes S, and a set of

dense attributes D.

Index tree is constructed on sparse dimensions.

Each leaf points to a multidimensional array indexed by D.

(115)

Example

product, store are sparse dimensions

date and customer-type are dense

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

(116)

Example

product, store are sparse dimensions

date and customer-type are dense

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

(117)

Example

product, store are sparse dimensions

date and customer-type are dense

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 2D array Direct access … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

E.g., B-tree, R-tree, …

(118)

Example

product, store are sparse dimensions

date and customer-type are dense

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 2D array Direct access … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

E.g., B-tree, R-tree, …

Direct access

Linked list

(119)

Queries

Efficiency depends on:

• does index on sparse dimensions fit into memory?

• Type of queries:

− Restrictions on all dimensions − Restrictions on all dimensions − Restrictions only on dense

(120)

Queries

Selection on all attributes: (p,s1,ret,all)

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

(121)

Queries

Only on dense attributes: (-,-,ret,”2/1/07”)

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p

(122)

Queries

Only some sparse and dense attributes: (-,s1,ret,”2/1/07”)

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p prod. p2 store s1

(123)

Queries

Only some sparse and dense attributes: (p,-,-,”2/1/07”)

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 prod. p store s1 … … 65 22 51 138 Total 174 67 329 570

1time ret reg Total 1/1/07 51 25 158 234 2/1/07 58 20 120 198 … 65 22 51 138 Total 174 67 329 570 prod. p store s2 prod. p prod. p store s1

(124)

Storing the Cube

Dense combinations of dimensions can be stored in multi-dimensional arrays

For every combination of sparse dimensions

• one sub-cube

Sub-cubes indexed by sparse dimensions

Sub-cubes indexed by sparse dimensions

• E.g., B-tree

(125)

This Lecture

How is the data stored?

• relational database (ROLAP)

• Multi-dimensional structure (MOLAP)

How can we speed up computation?

How can we speed up computation?Indexing structures

− bitmap index − join index

(126)

Specialized Indexing Structures

B-trees, (covered in other courses)

Bitmapped indices,

Join indices,

(127)

Indexing principle:

mapping key values to records for associative direct access

Most popular indexing techniques in relational

Index Structures

Most popular indexing techniques in relational

database: B+-trees

For multi-dimensional data, a large number of

(128)

Bitmap index: indexing technique that has attracted attention in multi-dimensional DB implementation

table Customer City Car

Bitmap Indexes

Customer City Car

c1 Detroit Ford c2 Chicago Honda c3 Detroit Honda c4 Poznan Ford c5 Paris BMW c6 Paris Nissan

(129)

The index consists of bitmaps:

ec1 Chicago Detroit Paris Poznan

1 0 1 0 0

2 1 0 0 0

3 0 1 0 0

4 0 0 0 1

5 0 0 1 0

ec1 BMW Ford Honda Nissan

1 0 1 0 0 2 1 0 1 0 3 0 0 1 0 4 0 1 0 0

Bitmap Indexes

bitmaps 5 0 0 1 0 6 0 0 1 0 4 0 1 0 0 5 1 0 0 0 6 0 0 0 1 bitmaps •Index on a particular column

•Index consists of a number of bit vectors - bitmaps

•Each value in the indexed column has a bit vector (bitmaps)

•The length of the bit vector is the number of records in the base table •The i-th bit is set if the i-th row of the base table has the value for the indexed column

(130)

Index on a particular column

Index consists of a number of bit vectors

-bitmaps

Each value in the indexed column has a bit vector

Bitmap Indexes

(bitmaps)

The length of the bit vector is the number of

records in the base table

The i-th bit is set if the i-th row of the base table

(131)

20 23 18 19 20 21 id name age 1 joe 20 2 fred 20 3 sally 21 4 nancy 20 5 tom 20 1 1 0 1 1 0 0 00 Query:

Get people with age = 20 and name = “fred” List for age = 20:

Bitmap Indexes

21 22 23 25 26 5 tom 20 6 pat 25 7 dave 21 8 jeff 26 . . . age index bit maps data records 0 0 0 0 1 0 0 0 1 0 1 1

List for age = 20: 1101100000

List for name = “fred”: 0100000001

Answer is intersection: 0100000000

Suited well for domains with small cardinality

(132)

Bitmap Index

Size of bitmaps can be further reduced

• use run-length encoding

1111000111100000001111000 is encoded as 4x1;3x0;4x1;7x0;4x1;3x0

4x1;3x0;4x1;7x0;4x1;3x0

• Can reduce the storage space significantly

(133)

With efficient hardware support for bitmap

operations (AND, OR, XOR, NOT), bitmap index offers better access methods for certain queries

e.g., selection on two attributes

Bitmap Index – Summary

Some commercial products have implemented

bitmap index

Works poorly for high cardinality domains since

the number of bitmaps increases

Difficult to maintain - need reorganization when

(134)

This Lecture

How is the data stored?

• relational database (ROLAP)

• Specialized structures (MOLAP)

How can we speed up computation?

How can we speed up computation?Indexing structures

− bitmap indexjoin index

(135)

Traditional indexes: value 





 rids.

Join indices: tuples in the join 





 to rids in

the source tables.

Data warehouse:

Join Indexes

Data warehouse:

values of dimensions of star schema  rows in fact table.

Join indexes can span multiple

(136)

“Combine” SALE, PRODUCT relations

In SQL:

SELECT * FROM SALE, PRODUCT

sale prodId storeId date amt

p1 c1 1 12

p2 c1 1 11

product id name price

p1 bolt 10 p2 nut 5

Join

p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 p2 nut 5

joinTb prodId name price storeId date amt

p1 bolt 10 c1 1 12 p2 nut 5 c1 1 11 p1 bolt 10 c3 1 50 p2 nut 5 c2 1 8 p1 bolt 10 c1 2 44 p1 bolt 10 c2 2 4

(137)

product id nam e price jIndex

p1 bolt 10 r1,r3,r5,r6

p2 nut 5 r2,r4

join index

Join Indexes

sale rId prodId storeId date am t

r1 p1 c1 1 12 r2 p2 c1 1 11 r3 p1 c3 1 50 r4 p2 c2 1 8 r5 p1 c1 2 44 r6 p1 c2 2 4

(138)

Summary

Data warehouse is a specialized database to support analytical queries = OLAP queries

Data cube as conceptual model

Implementation of Data Cube

• View selection problem • Explosion problem

• ROLAP vs. MOLAP • Indexing structures

References

Related documents

In this study, we intended to detect Quantitative trait loci (QTL) for mid-parent heterosis under infestation with the Mediterranean corn borer (MCB, Sesamia nonagrioides Lef.) using

The most promising candidate genes for mood disorder are brain derived neurotrophic factor ( BDNF ), D-amino acid oxidase activator ( DAOA ), disrupted in schizophrenia 1 ( DISC1

Se ha observado que agonistas de este receptor nuclear regulan la secreción de hormonas esenciales para el mantenimiento de la gestación, como la progesterona y la gonadotrofina

Figure 2: Effect of apocynin on oxidative stress parameters MDA, NO and APOP level in plasma and liver tissue of carbon tetra chloride induced rats. Data are presented as mean±SEM,

In 2002/03, similar proportions of OB recipients and non-recipients reported that they had received such help (20 per cent and 27 per cent respectively); however, as in 2001/02,

Rhyme, according to Low, is often a problem in song translation if the translator tries to imitate the exact rhyme scheme as the source text (Translating Songs, 6)..

As well as tabulat- ing specific heats at constant pressure and constant volume, specific heats are given as heat capacity per unit mass, heat capacity per mole, or heat capacity