OLAP Systems and Multidimensional Expressions I

(1)

OLAP Systems and Multidimensional Expressions I

Krzysztof Dembczy´ nski

Intelligent Decision Support Systems Laboratory (IDSS) Pozna´n University of Technology, Poland

Software Development Technologies Master studies, first semester

(2)

Review of the previous lectures

•

Mining of massive datasets

•

Evolution of database systems: operational vs. analytical systems.

•

Dimensional modeling.

•

Operational vs. analytical systems.

•

Extraction, transformation and load of data.

(3)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(4)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(5)

OLAP systems

•

The next step is to provide solutions for querying and reporting

multidimensional analytical data.

•

The goal is to provide efficient solutions for physical representation

and processing of these data.

(6)

Multidimensional reports

•

OLAP servers provide an effective solution for accessing and

processing large volumes of high dimensional data.

•

OLAP systems provide tools for multidimensional reporting.

(7)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(8)

Multidimensional cube

•

The proper data model for multidimensional reporting is the

multidimensional one.

(9)

Operators in multidimensional data model

• Roll up – summarize data along a dimension hierarchy.

• Drill down – go from higher level summary to lower level summary or detailed data.

• Slice and dice – corresponds to selection and projection.

• Pivot – reorient cube.

• Raking, Time functions, etc..

(10)

Lattice of cuboids

•

Different degrees of summarizations are presented as a lattice of

cuboids.

Example for the dimensions: time, product, location, supplier.

(11)

Total number of cuboids

•

For an n-dimensional data cube, the total number of cuboids that can

be generated is:

T = Y

i=1,...,n

(L

_i

+ 1),

where L

_i

is the number of levels associated with dimension i

(excluding the virtual top level ”all” since generalizing to ”all” is

equivalent to the removal of a dimension).

•

For example, if the cube has 10 dimensions and each dimension has 4

levels, the total number of cuboids that can be generated will be:

l = 5

¹⁰

= 9, 8 × 10

⁶

.

(12)

Three types of aggregate functions

•

distributive: count(), sum(), max(), min(),

•

algebraic: ave(), std dev(),

•

holistic: median(), mode(), rank().

(13)

OLAP servers

•

Relational OLAP (ROLAP),

•

Multidimensional OLAP (MOLAP),

•

Hybrid OLAP (HOLAP).

(14)

ROLAP

•

ROLAP servers use a relational or post-relational database

management system to store and manage warehouse data.

•

Optimization techniques:

I Denormalization,

I Materialized views,

I Partitioning,

I Joins,

I Indexes,

I Query processing.

(15)

ROLAP

•

Advantages of ROLAP Servers:

I Scalable with respect to the number of dimensions,

I Scalable with respect to the size of data,

I Sparsity is not a problem (fact tables contain only facts),

I Mature and well-developed technology.

•

Disadvantage of ROLAP Servers:

I Worse performance than MOLAP,

I Additional data structures and optimization techniques used to improve the performance.

(16)

MOLAP

•

MOLAP Servers use array-based multidimensional storage engines.

•

Optimization techniques:

I Two-level storage representation: dense cubes are identified and stored as array structures, sparse cubes employ compression techniques,

I Materialized cubes.

(17)

MOLAP

•

Advantages of MOLAP Servers:

I Multidimensional views are directly mapped to data cube array structures – efficient access to data,

I Can easily store subaggregates.

•

Disadvantages of MOLAP Servers:

I Scalability problem in the case of larger number of dimensions,

I Not tailored for sparse data,

I Young technology,

I There are no existing standards.

(18)

MOLAP

Example

•

Logical model consists of four dimensions: customer, product,

location, and day

•

In case of 100 000 customers, 10 000 products, 1 000 locations and 1

000 days, the data cube will contain 1 000 000 000 000 000 cells!

•

Huge number of cells is empty: a customer is not able to buy all

products in all locations . . ..

(19)

HOLAP

•

HOLAP Servers are a hybrid approach that combines ROLAP and

MOLAP technology.

•

HOLAP benefits from the greater scalability of ROLAP and the faster

computation of MOLAP.

(20)

Querying data warehouses

•

There are two main approaches for querying data warehouses:

I SQL (ROLAP)

I MDX (MOLAP)

(21)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(22)

SQL queries

•

Querying the star schema

(23)

SQL queries

SQL – group by

SELECT Name, AVG(Grade)

FROM Students grades G, Student S

WHERE G.Student = S.ID

GROUP BY Name;

Name AVG(Grade) Inmon 4.8

Kimball 4.7 Gates 4.0 Todman 4.5

(24)

SQL queries

SQL– group by

SELECT Academic year, Name, AVG(Grade)

FROM Students grades G, Academic year A, Professor P WHERE G.Professor = P.ID and G.Academic year = A.ID GROUP BY Academic year, Name;

Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2

2002/3 Stefanowski 4.0 2003/4 Stefanowski 3.9 2001/2 S lowiński 4.1 2002/3 S lowiński 3.8 2003/4 S lowiński 3.6

(25)

SQL queries

•

OLAP extensions in SQL:

I GROUP BY ROLLUP,

I GROUP BY CUBE,

I GROUP BY GROUPING SETS

I GROUPING and DECODE/CASE

I OVER

I Ranking functions

(26)

SQL queries

•

GROUP BY CUBE

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY CUBE (Time, Product, Location, Supplier);

(27)

SQL queries

•

GROUP BY CUBE

GROUP BY Time, Product, Location, Supplier UNION ALL

SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location UNION ALL

SELECT Time, Product, ’’*’’, Location, SUM(Gain) FROM Sales

. . . UNION ALL

(28)

SQL queries

•

GROUP BY CUBE

FROM Students grades GROUP BY CUBE(Academic year, Name);

2001/2 S lowiński 4.1 2002/3 Stefanowski 4.0 2002/3 S lowiński 3.8 2003/4 Stefanowski 3.9 2003/4 S lowiński 3.6 2003/4 Dembczyński 4.8

2001/2 NULL 4.15

2002/3 NULL 3.85

(29)

SQL queries

•

GROUP BY ROLLUP

GROUP BY ROLLUP (Time, Product, Location, Supplier);

(30)

SQL queries

•

GROUP BY ROLLUP

GROUP BY Time, Product, Location, Supplier UNION ALL

SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales

SELECT Time, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time, Product UNION ALL

SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain)

(31)

SQL queries

•

GROUP BY ROLLUP

SELECT Academic year, Name, AVG(Grade) FROM Students grades G

GROUP BY ROLLUP(Academic year, Name);

2001/2 S lowiński 4.1 2002/3 Stefanowski 4.0 2002/3 S lowiński 3.8 2003/4 Stefanowski 3.9 2003/4 S lowiński 3.6 2003/4 Dembczyński 4.8

2001/2 NULL 4.15

2002/3 NULL 3.85

2003/4 NULL 3.8

(32)

SQL queries

•

GROUP BY GROUPING SETS

GROUP BY GROUPING SETS ((Time), (Product), (Location), (Supplier));

(33)

SQL queries

•

GROUP BY GROUPING SETS

SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time UNION ALL

SELECT ’’*’’, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Product UNION ALL

SELECT ’’*’’, ’’*’’, Location, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Location UNION ALL

SELECT ’*’, ’*’, ’*’, Supplier, SUM(Gain) FROM Sales GROUP BY Supplier;

(34)

SQL queries

•

GROUP BY GROUPING SETS

FROM Students grades GROUP BY GROUPING SETS ((Academic year), (Name),());

Academic year Name AVG(Grade)

2001/2 NULL 4.15

2002/3 NULL 3.85

2003/4 NULL 3.8

NULL Stefanowski 3.9

NULL S lowi´nski 3.6 NULL Dembczy´nski 4.8

NULL NULL 3.95

(35)

SQL queries

•

GROUPING(<column expression>)

I Returns a value of 1 if the value of expression in the row is a null representing the set of all values.

I <column expression> is a column or an expression that contains a column in a GROUP BY clause.

I GROUPING is used to distinguish the null values that are returned by ROLLUP, CUBE or GROUPING SETS from standard null values.

I The NULL returned as the result of a ROLLUP, CUBE or GROUPING SETS operation is a special use of NULL.

(36)

SQL queries

•

GROUPING(<column expression>)

SELECT Extra scholarship, AVG(Grade), GROUPING(Extra scholarship) as Grouping

FROM Students grades

GROUP BY ROLL UP(Extra scholarship);

Extra scholarship AVG(Grade) Grouping

Yes 4.15 0

No 3.61 0

NULL 4.03 0

NULL 3.89 1

(37)

SQL queries

•

DECODE(expression , search , result [, search ,

result]... [, default] )

I If the value of expression is equal to search, then result is returned, otherwise default is returned.

I The functionality is similar to CASE expression,

I The results of GROUPING() can be passed into a DECODE function or the CASE expression.

(38)

SQL queries

•

DECODE(expression , search , result [, search ,

result]... [, default] )

SELECT DECODE(GROUPING(Extra scholarship), 1, "Total Average", Extra scholarship) as Extra scholarship, AVG(Grade)

GROUP BY ROLL UP(Extra scholarship);

Extra scholarship AVG(Grade)

Yes 4.15

No 3.61

NULL 4.03

(39)

SQL queries

•

OVER():

I Determines the partitioning and ordering of a rowset before the associated window function is applied.

I The OVER clause defines a window or user-specified set of rows within a query result set.

I A window function then computes a value for each row in the window.

I The OVER clause can be used with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.

I Syntax:

OVER (

[ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] )

(40)

SQL queries

•

OVER():

I PARTITION BY:

• Divides the query result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.

I ORDER BY:

• Defines the logical order of the rows within each partition of the result set, i.e., it specifies the logical order in which the window function calculation is performed.

I ROW | RANGE:

• Further limits the rows within the partition by specifying start and end points within the partition.

• This is done by specifying a range of rows with respect to the current row either by logical association or physical association.

• The ROWS clause limits the rows within a partition by specifying a fixed

(41)

SQL queries

•

Ranking functions:

I RANK () OVER:

• Returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.

I DENSE RANK () OVER:

• Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.

I NTILE (integer expression) OVER:

• Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.

I ROW NUMBER () OVER:

• Returns the sequential number of a row within a partition of a result

(42)

SQL queries

•

Examples:

I Ranking of the students

SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) FROM Students grades GROUP BY Student;

I To sort according to rank, we need to order the resulting relation:

SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) AS rank of grades

FROM Students grades GROUP BY Student ORDER BY rank of grades;

(43)

SQL queries

•

Examples:

I Ranking of students partitioned by instructors.

SELECT Instructor Name, Student, Avg(Grade), RANK () OVER (PARTITION BY Student ORDER BY Avg(Grade) DESC) AS rank 1

GROUP BY Student,Instructor Name ORDER BY Instructor Name, rank 1;

I Moving average of a student:

SELECT Student, Academic year, AVG (grades) OVER (PARTITION BY Student ORDER BY Academic year DESC ROWS UNBOUNDED PRECEDING)

ORDER BY Student, Academic year;

(44)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(45)

Summary

•

Three types of OLAP servers: ROLAP, MOLAP, and HOLAP.

•

Several approaches for querying data warehouses.

•

ROLAP servers: SQL and its OLAP extensions.

•

MOLAP servers: . . .

(46)