OLAP Systems and Multidimensional Expressions I
Krzysztof Dembczy´ nski
Intelligent Decision Support Systems Laboratory (IDSS) Pozna´n University of Technology, Poland
Software Development Technologies Master studies, first semester
Review of the previous lectures
•
Mining of massive datasets
•
Evolution of database systems: operational vs. analytical systems.
•
Dimensional modeling.
•
Operational vs. analytical systems.
•
Extraction, transformation and load of data.
Outline
1
Motivation
2
OLAP Servers
3
SQL
4
Summary
Outline
1
Motivation
2
OLAP Servers
3
SQL
4
Summary
OLAP systems
•
The next step is to provide solutions for querying and reporting
multidimensional analytical data.
•
The goal is to provide efficient solutions for physical representation
and processing of these data.
Multidimensional reports
•
OLAP servers provide an effective solution for accessing and
processing large volumes of high dimensional data.
•
OLAP systems provide tools for multidimensional reporting.
Outline
1
Motivation
2
OLAP Servers
3
SQL
4
Summary
Multidimensional cube
•
The proper data model for multidimensional reporting is the
multidimensional one.
Operators in multidimensional data model
• Roll up – summarize data along a dimension hierarchy.
• Drill down – go from higher level summary to lower level summary or detailed data.
• Slice and dice – corresponds to selection and projection.
• Pivot – reorient cube.
• Raking, Time functions, etc..
Lattice of cuboids
•
Different degrees of summarizations are presented as a lattice of
cuboids.
Example for the dimensions: time, product, location, supplier.
Total number of cuboids
•
For an n-dimensional data cube, the total number of cuboids that can
be generated is:
T = Y
i=1,...,n
(L
i+ 1),
where L
iis the number of levels associated with dimension i
(excluding the virtual top level ”all” since generalizing to ”all” is
equivalent to the removal of a dimension).
•
For example, if the cube has 10 dimensions and each dimension has 4
levels, the total number of cuboids that can be generated will be:
l = 5
10= 9, 8 × 10
6.
Three types of aggregate functions
•
distributive: count(), sum(), max(), min(),
•
algebraic: ave(), std dev(),
•
holistic: median(), mode(), rank().
OLAP servers
•
Relational OLAP (ROLAP),
•
Multidimensional OLAP (MOLAP),
•
Hybrid OLAP (HOLAP).
ROLAP
•
ROLAP servers use a relational or post-relational database
management system to store and manage warehouse data.
•
Optimization techniques:
I Denormalization,
I Materialized views,
I Partitioning,
I Joins,
I Indexes,
I Query processing.
ROLAP
•
Advantages of ROLAP Servers:
I Scalable with respect to the number of dimensions,
I Scalable with respect to the size of data,
I Sparsity is not a problem (fact tables contain only facts),
I Mature and well-developed technology.
•
Disadvantage of ROLAP Servers:
I Worse performance than MOLAP,
I Additional data structures and optimization techniques used to improve the performance.
MOLAP
•
MOLAP Servers use array-based multidimensional storage engines.
•
Optimization techniques:
I Two-level storage representation: dense cubes are identified and stored as array structures, sparse cubes employ compression techniques,
I Materialized cubes.
MOLAP
•
Advantages of MOLAP Servers:
I Multidimensional views are directly mapped to data cube array structures – efficient access to data,
I Can easily store subaggregates.
•
Disadvantages of MOLAP Servers:
I Scalability problem in the case of larger number of dimensions,
I Not tailored for sparse data,
I Young technology,
I There are no existing standards.
MOLAP
Example
•
Logical model consists of four dimensions: customer, product,
location, and day
•
In case of 100 000 customers, 10 000 products, 1 000 locations and 1
000 days, the data cube will contain 1 000 000 000 000 000 cells!
•
Huge number of cells is empty: a customer is not able to buy all
products in all locations . . ..
HOLAP
•
HOLAP Servers are a hybrid approach that combines ROLAP and
MOLAP technology.
•
HOLAP benefits from the greater scalability of ROLAP and the faster
computation of MOLAP.
Querying data warehouses
•
There are two main approaches for querying data warehouses:
I SQL (ROLAP)
I MDX (MOLAP)
Outline
1
Motivation
2
OLAP Servers
3
SQL
4
Summary
SQL queries
•
Querying the star schema
SQL queries
SQL – group by
SELECT Name, AVG(Grade)
FROM Students grades G, Student S
WHERE G.Student = S.ID
GROUP BY Name;
Name AVG(Grade) Inmon 4.8
Kimball 4.7 Gates 4.0 Todman 4.5
SQL queries
SQL– group by
SELECT Academic year, Name, AVG(Grade)
FROM Students grades G, Academic year A, Professor P WHERE G.Professor = P.ID and G.Academic year = A.ID GROUP BY Academic year, Name;
Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2
2002/3 Stefanowski 4.0 2003/4 Stefanowski 3.9 2001/2 S lowi´nski 4.1 2002/3 S lowi´nski 3.8 2003/4 S lowi´nski 3.6
SQL queries
•
OLAP extensions in SQL:
I GROUP BY ROLLUP,
I GROUP BY CUBE,
I GROUP BY GROUPING SETS
I GROUPING and DECODE/CASE
I OVER
I Ranking functions
SQL queries
•
GROUP BY CUBE
SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales
GROUP BY CUBE (Time, Product, Location, Supplier);
SQL queries
•
GROUP BY CUBE
SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales
GROUP BY Time, Product, Location, Supplier UNION ALL
SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Time, Product, Location UNION ALL
SELECT Time, Product, ’’*’’, Location, SUM(Gain) FROM Sales
GROUP BY Time, Product, Location UNION ALL
. . . UNION ALL
SQL queries
•
GROUP BY CUBE
SELECT Academic year, Name, AVG(Grade)
FROM Students grades GROUP BY CUBE(Academic year, Name);
Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2
2001/2 S lowi´nski 4.1 2002/3 Stefanowski 4.0 2002/3 S lowi´nski 3.8 2003/4 Stefanowski 3.9 2003/4 S lowi´nski 3.6 2003/4 Dembczy´nski 4.8
2001/2 NULL 4.15
2002/3 NULL 3.85
SQL queries
•
GROUP BY ROLLUP
SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales
GROUP BY ROLLUP (Time, Product, Location, Supplier);
SQL queries
•
GROUP BY ROLLUP
SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales
GROUP BY Time, Product, Location, Supplier UNION ALL
SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Time, Product, Location UNION ALL
SELECT Time, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Time, Product UNION ALL
SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain)
SQL queries
•
GROUP BY ROLLUP
SELECT Academic year, Name, AVG(Grade) FROM Students grades G
GROUP BY ROLLUP(Academic year, Name);
Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2
2001/2 S lowi´nski 4.1 2002/3 Stefanowski 4.0 2002/3 S lowi´nski 3.8 2003/4 Stefanowski 3.9 2003/4 S lowi´nski 3.6 2003/4 Dembczy´nski 4.8
2001/2 NULL 4.15
2002/3 NULL 3.85
2003/4 NULL 3.8
SQL queries
•
GROUP BY GROUPING SETS
SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales
GROUP BY GROUPING SETS ((Time), (Product), (Location), (Supplier));
SQL queries
•
GROUP BY GROUPING SETS
SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Time UNION ALL
SELECT ’’*’’, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Product UNION ALL
SELECT ’’*’’, ’’*’’, Location, ’’*’’, SUM(Gain) FROM Sales
GROUP BY Location UNION ALL
SELECT ’*’, ’*’, ’*’, Supplier, SUM(Gain) FROM Sales GROUP BY Supplier;
SQL queries
•
GROUP BY GROUPING SETS
SELECT Academic year, Name, AVG(Grade)
FROM Students grades GROUP BY GROUPING SETS ((Academic year), (Name),());
Academic year Name AVG(Grade)
2001/2 NULL 4.15
2002/3 NULL 3.85
2003/4 NULL 3.8
NULL Stefanowski 3.9
NULL S lowi´nski 3.6 NULL Dembczy´nski 4.8
NULL NULL 3.95
SQL queries
•
GROUPING(<column expression>)
I Returns a value of 1 if the value of expression in the row is a null representing the set of all values.
I <column expression> is a column or an expression that contains a column in a GROUP BY clause.
I GROUPING is used to distinguish the null values that are returned by ROLLUP, CUBE or GROUPING SETS from standard null values.
I The NULL returned as the result of a ROLLUP, CUBE or GROUPING SETS operation is a special use of NULL.
SQL queries
•
GROUPING(<column expression>)
SELECT Extra scholarship, AVG(Grade), GROUPING(Extra scholarship) as Grouping
FROM Students grades
GROUP BY ROLL UP(Extra scholarship);
Extra scholarship AVG(Grade) Grouping
Yes 4.15 0
No 3.61 0
NULL 4.03 0
NULL 3.89 1
SQL queries
•
DECODE(expression , search , result [, search ,
result]... [, default] )
I If the value of expression is equal to search, then result is returned, otherwise default is returned.
I The functionality is similar to CASE expression,
I The results of GROUPING() can be passed into a DECODE function or the CASE expression.
SQL queries
•
DECODE(expression , search , result [, search ,
result]... [, default] )
SELECT DECODE(GROUPING(Extra scholarship), 1, "Total Average", Extra scholarship) as Extra scholarship, AVG(Grade)
FROM Students grades
GROUP BY ROLL UP(Extra scholarship);
Extra scholarship AVG(Grade)
Yes 4.15
No 3.61
NULL 4.03
SQL queries
•
OVER():
I Determines the partitioning and ordering of a rowset before the associated window function is applied.
I The OVER clause defines a window or user-specified set of rows within a query result set.
I A window function then computes a value for each row in the window.
I The OVER clause can be used with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.
I Syntax:
OVER (
[ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] )
SQL queries
•
OVER():
I PARTITION BY:
• Divides the query result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.
I ORDER BY:
• Defines the logical order of the rows within each partition of the result set, i.e., it specifies the logical order in which the window function calculation is performed.
I ROW | RANGE:
• Further limits the rows within the partition by specifying start and end points within the partition.
• This is done by specifying a range of rows with respect to the current row either by logical association or physical association.
• The ROWS clause limits the rows within a partition by specifying a fixed
SQL queries
•
Ranking functions:
I RANK () OVER:
• Returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.
I DENSE RANK () OVER:
• Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.
I NTILE (integer expression) OVER:
• Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.
I ROW NUMBER () OVER:
• Returns the sequential number of a row within a partition of a result
SQL queries
•
Examples:
I Ranking of the students
SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) FROM Students grades GROUP BY Student;
I To sort according to rank, we need to order the resulting relation:
SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) AS rank of grades
FROM Students grades GROUP BY Student ORDER BY rank of grades;
SQL queries
•
Examples:
I Ranking of students partitioned by instructors.
SELECT Instructor Name, Student, Avg(Grade), RANK () OVER (PARTITION BY Student ORDER BY Avg(Grade) DESC) AS rank 1
FROM Students grades
GROUP BY Student,Instructor Name ORDER BY Instructor Name, rank 1;
I Moving average of a student:
SELECT Student, Academic year, AVG (grades) OVER (PARTITION BY Student ORDER BY Academic year DESC ROWS UNBOUNDED PRECEDING)
FROM Students grades
ORDER BY Student, Academic year;
Outline
1
Motivation
2
OLAP Servers
3
SQL
4
Summary
Summary
•
Three types of OLAP servers: ROLAP, MOLAP, and HOLAP.
•
Several approaches for querying data warehouses.
•
ROLAP servers: SQL and its OLAP extensions.
•