• No results found

OLAP Systems and Multidimensional Expressions I

N/A
N/A
Protected

Academic year: 2021

Share "OLAP Systems and Multidimensional Expressions I"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

OLAP Systems and Multidimensional Expressions I

Krzysztof Dembczy´ nski

Intelligent Decision Support Systems Laboratory (IDSS) Pozna´n University of Technology, Poland

Software Development Technologies Master studies, first semester

(2)

Review of the previous lectures

Mining of massive datasets

Evolution of database systems: operational vs. analytical systems.

Dimensional modeling.

Operational vs. analytical systems.

Extraction, transformation and load of data.

(3)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(4)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(5)

OLAP systems

The next step is to provide solutions for querying and reporting

multidimensional analytical data.

The goal is to provide efficient solutions for physical representation

and processing of these data.

(6)

Multidimensional reports

OLAP servers provide an effective solution for accessing and

processing large volumes of high dimensional data.

OLAP systems provide tools for multidimensional reporting.

(7)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(8)

Multidimensional cube

The proper data model for multidimensional reporting is the

multidimensional one.

(9)

Operators in multidimensional data model

• Roll up – summarize data along a dimension hierarchy.

• Drill down – go from higher level summary to lower level summary or detailed data.

• Slice and dice – corresponds to selection and projection.

• Pivot – reorient cube.

• Raking, Time functions, etc..

(10)

Lattice of cuboids

Different degrees of summarizations are presented as a lattice of

cuboids.

Example for the dimensions: time, product, location, supplier.

(11)

Total number of cuboids

For an n-dimensional data cube, the total number of cuboids that can

be generated is:

T = Y

i=1,...,n

(L

i

+ 1),

where L

i

is the number of levels associated with dimension i

(excluding the virtual top level ”all” since generalizing to ”all” is

equivalent to the removal of a dimension).

For example, if the cube has 10 dimensions and each dimension has 4

levels, the total number of cuboids that can be generated will be:

l = 5

10

= 9, 8 × 10

6

.

(12)

Three types of aggregate functions

distributive: count(), sum(), max(), min(),

algebraic: ave(), std dev(),

holistic: median(), mode(), rank().

(13)

OLAP servers

Relational OLAP (ROLAP),

Multidimensional OLAP (MOLAP),

Hybrid OLAP (HOLAP).

(14)

ROLAP

ROLAP servers use a relational or post-relational database

management system to store and manage warehouse data.

Optimization techniques:

I Denormalization,

I Materialized views,

I Partitioning,

I Joins,

I Indexes,

I Query processing.

(15)

ROLAP

Advantages of ROLAP Servers:

I Scalable with respect to the number of dimensions,

I Scalable with respect to the size of data,

I Sparsity is not a problem (fact tables contain only facts),

I Mature and well-developed technology.

Disadvantage of ROLAP Servers:

I Worse performance than MOLAP,

I Additional data structures and optimization techniques used to improve the performance.

(16)

MOLAP

MOLAP Servers use array-based multidimensional storage engines.

Optimization techniques:

I Two-level storage representation: dense cubes are identified and stored as array structures, sparse cubes employ compression techniques,

I Materialized cubes.

(17)

MOLAP

Advantages of MOLAP Servers:

I Multidimensional views are directly mapped to data cube array structures – efficient access to data,

I Can easily store subaggregates.

Disadvantages of MOLAP Servers:

I Scalability problem in the case of larger number of dimensions,

I Not tailored for sparse data,

I Young technology,

I There are no existing standards.

(18)

MOLAP

Example

Logical model consists of four dimensions: customer, product,

location, and day

In case of 100 000 customers, 10 000 products, 1 000 locations and 1

000 days, the data cube will contain 1 000 000 000 000 000 cells!

Huge number of cells is empty: a customer is not able to buy all

products in all locations . . ..

(19)

HOLAP

HOLAP Servers are a hybrid approach that combines ROLAP and

MOLAP technology.

HOLAP benefits from the greater scalability of ROLAP and the faster

computation of MOLAP.

(20)

Querying data warehouses

There are two main approaches for querying data warehouses:

I SQL (ROLAP)

I MDX (MOLAP)

(21)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(22)

SQL queries

Querying the star schema

(23)

SQL queries

SQL – group by

SELECT Name, AVG(Grade)

FROM Students grades G, Student S

WHERE G.Student = S.ID

GROUP BY Name;

Name AVG(Grade) Inmon 4.8

Kimball 4.7 Gates 4.0 Todman 4.5

(24)

SQL queries

SQL– group by

SELECT Academic year, Name, AVG(Grade)

FROM Students grades G, Academic year A, Professor P WHERE G.Professor = P.ID and G.Academic year = A.ID GROUP BY Academic year, Name;

Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2

2002/3 Stefanowski 4.0 2003/4 Stefanowski 3.9 2001/2 S lowi´nski 4.1 2002/3 S lowi´nski 3.8 2003/4 S lowi´nski 3.6

(25)

SQL queries

OLAP extensions in SQL:

I GROUP BY ROLLUP,

I GROUP BY CUBE,

I GROUP BY GROUPING SETS

I GROUPING and DECODE/CASE

I OVER

I Ranking functions

(26)

SQL queries

GROUP BY CUBE

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY CUBE (Time, Product, Location, Supplier);

(27)

SQL queries

GROUP BY CUBE

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location, Supplier UNION ALL

SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location UNION ALL

SELECT Time, Product, ’’*’’, Location, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location UNION ALL

. . . UNION ALL

(28)

SQL queries

GROUP BY CUBE

SELECT Academic year, Name, AVG(Grade)

FROM Students grades GROUP BY CUBE(Academic year, Name);

Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2

2001/2 S lowi´nski 4.1 2002/3 Stefanowski 4.0 2002/3 S lowi´nski 3.8 2003/4 Stefanowski 3.9 2003/4 S lowi´nski 3.6 2003/4 Dembczy´nski 4.8

2001/2 NULL 4.15

2002/3 NULL 3.85

(29)

SQL queries

GROUP BY ROLLUP

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY ROLLUP (Time, Product, Location, Supplier);

(30)

SQL queries

GROUP BY ROLLUP

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location, Supplier UNION ALL

SELECT Time, Product, Location, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time, Product, Location UNION ALL

SELECT Time, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time, Product UNION ALL

SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain)

(31)

SQL queries

GROUP BY ROLLUP

SELECT Academic year, Name, AVG(Grade) FROM Students grades G

GROUP BY ROLLUP(Academic year, Name);

Academic year Name AVG(Grade) 2001/2 Stefanowski 4.2

2001/2 S lowi´nski 4.1 2002/3 Stefanowski 4.0 2002/3 S lowi´nski 3.8 2003/4 Stefanowski 3.9 2003/4 S lowi´nski 3.6 2003/4 Dembczy´nski 4.8

2001/2 NULL 4.15

2002/3 NULL 3.85

2003/4 NULL 3.8

(32)

SQL queries

GROUP BY GROUPING SETS

SELECT Time, Product, Location, Supplier, SUM(Gain) FROM Sales

GROUP BY GROUPING SETS ((Time), (Product), (Location), (Supplier));

(33)

SQL queries

GROUP BY GROUPING SETS

SELECT Time, ’’*’’, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Time UNION ALL

SELECT ’’*’’, Product, ’’*’’, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Product UNION ALL

SELECT ’’*’’, ’’*’’, Location, ’’*’’, SUM(Gain) FROM Sales

GROUP BY Location UNION ALL

SELECT ’*’, ’*’, ’*’, Supplier, SUM(Gain) FROM Sales GROUP BY Supplier;

(34)

SQL queries

GROUP BY GROUPING SETS

SELECT Academic year, Name, AVG(Grade)

FROM Students grades GROUP BY GROUPING SETS ((Academic year), (Name),());

Academic year Name AVG(Grade)

2001/2 NULL 4.15

2002/3 NULL 3.85

2003/4 NULL 3.8

NULL Stefanowski 3.9

NULL S lowi´nski 3.6 NULL Dembczy´nski 4.8

NULL NULL 3.95

(35)

SQL queries

GROUPING(<column expression>)

I Returns a value of 1 if the value of expression in the row is a null representing the set of all values.

I <column expression> is a column or an expression that contains a column in a GROUP BY clause.

I GROUPING is used to distinguish the null values that are returned by ROLLUP, CUBE or GROUPING SETS from standard null values.

I The NULL returned as the result of a ROLLUP, CUBE or GROUPING SETS operation is a special use of NULL.

(36)

SQL queries

GROUPING(<column expression>)

SELECT Extra scholarship, AVG(Grade), GROUPING(Extra scholarship) as Grouping

FROM Students grades

GROUP BY ROLL UP(Extra scholarship);

Extra scholarship AVG(Grade) Grouping

Yes 4.15 0

No 3.61 0

NULL 4.03 0

NULL 3.89 1

(37)

SQL queries

DECODE(expression , search , result [, search ,

result]... [, default] )

I If the value of expression is equal to search, then result is returned, otherwise default is returned.

I The functionality is similar to CASE expression,

I The results of GROUPING() can be passed into a DECODE function or the CASE expression.

(38)

SQL queries

DECODE(expression , search , result [, search ,

result]... [, default] )

SELECT DECODE(GROUPING(Extra scholarship), 1, "Total Average", Extra scholarship) as Extra scholarship, AVG(Grade)

FROM Students grades

GROUP BY ROLL UP(Extra scholarship);

Extra scholarship AVG(Grade)

Yes 4.15

No 3.61

NULL 4.03

(39)

SQL queries

OVER():

I Determines the partitioning and ordering of a rowset before the associated window function is applied.

I The OVER clause defines a window or user-specified set of rows within a query result set.

I A window function then computes a value for each row in the window.

I The OVER clause can be used with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.

I Syntax:

OVER (

[ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] )

(40)

SQL queries

OVER():

I PARTITION BY:

• Divides the query result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.

I ORDER BY:

• Defines the logical order of the rows within each partition of the result set, i.e., it specifies the logical order in which the window function calculation is performed.

I ROW | RANGE:

• Further limits the rows within the partition by specifying start and end points within the partition.

• This is done by specifying a range of rows with respect to the current row either by logical association or physical association.

• The ROWS clause limits the rows within a partition by specifying a fixed

(41)

SQL queries

Ranking functions:

I RANK () OVER:

• Returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.

I DENSE RANK () OVER:

• Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.

I NTILE (integer expression) OVER:

• Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.

I ROW NUMBER () OVER:

• Returns the sequential number of a row within a partition of a result

(42)

SQL queries

Examples:

I Ranking of the students

SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) FROM Students grades GROUP BY Student;

I To sort according to rank, we need to order the resulting relation:

SELECT Student, Avg(Grade), RANK () OVER (ORDER BY Avg(Grade) DESC) AS rank of grades

FROM Students grades GROUP BY Student ORDER BY rank of grades;

(43)

SQL queries

Examples:

I Ranking of students partitioned by instructors.

SELECT Instructor Name, Student, Avg(Grade), RANK () OVER (PARTITION BY Student ORDER BY Avg(Grade) DESC) AS rank 1

FROM Students grades

GROUP BY Student,Instructor Name ORDER BY Instructor Name, rank 1;

I Moving average of a student:

SELECT Student, Academic year, AVG (grades) OVER (PARTITION BY Student ORDER BY Academic year DESC ROWS UNBOUNDED PRECEDING)

FROM Students grades

ORDER BY Student, Academic year;

(44)

Outline

1

Motivation

2

OLAP Servers

3

SQL

4

Summary

(45)

Summary

Three types of OLAP servers: ROLAP, MOLAP, and HOLAP.

Several approaches for querying data warehouses.

ROLAP servers: SQL and its OLAP extensions.

MOLAP servers: . . .

(46)

Bibliography

A. Pelikant, Hurtownie danych. Od przetwarzania analitycznego do

raportowania, Helion 2011

SAS 9.1.3 OLAP Server MDX Guide Third Edition, 2006

MSDN: SQL Server Developer Center, 2008

References

Related documents

will be acceptable subject to agreement with the LBMA, but when the bars are check- weighed in London in troy ounces (using the procedure set out in Annex G) the weights

Spontaneous, unassisted notoriety degree of private medical clinics (Top of Mind, Sum of Mentions) Spontaneous, unassisted appreciation degree of private medical clinics (Top

As normas do Sistema Integrado de Diagnose e Recomendação (DRIS), desenvolvido por Beaufils (1973), ainda não foram estabelecidas para os cafeeiros do Alto Vale do Jequitinhonha,

Having now arranged our room and secured our choir, let us make a start with the boys and teach them something of voice

Mean AU/mL and standard deviations (N=25) are shown in parentheses.. Cross-reactivity of the Diazyme SARS-CoV-2 IgM/IgG serology assay towards serum or plasma from 55 patients

In this paper, we will present two more human-like scheduling approaches, by using a cutting-edge Bayesian Optimization Algorithm (BOA) and an Adapted Classifier

LAZ Ultimate Hospitality was formed in 2008 when Ultimate Parking, a 20-year hospitality parking industry leader, and Sunset Parking, a leader in the industry in