Decision Support Systems

(1)

Data warehouse (part 2)

section 3

Decision Support Systems

(2)

A Data Mining Query Language, DMQL:

Language Primitives

2

Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition ( Dimension Table )

define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)

First time as “cube definition”

define dimension <dimension_name> as

(3)

Example 1: Define Star Schema in DMQL

define cube

sales_star [time, item, branch, location]:

dollars_sold =sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as

(time_key, day, day_of_week, month, quarter, year)

define dimension item as

(item_key, item_name, brand, type, supplier_type)

define dimension branch as

(branch_key, branch_name, branch_type)

define dimension location as

(location_key, street, city, province_or_state,

(4)

Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as

(time_key, day, day_of_week, month, quarter, year)

define dimension item as

(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as

(branch_key, branch_name, branch_type)

define dimension location as

(location_key, street, city(city_key, city,province_or_state, country))

(5)

Defining Fact Constellation in DMQL

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state,

country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location_key as location in cube sales, shipper_type)

define dimension from_location as location in cube sales define dimension to_location as location in cube sales

(6)

Data warehouse architecture

(7)

Data

Warehouse

: A

Multi-Tiered

Architecture

Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrato r Metadata

Data Sources _Front-End

Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server 7

(8)

OLAP Server Architectures

• Relational OLAP (ROLAP)

•Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware.

•Maps operations on multidimensional data to standard

relational operations.

• Multidimensional OLAP (MOLAP)

• Sparse array-based multidimensional storage engine that directly implements multidimensional data and operations

• Fast indexing to pre-computed summarized data

(9)

OLAP Server Architectures

• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

•The hybrid OLAP approach combines ROLAP and MOLAP technology. •benefiting from the greater scalability of ROLAP and the faster

computation of MOLAP.

•For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store.

(10)

Example : A ROLAP data store

Table shows a summary fact table that contains both base fact data and

aggregated data.

The schema of the table is “record identifier (RID), item, ..., day,

month, quarter, year, dollars sold”, where day, month, quarter, and year define the date of sales, and dollars sold is the sales amount.

RID Item ... day month quarter year dollars sold

1001 TV ... 15 10 Q4 2003 250.60

1002 TV ... 23 10 Q4 2003 175.00

... ... ... ... ... ... ... ...

5001 TV ... all 10 Q4 2003 45,786.08

(11)

Example : A ROLAP data store

• Consider the tuples with an RID of 1001 and 1002, respectively.

• The data of these tuples are at the base fact level, where the date of sales is October 15, 2003, and October 23, 2003, respectively.

• Consider the tuple with an RID of 5001. This tuple is at a more general level of abstraction than the tuples 1001and 1002.

• The day value has been generalized to all, so that the corresponding

time value is October 2003.

•That is, the dollars sold amount shown is an aggregation representing the entire month of October 2003, rather than just October 15 or 23, 2003. The special value all is used to represent subtotals in

summarized data

(12)

Typical OLAP Operations/Queries

•Two major operations:

•Change resolution: roll up/drill down (zoom out/in) from one

dimension

•Get a sub-matrix: slice, dice.

•Roll up (drill-up): summarize data

•by climbing up hierarchy or by dimension reduction

•E.g., rather than grouping the data by city, the resulting cube groups the data by country.

•Drill down (roll down): reverse of roll-up

•from higher level summary to lower level summary or detailed data, or introducing new dimensions.

(13)

Roll-up and Drill Down

•Region •Country •State •Location Address 13 Roll Up Higher Level of Aggregation Low-level Details Drill-Down

(14)

Typical OLAP Operations/Queries

•Slice: performs a selection on one dimension of the cube, resulting in a subcube.

•E.g., time = “Q1”;

•Another explanation: picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension.

•Dice: defines a subcube by performing a selection on two or

more dimensions.

• E.g., (location=“Toronto” or “Vancouver) and (time=“Q1” or “Q2”) and (item=“home entertainment” or “computer”)

•Anohter explanation: produces a subcube by allowing the analyst to pick specific values of multiple dimensions.

(15)

“Slicing and Dicing”

15

Product

Sales Channel Regions

Retail Direct Special Household Telecomm Video Audio India Far East Europe

(16)

Typical OLAP Operations/Queries

• Pivot (rotate): change the dimensional orientation

•i.e., rotates the data axes in view in order to provide an alternative presentation of the data

• Another explanation: allows an analyst to rotate the cube in space to see its various faces.

• reorient the cube

(17)

(18)

Data warehouse implementation Issues

(19)

The

CUBE

Operator

•Based on the syntax of DMQL , the data cube in example could be defined as define cube sales cube [city, item, year]: sum(sales in dollars)

•For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid.

•A statement such as COMPUTE CUBE sales cube.

•would explicitly instruct the system to compute the sales aggregate cuboids for all of the eight sub sets of the set{city,item,year},

including the empty subset.

•On line analytical processing may need to access different cuboids for different queries.

• Therefore, it may seem like a good idea to compute all or at least some of the cuboids in a data cube in advance.

• Pre-computation leads to fast response time and avoids some

redundant computation. 19 (item ) (ci ty) ( ) _(ye ar) (city, item) (city,

year) (item, year) (city, item, year)

(20)

The curse of dimensionality

•A major challenge, is that the required storage space may explode if all of the cuboids in a data cube are precomputed, especially when the cube has many dimensions.

•The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.

•This problem is referred to as the curse of dimensionality

•Total number of cuboids

•where L_iis the number of levels associated with dimension i. • One is added to L_i to include the virtual top level, all.

•As the number of dimensions, number of conceptual hierarchies, or cardinality increases, the storage space required for many of the group-by’s will grossly exceed the (fixed) size of the input relation.

(21)

Materialization of data cube

A view whose tuples are stored in the database is said to be

materialized.

•No materialization: do not precompute any of the non-base cuboids.

• Leads to expensive multidimensional aggregates on the fly, which can be extremely slow.

•Full materialization: precompute every (cuboid)

• Due to huge number of cuboids, unrealistic.

•Partial materialization

• Selection of which cuboids to materialize, based on size, sharing, access frequency etc

• A popular approach is to materialize the set of cuboids on which other frequently referenced cuboids are based.Or alternatively, compute an iceberg cube.

(22)

Iceberg Cube

■ is a data cube that stores only those cube cells whose

aggregate value(e.g., count) is above some minimum support threshold

HAVING COUNT(*) >= min_sup

■ Motivation

■ Only a small portion of cube cells may be “above the water’’ in a sparse

cube

■ Only calculate “interesting” cells—data above certain threshold ■ Avoid explosive growth of the cube.

(23)

Efficient Processing OLAP Queries

• Determine which operations should be performed on the available cuboids

• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection

• Determine which materialized cuboid(s) should be selected for OLAP operation.

• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city} 2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

(24)

Which should be selected to process the

query?

• Dimensional hierarchy : item_name < brand

city < province_or_state < country P.s: Finer granularity data cannot be generated from

coarser-granularity data.

•Therefore, cuboid 2 cannot be used because country is a more general concept than province or state.

•Cuboids 1, 3, and 4 can be used to process the query because:

•they have the same set or a superset of the dimensions in the query

• the selection clause in the query can imply the selection in the cuboid, and

• the abstraction levels for the item and location dimensions in these cuboids are at a finer level than brand and province or state,

(25)

How would the costs of each cuboid

compare if used to process the query ?

•It is likely that using cuboid 1 would cost the most because both item name and city are at a lower level than the brand and province or state concepts specified in the query.

• If there are not many year values associated with items in the cube, but there are several item names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query.

•However, if efficient indices are available for cuboid 4, then cuboid 4 may be a better choice.

• Therefore, some cost-based estimation is required in order to decide which set of cuboids should be selected for query processing.

(26)

Indexing OLAP Data: Bitmap Index

how to index OLAP data by bitmap indexing and join indexing.

•Indexing facilitates efficient data accessing. •Index on a particular column

•Each value in the column has a bit vector: bit-operation is fast •The length of the bit vector: # of records in the base table

•The i-th bit is set if the i-th row of the base table has the value for the indexed column

•not suitable for high cardinality domains

Base table _{Index on Region} _{Index on Type}

(27)

Example

: (

Bitmap

Index

)

the attribute gender has values M and F.

A table of 100 million people needs 2 lists of 100 million bits

27

Customer Query : select * from customer where

gender = ‘F’ and vote = ‘Y’ 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 M F F F F M Y Y Y N N N M 1 0 1 0 0 F 1 1 1 0 1 0 gender index 0