Data warehouse (part 2)
section 3
Decision Support Systems
A Data Mining Query Language, DMQL:
Language Primitives
2
Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as “cube definition”
define dimension <dimension_name> as
Example 1: Define Star Schema in DMQL
define cubesales_star [time, item, branch, location]:
dollars_sold =sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as
(time_key, day, day_of_week, month, quarter, year)
define dimension item as
(item_key, item_name, brand, type, supplier_type)
define dimension branch as
(branch_key, branch_name, branch_type)
define dimension location as
(location_key, street, city, province_or_state,
Defining Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as
(time_key, day, day_of_week, month, quarter, year)
define dimension item as
(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as
(branch_key, branch_name, branch_type)
define dimension location as
(location_key, street, city(city_key, city,province_or_state, country))
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location_key as location in cube sales, shipper_type)
define dimension from_location as location in cube sales define dimension to_location as location in cube sales
Data warehouse architecture
Data
Warehouse
: A
Multi-Tiered
Architecture
Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrato r MetadataData Sources Front-End
Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server 7
OLAP Server Architectures
• Relational OLAP (ROLAP)
•Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware.
•Maps operations on multidimensional data to standard
relational operations.
• Multidimensional OLAP (MOLAP)
• Sparse array-based multidimensional storage engine that directly implements multidimensional data and operations
• Fast indexing to pre-computed summarized data
OLAP Server Architectures
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
•The hybrid OLAP approach combines ROLAP and MOLAP technology. •benefiting from the greater scalability of ROLAP and the faster
computation of MOLAP.
•For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store.
Example : A ROLAP data store
Table shows a summary fact table that contains both base fact data and
aggregated data.
The schema of the table is “record identifier (RID), item, ..., day,
month, quarter, year, dollars sold”, where day, month, quarter, and year define the date of sales, and dollars sold is the sales amount.
RID Item ... day month quarter year dollars sold
1001 TV ... 15 10 Q4 2003 250.60
1002 TV ... 23 10 Q4 2003 175.00
... ... ... ... ... ... ... ...
5001 TV ... all 10 Q4 2003 45,786.08
Example : A ROLAP data store
• Consider the tuples with an RID of 1001 and 1002, respectively.
• The data of these tuples are at the base fact level, where the date of sales is October 15, 2003, and October 23, 2003, respectively.
• Consider the tuple with an RID of 5001. This tuple is at a more general level of abstraction than the tuples 1001and 1002.
• The day value has been generalized to all, so that the corresponding
time value is October 2003.
•That is, the dollars sold amount shown is an aggregation representing the entire month of October 2003, rather than just October 15 or 23, 2003. The special value all is used to represent subtotals in
summarized data
Typical OLAP Operations/Queries
•Two major operations:
•Change resolution: roll up/drill down (zoom out/in) from one
dimension
•Get a sub-matrix: slice, dice.
•Roll up (drill-up): summarize data
•by climbing up hierarchy or by dimension reduction
•E.g., rather than grouping the data by city, the resulting cube groups the data by country.
•Drill down (roll down): reverse of roll-up
•from higher level summary to lower level summary or detailed data, or introducing new dimensions.
Roll-up and Drill Down
•Region •Country •State •Location Address 13 Roll Up Higher Level of Aggregation Low-level Details Drill-DownTypical OLAP Operations/Queries
•Slice: performs a selection on one dimension of the cube, resulting in a subcube.
•E.g., time = “Q1”;
•Another explanation: picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension.
•Dice: defines a subcube by performing a selection on two or
more dimensions.
• E.g., (location=“Toronto” or “Vancouver) and (time=“Q1” or “Q2”) and (item=“home entertainment” or “computer”)
•Anohter explanation: produces a subcube by allowing the analyst to pick specific values of multiple dimensions.
“Slicing and Dicing”
15
Product
Sales Channel Regions
Retail Direct Special Household Telecomm Video Audio India Far East Europe
Typical OLAP Operations/Queries
• Pivot (rotate): change the dimensional orientation•i.e., rotates the data axes in view in order to provide an alternative presentation of the data
• Another explanation: allows an analyst to rotate the cube in space to see its various faces.
• reorient the cube
Data warehouse implementation Issues
The
CUBE
Operator
•Based on the syntax of DMQL , the data cube in example could be defined as define cube sales cube [city, item, year]: sum(sales in dollars)
•For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid.
•A statement such as COMPUTE CUBE sales cube.
•would explicitly instruct the system to compute the sales aggregate cuboids for all of the eight sub sets of the set{city,item,year},
including the empty subset.
•On line analytical processing may need to access different cuboids for different queries.
• Therefore, it may seem like a good idea to compute all or at least some of the cuboids in a data cube in advance.
• Pre-computation leads to fast response time and avoids some
redundant computation. 19 (item ) (ci ty) ( ) (ye ar) (city, item) (city,
year) (item, year) (city, item, year)
The curse of dimensionality
•A major challenge, is that the required storage space may explode if all of the cuboids in a data cube are precomputed, especially when the cube has many dimensions.
•The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.
•This problem is referred to as the curse of dimensionality
•Total number of cuboids
•where Li is the number of levels associated with dimension i. • One is added to Li to include the virtual top level, all.
•As the number of dimensions, number of conceptual hierarchies, or cardinality increases, the storage space required for many of the group-by’s will grossly exceed the (fixed) size of the input relation.
Materialization of data cube
A view whose tuples are stored in the database is said to be
materialized.
•No materialization: do not precompute any of the non-base cuboids.
• Leads to expensive multidimensional aggregates on the fly, which can be extremely slow.
•Full materialization: precompute every (cuboid)
• Due to huge number of cuboids, unrealistic.
•Partial materialization
• Selection of which cuboids to materialize, based on size, sharing, access frequency etc
• A popular approach is to materialize the set of cuboids on which other frequently referenced cuboids are based.Or alternatively, compute an iceberg cube.
Iceberg Cube
■ is a data cube that stores only those cube cells whose
aggregate value(e.g., count) is above some minimum support threshold
HAVING COUNT(*) >= min_sup
■ Motivation
■ Only a small portion of cube cells may be “above the water’’ in a sparse
cube
■ Only calculate “interesting” cells—data above certain threshold ■ Avoid explosive growth of the cube.
Efficient Processing OLAP Queries
• Determine which operations should be performed on the available cuboids
• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection
• Determine which materialized cuboid(s) should be selected for OLAP operation.
• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city} 2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the
query?
• Dimensional hierarchy : item_name < brand
city < province_or_state < country P.s: Finer granularity data cannot be generated from
coarser-granularity data.
•Therefore, cuboid 2 cannot be used because country is a more general concept than province or state.
•Cuboids 1, 3, and 4 can be used to process the query because:
•they have the same set or a superset of the dimensions in the query
• the selection clause in the query can imply the selection in the cuboid, and
• the abstraction levels for the item and location dimensions in these cuboids are at a finer level than brand and province or state,
How would the costs of each cuboid
compare if used to process the query ?
•It is likely that using cuboid 1 would cost the most because both item name and city are at a lower level than the brand and province or state concepts specified in the query.
• If there are not many year values associated with items in the cube, but there are several item names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query.
•However, if efficient indices are available for cuboid 4, then cuboid 4 may be a better choice.
• Therefore, some cost-based estimation is required in order to decide which set of cuboids should be selected for query processing.
Indexing OLAP Data: Bitmap Index
how to index OLAP data by bitmap indexing and join indexing.
•Indexing facilitates efficient data accessing. •Index on a particular column
•Each value in the column has a bit vector: bit-operation is fast •The length of the bit vector: # of records in the base table
•The i-th bit is set if the i-th row of the base table has the value for the indexed column
•not suitable for high cardinality domains
Base table Index on Region Index on Type
Example
: (
Bitmap
Index
)
the attribute gender has values M and F.
A table of 100 million people needs 2 lists of 100 million bits
27
Customer Query : select * from customer where
gender = ‘F’ and vote = ‘Y’ 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 M F F F F M Y Y Y N N N M 1 0 1 0 0 F 1 1 1 0 1 0 gender index 0