Chapter 3, Data Warehouse and OLAP Operations

(1)

Chapter 3, Data Warehouse and

OLAP Operations

Young-Rae Cho

Associate Professor

Department of Computer Science

Baylor University

CSI 4352, Introduction to Data Mining

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(2)

What is Data Warehouse?

Data Warehouse ( defined in many different ways )

 A decision support database that is maintained separately from the organization’s operational database

 The support of information processing by providing a solid platform of consolidated, historical data for analysis

 “A data warehouse is a (1) subject-oriented, (2) integrated, (3) time-variant, and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. Inmon

Data Warehousing

 The process of constructing and using data warehouses

Data Warehouse – Subject-Oriented

 Organized around major subjects

 e.g., customers, products, sales

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

 Provide a simple and concise view around particular subject issues

 Excluding data that are not useful in the decision support process

(3)

Data Warehouse – Integrated

Integrating multiple, heterogeneous data sources

 Relational databases, flat files, on-line transaction records

Apply data cleaning and data integration techniques

 Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

Data Warehouse – Time Variant

The time horizon of data warehouses is significantly longer than that of operational systems

 Operational databases have current data values

 Data warehouses provide information from a historical perspective (e.g., 10-20 years)

 Time is a key structure in data warehouses

 Contain the attribute of time (explicitly or implicitly)

(4)

Data Warehouse – Nonvolatile

 A physically separate storage of data transformed from operational databases

 Operational update of data does not occur

 Not require transaction processing, recovery, and concurrency control mechanisms

 Require only two operations, initial loading of data and access of data

Data Integration Methods

 Methods

(1) Process to provide uniform interface to multiple data sources

→ Tradition Database Integration

(2) Process to combine multiple data sources into coherent storage

→ Data warehousing

 Traditional DB Integration

 A query-driven approach

 Wrappers / mediators on top of heterogeneous data sources

 Data Warehousing

 An update-driven approach

 Combined the heterogeneous data sources in advance

 Stored them in a warehouse for direct query and analysis

(5)

OLTP vs. OLAP

 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS

 Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc.

 OLAP (on-line analytical processing)

 Major task of data warehouse system

 Data analysis and decision making

 Distinct Features (OLTP vs. OLAP)

 User and system orientation (customers vs. market analysts)

 Data contents (current, detailed vs. historical, consolidated)

 Database design (ER + application vs. star + subject)

 View (current, local vs. evolutionary, integrated)

OLTP vs. OLAP

Feature OLTP OLAP

Characteristic operational processing information processing

Orientation transaction analysis

Users clerk, DBA knowledge worker (CEO, analyst)

Function day-to-day operations decision support DB Design application-oriented subject-oriented

Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query

Access read/write/update read-only (lots of scans)

(6)

Why Data Warehouse?

 Performance Issue

 DBMS: tuned for OLTP

e.g., access methods, indexing, concurrency control, recovery

 Data warehouse: tuned for OLAP

e.g., complex queries, multidimensional view, consolidation

 Data Issue

 Decision support requires historical data, consolidated and summarized data, consistent data

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(7)

Data Format for Warehouse

 Dimensions

 Multi-dimensional data model

 Data are stored in the form of a data cube

 Data Cube

 A view of multi-dimensions

 Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)

 Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

 Cuboid

 Each combination of dimensional spaces in a data cube

 0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid

The Lattice of Cuboids

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

time,item

time,item,location

time, item, location, supplier

(8)

Conceptual Modeling

 Key of Modeling Data Warehouses

 Handling dimensions & measures

 Examples

 Star schema: A fact table in the middle connected to a set of dimension tables

 Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Example of Star Schema

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

(9)

Example of Snowflake Schema

supplier_key supplier_type

supplier

city_key city state country

city

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

Example of Fact Constellation

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper

Shipping Fact Table

(10)

Cube Definition in DMQL

 Cube Definition (Fact Table)

 define cube<cube_name> [<dimension_list>]: <measure_list>

 Dimension Definition (Dimension Table)

 define dimension<dimension_name> as (<attribute_or_dimension_list>)

 Special Case (Shared Dimension Table)

 define dimension<dimension_name> as<dimension_name_first>

in cube<cube_name_first>

Star Schema Definition in DMQL

 Example

 define cubesales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

 define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

 define dimension item as (item_key, item_name, brand, type, supplier_type)

 define dimension branch as(branch_key, branch_name, branch_type)

 define dimensionlocation as(location_key, street, city, state, country)

(11)

Snowflake Schema Definition in DMQL

 Example

 define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

 define dimensionlocation as(location_key, street, city(city_key, province_or_state, country))

Fact Constellation Schema Definition in DMQL

 Example

 define dimension item as (item_key, item_name, brand, type, supplier_type)

 define dimensionlocation as(location_key, street, city, state, country)

 define cubeshipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

 define dimensiontime as time in cubesales

 define dimension item as item in cubesales

 define dimension shipper as(shipper_key, shipper_name, locationaslocation in cubesales, shipper_type)

 define dimensionfrom_location aslocation in cubesales

 define dimensionto_location aslocation in cubesales

(12)

Measures

 Distributive

 If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

 e.g., count(), sum(), min(), max()

 Algebraic

 If it can be computed by an algebraic function with m arguments, each of which is obtained by applying a distributive aggregate function

 e.g., avg(), standard_deviation()

 Holistic

 If there is no constant bound on the storage size needed to describe a subaggregate

 e.g., median(), mode(), rank()

Concept Hierarchy

 Schema Hierarchy

 e.g., street < city < state < country

 e.g., day < { month < quarter ; week } < year

 Set-group hierarchy

 e.g., { (0..100] ; (100..200] } < (0..200]

 e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }

< allProducts

Year Quarter

| Week Month

Day

(0..200]

(0..100] (100..200]

allProducts

lowPrice highPrice

(0..10] (10..100] (100..200]

(13)

Three Components of Data Cube

 Example

 Measures: data values as a function of products, locations and time

 Dimensions:

 Hierarchies:

Company

| Category

| Product

locations

time

Country

| City

| Office

Year Quarter

| Week Month

Day product location time

Example of Cuboid Cells

all

product quarter country

product, quarter

product, country

quarter, country

product, quarter, country

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

product dimension

time dimension

location dimension

(14)

Example of Data Cube

Total annual sales

of TV in U.S.A.

Quarter

Country

sum

TV sum

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

OLAP Operations

 Roll-up (drill-up)

 Summarizes (aggregates) data

 by climbing up hierarchy or by dimension reduction

 Drill-down (roll-down)

 Reverse of roll-up

 by stepping down to lower-level data or introducing new dimensions

Slice

 Selecting data on one dimension

Dice

 Selecting data on multi-dimensions

Pivot (rotate)

 Reorienting the cube, or transforming 3-D data to a series of 2-D spaces

(15)

Roll-UP & Drill-Down

country

quarter

city

quarter

country

month

location

time

Slice, Dice & Pivot

country

quarter

country

quarter

location

time Q1 Q2 Q3 Q4 USA

canada mexico

pclaptop supercom

country ^USA

canada

Q1 Q2 pclaptop

Q1 Q2 Q3 Q4 USA

canada mexico

(16)

Starnet Query Model

Shipping Method

air

truck order

Orders

contracts

Customer

Product category item

sales-person

division

Organization Promotion

city country

region

Location

daily quarterly annually

Time

Each circle is called a footprint

customerID

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(17)

Data Warehouse Design

 Top-Down View

 Allows the selection of the relevant information necessary for the data warehouse

 Data Source View

 Exposes the information being captured, stored, and managed by operational systems

 Data Warehouse View

 Consists of fact tables and dimension tables

 Business Query View

 Shows the perspectives of data in the warehouse to end-users

Data Warehouse Design Process

 Categories by Process Direction

 Top-down: Starts with overall design and planning (mature)

 Bottom-up: Starts with experiments and prototypes (rapid)

 Categories by Software Engineering View

 Waterfall: structured, systematic analysis at each step before proceeding to the next

 Spiral: rapid generation of functional systems, short turn around time

Typical Data Warehouse Design Process

 Choose business processes for modeling, e.g., orders, invoices, etc

 Choose the grain (atomic level of data) of the business processes

 Choose the dimensions that will apply to each fact table

 Choose the measures that will populate each fact table

(18)

Data Warehouse Architecture

Data Warehouse Extract

Transform Load Refresh

OLAP Engine

Query Analysis

Reports Monitor &

Integrator Metadata

Data Sources Front-End Tools

Serve

Data Marts Operational

DBs

Other sources

Data Storage

OLAP Server

Three Data Warehouse Models

 Enterprise Warehouse

 A global view with all the information about subjects spanning the entire organization

 Data Mart

 A subset of corporate-wide data that is of value to a specific group of users

 Its scope is confined to specific, selected groups

 Independent vs. dependent (directly from warehouse) data mart

 Virtual Warehouse

 A set of views over operational databases

 Only some of possible summary views may be materialized

(19)

Development of Data Warehouse

Define a high-level corporate data model Model Refinement

Enterprise Data Warehouse

Multi-Tier Data Warehouse

Distributed Data Marts

Data

Mart Data

Mart

Model Refinement

Utilities of Back-End Tools

 Data Extraction

 Get data from multiple, heterogeneous, and external sources

 Data Cleaning

 Detect errors in the data and rectify them when possible

Data Transformation

 Convert data from the original format to the warehouse format

Loading

 Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions

Refresh

 Propagate the updates from data sources to the warehouse

(20)

OLAP Server Architecture

 Relational OLAP (ROLAP)

 Uses relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

 Includes optimization of DBMS back-end, implementation of aggregation navigation logic, and additional tools and services

 High scalability

 Multidimensional OLAP (MOLAP)

 Sparse array-based multidimensional storage engine

 Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP)

 Low-level: relational / high-level: array

 High flexibility

Metadata Repository

 Definition of Metadata

 The data defining data warehouse objects

 Examples

 Description of the structure of the data warehouse, e.g., schema, view, dimensions, hierarchies, data definitions, data mart locations and contents

 Operational meta-data, e.g., history of migrated data, currency of data, warehouse usage statistics, error reports

 Algorithms used for summarization

 Mapping from operational environment to data warehouse

 Data related to system performance

 Business data, e.g., business terms and definitions, ownership of data, charging policies

(21)

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

Data Cube Computation

 View as a Lattice of Cuboids

 How many cuboids in an n-dimensional cube?

 How many cuboid cells in an n-dimensional cube with Lilevels?

 Materialization of Data Cube

 Full materialization (all cuboids), Partial materialization (some cuboids), No materialization (only base cuboid)

 Selection of cuboids to materialize

•Based on the size, sharing, access frequency, etc.

) 1( 1

 n

i Li

2

n

all

product time location

product, time

product, location time, location product, time, location

(22)

Cube Operation

 Cube Definition and Computation in DMQL

 define cube sales [item, city, year]: sum (sales_in_dollars)

 compute cubesales

 Cube Definition and Computation in SQL

 select item, city, year, SUM (amount)

 from sales

 cube by item, city, year

Internal Operations

 group by (item, city, year)

 group by (item, city), (item, year), (city, year)

 group by (item), (city), (year)

 group by ()

Iceberg Cube

 Iceberg Cube Computation

 Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= min_sup

 Motivation

 Only a small portion of cube cells may be “above the water’’ in a sparse cube

 Only calculate “interesting” cells—data above certain threshold

 Avoid explosive growth of the cube

(23)

Indexing OLAP Data

Bitmap Indexing

 Index on a particular column

 Each value in the column has a bit vector

 The length of bit vector is the number of records in the base table

 The i^thbit is set if the i^throw of the base table has the value

 Not suitable for high cardinality domains

Join Indexing

 Link the value of dimensions to rows in the fact table CusID Region Type

C1 Asia Retail C2 Europe Dealer C3 Asia Dealer

C4 US Retail

C5 Europe Dealer

RecID Retail Dealer

1 1 0

2 0 1

3 0 1

4 1 0

5 0 1

RecID Asia Europe US

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

Base Table Index on Region Index on Type

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(24)

Data Warehouse Usage

 Information Processing

 Supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

 Analytical Processing

 Supports OLAP operations in multi-dimensional space

 Data Mining

 Supports pattern discovery from warehouse data, and presenting the mining results using visualization tools

From OLAP To OLAM

 On-Line Analytical Mining (OLAM)

 ( OLAP + Data Mining ) in data warehouse

 Why OLAM ?

 High quality data

• Data warehouse contains integrated, consistent and cleaned data

 Information processing infrastructure

• ODBC/OLE DB connections, web accessing, service facilities

 OLAP-based exploratory data analysis

• Mining with drilling, dicing, pivoting, etc.

 On-line selection of data mining functions

• Integration and swapping of multiple data mining functions

(25)

OLAM System Architecture

Data Warehouse OLAM

Engine

OLAP Engine User GUI API

Data Cube API

Database API data cleaning data integration

Layer3 OLAP/OLAM

Layer2 MDDB

Layer1 Data Repository

Layer4 User Interface

mining query mining result

Meta Data

Databases

Questions?

 References

 Gray, J., et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and Knowledge Discovery, Vol. 1 (1997)

 Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352