• No results found

Chapter 3, Data Warehouse and OLAP Operations

N/A
N/A
Protected

Academic year: 2021

Share "Chapter 3, Data Warehouse and OLAP Operations"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Chapter 3, Data Warehouse and

OLAP Operations

Young-Rae Cho

Associate Professor

Department of Computer Science

Baylor University

CSI 4352, Introduction to Data Mining

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(2)

What is Data Warehouse?

Data Warehouse ( defined in many different ways )

A decision support database that is maintained separately from the organization’s operational database

The support of information processing by providing a solid platform of consolidated, historical data for analysis

“A data warehouse is a (1) subject-oriented, (2) integrated, (3) time-variant, and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. Inmon

Data Warehousing

 The process of constructing and using data warehouses

Data Warehouse – Subject-Oriented

 Organized around major subjects

 e.g., customers, products, sales

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

 Provide a simple and concise view around particular subject issues

 Excluding data that are not useful in the decision support process

(3)

Data Warehouse – Integrated

Integrating multiple, heterogeneous data sources

 Relational databases, flat files, on-line transaction records

Apply data cleaning and data integration techniques

 Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

Data Warehouse – Time Variant

The time horizon of data warehouses is significantly longer than that of operational systems

 Operational databases have current data values

 Data warehouses provide information from a historical perspective (e.g., 10-20 years)

 Time is a key structure in data warehouses

 Contain the attribute of time (explicitly or implicitly)

(4)

Data Warehouse – Nonvolatile

 A physically separate storage of data transformed from operational databases

 Operational update of data does not occur

 Not require transaction processing, recovery, and concurrency control mechanisms

 Require only two operations, initial loading of data and access of data

Data Integration Methods

Methods

(1) Process to provide uniform interface to multiple data sources

→ Tradition Database Integration

(2) Process to combine multiple data sources into coherent storage

→ Data warehousing

 Traditional DB Integration

 A query-driven approach

 Wrappers / mediators on top of heterogeneous data sources

 Data Warehousing

 An update-driven approach

 Combined the heterogeneous data sources in advance

 Stored them in a warehouse for direct query and analysis

(5)

OLTP vs. OLAP

 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS

 Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc.

 OLAP (on-line analytical processing)

 Major task of data warehouse system

 Data analysis and decision making

 Distinct Features (OLTP vs. OLAP)

 User and system orientation (customers vs. market analysts)

 Data contents (current, detailed vs. historical, consolidated)

 Database design (ER + application vs. star + subject)

 View (current, local vs. evolutionary, integrated)

OLTP vs. OLAP

Feature OLTP OLAP

Characteristic operational processing information processing

Orientation transaction analysis

Users clerk, DBA knowledge worker (CEO, analyst)

Function day-to-day operations decision support DB Design application-oriented subject-oriented

Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query

Access read/write/update read-only (lots of scans)

(6)

Why Data Warehouse?

 Performance Issue

 DBMS: tuned for OLTP

e.g., access methods, indexing, concurrency control, recovery

 Data warehouse: tuned for OLAP

e.g., complex queries, multidimensional view, consolidation

 Data Issue

 Decision support requires historical data, consolidated and summarized data, consistent data

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(7)

Data Format for Warehouse

Dimensions

 Multi-dimensional data model

 Data are stored in the form of a data cube

Data Cube

 A view of multi-dimensions

 Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)

 Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

Cuboid

 Each combination of dimensional spaces in a data cube

 0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid

The Lattice of Cuboids

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

time,item

time,item,location

time, item, location, supplier

(8)

Conceptual Modeling

Key of Modeling Data Warehouses

 Handling dimensions & measures

 Examples

 Star schema: A fact table in the middle connected to a set of dimension tables

 Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Example of Star Schema

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

(9)

Example of Snowflake Schema

supplier_key supplier_type

supplier

city_key city state country

city

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key

item

branch_key

branch_name

branch_type

branch

Example of Fact Constellation

time_key

day

day_of_the_week

month

quarter

year

time

location_key

street

city

state

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

item

branch_key

branch_name

branch_type

branch

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper

Shipping Fact Table

(10)

Cube Definition in DMQL

Cube Definition (Fact Table)

 define cube<cube_name> [<dimension_list>]: <measure_list>

 Dimension Definition (Dimension Table)

 define dimension<dimension_name> as (<attribute_or_dimension_list>)

 Special Case (Shared Dimension Table)

 define dimension<dimension_name> as<dimension_name_first>

in cube<cube_name_first>

Star Schema Definition in DMQL

Example

 define cubesales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

 define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

 define dimension item as (item_key, item_name, brand, type, supplier_type)

 define dimension branch as(branch_key, branch_name, branch_type)

 define dimensionlocation as(location_key, street, city, state, country)

(11)

Snowflake Schema Definition in DMQL

Example

 define cubesales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

 define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

 define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

 define dimension branch as(branch_key, branch_name, branch_type)

 define dimensionlocation as(location_key, street, city(city_key, province_or_state, country))

Fact Constellation Schema Definition in DMQL

Example

 define cubesales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

 define dimensiontime as (time_key, day, day_of_week, month, quarter, year)

 define dimension item as (item_key, item_name, brand, type, supplier_type)

 define dimension branch as(branch_key, branch_name, branch_type)

 define dimensionlocation as(location_key, street, city, state, country)

 define cubeshipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

 define dimensiontime as time in cubesales

 define dimension item as item in cubesales

 define dimension shipper as(shipper_key, shipper_name, locationaslocation in cubesales, shipper_type)

 define dimensionfrom_location aslocation in cubesales

 define dimensionto_location aslocation in cubesales

(12)

Measures

Distributive

If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

 e.g., count(), sum(), min(), max()

Algebraic

If it can be computed by an algebraic function with m arguments, each of which is obtained by applying a distributive aggregate function

 e.g., avg(), standard_deviation()

Holistic

 If there is no constant bound on the storage size needed to describe a subaggregate

 e.g., median(), mode(), rank()

Concept Hierarchy

Schema Hierarchy

 e.g., street < city < state < country

 e.g., day < { month < quarter ; week } < year

Set-group hierarchy

 e.g., { (0..100] ; (100..200] } < (0..200]

 e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }

< allProducts

Year Quarter

| Week Month

Day

(0..200]

(0..100] (100..200]

allProducts

lowPrice highPrice

(0..10] (10..100] (100..200]

(13)

Three Components of Data Cube

Example

 Measures: data values as a function of products, locations and time

 Dimensions:

 Hierarchies:

Company

| Category

| Product

locations

time

Country

| City

| Office

Year Quarter

| Week Month

Day product location time

Example of Cuboid Cells

all

product quarter country

product, quarter

product, country

quarter, country

product, quarter, country

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

product dimension

time dimension

location dimension

(14)

Example of Data Cube

Total annual sales

of TV in U.S.A.

Quarter

Country

sum

TV sum

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

OLAP Operations

Roll-up (drill-up)

 Summarizes (aggregates) data

 by climbing up hierarchy or by dimension reduction

Drill-down (roll-down)

 Reverse of roll-up

 by stepping down to lower-level data or introducing new dimensions

Slice

 Selecting data on one dimension

Dice

 Selecting data on multi-dimensions

Pivot (rotate)

 Reorienting the cube, or transforming 3-D data to a series of 2-D spaces

(15)

Roll-UP & Drill-Down

country

quarter

city

quarter

country

month

location

time

Slice, Dice & Pivot

country

quarter

quarter

country

quarter

location

time Q1 Q2 Q3 Q4 USA

canada mexico

pclaptop supercom

country USA

canada

Q1 Q2 pclaptop

Q1 Q2 Q3 Q4 USA

canada mexico

(16)

Starnet Query Model

Shipping Method

air

truck order

Orders

contracts

Customer

Product category item

sales-person

division

division

Organization Promotion

city country

region

Location

daily quarterly annually

Time

Each circle is called a footprint

customerID

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(17)

Data Warehouse Design

Top-Down View

 Allows the selection of the relevant information necessary for the data warehouse

Data Source View

 Exposes the information being captured, stored, and managed by operational systems

Data Warehouse View

 Consists of fact tables and dimension tables

Business Query View

 Shows the perspectives of data in the warehouse to end-users

Data Warehouse Design Process

Categories by Process Direction

 Top-down: Starts with overall design and planning (mature)

 Bottom-up: Starts with experiments and prototypes (rapid)

Categories by Software Engineering View

 Waterfall: structured, systematic analysis at each step before proceeding to the next

 Spiral: rapid generation of functional systems, short turn around time

Typical Data Warehouse Design Process

Choose business processes for modeling, e.g., orders, invoices, etc

Choose the grain (atomic level of data) of the business processes

Choose the dimensions that will apply to each fact table

Choose the measures that will populate each fact table

(18)

Data Warehouse Architecture

Data Warehouse Extract

Transform Load Refresh

OLAP Engine

Query Analysis

Reports Monitor &

Integrator Metadata

Data Sources Front-End Tools

Serve

Data Marts Operational

DBs

Other sources

Data Storage

OLAP Server

Three Data Warehouse Models

Enterprise Warehouse

 A global view with all the information about subjects spanning the entire organization

Data Mart

 A subset of corporate-wide data that is of value to a specific group of users

 Its scope is confined to specific, selected groups

 Independent vs. dependent (directly from warehouse) data mart

Virtual Warehouse

 A set of views over operational databases

 Only some of possible summary views may be materialized

(19)

Development of Data Warehouse

Define a high-level corporate data model Model Refinement

Enterprise Data Warehouse

Multi-Tier Data Warehouse

Distributed Data Marts

Data

Mart Data

Mart

Model Refinement

Utilities of Back-End Tools

Data Extraction

 Get data from multiple, heterogeneous, and external sources

Data Cleaning

 Detect errors in the data and rectify them when possible

Data Transformation

 Convert data from the original format to the warehouse format

Loading

 Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions

Refresh

 Propagate the updates from data sources to the warehouse

(20)

OLAP Server Architecture

Relational OLAP (ROLAP)

 Uses relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

 Includes optimization of DBMS back-end, implementation of aggregation navigation logic, and additional tools and services

 High scalability

Multidimensional OLAP (MOLAP)

 Sparse array-based multidimensional storage engine

 Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP)

 Low-level: relational / high-level: array

 High flexibility

Metadata Repository

Definition of Metadata

 The data defining data warehouse objects

Examples

 Description of the structure of the data warehouse, e.g., schema, view, dimensions, hierarchies, data definitions, data mart locations and contents

 Operational meta-data, e.g., history of migrated data, currency of data, warehouse usage statistics, error reports

 Algorithms used for summarization

 Mapping from operational environment to data warehouse

 Data related to system performance

 Business data, e.g., business terms and definitions, ownership of data, charging policies

(21)

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

Data Cube Computation

View as a Lattice of Cuboids

How many cuboids in an n-dimensional cube?

How many cuboid cells in an n-dimensional cube with Lilevels?

Materialization of Data Cube

 Full materialization (all cuboids), Partial materialization (some cuboids), No materialization (only base cuboid)

 Selection of cuboids to materialize

•Based on the size, sharing, access frequency, etc.

) 1( 1

 n

i Li

2

n

all

product time location

product, time

product, location time, location product, time, location

(22)

Cube Operation

Cube Definition and Computation in DMQL

 define cube sales [item, city, year]: sum (sales_in_dollars)

 compute cubesales

Cube Definition and Computation in SQL

 select item, city, year, SUM (amount)

 from sales

 cube by item, city, year

Internal Operations

 group by (item, city, year)

 group by (item, city), (item, year), (city, year)

 group by (item), (city), (year)

 group by ()

Iceberg Cube

Iceberg Cube Computation

 Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= min_sup

Motivation

 Only a small portion of cube cells may be “above the water’’ in a sparse cube

 Only calculate “interesting” cells—data above certain threshold

 Avoid explosive growth of the cube

(23)

Indexing OLAP Data

Bitmap Indexing

 Index on a particular column

 Each value in the column has a bit vector

 The length of bit vector is the number of records in the base table

The ithbit is set if the ithrow of the base table has the value

 Not suitable for high cardinality domains

Join Indexing

 Link the value of dimensions to rows in the fact table CusID Region Type

C1 Asia Retail C2 Europe Dealer C3 Asia Dealer

C4 US Retail

C5 Europe Dealer

RecID Retail Dealer

1 1 0

2 0 1

3 0 1

4 1 0

5 0 1

RecID Asia Europe US

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

Base Table Index on Region Index on Type

Lecture 3, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

(24)

Data Warehouse Usage

Information Processing

 Supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Analytical Processing

 Supports OLAP operations in multi-dimensional space

Data Mining

 Supports pattern discovery from warehouse data, and presenting the mining results using visualization tools

From OLAP To OLAM

On-Line Analytical Mining (OLAM)

 ( OLAP + Data Mining ) in data warehouse

Why OLAM ?

 High quality data

• Data warehouse contains integrated, consistent and cleaned data

 Information processing infrastructure

• ODBC/OLE DB connections, web accessing, service facilities

 OLAP-based exploratory data analysis

• Mining with drilling, dicing, pivoting, etc.

 On-line selection of data mining functions

• Integration and swapping of multiple data mining functions

(25)

OLAM System Architecture

Data Warehouse OLAM

Engine

OLAP Engine User GUI API

Data Cube API

Database API data cleaning data integration

Layer3 OLAP/OLAM

Layer2 MDDB

Layer1 Data Repository

Layer4 User Interface

mining query mining result

Meta Data

Databases

Questions?

 References

 Gray, J., et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and Knowledge Discovery, Vol. 1 (1997)

 Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352

References

Related documents

SSL VPN solutions provide secure, granular access controls, ensuring that users gain access only to the designated resources or applications specific to their needs and according

When I disconnect the cable connecting the PC to the camera and double- click the removable disk icon, a message appears and the window does not open.. FinePixViewer automatic

Last but not least, el duque de Montalto-cardenal Moncada, cuñado tanto del marqués de Aytona - se había casado en segundas nupcias con su hermana Caterina - como del

time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures. item_key

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr.. 7) The ROLAP model that treats data as if they were stored in. a)

A developer who is using the results of the OLAP cube would not understand the physical or logical model of the data warehouse because the structure is very different. We should

Populate repositories Relational DBMS Data warehouse Non-relational DBMS (HDFS, Hbase,...) Non-relational DBMS (content mgt systems) Data mart (OLAP cube) Real-time

Studies on SGLT2 KO mice and selective SGLT2 inhibitors described the renal transport capacity of SGLT1, showing that the SGLT1-mediated glucose reabsorption is maintained at 40 –50%