Decision Support, Data Warehousing, and OLAP

24 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Decision Support, Data

Decision Support, Data

Warehousing, and OLAP

Warehousing, and OLAP

Anindya Datta

Anindya Datta

Director,

Director,

iXL Center for E

iXL

Center for E-

-Commerce

Commerce

Georgia Institute of Technology

Georgia Institute of Technology

adatta

adatta

@cc.gatech

@cc.

gatech.

.

edu

edu

Outline

Outline

I

I

Terminology: OLAP vs

Terminology: OLAP

vs

. OLTP

. OLTP

I

I

Data Warehousing Architecture

Data Warehousing Architecture

I

I

Technologies

Technologies

I

I

Products

Products

I

I

Research Issues

Research Issues

I

(2)

Decision Support and OLAP

Decision Support and OLAP

I

I

Information technology to help the knowledge worker

Information technology to help the knowledge worker

(executive, manager, analyst) make faster and better

(executive, manager, analyst) make faster and better

decisions.

decisions.

What were the sales volumes by region and product category What were the sales volumes by region and product category for the last year?

for the last year?

How did the share price of computer manufacturers correlate How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?

with quarterly profits over the past 10 years?

Which orders should we fill to maximize revenues?Which orders should we fill to maximize revenues?

Will a 10% discount increase sales volume sufficiently?Will a 10% discount increase sales volume sufficiently?

Which of two new medications will result in the best Which of two new medications will result in the best outcome: higher recovery rate & shorter hospital stay? outcome: higher recovery rate & shorter hospital stay?

I

I

On

On-

-Line Analytical Processing (OLAP) is an element of

Line Analytical Processing (OLAP) is an element of

decision support

decision support systmes

systmes (DSS).

(DSS).

Evolution

Evolution

I

I

60’s: Batch reports

60’s: Batch reports

• hard to find and analyze informationhard to find and analyze information •

• inflexible and expensive, reprogram every new requestinflexible and expensive, reprogram every new request

I

I

70’s: Terminal

70’s: Terminal-

-based DSS and EIS (executive information

based DSS and EIS (executive information

systems)

systems)

• still inflexible, not integrated with desktop toolsstill inflexible, not integrated with desktop tools

I

I

80’s: Desktop data access and analysis tools

80’s: Desktop data access and analysis tools

• query tools, spreadsheets, GUIsquery tools, spreadsheets, GUIs •

• easier to use, but only access operational databaseseasier to use, but only access operational databases

I

I

90’s: Data warehousing with integrated OLAP engines

90’s: Data warehousing with integrated OLAP engines

and tools

and tools

(3)

OLTP

OLTP

vs. OLAP

vs

. OLAP

I

I Clerk, IT ProfessionalClerk, IT Professional

I

I Day to day operationsDay to day operations

I

I Application-Application-oriented (Eoriented (E--R R based)

based)

I

I Current, IsolatedCurrent, Isolated

I

I Detailed, Flat relationalDetailed, Flat relational

I

I Structured, RepetitiveStructured, Repetitive

I

I Short, Simple transactionShort, Simple transaction

I

I Read/writeRead/write

I

I Index/hash on prim. KeyIndex/hash on prim. Key

I I TensTens I I ThousandsThousands I I 100 MB-100 MB-GBGB I

I Trans. throughputTrans. throughput

I

I Knowledge workerKnowledge worker

I

I Decision supportDecision support

I

I Subject-Subject-oriented (Star, oriented (Star, snowflake)

snowflake)

I

I Historical, ConsolidatedHistorical, Consolidated

I

I Summarized, MultidimensionalSummarized, Multidimensional

I

I Ad hocAd hoc

I

I Complex queryComplex query

I

I Read MostlyRead Mostly

I

I Lots of ScansLots of Scans

I I MillionsMillions I I HundredsHundreds I I 100GB-100GB-TBTB I

I Query throughput, responseQuery throughput, response

User Function DB Design Data View Usage Unit of work Access Operations # Records accessed #Users Db size Metric

OLTP

OLTP

OLAP

OLAP

Data Warehouse

Data Warehouse

I

I

A decision support database that is maintained

A decision support database that is maintained

separately from the organization’s operational

separately from the organization’s operational

databases.

databases.

I

I

A data warehouse is a

A data warehouse is a

subject

subject

-

-

oriented,

oriented,

integrated,

integrated,

time

time

-

-

varying,

varying,

non

non

-

-

volatile

volatile

collection of data that is used primarily in

collection of data that is used primarily in

organizational decision making

(4)

Why Separate Data Warehouse?

Why Separate Data Warehouse?

I

I

Performance

Performance

• Op Op dbs dbs designed & tuned for known designed & tuned for known txs txs & workloads.& workloads.

• Complex OLAP queries would degrade Complex OLAP queries would degrade perfperf. For op . For op txstxs.. •

• Special data organization, access & implementation methods Special data organization, access & implementation methods

needed for multidimensional views & queries.

needed for multidimensional views & queries.

I

I

Function

Function

• Missing data: Decision support requires historical data, which Missing data: Decision support requires historical data, which opop

dbs

dbsdo not typically maintain.do not typically maintain.

• Data consolidation: Decision support requires consolidation Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous

(aggregation, summarization) of data from many heterogeneous

sources: op

sources: op dbsdbs, external sources. , external sources. •

• Data quality: Different sources typically use inconsistent dataData quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

representations, codes, and formats which have to be reconciled.

Data Warehousing Market

Data Warehousing Market

I

I

Hardware: servers, storage, clients

Hardware: servers, storage, clients

I

I

Warehouse

Warehouse DBMs

DBMs

I

I

Tools

Tools

I

I

Market growing from

Market growing from

• $2B in 1995 to $8 B in 1998 ($2B in 1995 to $8 B in 1998 (Meta Meta Group)Group)

• 1.5B today to $6.9B in 1999 (1.5B today to $6.9B in 1999 (Gartner Gartner Group)Group)

I

I

Systems integration & Consulting

Systems integration & Consulting

I

I

Already deployed in many industries: manufacturing,

Already deployed in many industries: manufacturing,

retail, financial, insurance, transportation, telecom.,

retail, financial, insurance, transportation, telecom.,

utilities, healthcare.

(5)

Data Warehousing Architecture

Data Warehousing Architecture

Monitoring & Administration Monitoring & Administration

Metadata Metadata Repository Repository Extract Extract Transform Transform Load Load Refresh Refresh

Data Marts

Data Marts

External External Sources Sources Operational Operational dbs dbs Serve Serve OLAP servers OLAP servers

Analysis

Analysis

Query/

Query/

Reporting

Reporting

Data

Data

Mining

Mining

Three

Three

-Tier Architecture

-

Tier Architecture

I

I

Warehouse database server

Warehouse database server

• Almost always a relational DBMS; rarely flat filesAlmost always a relational DBMS; rarely flat files

I

I

OLAP servers

OLAP servers

• Relational OLAP (ROLAP): extended relational DBMS that maps Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational

operations on multidimensional data to standard relational

operations.

operations.

• Multidimensional OLAP (MOLAP): special purpose server that Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.

directly implements multidimensional data and operations.

I

I

Clients

Clients

• Query and reporting tools.Query and reporting tools. •

• Analysis toolsAnalysis tools •

(6)

Data Warehouse

Data Warehouse

vs. Data Marts

vs

. Data Marts

I

I Enterprise warehouse: collects all information about subjects Enterprise warehouse: collects all information about subjects

(customers, products, sales, assets, personnel) that span the en

(customers, products, sales, assets, personnel) that span the entire tire organization.

organization.

• Requires extensive business modelingRequires extensive business modeling •

• May take years to design and buildMay take years to design and build

I

I Data Marts: Departmental subsets that focus on selected subjectData Marts: Departmental subsets that focus on selected subjects: s:

Marketing data mart: customer, products, sales.

Marketing data mart: customer, products, sales.

• Faster roll out, but complex integration in the long run.Faster roll out, but complex integration in the long run.

I

I Virtual warehouse: views over operational Virtual warehouse: views over operational dbsdbs

• Materialize some summary views for efficient query processingMaterialize some summary views for efficient query processing •

• Easier to buildEasier to build •

• Requisite excess capcaity Requisite excess capcaity on operational on operational db db serversservers

Design & Operational Process

Design & Operational Process

I

I

Define architecture. Do capacity planning.

Define architecture. Do capacity planning.

I

I

Integrate

Integrate db

db and OLAP servers, storage and client tools.

and OLAP servers, storage and client tools.

I

I

Design warehouse schema, views.

Design warehouse schema, views.

I

I

Design physical warehouse organization: data placement,

Design physical warehouse organization: data placement,

partitioning, access methods.

partitioning, access methods.

I

I

Connect sources: gateways, ODBC drivers, wrappers.

Connect sources: gateways, ODBC drivers, wrappers.

I

I

Design & implement scripts for data extract, load refresh.

Design & implement scripts for data extract, load refresh.

I

I

Define

Define metadata

metadata and populate repository.

and populate repository.

I

I

Design & implement end

Design & implement end-

-user applications.

user applications.

I

I

Roll out warehouse and applications.

Roll out warehouse and applications.

I

(7)

OLAP for Decision Support

OLAP for Decision Support

I

I

Goal of OLAP is to support ad

Goal of OLAP is to support ad-

-hoc querying for the

hoc querying for the

business analyst

business analyst

I

I

Business analysts are familiar with spreadsheets

Business analysts are familiar with spreadsheets

I

I

Extend spreadsheet analysis model to work with

Extend spreadsheet analysis model to work with

warehouse data

warehouse data

• Large data setLarge data set •

• Semantically enriched to understand business terms (e.g., time, Semantically enriched to understand business terms (e.g., time, geography)

geography)

• Combined with reporting featuresCombined with reporting features

I

I

Multidimensional

Multidimensional

view of data is the foundation of OLAP

view of data is the foundation of OLAP

Multidimensional Data Model

Multidimensional Data Model

I

I

Database is a set of

Database is a set of

facts

facts

(points) in a multidimensional

(points) in a multidimensional

space

space

I

I

A fact has a

A fact has a

measure

measure

dimension

dimension

• quantity that is analyzed, e.g., sale, budgetquantity that is analyzed, e.g., sale, budget

I

I

A set of

A set of

dimensions

dimensions

on which data is analyzed

on which data is analyzed

• e.g. , store, product, date associated with a sale amounte.g. , store, product, date associated with a sale amount

I

I

Dimensions form a sparsely populated coordinate system

Dimensions form a sparsely populated coordinate system

I

I

Each dimension has a set of

Each dimension has a set of

attributes

attributes

• e.g., owner city and county of storee.g., owner city and county of store

I

I

Attributes of a dimension may be related by partial order

Attributes of a dimension may be related by partial order

HierarchyHierarchy: e.g., street > county >city: e.g., street > county >city •

(8)

Multidimensional Data

Multidimensional Data

10 10 47 47 30 30 12 12

Juice

Juice

Cola

Cola

Milk

Milk

Cream

Cream

NY NY LA LA SF SF

Sales

Sales

Volume

Volume

as a

as a

function

function

of time,

of time,

city and

city and

product

product

3/1 3/2 3/3 3/4

3/1 3/2 3/3 3/4

Date

Date

Operations in Multidimensional

Operations in Multidimensional

Data Model

Data Model

I

I

Aggregation (roll

Aggregation (

roll-

-up

up)

)

dimension reduction: e.g., total sales by city

dimension reduction: e.g., total sales by city

summarization over aggregate hierarchy: e.g., total

summarization over aggregate hierarchy: e.g., total

sales by city and year

sales by city and year -

-> total sales by region and by

> total sales by region and by

year

year

I

I

Selection (slice

Selection (

slice

) defines a subcube

) defines a

subcube

e.g., sales where city = Palo Alto and date = 1/15/96

e.g., sales where city = Palo Alto and date = 1/15/96

I

I

Navigation to detailed data (drill

Navigation to detailed data (

drill-

-down

down)

)

e.g., (sales -

e.g., (sales

-

expense) by city, top 3% of cities by

expense) by city, top 3% of cities by

average income

average income

I

(9)

A Visual Operation: Pivot (Rotate)

A Visual Operation: Pivot (Rotate)

10 10 47 47 30 30 12 12

Juice

Juice

Cola

Cola

Milk

Milk

Cream

Cream

NY NY LA LA SF SF

3/1 3/2 3/3 3/4

3/1 3/2 3/3 3/4

Date

Date

Month Month RegionRegion Product Product

Approaches to OLAP Servers

Approaches to OLAP Servers

I

I

Relational OLAP (ROLAP)

Relational OLAP (ROLAP)

• Relational and Specialized Relational DBMS to store and manage Relational and Specialized Relational DBMS to store and manage warehouse data

warehouse data

• OLAP OLAP middleware middleware to support missing piecesto support missing pieces

– Optimize for each DBMS backendOptimize for each DBMS backend –

– Aggregation Navigation LogicAggregation Navigation Logic –

– Additional tools and servicesAdditional tools and services •

• Example: Example: MicrostrategyMicrostrategy, , MetaCube MetaCube (Informix)(Informix)

I

I

Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP)

• ArrayArray--based storage structuresbased storage structures •

• Direct access to array data structuresDirect access to array data structures •

• Example: Example: Essbase Essbase (Arbor), (Arbor), Accumate Accumate ((KenanKenan))

I

(10)

Relational DBMS as Warehouse

Relational DBMS as Warehouse

Server

Server

I

I

Schema design

Schema design

I

I

Specialized scan, indexing and join techniques

Specialized scan, indexing and join techniques

I

I

Handling of aggregate views (querying and

Handling of aggregate views (querying and

materialization)

materialization)

I

I

Supporting query language extensions beyond

Supporting query language extensions beyond

SQL

SQL

I

I

Complex query processing and optimization

Complex query processing and optimization

I

I

Data partitioning and parallelism

Data partitioning and parallelism

Warehouse Database Schema

Warehouse Database Schema

I

I

ER design techniques not appropriate

ER design techniques not appropriate

I

I

Design should reflect multidimensional

Design should reflect multidimensional

view

view

Star Schema

Star Schema

Snowflake Schema

Snowflake Schema

(11)

Example of a Star Schema

Example of a Star Schema

Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category CategoryDescription CategoryDescription UnitPrice UnitPrice DateKey DateKey Date Date CityName CityName State State Country Country Order Order Customer Customer Salesperson Salesperson City City Date Date Product Product Fact Table Fact Table

Star Schema

Star Schema

I

I

A single fact table and a single table for each dimension

A single fact table and a single table for each dimension

I

I

Every fact points to one

Every fact points to one tuple

tuple

in each of the dimensions

in each of the dimensions

and has additional attributes

and has additional attributes

I

I

Does not capture hierarchies directly

Does not capture hierarchies directly

I

I

Generated keys are used for performance and maintenance

Generated keys are used for performance and maintenance

reasons

reasons

I

I

Fact constellation: Multiple Fact tables that share many

Fact constellation: Multiple Fact tables that share many

dimension tables

dimension tables

• Example: Projected expense and the actual expense may share Example: Projected expense and the actual expense may share dimensional tables

(12)

Example of a Snowflake Schema

Example of a Snowflake Schema

Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category Category Category UnitPrice UnitPrice DateKey DateKey Date Date Month Month CityName CityName State State Country Country Order Order Customer Customer Salesperson Salesperson City City Date Date Product Product Fact Table

Fact Table CategoryNameCategoryDescrCategoryNameCategoryDescr

Month

Month

Year

Year YearYear

StateName StateName Country Country Category Category State State Month Month Year Year

Snowflake Schema

Snowflake Schema

I

I

Represent dimensional hierarchy directly by

Represent dimensional hierarchy directly by

normalizing the dimension tables

normalizing the dimension tables

I

I

Easy to maintain

Easy to maintain

I

I

Saves storage, but is alleged that it reduces

Saves storage, but is alleged that it reduces

effectiveness of browsing (Kimball)

effectiveness of browsing (Kimball)

(13)

Indexing Techniques

Indexing Techniques

I

I

Exploiting indexes to reduce scanning of

Exploiting indexes to reduce scanning of

data is of crucial importance

data is of crucial importance

I

I

Bitmap Indexes

Bitmap Indexes

I

I

Join Indexes

Join Indexes

I

I

Other Issues

Other Issues

Text indexing

Text indexing

Parallelizing

Parallelizing

and sequencing of index builds

and sequencing of index builds

and incremental updates

and incremental updates

BitMap

BitMap

Indexes

Indexes

I

I

An alternative representation of RID

An alternative representation of RID-

-list

list

I

I

Specially advantageous for low

Specially advantageous for low-

-cardinality

cardinality

domains

domains

I

I

Represent each row of a table by a bit and the

Represent each row of a table by a bit and the

table as a bit vector

table as a bit vector

I

I

There is a distinct bit vector

There is a distinct bit vector Bv

Bv for each value v

for each value v

for the domain

for the domain

I

I

Example: the attribute sex has values M and F. A

Example: the attribute sex has values M and F. A

table of 100 million people needs 2 lists of 100

table of 100 million people needs 2 lists of 100

million bits

(14)

Bit Map Index

Bit Map Index

Cust Region Rating

C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 N H Base Table Base Table Row ID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0 Row ID H M L 1 1 0 0 2 0 1 0 3 0 0 0 4 0 0 0 5 0 1 0 6 0 0 0 7 1 0 0 Rating Index Rating Index Region Index Region Index

Customers where

Customers where

Region = W

Region = W

And

And

Rating = 1

Rating = 1

BitMap

BitMap

Indexes

Indexes

I

I

Comparison, join and aggregation operations are

Comparison, join and aggregation operations are

reduced to bit arithmetic with dramatic

reduced to bit arithmetic with dramatic

improvement in processing time

improvement in processing time

I

I

Significant reduction in space and I/O (30:1)

Significant reduction in space and I/O (30:1)

I

I

Adapted for higher cardinality domains as well.

Adapted for higher cardinality domains as well.

I

I

Compression (e.g., run

Compression (e.g., run-

-length encoding) exploited

length encoding) exploited

I

I

Products that support bitmaps: Model 204,

Products that support bitmaps: Model 204,

TargetIndex

(15)

Issues in Handling of Aggregate

Issues in Handling of Aggregate

Views

Views

I

I

Important component for ROLAP Servers

Important component for ROLAP Servers

I

I

Representation in the context of star schema

Representation in the context of star schema

Query Expressions

Query Expressions

Materialized Views

Materialized Views

I

I

Logic for Aggregation Navigation

Logic for Aggregation Navigation

make optimum use of materialized aggregates to

make optimum use of materialized aggregates to

answer a query

answer a query

I

I

Choice of aggregate views to materialize

Choice of aggregate views to materialize

I

I

HP Intelligent Warehouse pioneered some of the

HP Intelligent Warehouse pioneered some of the

techniques

techniques

SQL Extensions for Front End

SQL Extensions for Front End

Tools

Tools

I

I

Extended Family of Aggregate functions

Extended Family of Aggregate functions

• rank (top 10) and Nrank (top 10) and N--Tile (“top 30%” of all products)Tile (“top 30%” of all products)

• Median, mode…..Median, mode…..

I

I

Reporting Features

Reporting Features

• running total, cumulative totalsrunning total, cumulative totals

I

I

Results of multiple group by:

Results of multiple group by:

• total sales by month and total sales by producttotal sales by month and total sales by product

I

I

SQL comes in the way of sequential processing and

SQL comes in the way of sequential processing and

columnar aggregations

columnar aggregations

(16)

Query Processing in MOLAP

Query Processing in MOLAP

Servers

Servers

I

I

The storage model is an n-

The storage model is an n

-dimensional

dimensional

array

array

I

I

Front end multidimensional queries map to

Front end multidimensional queries map to

server capabilities in a straightforward way

server capabilities in a straightforward way

I

I

Direct Addressing abilities

Direct Addressing abilities

I

I

A straightforward array representation has

A straightforward array representation has

good indexing properties but very poor

good indexing properties but very poor

storage utilization when the data is sparse

storage utilization when the data is sparse

Query Processing in MOLAP

Query Processing in MOLAP

Servers

Servers

2

2

-

-

dimensional dense arrays indexed by B

dimensional dense arrays indexed by B

-

-

Trees

Trees

Traditional Traditional indexing indexing structure structure 2 2--dimensional dimensional dense arrays dense arrays

(17)

Population & Refreshing the

Population & Refreshing the

Warehouse

Warehouse

I

I

Data extraction

Data extraction

I

I

Data cleaning

Data cleaning

I

I

Data transformation

Data transformation

Convert from legacy/host format to warehouse format

Convert from legacy/host format to warehouse format

I

I

Load

Load

Sort, summarize, consolidate, compute views, check

Sort, summarize, consolidate, compute views, check

integrity, build indexes, partition

integrity, build indexes, partition

I

I

Refresh

Refresh

Propogate

Propogate

updates from sources to the warehouse

updates from sources to the warehouse

Data Cleaning

Data Cleaning

I I

Why?

Why?

• Data warehouse contains data that is analyzed for business Data warehouse contains data that is analyzed for business decisions

decisions

• More data and multiple sources could mean more errors in the datMore data and multiple sources could mean more errors in the data a

and harder to trace such errors

and harder to trace such errors

• Results in incorrect analysisResults in incorrect analysis

I

I

Detecting data anomalies and rectifying them early has

Detecting data anomalies and rectifying them early has

huge payoffs

huge payoffs

I

I

Important to identify tools that work together well

Important to identify tools that work together well

I

I

Long Term Solution

Long Term Solution

• Change business practices and data entry toolsChange business practices and data entry tools •

(18)

Data Cleaning Techniques

Data Cleaning Techniques

I

I

Transformation Rules

Transformation Rules

• Example: translate “gender” to “sex”Example: translate “gender” to “sex” •

• Products: Warehouse Manger (Prism), Extract (ETI)Products: Warehouse Manger (Prism), Extract (ETI)

I

I

Uses domain

Uses domain-

-specific knowledge to do scrubbing

specific knowledge to do scrubbing

I

I

Parsing and fuzzy matching

Parsing and fuzzy matching

• Multiple data sources (can designate a preferred source)Multiple data sources (can designate a preferred source) •

• Products: Integrity (Products: Integrity (ValityVality), ), TrillumTrillum

I

I

Discover facts that flag unusual patterns (auditing)

Discover facts that flag unusual patterns (auditing)

• Some dealer has never received a single complaintSome dealer has never received a single complaint •

• Products: QDB, Products: QDB, SBStarSBStar, , WizRuleWizRule

Load

Load

I

I

Issues:

Issues:

• huge volumes of data to be loadedhuge volumes of data to be loaded •

• small time window (usually at night) when the warehouse can be small time window (usually at night) when the warehouse can be taken off

taken off--lineline •

• When to build indexes and summary tablesWhen to build indexes and summary tables •

• allow system administrator to monitor status, cancel suspend, allow system administrator to monitor status, cancel suspend, resume load, or change load rate

resume load, or change load rate

• restart after failure with no loss of data integrityrestart after failure with no loss of data integrity

I

I

Techniques:

Techniques:

• batch load utility: sort input records on clustering key and usbatch load utility: sort input records on clustering key and use e sequential I/O; build indexes and derived tables

sequential I/O; build indexes and derived tables

• sequential loads still too long (~100 days for TB)sequential loads still too long (~100 days for TB) •

(19)

Refresh

Refresh

I

I

Issues:

Issues:

when to refresh

when to refresh

on every update: too expensive, only necessary if

on every update: too expensive, only necessary if

OLAP queries need current data

OLAP queries need current data

(e.g., up

(e.g., up

-

-

the

the

-

-

minute

minute

stock quotes)

stock quotes)

periodically (e.g., every 24 hours, every week) or

periodically (e.g., every 24 hours, every week) or

after “significant” events

after “significant” events

refresh policy set by administrator based on user

refresh policy set by administrator based on user

needs and traffic

needs and traffic

possibly different policies for different sources

possibly different policies for different sources

how to refresh

how to refresh

Refresh Techniques

Refresh Techniques

I

I

Full extract from base tables

Full extract from base tables

• read entire source table or database: expensiveread entire source table or database: expensive •

• may be the only choice for legacy databases or files.may be the only choice for legacy databases or files.

I

I

Incremental techniques (related to work on active

Incremental techniques (related to work on active dbs

dbs)

)

• detect & propagate changes on base tables: replication servers detect & propagate changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator)

(e.g., Sybase, Oracle, IBM Data Propagator)

– snapshots & triggers (Oracle)snapshots & triggers (Oracle) –

– transaction shipping (Sybase)transaction shipping (Sybase) •

• Logical correctnessLogical correctness –

– computing changes to star tablescomputing changes to star tables –

– computing changes to derived and summary tablescomputing changes to derived and summary tables –

– optimization: only significant changesoptimization: only significant changes •

(20)

Metadata

Metadata

Repository

Repository

I

I

Administrative

Administrative metadata

metadata

• source databases and their contentssource databases and their contents •

• gateway descriptionsgateway descriptions •

• warehouse schema, view & derived data definitionswarehouse schema, view & derived data definitions •

• dimensions, hierarchiesdimensions, hierarchies •

• prepre--defined queries and reportsdefined queries and reports •

• data mart locations and contentsdata mart locations and contents •

• data partitionsdata partitions •

• data extraction, cleansing, transformation rules, defaultsdata extraction, cleansing, transformation rules, defaults •

• data refresh and purging rulesdata refresh and purging rules •

• user profiles, user groupsuser profiles, user groups •

• security: user authorization, access controlsecurity: user authorization, access control

Metdata

Metdata Repository .. 2

Repository .. 2

I

I

Business data

Business data

business terms and definitions

business terms and definitions

ownership of data

ownership of data

charging policies

charging policies

I

I

operational

operational metadata

metadata

data lineage: history of migrated data and sequence of

data lineage: history of migrated data and sequence of

transformations applied

transformations applied

currency of data: active, archived, purged

currency of data: active, archived, purged

monitoring information: warehouse usage statistics,

monitoring information: warehouse usage statistics,

error reports, audit trails.

(21)

Warehouse Design Tools

Warehouse Design Tools

I

I

Creating and managing a warehouse is hard.

Creating and managing a warehouse is hard.

I

I

Development tools

Development tools

• defining & editing defining & editing metadata metadata repository contents (schemas, scripts, repository contents (schemas, scripts, rules).

rules).

• Queries and reportsQueries and reports •

• Shipping Shipping metadata metadata to and from RDBMS catalogue (e.g., Prism to and from RDBMS catalogue (e.g., Prism Warehouse Manager).

Warehouse Manager).

I

I

Planning & analysis tools

Planning & analysis tools

• impact of schema changesimpact of schema changes •

• capacity planningcapacity planning •

• refresh performance: changing refresh rates or time windowsrefresh performance: changing refresh rates or time windows

Warehouse Management Tools

Warehouse Management Tools

I

I

Monitoring and reporting tools (e.g., HP Intelligent

Monitoring and reporting tools (e.g., HP Intelligent

Warehouse Advisor)

Warehouse Advisor)

• which partitions, summary tables, columns are used which partitions, summary tables, columns are used •

• query execution timesquery execution times •

• for summary tables, types & frequencies of roll downsfor summary tables, types & frequencies of roll downs •

• warehouse usage over time (detect peak periods)warehouse usage over time (detect peak periods)

I

I

Systems and network management tools (e.g., HP

Systems and network management tools (e.g., HP

OpenView

OpenView, IBM

, IBM NetView

NetView,

, Tivoli

Tivoli): traffic, utilization

): traffic, utilization

I

I

Exception reporting/alerting tools 9e.g., DB2 Event

Exception reporting/alerting tools 9e.g., DB2 Event

Alerters

Alerters, Information Advantage

, Information Advantage InfoAgents

InfoAgents &

& InfoAlert

InfoAlert)

)

• runaway queriesrunaway queries

I

(22)

State of Commercial Practice

State of Commercial Practice

I

I Products and Vendors [DatamationProducts and Vendors [Datamation, May 15, 1996; R.C. , May 15, 1996; R.C. BarquinBarquin, H.A. , H.A. EdelsteinEdelstein: : Planning and

Planning and Designin gthe Designin gthe Data Data WarehousWarehous. Prentice Hall. 1997]. Prentice Hall. 1997] •

• Connectivity to sourcesConnectivity to sources –

– ApertusApertus CACA--Ingres Ingres GatewayGateway –

– Information Builders EDA/SQLInformation Builders EDA/SQL IBM Data IBM Data JionerJioner –

– Informix Enterprise GatewayInformix Enterprise Gateway Microsoft ODBCMicrosoft ODBC –

– Oracle Open ConnectOracle Open Connect Platinum Platinum InfohubInfohub –

– SAS ConnectSAS Connect Software AG EntireSoftware AG Entire –

– Sybase Enterprise ConnectSybase Enterprise Connect Trinzic InfoHubTrinzic InfoHub •

• Data extract, clean, Data extract, clean, transfomrtransfomr, refresh, refresh –

– CA-CA-Ingres ReplicatorIngres Replicator Carleton PassportCarleton Passport –

– Evolutionary Tech Inc. ETI-Evolutionary Tech Inc. ETI-ExtractExtract HarteHarte--Hanks TrilliumHanks Trillium –

– IBM Data Joiner, Data PropagatorIBM Data Joiner, Data Propagator Oracle 7Oracle 7 –

– Platinum InfoRefinerPlatinum InfoRefiner, , InfroPumpInfroPump Praxis Praxis OmniReplicatorOmniReplicator –

– Prism Warehouse ManagerPrism Warehouse Manager Redbrick TMURedbrick TMU –

– SAS AccessSAS Access Software AG Software AG SouorcepointSouorcepoint –

– Sybase Replication ServerSybase Replication Server TrinzicTrinzicInfoPumpInfoPump

State of Commercial Practice..2

State of Commercial Practice..2

I

I Multidimensional Database EnginesMultidimensional Database Engines

Arbor

Arbor EssbaseEssbase ComshareComshareCommander OLAPCommander OLAP Oracle IRI Express

Oracle IRI Express SAS SystemSAS System

I

I Warehouse Data ServersWarehouse Data Servers

• CA-CA-IngresIngres IBM DB2IBM DB2 •

• Information Builders FocusInformation Builders Focus InformixInformix •

• OracleOracle Praxiz Model 204Praxiz Model 204 •

• RedbrickRedbrick Software AG ADABASSoftware AG ADABAS •

• Sybase MPPSybase MPP TandemTandem •

• TerdataTerdata

I

I ROLAP ServersROLAP Servers

• HP Intelligent WarehouseHP Intelligent Warehouse Information Advantage AsxysInformation Advantage Asxys •

(23)

State of Commercial Practice..3

State of Commercial Practice..3

I

I Query/Reporting EnvironmentsQuery/Reporting Environments

Brio

Brio/Query/Query Business ObjectsBusiness Objects Cognos

Cognos ImpromptuImpromptu CA Visual ExpressCA Visual Express IBM

IBM DataGuideDataGuide Information Builders Focus SixInformation Builders Focus Six Informix

Informix ViewPointViewPoint Platinum Forest & TreesPlatinum Forest & Trees SAS Access

SAS Access Software AG Software AG EsperantEsperant I

I Multidimensional AnalysisMultidimensional Analysis

Andydne

Andydne PabloPablo Arbor Arbor EssbaseEssbaseAnalysis ServerAnalysis Server

Business Objects

Business Objects Cognos PowerPlayCognos PowerPlay Dimensional Insight Cross Target

Dimensional Insight Cross Target Holistic Systems HOLOSHolistic Systems HOLOS Information Advantage Decision Suite

Information Advantage Decision Suite IQ Software IQ/VisionIQ Software IQ/Vision Kenan

Kenan System System AcumateAcumate Lotus 123Lotus 123 Microsoft Excel

Microsoft Excel Microstrategy Microstrategy DSSDSS

Pilot Lightship

Pilot Lightship Platinum Forest & TreesPlatinum Forest & Trees Prodea

Prodea BeaconBeacon SAS OLAP ++SAS OLAP ++ Stanford Technology Group

Stanford Technology Group MetacubeMetacube

State of Commercial Practice..4

State of Commercial Practice..4

I

I Metadata Metadata ManagementManagement

• HP Intelligent WarehouseHP Intelligent Warehouse IBM Data GuideIBM Data Guide •

• Platinum Repository Platinum Repository Prism Directory ManagerPrism Directory Manager I

I System ManagementSystem Management

• CA UnicenterCA Unicenter HP OpenViewHP OpenView •

• IBM DataHubIBM DataHub, , NetViewNetView Information Builder Site AnalyzerInformation Builder Site Analyzer •

• Prism Warehouse ManagerPrism Warehouse Manager SAS CPESAS CPE •

• TivoliTivoli Software AG Source PointSoftware AG Source Point •

• Redbrick Enterprise Control and CoordinationRedbrick Enterprise Control and Coordination I

I Process Management Process Management

• At& T TOPENDAt& T TOPEND HP Intelligent WarehouseHP Intelligent Warehouse •

• IBM FlowMarkIBM FlowMark Platinum RepositoryPlatinum Repository •

• Prism Warehouse ManagerPrism Warehouse Manager Software AG Source PointSoftware AG Source Point I

(24)

Research Issues

Research Issues

I

I Data cleaningData cleaning

• focus on data inconsistencies, not schema differencesfocus on data inconsistencies, not schema differences •

• data mining techniquesdata mining techniques

I

I Physical DesignPhysical Design

• design of summary tables, partitions, indexesdesign of summary tables, partitions, indexes •

• tradeoffs in use of different indexestradeoffs in use of different indexes

I

I Query processingQuery processing

• selecting appropriate summary tablesselecting appropriate summary tables •

• dynamic optimization with feedbackdynamic optimization with feedback •

• acid test for query optimization: cost estimation, use of transformations, acid test for query optimization: cost estimation, use of transformations, search strategies

search strategies •

• partitioning query processing between OLAP server and backend server.partitioning query processing between OLAP server and backend server.

Research Issues .. 2

Research Issues .. 2

I

I

Warehouse Management

Warehouse Management

detecting runaway queries

detecting runaway queries

resource management

resource management

incremental refresh techniques

incremental refresh techniques

computing summary tables during load

computing summary tables during load

failure recovery during load and refresh

failure recovery during load and refresh

process management: scheduling queries, load and

process management: scheduling queries, load and

refresh

refresh

Figure

Updating...

References

Updating...

Related subjects :