Decision Support, Data
Decision Support, Data
Warehousing, and OLAP
Warehousing, and OLAP
Anindya Datta
Anindya Datta
Director,
Director,
iXL Center for E
iXL
Center for E-
-Commerce
Commerce
Georgia Institute of Technology
Georgia Institute of Technology
adatta
adatta
@cc.gatech
@cc.
gatech.
.
edu
edu
Outline
Outline
I
I
Terminology: OLAP vs
Terminology: OLAP
vs
. OLTP
. OLTP
I
I
Data Warehousing Architecture
Data Warehousing Architecture
I
I
Technologies
Technologies
I
I
Products
Products
I
I
Research Issues
Research Issues
I
Decision Support and OLAP
Decision Support and OLAP
I
I
Information technology to help the knowledge worker
Information technology to help the knowledge worker
(executive, manager, analyst) make faster and better
(executive, manager, analyst) make faster and better
decisions.
decisions.
•
• What were the sales volumes by region and product category What were the sales volumes by region and product category for the last year?
for the last year? •
• How did the share price of computer manufacturers correlate How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?
with quarterly profits over the past 10 years? •
• Which orders should we fill to maximize revenues?Which orders should we fill to maximize revenues? •
• Will a 10% discount increase sales volume sufficiently?Will a 10% discount increase sales volume sufficiently? •
• Which of two new medications will result in the best Which of two new medications will result in the best outcome: higher recovery rate & shorter hospital stay? outcome: higher recovery rate & shorter hospital stay?
I
I
On
On-
-Line Analytical Processing (OLAP) is an element of
Line Analytical Processing (OLAP) is an element of
decision support
decision support systmes
systmes (DSS).
(DSS).
Evolution
Evolution
I
I
60’s: Batch reports
60’s: Batch reports
•
• hard to find and analyze informationhard to find and analyze information •
• inflexible and expensive, reprogram every new requestinflexible and expensive, reprogram every new request
I
I
70’s: Terminal
70’s: Terminal-
-based DSS and EIS (executive information
based DSS and EIS (executive information
systems)
systems)
•
• still inflexible, not integrated with desktop toolsstill inflexible, not integrated with desktop tools
I
I
80’s: Desktop data access and analysis tools
80’s: Desktop data access and analysis tools
•
• query tools, spreadsheets, GUIsquery tools, spreadsheets, GUIs •
• easier to use, but only access operational databaseseasier to use, but only access operational databases
I
I
90’s: Data warehousing with integrated OLAP engines
90’s: Data warehousing with integrated OLAP engines
and tools
and tools
OLTP
OLTP
vs. OLAP
vs
. OLAP
II Clerk, IT ProfessionalClerk, IT Professional
I
I Day to day operationsDay to day operations
I
I Application-Application-oriented (Eoriented (E--R R based)
based)
I
I Current, IsolatedCurrent, Isolated
I
I Detailed, Flat relationalDetailed, Flat relational
I
I Structured, RepetitiveStructured, Repetitive
I
I Short, Simple transactionShort, Simple transaction
I
I Read/writeRead/write
I
I Index/hash on prim. KeyIndex/hash on prim. Key
I I TensTens I I ThousandsThousands I I 100 MB-100 MB-GBGB I
I Trans. throughputTrans. throughput
I
I Knowledge workerKnowledge worker
I
I Decision supportDecision support
I
I Subject-Subject-oriented (Star, oriented (Star, snowflake)
snowflake)
I
I Historical, ConsolidatedHistorical, Consolidated
I
I Summarized, MultidimensionalSummarized, Multidimensional
I
I Ad hocAd hoc
I
I Complex queryComplex query
I
I Read MostlyRead Mostly
I
I Lots of ScansLots of Scans
I I MillionsMillions I I HundredsHundreds I I 100GB-100GB-TBTB I
I Query throughput, responseQuery throughput, response
User Function DB Design Data View Usage Unit of work Access Operations # Records accessed #Users Db size Metric
OLTP
OLTP
OLAP
OLAP
Data Warehouse
Data Warehouse
II
A decision support database that is maintained
A decision support database that is maintained
separately from the organization’s operational
separately from the organization’s operational
databases.
databases.
I
I
A data warehouse is a
A data warehouse is a
•
•
subject
subject
-
-
oriented,
oriented,
•
•
integrated,
integrated,
•
•
time
time
-
-
varying,
varying,
•
•
non
non
-
-
volatile
volatile
collection of data that is used primarily in
collection of data that is used primarily in
organizational decision making
Why Separate Data Warehouse?
Why Separate Data Warehouse?
I
I
Performance
Performance
•
• Op Op dbs dbs designed & tuned for known designed & tuned for known txs txs & workloads.& workloads.
•
• Complex OLAP queries would degrade Complex OLAP queries would degrade perfperf. For op . For op txstxs.. •
• Special data organization, access & implementation methods Special data organization, access & implementation methods
needed for multidimensional views & queries.
needed for multidimensional views & queries.
I
I
Function
Function
•
• Missing data: Decision support requires historical data, which Missing data: Decision support requires historical data, which opop
dbs
dbsdo not typically maintain.do not typically maintain.
•
• Data consolidation: Decision support requires consolidation Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous
(aggregation, summarization) of data from many heterogeneous
sources: op
sources: op dbsdbs, external sources. , external sources. •
• Data quality: Different sources typically use inconsistent dataData quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
representations, codes, and formats which have to be reconciled.
Data Warehousing Market
Data Warehousing Market
I
I
Hardware: servers, storage, clients
Hardware: servers, storage, clients
II
Warehouse
Warehouse DBMs
DBMs
II
Tools
Tools
II
Market growing from
Market growing from
•
• $2B in 1995 to $8 B in 1998 ($2B in 1995 to $8 B in 1998 (Meta Meta Group)Group)
•
• 1.5B today to $6.9B in 1999 (1.5B today to $6.9B in 1999 (Gartner Gartner Group)Group)
I
I
Systems integration & Consulting
Systems integration & Consulting
II
Already deployed in many industries: manufacturing,
Already deployed in many industries: manufacturing,
retail, financial, insurance, transportation, telecom.,
retail, financial, insurance, transportation, telecom.,
utilities, healthcare.
Data Warehousing Architecture
Data Warehousing Architecture
Monitoring & Administration Monitoring & Administration
Metadata Metadata Repository Repository Extract Extract Transform Transform Load Load Refresh Refresh
Data Marts
Data Marts
External External Sources Sources Operational Operational dbs dbs Serve Serve OLAP servers OLAP serversAnalysis
Analysis
Query/
Query/
Reporting
Reporting
Data
Data
Mining
Mining
Three
Three
-Tier Architecture
-
Tier Architecture
I
I
Warehouse database server
Warehouse database server
•
• Almost always a relational DBMS; rarely flat filesAlmost always a relational DBMS; rarely flat files
I
I
OLAP servers
OLAP servers
•
• Relational OLAP (ROLAP): extended relational DBMS that maps Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational
operations on multidimensional data to standard relational
operations.
operations.
•
• Multidimensional OLAP (MOLAP): special purpose server that Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.
directly implements multidimensional data and operations.
I
I
Clients
Clients
•
• Query and reporting tools.Query and reporting tools. •
• Analysis toolsAnalysis tools •
Data Warehouse
Data Warehouse
vs. Data Marts
vs
. Data Marts
II Enterprise warehouse: collects all information about subjects Enterprise warehouse: collects all information about subjects
(customers, products, sales, assets, personnel) that span the en
(customers, products, sales, assets, personnel) that span the entire tire organization.
organization.
•
• Requires extensive business modelingRequires extensive business modeling •
• May take years to design and buildMay take years to design and build
I
I Data Marts: Departmental subsets that focus on selected subjectData Marts: Departmental subsets that focus on selected subjects: s:
Marketing data mart: customer, products, sales.
Marketing data mart: customer, products, sales.
•
• Faster roll out, but complex integration in the long run.Faster roll out, but complex integration in the long run.
I
I Virtual warehouse: views over operational Virtual warehouse: views over operational dbsdbs
•
• Materialize some summary views for efficient query processingMaterialize some summary views for efficient query processing •
• Easier to buildEasier to build •
• Requisite excess capcaity Requisite excess capcaity on operational on operational db db serversservers
Design & Operational Process
Design & Operational Process
I
I
Define architecture. Do capacity planning.
Define architecture. Do capacity planning.
II
Integrate
Integrate db
db and OLAP servers, storage and client tools.
and OLAP servers, storage and client tools.
II
Design warehouse schema, views.
Design warehouse schema, views.
II
Design physical warehouse organization: data placement,
Design physical warehouse organization: data placement,
partitioning, access methods.
partitioning, access methods.
I
I
Connect sources: gateways, ODBC drivers, wrappers.
Connect sources: gateways, ODBC drivers, wrappers.
II
Design & implement scripts for data extract, load refresh.
Design & implement scripts for data extract, load refresh.
II
Define
Define metadata
metadata and populate repository.
and populate repository.
II
Design & implement end
Design & implement end-
-user applications.
user applications.
II
Roll out warehouse and applications.
Roll out warehouse and applications.
IOLAP for Decision Support
OLAP for Decision Support
I
I
Goal of OLAP is to support ad
Goal of OLAP is to support ad-
-hoc querying for the
hoc querying for the
business analyst
business analyst
I
I
Business analysts are familiar with spreadsheets
Business analysts are familiar with spreadsheets
II
Extend spreadsheet analysis model to work with
Extend spreadsheet analysis model to work with
warehouse data
warehouse data
•
• Large data setLarge data set •
• Semantically enriched to understand business terms (e.g., time, Semantically enriched to understand business terms (e.g., time, geography)
geography)
•
• Combined with reporting featuresCombined with reporting features
I
I
Multidimensional
Multidimensional
view of data is the foundation of OLAP
view of data is the foundation of OLAP
Multidimensional Data Model
Multidimensional Data Model
I
I
Database is a set of
Database is a set of
facts
facts
(points) in a multidimensional
(points) in a multidimensional
space
space
I
I
A fact has a
A fact has a
measure
measure
dimension
dimension
•
• quantity that is analyzed, e.g., sale, budgetquantity that is analyzed, e.g., sale, budget
I
I
A set of
A set of
dimensions
dimensions
on which data is analyzed
on which data is analyzed
•
• e.g. , store, product, date associated with a sale amounte.g. , store, product, date associated with a sale amount
I
I
Dimensions form a sparsely populated coordinate system
Dimensions form a sparsely populated coordinate system
II
Each dimension has a set of
Each dimension has a set of
attributes
attributes
•
• e.g., owner city and county of storee.g., owner city and county of store
I
I
Attributes of a dimension may be related by partial order
Attributes of a dimension may be related by partial order
•
• HierarchyHierarchy: e.g., street > county >city: e.g., street > county >city •
Multidimensional Data
Multidimensional Data
10 10 47 47 30 30 12 12Juice
Juice
Cola
Cola
Milk
Milk
Cream
Cream
NY NY LA LA SF SFSales
Sales
Volume
Volume
as a
as a
function
function
of time,
of time,
city and
city and
product
product
3/1 3/2 3/3 3/4
3/1 3/2 3/3 3/4
Date
Date
Operations in Multidimensional
Operations in Multidimensional
Data Model
Data Model
II
Aggregation (roll
Aggregation (
roll-
-up
up)
)
•
•
dimension reduction: e.g., total sales by city
dimension reduction: e.g., total sales by city
•
•
summarization over aggregate hierarchy: e.g., total
summarization over aggregate hierarchy: e.g., total
sales by city and year
sales by city and year -
-> total sales by region and by
> total sales by region and by
year
year
I
I
Selection (slice
Selection (
slice
) defines a subcube
) defines a
subcube
•
•
e.g., sales where city = Palo Alto and date = 1/15/96
e.g., sales where city = Palo Alto and date = 1/15/96
I
I
Navigation to detailed data (drill
Navigation to detailed data (
drill-
-down
down)
)
•
•
e.g., (sales -
e.g., (sales
-
expense) by city, top 3% of cities by
expense) by city, top 3% of cities by
average income
average income
I
A Visual Operation: Pivot (Rotate)
A Visual Operation: Pivot (Rotate)
10 10 47 47 30 30 12 12
Juice
Juice
Cola
Cola
Milk
Milk
Cream
Cream
NY NY LA LA SF SF3/1 3/2 3/3 3/4
3/1 3/2 3/3 3/4
Date
Date
Month Month RegionRegion Product ProductApproaches to OLAP Servers
Approaches to OLAP Servers
I
I
Relational OLAP (ROLAP)
Relational OLAP (ROLAP)
•
• Relational and Specialized Relational DBMS to store and manage Relational and Specialized Relational DBMS to store and manage warehouse data
warehouse data
•
• OLAP OLAP middleware middleware to support missing piecesto support missing pieces
–
– Optimize for each DBMS backendOptimize for each DBMS backend –
– Aggregation Navigation LogicAggregation Navigation Logic –
– Additional tools and servicesAdditional tools and services •
• Example: Example: MicrostrategyMicrostrategy, , MetaCube MetaCube (Informix)(Informix)
I
I
Multidimensional OLAP (MOLAP)
Multidimensional OLAP (MOLAP)
•
• ArrayArray--based storage structuresbased storage structures •
• Direct access to array data structuresDirect access to array data structures •
• Example: Example: Essbase Essbase (Arbor), (Arbor), Accumate Accumate ((KenanKenan))
I
Relational DBMS as Warehouse
Relational DBMS as Warehouse
Server
Server
II
Schema design
Schema design
I
I
Specialized scan, indexing and join techniques
Specialized scan, indexing and join techniques
I
I
Handling of aggregate views (querying and
Handling of aggregate views (querying and
materialization)
materialization)
II
Supporting query language extensions beyond
Supporting query language extensions beyond
SQL
SQL
II
Complex query processing and optimization
Complex query processing and optimization
I
I
Data partitioning and parallelism
Data partitioning and parallelism
Warehouse Database Schema
Warehouse Database Schema
I
I
ER design techniques not appropriate
ER design techniques not appropriate
I
I
Design should reflect multidimensional
Design should reflect multidimensional
view
view
•
•
Star Schema
Star Schema
•
•
Snowflake Schema
Snowflake Schema
•
Example of a Star Schema
Example of a Star Schema
Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category CategoryDescription CategoryDescription UnitPrice UnitPrice DateKey DateKey Date Date CityName CityName State State Country Country Order Order Customer Customer Salesperson Salesperson City City Date Date Product Product Fact Table Fact Table
Star Schema
Star Schema
II
A single fact table and a single table for each dimension
A single fact table and a single table for each dimension
II
Every fact points to one
Every fact points to one tuple
tuple
in each of the dimensions
in each of the dimensions
and has additional attributes
and has additional attributes
I
I
Does not capture hierarchies directly
Does not capture hierarchies directly
II
Generated keys are used for performance and maintenance
Generated keys are used for performance and maintenance
reasons
reasons
I
I
Fact constellation: Multiple Fact tables that share many
Fact constellation: Multiple Fact tables that share many
dimension tables
dimension tables
•
• Example: Projected expense and the actual expense may share Example: Projected expense and the actual expense may share dimensional tables
Example of a Snowflake Schema
Example of a Snowflake Schema
Order No Order No Order Date Order Date Customer No Customer No Customer Name Customer Name Customer Customer Address Address City City SalespersonID SalespersonID SalespersonName SalespersonName City City Quota Quota OrderNO OrderNO SalespersonID SalespersonID CustomerNO CustomerNO ProdNo ProdNo DateKey DateKey CityName CityName Quantity Quantity Total Price ProductNO ProductNO ProdName ProdName ProdDescr ProdDescr Category Category Category Category UnitPrice UnitPrice DateKey DateKey Date Date Month Month CityName CityName State State Country Country Order Order Customer Customer Salesperson Salesperson City City Date Date Product Product Fact Table
Fact Table CategoryNameCategoryDescrCategoryNameCategoryDescr
Month
Month
Year
Year YearYear
StateName StateName Country Country Category Category State State Month Month Year Year
Snowflake Schema
Snowflake Schema
I
I
Represent dimensional hierarchy directly by
Represent dimensional hierarchy directly by
normalizing the dimension tables
normalizing the dimension tables
I
I
Easy to maintain
Easy to maintain
I
I
Saves storage, but is alleged that it reduces
Saves storage, but is alleged that it reduces
effectiveness of browsing (Kimball)
effectiveness of browsing (Kimball)
Indexing Techniques
Indexing Techniques
I
I
Exploiting indexes to reduce scanning of
Exploiting indexes to reduce scanning of
data is of crucial importance
data is of crucial importance
I
I
Bitmap Indexes
Bitmap Indexes
I
I
Join Indexes
Join Indexes
I
I
Other Issues
Other Issues
•
•
Text indexing
Text indexing
•
•
Parallelizing
Parallelizing
and sequencing of index builds
and sequencing of index builds
and incremental updates
and incremental updates
BitMap
BitMap
Indexes
Indexes
II
An alternative representation of RID
An alternative representation of RID-
-list
list
I
I
Specially advantageous for low
Specially advantageous for low-
-cardinality
cardinality
domains
domains
II
Represent each row of a table by a bit and the
Represent each row of a table by a bit and the
table as a bit vector
table as a bit vector
II
There is a distinct bit vector
There is a distinct bit vector Bv
Bv for each value v
for each value v
for the domain
for the domain
II
Example: the attribute sex has values M and F. A
Example: the attribute sex has values M and F. A
table of 100 million people needs 2 lists of 100
table of 100 million people needs 2 lists of 100
million bits
Bit Map Index
Bit Map Index
Cust Region Rating
C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 N H Base Table Base Table Row ID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0 Row ID H M L 1 1 0 0 2 0 1 0 3 0 0 0 4 0 0 0 5 0 1 0 6 0 0 0 7 1 0 0 Rating Index Rating Index Region Index Region Index
Customers where
Customers where
Region = W
Region = W
And
And
Rating = 1
Rating = 1
BitMap
BitMap
Indexes
Indexes
II
Comparison, join and aggregation operations are
Comparison, join and aggregation operations are
reduced to bit arithmetic with dramatic
reduced to bit arithmetic with dramatic
improvement in processing time
improvement in processing time
II
Significant reduction in space and I/O (30:1)
Significant reduction in space and I/O (30:1)
I
I
Adapted for higher cardinality domains as well.
Adapted for higher cardinality domains as well.
I
I
Compression (e.g., run
Compression (e.g., run-
-length encoding) exploited
length encoding) exploited
I
I
Products that support bitmaps: Model 204,
Products that support bitmaps: Model 204,
TargetIndex
Issues in Handling of Aggregate
Issues in Handling of Aggregate
Views
Views
II
Important component for ROLAP Servers
Important component for ROLAP Servers
I
I
Representation in the context of star schema
Representation in the context of star schema
•
•
Query Expressions
Query Expressions
•
•
Materialized Views
Materialized Views
I
I
Logic for Aggregation Navigation
Logic for Aggregation Navigation
•
•
make optimum use of materialized aggregates to
make optimum use of materialized aggregates to
answer a query
answer a query
I
I
Choice of aggregate views to materialize
Choice of aggregate views to materialize
I
I
HP Intelligent Warehouse pioneered some of the
HP Intelligent Warehouse pioneered some of the
techniques
techniques
SQL Extensions for Front End
SQL Extensions for Front End
Tools
Tools
I
I
Extended Family of Aggregate functions
Extended Family of Aggregate functions
•
• rank (top 10) and Nrank (top 10) and N--Tile (“top 30%” of all products)Tile (“top 30%” of all products)
•
• Median, mode…..Median, mode…..
I
I
Reporting Features
Reporting Features
•
• running total, cumulative totalsrunning total, cumulative totals
I
I
Results of multiple group by:
Results of multiple group by:
•
• total sales by month and total sales by producttotal sales by month and total sales by product
I
I
SQL comes in the way of sequential processing and
SQL comes in the way of sequential processing and
columnar aggregations
columnar aggregations
•
Query Processing in MOLAP
Query Processing in MOLAP
Servers
Servers
I
I
The storage model is an n-
The storage model is an n
-dimensional
dimensional
array
array
I
I
Front end multidimensional queries map to
Front end multidimensional queries map to
server capabilities in a straightforward way
server capabilities in a straightforward way
I
I
Direct Addressing abilities
Direct Addressing abilities
I
I
A straightforward array representation has
A straightforward array representation has
good indexing properties but very poor
good indexing properties but very poor
storage utilization when the data is sparse
storage utilization when the data is sparse
Query Processing in MOLAP
Query Processing in MOLAP
Servers
Servers
2
2
-
-
dimensional dense arrays indexed by B
dimensional dense arrays indexed by B
-
-
Trees
Trees
Traditional Traditional indexing indexing structure structure 2 2--dimensional dimensional dense arrays dense arrays
Population & Refreshing the
Population & Refreshing the
Warehouse
Warehouse
II
Data extraction
Data extraction
I
I
Data cleaning
Data cleaning
I
I
Data transformation
Data transformation
•
•
Convert from legacy/host format to warehouse format
Convert from legacy/host format to warehouse format
I
I
Load
Load
•
•
Sort, summarize, consolidate, compute views, check
Sort, summarize, consolidate, compute views, check
integrity, build indexes, partition
integrity, build indexes, partition
I
I
Refresh
Refresh
•
•
Propogate
Propogate
updates from sources to the warehouse
updates from sources to the warehouse
Data Cleaning
Data Cleaning
I I
Why?
Why?
•
• Data warehouse contains data that is analyzed for business Data warehouse contains data that is analyzed for business decisions
decisions
•
• More data and multiple sources could mean more errors in the datMore data and multiple sources could mean more errors in the data a
and harder to trace such errors
and harder to trace such errors
•
• Results in incorrect analysisResults in incorrect analysis
I
I
Detecting data anomalies and rectifying them early has
Detecting data anomalies and rectifying them early has
huge payoffs
huge payoffs
I
I
Important to identify tools that work together well
Important to identify tools that work together well
II
Long Term Solution
Long Term Solution
•
• Change business practices and data entry toolsChange business practices and data entry tools •
Data Cleaning Techniques
Data Cleaning Techniques
I
I
Transformation Rules
Transformation Rules
•
• Example: translate “gender” to “sex”Example: translate “gender” to “sex” •
• Products: Warehouse Manger (Prism), Extract (ETI)Products: Warehouse Manger (Prism), Extract (ETI)
I
I
Uses domain
Uses domain-
-specific knowledge to do scrubbing
specific knowledge to do scrubbing
II
Parsing and fuzzy matching
Parsing and fuzzy matching
•
• Multiple data sources (can designate a preferred source)Multiple data sources (can designate a preferred source) •
• Products: Integrity (Products: Integrity (ValityVality), ), TrillumTrillum
I
I
Discover facts that flag unusual patterns (auditing)
Discover facts that flag unusual patterns (auditing)
•
• Some dealer has never received a single complaintSome dealer has never received a single complaint •
• Products: QDB, Products: QDB, SBStarSBStar, , WizRuleWizRule
Load
Load
I
I
Issues:
Issues:
•
• huge volumes of data to be loadedhuge volumes of data to be loaded •
• small time window (usually at night) when the warehouse can be small time window (usually at night) when the warehouse can be taken off
taken off--lineline •
• When to build indexes and summary tablesWhen to build indexes and summary tables •
• allow system administrator to monitor status, cancel suspend, allow system administrator to monitor status, cancel suspend, resume load, or change load rate
resume load, or change load rate
•
• restart after failure with no loss of data integrityrestart after failure with no loss of data integrity
I
I
Techniques:
Techniques:
•
• batch load utility: sort input records on clustering key and usbatch load utility: sort input records on clustering key and use e sequential I/O; build indexes and derived tables
sequential I/O; build indexes and derived tables
•
• sequential loads still too long (~100 days for TB)sequential loads still too long (~100 days for TB) •
Refresh
Refresh
I
I
Issues:
Issues:
•
•
when to refresh
when to refresh
–
–
on every update: too expensive, only necessary if
on every update: too expensive, only necessary if
OLAP queries need current data
OLAP queries need current data
(e.g., up
(e.g., up
-
-
the
the
-
-
minute
minute
stock quotes)
stock quotes)
–
–
periodically (e.g., every 24 hours, every week) or
periodically (e.g., every 24 hours, every week) or
after “significant” events
after “significant” events
–
–
refresh policy set by administrator based on user
refresh policy set by administrator based on user
needs and traffic
needs and traffic
–
–
possibly different policies for different sources
possibly different policies for different sources
•
•
how to refresh
how to refresh
Refresh Techniques
Refresh Techniques
I
I
Full extract from base tables
Full extract from base tables
•
• read entire source table or database: expensiveread entire source table or database: expensive •
• may be the only choice for legacy databases or files.may be the only choice for legacy databases or files.
I
I
Incremental techniques (related to work on active
Incremental techniques (related to work on active dbs
dbs)
)
•
• detect & propagate changes on base tables: replication servers detect & propagate changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator)
(e.g., Sybase, Oracle, IBM Data Propagator)
–
– snapshots & triggers (Oracle)snapshots & triggers (Oracle) –
– transaction shipping (Sybase)transaction shipping (Sybase) •
• Logical correctnessLogical correctness –
– computing changes to star tablescomputing changes to star tables –
– computing changes to derived and summary tablescomputing changes to derived and summary tables –
– optimization: only significant changesoptimization: only significant changes •
Metadata
Metadata
Repository
Repository
I
I
Administrative
Administrative metadata
metadata
•
• source databases and their contentssource databases and their contents •
• gateway descriptionsgateway descriptions •
• warehouse schema, view & derived data definitionswarehouse schema, view & derived data definitions •
• dimensions, hierarchiesdimensions, hierarchies •
• prepre--defined queries and reportsdefined queries and reports •
• data mart locations and contentsdata mart locations and contents •
• data partitionsdata partitions •
• data extraction, cleansing, transformation rules, defaultsdata extraction, cleansing, transformation rules, defaults •
• data refresh and purging rulesdata refresh and purging rules •
• user profiles, user groupsuser profiles, user groups •
• security: user authorization, access controlsecurity: user authorization, access control
Metdata
Metdata Repository .. 2
Repository .. 2
II
Business data
Business data
•
•
business terms and definitions
business terms and definitions
•
•
ownership of data
ownership of data
•
•
charging policies
charging policies
I
I
operational
operational metadata
metadata
•
•
data lineage: history of migrated data and sequence of
data lineage: history of migrated data and sequence of
transformations applied
transformations applied
•
•
currency of data: active, archived, purged
currency of data: active, archived, purged
•
•
monitoring information: warehouse usage statistics,
monitoring information: warehouse usage statistics,
error reports, audit trails.
Warehouse Design Tools
Warehouse Design Tools
I
I
Creating and managing a warehouse is hard.
Creating and managing a warehouse is hard.
II
Development tools
Development tools
•
• defining & editing defining & editing metadata metadata repository contents (schemas, scripts, repository contents (schemas, scripts, rules).
rules).
•
• Queries and reportsQueries and reports •
• Shipping Shipping metadata metadata to and from RDBMS catalogue (e.g., Prism to and from RDBMS catalogue (e.g., Prism Warehouse Manager).
Warehouse Manager).
I
I
Planning & analysis tools
Planning & analysis tools
•
• impact of schema changesimpact of schema changes •
• capacity planningcapacity planning •
• refresh performance: changing refresh rates or time windowsrefresh performance: changing refresh rates or time windows
Warehouse Management Tools
Warehouse Management Tools
I
I
Monitoring and reporting tools (e.g., HP Intelligent
Monitoring and reporting tools (e.g., HP Intelligent
Warehouse Advisor)
Warehouse Advisor)
•
• which partitions, summary tables, columns are used which partitions, summary tables, columns are used •
• query execution timesquery execution times •
• for summary tables, types & frequencies of roll downsfor summary tables, types & frequencies of roll downs •
• warehouse usage over time (detect peak periods)warehouse usage over time (detect peak periods)
I
I
Systems and network management tools (e.g., HP
Systems and network management tools (e.g., HP
OpenView
OpenView, IBM
, IBM NetView
NetView,
, Tivoli
Tivoli): traffic, utilization
): traffic, utilization
I
I
Exception reporting/alerting tools 9e.g., DB2 Event
Exception reporting/alerting tools 9e.g., DB2 Event
Alerters
Alerters, Information Advantage
, Information Advantage InfoAgents
InfoAgents &
& InfoAlert
InfoAlert)
)
•
• runaway queriesrunaway queries
I
State of Commercial Practice
State of Commercial Practice
II Products and Vendors [DatamationProducts and Vendors [Datamation, May 15, 1996; R.C. , May 15, 1996; R.C. BarquinBarquin, H.A. , H.A. EdelsteinEdelstein: : Planning and
Planning and Designin gthe Designin gthe Data Data WarehousWarehous. Prentice Hall. 1997]. Prentice Hall. 1997] •
• Connectivity to sourcesConnectivity to sources –
– ApertusApertus CACA--Ingres Ingres GatewayGateway –
– Information Builders EDA/SQLInformation Builders EDA/SQL IBM Data IBM Data JionerJioner –
– Informix Enterprise GatewayInformix Enterprise Gateway Microsoft ODBCMicrosoft ODBC –
– Oracle Open ConnectOracle Open Connect Platinum Platinum InfohubInfohub –
– SAS ConnectSAS Connect Software AG EntireSoftware AG Entire –
– Sybase Enterprise ConnectSybase Enterprise Connect Trinzic InfoHubTrinzic InfoHub •
• Data extract, clean, Data extract, clean, transfomrtransfomr, refresh, refresh –
– CA-CA-Ingres ReplicatorIngres Replicator Carleton PassportCarleton Passport –
– Evolutionary Tech Inc. ETI-Evolutionary Tech Inc. ETI-ExtractExtract HarteHarte--Hanks TrilliumHanks Trillium –
– IBM Data Joiner, Data PropagatorIBM Data Joiner, Data Propagator Oracle 7Oracle 7 –
– Platinum InfoRefinerPlatinum InfoRefiner, , InfroPumpInfroPump Praxis Praxis OmniReplicatorOmniReplicator –
– Prism Warehouse ManagerPrism Warehouse Manager Redbrick TMURedbrick TMU –
– SAS AccessSAS Access Software AG Software AG SouorcepointSouorcepoint –
– Sybase Replication ServerSybase Replication Server TrinzicTrinzicInfoPumpInfoPump
State of Commercial Practice..2
State of Commercial Practice..2
II Multidimensional Database EnginesMultidimensional Database Engines
Arbor
Arbor EssbaseEssbase ComshareComshareCommander OLAPCommander OLAP Oracle IRI Express
Oracle IRI Express SAS SystemSAS System
I
I Warehouse Data ServersWarehouse Data Servers
•
• CA-CA-IngresIngres IBM DB2IBM DB2 •
• Information Builders FocusInformation Builders Focus InformixInformix •
• OracleOracle Praxiz Model 204Praxiz Model 204 •
• RedbrickRedbrick Software AG ADABASSoftware AG ADABAS •
• Sybase MPPSybase MPP TandemTandem •
• TerdataTerdata
I
I ROLAP ServersROLAP Servers
•
• HP Intelligent WarehouseHP Intelligent Warehouse Information Advantage AsxysInformation Advantage Asxys •
State of Commercial Practice..3
State of Commercial Practice..3
II Query/Reporting EnvironmentsQuery/Reporting Environments
Brio
Brio/Query/Query Business ObjectsBusiness Objects Cognos
Cognos ImpromptuImpromptu CA Visual ExpressCA Visual Express IBM
IBM DataGuideDataGuide Information Builders Focus SixInformation Builders Focus Six Informix
Informix ViewPointViewPoint Platinum Forest & TreesPlatinum Forest & Trees SAS Access
SAS Access Software AG Software AG EsperantEsperant I
I Multidimensional AnalysisMultidimensional Analysis
Andydne
Andydne PabloPablo Arbor Arbor EssbaseEssbaseAnalysis ServerAnalysis Server
Business Objects
Business Objects Cognos PowerPlayCognos PowerPlay Dimensional Insight Cross Target
Dimensional Insight Cross Target Holistic Systems HOLOSHolistic Systems HOLOS Information Advantage Decision Suite
Information Advantage Decision Suite IQ Software IQ/VisionIQ Software IQ/Vision Kenan
Kenan System System AcumateAcumate Lotus 123Lotus 123 Microsoft Excel
Microsoft Excel Microstrategy Microstrategy DSSDSS
Pilot Lightship
Pilot Lightship Platinum Forest & TreesPlatinum Forest & Trees Prodea
Prodea BeaconBeacon SAS OLAP ++SAS OLAP ++ Stanford Technology Group
Stanford Technology Group MetacubeMetacube
State of Commercial Practice..4
State of Commercial Practice..4
I
I Metadata Metadata ManagementManagement
•
• HP Intelligent WarehouseHP Intelligent Warehouse IBM Data GuideIBM Data Guide •
• Platinum Repository Platinum Repository Prism Directory ManagerPrism Directory Manager I
I System ManagementSystem Management
•
• CA UnicenterCA Unicenter HP OpenViewHP OpenView •
• IBM DataHubIBM DataHub, , NetViewNetView Information Builder Site AnalyzerInformation Builder Site Analyzer •
• Prism Warehouse ManagerPrism Warehouse Manager SAS CPESAS CPE •
• TivoliTivoli Software AG Source PointSoftware AG Source Point •
• Redbrick Enterprise Control and CoordinationRedbrick Enterprise Control and Coordination I
I Process Management Process Management
•
• At& T TOPENDAt& T TOPEND HP Intelligent WarehouseHP Intelligent Warehouse •
• IBM FlowMarkIBM FlowMark Platinum RepositoryPlatinum Repository •
• Prism Warehouse ManagerPrism Warehouse Manager Software AG Source PointSoftware AG Source Point I
Research Issues
Research Issues
II Data cleaningData cleaning
•
• focus on data inconsistencies, not schema differencesfocus on data inconsistencies, not schema differences •
• data mining techniquesdata mining techniques
I
I Physical DesignPhysical Design
•
• design of summary tables, partitions, indexesdesign of summary tables, partitions, indexes •
• tradeoffs in use of different indexestradeoffs in use of different indexes
I
I Query processingQuery processing
•
• selecting appropriate summary tablesselecting appropriate summary tables •
• dynamic optimization with feedbackdynamic optimization with feedback •
• acid test for query optimization: cost estimation, use of transformations, acid test for query optimization: cost estimation, use of transformations, search strategies
search strategies •
• partitioning query processing between OLAP server and backend server.partitioning query processing between OLAP server and backend server.
Research Issues .. 2
Research Issues .. 2
II