DW Source Integration, Tools,
and Architecture
Original slides were written by Torben Bach Pedersen
Aalborg University 2007 - DWML course 2
Overview
•
DW Front End Tools
•
Source Integration
•
DW architecture
End User Applications (EUA)
•
The business impact of the DW!
•
Canned reports
End user application templates Provide answers to common questions
Can be used as (quality-assured) building blocks for other reports
•
Two extremes
Ad hoc strategic analysis, power users, DIY query tools
Fixed operational analysis, report consumers, operational reporting
•
EUA fills the gap
“Tactical” analysis, push-button knowledge workers
EUA Concepts
•
Templates
Layout/structure + parameters
Compare sales per product in <area> for <period1> and <period2>
•
Parameters - chosen at run-time
Come from any level of the given dimension – drill-down Time (All time, 2002, 2002 4Q, 2002 Dec, 2002 Dec 1) possible Many different
•
Identify report candidates
Produce a list of candidates
•
Consolidate candidate list
Aalborg University 2007 - DWML course 5
What Templates to Choose?
•
“Analytical Cycle” Steps (repeats)
1) How’s business? – current performance 2) What are the trends? – performance over time
3) What’s unusual? – quick identification of exceptions (+/-) 4) What is driving the exceptions? – find causes for exceptions 5) What if …? – play around with parameters and see effect 6) Make a business decision – small as well as big decisions 7) Implement the decision – feed analysis results into op. systems
•
Prioritize template list
Rank or group templates – implement 15 most important at first
Aalborg University 2007 - DWML course 6
Overview
•
DW Front End Tools
•
Source Integration
•
DW architecture
Data Integration Research Projects
•
Focus on source integration and update propagation
•
Wrapper: convert source data into a standard format
•
Information Manifold
Sources: databases, SGML docs, unstructured files,… Relational integration data model
•
TSIMMIS
Wrapper/mediator
Semi-structured OEM integration data model
•
Squirrel
Powerful “integration mediator”
•
WHIPS
Wrapper/mediator
Relational integration data model
Views on DW Metadata
• Most DW projects: DW architecture as a “stepwise flow” of information from source to analyst
• No conceptual domain model used for integration
Some questions cannot be answered • DWQ project: extended metamodel to
Aalborg University 2007 - DWML course 9
Using DW Metadata in the Enterprise
(1)
(2)
(3) (4) (5)
• Analyst wants to analyze data
Gather data from operational departments through OLTP Question travels through (1)-(5) • Traditional DW (previous slide)
only describes step (3)-(4)
Cannot solve problems like “why can’t I answer quest. X?” • Conceptual relationships between
enterprise model, operational models + DW must be captured
Everything is a view on the enterprise model ! (“local as view”) – unlike previous slide
Aalborg University 2007 - DWML course 10
Analyst: “Why can’t I answer question X?”
•
Possible reasons
Certain measures not included in fact table
Granularity of facts too coarse
Particular dimensions not in DW
Descriptive attributes missing from dimensions
Meaning of attributes/measures deviate from the
analyst’s expectation
……
DWQ Metadata
• Three metadata perspectivesmust be captured
Conceptual (enterprise) Logical (data model) Physical (data flow) • Framework instantiated by
conceptual, logical, and physical information models
• DW quality heavily depends on DW processes rather than schemas
• A process meta model is needed to capture process definitions, and the relationships to DW quality
Source integration practice
•
Focus on information integration in
databases
(schema and data)
•
Two main approaches
Constructing integrated enterprise model Focus on mappingsbetween sources and DW
•
Tools for DW management
Schema integration Metadata management Based on modeling tools
•
Tools for data integration
Mapping specification ETL tools – like last lecture
Aalborg University 2007 - DWML course 13
Schema Integration
• Producing one global schema (one-shot or incremental) • Pre-integration
Analyzing and annotating source schemata
Semantic enrichment of schema, often in richer data model • Schema comparison
Determine correlations/conflicts among schema concepts Heterogeneity conflicts – different source data models Naming conflicts – homonyms and synonyms Semantic conflicts –different abstraction levels Structural – different constructs
• Schema conforming
Conform/align schemas to make them compatible Typically semi-automatic process
• Schema merging and restructuring
“Superimpose” conformed schemas
Quality: completeness, correctness, minimality, understandability
Aalborg University 2007 - DWML course 14
Virtual Data Integration
• Only data definitionis integratedData only in sources, queries on views, queries shipped to sources Not suited for DW?
• Carnot
Individual schemata mapped onto rich GCL ontology (1. order logic) Articulation axioms specify mappings, queries mapped to GCL • SIMS
Creates common class-based domain model to describe sources Sources are dynamicallychosen and integrated at query time Query reformulation, access planning, optimization, execution • Information Manifold
Relational world view + information source description + correspondences Metamodel enriched using description logic/Datalog rules
Datalog queries, optimized by choosing ”minimal” sources • TSIMMIS
Wrapperswrap sources using semi-structured OEM model
Mediator performs its own integration – no global integration (global as view)
Materialized Data Integration
•
Views on source data are
materialized
in integrated DB
•
Squirrel
Integration mediators incrementally maintain materialized views Cooperation of sources required
•
WHIPS
Relational SPJ + aggregation views specified in view tree View manager computes view and handles updates Integrator ensures view maintainability
Global query processorqueries sources using wrappers/mediators
•
In combination with virtual integration?
DWQ Source Integration
•
Current DW tools cannot fully support DW quality
No support for validation of interschema assertions and other
specified relationships, i.e., the DW design process
•
Conceptual perspective
Domain model = enterprise model + source models
Consolidated and reconciled description of important concepts ◆ Not all enterprise data captured (at first, incremental approach) Logic-based formalism allows reasoning over metadata Intermodel assertions capture interdependencies
•
Logical perspective
Source schemata + DW schema in logical data model (relational) Defined as queries over the corresponding conceptual component
•
Physical perspective
Aalborg University 2007 - DWML course 17
DWQ Source Integration Architecture
Note explicit mappings!
Aalborg University 2007 - DWML course 18
DWQ Source Integration Methodology
•
Source-driven integration
Enterprise and source model construction Source model integration (into the domain model) Source and DW schema specification (+ mappings) Data integration and reconciliation
Quality analysis steps in all phases above
•
Client-driven integration
New client query considered
Reasoning determines whether query can be answered by
materialized views already in DW
◆ Query containment reasoning
If DW not sufficient, materialize new concepts in domain model? Otherwise, new sources must be added using source-driven integr.
Overview
•
DW Front End Tools
•
Source Integration
•
DW architecture
Technical Architecture
Design
Technical Architecture
Design
Product Selection& Installation
Product Selection& Installation
End-User Application Specification
End-User Application Specification
End-User Application Development
End-User Application Development
Business Requirements
Definition
Business Requirements
Definition Deployment
Deployment Maintenance
and Growth
Maintenance and Growth
Project Management
Project Management
Dimensional Modeling
Dimensional
Modeling PhysicalDesign
Physical Design
Data Staging Design & Development
Data Staging Design & Development
Project Planning
Project Planning
Aalborg University 2007 - DWML course 21 DB DB DB DB DB Appl. Appl. Appl. ETL DW DM DM DM OLAP Visua-lization Appl. Appl. Data mining Data Marts Data Warehouse Existing databases
and systems (OLTP) New databases and systems (OLAP)
Technical DW Architecture
…
How to organize DW and DMs?
Clients
Aalborg University 2007 - DWML course 22
Central DW
• All data in one, central DW
• All client queries directly on the central DW
• Pros
Simplicity Easy to manage • Cons
Bad performance due to no redundancy/ workload distribution
Central
DW
Source
Source
ClientsFederated DW
• Data stored in separate data marts, aimed at special departments
• Logical DW (i.e., virtual)
• Data marts contain detail data
• Pros
Performance due to distribution • Cons More complex
Logical
DW
Source
Source
ClientsFinance
mart
Mrktng
mart
Distr.
mart
Tiered Architecture
• Central DW is materialized• Data is distributed to data marts in one or more tiers
• Only aggregated data in cube tiers
• Data is aggregated/reduced as it moves through tiers
• Pros
Best performance due to redundancy+distribution • Cons
Most complex Hard to manage
2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211 2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211 2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211
Central
DW
2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211 2000 2001 Aalborg Copenhagen Milk Bread 123 127 57 45 56 67 211Aalborg University 2007 - DWML course 25
Coordination w. Development Strategy
•
Different development strategies pose different demands
to the architecture elements
•
Example: Kimball Dimensional Modeling
Centralized design of (conforming) dimensions First, design of a single-source data mart Later, design of multi-source data marts
Integration of existing data marts into new data marts The DW is just the union of the marts it is composed of Entails top-down (“Bus Architecture”) and bottom-up elements
•
Consequences
No initial design of DW, from which data marts are extracted Data is extracted directly from sources to data marts Allows distribution of data marts and computation on them
Aalborg University 2007 - DWML course 26
Operational Data Store (ODS)
DB DB DB DB DB Appl.
Appl.
Appl.
ETL
DW
DM
DM
DM
OLAP
Visua-lization Appl.
Appl.
Data mining Existing databases
and systems (OLTP)
…
ODS New DB
Operational Data Store I
a “
subject oriented, integrated, volatile, current valued data
store containing only corporate detailed data
”
(Inmon et al.)
•
A database which integrates and accumulates operational
data in a subject-oriented structure
•
Not dimensional, but ordinary relational
•
An extra level between operational systems and
dimensional structures
•
Two benefits sought
Integration of operational systems Basis for data warehouse
Operational Data Store II
•
ODS - pros
More modeling choices
◆ The dimensional “straightjacket” can force sub-optimal design decisions hiding the true semantics of data
◆ No need to choose a granularity, and no need to exclude data ◆ In summary, no need to make design decisions that cannot be
changed subsequently This means extra flexibility
•
ODS – cons
Not feasible to do analysis directly on ODS extra complexity
Aalborg University 2007 - DWML course 29
MS Analysis Services
•
Cheap
•
Easy to use
•
(R/M/H)OLAP technology
Data placement as desired
•
Intelligent pre-aggregation
•
Server and client parts
Reporting Services a separate tool
•
Built-in data mining
Decision trees Clustering
•
MS OLE DB for OLAP interface
Aalborg University 2007 - DWML course 30
IBM DB2 OLAP Server
•
“Light” version of Hyperion Essbase (OLAP market leader)
Extra product “on top of” DB2
•
(R/M/H)OLAP
Data in DB2 or in multidimensional structures
•
Interfaces
Hyperion Essbase API OLE DB for OLAP (promised)
•
DB2 can also handle aggregates
Automatic summary tables Used by DB2 optimizer
Automatic maintenance by DB2
Oracle 10g BI
•
Based on Express OLAP product
On the market since 1970!
•
(R/M/H)OLAP
Flexible data placement
Integrates ROLAP strategy and Express OLAP
•
Total integration with Oracle 10g RDBMS
Storage, security, management,… Best integration compared to MS and IBM
•
Add-on data mining (10g Data Mining)
Associations, classification, prediction, clustering
Architecture Alternatives
•
Cubes are smart
Intuitive model Better overview
Better suited for data analysis
•
But
logical
cubes suffice
Implementation hidden from user
•
Architecture alternatives
MS, IBM, Oracle
Virtual cubes, physical cubes ROLAP, MOLAP
Separate relational DW, cubes directly on source data Client tools
Aalborg University 2007 - DWML course 33
MS vs. IBM vs. Oracle
•
All
Good scalability Good analysis facilities
Flexible storage (MOLAP, ROLAP, HOLAP) Incremental update
Many client tools
•
MS Analysis Server
Built-in mining + good integration with MS SQL Server
•
DB2 OLAP Server
Good integration with DB2
•
Oracle
Best RDBMS/MOLAP integration of the three
•
All three products are good
Dependent on the other choices + existing technical architecture
Aalborg University 2007 - DWML course 34
Virtual vs. Physical Cubes
•
Virtual cubes
Logical cube specification directly on source data ROLAP implementation without aggregates + flexible, design can be changed quickly - performance, constant load on source DB
•
Physical cubes
Data for cube extracted and stored on OLAP server Several implementation choices possible
+ good performance, only source DB load at creation/update - harder to change design
MOLAP vs. ROLAP
•
MOLAP
Data in specialized data structure, optimized for OLAP + best performance, least space consumption
- changing design requires rebuilding, scalability at detail level?,
detail data stored several times
•
ROLAP
Data in RDBMS
+ more flexible change of design, scalable for detail data - not as good performance, larger space consumption
•
HOLAP
Detail data in RDBMS (can be source DB) Aggregates in multidimensional structure
+ good performance for higher-level queries, detail data only stored
once
- handling design changes, operational complexity
Separate Data Warehouse?
•
Separate DW
Integration of source data in DW Cubes built from DW
Sometimes the only solution
+ better integration and cleansing, less load on existing servers - larger complexity, design changes, updating DW
•
Cubes directly on source data
Cubes built directly from source data Cannot handle all cases
+ less complexity, easier to change design, no update of DW - cannot handle all forms of integration and cleansing, more load
Aalborg University 2007 - DWML course 37
Choosing Client Tools
•
Many OLAP clients on the market, e.g.,
Hyperion, Targit, Oracle MS Reporting Services
•
Client and server communicates via an API
•
MS OLE DB for OLAP
De facto standard
Supported by almost all client tools
•
Hyperion Essbase API
Supported by many client tools
•
Some criteria
Functionality (web distribution, analysis, reporting, …) Support
Price
Aalborg University 2007 - DWML course 38
Architecture Alternatives - Conclusion
•
Architecture alternatives, their pros and cons
•
No simple general choices
•
Choices dependent on the concrete situation
Look at books
Look at requirements specs Look at the latest products Think about prototyping
Summary
•
DW Front End Tools
•
Source Integration
•
DW architecture
Mini Project
•
New subtask
Build a few reports in Reporting Services to answer important
business questions you proposed in part (1a)
Discuss the architecture of your DW system Discuss source integration in your system