F4: DW Architecture and Lifecycle
Erik Perjons, DSV, SU/KTH perjons@dsv.su.se
The data warehouse architecture
Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining
Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41
Operational source systems
Data access tools (RK) End user applications Business Intelligence tools Data staging
area (RK) Back end tools
Data presentation area (RK)
”The data warehouse” Presentation (OLAP) servers Operational source
systems (RK) Legacy systems OLTP/TP systems
Operational source systems
characteristics:
• the source data often in OLTP (Online Transaction Processing) systems, also called TPS (Transaction Processing Systems) • high level of performance and availability
• often one-record-at-a time queries
• already occupied by the normal operations of the organisation OLTP vs. DSS (Decision Support Systems)
OLTP vs. OLAP (Online analytical processing)
Operational Source Systems
Operational source systems
More operational source systems
characteristics:
• a OLTP system may be reliable and consistent, but there are often inconsistencies between different OLTP systems • different types of data format and data structures in
different OLTP systems AND DIFFERENT SEMANTICS
Operational Source Systems
Operational source systems
Operational Source Systems
Kimball et al´s assumptions (p 7):
•Source systems are not queried in the broad
and unexpected ways
•Maintain little historical data
•Each source systems is often a natural
stovepipe application
Operational source systems
DW architecture: Data staging area
Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining
Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41
Operational source systems
Data access tools
Data staging area Data presentation area Operational
The Data Staging Area
Often the most complex part in
the architecture, and involves...
• Extraction (E)
• Transformation (T)
• Load (L)
• indexing
ETL-tools can be used
Scripts for extraction, transformation and load are
implemented
Extract Transform Load
Extraction
means reading and understanding the source data and
copying the data needed for the data warehouse into
staging area for further manipulation, i.e.
transformation
Data staging area
Extract Transform Load
Transformation involves…
• data conversion/transformation
(specify transformation rules to convert to a common data format and common terms/semantics)
• data cleaning/cleansing
– data scrubbing (use domain-specific knowledge (e.g postal adresses) to check the data)
– data auditing (discover suspicious pattern, discover violation of stated rules)
• combining data from multiple sources • assigning warehouse (surrogate) keys • data aggregation
Data staging area
Extract Transform Load
A debate questions:
Should the data in the data staging area be stored in a
3NF relational database and loaded into the presentation
area for querying and reporting?
Kimball (p 8-9): a 3NF relational database in data staging area requires more time and resources for development, periodic loading and updating and more capacity of storing the multiple copies of the data
Data staging area
Extract Transform Load
Flat file C DB2 table(s) D’ DB2Connect
Staging area for checking, analysing, cleaning, complementing etc transaction data SQL, C++ ?? DB2 Preliminary target DW E Fees (manually adjusted to individual agreements) I Start balance H Customer data G Customer data F
Three star/join schemas comprising altogether 8 tables Fact tables:
- transactions (10 attributes) - fees (7 attributes) - start balance (4 attributes) Dimensional tables: - time (7 attr) - customer (> 40 attr) - company (> 90 attr) - product (13 attr) - ”Service charged” (2 attr) Various source files
DB2 Final target DW E’ +aggregation (new program)
E complemented with some aggregated tables
Some cleansing and scrubbing may be needed here
A Real World Example
DW architecture: Data presentation area
Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining
Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41
Operational source systems
Data access tools Data staging area Data presentation area
Operational source systems
Data presentation area
Data warehouse Data marts OLAP servers• What is OLAP?
• Dimensional modelling vs. 3 NF modelling
• Data Marts
• ROLAP/MOLAP servers
• Acronym for “On-line analytical processing”
• A decision support system (DSS) that support ad-hoc querying, i.e. enables managers and analysts to interactively manipulate data. The idea is to allow the users to easy and quickly manipulate and visualise the data through multidimensional views, i.e. different perspectives.
What is OLAP?
q ua rt er offic e product Service Facts Office QuarterDate/
Key Month Quarter Year
991011 9910 4 - 99 99
991012 9910 4 - 99 99
Key Customer Address Region
Income group
C210 Anna N Stockholm Stockholm B
C211 Lars S Malmö Skåne B
C212 Erik P Rättvik Dalarna C
C213 Danny B Stockholm Stockholm A
C214 Åsa S Stockholm Stockholm A
Key Service Servicegroup
S1 Local call Group A
S2 Intern. call Group A
S3 SMS Group B
S4 WAP Group C
Key Seller Office
F11 Anders C Sundsvall
F12 Lisa B Sundsvall
F13 Janis B Kista
Service Dimension Time Dimension
Sales Dimension Customer Dimension Fact table - Transactions
Sum Numberof calls
C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1
Dimensional modelling
1 1 1 1 0..* 0..* 0..* 0..* Date/Key Month Quarter Year
991011 9910 4 - 99 99
991012 9910 4 - 99 99
Key Customer Address Region
Income group
C210 Anna N Stockholm Stockholm B
C211 Lars S Malmö Skåne B
C212 Erik P Rättvik Dalarna C
C213 Danny B Stockholm Stockholm A
C214 Åsa S Stockholm Stockholm A
Key Service Servicegroup
S1 Local call Group A
S2 Intern. call Group A
S3 SMS Group B
S4 WAP Group C
Key Seller Office
F11 Anders C Sundsvall
F12 Lisa B Sundsvall
F13 Janis B Kista
Service Dimension Time Dimension
Sales Dimension Customer Dimension Fact table - Transactions
Sum Numberof calls
C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 Query: For how much
did customers in Sthlm use service “Local call” in october 1999? Σ=37:00
Key difference between 3NF and Dimensional modelling:
- the degree of normalisation
3 NF modelling
- a logical design technique to eliminate data redundancy to keep consistency and storage efficiency, and makes transaction simple and deterministic
- ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables
Dimensional modelling
- a logical design technique that present data in a intuitive, i.e. easier to navigate for the user
- allow high performance access/queries (the complexity of 3NF models overwhelms the database systems optimizer, which means bad performance)
- aims at model decision support data
3 NF modelling vs. Dimensional modelling
[Kimball et al, p 10-11]
Kimball et al (p.10-12 and 396)
“we refer to the presentation area as a series of integrated data marts”
“a data mart is a flexible set of data, ideally based on the most atomic (granular) data possible to extract from operational source, and presented in a symmetric (dimensional) model that is resilient when faced with unexpected user queries”
“in its most simplistic form a data mart represent data from a single business process” (business process=purchase order, store inventory and so on)
Data marts
Service Calls Office Quarter Service Subscription orders Office Quarter Service Calls Office Quarter Subscription ordersThe data warehouse bus architecture
Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center [Kimball et al, p 78-79] A data mart A data mart
• A dimensional model for a large data warehouse
consists of between 10 and 25 similar-looking data
marts. Each data marts will have 5 to 15 dimensional
tables
.
Data marts
Kimball et al’s strong opinions (p.10-12)
• all data in the presentation area should be presented,
stored and accesses in dimensional models
• the data marts must contain detailed, atomic data (it
is unacceptable that the detailed data should be
locked up in 3 NF models for drill-down)
• the data marts dimensions should be conformed for
drill-across techniques, which tie the data marts
together in the data warehouse bus architecture
More about data marts:
• far smaller data volumes, fewer data sources
• easier data cleaning process, faster roll-out
• allows a “piecemeal” approach to some of the enormous
integration problems involved in creating an enterprise
wide data model, but complex integration in the long
term
The Data marts
Dependent vs. Independent Data marts
Data warehouse
Dependent Data marts Data warehouse
Extended Relational DBMS (ROLAP servers)
– data stored in RDB – star-join schemas – support SQL extensions – index structures
Multidimensional DBMS (MOLAP servers)
– data stored in arrays (n-dimensional array) – direct access to array data structure – excellent indexing properties
– poor storage utilisation, especially when the data is sparse.
The presentation/OLAP servers
Data warehouse
Data marts
OLAP servers
• Index structures (bit map indexes, join indexes)
• SQL extensions (operators like Cube, Crossjoin)
• Materialised views (pre-aggregations)
More about presentation servers
What is characteristics regarding data warehouse, according to Chaudhiri&Dayal :
DW architechture: Metadata repository
Query/Reporting Extract Transform Load Refresh Operational source systems Serve External sources Data warehouse Metadata repositoryMonitoring & Administration
Data marts OLAP servers Analysis Falö aöldf flaöd aklöd falö alksdf Data mining
Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41
Data access tools Data staging area Data presentation area
Operational source systems
What is metadata?
Main functions are to give...
• data definitions
• the origin of data
• the structure of data
• rules for the selection and transfer of data
• qualitative and quantitative data about data
Contained in metadata repository
Æ
The metadata repository
An integrated complete source of metadata
• is at the heart of the data warehouse architecture • supports the information needs of...
– system developers – data administrators – system administrators – users
– applications on the data warehouse
• very complex data structure • must contain full version history • must always be up to date
Metadata life cycle activities
• Collection
• identify and capture metadata in a central
repository
• Maintenance
• establish processes to synchronise metadata with
the changing data structure
• Deployment
• provide metadata to users in the right form and
with the right tools
Different types of metadata
•
Administrative metadata
(includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas,
dimensions, hierachies, predefined queries, physical
organisation, rules and script for extraction, transformation and load, back-end and front end tools)
•
Business metadata
(business terms and definitions, ownership of data)
•
Operational metadata
(information collected during the operations of the DW, e. g. usage statistics, error reports)
DW architecture: End user applications
Query/Reporting Extract Transform Load Refresh Operational DBs Serve External sources Data warehouse Metadata repository Monitoring & Administration
Data marts OLAP servers Analysis Falö aöldf flaöd aklöd falö alksdf Data mining Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41
Data access tools
Data staging area Data presentation area Operational
Query/Reporting Analysis Falö aöldf flaöd aklöd falö alksdf Data mining
Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41
• OLAP tools, BI apps, DSS
• Query/Reporting tools
• Data mining
End user applications
product product group
mounth
quarter officeregion
Product Group Region First Quarter - 1997
Group A ABC 1245
Group A XYZ 34534
Group B ABC 45543
Group B XYZ 34533
Column headers
(join constraints) Column header(application constraint) Answer set representing
focal event
Row headers
Graphical output of OLAP tool
• Drill-down - decreasing the level of aggregation
• Drill-up/Roll-up/Consolidation - increasing the level of aggregation • Drill-across - move between different star-join schemas using
conformed dimensions and joins
• Slicing and dicing – ability to look at the database from different views, e.g. one slice shows all sales of product type within regions, another slice shows all sales by sales channel within each product type
• Pivoting - e.g. change columns to rows, rows to columns • Ranking - sorting
“Think of an OLAP data structure as a Rubik´s Cube of data that users can twist and twirl in different ways to work through what-if an what-happend scenarios” [Lee Thé]
Strategic
Who: strategic leaders
What: formulate strategy and monitor corporate performance Examples: Balance scorecard, Strategic Planning
Operational
Who: operational managers
What: execution of strategy againts objectives Examples: Budgeting, Sales forcasting
Analytical
Who: analysts, knowledge worker, controller What: ad-hoc analysis
Examples: Financial and Sales Analysis, Customer Segmentation, Clickstream analysis
Business Intelligence (BI) apps
• Complexity of integration
– Hidden problems with source systems – Data homogenisation
– Underestimation of resources for data loading
• Required data not captured
• High maintenance
• Long duration projects
• Why not integrating the legacy applications
(OLTP systems) instead?
Operational Data Store (ODS)
No singel universal defintion...
ODS definition 1: Implemented to deliver operational reporting, especially when neither the legacy nor the modern OLTP systems provide adequate operational reports – fixed queries and for tactical decision making
ODS definition 2: Built to support real-time interactions, especially in Customer Relationsship Management applications – the tradtional data warehouse typically is not in a position to support the demand for near-real-time data
OMG’s standards
Model Metamodel Meta metamodel Instances Invoice no 34 Helen Nagy M0 layer M1 layer M2 layer M3 layerCWM Metamodel
Meta Object Facility (MOF)
Common Warehouse Metamodel (CWM)
Data Source Data Source Data Source OperationalData Store ETL
Data Warehouse Data Mart Analysis Reporting Visualization Data Mining Data Mart Data Mart
The collection of metamodels by CWM can be
used to model the whole data warehousing
environment i.e from data sources to end use
analysis, and data warehouse management
Common Warehouse Metamodel
• Common Warehouse Metamodel (CWM) is a
language specifically design to model data
warehousing and data mining applications, i.e.
integrating data warehousing and business
analysis (business intelligence) tools
• CWM has a lot in common with the UML metamodel
but has a number of special metamodels
(metaclasses), e.g modelling relational databases,
multidimensional databases, OLAP, schema
transformations, XML
X Check material on stock Capture ordered items Check material on stock Ordered item [captured] Ordered item captured Material on stock [checked] Capture ordered items Material is on stock Material is not on stock Order recieved State Precedes Succedes Activity Precedes/ Succedes Succedes Precedes State Event consists of Transform-ation consists of Function Event Succedes Precedes
Why
meta-modelling?
Meta metamodel level or Reference model Meta-model level Model level [Rosemann, Green, 2002]CWM packages
Management Warehouse Process Warehouse Operation
Analysis Transformation OLAP Data Mining VisualizationInformation NomenclatureBusiness
Resource Relational Record Multi-Dimensional XML
Foundation InformationBusiness Data Types Expressions Keys andIndexes DeploymentSoftware Type Mapping Object
Model Core Behavioral Relationships Instance
CWM packages layers
[Poole et al, p.36-40 (2002)]
•
Object layer - base metamodels/packages, which are (re)used by the other metamodels/packages•
Foundation layer - extends the object layer with services required which are (re)used by the other metamodels/packages, e.g “unique key” in the Key Indexes metamodel/package is used by relational databases, OO-databases and record-oriented• Resource layer - defines metamodels/packages for various types of data resouces
•
Analysis layer - analysis-oriented metadata•
Management layer - describing the data warehousing process as a wholeCWM packages relations
Relational package ModelElement Class Classifier Core package Datatype package ClassifierFea ture Element Namespace Feature StructuralFeature Expression ProcedureExpression QueryExpression Attribute ColumnSet NamedColumnSet Table View QueryColumnSet ColumnCWM classifyer equality
Object Package Classifier
(Klass) (Attribut)Feature
Schema Table Column
Relational
Record Record
file RecordDef Field
Multi Dimensional
XML
Schema Dimenson Dimension
ed Objct Schema Element Type Attribute
More about CWM
Common Representation <<metamodels>> CWM Packages Tool Y Metamodel Tool XBusiness Dimensional Lifecycle
Technical Architecture Design Technical Architecture Design Product Selection & Installation Product Selection & Installation End-User Application Specification End-User Application Specification End-User Application Development End-User Application Development Project Planning Project Planning Business Requirement Definition Business Requirement Definition DeploymentDeployment Maintenanceand Growth Maintenance and Growth Project Management Project Management Dimensional Modeling Dimensional
Modeling PhysicalDesign
Physical Design Data Staging Design & Development Data Staging Design & Development
The Data Warehouse Architecture
Framework
Level of ARCHITECTURE AREA
detail Data Back room Front room Infrastructure
Business reqs and audit Architecture models and documents Detailed models and specs Implemen-tation Info needed for better decisions
Enterprise models How get, transform, make available data Capabilities needed to get and
transform data Major data stores
User’s needs Major classes of analyses Priorities Where is data coming from Calc and storage
reqs HW/SW capabilities needed vs what we have Major business issues. How measure How analyse Focal events, facts, dimensions Dimensional models
Install, test infra-structure. Connect sourcesto targets to desktop Standards, prods to provide capabilities How hook together
Report layouts, derivation For whom, when
How interact with capabilities System utilties, calls, APIs ... Write extracts, loads Automate process Implement report and analysis env
Build rpt Train users Logical and physical models Domains, derivation rules DB, indexes backup ...