• No results found

F4: DW Architecture and Lifecycle. Erik Perjons, DSV, SU/KTH Data warehouse

N/A
N/A
Protected

Academic year: 2021

Share "F4: DW Architecture and Lifecycle. Erik Perjons, DSV, SU/KTH Data warehouse"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

F4: DW Architecture and Lifecycle

Erik Perjons, DSV, SU/KTH perjons@dsv.su.se

The data warehouse architecture

Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining

Productt Time1 Value1 Value11 Product2 Time2 Value2 Value21 Product3 Time3 Value3 Value31 Product4 Time4 Value4 Value41

Operational source systems

Data access tools (RK) End user applications Business Intelligence tools Data staging

area (RK) Back end tools

Data presentation area (RK)

”The data warehouse” Presentation (OLAP) servers Operational source

systems (RK) Legacy systems OLTP/TP systems

(2)

Operational source systems

characteristics:

• the source data often in OLTP (Online Transaction Processing) systems, also called TPS (Transaction Processing Systems) • high level of performance and availability

• often one-record-at-a time queries

• already occupied by the normal operations of the organisation OLTP vs. DSS (Decision Support Systems)

OLTP vs. OLAP (Online analytical processing)

Operational Source Systems

Operational source systems

More operational source systems

characteristics:

• a OLTP system may be reliable and consistent, but there are often inconsistencies between different OLTP systems • different types of data format and data structures in

different OLTP systems AND DIFFERENT SEMANTICS

Operational Source Systems

Operational source systems

(3)

Operational Source Systems

Kimball et al´s assumptions (p 7):

•Source systems are not queried in the broad

and unexpected ways

•Maintain little historical data

•Each source systems is often a natural

stovepipe application

Operational source systems

DW architecture: Data staging area

Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining

Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41

Operational source systems

Data access tools

Data staging area Data presentation area Operational

(4)

The Data Staging Area

Often the most complex part in

the architecture, and involves...

• Extraction (E)

• Transformation (T)

• Load (L)

• indexing

ETL-tools can be used

Scripts for extraction, transformation and load are

implemented

Extract Transform Load

Extraction

means reading and understanding the source data and

copying the data needed for the data warehouse into

staging area for further manipulation, i.e.

transformation

Data staging area

Extract Transform Load

(5)

Transformation involves…

• data conversion/transformation

(specify transformation rules to convert to a common data format and common terms/semantics)

• data cleaning/cleansing

– data scrubbing (use domain-specific knowledge (e.g postal adresses) to check the data)

– data auditing (discover suspicious pattern, discover violation of stated rules)

• combining data from multiple sources • assigning warehouse (surrogate) keys • data aggregation

Data staging area

Extract Transform Load

A debate questions:

Should the data in the data staging area be stored in a

3NF relational database and loaded into the presentation

area for querying and reporting?

Kimball (p 8-9): a 3NF relational database in data staging area requires more time and resources for development, periodic loading and updating and more capacity of storing the multiple copies of the data

Data staging area

Extract Transform Load

(6)

Flat file C DB2 table(s) D’ DB2Connect

Staging area for checking, analysing, cleaning, complementing etc transaction data SQL, C++ ?? DB2 Preliminary target DW E Fees (manually adjusted to individual agreements) I Start balance H Customer data G Customer data F

Three star/join schemas comprising altogether 8 tables Fact tables:

- transactions (10 attributes) - fees (7 attributes) - start balance (4 attributes) Dimensional tables: - time (7 attr) - customer (> 40 attr) - company (> 90 attr) - product (13 attr) - ”Service charged” (2 attr) Various source files

DB2 Final target DW E’ +aggregation (new program)

E complemented with some aggregated tables

Some cleansing and scrubbing may be needed here

A Real World Example

DW architecture: Data presentation area

Query/Reporting Extract Transform Load Serve External sources Data warehouse Data marts Analysis/OLAP Falö aöldf flaöd aklöd falö alksdf Data mining

Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41

Operational source systems

Data access tools Data staging area Data presentation area

Operational source systems

(7)

Data presentation area

Data warehouse Data marts OLAP servers

• What is OLAP?

• Dimensional modelling vs. 3 NF modelling

• Data Marts

• ROLAP/MOLAP servers

• Acronym for “On-line analytical processing”

• A decision support system (DSS) that support ad-hoc querying, i.e. enables managers and analysts to interactively manipulate data. The idea is to allow the users to easy and quickly manipulate and visualise the data through multidimensional views, i.e. different perspectives.

What is OLAP?

q ua rt er offic e product Service Facts Office Quarter

(8)

Date/

Key Month Quarter Year

991011 9910 4 - 99 99

991012 9910 4 - 99 99

Key Customer Address Region

Income group

C210 Anna N Stockholm Stockholm B

C211 Lars S Malmö Skåne B

C212 Erik P Rättvik Dalarna C

C213 Danny B Stockholm Stockholm A

C214 Åsa S Stockholm Stockholm A

Key Service Servicegroup

S1 Local call Group A

S2 Intern. call Group A

S3 SMS Group B

S4 WAP Group C

Key Seller Office

F11 Anders C Sundsvall

F12 Lisa B Sundsvall

F13 Janis B Kista

Service Dimension Time Dimension

Sales Dimension Customer Dimension Fact table - Transactions

Sum Numberof calls

C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1

Dimensional modelling

1 1 1 1 0..* 0..* 0..* 0..* Date/

Key Month Quarter Year

991011 9910 4 - 99 99

991012 9910 4 - 99 99

Key Customer Address Region

Income group

C210 Anna N Stockholm Stockholm B

C211 Lars S Malmö Skåne B

C212 Erik P Rättvik Dalarna C

C213 Danny B Stockholm Stockholm A

C214 Åsa S Stockholm Stockholm A

Key Service Servicegroup

S1 Local call Group A

S2 Intern. call Group A

S3 SMS Group B

S4 WAP Group C

Key Seller Office

F11 Anders C Sundsvall

F12 Lisa B Sundsvall

F13 Janis B Kista

Service Dimension Time Dimension

Sales Dimension Customer Dimension Fact table - Transactions

Sum Numberof calls

C210 S1 F11 991011 25:00 3 C210 S3 F11 991011 05:00 1 C212 S2 F13 991011 89:00 1 C213 S1 F13 991011 12:00 1 C214 S4 F13 991012 08:00 1 Query: For how much

did customers in Sthlm use service “Local call” in october 1999? Σ=37:00

(9)

Key difference between 3NF and Dimensional modelling:

- the degree of normalisation

3 NF modelling

- a logical design technique to eliminate data redundancy to keep consistency and storage efficiency, and makes transaction simple and deterministic

- ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables

Dimensional modelling

- a logical design technique that present data in a intuitive, i.e. easier to navigate for the user

- allow high performance access/queries (the complexity of 3NF models overwhelms the database systems optimizer, which means bad performance)

- aims at model decision support data

3 NF modelling vs. Dimensional modelling

[Kimball et al, p 10-11]

Kimball et al (p.10-12 and 396)

“we refer to the presentation area as a series of integrated data marts”

“a data mart is a flexible set of data, ideally based on the most atomic (granular) data possible to extract from operational source, and presented in a symmetric (dimensional) model that is resilient when faced with unexpected user queries”

“in its most simplistic form a data mart represent data from a single business process” (business process=purchase order, store inventory and so on)

(10)

Data marts

Service Calls Office Quarter Service Subscription orders Office Quarter Service Calls Office Quarter Subscription orders

The data warehouse bus architecture

Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center [Kimball et al, p 78-79] A data mart A data mart

(11)

• A dimensional model for a large data warehouse

consists of between 10 and 25 similar-looking data

marts. Each data marts will have 5 to 15 dimensional

tables

.

Data marts

Kimball et al’s strong opinions (p.10-12)

• all data in the presentation area should be presented,

stored and accesses in dimensional models

• the data marts must contain detailed, atomic data (it

is unacceptable that the detailed data should be

locked up in 3 NF models for drill-down)

• the data marts dimensions should be conformed for

drill-across techniques, which tie the data marts

together in the data warehouse bus architecture

(12)

More about data marts:

• far smaller data volumes, fewer data sources

• easier data cleaning process, faster roll-out

• allows a “piecemeal” approach to some of the enormous

integration problems involved in creating an enterprise

wide data model, but complex integration in the long

term

The Data marts

Dependent vs. Independent Data marts

Data warehouse

Dependent Data marts Data warehouse

(13)

Extended Relational DBMS (ROLAP servers)

– data stored in RDB – star-join schemas – support SQL extensions – index structures

Multidimensional DBMS (MOLAP servers)

– data stored in arrays (n-dimensional array) – direct access to array data structure – excellent indexing properties

– poor storage utilisation, especially when the data is sparse.

The presentation/OLAP servers

Data warehouse

Data marts

OLAP servers

• Index structures (bit map indexes, join indexes)

• SQL extensions (operators like Cube, Crossjoin)

• Materialised views (pre-aggregations)

More about presentation servers

What is characteristics regarding data warehouse, according to Chaudhiri&Dayal :

(14)

DW architechture: Metadata repository

Query/Reporting Extract Transform Load Refresh Operational source systems Serve External sources Data warehouse Metadata repository

Monitoring & Administration

Data marts OLAP servers Analysis Falö aöldf flaöd aklöd falö alksdf Data mining

Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41

Data access tools Data staging area Data presentation area

Operational source systems

What is metadata?

Main functions are to give...

• data definitions

• the origin of data

• the structure of data

• rules for the selection and transfer of data

• qualitative and quantitative data about data

Contained in metadata repository

Æ

(15)

The metadata repository

An integrated complete source of metadata

• is at the heart of the data warehouse architecture • supports the information needs of...

– system developers – data administrators – system administrators – users

– applications on the data warehouse

• very complex data structure • must contain full version history • must always be up to date

Metadata life cycle activities

• Collection

• identify and capture metadata in a central

repository

• Maintenance

• establish processes to synchronise metadata with

the changing data structure

• Deployment

• provide metadata to users in the right form and

with the right tools

(16)

Different types of metadata

Administrative metadata

(includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas,

dimensions, hierachies, predefined queries, physical

organisation, rules and script for extraction, transformation and load, back-end and front end tools)

Business metadata

(business terms and definitions, ownership of data)

Operational metadata

(information collected during the operations of the DW, e. g. usage statistics, error reports)

DW architecture: End user applications

Query/Reporting Extract Transform Load Refresh Operational DBs Serve External sources Data warehouse Metadata repository Monitoring & Administration

Data marts OLAP servers Analysis Falö aöldf flaöd aklöd falö alksdf Data mining Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41

Data access tools

Data staging area Data presentation area Operational

(17)

Query/Reporting Analysis Falö aöldf flaöd aklöd falö alksdf Data mining

Productt Time1 Value1 Value11 Product2Time2 Value2 Value21 Product3Time3 Value3 Value31 Product4Time4 Value4 Value41

• OLAP tools, BI apps, DSS

• Query/Reporting tools

• Data mining

End user applications

product product group

mounth

quarter officeregion

Product Group Region First Quarter - 1997

Group A ABC 1245

Group A XYZ 34534

Group B ABC 45543

Group B XYZ 34533

Column headers

(join constraints) Column header(application constraint) Answer set representing

focal event

Row headers

(18)

Graphical output of OLAP tool

Drill-down - decreasing the level of aggregation

Drill-up/Roll-up/Consolidation - increasing the level of aggregation • Drill-across - move between different star-join schemas using

conformed dimensions and joins

Slicing and dicing – ability to look at the database from different views, e.g. one slice shows all sales of product type within regions, another slice shows all sales by sales channel within each product type

Pivoting - e.g. change columns to rows, rows to columns • Ranking - sorting

“Think of an OLAP data structure as a Rubik´s Cube of data that users can twist and twirl in different ways to work through what-if an what-happend scenarios” [Lee Thé]

(19)

Strategic

Who: strategic leaders

What: formulate strategy and monitor corporate performance Examples: Balance scorecard, Strategic Planning

Operational

Who: operational managers

What: execution of strategy againts objectives Examples: Budgeting, Sales forcasting

Analytical

Who: analysts, knowledge worker, controller What: ad-hoc analysis

Examples: Financial and Sales Analysis, Customer Segmentation, Clickstream analysis

Business Intelligence (BI) apps

• Complexity of integration

– Hidden problems with source systems – Data homogenisation

– Underestimation of resources for data loading

• Required data not captured

• High maintenance

• Long duration projects

• Why not integrating the legacy applications

(OLTP systems) instead?

(20)

Operational Data Store (ODS)

No singel universal defintion...

ODS definition 1: Implemented to deliver operational reporting, especially when neither the legacy nor the modern OLTP systems provide adequate operational reports – fixed queries and for tactical decision making

ODS definition 2: Built to support real-time interactions, especially in Customer Relationsship Management applications – the tradtional data warehouse typically is not in a position to support the demand for near-real-time data

OMG’s standards

Model Metamodel Meta metamodel Instances Invoice no 34 Helen Nagy M0 layer M1 layer M2 layer M3 layer

CWM Metamodel

Meta Object Facility (MOF)

(21)

Common Warehouse Metamodel (CWM)

Data Source Data Source Data Source Operational

Data Store ETL

Data Warehouse Data Mart Analysis Reporting Visualization Data Mining Data Mart Data Mart

The collection of metamodels by CWM can be

used to model the whole data warehousing

environment i.e from data sources to end use

analysis, and data warehouse management

Common Warehouse Metamodel

• Common Warehouse Metamodel (CWM) is a

language specifically design to model data

warehousing and data mining applications, i.e.

integrating data warehousing and business

analysis (business intelligence) tools

• CWM has a lot in common with the UML metamodel

but has a number of special metamodels

(metaclasses), e.g modelling relational databases,

multidimensional databases, OLAP, schema

transformations, XML

(22)

X Check material on stock Capture ordered items Check material on stock Ordered item [captured] Ordered item captured Material on stock [checked] Capture ordered items Material is on stock Material is not on stock Order recieved State Precedes Succedes Activity Precedes/ Succedes Succedes Precedes State Event consists of Transform-ation consists of Function Event Succedes Precedes

Why

meta-modelling?

Meta metamodel level or Reference model Meta-model level Model level [Rosemann, Green, 2002]

CWM packages

Management Warehouse Process Warehouse Operation

Analysis Transformation OLAP Data Mining VisualizationInformation NomenclatureBusiness

Resource Relational Record Multi-Dimensional XML

Foundation InformationBusiness Data Types Expressions Keys andIndexes DeploymentSoftware Type Mapping Object

Model Core Behavioral Relationships Instance

(23)

CWM packages layers

[Poole et al, p.36-40 (2002)]

Object layer - base metamodels/packages, which are (re)used by the other metamodels/packages

Foundation layer - extends the object layer with services required which are (re)used by the other metamodels/packages, e.g “unique key” in the Key Indexes metamodel/package is used by relational databases, OO-databases and record-oriented

• Resource layer - defines metamodels/packages for various types of data resouces

Analysis layer - analysis-oriented metadata

Management layer - describing the data warehousing process as a whole

CWM packages relations

Relational package ModelElement Class Classifier Core package Datatype package ClassifierFea ture Element Namespace Feature StructuralFeature Expression ProcedureExpression QueryExpression Attribute ColumnSet NamedColumnSet Table View QueryColumnSet Column

(24)

CWM classifyer equality

Object Package Classifier

(Klass) (Attribut)Feature

Schema Table Column

Relational

Record Record

file RecordDef Field

Multi Dimensional

XML

Schema Dimenson Dimension

ed Objct Schema Element Type Attribute

More about CWM

Common Representation <<metamodels>> CWM Packages Tool Y Metamodel Tool X

(25)

Business Dimensional Lifecycle

Technical Architecture Design Technical Architecture Design Product Selection & Installation Product Selection & Installation End-User Application Specification End-User Application Specification End-User Application Development End-User Application Development Project Planning Project Planning Business Requirement Definition Business Requirement Definition Deployment

Deployment Maintenanceand Growth Maintenance and Growth Project Management Project Management Dimensional Modeling Dimensional

Modeling PhysicalDesign

Physical Design Data Staging Design & Development Data Staging Design & Development

The Data Warehouse Architecture

Framework

Level of ARCHITECTURE AREA

detail Data Back room Front room Infrastructure

Business reqs and audit Architecture models and documents Detailed models and specs Implemen-tation Info needed for better decisions

Enterprise models How get, transform, make available data Capabilities needed to get and

transform data Major data stores

User’s needs Major classes of analyses Priorities Where is data coming from Calc and storage

reqs HW/SW capabilities needed vs what we have Major business issues. How measure How analyse Focal events, facts, dimensions Dimensional models

Install, test infra-structure. Connect sourcesto targets to desktop Standards, prods to provide capabilities How hook together

Report layouts, derivation For whom, when

How interact with capabilities System utilties, calls, APIs ... Write extracts, loads Automate process Implement report and analysis env

Build rpt Train users Logical and physical models Domains, derivation rules DB, indexes backup ...

References

Related documents

This linking allows statistical model checking of several opportunistic network properties and

The last four minerals observed for the first time in Kenya (phillipsite, hydroxyapophyllite, celestina and halite) have all been noted in samples taken from the Kitum Cave on

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’

Data Cleaning &amp; Integration Databases Data Mining Data Warehouse Task-relevant Data Selection &amp; transformation. }   Data mining plays

Data Cleaning &amp; Integration Databases Data Mining Data Warehouse Task-relevant Data Selection &amp; transformation. ` Data mining plays

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

In the first legal setback of the year for the ARM industry, an appeals court Thursday reversed a lower court ruling and found that a collection agency violated the FDCPA when

Potential explanations for the large and seemingly random price variation are: (i) different cost pricing methods used by hospitals, (ii) uncertainty due to frequent changes in