• No results found

Data-Warehouse-, Data-Mining- und OLAP-Technologien

N/A
N/A
Protected

Academic year: 2021

Share "Data-Warehouse-, Data-Mining- und OLAP-Technologien"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Anwendersoftware

aa

Anwendungssoftware

ss

Data-Warehouse-, Data-Mining- und

OLAP-Technologien

Chapter 2: Data Warehouse Architecture

Bernhard Mitschang

Universität Stuttgart

Winter Term 2014/2015

(2)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Overview

Data Warehouse Architecture

 Data Sources and Data Quality

 Data Mart

 Operational Data Store

Information Integration

 materialized vs. virtual

 Federated Information Systems

Metadata

 Metadata Repository

 Metadata in Data Warehousing

 CWM Metamodel

(3)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Architecture

3 Load Data Staging Area Extraction Transformation Monitor Data Warehouse Manager Data

Warehouse

Metadata Manager Metadata Repository End User

Data Access

Data Sources

Data Warehouse System

Data Staging Area

data flow control flow

(4)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Data Sources

• Characteristics of source systems:

 narrow, "account-based" queries  no queries in a broad and

unexpected way, like DW

 maintain little historical data  no conformed dimensions

(product, customer, geography, …) with other legacy systems

 use keys (production keys) to

make certain things unique (product, customer, …)

• Important issues in selecting data

sources:

 Purpose of the data warehouse  Quality of data sources

 Availability of data sources

(organizational prerequisites, technical prerequisites)

 Costs

(internal data, external data)

Data Sources

(5)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Data Quality

• Are there contradictions in data and/or metadata?

consistency

• Do data and metadata provide an exact picture of the reality?

correctness

• Are there missing attributes or values?

completeness

• Are exact numeric values available?

• Are different objects identifiable? Homonyms?

exactness

• Is there a Standard Operating Procedure (SOP) that describes the provision of source data?

reliability

• Does a description for the data and coded values exist?

understandability

• Does the data contribute to the purpose of the data warehouse?

relevance

(6)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Dimensions of Data Sources

origin time usage type character set orientation confidentiality

• internal vs. external data

• current vs. historic data

• data vs. metadata

• number, string, time, graphic, audio, video, …

numeric, alphanumeric, boolean, binary, …

• ASCII, EBCDIC, UNICODE, …

• left to right, right to left, top-down

• strictly confidential, confidential, public, …

(7)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Monitoring

Goal: Discover changes in data source incrementally

Approaches:

Based on … Changes identified by …

Trigger triggers defined in

source DBMS trigger writes a copy of changed data to files

Replica replication support of

source DBMS replication provides changed rows in a separate table

Timestamp timestamp assigned to

each row use timestamp to identify changes (supported by temporal DBMS)

Log log of source DBMS read log

Snapshot periodic snapshot of data

source compare snapshots

(8)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Data Staging Area (DSA)

Data Staging Area:

A storage area and a set of

processes that clean, transform, combine, deduplicate, household, archive, and prepare source data for use in the data warehouse.

• Data is temporarily stored in the

data staging area before it is loaded into the data warehouse.

• All transformations are performed

in the DSA.

 Preprocessing does not influence

data sources or data warehouse

• DSA is the central repository for

ETL (Extraction - Transformation - Load) processing. Load Data Staging Area Extraction Transformation Monitor

(R. Kimball et al: The Data Warehouse Lifecycle Toolkit, 1998)

(9)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Extraction

Transfer data from data source into the data staging area.

Extracted subset of data sources and schedule of the extraction

depends on the kind of analysis that should be supported.

Method depends on the monitoring strategy used:

 Read data from a file written by triggers.

 Read data from replication tables.

 Select data based on the timestamp.

 Read data from log.

 Read output of snapshot comparison.

Multiple extract types:

 periodic

 started by the admin/user

 event-driven

 immediate after changes in data sources

Data Staging Area

Extraction

Data Sources

(10)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Transformation

Convert the data into something representable to the users and

valuable to the business.

Transformation of structure and content

Typical transformations:

 denormalization, normalization

 data type conversion

 calculation, aggregation

 standardization of strings and date values

 conversion of measures

 cleansing (missing, wrong, and inconsistent values)

Data Staging

Area Transformation

(11)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Load

• Transfer data from the data

staging area into the data warehouse.

• Data in the warehouse is rarely

replaced. The history of

values/changes is stored instead.

• Mainly based on bulk load tools

of the DBMS.

• Offline vs. online load.

• Parallel load may be required.

Load Data Staging Area Data Warehouse 11

(12)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Data Warehouse Manager

Controls all components of the data warehouse system:

Monitor: Discover changes in data sources

Extraction: Select and transfer data from data sources to the data

staging area

Transformation: Consolidate data

Load: Transfer data from data staging area to the data warehouse

End User Data Access: Analysis of data in the data warehouse

(13)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Basic Elements of the Data Warehouse

Storage: Flat files; RDBMS; other Processing: clean; prune; combine; remove duplicates; household; standardize; conform dimensions; store awaiting replication; archive;

export to data marts;

No user query services

Source Systems Data Staging Area "The Data Warehouse"

Presentation Servers Data Access End User

extract

extract

extract

Data Mart #1:

OLAP query services;

dimensional!

subject oriented; locally implemented; user group driven; may store atomic data; may be frequently refreshed; conforms to DW Bus Data Mart #3: Data Mart #2: populate, replicate, recover populate, replicate, recover populate, replicate, recover feed End User Applications Ad Hoc Query Tools Models forecasting; scoring; allocating; data mining; other downstream systems; other parameters; special UI Report Writers feed feed feed DW BUS

DW BUS Conformed dimensions

Conformed facts

Conformed dimensions Conformed facts

upload cleaned dimensions upload model results

(14)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Architecture

Logical Data

Warehouse Warehouse Data

Central architecture

• only one data model • performance bottleneck • complex to build • easy to maintain (Source: [JL+02]) Federal architecture • logically consolidated • separate physical databases that store detailed data

• faster response time

Tiered architecture

• physical central data warehouse

• separate physical databases that store summarized data • faster response time

Data Warehouse

Clients

(15)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Data Marts

Load Data Warehouse Load Data Warehouse Data

Mart Data Mart Data Mart Data Mart Transformation Load Load Data Mart End User Data Access Data Mart End User Data Access End User Data Access Data Mart End User Data Access End User

Data Access End User Data Access

End User

Data Access End User Data Access

dependent data marts independent data marts

(16)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

dependent data marts

(tiered architecture) independent data marts (federated architecture)

Data Marts

• Central data warehouse (DW) is

build first

• Extracts of the data warehouse

are provided as data marts (materialized views)

• Establish ETL process for DW

only

• Consistent analysis on DW and

DM

• Several data marts (DM) are build

first

• Data marts are integrated by

means of a second transformation step

• Establish ETL process for each

DM and the central DW

• Inconsistent analysis is possible

• Virtual data warehouse possible

(federated architecture)

(17)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Operational Data Store (ODS)

Term has taken many definitions. For example:

Point of integration for operational systems

- refreshed within a few seconds after the operational data sources are updated

- very little transformations are performed

- Example: Banking environment where data sources keep individual

accounts of a large multinational customer, and the ODS stores the total balance for this customer.

true operational system separated from the data warehouse

Decision support access to operational data

- integrated and transformed data are first accumulated and then periodically forwarded to the ODS

- involves more integration and transformation processing

- Example: Bank that stores in the ODS an integrated individual bank account on a weekly basis

part of the data warehouse or separate system?

(18)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Classes of Operational Data Stores

• Tables are copied from the operational environment

• Transactions are moved to the ODS in an immediate manner

(range of one to two seconds)

• Activities in the operational

environment are stored, integrated, and forwarded to the ODS

• ODS is fed aggregated analytical data from the data warehouse

• Combination of integrated data from the operational environment and aggregated data from the analytical environment DWH ODS appl ic at io n s DWH ODS appl ic at io n s DWH ODS appl ic at io n s DWH ODS appl ic at io n s DWH ODS appl ic at io n s cl ass 0 cl ass 1 cl ass 2 cl ass 3 cl ass 4

(19)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Overview

Data Warehouse Architecture

 Data Sources and Data Quality

 Data Mart

 Operational Data Store

Information Integration

 materialized vs. virtual

 Federated Information Systems

Metadata

 Metadata Repository

 Metadata in Data Warehousing

 CWM Metamodel

(20)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Dimensions of Information Systems

Definition

: Federated Information System

A federated information system consists of a set of distinct and autonomous

system components, the participants of this federation. Participants in first place operate independently, but have given up some autonomy in order to participate in the federation. (Source: [BK+99]) distribution autonomy heterogeneity Centralized Information Systems Heterogeneous Information Systems Distributed Information Systems Federated Information Systems 20

(21)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Autonomy of Information Systems

Design autonomy

 The design of one source system is independent from the design of

other systems, e.g., in the universe of discourse, the data model, naming concepts, ...

Communication autonomy

 Source systems decide independently with which other systems they

communicate.

 Source systems can leave and enter a federation at any time.

Execution autonomy

 Source systems independently decide on execution and scheduling of

incoming requests.

(22)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Heterogeneity in Information Systems

Heterogeneity Syntactical Data model Logical Technical heterogeneity Structural heterogeneity Schematic heterogeneity Semantic heterogeneity Data model heterogeneity Binding restrictions Query restrictions Language heterogeneity Interface

(23)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Federated Information Systems

(Source: [BK+99])

Federation layer

e.g. uniform access language, uniform access schema, uniform metadata set

Foundation layer (data sources) Local Applications Presentation layer Wrapper layer 23

(24)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Types of Federated Information Systems

Loosely coupled

Information Systems Federated Databases Mediator-based Information Systems Types of

heterogeneity addressed

Only technical and

language heterogeneity All, except query restriction heterogeneity;

schema integration difficult for schematic heterogeneity

All

Loss of autonomy? Execution autonomy Execution autonomy; notification of schema changes

Execution autonomy

Transparency Language Location, schema and

partly language Location, schema and language

Kind of components Structured Structured Any

Access methods Query language Query language Any

Access restrictions No No Yes

Write access? Yes Yes No

(25)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Types of Federated Information Systems

Loosely coupled

Information Systems Federated Databases Mediator-based Information Systems Tight vs. loose

integration Loose Tight Tight Kinds of semantic

integration Collection Collection and fusions Collection, fusion, sometimes abstraction

Necessary metadata Technical, infrastructure Logic, technical, semantic Logic, technical, semantic

Bottom-up vs.

top-down n.a. Bottom-up Top-down Virtual vs.

materialized Virtual Virtual Virtual Evolvability High Low High

(26)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Mediator-based Information Systems

(Source: [BK+99]) Mediation layer Foundation layer (data sources) Local Applications Presentation layer Wrapper layer Mediator Mediator 26

(27)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Federated Database Systems

• Federated DBMS (FDBS)

 Transparent access to a

collection of heterogeneous and semi-automonomous data

sources.

 Complete, extensible database

engine

- Function compensation

- Powerful (global) query optimizer (pushdown analysis, cost-based

optimization, query rewrite)

Federation layer

e.g. uniform access language, uniform access schema, uniform metadata set

Foundation layer (data sources) Local Applications Presentation layer 27

(28)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Architecture for a FDBS

Catalog Data Data Data Back-end Data Source Back-end Data Source Federated Database Server SQL API Client Wrapper 28

(29)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

29

Information Integration

(Source: [LN07])

Data Sources Data Sources Data Sources Data Sources

ETL ETL Wrapper Wrapper

Data Warehouse Mediator

Integrated

Schema Schema Global Mappings Schema End User

Data Access Data Access End User

Query Execution

(30)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Overview

Data Warehouse Architecture

 Data Sources and Data Quality

 Data Mart

 Operational Data Store

Information Integration

 materialized vs. virtual

 Federated Information Systems

Metadata

 Metadata Repository

 Metadata in Data Warehousing

 CWM Metamodel

(31)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Metadata Repository

A repository is a shared database of information about engineering

artifacts, such as software, documents, maps, information

systems, and manufactured components and systems.

Functions of a repository:

 Object management  Dynamic extensibility  Relationship management  Notification  Version management  Configuration management

Metadata Manager Metadata Repository

(32)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Metadata in Data Warehousing

Metadata

Data dictionary: Definitions of the databases and relationship between data

elements Data flow:

Direction and frequency of data feed Data transformation: Transformations required when data is moved Version control: Changes to metadata are stored Data usage statistics: A profile of data in the warehouse Alias information: Alias names for

a field Security: Who is allowed to access the data Need for a standard interchange format Stored in a metadata repository 32

(33)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Metadata in Data Warehousing

• Criteria to identify important

classes of metadata in data warehousing:  Type of data  Abstraction  User  Origins  Time

• Usage of metadata in data

warehousing:

 passive  active

 semi-active

• Main goals:

 Support development and

operation of a data warehouse

- system integration

- processes for DW administration

- flexible application development

- access rights

 Provide information for data

warehouse users

- quality of data

- consistent terminology

- support for data analysis

(34)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Centralized Metadata Management

Metadata Manager

data flow control flow Metadata Repository

Monitoring

Tool Tool ETL Data Warehouse Manager Data Access Tool

(35)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Decentralized Metadata Management

MD Manager Metadata Repository data flow control flow (Source: [PC+03]) Metadata

Repository Repository Metadata Repository Metadata Monitoring

Tool Tool ETL Data Warehouse Manager Data Access Tool

MD Manager MD Manager MD Manager meta data integration via point-to-point bridges

(36)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Federated Metadata Management

Metadata Manager Metadata Repository data flow control flow (Source: [BG04]) Metadata

Repository Repository Metadata Repository Metadata

federated metadata management Monitoring

Tool Tool ETL Data Warehouse Manager Data Access Tool

Local Tool

(37)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Metadata Interchange

Standardization Approaches:

 Case Data Interchange Format (CDIF)

- Electronic Industries Association (EIA)

- Focus on CASE Tools

- Interchange based on files

 Open Information Model (OIM)

- Meta Data Coalition's Metadata Model

- Interchange based on XML

- Merged with CWM since 1999

 Common Warehouse Model (CWM)

- Metadata Model published by the Object Management Group (OMG)

- Interchange based on XML

(38)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Metamodel

The main purpose of

CWM

is to enable easy interchange of

warehouse and business intelligence metadata between warehouse

tools, warehouse platforms and warehouse metadata repositories

in distributed heterogeneous environments.

Design goals: The CWM model should …

 reuse concepts in the UML metamodel where applicable.

 be subdivided into packages allowing implementation of relevant

subsets of the model.

 be independent of any specific data warehouse implementation. Yet it

should contain features that are effective in, and mappable to, a broad range of representative warehouse configurations based on specific tools.

(39)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Model-based Metadata Management

MD Manager Metadata Repository data flow control flow (Source: [PC+03]) Metadata

Repository Repository Metadata Repository Metadata Monitoring

Tool Tool ETL Data Warehouse Manager Data Access Tool

MD Manager MD Manager MD Manager CWM Metamodel

(40)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Metamodel

CWM is based on three key industry standards:

 UML - MOF - XMI

CWM needs a formal language capable of

representing meta data in terms of shared

platform-independent models.

CWM is based on a framework that supports

any kind of meta data and allows new kinds

of meta data to be added as required.

Meta data integration based on CWM requires

a common interchange format for exchanging

instances of shared meta data.

UML

MOF

XMI

(41)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

UML

• Class Model • Inheritance • Composition

• The UML notation is used in the

diagrammatic representations of the CWM metamodel.

"The Unified Modeling Language (UML) is a graphical language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system. The UML offers a standard way to write a system's blueprints, including conceptual things such as business processes and system functions as well as concrete things such as programming language statements, database schemas, and reusable software components." (OMG)

Point x: Real y: Real rotate(angle:Real) Shape Spline Elipse Polygon Window Panel Header Slider 1 1 1 2 1 1 scrollbar title body

(42)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Meta Object Facility (MOF)

The MOF model can be used as a model for defining information

models.

In this context, the MOF Model is referred to as a meta-metamodel

because it is being used to define metamodels (UML, CWM, …).

The main modeling concepts are:

 Classes, which model MOF meta-objects.

 Associations, which model binary relationships between meta-objects.  DataTypes, which model other data (e.g. primitive types, external

types, etc.).

 Packages, which modularize the models.

(43)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Metadata Architecture

3 The "MOF Model" meta-

metamodell The "MOF Model"

2 UML Metamodel metamodel,

meta-metadata CWM Metamodel

1 UML Model model,

metadata CWM Model

0 modelled

system object, data warehouse data

(44)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

XML Metadata Interchange (XMI)

• Supports interchange of MOF-based meta-data.

• Applicable to a wide variety of

objects: UML, Java, C++, EJB, IDL, CWM.

• Production of XML Schemas starting from an object model.

• Production of XML documents starting from objects.

(Source: Object Management Group: MOF 2.0/XMI Mapping, Version 2.1.1, www.omg.org) 44 M3 level M2 level M1 level The MOF model CWM meta model CWM model

XML schema production rules schema XML

XML document production rules

XML document

(45)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Metamodel

Management Warehouse Process Warehouse Operation

Analysis Transformation OLAP Data Mining Information Visualization Nomenclature Business

Resource Object Model Relational Record Multidimensional XML

Foundation Information Business Types Data Expression Keys and Indexes Mapping Type Deployment Software Object Model

(46)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Relational Package

(47)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Relational Package

(48)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM Warehouse Process Package

(49)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

CWM OLAP Package

(50)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Summary

Basic Components:

 Data Staging Area: Extraction, Transformation, Load

 Data Warehouse Database

 Data Warehouse Manager

 Metadata Repositories and Metadata Manager

Data Marts: Distributed Data Warehouse

Materialized vs. virtual information integration

Metadata is important to:

 Support development and operation of a data warehouse

 Provide information for data warehouse users

Metadata standards are important to interchange metadata

between warehouse tools, warehouse platforms and warehouse

metadata repositories.

(51)

Data Warehouse Architecture Anwendersoftware

aa

Anwendungssoftware

ss

Papers & Books

[Ber98] P. A. Bernstein: Repositories and Object Oriented Databases.

SIGMOD Record 27(1): 88-96 (1998).

[BK+99] S. Busse, R.-D. Kutsche, U. Leser, H. Weber: Federated

Information Systems: Concepts, Terminology and Architectures. Forschungsberichte des Fachbereichs Informatik Nr. 99-9, TU Berlin April 1999.

[PC+03] J. Poole, D. Chang, D. Tolbert, D. Mellor: Common Warehouse

Metamodel: Developer's Guide. OMG Press, 2003.

[SL90] A. P. Sheth, J. A. Larson: Federated Database Systems for

Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Comput. Surv. 22(3): 183-236(1990).

References

Related documents

The character can add his dots in this Fighting Style as a bonus to Attack with any Short Blade, Long Blade, or Medium Blunt weapon up to his effective

Und, wenn man den App-Begriff weiter fasst und sagt, dass muss jetzt nicht unbedingt eine native App sein, sondern es kann auch eine Web-App sein, dann kann das durch aus sein,

Our aim is to provide a local service enabling our customer base to gain access to the highest standards of quality fresh frozen and chilled produce.

Elisabeth Gerarda de Waal, daughter of Nicolaas de Waal and Christina Kaljee, was born op 12 september 1891 te Amersfoort and was baptized.. She is waschmeisje,

The basic idea of web 3.0 is to define structure data and link them in order to more effective discovery, automation, integration, and reuse across various applications [5].Web

Data Warehouse Analyst Business Analyst Software Quality Assurance Engineeer Web Development Graphics Designer Multimedia Specialist Digital Visualization User Support Other

A research model is developed that proposes both a direct effect of IT infrastructure on organizational performance and an indirect effect mediated by the effects of IT

Course content ideas are presented to help these students develop competencies in four areas: self-understanding, knowledge of entrepreneurial careers, a realistic sense of