• No results found

Introduction. Introduction to Data Warehousing

N/A
N/A
Protected

Academic year: 2021

Share "Introduction. Introduction to Data Warehousing"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to

Data Warehousing

Pasquale LOPS

Gestione della Conoscenza d’Impresa

A.A. 2003-2004

Introduction to

Data Warehousing

Pasquale LOPS

Gestione della Conoscenza d’Impresa

A.A. 2003-2004

Introduction

Introduction

Data warehousing and decision support have

given rise to a new class of databases.

Design strategies for OLAP databases differ

significantly from OLTP systems.

Today’s decision support systems must deliver

multidimensional analysis capabilities.

(2)

Terminology – What is a Data Warehouse?

Terminology – What is a Data Warehouse?

A database

- typically read-only

- Data stored in relational or multidimensional format - Multidimensional db often populated from relational db

Populated from existing source systems

- Secondary sources of data

- Populated from existing internal or external data sources - It is possible to build DSS on top of operational systems

Used for reporting purposes

- not transaction-based

- Used primarily for reporting purposes - Must be designed for analysis purposes - OLAP; not OLTP

Terminology – Decision Support and

Multidimensional Analysis

Terminology – Decision Support and

Multidimensional Analysis

Decision Support Systems (DSS)

Facilitate business analysis

Support business decision makers by providing various types of analysis: trend, comparison and ad hoc

reporting

(3)

Terminology – OLTP and OLAP

Terminology – OLTP and OLAP

OLTP

On-Line Transaction Processing

- support specific application - Maintain integrity of data

OLAP

On-Line Analytical Processing

- support business analysis

Points of Difference

‹ Orientation or alignment of data ‹ Integration

‹ History—time horizon of data ‹ Data access and manipulation ‹ Usage patterns

OLTP vs. OLAP – Orientation or Alignment of Data

OLTP vs. OLAP – Orientation or Alignment of Data

Organized around Applications

‹Different systems hold different types of data.

‹Data is inherently organized by application.

‹Different information in a different system.

Organized for Business Dimension

‹All types of data are integrated into one system.

‹Data is organized by defined dimensions of the business. ‹Information from different

systems stored in a single database.

(4)

OLTP vs. OLAP – Integration

OLTP vs. OLAP – Integration

Typically Not Integrated

‹ Different key structures ‹ Different naming conventions ‹ Different file formats

‹ Different hardware platforms

Must Be Integrated

‹ Standard key structures ‹ Standard naming conventions ‹ Standard file format

‹ One warehouse server – Logical server

OLTP OLAP

OLTP vs. OLAP – History

OLTP vs. OLAP – History

Recent or Current Data

‹ 60-90 days ‹ Current values only ‹ No time key

‹ No time series analysis

Historical Data

‹ 2 or more years

‹ Historical snapshots of OLTP data ‹ Time key

‹ Time series analysis

(5)

OLTP vs. OLAP – Data Access and Manipulation

OLTP vs. OLAP – Data Access and Manipulation

Transactions

‹ Inserts, Updates, Deletes, Selects

‹ Small amount of data involved in each transaction

‹ Highly ‘indexable’ ‹ RDBMS focus

– Locking – Concurrency – Logical Unit of Work

Bulk Processes

‹ Selects only

‹ Large amount of data involved in each process

‹ Not always ‘indexable’ ‹ RDBMS focus

– Parallel Loader, Query – Star Join

– Bit mapped Indexes

OLTP OLAP

OLTP vs. OLAP – Usage Patterns

OLTP vs. OLAP – Usage Patterns

Fairly Consistent Maintain a constant system

utilization pattern

OLTP OLAP

Spiked or Uneven Large period of light use and

spiked usage pattern

(6)

OLTP vs. OLAP – Summary

OLTP vs. OLAP – Summary

Aligned by Application

Typically Not Integrated

Recent or Current Data

Transactions Fairly Consistent Aligned by Dimension Must Be Integrated Historical Data Bulk Processes Spiked or Uneven OLTP OLAP Alignment: Integration: History: Data Access: Usage: Warehouse Headaches Batch Maintenance Tuning

Intro to ERM and ERD

Intro to ERM and ERD

(7)

ENTITIES

ENTITIES

ERM Terminology

ERM Terminology

ERM - Entity Relationship Model (design) ERD - Entity Relationship Diagram (graphical)

Entity - things of interest to the business, represented by boxes and implemented as tables

Attributes - things to know about an entity, implemented as columns in tables

Relationships - how entities relate, represented by lines on ERD and implemented as foreign keys

(8)

Entity Paradigms

Entity Paradigms

Rounded corners - ERD,

Square corners - Relational

Naming

Should be singular in nature

Consistency, communication, compatibility

RELATIONSHIPS

(9)

Relationships and Business Rules

Relationships and Business Rules

RX Transaction RX

Relationship - Line and “crow’s foot” represent a foreign key relationship from RX Transaction and RX

Cardinality - crow’s foot means “one or more”, absence means “one”

Relationships and Business Rules

Relationships and Business Rules

RX Transaction RX

allows Is allowed by

Optionality

Solid bar means that the relationship MUST exist Circle means that relationship MAY exist

Use words near entities with optionality symbols to complete sentences for definition

RX may allow RX Transactions

(10)

Relationships

Relationships

Vice President Department Be managed by manage One-to-One relationship:

Each department must be managed by one VP. Each VP may manage one department.

Relationships

Relationships

Vice President Management Team contains Is contained in One-to-Many relationship:

Each management team must contain many VP’s. Each VP may be contained in one management team.

(11)

Relationships

Relationships

Degree Employee Is held by hold Many-to-Many relationship:

Each Employee must hold one or more degrees. Each Degree may be held by one or more employees.

ATTRIBUTES

(12)

Attributes - Terminology

Attributes - Terminology

Attributes are the information we wish to keep

about a particular entity

Example: Inventory Inventory Store_id Item_id Amt Units

Attributes - Implementation

Attributes - Implementation

Entities and their attributes are shown as:

ENTITY NAME (Attribute1, Attribute2, Attribute3)

Inventory (Store_id, Item_id,Amount, Units)

To specify a primary key for an entity/table, underline the appropriate Attribute(s)

(13)

Warehouse Architecture Overview

Warehouse Architecture Overview

Warehouse Overview

Warehouse Overview

Basic components Warehouse Server Warehouse Access Tool Source Systems Design Strategies

(14)

Warehouse Overview

Warehouse Overview

Designers must consider and understand

unique characteristics and requirements of all

three previous components

Ideally, a project team should pick the

best-of-class tools for storing and accessing data.

In reality, all three pieces should be selected

with regard to the others, to ensure that each

component will complement the others.

Source Systems

Source Systems

One or more operational systems will be the source(s) of the data stored in the data warehouse.

Source System A Source System B User Group A User Group B

(15)

Source Systems (cont’d)

Source Systems (cont’d)

Source systems are typically not integrated.

9Have unique key structures and unique naming conventions

9Possess overlapping data

Source systems hold current value data.

Source systems will indirectly define the scope of a warehouse.

9Only data found in source systems can be included in data warehouse; no “new” data can be created.

9Each operational system will have unique characteristics (levels of detail or granularity of data, types of data or metrics available)

Warehouse Server

Warehouse Server

DWH RDBMS HW Platform (typically UNIX-based) DWH RDBMS DWH RDBMS Distributed Architecture A DWH RDBMS DWH Gateway Distributed Architecture B

(16)

Warehouse Access Tool / Architecture

Warehouse Access Tool / Architecture

DWH RDBMS HW Platfom MDDB HW Platform Client HW Platform App Server SQL SQL SQL Messaging MDDB Calls MOLAP ROLAP (2-3 tier)

Design Overview

Design Overview

(17)

The Warehouse Trade-Off Triangle

The Warehouse Trade-Off Triangle

Query Performance User Requirements Data Warehouse Maintenance

Schema

The ETL Process

ETL = Extraction,

Transformation

and Loading

The ETL Process

ETL = Extraction,

Transformation

and Loading

(18)

Batch Process – Overview

Batch Process – Overview

Source System Extract Program Extract File

Source System Server Warehouse Server

Load File DWH RDBMS File Transfer Landing Space

Batch Process – Extracts

Batch Process – Extracts

Source System Extract Program Extract File

Extracts are programs that generate data files.

Perform data transformations, data cleaning.

Perform key conversions.

Reformat data to the standards of the warehouse.

Must produce data in a file format suitable for loading into the data

(19)

Batch Process – Extracts

Batch Process – Extracts

Source System Extract Program Extract File

Source System Server

Basic Types of Extracts

1) Facts tables

Must provide load files for the following tables:

–Base tables –Historical tables –Aggregate tables

2) Lookup tables

Must provide data to populate the following tables:

–Lookup tables –Relationship tables

Batch Process – Extracts (cont’d)

Batch Process – Extracts (cont’d)

Static Extraction for the first loading of the DWH Incremental Extraction for the update of the DWH

(20)

Batch Process – File Transfers

Batch Process – File Transfers

Source System Extract Program Extract File

Source System Server Warehouse Server

Load File File Transfer Landing Space

Batch Process – File Transfers (cont’d)

Batch Process – File Transfers (cont’d)

File Transfer: Process of moving data files to data

warehouse server.

After Extracts, generated files must be moved from source systems to data warehouse server.

Design Considerations

9Transfer method and network impact ¾Usually transferred via FTP

(21)

Batch Process – File Transfers (cont’d)

Batch Process – File Transfers (cont’d)

Design Considerations

9Scheduling routines

¾If there is not enough landing space, scheduling routines must be designed.

¾Routines must coordinate file transfers and database loads, transferring a new data file only after an existing file has been loaded and is no longer needed.

Batch Process – Data Loads

Batch Process – Data Loads

Source System Extract Program Extract File Load File DWH RDBMS File Transfer Landing Space

(22)

Batch Process – Data Loads (cont’d)

Batch Process – Data Loads (cont’d)

Data Load: data loaded from extract file into database.

9Post-load processes

9Aggregation routines

Basic Types of Load Procedures

9Append new records to existing table.

9Drop table and reload updated data file.

9Update existing records.

Batch Process – Data Loads (cont’d)

Batch Process – Data Loads (cont’d)

Post Load Processes

9Must update table indexes after data loads.

9For statistics-based optimizers, must update table and index statistics after loads.

Aggregation Routines

9Must run aggregation routines if aggregate data

(23)

Batch Jobs – Overview

Batch Jobs – Overview

Basic Refresh Jobs: necessary to update tables in

warehouse with current information

9Lookup Table

9Fact Table

9Aggregate Table

Maintenance Jobs: necessary to maintain tables in

warehouse

9Updating fact data

9Re-organizing data

Basic Refresh Jobs – Lookup Tables

Basic Refresh Jobs – Lookup Tables

Purpose

9

Apply changes in existing “organizational” systems to lookup data in data warehouse.

9

Changes include addition of new items or changes to descriptive information.

9

No changes to attribute keys or attribute relationships.

(24)

Basic Refresh Jobs – Lookup Tables (cont’d)

Basic Refresh Jobs – Lookup Tables (cont’d)

Basic Methods

9

No refresh

¾Extract is run once to populate DWH

¾Often used in pilot or prototype systems

9

Drop and reload

¾Existing table is dropped or emptied

¾Extract is re-run to capture current information

¾Table is loaded with new extract

9

Append to existing table

¾Extract is re-run to capture current information

¾New extract and “old” or “master” lookup file are compared. ¾New “Delta” file is generated.

¾Delta is applied to master lookup file and lookup table in warehouse. ¾Delta file may be loaded into warehouse directly, OR

¾Delta may be applied to master lookup file and then use Drop and Reload method, loading the master lookup file.

¾Sophisticated batch routines, normally used in production

Basic Refresh Jobs – Lookup Tables (cont’d)

Basic Refresh Jobs – Lookup Tables (cont’d)

Org Source System Extract Program Extract File 1/96 Lookup Table

(25)

Basic Refresh Jobs – Lookup Tables (cont’d)

Basic Refresh Jobs – Lookup Tables (cont’d)

Org Source System Extract Program Master Lookup DWH RDBMS Extract File 2/96 Lookup Table Compare Program Delta File

Basic Refresh Jobs – Fact Tables

Basic Refresh Jobs – Fact Tables

Purpose

9

Refresh or update fact data in DWH with the new data from source systems.

Basic methods

9

Bulk or historical insert

¾Extract is run to capture all data existing in source systems ¾Data is bulk-loaded into data warehouse fact tables ¾Simple batch routine

¾Used to “start” warehouse or provide initial data sets ¾Often doesn’t perform any cleansing or integration

9

Drop and reload

(26)

Basic Refresh Jobs – Fact Tables (cont’d)

Basic Refresh Jobs – Fact Tables (cont’d)

Fact Source System Extract Program DWH RDBMS Extract File for 9/95 thru 1/96 Fact Table

Basic Refresh Jobs – Fact Tables (cont’d)

Basic Refresh Jobs – Fact Tables (cont’d)

Basic methods

9

Drop and reload

¾Historical or Bulk extract is re-run to capture all available data ¾Existing warehouse table is emptied or truncated

¾File is inserted into empty fact table ¾Simple batch routine

¾Used in prototypes or pilot ¾Not feasible for large data sets

9

Append to existing table

(27)

Basic Refresh Jobs – Fact Tables (cont’d)

Basic Refresh Jobs – Fact Tables (cont’d)

Fact Source System Extract Program DWH RDBMS Fact Table 9/95 - 1/96 Extract File 2/96 Append 2/96

Basic Refresh Jobs – Aggregate tables

Basic Refresh Jobs – Aggregate tables

Purpose

9

Refresh or aggregate tables Basic methods

9

Aggregate in warehouse RDBMS

¾Produce atomic extract

¾Transfer and load atomic extract into atomic fact table

¾Produce aggregate values using SQL accessing atomic fact table ¾Insert aggregate values into aggregate fact table

9

Aggregate in batch (on source systems or warehouse server)

¾Produce atomic extract

¾Transfer and load atomic extract into atomic fact table ¾Produce aggregate extract from atomic extract

(28)

Basic Refresh Jobs – Aggregate Tables (cont’d)

Basic Refresh Jobs – Aggregate Tables (cont’d)

Source System Extract Program Atomic Extract

Source System Server Warehouse RDBMS Aggregate Extract Aggregate Program Atomic Fact Table Aggregate Fact Table Aggregate SQL Routines

Maintenance Jobs – Updating Fact Data

Maintenance Jobs – Updating Fact Data

Purpose

9As changes are made to source system data, they should be reflected in the data warehouse.

Basic Methods

9Ignore changes.

9Wait until audited data is available.

(29)

Maintenance Jobs – Updating Fact Data (cont’d)

Maintenance Jobs – Updating Fact Data (cont’d)

Sun Mon Tue Wed Thu Fri Sat

Sun Pre Audit Data Wed Pre Audit Data Sun Post Audit Data Scenario

Audit Process produces clean data set 3 days after initial set is posted.

Maintenance Jobs – Updating Fact Data (cont’d)

Maintenance Jobs – Updating Fact Data (cont’d)

Fact Source System Extract Program Sunday DWH RDBMS Fact Table Compare Program Delta File Sun Pre Data Wed Sun Post Data

(30)

Maintenance Jobs – Re-Organizing Data

Maintenance Jobs – Re-Organizing Data

Store

Region Lookup Store

Store_id Store_desc Region_id Lookup Region Region_id Region_desc Relationship between Region and Store changes Must update foreign key in Store Lookup Region_id 2 1 2 1 1 1 2 Store_id 13 21 24 27 35 57 Store_desc San Fran Boston Dallas Philly DC Las Vegas Dallas is moved to the East Region

Maintenance Jobs – Re-Organizing Data (cont’d)

Maintenance Jobs – Re-Organizing Data (cont’d)

Lookup Store Region_id Store_id Store_desc Lookup Region Region_id

Region_desc Must update key

for Store in all tables

Best Case Worst Case

Fact Sales Region_id Store_id Item_id Week_id Sales_dollars Sales_units

(31)

Maintenance Jobs – Re-Organizing Data (cont’d)

Maintenance Jobs – Re-Organizing Data (cont’d)

Store Sales Store_id Item_id Date Sales_Dollars Sales_Units Region Sales Region_id Item_id Date Sales_Dollars Sales_Units Lookup Store Store_id Store_desc Region_id Lookup Region Region_id Region_desc Region_id 2 1 2 1 1 1 2 Store_id 13 21 24 27 35 57 Store_desc San Fran Boston Dallas Philly DC Las Vegas It is necessary to Re-aggregate table values

Batch Process – Frequency

Batch Process – Frequency

9

Batch job frequencies differ with data sources and level of detail

9

Typically there will be a set of batch routines dedicated to each level of time detail (daily, weekly and monthly batch job)

9

Frequency is often but not necessarily tied to the level of time detail included in the data files to be loaded during that batch routine

(32)

Batch Process – Frequency (cont’d)

Batch Process – Frequency (cont’d)

Daily Daily Weekly Weekly Level of Detail Frequency

Frequency vs. Detail Chart

A A B C?? Weekly Tables Daily Tables Current Week Current week problem

References

References

Golfarelli, M., Rizzi, S., Data Warehouse: Teoria e pratica della progettazione, McGraw-Hill, 2002.

References

Related documents

Documentation files like the methodology description, project plan, interview guidelines and consent form templates can be imported into the NVivo project file and stored in

In Oruta, we utilize ring signatures [4], [5] to construct homomorphic authenticators [2], [6], so that the third party auditor is able to verify the integrity of shared data for

changes touchscreen gestures to have different effects and adds additional gestures that allow users to move around the screen and control their iPads. If VoiceOver is not disabled

Describe the options available for data warehousing with SQL Server Describe the key features and functionality of Parallel Data Warehouse Module 8: Introduction to SQL Azure.

The few studies that include both observed incivilities and reports of crime deserve special review, given their relevance to neighborhood policies on incivilities and

For example, an archer recoiling away from a melee combat will be allowed to shoot or cast a spell in its player’s turn (since the model is no longer consid- ered in melee combat,

In sql server database vs where aggregate functions ignore null object_ids from clause where vs having in sql server keeps the aggregate functions operate very likely have shown

Essentially, SPR immunosensors detect antigens that bind to antibodies on the gold sensor by measuring the angle of light reflection (Kretschmann & Raether,