Introduction to
Data Warehousing
Pasquale LOPS
Gestione della Conoscenza d’Impresa
A.A. 2003-2004
Introduction to
Data Warehousing
Pasquale LOPS
Gestione della Conoscenza d’Impresa
A.A. 2003-2004
Introduction
Introduction
Data warehousing and decision support have
given rise to a new class of databases.
Design strategies for OLAP databases differ
significantly from OLTP systems.
Today’s decision support systems must deliver
multidimensional analysis capabilities.
Terminology – What is a Data Warehouse?
Terminology – What is a Data Warehouse?
A database
- typically read-only
- Data stored in relational or multidimensional format - Multidimensional db often populated from relational db
Populated from existing source systems
- Secondary sources of data
- Populated from existing internal or external data sources - It is possible to build DSS on top of operational systems
Used for reporting purposes
- not transaction-based
- Used primarily for reporting purposes - Must be designed for analysis purposes - OLAP; not OLTP
Terminology – Decision Support and
Multidimensional Analysis
Terminology – Decision Support and
Multidimensional Analysis
Decision Support Systems (DSS)
Facilitate business analysisSupport business decision makers by providing various types of analysis: trend, comparison and ad hoc
reporting
Terminology – OLTP and OLAP
Terminology – OLTP and OLAP
OLTP
On-Line Transaction Processing
- support specific application - Maintain integrity of data
OLAP
On-Line Analytical Processing
- support business analysis
Points of Difference
Orientation or alignment of data Integration
History—time horizon of data Data access and manipulation Usage patterns
OLTP vs. OLAP – Orientation or Alignment of Data
OLTP vs. OLAP – Orientation or Alignment of Data
Organized around Applications
Different systems hold different types of data.
Data is inherently organized by application.
Different information in a different system.
Organized for Business Dimension
All types of data are integrated into one system.
Data is organized by defined dimensions of the business. Information from different
systems stored in a single database.
OLTP vs. OLAP – Integration
OLTP vs. OLAP – Integration
Typically Not Integrated
Different key structures Different naming conventions Different file formats
Different hardware platforms
Must Be Integrated
Standard key structures Standard naming conventions Standard file format
One warehouse server – Logical server
OLTP OLAP
OLTP vs. OLAP – History
OLTP vs. OLAP – History
Recent or Current Data
60-90 days Current values only No time key
No time series analysis
Historical Data
2 or more years
Historical snapshots of OLTP data Time key
Time series analysis
OLTP vs. OLAP – Data Access and Manipulation
OLTP vs. OLAP – Data Access and Manipulation
Transactions
Inserts, Updates, Deletes, Selects
Small amount of data involved in each transaction
Highly ‘indexable’ RDBMS focus
– Locking – Concurrency – Logical Unit of Work
Bulk Processes
Selects only
Large amount of data involved in each process
Not always ‘indexable’ RDBMS focus
– Parallel Loader, Query – Star Join
– Bit mapped Indexes
OLTP OLAP
OLTP vs. OLAP – Usage Patterns
OLTP vs. OLAP – Usage Patterns
Fairly Consistent Maintain a constant system
utilization pattern
OLTP OLAP
Spiked or Uneven Large period of light use and
spiked usage pattern
OLTP vs. OLAP – Summary
OLTP vs. OLAP – Summary
Aligned by Application
Typically Not Integrated
Recent or Current Data
Transactions Fairly Consistent Aligned by Dimension Must Be Integrated Historical Data Bulk Processes Spiked or Uneven OLTP OLAP Alignment: Integration: History: Data Access: Usage: Warehouse Headaches Batch Maintenance Tuning
Intro to ERM and ERD
Intro to ERM and ERD
ENTITIES
ENTITIES
ERM Terminology
ERM Terminology
ERM - Entity Relationship Model (design) ERD - Entity Relationship Diagram (graphical)
Entity - things of interest to the business, represented by boxes and implemented as tables
Attributes - things to know about an entity, implemented as columns in tables
Relationships - how entities relate, represented by lines on ERD and implemented as foreign keys
Entity Paradigms
Entity Paradigms
Rounded corners - ERD,
Square corners - Relational
Naming
Should be singular in nature
Consistency, communication, compatibility
RELATIONSHIPS
Relationships and Business Rules
Relationships and Business Rules
RX Transaction RX
Relationship - Line and “crow’s foot” represent a foreign key relationship from RX Transaction and RX
Cardinality - crow’s foot means “one or more”, absence means “one”
Relationships and Business Rules
Relationships and Business Rules
RX Transaction RX
allows Is allowed by
Optionality
Solid bar means that the relationship MUST exist Circle means that relationship MAY exist
Use words near entities with optionality symbols to complete sentences for definition
RX may allow RX Transactions
Relationships
Relationships
Vice President Department Be managed by manage One-to-One relationship:Each department must be managed by one VP. Each VP may manage one department.
Relationships
Relationships
Vice President Management Team contains Is contained in One-to-Many relationship:Each management team must contain many VP’s. Each VP may be contained in one management team.
Relationships
Relationships
Degree Employee Is held by hold Many-to-Many relationship:Each Employee must hold one or more degrees. Each Degree may be held by one or more employees.
ATTRIBUTES
Attributes - Terminology
Attributes - Terminology
Attributes are the information we wish to keep
about a particular entity
Example: Inventory Inventory Store_id Item_id Amt Units
Attributes - Implementation
Attributes - Implementation
Entities and their attributes are shown as:
ENTITY NAME (Attribute1, Attribute2, Attribute3)
Inventory (Store_id, Item_id,Amount, Units)
To specify a primary key for an entity/table, underline the appropriate Attribute(s)
Warehouse Architecture Overview
Warehouse Architecture Overview
Warehouse Overview
Warehouse Overview
Basic components Warehouse Server Warehouse Access Tool Source Systems Design StrategiesWarehouse Overview
Warehouse Overview
Designers must consider and understand
unique characteristics and requirements of all
three previous components
Ideally, a project team should pick the
best-of-class tools for storing and accessing data.
In reality, all three pieces should be selected
with regard to the others, to ensure that each
component will complement the others.
Source Systems
Source Systems
One or more operational systems will be the source(s) of the data stored in the data warehouse.
Source System A Source System B User Group A User Group B
Source Systems (cont’d)
Source Systems (cont’d)
Source systems are typically not integrated.
9Have unique key structures and unique naming conventions
9Possess overlapping data
Source systems hold current value data.
Source systems will indirectly define the scope of a warehouse.
9Only data found in source systems can be included in data warehouse; no “new” data can be created.
9Each operational system will have unique characteristics (levels of detail or granularity of data, types of data or metrics available)
Warehouse Server
Warehouse Server
DWH RDBMS HW Platform (typically UNIX-based) DWH RDBMS DWH RDBMS Distributed Architecture A DWH RDBMS DWH Gateway Distributed Architecture BWarehouse Access Tool / Architecture
Warehouse Access Tool / Architecture
DWH RDBMS HW Platfom MDDB HW Platform Client HW Platform App Server SQL SQL SQL Messaging MDDB Calls MOLAP ROLAP (2-3 tier)
Design Overview
Design Overview
The Warehouse Trade-Off Triangle
The Warehouse Trade-Off Triangle
Query Performance User Requirements Data Warehouse Maintenance
Schema
The ETL Process
ETL = Extraction,
Transformation
and Loading
The ETL Process
ETL = Extraction,
Transformation
and Loading
Batch Process – Overview
Batch Process – Overview
Source System Extract Program Extract File
Source System Server Warehouse Server
Load File DWH RDBMS File Transfer Landing Space
Batch Process – Extracts
Batch Process – Extracts
Source System Extract Program Extract File
Extracts are programs that generate data files.
Perform data transformations, data cleaning.
Perform key conversions.
Reformat data to the standards of the warehouse.
Must produce data in a file format suitable for loading into the data
Batch Process – Extracts
Batch Process – Extracts
Source System Extract Program Extract File
Source System Server
Basic Types of Extracts
1) Facts tables
Must provide load files for the following tables:
–Base tables –Historical tables –Aggregate tables
2) Lookup tables
Must provide data to populate the following tables:
–Lookup tables –Relationship tables
Batch Process – Extracts (cont’d)
Batch Process – Extracts (cont’d)
Static Extraction for the first loading of the DWH Incremental Extraction for the update of the DWH
Batch Process – File Transfers
Batch Process – File Transfers
Source System Extract Program Extract File
Source System Server Warehouse Server
Load File File Transfer Landing Space
Batch Process – File Transfers (cont’d)
Batch Process – File Transfers (cont’d)
File Transfer: Process of moving data files to data
warehouse server.
After Extracts, generated files must be moved from source systems to data warehouse server.
Design Considerations
9Transfer method and network impact ¾Usually transferred via FTP
Batch Process – File Transfers (cont’d)
Batch Process – File Transfers (cont’d)
Design Considerations
9Scheduling routines
¾If there is not enough landing space, scheduling routines must be designed.
¾Routines must coordinate file transfers and database loads, transferring a new data file only after an existing file has been loaded and is no longer needed.
Batch Process – Data Loads
Batch Process – Data Loads
Source System Extract Program Extract File Load File DWH RDBMS File Transfer Landing Space
Batch Process – Data Loads (cont’d)
Batch Process – Data Loads (cont’d)
Data Load: data loaded from extract file into database.
9Post-load processes
9Aggregation routines
Basic Types of Load Procedures
9Append new records to existing table.
9Drop table and reload updated data file.
9Update existing records.
Batch Process – Data Loads (cont’d)
Batch Process – Data Loads (cont’d)
Post Load Processes
9Must update table indexes after data loads.
9For statistics-based optimizers, must update table and index statistics after loads.
Aggregation Routines
9Must run aggregation routines if aggregate data
Batch Jobs – Overview
Batch Jobs – Overview
Basic Refresh Jobs: necessary to update tables in
warehouse with current information9Lookup Table
9Fact Table
9Aggregate Table
Maintenance Jobs: necessary to maintain tables in
warehouse9Updating fact data
9Re-organizing data
Basic Refresh Jobs – Lookup Tables
Basic Refresh Jobs – Lookup Tables
Purpose9
Apply changes in existing “organizational” systems to lookup data in data warehouse.9
Changes include addition of new items or changes to descriptive information.9
No changes to attribute keys or attribute relationships.Basic Refresh Jobs – Lookup Tables (cont’d)
Basic Refresh Jobs – Lookup Tables (cont’d)
Basic Methods9
No refresh¾Extract is run once to populate DWH
¾Often used in pilot or prototype systems
9
Drop and reload¾Existing table is dropped or emptied
¾Extract is re-run to capture current information
¾Table is loaded with new extract
9
Append to existing table¾Extract is re-run to capture current information
¾New extract and “old” or “master” lookup file are compared. ¾New “Delta” file is generated.
¾Delta is applied to master lookup file and lookup table in warehouse. ¾Delta file may be loaded into warehouse directly, OR
¾Delta may be applied to master lookup file and then use Drop and Reload method, loading the master lookup file.
¾Sophisticated batch routines, normally used in production
Basic Refresh Jobs – Lookup Tables (cont’d)
Basic Refresh Jobs – Lookup Tables (cont’d)
Org Source System Extract Program Extract File 1/96 Lookup Table
Basic Refresh Jobs – Lookup Tables (cont’d)
Basic Refresh Jobs – Lookup Tables (cont’d)
Org Source System Extract Program Master Lookup DWH RDBMS Extract File 2/96 Lookup Table Compare Program Delta File
Basic Refresh Jobs – Fact Tables
Basic Refresh Jobs – Fact Tables
Purpose9
Refresh or update fact data in DWH with the new data from source systems.Basic methods
9
Bulk or historical insert¾Extract is run to capture all data existing in source systems ¾Data is bulk-loaded into data warehouse fact tables ¾Simple batch routine
¾Used to “start” warehouse or provide initial data sets ¾Often doesn’t perform any cleansing or integration
9
Drop and reloadBasic Refresh Jobs – Fact Tables (cont’d)
Basic Refresh Jobs – Fact Tables (cont’d)
Fact Source System Extract Program DWH RDBMS Extract File for 9/95 thru 1/96 Fact Table
Basic Refresh Jobs – Fact Tables (cont’d)
Basic Refresh Jobs – Fact Tables (cont’d)
Basic methods9
Drop and reload¾Historical or Bulk extract is re-run to capture all available data ¾Existing warehouse table is emptied or truncated
¾File is inserted into empty fact table ¾Simple batch routine
¾Used in prototypes or pilot ¾Not feasible for large data sets
9
Append to existing tableBasic Refresh Jobs – Fact Tables (cont’d)
Basic Refresh Jobs – Fact Tables (cont’d)
Fact Source System Extract Program DWH RDBMS Fact Table 9/95 - 1/96 Extract File 2/96 Append 2/96
Basic Refresh Jobs – Aggregate tables
Basic Refresh Jobs – Aggregate tables
Purpose9
Refresh or aggregate tables Basic methods9
Aggregate in warehouse RDBMS¾Produce atomic extract
¾Transfer and load atomic extract into atomic fact table
¾Produce aggregate values using SQL accessing atomic fact table ¾Insert aggregate values into aggregate fact table
9
Aggregate in batch (on source systems or warehouse server)¾Produce atomic extract
¾Transfer and load atomic extract into atomic fact table ¾Produce aggregate extract from atomic extract
Basic Refresh Jobs – Aggregate Tables (cont’d)
Basic Refresh Jobs – Aggregate Tables (cont’d)
Source System Extract Program Atomic Extract
Source System Server Warehouse RDBMS Aggregate Extract Aggregate Program Atomic Fact Table Aggregate Fact Table Aggregate SQL Routines
Maintenance Jobs – Updating Fact Data
Maintenance Jobs – Updating Fact Data
Purpose9As changes are made to source system data, they should be reflected in the data warehouse.
Basic Methods
9Ignore changes.
9Wait until audited data is available.
Maintenance Jobs – Updating Fact Data (cont’d)
Maintenance Jobs – Updating Fact Data (cont’d)
Sun Mon Tue Wed Thu Fri Sat
Sun Pre Audit Data Wed Pre Audit Data Sun Post Audit Data Scenario
Audit Process produces clean data set 3 days after initial set is posted.
Maintenance Jobs – Updating Fact Data (cont’d)
Maintenance Jobs – Updating Fact Data (cont’d)
Fact Source System Extract Program Sunday DWH RDBMS Fact Table Compare Program Delta File Sun Pre Data Wed Sun Post Data
Maintenance Jobs – Re-Organizing Data
Maintenance Jobs – Re-Organizing Data
Store
Region Lookup Store
Store_id Store_desc Region_id Lookup Region Region_id Region_desc Relationship between Region and Store changes Must update foreign key in Store Lookup Region_id 2 1 2 1 1 1 2 Store_id 13 21 24 27 35 57 Store_desc San Fran Boston Dallas Philly DC Las Vegas Dallas is moved to the East Region
Maintenance Jobs – Re-Organizing Data (cont’d)
Maintenance Jobs – Re-Organizing Data (cont’d)
Lookup Store Region_id Store_id Store_desc Lookup Region Region_id
Region_desc Must update key
for Store in all tables
Best Case Worst Case
Fact Sales Region_id Store_id Item_id Week_id Sales_dollars Sales_units
Maintenance Jobs – Re-Organizing Data (cont’d)
Maintenance Jobs – Re-Organizing Data (cont’d)
Store Sales Store_id Item_id Date Sales_Dollars Sales_Units Region Sales Region_id Item_id Date Sales_Dollars Sales_Units Lookup Store Store_id Store_desc Region_id Lookup Region Region_id Region_desc Region_id 2 1 2 1 1 1 2 Store_id 13 21 24 27 35 57 Store_desc San Fran Boston Dallas Philly DC Las Vegas It is necessary to Re-aggregate table values
Batch Process – Frequency
Batch Process – Frequency
9
Batch job frequencies differ with data sources and level of detail9
Typically there will be a set of batch routines dedicated to each level of time detail (daily, weekly and monthly batch job)9
Frequency is often but not necessarily tied to the level of time detail included in the data files to be loaded during that batch routineBatch Process – Frequency (cont’d)
Batch Process – Frequency (cont’d)
Daily Daily Weekly Weekly Level of Detail Frequency
Frequency vs. Detail Chart
A A B C?? Weekly Tables Daily Tables Current Week Current week problem
References
References
Golfarelli, M., Rizzi, S., Data Warehouse: Teoria e pratica della progettazione, McGraw-Hill, 2002.