- 1-
Data Warehouse design
Design of Enterprise Systems University of Pavia
DATA MODELLING
- 3-
Data Modelling
Important premise
– Data warehouses typically reside on a RDBMS (Relational Database Management System)
Open questions
– Data Model: How will the data be organized within relational tables? What are the subject areas? What are the entities? How will they relate? What will they mean?
– Data Architecture: How will the relational tables be organized? Will they reside in a single central data warehouse, or will they reside in distributed data
Data Modeling Methodology
Data modeling methodology includes three phases, created in following sequence:
– Conceptual Data Model:
• How will the enterprise be organized within the data warehouse?
• The Conceptual Data Model defines the subject areas and major entities. – Logical Data Model:
• Which entities exist within each subject area? • How will those entities relate to each other?
• The relationships extend beyond just primary key/foreign key
relationships.
– Physical Data Model:
• How will that data be assigned to columns of a data warehouse?
• Though not quite the Data Definition Language (DDL) of the tables and databases, the Physical Data Model is very close to the DDL of the tables and databases.
- 5-
Conceptual data model
The Conceptual Data Model identifies the main subject areas of a data warehouse.
Implicitly, the Conceptual Data Model also identifies the boundary of a data
warehouse because any subject area not included in the Conceptual Data Model
is not included in the data warehouse.
The Conceptual Data Model provides an overall map to a data warehouse. The applicable subject area can identify the location of data within a data warehouse. The definition of a subject area within the Conceptual Data Model is not a
binding statement that the data, tables, and possibly database for that subject
area will be created.
Data warehouse development best occurs in iterative slices of development, not all at once.
Conceptual data model
A Conceptual Data Model will require numerous iterations of brainstorming, model, review, model, brainstorming, model, etc.
A Conceptual Data Model is the first foundation of a data warehouse and
provides the roadmap to a data warehouse via the subject areas of the enterprise. Details are added as it is transformed into a Logical Data Model.
- 7-
Logical Data Model
Physical Data Model
A Physical Data Model is a representation of the data structures that will hold the data in a data warehouse, and is directly based on a Logical Data Model.
Physical Data Model can be derived by Model-to-Model (M2M) transformation The structures of a Physical Data Model are databases, tables, and views. A Physical Data Model indicates the physical data types of all fields, and
continues the display of relations that originally appeared in the prerequisite Logical Data Model.
We discuss three primary varieties of data warehouse: – Third normal form
– Star schema
- 9-
Third normal form
A Third Normal Form data warehouse uses data structures that are normalized to the Third Normal Form.
The result is many small tables rather than a few large tables, and many table
joins rather than a few table joins.
From third normal form to dimensional data models
In the early days of data
warehousing, early developers
of decision support systems
used the best methods they
had available to them.
The best, and arguably only,
method of modeling data in a
RDBMS was the Third Normal
Form.20
The developers’ knowledge
and methods of Third Normal
Form were based on their
experience, which occurred in
operational relational
databases.
Kimball introduced the
Dimensional Data Model.
Kimball used the Dimensional
Data Model to solve
shortcomings of Third Normal
Form data warehouses:
1. Third Normal Form data
warehouses do not model a
business or subject area, they
model relationships between
data elements instead.
2. Third Normal Form data
warehouse structures are too
erratic and scattered to be
- 11-
Facts
A Fact table incorporates entities and measures that were identified in the Logical Data Model. Typically, those entities include:
– Time: An event happens at a moment in time – Place: An event happens in a place or space – Person: People are usually involved in events
– Thing: Events are often focused on or around an object
– Equipment: Participants in an event often use a tool or equipment – How: The action that was performed
– Why: Sometimes, but not always, a reason for the action is provided
A Fact table joins to the surrounding Dimension tables using Primary Key/Foreign Key relations.
The Primary Key of a single row from a Dimension Row is embedded into a row in a Fact table. That way, a Fact row will join to one and only one row within each Dimension table.
- 13-
Dimensions
The hierarchical grain at which a Dimension table joins to a Fact table is the lowest relevant level of that Dimension’s hierarchy.
No Structured Query Language (SQL) can summarize lower than the hierarchical grain of a Fact table.
– For instance, if the Fact table has a Product Key and the Dimension table has a Product Key and a Subproduct Key, the Dimension’s Subproduct Key will never be used.
So, the lowest relevant hierarchical grain of a Dimension table is the hierarchical
Dimensions
Ideally, a Dimension table contains in each row the entire hierarchy for a given Primary Key. For example:
– A Date Dimension row for the date August 27, 1993 should include every hierarchical level (e.g., Week, Month, Quarter, Year, etc.) for that Date. – A Facility Dimension row for Warehouse #253 should include every
hierarchical level (e.g., Warehouse, Warehouse Group, Warehouse Class,
Warehouse Type, Warehouse Region, Warehouse District, Warehouse Division, etc.) for that warehouse.
By including all hierarchical levels in a single row, a data warehouse customer can summarize data at the Week and Warehouse Region, or Month and Warehouse
Type, or Quarter and Warehouse Group without adding any new tables to the SQL. Once a Fact table can join to a Warehouse Dimension and a Date Dimension
table, all the hierarchical information necessary to summarize at a higher level is already present.
- 15-
Dimensions and join strategies
Dimension tables are very flexible.
Attributes, hierarchies, and hierarchical levels can be added, removed, or changed by simply modifying the columns in a Dimension table.
– Rather than update five normalized tables, a single update to a single table can add a new hierarchy and the Dimension joins to that hierarchy.
– Immediately, the new hierarchy is available to any Fact table joined with that Dimension table.
The Primary Key/Foreign Key joining between a Fact table and a Dimension table can occur by various permutations of
– Source Native Key: the identifier for an entity provided by a source system – Surrogate Key: a key generated by and for a data warehouse
Source Native Key
The simplest method by which a
Fact table can join with a
Dimension table is by using the
Source Native Key, in this example
labeled “Thing.”
Apparently, the enterprise has two
unique identifiers: Chair and
Table.
As long as the Dimension table
has only one row for the entity
labeled Chair and only one row for
the entity labeled Table, this
method will work.
The SQL WHERE clause joins only
on the Thing field:
- 17-
Source Native Key with Dates
Time Variance can be added to
the relation between the Fact and
Dimension tables. In this example,
each instance of Chair and Table
are captured with their Begin and
End Dates.
The join between the Fact and
Dimension tables must include
the Source Native Key and Dates.
The Dimension table can have
only one instance of Chair and
Table on a single Date.
The SQL WHERE clause joins on
the Thing and Date fields:
– Fact.Thing = Dimension.Thing And Fact.Date between
Dimension.Begin_Date and Dimension.End_Date
Surrogate Key
In a data warehouse that must
represent disparate source
systems, Surrogate Keys are used
to combine and conform entity
keys that are not coordinated in
the enterprise.
If so, the Fact table can join the
Dimension table using the
generated Surrogate Key.
This approach requires the
Dimension table have only one
row for each Surrogate Key.
The SQL WHERE clause joins on
the Thing Key field:
– Fact.Thing_Key =- 19-
Conformed Dimensions
Fact tables alone are designed to capture all business events.
Ideally, all business events in an enterprise will be able to share the
same dimension tables.
In this example, both Fact tables can share the Product, Place, and Date Dimension tables. These shared tables are called Conformed Dimensions. The Manufacture Fact table is not able
to share the Store table because the manufacturing plant is not a store. This practice of conforming
dimensions so multiple Fact tables can be shared is relevant to the discussion about data warehouse as a
Snowflake Schema
Sometimes a Dimension, or part
of a Dimension, is too complex or
volatile to work well in a single
Dimension row. In such situations,
part of the Dimension is
normalized out of the Dimension
yielding a Dimension of a
Dimension, or a Sub-dimension.
Splitting a Dimension in a Star
Schema yields a Snowflake
Schema.
Example shows the progression
from a completely de-normalized
Facility Dimension table to a fully
normalized Facility Hierarchy.
– In the fully normalized Facility Hierarchy, the Facility Dimension table is no longer necessary and is
- 21-
DATA STORAGE
Enterprise Data Warehouse
An Enterprise Data Warehouse (EDW) is a single centralized database or set of databases on one platform. Typically, an EDW is the core of a data warehouse. Example shows an EDW containing six subject areas.
EDW Data Architecture locates all the subject areas of a data warehouse inside the EDW. By locating all the subject areas in one RDBMS, a data warehouse
- 23-
EDW performances
A centralized EDW concentrates all the data volume and throughput into one RDBMS and RDBMS platform.
– The data volumes and throughput, therefore, are a major consideration in the design of an EDW.
– Data model interaction with a RDBMS and RDBMS platform are part of that consideration.
Knowing that the hardware will be pushed to its maximum capacity and
throughput, a data warehouse designer must do the homework necessary to
optimize the databases, tables, and views on a specific RDBMS on a platform.
When all the data in a data warehouse is first integrated into an EDW, the data warehouse team is able to apply the rigor and discipline of Data Quality measurements and communication of metadata to that data.
Data Marts
A Data Mart is a separate database or set of databases, each with a specific focus.
– That focus can be either a subject area or a decision support need (e.g., auditing, loss prevention, or profitability).
A Data Mart is created when an EDW cannot provide data in the manner
required by data customers, and the business need for data in that form
justifies the expense and overhead of a Data Mart.
Another common justification for a Data Mart is data segregation.
– A business area needs to include sensitive data (e.g., proprietary, financial, medical, etc.), which cannot be available to anyone outside the business area. – A business area needs to interact with an external business or government
- 25-
Data Marts
A Data Mart is physically or logically separate from the EDW from which it receives data.
A Data Mart is a subset of an EDW and receives at least some of its data from an EDW.
The load cycle of a Data Mart, therefore, is no faster than the load cycle of the EDW that feeds it.
Data Marts
A Data Mart can be achieved by two basic methods.
The first method is to create a physical set of databases and tables, which are located on a platform separate and removed from the data warehouse platform.
– This method provides the maximum possible isolation of the Data Mart. The data must be physically transported from the data warehouse platform to the Data Mart platform.
– This method is also the most expensive, including the cost of a separate platform, data transport applications, and the maintenance of the separate platform and transport applications.
The second method is to define a set of views that draw their data from the data warehouse: a View Data Mart.
– A Data Mart based on views must, of course, be located on the data warehouse platform.
– While this method does not incur the overhead and cost of a separate platform, a View Data Mart does not have the independence of a separate
- 27-
Operational Data Store
An Operational Data Store (ODS) reflects the data of a single subject area as it exists in the operational environment.
The value of an ODS is that it leverages the strategic technologies of a data warehouse to answer tactical questions.
ODS customers can see the data from their business unit without interrupting
or interfering with their operational business applications, or incurring the
Performance improvement
Data warehouse customers always have a common complaint: performance. Technology-based solutions – Database tuning – SQL tuning – Indexing – Optimizer implementation Architecture-based solutions – Summaries – Aggregates
Here we focus only on architecture-based solutions.
The point of architecture-based solutions is to incur the high system resource consumption once during off-peak hours to avoid multiple consumptions of
- 29-
Summaries
A Summary is a table that stores the results of a SQL arithmetic SUM statement that has been applied to a Fact table.
The arithmetic portion of a Fact table is summed, while simultaneously one or
more hierarchical levels of detail are removed from the data in a Fact table.
For example:
– Intraday Fact data is summed at the Day level. The resulting data is stored in a Daily Summary table. For that data, the lowest grain is the Day.
– Store Fact data is summed at the Region level. The resulting data is stored in a Region Summary table. For that data, the lowest grain is the Region.
The intention of a Summary table is to perform the summation of arithmetic
Fact data only once, rather than many times.
By incurring the resource consumption necessary to summarize a Fact table, data warehouse customers will receive the previously summarized data they want
Aggregates
An Aggregate is a table that stores the results of SQL JOIN statements, which have been applied to a set of Dimension tables.
The hierarchies and attributes above an entity are pre-joined and stored in a
table.
For example:
– The Product entity, its levels of hierarchy and management area pre-joined into a single table that stores the result set. The grain of this result set is the Product.
– The Facility entity, its levels of geographic and management hierarchy are pre-joined into a single table that store the result set. The grain of this result set is the Facility.
The intention of an Aggregate table is to perform the joins of large sets of
Dimension data only once.
- 31-
ASSIGNMENT
Guidelines
Choose one among the KPIs
in next page
Design:
– Logical model
– Physical model
Use Dimensional Fact
Models (DFMs) considering:
– all applicable dimensions
– data profile (from previous
deliverable)
Install SQL Power Architect
and practice designing the
physical model
- 33-
KPIs selection
Quality
Service
Efficiency
A g en t Su pe rvi so rs Cu st om er T op m an ag em en t Error Rate Call quality
First call resolution
Service Level Speed of Answer Delay in queue Staff shrinkage Staffing efficiency Absenteism Availability Call cost
Agent training level Protocol adherence Error Rate
Call quality
First call resolution
Occlusion Abandon rate Self-service availability Service Level Speed of Answer Delay in queue
Average talking time Average back-office time Occupancy Staff shrinkage Staffing efficiency Absenteism Availability Traffic Upselling Error Rate Call quality Service Level Speed of Answer Delay in queue Cost to customer
Law 626 Delay in queue System error rate
Execution time Learning time Staff shrinkage
KPIs to be implemented
All the groups: Traffic
– Number of calls One KPI per group: Abandon rate
– Number of abandoned calls (with no answer) Speed of answer (or average waiting time)
– ∑ waiting time duration / number of calls Average talking time
– ∑ talking time duration / number of calls Average back-office time
- 35-
Physical model implementation
Open PgAdmin III
Open your LOCAL postgreSQL server
Physical model implementation
Open SQL Power Architect
- 37-