2
Need for DM: Un-answered Qs
• Lets have a look at a typical ER data model first.
• Some Observations:
– All tables look-alike, as a consequence it is difficult to identify:
• Which table is more important ?
• Which is the largest?
• Which tables contain numerical measurements of the
business?
3
Need for DM: Complexity of Representation
– Many topologies for the same ER diagram, all
appearing different.
• Very hard to visualize and remember.
• A large number of possible connections to any
two (or more) tables
4
Need for DM: The Paradox
• The Paradox: Trying to make information accessible using tables resulted in an inability to query them!
• ER and Normalization result in large number of tables which are: – Hard to understand by the users (DB programmers)
– Hard to navigate optimally by DBMS software
• Real value of ER is in using tables individually or in pairs
• Too complex for queries that span multiple tables with a large
5
ER vs. DM
ER DM
Constituted to optimize OLTP
performance. Constituted to optimize DSS query performance.
Models the micro relationships among data elements.
Models the macro
relationships among data elements with an overall
deterministic strategy.
A wild variability of the structure of ER models.
All dimensions serve as equal entry points to the
fact table. Very vulnerable to changes in
the user's querying habits, because such schemas are
asymmetrical.
Changes in users' querying habits can be
accommodated by
6
How to simplify a ER data model?
•
Two general methods:
–
De-Normalization
7
What is DM?…
• A simpler logical model optimized for decision
support.
• Inherently dimensional in nature, with a single
central fact table and a set of smaller dimensional tables.
• Multi-part key for the fact table
• Dimensional tables with a single-part PK.
8
What is DM?...
• Results in a star like structure, called star
schema or star join.
– All relationships mandatory M-1.
– Single path between any two levels.
9
Dimensions have Hierarchies
Items
Books Cloths
Fiction Text Men Women Medical
Engg
Physical Model
Physical implementation of the multidimensional
paradigm model can be:
• – Relational
• • Snowflake-schema • • Star-schema
Relational Model
• Going from multidimensional to relational
• Représentations for cubes, dimensions, classification
hierarchies and attributes
• Implementation of cubes without the classification
hierarchies is easy
• A table can be seen as a cube
• A column of a table can be considered as a dimension
mapping
• A tuple in the table represents a cell in the cube.
• If we interpret only a part of the columns as dimensions can
use the rest as measures
13
The two Schemas
Star
14
“Simplified” 3NF (Retail)
CITY DISTRICT
1
ZONE CITY DISTRICT DIVISION
MONTH QTR STORE # STREET ZONE ...
WEEK MONTH
DATE WEEK
RECEIPT # STORE # DATE ...
ITEM #
RECEIPT # ... $
ITEM # CATEGORY
1-Star schema
• – Basic idea: use a de-normalized schema for
all the dimensions
• A star schema can be obtained from the
snowflake schema through the
16
Vastly Simplified Star Schema
1-Star Schema
• Advantages
• – Improves query performance for often-used
data
• – Less tables and simple structure
• – Efficient query processing with regard to
dimensions
• Disadvantages
• – In some cases, high overhead of redundant
19
The Benefit of Simplicity
Beauty lies in close correspondence with
the business, evident even to business
20
Features of Star Schema
Dimensional hierarchies are collapsed into a single table for each dimension. Loss of Information?
A single fact table created with a single header from the detail records, resulting in:
– A vastly simplified physical data model!
– Fewer tables (thousands of tables in some ERP systems).
– Fewer joins resulting in high performance.
2- Snow-Flake Schema
Simple idea: use a table for each classification level • This table includes the ID of the classification level and other attributes
• 2 neighbor classification levels are connected by 1:n
connections e.g., from n Days to 1 Month
• The measures of a cube are maintained in a fact table
2- Snowflake Schema
• Snowflake?
• – The facts/measures are in the center
• – The dimensions spread out in each direction
Snowflake schema – Advantages
• –With a snowflake schema the size of the
dimension
• tables will be reduced and queries will run
faster
• If a dimension is very sparse (most measures • corresponding to the dimension have no data) • And/or a dimension has long list of attributes
Snowflake Schema Disadvantages
• – Fact tables are responsible for 90% of the storage
requirements
• Thus, normalizing the dimensions usually lead to insignificant
improvements
• – Normalization of the dimension tables can reduce the
• performance of the DW because it leads to a large number of
tables
• E.g., when connecting dimensions with coarse granularity
these tables are joined with each other during queries
• A query which connects Product category with Year and
28
29
The Process of Dimensional Modeling
Four Step Method from ER to DM
1. Choose the Business Process
2. Choose the Grain
3. Choose the Facts
30
Step-1: Choose the Business Process
• A business process is a major operational
process in an organization.
• Typically supported by a legacy system
(database) or an OLTP.
– Examples: Orders, Invoices, Inventory etc.
• Business Processes are often termed as Data
31
Star-1
Star-2 Snow-flake
32
Step-2: Choosing the Grain
• Grain is the fundamental, atomic level of data to be represented.
• Grain is also termed as the unit of analyses.
• Example grain statements
• Typical grains
– Individual Transactions
– Daily aggregates (snapshots) – Monthly aggregates
• Relationship between grain and expressiveness.
33
Step-2: Relationship b/w Grain
Daily aggregates 6 x 4 = 24 values Four aggregates per week
4 x 4 = 16 values
Two aggregates per week 2 x 4 = 8 values
34
The case FOR data aggregation
• Works well for repetitive queries.
• Follows the known thought process.
• Justifiable if used for max number of queries.
• Provides a “big picture” or macroscopic view.
• Application dependent, usually inflexible to
35
The case AGAINST data aggregation
• Aggregation is irreversible.
– Can create monthly sales data from weekly sales data,
but the reverse is not possible.
• Aggregation limits the questions that can be
answered.
36
The case AGAINST data aggregation
• Aggregation can hide crucial facts.
37
Aggregation hides crucial facts Example
Week-1 Week-2 Week-3 Week-4 Average
Zone-1 100 100 100 100 100
Zone-2 50 100 150 100 100
Zone-3 50 100 100 150 100
Zone-4 200 100 50 50 100
Average 100 100 100 100
38
Aggregation hides crucial facts chart
0 50 100 150 200 250
Week-1 Week-2 Week-3 Week-4
Z1 Z2 Z3 Z4
Z1: Sale is constant (need to work on it)
Z2: Sale went up, then fell (need of concern)
Z3: Sale is on the rise, why?
Z4: Sale dropped sharply, need to look deeply.
39
“We need monthly sales volume and Rs. by
week, product and Zone” Facts
Dimensions
40
•
Choose the facts that will populate each
fact table record.
– Remember that best Facts are Numeric,
Continuously Valued and Additive.
– Example: Quantity Sold, Amount etc.
41
•
Choose the dimensions that apply to each
fact in the fact table.
– Typical dimensions: time, product, geography
etc.
– Identify the descriptive attributes that explain
each dimension.
– Determine hierarchies within each dimension.
42
Step-4: How to Identify a Dimension?
• The single valued attributes during recording of a
transaction are dimensions.
Calendar_Date Time_of_Day Account _No ATM_Location Transaction_Type
Transaction_Rs
Fact Table
Dim
Time_of_day: Morning, Mid Morning, Lunch Break etc.
43
Step-4: Can Dimensions be Multi-valued?
• Are dimensions ALWYS single?– Not really
– What are the problems? And how to handle them
Calendar_Date (of inspection) Reg_No
Technician Workshop
Maintenance_Operation
How many maintenance operations are possible?
Few
44
Step-4: Dimensions & Grain
• Several grains are possible as per business
requirement.
– For some aggregations certain descriptions do not remain
atomic.
– Example: Time_of_Day may change several times during
daily aggregate, but not during a transaction
• Choose the dimensions that are applicable within the
45
Quantifying space requirement
Quantifying use of additional space using star schema
There are about 10 million mobile phone users in Pakistan. Say the top company has half of them = 500,000
Number of days in 1 year = 365
Number of calls recorded each day = 250,000 (assumed) Maximum number of records in fact table = 91 billion rows Assuming a relatively small header size = 128 bytes
Fact table storage used = 11 Tera bytes
Average length of city name = 8 characters 8 bytes
Total number of cities with telephone access = 170 (1 byte)
Space used for city name in fact table using Star = 8 x 0.091 = 0.728 TB Space used for city code using snow-flake = 1x 0.091= 0.091 TB