Lecture 7 - Dimensional Modeling (DM)

(1)

(2)

2

Need for DM: Un-answered Qs

• _{Lets have a look at a typical ER data model first.}

• _{Some Observations:}

– _{All tables look-alike, as a consequence it is difficult to identify:}

• _{Which table is more important ?}

• _{Which is the largest?}

• _{Which tables contain numerical measurements of the}

business?

(3)

3

Need for DM: Complexity of Representation

– _{Many topologies for the same ER diagram, all}

appearing different.

• _{Very hard to visualize and remember.}

• _{A large number of possible connections to any}

two (or more) tables

(4)

4

Need for DM: The Paradox

• The Paradox: Trying to make information accessible using tables resulted in an inability to query them!

• _{ER and Normalization result in large number of tables which are:} – Hard to understand by the users (DB programmers)

– _{Hard to navigate optimally by DBMS software}

• Real value of ER is in using tables individually or in pairs

• _{Too complex for queries that span multiple tables with a large}

(5)

5

ER vs. DM

ER DM

Constituted to optimize OLTP

performance. Constituted to optimize DSS query performance.

Models the micro relationships among data elements.

Models the macro

relationships among data elements with an overall

deterministic strategy.

A wild variability of the structure of ER models.

All dimensions serve as equal entry points to the

fact table. Very vulnerable to changes in

the user's querying habits, because such schemas are

asymmetrical.

Changes in users' querying habits can be

accommodated by

(6)

6

How to simplify a ER data model?

• _{Two general methods:}

–

_{De-Normalization}

(7)

7

What is DM?…

• _{A simpler logical model optimized for decision}

support.

• _{Inherently dimensional in nature, with a single}

central fact table and a set of smaller dimensional tables.

• _{Multi-part key for the fact table}

• _{Dimensional tables with a single-part PK.}

(8)

8

What is DM?...

• _{Results in a star like structure, called star}

schema or star join.

– _{All relationships mandatory M-1.}

– _{Single path between any two levels.}

(9)

9

Dimensions have Hierarchies

Items

Books Cloths

Fiction Text Men Women Medical

Engg

(10)

Physical Model

Physical implementation of the multidimensional

paradigm model can be:

• _{– Relational}

• _{• Snowflake-schema} • _{• Star-schema}

(11)

Relational Model

• _{Going from multidimensional to relational}

• _{Représentations for cubes, dimensions, classification}

hierarchies and attributes

• _{Implementation of cubes without the classification}

hierarchies is easy

• _{A table can be seen as a cube}

• _{A column of a table can be considered as a dimension}

mapping

• _{A tuple in the table represents a cell in the cube.}

• _{If we interpret only a part of the columns as dimensions can}

use the rest as measures

(12)

(13)

13

The two Schemas

Star

(14)

14

“Simplified” 3NF (Retail)

CITY DISTRICT

1

ZONE CITY DISTRICT DIVISION

MONTH QTR STORE # STREET ZONE ...

WEEK MONTH

DATE WEEK

RECEIPT # STORE # DATE ...

ITEM #

RECEIPT # ... $

ITEM # CATEGORY

(15)

1-Star schema

• _{– Basic idea: use a de-normalized schema for}

all the dimensions

• _{A star schema can be obtained from the}

snowflake schema through the

(16)

16

Vastly Simplified Star Schema

(17)

(18)

1-Star Schema

• _Advantages

• _{– Improves query performance for often-used}

data

• _{– Less tables and simple structure}

• _{– Efficient query processing with regard to}

dimensions

• _{Disadvantages}

• _{– In some cases, high overhead of redundant}

(19)

19

The Benefit of Simplicity

Beauty lies in close correspondence with

the business, evident even to business

(20)

20

Features of Star Schema

Dimensional hierarchies are collapsed into a single table for each dimension. Loss of Information?

A single fact table created with a single header from the detail records, resulting in:

– _{A vastly simplified physical data model!}

– _{Fewer tables}(thousands of tables in some ERP systems).

– _{Fewer joins resulting in high performance.}

(21)

2- Snow-Flake Schema

Simple idea: use a table for each classification level • This table includes the ID of the classification level and other attributes

• 2 neighbor classification levels are connected by 1:n

connections e.g., from n Days to 1 Month

• The measures of a cube are maintained in a fact table

(22)

2- Snowflake Schema

• _Snowflake?

• _{– The facts/measures are in the center}

• _{– The dimensions spread out in each direction}

(23)

(24)

Snowflake schema – Advantages

• _{–With a snowflake schema the size of the}

dimension

• _{tables will be reduced and queries will run}

faster

• _{If a dimension is very sparse (most measures} • _{corresponding to the dimension have no data)} • _{And/or a dimension has long list of attributes}

(25)

Snowflake Schema Disadvantages

• _{– Fact tables are responsible for 90% of the storage}

requirements

• _{Thus, normalizing the dimensions usually lead to insignificant}

improvements

• _{– Normalization of the dimension tables can reduce the}

• _{performance of the DW because it leads to a large number of}

tables

• _{E.g., when connecting dimensions with coarse granularity}

these tables are joined with each other during queries

• _{A query which connects Product category with Year and}

(26)

(27)

(28)

28

(29)

29

The Process of Dimensional Modeling

Four Step Method from ER to DM

1. Choose the Business Process

2. Choose the Grain

3. Choose the Facts

(30)

30

Step-1: Choose the Business Process

• _{A business process is a major operational}

process in an organization.

• _{Typically supported by a legacy system}

(database) or an OLTP.

– _{Examples: Orders, Invoices, Inventory etc.}

• _{Business Processes are often termed as Data}

(31)

31

Star-1

Star-2 Snow-flake

(32)

32

Step-2: Choosing the Grain

• Grain is the fundamental, atomic level of data to be represented.

• _{Grain is also termed as the unit of analyses.}

• _{Example grain statements}

• _{Typical grains}

– Individual Transactions

– Daily aggregates (snapshots) – Monthly aggregates

• _{Relationship between grain and expressiveness.}

(33)

33

Step-2: Relationship b/w Grain

Daily aggregates 6 x 4 = 24 values Four aggregates per week

4 x 4 = 16 values

Two aggregates per week 2 x 4 = 8 values

(34)

34

The case FOR data aggregation

• _{Works well for repetitive queries.}

• _{Follows the known thought process.}

• _{Justifiable if used for max number of queries.}

• _{Provides a “big picture” or macroscopic view.}

• _{Application dependent, usually inflexible to}

(35)

35

The case AGAINST data aggregation

• _{Aggregation is irreversible.}

–_{Can create monthly sales data from weekly sales data,}

but the reverse is not possible.

• _{Aggregation limits the questions that can be}

answered.

(36)

36

The case AGAINST data aggregation

• _{Aggregation can hide crucial facts.}

(37)

37

Aggregation hides crucial facts Example

Week-1 Week-2 Week-3 Week-4 Average

Zone-1 100 100 100 100 100

Zone-2 50 100 150 100 100

Zone-3 50 100 100 150 100

Zone-4 200 100 50 50 100

Average 100 100 100 100

(38)

38

Aggregation hides crucial facts chart

0 50 100 150 200 250

Week-1 Week-2 Week-3 Week-4

Z1 Z2 Z3 Z4

Z1: Sale is constant (need to work on it)

Z2: Sale went up, then fell (need of concern)

Z3: Sale is on the rise, why?

Z4: Sale dropped sharply, need to look deeply.

(39)

39

“We need monthly sales volume and Rs. by

week, product and Zone” Facts

Dimensions

(40)

40

• _{Choose the facts that will populate each}

fact table record.

– _{Remember that best Facts are Numeric,}

Continuously Valued and Additive.

– _{Example: Quantity Sold, Amount etc.}

(41)

41

• _{Choose the dimensions that apply to each}

fact in the fact table.

– _{Typical dimensions: time, product, geography}

etc.

– _{Identify the descriptive attributes that explain}

each dimension.

– _{Determine hierarchies within each dimension.}

(42)

42

Step-4: How to Identify a Dimension?

• _{The single valued attributes during recording of a}

transaction are dimensions.

Calendar_Date Time_of_Day Account _No ATM_Location Transaction_Type

Transaction_Rs

Fact Table

Dim

Time_of_day: Morning, Mid Morning, Lunch Break etc.

(43)

43

Step-4: Can Dimensions be Multi-valued?

• _{Are dimensions ALWYS single?}

– _{Not really}

– _{What are the problems? And how to handle them}

_{Calendar_Date (of inspection)} _{Reg_No}

_Technician _Workshop

_{Maintenance_Operation}

 How many maintenance operations are possible?

 _Few

(44)

44

Step-4: Dimensions & Grain

• _{Several grains are possible as per business}

requirement.

– _{For some aggregations certain descriptions do not remain}

atomic.

– _{Example: Time_of_Day may change several times during}

daily aggregate, but not during a transaction

• _{Choose the dimensions that are applicable within the}

(45)

45

Quantifying space requirement

Quantifying use of additional space using star schema

There are about 10 million mobile phone users in Pakistan. Say the top company has half of them = 500,000

Number of days in 1 year = 365

Number of calls recorded each day = 250,000 (assumed) Maximum number of records in fact table = 91 billion rows Assuming a relatively small header size = 128 bytes

Fact table storage used = 11 Tera bytes

Average length of city name = 8 characters  8 bytes

Total number of cities with telephone access = 170 (1 byte)

Space used for city name in fact table using Star = 8 x 0.091 = 0.728 TB Space used for city code using snow-flake = 1x 0.091= 0.091 TB