Chapter 6. Data Modeling for a Data Warehouse
6.4 Dimensional Modeling
In some respects, dimensional modeling is simpler, more expressive, and easier to understand than ER modeling. But, dimensional modeling is a relatively new concept and not firmly defined yet in details, especially when compared to ER modeling techniques.
This section presents the terminology that we use in this book as we discuss dimensional modeling. For more detailed techniques, methodologies, and hints, refer to Chapter 8, “Data Warehouse Modeling Techniques” on page 81.
6.4.1 Basic Concepts
Dimensional modeling is a technique for conceptualizing and visualizing data models as a set of measures that are described by common aspects of the business. It is especially useful for summarizing and rearranging the data and presenting views of the data to support data analysis. Dimensional modeling focuses on numeric data, such as values, counts, weights, balances, and occurrences.
Dimensional modeling has several basic concepts:
• Facts
• Dimensions
• Measures (variables)
6.4.1.1 Fact
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business processes.
In a data warehouse, facts are implemented in the core tables in which all of the numeric data is stored.
6.4.1.2 Dimension
A dimension is a collection of members or units of the same type of views. In a diagram, a dimension is usually represented by an axis. In a dimensional model, every data point in the fact table is associated with one and only one member from each of the multiple dimensions. That is, dimensions determine the contextual background for the facts. Many analytical processes are used to quantify the impact of dimensions on the facts.
Dimensions are the parameters over which we want to perform Online Analytical Processing (OLAP). For example, in a database for analyzing all sales of
products, common dimensions could be:
• Time
• Location/region
• Customers
• Salesperson
• Scenarios such as actual, budgeted, or estimated numbers
Dimensions can usually be mapped to nonnumeric, informative entities such as branch or employee.
Dimension Members: A dimension contains many dimension members. A dimension member is a distinct name or identifier used to determine a data item′s position. For example, all months, quarters, and years make up a time dimension, and all cities, regions, and countries make up a geography
dimension.
Dimension Hierarchies: We can arrange the members of a dimension into one or more hierarchies. Each hierarchy can also have multiple hierarchy levels.
Every member of a dimension does not locate on one hierarchy structure.
A good example to consider is the time dimension hierarchy as shown in Figure 14. The reason we define two hierarchies for time dimension is because a week can span two months, quarters, and higher levels. Therefore, weeks cannot be added up to equal a month, and so forth. If there is no practical benefit in analyzing the data on a weekly basis, you would not need to define another hierarchy for week.
Figure 14. Multiple Hierarchies i n a Time Dimension
6.4.1.3 Measure
A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to the dimensions. The actual numbers are called as variables. For example, measures are the sales in money, the sales volume, the quantity supplied, the supply cost, the transaction amount, and so forth. A measure is determined by combinations of the members of the dimensions and is located on facts.
6.4.2 Visualization of a Dimensional Model
The most popular way of visualizing a dimensional model is to draw a cube. We can represent a three-dimensional model using a cube. Usually a dimensional model consists of more than three dimensions and is referred to as a hypercube.
However, a hypercube is difficult to visualize, so a cube is the more commonly used term.
In Figure 15 on page 44, the measurement is the volume of production, which is determined by the combination of three dimensions: location, product, and time.
The location dimension and product dimension have their own two levels of hierarchy. For example, the location dimension has the region level and plant
Chapter 6. Data Modeling for a Data Warehouse 43
level. In each dimension, there are members such as the east region and west region of the location dimension. Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each subcube has its own numbers, which represent the volume of production as a measurement. For example, in a specific time period (not expressed in the figure), the Armonk plant in East region has produced 11,000 CellPhones, of model number 1001.
Figure 15. The Cube: A Metaphor for a Dimensional M o d e l
6.4.3 Basic Operations for OLAP
Dimensional modeling is primarily to support OLAP and decision making. Let′s review some of the basic concepts of OLAP to get a little better grasp of OLAP business requirements so that we can model the data warehouse more
effectively.
Four types of operations are used in OLAP to analyze data. As we consider granularity, we can perform the operations of drill down and roll up. To browse along the dimensions, we use the operations slice and dice. Let′s explore what those terms really mean.
6.4.3.1 Drill Down and Roll Up
Drill down and roll up are the operations for moving the view down and up along the dimensional hierarchy levels. With drill-down capability, users can navigate to higher levels of detail. With roll-up capability, users can zoom out to see a summarized level of data. The navigation path is determined by the hierarchies within dimensions. As an example, look at Figure 16 on page 45. While you analyze the monthly production report of the west region plants, you might like to review the recent trends by looking at past performance by quarter. You would be performing a roll-up operation by looking at the quarterly data. You may then wonder why the San Jose plant produced less than Boulder and would need more detailed information. You could then use the drill down-operation on the report by Team within a Plant to understand how the productivity of Team 2 (which is lower in all cases than the productivity for Team 1) can be improved.
Figure 16. Example of D r i l l D o w n and Rol l Up
6.4.3.2 Slice and Dice
Slice and dice are the operations for browsing the data through the visualized cube. Slicing cuts through the cube so that users can focus on some specific perspectives. Dicing rotates the cube to another perspective so that users can be more specific with the data analysis. Let′s look at another example, using Figure 17 on page 46. You may be analyzing the production report of a specific month by plant and product, so you get the quarterly view of gross production by plant. You can then change the dimension from product to time, which is dicing.
Now, you want to focus on the CellPhone only, rather than gross production. To do this, you can cut off the cube only for the CellPhone for the same dimensions, which is slicing.
Those are some of the key operations used in data analysis. To enable those types of operations requires that the data be stored in a specific way, and that is in a dimensional model.
6.4.4 Star and Snowflake Models
There are two basic models that can be used in dimensional modeling:
• Star model
• Snowflake model
Sometimes, the constellation model or multistar model is introduced as an extension of star and snowflake, but we will confine our discussion to the two basic structures. That is sufficient to explain the issues in dimensional modeling. This section presents only a basic introduction to the dimensional modeling techniques. For a detailed description, refer to Chapter 8, “Data Warehouse Modeling Techniques” on page 81.
Chapter 6. Data Modeling for a Data Warehouse 45
Figure 17. Example of Slice and Dice
6.4.4.1 Star Model
Star schema has become a common term used to connote a dimensional model.
Database designers have long used the term star schema to describe dimensional models because the resulting structure looks like a star and the logical diagram looks like the physical schema. Business users feel
uncomfortable with the term schema, so they have embraced the more simple sounding term of star model. In this book, we will also use the term star model.
The star model is the basic structure for a dimensional model. It typically has one large central table (called the fact table) and a set of smaller tables (called the dimension tables) arranged in a radial pattern around the fact table.
Figure 18 on page 47 shows an example of a star schema. It depicts sales as a fact table in the center. Arranged around the fact table are the dimension tables of time, customer, seller, manufacturing location, and product.
Whereas the traditional ER model has an even and balanced style of entities and complex relationships among entities, the dimensional model is very
asymmetric. Even though the fact table in the dimensional model is joined with all the other dimension tables, there is only a single join line connecting the fact table to the dimension tables.
6.4.4.2 Snowflake Model
Dimensional modeling typically begins by identifying facts and dimensions, after the business requirements have been gathered. The initial dimensional model is usually starlike in appearance, with one fact in the center and one level of several dimensions around it.
The snowflake model is the result of decomposing one or more of the
dimensions, which sometimes have hierarchies themselves. We can define the many-to-one relationships among members within a dimension table as a
Figure 18. Star Model.
separate dimension table, forming a hierarchy. For example, the seller dimension in Figure 18 on page 47 is decomposed into subdimensions outlet, region, and outlet type in Figure 19 on page 48. This type of model is derived from the star schema and, as can be seen, looks like a snowflake.
The decomposed snowflake structure visualizes the hierarchical structure of dimensions very well. The snowflake model is easy for data modelers to understand and for database designers to use for the analysis of dimensions.
However, the snowflake structure seems more complex and could tend to make the business users feel more uncomfortable working with it than with the simpler star model. Developers can also elect the snowflake because it typically saves data storage. Consider a banking application where there is a very large account table for one of the dimensions. You can easily expect to save quite a bit of space in a table of that size by not storing the very frequently repeated text fields, but rather putting them once in a subdimension table. Although the snowflake model does save space, it is generally not significant when compared to the fact table. Most database designers do not consider the savings in space to be a major decision criterion in the selection of a modeling technique.
6.4.5 Data Consolidation
Another major criterion for the use of OLAP is the fast response time for ad hoc queries. However, there could still be performance issues depending on the structure and volume of data. For a consistently fast response time, data consolidation (precalculation or preaggregation) is required. By precalculating and storing all subtotals before the query is issued, you can reduce the number of records to be retrieved for the query and maintain consistent and fast
performance. The trade-off is that you will have to know how the users typically make their queries to understand how to consolidate. When users drill down to details, they typically move along the levels of a dimension hierarchy.
Therefore, that provides the paths to consolidate or precalculate the data.