While dimension tables represent how users want to see data, fact tables represent the data users want to see. Facts are almost always numeric values such as dollar sales, unit sales, employee turnover, inventory levels, portfolio churn rate, and so forth. Often these values are fully additive, meaning that a value such as dollar sales
4 0
B u s i n e s s I n t e l l i g e n c e w i t h M i c r o s o f t O f f i c e P e r f o r m a n c e P o i n t S e r v e r 2 0 0 7 Torrey Donald Raghu VP Human Resources Molly VP Sales Hannah Director Director Director Yoshi Mario Team Leader Benefits Admin Payroll Clerk Patty Benny Hailey Admin Assistant CEOFigure 3-3 A simplified organizational chart shows that the hierarchical structure of a parent-child dimension is not predictable as far as the number or consistency of levels.
EmployeeID Name Title ManagerID
1 Torrey CEO <null>
2 Hailey Admin Assistant 1
3 Molly VP Sales 1
4 Donald VP Human Resources 1
5 Hannah Director 3
6 Yoshi Director 3
7 Raghu Director 4
8 Mario Team Leader 7
9 Benny Benefits Admin 8
10 Patty Payroll Clerk 8
Table 3-2 The Organizational Chart in a Relational Table, Showing that the ManagerID Is Tied to the EmployeeID.
C h a p t e r 3 : D a t a W a r e h o u s i n g a n d B u s i n e s s I n t e l l i g e n c e
4 1
can be summed by any dimension and provide a valid value. Other facts, such as inventory levels, are semi-additive, meaning they cannot simply be summed across all dimensions. The ending inventory for the first quarter cannot be added to the ending inventory for the second quarter to give a valid inventory level for the first half of the year. Instead, an average can be used, or simply the ending inventory for the second quarter.
Facts, also called measures, reside in a fact table. The fact table is actually fully normalized by nature, unlike the denormalized dimension tables. Typically, dimension tables have single-field primary keys, often of an integer data type. Each fact represents a unique combination of items for the lowest level of detail for each dimension.
As an example, assume three simple dimensions: Time, Product, and Customer. The Time dimension goes to the Day level, the Product dimension goes to the SKU level, and the Customer dimension goes to the Customer level. This means that the level of detail, or granularity, of the fact table is by customer, day, and product. Therefore, a single record contains a particular item sold to a particular customer on a particular day. If the customer ordered the same item on the same day in two or more different orders, the products will be summed to a single record.
Each fact record will contain a foreign key back to the dimension tables, and these foreign keys make up the primary key of the fact table; after the key fields are the measures themselves. Thus, the fact table is often made up entirely, or nearly so, of numeric fields. This means that the fact table is relatively narrow, since storing numbers requires less space than storing descriptive strings. While the fact table may be narrow, it is often extremely large as far as the number of records being held. In many cases, the fact table makes up ninety eight percent or more of the total storage used by the relational warehouse.
To illustrate the sizing of a warehouse, consider the same three dimension tables: Time, Product, and Customer. If tracking 10 years of data with a daily grain, this means that the Time table will hold approximately 3652 records. The company might have 10,000 products and sell to 100,000 customers. While not every customer will buy every product every day, imagine that they did: the fact table will hold 3652 × 10,000 × 100,000 records, or 3,652,000,000,000, or 3.65 trillion records. Even at one tenth of one percent of this value, the fact table would still hold almost four billion records. These are indeed impressive numbers.
Because fact tables are often so large, most organizations estimate the storage requirements for their relational data warehouse simply by estimating the number of rows in the fact table, multiplying it by the size of the record, and then adding a percentage for the indexes. This typically works well, because the dimension tables are relatively small compared to the fact table, and most warehouse architects tend to overestimate the number of records in order to build in a cushion. Strategies for backing up and restoring large fact tables are critical, but are beyond the scope of this book.
4 2
B u s i n e s s I n t e l l i g e n c e w i t h M i c r o s o f t O f f i c e P e r f o r m a n c e P o i n t S e r v e r 2 0 0 7A relational warehouse is often called a star schema, simply because of the look of the schema. In the middle sits a fact table with one or more measures. The fact table has a join back to each dimension table, which often surround the fact table in diagrams. The practice of placing the fact table at the center of the schema diagram is merely tradition, but it helps those familiar with relational warehousing quickly grasp the structure of the warehouse. In some cases a dimension table is normalized somewhat, so that a Product dimension is broken into Product, Product Category, and Product Subcategory tables. This structure is called a snowflake schema because the dimension tables branch out, much like the ice crystals of a snowflake. Figure 3-4 shows a simple star schema.
In many data warehouses, there are multiple fact tables. This may be due to several factors, two of which are relatively common. First, different fact tables may contain different dimensions. A fact table that supports manufacturing will have a product dimension and an assembly line dimension. A fact table that supports sales will have a product dimension and a customer dimension. Assembly line has nothing to do with
C h a p t e r 3 : D a t a W a r e h o u s i n g a n d B u s i n e s s I n t e l l i g e n c e
4 3
sales, and the customer has nothing to do with manufacturing. Note that these fact tables should, however, have a conformed product dimension table joined to them.
Second, fact tables may be at differently levels of granularity. Fact tables that contain actual data are typically at lower levels of detail, such as Day. Fact tables designed to hold forecast or budget values are often at higher levels of detail, such as Month or Quarter. Therefore, while both the sales and forecast fact tables contain a Time dimension, they have a different grain and therefore cannot be a single fact table.
One aspect of fact tables that breaks with traditional normalization techniques is that calculated fields are often included as some of the measures. Typically, calculated fields are not stored in a relational table, but there is nothing wrong with storing calculated fields in a fact table. In some cases this can improve performance by calculating the values ahead of time instead of requiring the cube engine to calculate the values on the fly. Calculations in the cube engine will be mentioned later in this chapter.