Chapter 10. Populating the Data Warehouse
A.6 CelDial Metadata - Proposed Solution
No model is complete without its metadata. We include here a sample of metadata that could be used for our proposed solution. It is not complete, but it provides much of the needed metadata. It is left as an exercise for the reader to analyze the sample and try to determine additional metadata required for a complete solution.
• MODEL METADATA
Name: Inventory
Definition: This model contains inventory data for each product model in each manufacturing plant, on a daily basis.
Purpose: The purpose of this model is to facilitate the analysis of inventory levels.
Contact Person: Plant Manager
Dimensions: Manufacturing, Product, and Time
Facts: Inventory
Measures: Quantity on hand, Reorder level, Total cost, and Total Revenue
Name: Sales
Definition: This model contains sales data for each product model, on each order, on a daily basis.
Purpose: The purpose of this model is to facilitate the analysis of product sales.
Contact Person: Regional Sales Manager
Dimensions: Customer, Manufacturing, Product, Seller, and Time
Facts: Sale
Measures: Total cost, Total revenue, Total quantity sold, and Discount amount
• FACT METADATA
Name: Inventory
Definition: This fact contains inventory data for each product model in each manufacturing plant, on a daily basis.
Alias: None
Load Frequency: Daily Load Statistics:
• Last Load Date: N/A
• Number of Rows Loaded: N/A Usage Statistics:
• Average Number of Queries/Day: N/A
• Average Rows Returned/Query: N/A
• Average Query Runtime: N/A
• Maximum Number of Queries/Day: N/A
• Maximum Rows Returned/Query: N/A
• Maximum Query Runtime: N/A
Archive Rules: Data will be archived after 36 months, on a monthly basis.
Archive Statistics:
• Last Archive Date: N/A
• Date Archived to: N/A
Purge Rules: Data will be purged after 48 months, on a monthly basis.
Purge Statistics:
• Last Purge Date: N/A
• Date Purged to: N/A
Data Quality: Inventory levels may fluctuate throughout the day as more stock is received into inventory from production and stock is shipped out to retail stores and customers. The
measures for this fact are collected once per day and thus reflect the state of inventory at that point in time, which is the end of the working day for a plant.
Data Accuracy: The measures of this fact are 97.5% accurate at the point in time they represent. This is based on the results of physical inventories matched to recorded inventory levels.
No inference can be made from these measures as to values at points in time not recorded.
Grain of Time: The measures of this fact represent inventory levels on a daily basis.
Key: The key to an inventory fact is the combination of the keys of its dimensions: Manufacturing, Product, and Time.
Appendix A. The CelDial Case Study 171
Key Generation Method: The time portion of the key is simply the date the inventory level is being recorded for. The product key is retrieved from the product translation table using the product ID, model ID, and the inventory level date. The manufacturing key is retrieved from the manufacturing translation table using the region ID and plant ID.
Source:
• Name: Inventory Table
• Conversion Rules: Each row in each inventory table is copied into the inventory fact on a daily basis.
• Selection Logic: All rows are selected from the inventory table in each plant.
• Name: Sale Fact
• Conversion Rules: The rows representing sales of an individual model produce from an individual plant are summarized for each day and the result appended to the relevant inventory fact.
• Selection Logic: All rows for the day the inventory fact is being loaded with are selected from the sale fact.
Measures: Quantity on hand, Reorder level, Total cost, Total revenue Dimensions: Manufacturing, Product, and Time
Subsidiary Facts: None
Contact Person: Plant Manager
Name: Sale
Definition: This fact contains sale data for each order which has been recorded in the sales systems at each retail store and corporate sales office.
Alias: None
Load Frequency: Daily
Load Statistics:
• Last Load Date: N/A
• Number of Rows Loaded: N/A Usage Statistics:
• Average Number of Queries/Day: N/A
• Average Rows Returned/Query: N/A
• Average Query Runtime: N/A
• Maximum Number of Queries/Day: N/A
• Maximum Rows Returned/Query: N/A
• Maximum Query Runtime: N/A
Archive Rules: Data will be archived after 36 months, on a monthly basis.
Archive Statistics:
• Last Archive Date: N/A
• Date Archived to: N/A
Purge Rules: Data will be purged after 48 months, on a monthly basis.
Purge Statistics:
• Last Purge Date: N/A
• Date Purged to: N/A
Data Quality: It is possible for errors to be made by staff completing an order. However, the numbers recorded represent what is actually contracted with the customer and must be honored.
Data Accuracy: The measures of this fact are 100% accurate in that they represent what was actually sold.
Grain of Time: The measures of this fact represent sales of a given product on a given order.
Key: The key to a sale fact is the combination of the keys of its dimensions: Customer, Manufacturing, Order, Product, Seller, and Time. Note that Order does not have any additional attributes and thus does not physically exist as a dimension. Its primary function is to logically group product sales for analysis.
Key Generation Method: The time portion of the key is simply the date the sale takes place. The product key is retrieved from the product translation table using the product ID, model ID, and the sale date. The manufacturing key is retrieved from the manufacturing translation table using the region ID and plant ID. The seller key is retrieved from the seller translation table using the region ID, outlet ID, and salesperson ID. The customer key is retrieved from the customer translation table using the customer ID and customer location ID. The order ID from the original order is used as the order key for the sale fact.
Source:
• Name: Order Table
• Conversion Rules: Rows in each order table are copied into the sale fact on a daily basis.
• Selection Logic: Only rows for the current transaction date are selected.
• Name: Product Dimension
• Conversion Rules: The product dimension is used to calculate the cost for the product model on an order.
As well, the negotiated unit price on the order is compared to the suggested retail or wholesale price (based on the outlet type) to determine if a discount was given. If so, the discount amount is calculated. If a discount is given for a product not eligible for
discounting, a message is printed on an error report.
• Selection Logic: For each row being inserted into the sale fact, the product data is accessed.
Measures: Total cost, Total revenue, Total quantity sold, and Discount amount
Dimensions: Customer, Manufacturing, Order, Product, Seller, and Time Subsidiary Facts: Inventory: the inventory fact contains daily aggregates for
cost and revenue at the product and plant level of granularity.
Contact Person: Plant Manager
• DIMENSION METADATA
Name: Customer
Definition: A customer is any person or organization who purchases goods from CelDial. A customer may be associated with many business locations.
Alias: None
Appendix A. The CelDial Case Study 173
Hierarchy: Data can be summarized at two levels for a customer. The lowest level of summarization is the customer ship-to address. Data from each location (ship-to address) can be further rolled up to summarize for an entire customer.
Change Rules: New customer locations are inserted as new rows into the dimension. Changes to existing locations are updated in place.
Load Frequency: Daily Load Statistics:
• Last Load Date: N/A
• Number of Rows Loaded: N/A Usage Statistics:
• Average Number of Queries/Day: N/A
• Average Rows Returned/Query: N/A
• Average Query Runtime: N/A
• Maximum Number of Queries/Day: N/A
• Maximum Rows Returned/Query: N/A
• Maximum Query Runtime: N/A Archive Rules: Customer data is not archived.
Archive Statistics:
• Last Archive Date: N/A
• Date Archived to: N/A
Purge Rules: Customers who have not purchased any goods from CelDial in the past 48 months will be purged on a monthly basis.
Purge Statistics:
• Last Purge Date: N/A
• Date Purged to: N/A
Data Quality: When a new customer is added a search is done to determine if we already do business with another location.
In rare cases separate branches of a customer are recorded as separate customers because this check fails.
Until such time as the customer notices separate locations dealing with us such occurrences remain as originally recorded.
Data Accuracy: Incorrect association of locations of a common customer occur in less than .5% of our customer data.
Key: The key to the customer dimension consists of a system generated number.
Key Generation Method: When a customer is copied from the operational system, the translation table is checked to determine if the
customer already exists in the warehouse. If not, a new key is generated and the key along with the customer ID and location ID are added to the translation table. If the customer and location already exist, the key from the translation table is used to determine which customer in the warehouse to update.
Source:
• Name: Customer Table
• Conversion Rules: Rows in each customer table are copied on a daily basis. For existing customers, the name is updated. For new customers, once a location is determined, the key is generated and a row
inserted. Before the update/insert takes place a check is performed for a duplicate customer name. If a
duplicate is detected, a sequence number is appended to the name. This check is repeated until the name and sequence number combination are determined to be unique. Once uniqueness has been confirmed, the update/insert takes place.
• Selection Logic: Only new or changed rows are selected.
• Name: Customer Location Table
• Conversion Rules: Rows in each customer location table are copied on a daily basis. For existing
customer locations, the ship-to address is updated. For new customer locations, the key is generated and a row inserted.
• Selection Logic: Only new or changed rows are selected.
Attributes:
• Name: Customer Key
• Definition: This is an arbitrary value assigned to guarantee uniqueness for each customer and location.
• Alias: None
• Change Rules: Once assigned, the values of this attribute never change.
• Data Type: Numeric
• Domain: 1 - 999,999,999
• Derivation Rules: A system generated key of the highest used customer key +1 is assigned when creating a new customer and location entry.
• Source: System Generated
• Name: Name
• Definition: This is the name by which a customer is known to CelDial.
• Alias: None
• Change Rules: When a customer name changes it is updated in place in this dimension.
• Data Type: Character(30)
• Domain:
• Derivation Rules: To ensure the separation of data for customers who have the same name but are not part of the same organization, a number will be appended to names where duplicates exist.
• Source: Name in Customer Table
• Name: Ship-to Address
• Definition: This is an address where CelDial ships goods to a corporate customer. It is possible for a corporate customer to have multiple ship-to locations.
For retail customers no ship-to address is kept.
Therefore, there can only be one entry in the customer dimension for a retail customer.
• Alias: None
• Change Rules: When a ship-to address changes it is updated in place in this dimension.
• Data Type: Character(60)
Appendix A. The CelDial Case Study 175
• Domain: All valid addresses within CelDial′s service area.
• Derivation Rules: The ship-to address is a direct copy of the source.
• Source: Ship-to Address in Customer Location Table
Facts: Sale
Measures: Total cost, Total revenue, Total quantity sold, and Discount amount
Subsidiary Dimensions: None
Contact Person: Vice-president of Sales and Marketing
Name: Manufacturing
Definition: The manufacturing dimension represents the
manufacturing plants owned and operated by CelDial.
Plants are grouped into geographic regions.
Alias: None
Hierarchy: Data can be summarized at two levels for manufacturing.
The lowest level of summarization is the manufacturing plant. Data from each plant can be further rolled up to summarize for an entire geographic region.
Change Rules: New plants are inserted as new rows into the dimension.
Changes to existing plants are updated in place.
Load Frequency: Daily Load Statistics:
• Last Load Date: N/A
• Number of Rows Loaded: N/A Usage Statistics:
• Average Number of Queries/Day: N/A
• Average Rows Returned/Query: N/A
• Average Query Runtime: N/A
• Maximum Number of Queries/Day: N/A
• Maximum Rows Returned/Query: N/A
• Maximum Query Runtime: N/A Archive Rules: Manufacturing plant data is not archived.
Archive Statistics:
• Last Archive Date: N/A
• Date Archived to: N/A
Purge Rules: Manufacturing plants that have been closed for at least 48 months will be purged on a monthly basis.
Purge Statistics:
• Last Purge Date: N/A
• Date Purged to: N/A
Data Quality: There are no opportunities for error or misinterpretation of manufacturing plant data.
Data Accuracy: Manufacturing plant data is 100% accurate.
Key: The key to the manufacturing plant dimension consists of a system generated number.
Key Generation Method: When a manufacturing plant is copied from the operational system, the translation table is checked to determine if the plant already exists in the warehouse. If not, a new key is generated and the key along with the plant ID and region ID are added to the translation table. If the plant and region already exist, the key from the
translation table is used to determine which plant in the warehouse to update.
Source:
• Name: Manufacturing Plant Table
• Conversion Rules: rows in each plant table are copied on a daily basis. For existing plants, the plant name is updated. For new plants, once a region is determined, the key is generated and a row inserted.
• Selection Logic: Only new or changed rows are selected.
• Name: Manufacturing Region Table
• Conversion Rules: Rows in each region table are copied on a daily basis. For existing regions, the region name is updated for all plants in the region.
For new regions, the key is generated and a row inserted.
• Selection Logic: Only new or changed rows are selected.
Attributes:
• Name: Manufacturing Key
• Definition: This is an arbitrary value assigned to guarantee uniqueness for each plant and region.
• Alias: None
• Change Rules: Once assigned, the values of this attribute never change.
• Data Type: Numeric
• Domain: 1 - 999,999,999
• Derivation Rules: system generated key of the highest used manufacturing key + 1 is assigned when
creating a new plant and region entry.
• Source: System Generated
• Name: Region Name
• Definition: This is the name CelDial uses to identify a geographic region for the purpose of grouping manufacturing plants.
• Alias: None
• Change Rules: When a region name changes it is updated in place in this dimension.
• Data Type: Character(30)
• Domain:
• Derivation Rules: The region name is a direct copy of the source
• Source: Name in Manufacturing Region Table
• Name: Plant Name
• Definition: This is the name CelDial uses to identify an individual manufacturing plant.
• Alias: None
• Change Rules: When a plant name changes it is updated in place in this dimension.
• Data Type: Character(30)
• Domain:
• Derivation Rules: The plant name is a direct copy of the source
Appendix A. The CelDial Case Study 177
• Source: Name in Manufacturing Plant Table
Facts: Inventory and Sale
Measures: Quantity on hand, Reorder level, Total cost, Total revenue, Total quantity sold, and Discount amount
Subsidiary Dimensions: None
Contact Person: Plant Manager
Name: Time
Definition: The time dimension represents the time frames used by CelDial for reporting purposes.
Alias: None
Hierarchy: The lowest level of summarization is a day. Data for a given day can be rolled up into either weeks or months.
Weeks cannot be rolled up into months.
Change Rules: Once a year the following year′s dates are inserted as new rows into the dimension. There are no updates to this dimension.
Load Frequency: Annually Load Statistics:
• Last Load Date: N/A
• Number of Rows Loaded: N/A Usage Statistics:
• Average Number of Queries/Day: N/A
• Average Rows Returned/Query: N/A
• Average Query Runtime: N/A
• Maximum Number of Queries/Day: N/A
• Maximum Rows Returned/Query: N/A
• Maximum Query Runtime: N/A Archive Rules: Time data is not archived.
Archive Statistics:
• Last Archive Date: N/A
• Date Archived to: N/A
Purge Rules: Time data more than 4 years old will be purged on a yearly basis.
Purge Statistics:
• Last Purge Date: N/A
• Date Purged to: N/A
Data Quality: There are no opportunities for error or misinterpretation of time data.
Data Accuracy: Time data is 100% accurate.
Key: The key to the time dimension is a date in YYYYMMDD (year-month-day) format.
Key Generation Method: The date in a row is used as the key.
Source:
• Name: Calendar spreadsheet maintained by database administrator.
• Conversion Rules: Rows in the calendar spreadsheet represent one calendar year. All the rows in the spreadsheet are loaded into the dimension annually.
• Selection Logic: All rows are selected.
Attributes:
• Name: Time Key
• Definition: This is the date in YYYYMMDD format.
• Alias: None
• Change Rules: Once assigned, the values of this attribute never change.
• Data Type: Numeric
• Domain: valid dates
• Derivation Rules: This date is a direct copy from the source.
• Source: Numeric Date in Calendar spreadsheet
• Name: Date
• Definition: This is the descriptive date equivalent to the numeric date used as the key to this dimension. It is the date used on reports and to limit what data appears on a report. It is in the format MMM DD, YYYY.
• Alias: None
• Change Rules: Once assigned, the values of this attribute never change.
• Data Type: Character(12)
• Domain: valid dates in descriptive format
• Derivation Rules: This date is a direct copy from the source.
• Source: Descriptive Date in Calendar spreadsheet
• Name: Week of Year
• Definition: Each day of the year is assigned to a week for reporting purposes. Because years don′t divide evenly into weeks it is possible for a given day near the beginning or end of a calendar year to fall into a different year for weekly reporting purposes. The format is WW-YYYY.
• Alias: None
• Change Rules: Once assigned, the values of this attribute never change.
• Data Type: Character(7)
• Domain: WW is 1-52. YYYY is any valid year.
• Derivation Rules: This date is a direct copy from the source.
• Source: Week of Year in Calendar spreadsheet
Facts: Inventory and Sale
Measures: Quantity on hand, Reorder level, Total cost, Total revenue, Total quantity sold, and Discount amount
Subsidiary Dimensions: None
Contact Person: Data Warehouse Administrator
• MEASURE METADATA
Name: Total Cost
Definition: This is the cost of all components used to create product models that have been sold.
Alias: None
Data Type: Numeric (9,2) Domain: $0.01 - $9,999,999.99.
Derivation Rules: The total cost is the product of the unit cost of a product model and quantity of the product model sold.
Appendix A. The CelDial Case Study 179
Usage Statistics:
• Average Number of Queries/Day: N/A
• Maximum Number of Queries/Day: N/A
Data Quality: This figure only represents the cost of components. No attempt is made to record labor or overhead costs. As well, cost is calculated using the current cost at the time a product model is sold. No attempt is made to determine when the model was produced and the cost at that time.
Data Accuracy: We estimate that the cost reported for a product model is accurate to within +/- .5%.
Facts: Inventory and Sale
Dimensions: Customer, Manufacturing, Product, Seller, and Time
Name: Total Revenue
Definition: This is the amount billed to customers for product models that have been sold.
Alias: None
Data Type: Numeric (9,2)
Data Type: Numeric (9,2)