Data Warehouse design
Design of Enterprise Systems University of Pavia
DATA QUALITY
Data Quality
The quality of the data in a data warehouse determines the reputation and value of that data warehouse.
If customers perceive the data in a data warehouse to be misleading or just plain wrong, they won’t use the data warehouse.
If customers can find or create a superior source of data elsewhere, they will abandon a data warehouse altogether.
Data Quality, the perceived reputation and value, of a data warehouse, therefore, is vital to the success of a data warehouse.
Deming’s definition of Quality
Deming’s definition of Quality begins with a distinction between Features and
Quality.
– Features are the buzzers and whistles included with a product that do not
specifically address the core purpose of a product.
Typically, people see Features first. For example: – Leather seats in a car
– Mother of Pearl inlays in a guitar
– Graphical user interface (GUI) point-and-click buttons and scroll bars in a spreadsheet
Deming’s point is not that Features have no value. But their value is subjective and, therefore, impossible to measure.
So, what is Quality?
– Deming’s answer is based on expectations and whether or not those expectations are achieved.
– Deming defined Quality as the success or failure to achieve a customer’s expectations.
Data Quality Service Level Agreement
Customer expectations of a data warehouse, therefore, must be expressed in
well-defined and quantifiable terms. For example:
– Completeness
• Not Quantifiable: Most of the stores should be present in the data warehouse.
• Quantifiable: 95 percent of the stores should be present in the data warehouse.
– Latency
• Not Quantifiable: The warehouse data should be in before the staff arrives. • Quantifiable: The warehouse data should be in by 6:30 a.m.
– Accuracy
• Not Quantifiable: The warehouse data should equal what actually happens in the warehouse.
• Quantifiable: The total inventory movement of each product in a warehouse reported by the data warehouse should be within 2 percent (+/−) the total movement of each product as reported by the warehouse end-of-day
Deming’s Statistical Process Control
The processes of the data warehouse are the focus of Deming’s Process Control method, which measures the process by measuring the product — by
measuring its data.
– A data warehouse process that is in control will generate data that meets expectations all day long.
– A data warehouse process that is not in control will generate nonsense data. By measuring each individual ETL process, the data warehouse team can
monitor the health of the ETL application.
– If a single process is out of control, the ETL process can stop, allowing time
to adjust the errant ETL process.
– The time and effort spent to adjust an out-of-control ETL process costs less
than the rework that would happen had the process not been stopped.
– When all ETL processes are back in control, the data in the data warehouse will meet the expectations of the customer.
Process Measurement Methods
The options listed below should be chosen based on the corporate
management and culture surrounding the data warehouse.
Data Quality methods that work well in one setting may fail in another setting. So, involve members of the business to gain their feedback, approval, and buy-in. Methods and strategies:
– Data Stewardship
– Post-Load Audit and Report Errant Data
– Plug in a Default Value and Report Errant Data – Reject a Record and Report the Errant Record – Recycle the Data: In Place and Report Errant Data
Data Stewardship
Data Stewardship is a strategy.
A member of each business area included in the data warehouse is engaged to participate in the data warehouse as a Data Steward.
– This person must have a complete understanding of the business area, its
data, processes, and people.
– A Data Steward must also be able to understand the concepts of the data warehouse.
– Finally, a Data Steward must be able and willing to engage in the general oversight of the data warehouse.
In general, a Data Steward is not only a liaison between the business and the data warehouse, but a champion of both the business area and the data
warehouse.
A Data Steward participates in all communications between the business area and the data warehouse.
Post-Load Audit and Report Errant Data
After data has been loaded into a data warehouse, programmatically query the data in the data warehouse.
The SQL should test and validate that the data in a data warehouse conforms to
the expectations of the data warehouse, which are outlined in the Target
System Analysis.
Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.
The individual data elements are not the focus.
Plug in a Default Value and Report Errant Data
When an errant data element is encountered, the portion of the data element that does not conform to its expectations can be replaced by a default value
The default value may have a specific meaning (e.g., no known value, rejected value, etc.).
The data is allowed to proceed toward the data warehouse with the defaulted values in place of the errant values.
This method is also known as a Soft Reject.
Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.
The individual data elements are not the focus.
Reject a Record and Report the Errant Record
When an errant data element is encountered, the entire record or row may be
discarded altogether.
The remainder of the data is allowed to proceed toward the data warehouse, but without the rejected record or row.
This method is also known as a Hard Reject.
This method can be applied also to a whole dataset
Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.
The individual data elements are not the focus.
Data Recycle: In Place and Report Errant Data
A data element that is subject to remediation can be recorded into two fields, rather than just one.
– The first field is the original errant data value. – The second field is the defaulted data value.
– When a correct data value becomes available, that correct data value will
overwrite the second defaulted data field.
– The first field, containing the original data value, is used only to find the correct data value.
– The second data field, containing the correct data value, is the data field that is visible to the data warehouse customers.
This strategy requires communication between the data warehouse team and the Data Stewards.
When the business area management ceases to participate in the remediation of errant data
– the remainder of the data is still allowed to add value to the data warehouse
Data Recycle: Recycle Wheel and Report Errant Data
A data element that is subject to remediation can be held in a separate table, away from the data warehouse.
– This separate table is often called a Recycle Wheel
– When a correct data value becomes available, that correct data value will
overwrite the errant data field in the Recycle Wheel.
– Then, the corrected record or row of data in the Recycle Wheel is forwarded
to the ETL application to be included in the next iteration of data going to
the data warehouse.
This strategy requires communication between the data warehouse team and the Data Stewards.
The business person who is assigned the task of finding the correct data value must be included in the data warehouse team and its meetings and discussions. The cost of a Recycle Wheel is the storage and maintenance cost of the table
DATA QUALITY TOOLS
Market definition
The data quality tools market includes vendors that offer stand-alone software products to address the core functional requirements of the discipline, which are:
– Profiling: The analysis of data to capture statistics (metadata) that provide insight into the quality of data and help to identify data quality issues.
– Parsing and standardization: The decomposition of text fields into component parts and the formatting of values into consistent layouts based on industry standards, local standards (for example, postal authority standards for address data), user-defined business rules, and knowledge bases of values and patterns. – Generalized "cleansing": The modification of data values to meet domain
restrictions, integrity constraints or other business rules that define when the quality of data is sufficient for an organization.
– Matching: Identifying, linking or merging related entries within or across sets of data.
– Monitoring: Deploying controls to ensure that data continues to conform to business rules that define data quality for the organization.
– Enrichment: Enhancing the value of internally held data by appending related attributes from external sources (for example, consumer demographic attributes and geographic descriptors).
– Connectivity/adapters: The ability to interact with a range of different data structure types.
Market analysis
Leaders exhibit a clear understanding and vision of where the market is headed,
including recognition of non-customer data quality issues and delivery of enterprise-level data quality implementations.
Challengers have an established presence, credibility and viability, but may demonstrate strength only in a specific domain (e.g. only customer name and address cleansing). Visionaries demonstrate a strong
understanding of current and future market trends (e.g. the importance of ongoing
monitoring of data quality). They lack market presence, brand recognition, customer base. Niche Players often lack strength in rapidly
evolving functional areas. In addition, they may focus solely on a specific market
segment (such as midsize businesses), limited geographic areas or a single domain (e.g. customer data).