• No results found

Data Quality Processes

Data quality processes are those data integration processes that qualify and cleanse the data, based upon technical and business process rules. These rules or data quality criteria are built in to the data integration jobs as data quality criteria or “checks.”

You will find that data quality is a common architectural “thread” that is discussed in sev-eral different chapters of this book in terms of its impact on data integration processes and the data governance processes that are needed for a robust data integration environment.

What Is Data Quality?

Data quality is the commonly understood business and technical definition of data within defined ranges. It is measured by how effectively the data supports the transactions and decisions needed to meet an organization’s strategic goals and objectives, as embodied in its ability to manage its assets and conduct its core operations.

The level of data quality required to effectively support operations will vary by information system or business unit, depending upon the information needs to conduct that business unit’s oper-ations. For example, financial systems require a high degree of quality data due to the importance and usage of the data, but a marketing system may have the latitude to operate with a lower level of data quality without significantly impacting the use of the information in measuring marketing suc-cess. Because the purpose varies, so does the bar that is used to measure fitness to purpose.

Causes of Poor Data Quality

Causes for bad data quality can be categorized as business-process and technology-defined data quality issues, as demonstrated in Figure 2.12.

Data Quality Processes 31

Cust No Cust Name Product Cost 10 Ms.John Smith Seats $1,200

Sam Reilly Chairs $2,300 11 Jack Jones Stools $1,750

13 Charles Nelson Tables $A,AA 1. Invalid Data 2. Missing Data 3. Inaccurate Data

4. Inconsistent Bad Business--Process Data Quality

Bad Technology--Defined Data Quality Definition

Figure 2.12 Examples of bad data quality types

Technology-driven poor data qualities are those types that are caused by not applying tech-nology constraints either database or data integration. These types include the following:

ptg

• Missing data—Data that is missing in that field. For example, by not applying key con-straints in the database, a not-null field has been left null.

Business-driven bad data qualities are those types that are caused by end users inaccurately creating or defining data. Examples include the following:

• Inaccurate data—Invalid data due to incorrect input by business users. For example, by inaccurately creating a record for “Ms. Anthony Jones,” rather than “Mr. Anthony Jones,” poor data quality is created. Inaccurate data is also demonstrated by the “dupli-cate data” phenomenon. For example, an organization has a customer record for both

“Anthony Jones” and Tony Jones,” both the same person.

• Inconsistent definitions—Where stakeholders have different definitions of the data. By having disparate views on what the definition of poor data quality is, perceived bad qual-ity is created, for example when the Sales Department has a different definition of cus-tomer profitability than the Accounting Department.

Data Quality Check Points

Poor data quality can be prevented by determining key data quality criteria and building those rules into data quality “checks.” There are two types of data quality checks:

• Technical data quality checkpoints—Technical data quality checkpoints define the data quality criteria often found in both the entity integrity and referential integrity rela-tional rules found in logical data modeling. They address the invalid and missing data quality anomalies. Technical data quality criteria are usually defined by IT and Informa-tion Management subject matter experts. An example includes the primary key null data quality checkpoint.

• Business data quality checkpoints—The business data quality checkpoints confirm the understanding of the key data quality elements in terms of what the business defini-tion and ranges for a data quality element are and what business rules are associated with that element. Business data quality checkpoints address the inaccurate and incon-sistent data quality anomalies. The classic example of a business data quality check is gender. A potential list of valid ranges for gender is “Male,” “Female,” or “Unknown.”

This is a business definition, not an IT definition; the range is defined by the business.

Although many organizations find the three values for gender sufficient, the U.S. Postal Service has seven types of gender, so their business definition is broader than others.

Where to Perform a Data Quality Check

One of the best practices for data quality is that it should be checked before any transformation processing because there is usually no reason to process bad data. However, there are data inte-gration environments that check data quality after transformations are complete due to business

ptg rules and legitimate technical reasons. An example is check total ranges, where a check occurs

after a total sales calculation, ensuring that the amount is within a business-defined range.

Pragmatism dictates that the location of data quality checks in the architecture should be based on the data integration requirements, especially when there is a high degree of cycling logic, as portrayed in Figure 2.13.

Data Quality Processes 33

Iteratively Processing Records in a Subject Area File

Figure 2.13 Iterative transform and data quality checkpoint processing scenario

Regardless of where the data quality rules and the data quality layer are executed, the following data quality checkpoint processing functionality should be provided, as shown in Figure 2.14:

• Cleansed data files—Using the data quality criteria, the good records are filtered into the “clean” file.

• Reject data files—Data records that fail are logged in the “reject” file.

• Reject reports—Data records that fail are listed in a tabular report with reason codes for review and renovation.

Data Quality Processes Data Quality Processes

Business Data Quality Checks Technical

Data Quality Checks

Error Handling

Bad Transactions 0101 3443434 Missing Fields 0304 535355 Referential Integrity 0101 3443434 Missing Fields 0304 535355 Referential Integrity

Clean Data

Reject Data

Reject Report File from the

Initial Staging Landing Zone

Clean Staging Landing Zone

Figure 2.14 Data quality checkpoint processing architecture best practice

ptg A discussion on whether records that fail the data quality process should stop the data

inte-gration process or whether the records should be flagged and passed or fail (absolute versus optional data quality) is located in Chapter 11, “Data Integration Development Cycle.”