• No results found

Review/Assess Source Data Quality

Data Integration Analysis Phase

Step 3: Review/Assess Source Data Quality

This step further refines the Wheeler Core Data Element List for data quality issues and develops the first-cut set of data quality checkpoints.

Although it appears that there are redundancies in the source system profiling and data quality assessment tasks, profiling gathers the information and provides a first set of data quality issues. The review\assess source data quality task confirms those findings, performs further root cause analysis, and, finally, develops the first-cut technical and business data quality checkpoints for the data quality process layer in the data integration environment, as shown in Figure 6.16.

Wheeler Source System Core Data Element List Source File/

Table Name Data Element Name

Subject Area

Domain

Not Null Key Ranges

System 1 Customer File

CUST_# Customer Varchar(04) Y Y

ORG Customer Varchar(40) N N

CUST_NAME Customer Varchar(40) N N

ADDRESS Customer Varchar(20) N N

CITY Customer Varchar(20) N N

STATE Customer Varchar(20) N N

ZIP Customer Varchar(09) N N

System 2 Customer File

ID Customer Decimal(10) Y Y

O_NAME Customer Char(15) Y N

F_NAME Customer Char(15) Y N

L_NAME Customer Char(15) Y N

ADDRSS 1 Customer Char(20) Y N

ADDRSS 2 Customer Char(20) N N

CITY Customer Char(15) N N

STATE Customer Char(02) N N

ZIP Customer Decimal(09) N N

System 3 Customer File

CUST_ID Customer Decimal(10) Y Y

ORGANIZATION Customer Varchar(20) Y N

FRST Customer Varchar(20) Y N

LAST Customer Varchar(20) Y N

ADDR 1 Customer Char(20) Y N

ADDR 2 Customer Char(20) N N

ADDR 3 Customer Char(20) N N

CITY Customer Char(15) N N

STATE Customer Varchar(2) N N

ZIP Customer Integer(05) N N

EXT Customer Integer(04) N N

Data Quality Criteria

Source System Data Profiling

Assess Source Data

Quality

Figure 6.16 The iterative nature of source system analysis

1. Perform validation checks to assess the data—Using the Wheeler source system Core Data Element List, review and determine the types of checks that would be needed in the data quality layer of the proposed data integration process:

ptg

Data Integration Analysis Phase 131

• Data format checks—A secondary review of the data elements does not reveal any errors in terms of format, for example VarChar in Integer.

• Date format checks—Not only does it appear that there are no inconsistencies in the date formats of each of the order systems, they are also in the same format of two-digit month, two-two-digit day, four-two-digit year (e.g., 03122010.) It would be wise to work with the data modeler, ensure that the target Wheeler data warehouse data model has the same format, and reduce an unnecessary data format transformation unless there is a desire to standardize to the relational DATE format.

• Numeric value range check—Review the source data for numeric upper and lower limits in the numeric fields in the Wheeler order system source data. For example, a rule could be placed on the order numeric fields, such as cost and price that prevents them from being negative, thereby preventing downstream incorrect calculations.

N

OTE

Before such a business data quality rule is created, it is important to verify with an appropriate business user that this is an appropriate rule and there are not legitimate reasons for negatives in such columns.

• Null checks—When performing a secondary check for null values in mandatory columns/fields, the null key field in System 3 was captured in the prior analysis. It is good to double-check that a rule had been put in place in ensuring key rules are enforced.

• Duplicate key/field checks—When reviewing the Wheeler data for the prevention of the accidental loading of duplicate records, business-defined critical data ele-ments, and key columns (primary, foreign, unique), we should review and ensure that the duplication error found between the Wheeler System 2 Product File and Sys-tem 3 Product File has been communicated to prevent any future issues in the online systems.

2. Review any other observed anomalies—In this secondary review, we find that order file 3 does not contain a Terms field, as illustrated in Figure 6.17. This can cause signifi-cant data governance issues and merits further research with both the source system IT and business users.

ptg

System 1 Order File

ORDER_NO STATUS DATE CUST_# TERMS_CD ITEM_NO PROD_PRI

CE

AMNT_OR DR

10001Shipped 03032010 410 Fixd 1302 $14 2,000

10002Ordered 03112010 520 Open 1303 $15 5,000

10003Ordered 03122010 660 Open 1303 $15 3,000

10004Shipped 03122010 200 Fixd 1301 $12 20,000

System 2 Order File

ORD _NUM STATUS DATE CUST_# LINE_1 TERMS_CD ITEM_ID PROD_PRI

CE AMNT_O RDR

LINE_2 TERMS_CD ITEM_ID PROD_

PRICE AMNT_

ORDR

22221Shipped 03042010 11100011 1 02/10, net 30 1101 $125 100 2 02/10, net 30 1111 $135 550

22222Ordered 03222010 11100012 1 02/10, net 30 1101 $147 230 2 02/10, net 30 1103 $175 400

22223Ordered 03142010 <null> 1 02/10, net 30 1111 $135 1,000 2 02/10, net 30 <null> $135 400

22224Shipped 03212010 1110001A 1 02/10, net 30 1113 $89 2,000 2 02/10, net 30 1101 $125 200

System 3 Order File

ORD _# STS DTE CUST_# LN_1 ID_NUMBER PROD

_PRICE AMNT _ORDR

LN_2 ID_NUMBER PROD _PRICE

AMNT _ORDR

LN_3 ID_NUMBPROD _PRICE

AMNT _ORDR

30010Ordered 03302010 310001 1 1201 $30 500 2 1204 $25 3,500

30020Ordered 03152010 310002 1 1101 $32 320

30030Ordered 03222010 310003 1 1203 <null> 2,000 2 1204 $25 5,000 3 1201 $30 300

30040Ordered 03232010 310004 1 1204 $25 4,000 2 1101 $32 500

Existing Terms Column

Existing Terms Column

Missing Terms

Figure 6.17 Missing columns

Often in the first review of the data, the focus is so intense on the anomalies within a col-umn, broader data anomalies or missing critical data is overlooked. Only after the “pic-ture” data and its structure becomes clearer will less obvious issues be observed, which is another reason for a secondary review task.

We have now reviewed and documented the actual structure of the source data, the data itself, and the anomalies within the data.

The source system discovery tasks have provided a good understanding of the source sys-tem data in terms of its structure, its data, and its anomalies. With this body of knowledge, we can move on to the next task of data mapping.

Figure 6.18 provides the completed version of the Wheeler Core Data Element List that will be used for the complex task of data mapping.

ptg

Data Integration Analysis Phase 133

Wheeler Source System Core Data Element List

Source File/ Table

Data Element Name

Subject Area

Domain

Not Null Key Ranges/Rules

System 1 Customer File

CUST_# Customer Varchar(04) Y Y Should be Primary Key

ORG Customer Varchar(40) N N

CUST_NAME Customer Varchar(40) N N

ADDRESS Customer Varchar(20) N N

CITY Customer Varchar(20) N N

STATE Customer Varchar(20) N N

ZIP Customer Varchar(09) N N

System 2 Customer File

ID Customer Decimal(10) Y Y Should be Primary Key

O_NAME Customer Char(15) Y N

F_NAME Customer Char(15) Y N

L_NAME Customer Char(15) Y N

ADDRSS 1 Customer Char(20) Y N

ADDRSS 2 Customer Char(20) N N

CITY Customer Char(15) N N

STATE Customer Char(02) N N

ZIP Customer Decimal(09) N N

System 3 Customer File

CUST_ID Customer Decimal(10) Y Y Should be Primary Key

ORGANIZATION Customer Varchar(20) Y N

FRST Customer Varchar(20) Y N

LAST Customer Varchar(20) Y N

ADDR 1 Customer Char(20) Y N

ADDR 2 Customer Char(20) N N

ADDR 3 Customer Char(20) N N

CITY Customer Char(15) N N

STATE Customer Varchar(2) N N

ZIP Customer Integer(05) N N

EXT Customer Integer(04) N N

Source File/ Table

Data Element Name

Subject Area

Domain

Not Null Key Ranges/Rules

System 1 Rubber Product File

Item Number Product Varchar(04) Y Y Should be Primary Key

Description Product Char(30) Y N Non Repeating

Cost Product Decimal(12,2) N N Cannot be negative

Price Product Decimal(12,2) N N Cannot be negative

Inventory Product Decimal(12,2) N N

System 2 Wheels Product File

Item ID Product Integer(06) N N Should be Primary Key

Inventory Name Product Char(30) N N

Cost Product Decimal(12,2) N N Cannot be negative

Price Product Decimal(12,2) N N Cannot be negative

Inventory Product Decimal(12,2) N N

System 3 Bearing Product File

ID Number Product Integer(06) N N Should be Primary Key

Name Product Char(30) Y N

Cost Product Decimal(12,2) N N Cannot be negative

Price Product Decimal(12,2) N N Cannot be negative

Inventory Product Decimal(12,2) N N

Data Quality Criteria

Data Quality Criteria

Additional fields from the data quality exercise task. Need to

be verified with the business.

ptg Wheeler Source System Core Data Element List

Source File/ Table Name

Data Element Name

Subject Area

Domain

Not Null Key Ranges/Rules

System 1 Order File

ORDER_NO Order Decimal(05,2) Y Y Should be Primary Key

STATUS Order Char(11) N N

DATE Order Integer(08) N N

CUST_# Order Varchar(04) Y N Should be Foreign Key

TERMS_CD Order Char(05) Y N

ITEM_NO Order Varchar(04) Y Y Should be Foreign Key

PROD_PRICE Order Decimal(05,2) Y N

AMNT_ORDR Order Decimal(08,2) Y N

System 2 Order File

ORD _NUM Order Decimal(05,2) Y Y Should be Primary Key

STATUS Order Char(08) N N

DATE Order Integer(08) N N

CUST_# Order Varchar(04) Y N Should be Foreign Key

LINE_1 Order Decimal(2,2) Y N

TERMS_CD Order Char(05) Y Y

ITEM_ID Order Integer(06) Y N Should be Foreign Key

PROD_PRICE Order Decimal(05,2) Y N

AMNT_ORDR Order Decimal(08,2) N N

LINE_2 Order Decimal(2,2) N N

TERMS_CD Order Char(05) N N

ITEM_ID Order Integer(06) Y N Should be Foreign Key

PROD_PRICE Order Decimal(05,2) N N

AMNT_ORDR Order Decimal(08,2) N N

System 3 Order File

ORD _# Order Decimal(05,2) Y Y Should be Primary Key

STS Order Char(07) N N

DTE Order Integer(08) N N

CUST_# Order Varchar(04) Y Y Should be Foreign Key

LN_1 Order Decimal(2,2) Y N

ID_NUMBER Order Integer(06) N N Should be Foreign Key

PROD_PRICE Order Decimal(05,2) Y N

AMNT_ORDR Order Decimal(08,2) Y N

LN_2 Order Decimal(2,2) Y N

ID_NUMBER Order Integer(06) N N Should be Foreign Key

PROD_PRICE Order Decimal(05,2) Y N

AMNT_ORDR Order Decimal(08,2) Y N

LN_3 Order Decimal(2,2) Y N

ID_NUMBER Order Integer(06) N N Should be Foreign Key

PROD_PRICE Order Decimal(05,2) Y N

AMNT_ORDR Order Decimal(08,2) Y N

Data Quality Criteria

Figure 6.18 The completed Wheeler source system Core Data Element List

ptg

Data Integration Analysis Phase 135