Data Integration Analysis Phase
Step 3: Review/Assess Source Data Quality
This step further refines the Wheeler Core Data Element List for data quality issues and develops the first-cut set of data quality checkpoints.
Although it appears that there are redundancies in the source system profiling and data quality assessment tasks, profiling gathers the information and provides a first set of data quality issues. The review\assess source data quality task confirms those findings, performs further root cause analysis, and, finally, develops the first-cut technical and business data quality checkpoints for the data quality process layer in the data integration environment, as shown in Figure 6.16.
Wheeler Source System Core Data Element List Source File/
Table Name Data Element Name
Subject Area
Domain
Not Null Key Ranges
System 1 Customer File
CUST_# Customer Varchar(04) Y Y
ORG Customer Varchar(40) N N
CUST_NAME Customer Varchar(40) N N
ADDRESS Customer Varchar(20) N N
CITY Customer Varchar(20) N N
STATE Customer Varchar(20) N N
ZIP Customer Varchar(09) N N
System 2 Customer File
ID Customer Decimal(10) Y Y
O_NAME Customer Char(15) Y N
F_NAME Customer Char(15) Y N
L_NAME Customer Char(15) Y N
ADDRSS 1 Customer Char(20) Y N
ADDRSS 2 Customer Char(20) N N
CITY Customer Char(15) N N
STATE Customer Char(02) N N
ZIP Customer Decimal(09) N N
System 3 Customer File
CUST_ID Customer Decimal(10) Y Y
ORGANIZATION Customer Varchar(20) Y N
FRST Customer Varchar(20) Y N
LAST Customer Varchar(20) Y N
ADDR 1 Customer Char(20) Y N
ADDR 2 Customer Char(20) N N
ADDR 3 Customer Char(20) N N
CITY Customer Char(15) N N
STATE Customer Varchar(2) N N
ZIP Customer Integer(05) N N
EXT Customer Integer(04) N N
Data Quality Criteria
Source System Data Profiling
Assess Source Data
Quality
Figure 6.16 The iterative nature of source system analysis
1. Perform validation checks to assess the data—Using the Wheeler source system Core Data Element List, review and determine the types of checks that would be needed in the data quality layer of the proposed data integration process:
ptg
Data Integration Analysis Phase 131
• Data format checks—A secondary review of the data elements does not reveal any errors in terms of format, for example VarChar in Integer.
• Date format checks—Not only does it appear that there are no inconsistencies in the date formats of each of the order systems, they are also in the same format of two-digit month, two-two-digit day, four-two-digit year (e.g., 03122010.) It would be wise to work with the data modeler, ensure that the target Wheeler data warehouse data model has the same format, and reduce an unnecessary data format transformation unless there is a desire to standardize to the relational DATE format.
• Numeric value range check—Review the source data for numeric upper and lower limits in the numeric fields in the Wheeler order system source data. For example, a rule could be placed on the order numeric fields, such as cost and price that prevents them from being negative, thereby preventing downstream incorrect calculations.
N
OTEBefore such a business data quality rule is created, it is important to verify with an appropriate business user that this is an appropriate rule and there are not legitimate reasons for negatives in such columns.
• Null checks—When performing a secondary check for null values in mandatory columns/fields, the null key field in System 3 was captured in the prior analysis. It is good to double-check that a rule had been put in place in ensuring key rules are enforced.
• Duplicate key/field checks—When reviewing the Wheeler data for the prevention of the accidental loading of duplicate records, business-defined critical data ele-ments, and key columns (primary, foreign, unique), we should review and ensure that the duplication error found between the Wheeler System 2 Product File and Sys-tem 3 Product File has been communicated to prevent any future issues in the online systems.
2. Review any other observed anomalies—In this secondary review, we find that order file 3 does not contain a Terms field, as illustrated in Figure 6.17. This can cause signifi-cant data governance issues and merits further research with both the source system IT and business users.
ptg
System 1 Order File
ORDER_NO STATUS DATE CUST_# TERMS_CD ITEM_NO PROD_PRI
CE
AMNT_OR DR
10001Shipped 03032010 410 Fixd 1302 $14 2,000
10002Ordered 03112010 520 Open 1303 $15 5,000
10003Ordered 03122010 660 Open 1303 $15 3,000
10004Shipped 03122010 200 Fixd 1301 $12 20,000
System 2 Order File
ORD _NUM STATUS DATE CUST_# LINE_1 TERMS_CD ITEM_ID PROD_PRI
CE AMNT_O RDR
LINE_2 TERMS_CD ITEM_ID PROD_
PRICE AMNT_
ORDR
22221Shipped 03042010 11100011 1 02/10, net 30 1101 $125 100 2 02/10, net 30 1111 $135 550
22222Ordered 03222010 11100012 1 02/10, net 30 1101 $147 230 2 02/10, net 30 1103 $175 400
22223Ordered 03142010 <null> 1 02/10, net 30 1111 $135 1,000 2 02/10, net 30 <null> $135 400
22224Shipped 03212010 1110001A 1 02/10, net 30 1113 $89 2,000 2 02/10, net 30 1101 $125 200
System 3 Order File
ORD _# STS DTE CUST_# LN_1 ID_NUMBER PROD
_PRICE AMNT _ORDR
LN_2 ID_NUMBER PROD _PRICE
AMNT _ORDR
LN_3 ID_NUMBPROD _PRICE
AMNT _ORDR
30010Ordered 03302010 310001 1 1201 $30 500 2 1204 $25 3,500
30020Ordered 03152010 310002 1 1101 $32 320
30030Ordered 03222010 310003 1 1203 <null> 2,000 2 1204 $25 5,000 3 1201 $30 300
30040Ordered 03232010 310004 1 1204 $25 4,000 2 1101 $32 500
Existing Terms Column
Existing Terms Column
Missing Terms
Figure 6.17 Missing columns
Often in the first review of the data, the focus is so intense on the anomalies within a col-umn, broader data anomalies or missing critical data is overlooked. Only after the “pic-ture” data and its structure becomes clearer will less obvious issues be observed, which is another reason for a secondary review task.
We have now reviewed and documented the actual structure of the source data, the data itself, and the anomalies within the data.
The source system discovery tasks have provided a good understanding of the source sys-tem data in terms of its structure, its data, and its anomalies. With this body of knowledge, we can move on to the next task of data mapping.
Figure 6.18 provides the completed version of the Wheeler Core Data Element List that will be used for the complex task of data mapping.
ptg
Data Integration Analysis Phase 133
Wheeler Source System Core Data Element List
Source File/ Table
Data Element Name
Subject Area
Domain
Not Null Key Ranges/Rules
System 1 Customer File
CUST_# Customer Varchar(04) Y Y Should be Primary Key
ORG Customer Varchar(40) N N
CUST_NAME Customer Varchar(40) N N
ADDRESS Customer Varchar(20) N N
CITY Customer Varchar(20) N N
STATE Customer Varchar(20) N N
ZIP Customer Varchar(09) N N
System 2 Customer File
ID Customer Decimal(10) Y Y Should be Primary Key
O_NAME Customer Char(15) Y N
F_NAME Customer Char(15) Y N
L_NAME Customer Char(15) Y N
ADDRSS 1 Customer Char(20) Y N
ADDRSS 2 Customer Char(20) N N
CITY Customer Char(15) N N
STATE Customer Char(02) N N
ZIP Customer Decimal(09) N N
System 3 Customer File
CUST_ID Customer Decimal(10) Y Y Should be Primary Key
ORGANIZATION Customer Varchar(20) Y N
FRST Customer Varchar(20) Y N
LAST Customer Varchar(20) Y N
ADDR 1 Customer Char(20) Y N
ADDR 2 Customer Char(20) N N
ADDR 3 Customer Char(20) N N
CITY Customer Char(15) N N
STATE Customer Varchar(2) N N
ZIP Customer Integer(05) N N
EXT Customer Integer(04) N N
Source File/ Table
Data Element Name
Subject Area
Domain
Not Null Key Ranges/Rules
System 1 Rubber Product File
Item Number Product Varchar(04) Y Y Should be Primary Key
Description Product Char(30) Y N Non Repeating
Cost Product Decimal(12,2) N N Cannot be negative
Price Product Decimal(12,2) N N Cannot be negative
Inventory Product Decimal(12,2) N N
System 2 Wheels Product File
Item ID Product Integer(06) N N Should be Primary Key
Inventory Name Product Char(30) N N
Cost Product Decimal(12,2) N N Cannot be negative
Price Product Decimal(12,2) N N Cannot be negative
Inventory Product Decimal(12,2) N N
System 3 Bearing Product File
ID Number Product Integer(06) N N Should be Primary Key
Name Product Char(30) Y N
Cost Product Decimal(12,2) N N Cannot be negative
Price Product Decimal(12,2) N N Cannot be negative
Inventory Product Decimal(12,2) N N
Data Quality Criteria
Data Quality Criteria
Additional fields from the data quality exercise task. Need to
be verified with the business.
ptg Wheeler Source System Core Data Element List
Source File/ Table Name
Data Element Name
Subject Area
Domain
Not Null Key Ranges/Rules
System 1 Order File
ORDER_NO Order Decimal(05,2) Y Y Should be Primary Key
STATUS Order Char(11) N N
DATE Order Integer(08) N N
CUST_# Order Varchar(04) Y N Should be Foreign Key
TERMS_CD Order Char(05) Y N
ITEM_NO Order Varchar(04) Y Y Should be Foreign Key
PROD_PRICE Order Decimal(05,2) Y N
AMNT_ORDR Order Decimal(08,2) Y N
System 2 Order File
ORD _NUM Order Decimal(05,2) Y Y Should be Primary Key
STATUS Order Char(08) N N
DATE Order Integer(08) N N
CUST_# Order Varchar(04) Y N Should be Foreign Key
LINE_1 Order Decimal(2,2) Y N
TERMS_CD Order Char(05) Y Y
ITEM_ID Order Integer(06) Y N Should be Foreign Key
PROD_PRICE Order Decimal(05,2) Y N
AMNT_ORDR Order Decimal(08,2) N N
LINE_2 Order Decimal(2,2) N N
TERMS_CD Order Char(05) N N
ITEM_ID Order Integer(06) Y N Should be Foreign Key
PROD_PRICE Order Decimal(05,2) N N
AMNT_ORDR Order Decimal(08,2) N N
System 3 Order File
ORD _# Order Decimal(05,2) Y Y Should be Primary Key
STS Order Char(07) N N
DTE Order Integer(08) N N
CUST_# Order Varchar(04) Y Y Should be Foreign Key
LN_1 Order Decimal(2,2) Y N
ID_NUMBER Order Integer(06) N N Should be Foreign Key
PROD_PRICE Order Decimal(05,2) Y N
AMNT_ORDR Order Decimal(08,2) Y N
LN_2 Order Decimal(2,2) Y N
ID_NUMBER Order Integer(06) N N Should be Foreign Key
PROD_PRICE Order Decimal(05,2) Y N
AMNT_ORDR Order Decimal(08,2) Y N
LN_3 Order Decimal(2,2) Y N
ID_NUMBER Order Integer(06) N N Should be Foreign Key
PROD_PRICE Order Decimal(05,2) Y N
AMNT_ORDR Order Decimal(08,2) Y N
Data Quality Criteria
Figure 6.18 The completed Wheeler source system Core Data Element List
ptg
Data Integration Analysis Phase 135