• No results found

Data Quality Assessment. Approach


Academic year: 2021

Share "Data Quality Assessment. Approach"

Show more ( Page)

Full text


Data Quality Assessment


Prepared By:

Sanjay Seth



Data quality is crucial to the success of Business Intelligence initiatives. Unless data in source systems is accurate and reliable more effort will be spent by BI users on manual activities, rework than on business related activities.

In order to improve the quality of data many companies initiate data quality assessment programs and form data stewardship groups. Yet in absence of a comprehensive methodology, measuring data quality remains an elusive concept. It proves to be easier to produce hundreds or thousands of data error reports than to make any sense of them.

The purpose of this document is to provide a methodology for managing data quality issues. The process is described in the order the events should occur, from the initial capture of data quality issues to presenting the findings to the data owners for further action.

Data quality can be defined as the state of completeness, consistency, timeliness and accuracy which makes the data appropriate for a specific use.

The dimensions of data quality are:

• Accuracy – loading of facts/dimensions correctly

• Completeness - having all relevant data stored

• Consistency - format uniformity for storing the data

• Timeliness - data storage in the required time frame

Thus data quality should ensure that the data loaded into the target destination is timely, accurate, complete and consistent.

Data quality issues result from:-

 Mainly due to incorrect manual entry of data in the source system

 Lack of common data standards across various business divisions when

integrating data from multiple data sources while loading to a data warehouse

 Lack of a proper business process. In some cases root cause analysis of a data quality issue may point to a business process that is required to be re-designed to mitigate the issue.


The main source of data quality issues as provided in a TDWI survey is categorized and ranked below as

Figure 1: Sources of Data Quality Problems

Benefits with Improved Data Quality

The following benefits are a resultant of improved information quality:

Data quality related benefits can be related as either hard benefits, or soft benefits

Soft benefits are those that are evident, clearly have an effect on productivity, yet are difficult to measure. These include:

 Build user confidence and trust in the data disseminated by the Data Warehouse solution - Good data quality promotes use of the Data Warehouse.

 Improve throughput for volume processing – By reducing the delays associated with detecting and correcting data errors, and the rework associated with that correction, more transactions can be processed, resulting in greater volume processing and lower cost per transaction.

 Improve customer profiling – Having more compliant customer information allows the business intelligence process to provide more accurate customer profiling, which in turn can lead to increased sales, better customer service, and increased valued customer retention.

Decrease resource requirements – Redundant data, correction, and rework put an unnecessary strain on an organization’s resource pool. Eliminating redundant data and reducing the amount of rework reduces that strain and provides better resource allocation and utilization.

Hard benefits are those which can be estimated and/or measured. These include:

• Customer attrition, which occurs when a customer’s reaction to poor data quality results in the customer’s complete cessation of business


• Costs of error detection and correction

 Detection costs, which are incurred when a system error or processing failure occurs and a process is invoked to track down the problem. Extra time it takes to correct data problems

 Correction costs, which are associated with the actual correction of a problem as well as the restarting of any failed processes or activities. The amount of time associated with the activity that failed, along with extraneous employee activity, are all rolled up into correction costs.

• Costs of data maintenance i.e. maintaining spreadsheets to meet information requirements

• Extra resources needed to correct data problems

• Time and effort required to re-run jobs that a bend

• Time wasted arguing over inconsistent reports

• Lost business opportunities due to unavailable data

• Fines paid for noncompliance with government regulations

• Shipping products to the wrong customers

• Bad public relations with customers leads to alienated and lost customer

The Data Warehousing Institute (TDWI) estimates that poor quality customer data costs U.S. businesses a staggering $611 billion a year in postage, printing, and staff overhead. Organizations can frustrate and alienate loyal customers by incorrectly addressing letters or failing to recognize them when they call or visit a store or Web site. Once a company loses its loyal customers, it loses its base of sales and

referrals, and future revenue potential.

The above benefits can be provided in the solution by following the Data Quality Assessment Methodology as mentioned in the following section

Data Quality Assessment Methodology

The Data Quality Assessment methodology consists of a five stage process for assessing and improving data quality of the solution being addressed

1. Extract - Identify and analyze the source data being cleansed 2. Discover - Identify and understand the current data issues 3. Cleanse – Implement checks to rectify data errors

4. Monitoring - Processes for regularly validating the data

5. Prevention – Fixing the processes by which these data errors are introduced.


Figure 2: Data Quality Methodology

1. Extract

Data profiling is the first step towards ensuring data quality. Data Profiling is the assessment of data to understand its content, structure, quality and dependencies.

Data profiling looks at standard column analysis such as frequency, NULL checks, cardinality, etc. This process will usually expose some of the more offending data quality problems. Some of the common methods in data profiling are listed below:

• Structure discovery

Check whether the different patterns of data are valid.

e.g.: Format of zip-code, phone-number, address etc can be checked.

• Data discovery

Is the data value correct, error free and valid?

e.g.: Check if mandatory attributes are having incorrect data.

• Relationship discovery

Verify if all the key relationships are maintained and we are able to get an end to end linkage. e.g.: Link between a child and parent table.

• Data redundancy

Check if same data is represented in multiple times.

The results of the profiling should be discussed with the client to determine which of the issues are related to business problems.

Deploy automated profiling tools for

accelerated data analysis

Understand current data issues

Investigate data errors and identify root cause for failure

Quantify gap between

‘Current’ and

‘Desired’ quality levels

Prioritize data quality levels to meet objectives

Implement validation checks and business rules to detect data errors and ensure logical consistency of data

Rectify erroneous data using ETL rules

Manage data exceptions manually

Communicate errors and impact to data providers

Monitor and Identify data gaps and plan for maintaining and enhancing quality of data

StSteepp 55 P

Prreevveennttiioonn SStteepp 44

M Moonniittoorr SStteepp 33

C Clleeaannssee SStteepp 22


Diissccoovveerr StSteepp 11

E Exxttrraacctt


Some examples of data profiling analysis that can be done are for

Figure 3: Data Profiling Analysis

2. Discover

After identifying the problems, data quality involves correcting the various errors identified.

• Data standardization

Same data is represented in different formats. Identify them and use common standard values.

• Pattern standardization

Particular attribute may have data following different patterns. Generate a common pattern for the data. e.g.: Standardize phone no’s: 999-9999999 or 999(9999999), (999) 999-9999

• Data verification

Verify the correctness of data and reduce ambiguities.

e.g.: Customer address data can be checked to see if it is correct.

3. Cleanse

Through cleansing data quality defects are corrected using one of the appropriate data cleansing action.

The common data cleansing actions that can be taken during data cleansing:

• Filter data to remove rule violations

• Correct data to repair rule violations

Completeness What data is missing or unusable?

Conformity What data is stored in a non-standard format?

Consistency What data values give conflicting information?

Accuracy What data is incorrect or out of date?

Duplicates What data records or attributes are repeated?

Integrity What data is missing or not referenced?

Null checks, Uniqueness checks

Standard code sets, Rules

Relationship Analysis

Domain validation, Range validation

Redundancy Evaluation

Referential Integrity, Cardinality Analysis

Profiling Analysis


Filter data:

The purpose of filtering is to remove problematic data; this action is typically applied when data is considered defective to a degree that makes it unusable.

Correct data:

The purpose of correction is to fix defective data. Correction alters the values of individual fields.

Determining the replacement value to be used may use the following techniques

1) Identifying errors while integrating data.

• Inserting a default value that indicates absence of reliable data

• Removal of redundant information

e.g.: Same data is available in two different source systems. Same data is represented in two different formats in two source systems e.g. Address information can be represented in different formats.

2) Searching alternative sources to find a replacement value e.g. incorporating additional external data to add value to existing records, which

• Increase understanding effectively

e.g. customer data is appended with more business details, a better understanding of the customer can be obtained.

Data quality rules are implemented in the following order:

Figure 4: Data Quality Rules Implementation


Identify cleansing rules for various subject areas (customer, products etc).

Define Functional / program specifications for cleansing rules

Implement validation and cleansing rules

Assess data quality objectives

Identify & Define data cleansing rules

Implement data validation routines in source feed file

Missing data values or data elements, etc.

Implement data cleansing rules


Data Warehouse solutions have a role to play in ensuring the quality of data. It ensures that only data that is fit for use is allowed to be loaded and be made available to consumers. These solutions can also use dedicated data cleansing tools to conduct activities such as address cleansing, standardization and de-duping.

Data Warehouse solutions have various ways of validating that the source data is of good quality and conditioning the data where appropriate, prior to loading it into the final target tables. These solutions can also provide for automated proactive alerting.

4. Monitor

The measurement step involved measuring the data quality and tracking errors. It ensures that all reported anomalies are corrected and monitored.

Data quality monitoring is a process which focuses on improving the quality of data.

It ensures that the data is valid. A proper standardization is attained for the data.

Also the redundant data is identified and eliminated.

Once data is corrected, regular monitoring of the data is necessary to avoid errors and ambiguities. It can be done by

• Creating reports on a regular basis.

• Creating rules to validate the data

• Generating events to correct the data.

Data monitoring will include creating a list of critical data quality problems as well as the business problems to which they relate.

Data monitoring should not only measure information compliance to defined business rules, but also measure the actual costs associated with noncompliance.

The data quality monitoring can be implemented either as an Interim Solution or Long term solution.

During the Interim solution, data from various sources can be profiled and analyzed for anomalies and once the data quality rules are determined from the profiling analysis, they can be deployed to overcome these anomalies.

For a Long term solution, data quality metrics and audit reports can be created which will measure the overall data quality improvement.

Audit report structure details are provided in the Appendix.

Quality metrics provide the means to quantify data quality. Measures are needed to measure current state of data quality and to evaluate the progress made towards improving data quality.

Various attributes (format, range, domain, etc.) of the data elements can be measured. The measurements can be rolled up or aggregated into metrics e.g. the number of defective addresses, invalid phone numbers, incorrectly formatted email


addresses, and can all be measured and rolled up into one metric that represents quality of just the contact data.

Figure 5: Data Quality Monitoring

5. Prevention

The purpose of prevention is to remove causes of defective data to fix the processes by which defects are introduced. Prevention determines the root causes of defective data and takes steps to eliminate them.

By providing error reports, audit and reconciliation reports to the source system providers for correction of data. Data quality issues can be reduced and hence prevented over a period of time. These reports give the data owners the visibility into the errors and their causes and the corrective action that needs to be taken.

Reconciliation is a process through which data from both source and target systems are compared and analyzed. Validation scripts are run on both source and target data for comparison.

Data quality issues should ideally be addressed and resolved in the source systems themselves. This helps in ensuring that data in the data Warehouse is always in sync with data in the source systems.



Strong frameworks and processes are required for controlling data quality and for managing data. Additional validation procedures such as data level reconciliation ensures high success in providing high data quality solutions


Case Study:

For one of the Life Insurance clients, the objective of the data quality assessment was to provide an approach on how the data warehouse should increase its data accuracy and thus build business confidence in using the data warehouse.

The Client had multiple source systems loading customer and product data to the data warehouse so the requirement was to provide an approach by which the data quality within the warehouse could improve and address the following business issues, that was having an impact on business and overall cost.

Figure 6: Data Quality Relationship between Impact and Cost

The data quality assessment was conducted using the Data Quality methodology and a solution was provided for improving data quality for data access, the solution provided for the following benefit

 Provided for a Data warehouse environment that enabled o 360 degree view of customer

o Integrated product information from various source systems o Able to satisfy reporting needs of users

 Improved operational efficiencies by making data available to users when they need it, via a single standard framework, so that they can effectively make informed decisions

 The solution provided an opportunity for the analysts to spend more quality time on analysis and less time on data quality issue resolution

The solution improvements provided for recommendations in the following domains


Data Architecture

• Created source to target mappings

• Enabled single version of entities and metrics

• Structured Standards (Ex. Naming Standards)

ETL Processes

It helped define robust data validation, rejection and reconciliation mechanisms built into the ETL processes

Processes that need to be defined in the ETL:

• Data Integration Rules

• Data Standardization

• Data Rejection

• Data Reconciliation

Recommended Data Steward Participation during functional testing

The Data Steward along with the Test team needs to be involved in the following data quality aspects of functional testing of ETL and Reporting applications

• Ensured that sample data is used for testing, data that represents all kinds of irregularities and peculiarities of source data

• Ensured that ETL was able to identify, handle and notify all types of defined data issues

• Ensured that Reports and queries used for testing cover the required data samples

Activities that are carried out by a Data Steward is provided in the Appendix

Data Quality Monitoring

The data quality monitoring program recommended was to check on the data purity levels, this program would involve

• A Data quality scorecard to measure purity levels of the data warehouse, identify issues pro-actively and plan projects to address these issues

• Pro-actively analyze quality of source system data to identify new corruption issues and modify ETL to handle them

• Periodically assess effectiveness of ETL error notification process to the source system and how effective it is to resolve these issues.



Data Quality Audit

The purpose of auditing is to understand the degree to which data quality problems exist i.e. the extent and severity of data defects.

Audit procedures examine content, structure, completeness and other factors to detect and report integrity and correctness rules violations.

Data auditing is a process for identifying errors and check the health of the overall system with regard to quality of data. As the amount of data and number of process escalates over a period of time, the amount of inaccurate data also increases and decreased data quality. The audit reports can be created to measure progress in achieving data quality goals and complying with service level agreements.

It is very important to understand the error prone areas in the BI solution.

A data Quality scorecard can show the overall quality status. A template for Audit Summary is shown below.

Audit summary shows the number of occurrence of different types of errors, as per error stack, at various stages in the system. On one axis, 'types of error' is shown and on the other, the stage where error occurred is shown. So it will give the overall picture of the error occurrence. For example, 500 data entry errors occurred in the source system. This will give a fair idea about the problematic areas where data quality is poor / strong.


Data Steward

The data steward is responsible for driving organizational agreement on data definitions, business rules and domain values for the data warehouse data and publishing and reinforcing these definitions and rules.

The data steward is responsible for the following

• Primary Guardian of data while it is being created or maintained

• Responsible to create standards and procedures to ensure that policies and business rules are known and followed

• Should enforce adherence to policies and business rules that govern the data while the data is in their custody

• Should periodically monitor (audit) the quality of the data in their custody

Creating a data governance body, processes and tools for managing data quality will help to establish a robust framework for development.

Figure 7: Data Governance Framework


Data Quality and Data Profiling tools

Some of the Leading Data Quality vendors/products are provided below

Sanjay Seth

Sanjay Seth, a Senior Architect with the Business Intelligence

Practice of a leading IT consulting firm, has 8 years of extensive

experience in the data warehousing/business intelligence space


Related documents

Initially, I had difficulty understanding how it was that students were integrating the various disciplinary perspectives in their pursuit of the question, “What does it mean to

A prospective study of thirty myofascial trigger points in upper trapezius were considered to determine the effectiveness of transcutaneous electrical nerve stimulation and

Advances in medical treatment, management, diagnosis and surgical palliation have improved the quality and longevity of children born with Congenital Heart Disease. As

Saturday (hard day, 6-8 hours): dojo class conditioning hard stretching sparring weight training  bag work. running

● From the Start/Finish at the River House Barn, head south and cross Fig Ave into the River Campground and head EAST and connect with Main Loop trail.. ● The main loop trail will

The convergence of online and face-to-face teaching methods via learning management systems such as WebCT or Blackboard, for example, is creating new learning situations for

According to a recent Ventana Research report on big data technology, only 22% of 163 organizations that Ventana polled last year were using Hadoop, and 45% said they had no plans

Chapter 10 Evaluational Data 265 Data Warehouse System Concept 266 Decision Support 266 Data Resource Support 267 Data Warehouse System Definition 268 Dual Database Concept 269 A

.r,' bζ oeσ'o)a!λη !0l!.. ι,IPzι:ναwioν ιiθε τιμη'ιιηζ δkixρdηζ.. Eiμa! .νrτioν τδν 7'pΙματ'χδν .τ!,ζoρτini 6eω xdi σξιahovzι Ιanζ 9.ωxaτΦΦovωζ

A modern data management supply chain includes processes for data ingestion, data stream processing, data integration, data enrichment, data preparation, definition and

In February 2015, technology and analytics stakeholders with responsibility for corporate data initiatives were invited to participate in an online survey on the topic of

From the Public Folders tab, click the Create a Report View button that appears to the right of the R-306 District Standards Summary Report. Name this report “District A –

GPRS PDP Context Details IMEI TAC Bytes Uplink (Bytes) IMSI Bytes Downlink (Bytes) APN Packets Uplink (Packets) Cell Packets Downlink (Packets) Roaming Type Setup

We configure an inhibition-augmented COSFIRE filter by using two different types of prototype patterns, namely one positive pattern and one or more negative pattern(s), in order

Accessories for SAFELINE Split Septum Injection Site Components on this page do not contain DEHP or natural rubber latex... Braun

professionals who espoused eugenics…Sanger’s strategy, which became that o of Planned Parenthood, was to give birth control an aura of scientific and medical respectability

B displays the evolution over time of gender gaps in labor force participation among the 16-64 year olds in our balanced panel of 11 high income countries. See text

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a

Null Hypothesis (H 0 4): Patient adherence to antidiabetic medication (Morisky 8- item Adherence Scale) and HbA1c levels are not statistically significant predictors of the severity

Note: This graph shows the amount of sales (direct premiums) and number of complaints from the first semester of 2008, by insurance group (general and life) and the market

Figure-4: Shareholding pattern in Aurobindo Pharma.. Category wise shareholding pattern 2.1. Maruti Suzuki is having highest promoter’s shareholding in 2016-17 among all

The purpose of this two hour CE course is to provide an overview of the professional aspects of the Certified Nursing Assistant's (CNAs) role and to explore the importance

In the conclusion (Section 5), the processing of mobile phone data, by offering new maps of site practices and information on temporary populations and city usage