Methodologies for Data Quality Measurement and Improvement
7.1 Basics on Data Quality Methodologies
7.1.3 Comparison among Data-driven and Process-driven Strategies
In this section we compare data-driven and process-driven strategies. We dis-tinguish for simplicity three major strategies, using three distinct data quality activities:
7.1 Basics on Data Quality Methodologies 165 1. New data acquisition from the real world. When data representing a cer-tain reality of interest are inaccurate, incomplete or out-of-date, a possible way for improving their quality may be to again observing the reality of interest, and performing the activity called in Chapter 4 new data acqui-sition. E.g., if in a registry of employees the DateOfBirth is known only in 30% of the cases, we may request employees missing data. Intuitively, if the data acquisition campaign is performed effectively, this strategy immediately improves certain quality dimensions such as completeness, accuracy, and currency, since the data exactly represent the most recent reality of interest; we however note, that errors can be introduced by the measurement activity.
2. Record matching or more generally, the comparison of data whose quality dimensions have to be improved with other data in which the quality is known to be good. As an example, let us consider a database of addresses of clients that have been collected for a long period of time in a supermar-ket through forms, in order to provide clients with a fidelity card. After a while, certain quality dimensions, such as accuracy of residence addresses, tend to worsen. We may decide to perform a record matching activity to compare client records with an administrative database, known to be updated with the most recent data.
3. Use of data edits/integrity constraints, in which: (i) we define a set of in-tegrity constraints against which data have to be checked, (ii) we discover inconsistencies among data, and (iii) we correct the inconsistent data by means of error localization and correction activities.
Process-driven strategies focus on processes. Consequently, they need to acquire knowledge from databases and data flows in inputs only to a limited extent. Conversely, they focus mainly on measuring the quality of processes and formulating proposals for process improvement. Two main phases char-acterize process-driven strategies:
• Process control, which inserts checks and control procedures into the data production process when (i) new data is inserted from internal or external sources, (ii) data sources accessed by the process are updated, or (iii) new data sources are involved in the process. In this way, a reactive strategy is applied to data modification events, to avoid data degradation and error propagation.
• Process design, where we avoid improving the actual process. We re-design the production processes in order to remove the causes of bad qual-ity and introduce new activities that produce data of better qualqual-ity. In the case in which the change in the process is radical, this strategy corresponds to the activity called business process reengineering (see [91] and [181] for a comprehensive introduction to this issue).
We compare now data- and process-driven strategies according to two coordinates of analysis: (i) the improvement the strategy is potentially able
to produce on quality dimensions and (ii) the cost of its implementation. This comparison can be performed both in the short term and in the long term. In the following (see Figure 7.3), we compare improvement and costs in the long term; optimal target objectives are high improvement and low cost.
Cost Low Very High
High
Low High
Improvement
New data acquisition
Do nothing
Use of integrity constraints
Process control
Process Re-design
Record Matching
and comparison Cost
Low Very High
High
Low High
Improvement
New data acquisition
Do nothing
Use of integrity constraints
Process control
Process Re-design
Record Matching and comparison
Fig. 7.3. Improvement and cost of data/process-driven strategies: comparison in the long term
The simplest and most trivial strategy is to do nothing. In this case, data are neglected and abandoned; certain quality dimensions, such as completeness and currency, tend to worsen in the long term. The consequence is that data progressively deteriorate the quality of business processes and the cost of lost quality increases over time.
A better strategy is new data acquisition; in the short term, the improve-ment is relevant, since data is current, complete, and accurate. However, as time goes by we are obliged to periodically repeat the process, and the cost becomes intolerable.
The strategy that uses integrity constraints leads to much lower costs, but at the same time it is less effective, since only the errors related to constraints can be checked. The errors can be corrected only to a certain extent, as we have seen in Chapter 4.
The strategy performing record matching has even lower costs and even more improvements, since many techniques have been developed and imple-mented, as we saw in Chapter 5. A relevant part of the work can be done automatically. Furthermore, once the records corresponding to the same ob-ject have been identified, high quality values can be chosen for the different attributes from the higher quality source.
In order to be effective, previous strategies that belong to the class of data driven strategies have to be repeated, leading to costs that increase in the long term. Only when we move to process-driven methods, can we optimize