The CDQM methodology - Methodologies for Data Quality Measurement and Improvement

Methodologies for Data Quality Measurement and Improvement

7.4 The CDQM methodology

3, in the model proposed to describe the data production process. TQdM typically uses very simple formalisms, e.g., charts. The Istat methodology provides a signiﬁcant number of statistical techniques. With regard to the consolidation, TDQM and TQdM have been widely applied since the 1990s in U.S. and, to some extent, in other countries, while the Istat methodology is very recent, with a limited number case studies documented.

7.4 The CDQM methodology

Now we discuss an original methodology, characterized by a reasonable bal-ance between completeness and the practical feasibility of the data quality improvement process. The methodology deals with all types of knowledge described in Figure 7.1; for this reason, we will call it Complete Data Qual-ity Methodology (CDQM ) methodology. The phases and steps of CDQM are shown in Figure 7.15.

Phase 1: State reconstruction

1. Reconstruct the state and meaning of most relevant databases and data flows exchanged between organizations, and build the database + dataflow/organization matrixes.

2. Reconstruct most relevant business processes performed by organizations, and build the processes /organizations matrix.

3.For each process or group of processes related in a macroprocess, reconstruct the norms and organizational rules that discipline the macroprocess and the service provided.

Phase 2: Assessment

4. Check the major problems related with the services provided with the internal and final users. Fix these drawbacks in terms of process and service qualities, and identify the causes of the drawbacks due to low data quality.

5. Identify relevant DQ dimensions and metrics, measure data quality of databases and data flows, and identify their critical areas.

Phase 3: Choice of the optimal improvement process

6. For each database and data flow, fix the new DQ levels that improve process quality and reduce costs under a required threshold.

7. Conceive process re-engineering activities and choose DQ activities, that may lead to DQ improvement targets set in step 6, relating them in the data/activity matrix to clusters of databases and data flows involved in DQ improvement targets.

8. Choose optimal techniques for the DQ activities.

9. Connect crossings in the data/activity matrix in reasonable candidate improvement processes

10. For each improvement process defined in the previous step, compute approximate costs and benefits, and choose the optimal one, checking that the overall cost-benefit balance meets the targets of step 6.

Phase 1: State reconstruction

1. Reconstruct the state and meaning of most relevant databases and data flows exchanged between organizations, and build the database + dataflow/organization matrixes.

2. Reconstruct most relevant business processes performed by organizations, and build the processes /organizations matrix.

3.For each process or group of processes related in a macroprocess, reconstruct the norms and organizational rules that discipline the macroprocess and the service provided.

Phase 2: Assessment

5. Identify relevant DQ dimensions and metrics, measure data quality of databases and data flows, and identify their critical areas.

Phase 3: Choice of the optimal improvement process

6. For each database and data flow, fix the new DQ levels that improve process quality and reduce costs under a required threshold.

8. Choose optimal techniques for the DQ activities.

9. Connect crossings in the data/activity matrix in reasonable candidate improvement processes

Fig. 7.15. Phases and steps of CDQM

The overall strategy of CDQM sees the measurement and improvement activities as being deeply related to the business processes and to the costs of the organization. In phase 1 all the most important relationships between or-ganization units, processes, services, and data, if not known are reconstructed.

Phase 2 sets new target quality dimensions which are needed to improve pro-cess qualities, and evaluates reduced costs and new benefits. Phase 3 finds the optimal improvement process, i.e., the sequence of activities that has the optimal cost-effectiveness. In this section we examine the specific steps. The next section will provide a detailed case study.

7.4.1 Reconstruct the State of Data

Similarly to what happens in information system planning methodologies, at the beginning of the DQ process we reconstruct a model of the most relevant relationships between organizations or organizational units and data used and exchanged. This information is important, since it provides a picture of the main uses of data, of providers, and of consumers of data ﬂows. We can rep-resent these relationships with two matrixes:

1. the database/organization matrix (see Figure 7.16), where, for the most relevant databases, we represent organizations that create data and orga-nizations that use data. This matrix could be reﬁned, representing single entities (or tables), but in order to make its size reasonable, we set the granularity at the database level; and

Creates

Fig. 7.16. The database/organization matrix

2. the dataﬂow/organization matrix (see Figure 7.17), similar to the previous one, in which we represent the provider and consumer organizations of the most relevant data ﬂows.

Consumer

Fig. 7.17. The dataﬂow/organization matrix

7.4 The CDQM methodology 183 7.4.2 Reconstruct Business Processes

In this step we focus on processes and their relationships with organizational units. Processes are units of work performed in the organization and related to the production of goods or services. For every process we have to ﬁnd the organizational unit that is its owner, and the units that participate in the execution of the process: the whole set of cross-relationships is represented in the process/organization matrix , an example of which is given in Figure 7.18.

Distinguishing the owner of the process is important in DQ issues, since we can assign precise responsibilities in data-driven and process-driven improvement activities.

………

Owner Participates

Organization 2

Participates Owner

Participates Organziation m

…………

Participates Owner

Organization 1

Process n Process 2

Process 1 Process/

Organization

………

Owner Participates

Organization 2

Participates Owner

Participates Organziation m

…………

Participates Owner

Organization 1

Process n Process 2

Process 1 Process/

Organization

Fig. 7.18. The process/organization matrix

7.4.3 Reconstruct Macroprocesses and Rules

In this step we analyze two aspects in depth: the structure and the ﬁnal ob-jectives of the processes in the organization, i.e., how they are related and linked in the production of goods/services (denoted in the following for sim-plicity as services), and the legal and organizational rules that discipline and specify this structure. The relevant characteristics of processes are described in the macroprocess/norm-service-process matrix (see Figure 7.19), where the following aspects are represented:

• the macroprocess, i.e., the set of processes that are all together involved in service provision;

• services provided, identiﬁed by a name and, possibly, by the class of users of the service, their characteristics, and the organization responsible for service provision;

• norms that discipline the high-level speciﬁcation of the process.

Reconstructing the macroprocesses is an important activity, since model-ing processes independently provides only a fragmented view of the activities

of the organization. On the contrary, we need an integrated view to make decisions related to the possible restructuring of processes and information ﬂows. At the same time, especially in public organizations, the knowledge of norms related to the macroprocesses is relevant to precisely understand (i) the area at our disposal for “maneuvers” in process-driven activities, (ii) the extent to which we are free to restructure processes, and (iii) the norms or organizational rules to be repealed, changed, or modiﬁed.

Notice in Figure 7.19 that macroprocesses are represented as a set of pro-cesses. This model is very simple, and could be enriched using a process spec-iﬁcation language (see example in [193]).

………

Norm3 and Norm4 Norm 2

Norm 1

Norm/organiza-tional rule

S3 and S4 S2 and S5

S1 and S5 Service(s)

X Process 1

X Process 2

X Process 3

X Process n

...

X Process 4

Macroprocess m Macroprocess2

Macroprocess1

Macroprocess ………

Norm3 and Norm4 Norm 2

Norm 1

Norm/organiza-tional rule

S3 and S4 S2 and S5

S1 and S5 Service(s)

X Process 1

X Process 2

X Process 3

X Process n

...

X Process 4

Macroprocess m Macroprocess2

Macroprocess1 Macroprocess

Fig. 7.19. The macroprocess/norm-service-process matrix

7.4.4 Check Problems with Users

The goal of this step is to identify the most relevant problems, in terms of causes of poor data quality. Focusing initially on services, they can be identi-fied by interviewing internal and final users, and by understanding the major burdens and negative effects of poor data quality on the activities of internal users and on the satisfaction of final users. Then, the analysis goes back to processes to find the causes, in terms of quality and the nature of processes, that produce such burdens and negative effects. As an example, taxpayers of a district are bothered, if they receive erroneous notices of assessment from the revenue agency. It may be discovered that tax files for that district are not accurate, due to delayed or incorrect updates.

7.4.5 Measure Data Quality

In previous steps we have identiﬁed main problems that lead to poor data quality; here, we have to select, among the set of dimensions and metrics

7.4 The CDQM methodology 185 discussed in Chapter 2, the most relevant ones for our problem; for such dimensions, we have to choose metrics to provide a quantitative evaluation of the state of the system. For example, if the major burden perceived by ﬁnal users is the time delay between an information service request and service provision, we have to focus on the currency dimension, and organize a process to measure it.

Another relevant aspect of this step is locating critical areas, mentioned in the discussion on the Istat methodology. Since the improvement activities are complex and costly, it is advisable to focus on the parts of databases and data ﬂows that reveal major problems. This activity can be performed in two ways:

• Analyzing problems and causes, and trying to identify the data whose poor quality is more negatively inﬂuenced by them. In the taxpayer example, we focus on one speciﬁc district, since complaints come prevalently from that area.

• Analyzing statistics on data quality metrics selected according to diﬀerent properties of data, and determining where poor quality is located. We have seen this case in the example on names of streets discussed in Section 7.3.4.

7.4.6 Set New Target DQ Levels

In this step we set new target DQ levels, evaluating the economic impact of the improvement in terms of (hopefully) reduced costs and improved beneﬁts.

We have discussed in Chapter 4 some classifications of costs and benefits, and proposed a new one. The idea in this step is to use such classifications as a checklist; for each item in the classification, or in a subset of it, we collect data that allow some approximate estimate of the costs, savings, and other benefits associated with the item. Some items are easily calculated, such as, e.g., the cost of equipment involved in data cleaning activities. Other items need an estimate. For example, we may have perceived that a significant cost item is related to the time spent by clerks in looking for unmatched citizens, or for missing businesses in a registry. In the former case, we (i) estimate the number of clerks involved in the activity in terms of person-months a year; (ii) multiply this number by the average of the gross salary. Some cost items are difficult or even impossible to estimate. In this case, we identify a proxy cost item that provides an indirect valuation of the item that cannot be estimated.

Other aspects to be addressed concern the so called intangible beneﬁts, which are diﬃcult to express in monetary terms, and have to be eventually considered on a qualitative basis. Finally, the calculation of return on invest-ment is useful to help senior manageinvest-ment make a decision about the level of commitment to the data quality program.

The last issue to be dealt with in this step is the establishment of a rela-tionship between costs, beneﬁts, and quality levels. For instance, we assume

that presently 10% of customer addresses are not correct, and such poor qual-ity reduces potential revenues of sales campaigns by 5%. We have to identify, at least qualitatively, the functions that relate(i) costs of processes, (ii) sav-ings, and (iii) the cost of the improvement program for accurate addresses.

Then, we have to superimpose the three functions, to ﬁnd the optimal bal-ance between cost and savings and the corresponding target quality level to be achieved.

7.4.7 Choose Improvement Activities

This step is perhaps the most critical one for the success of the methodology.

The goal here is to understand which process-driven activities and which data-driven activities lead to the most effective results for quality improvement of databases and data flows. In this choice we may group databases and data flows or split them, in order to examine only critical areas or specific parts that are relevant in an activity.

With regard to process-driven activities, business process reengineering activity (see [91], [181], and [137] for a comprehensive discussion) is composed of the following steps:

• Map and analyze the as-is process, in which the objective is typically to describe the actual process.

• Design the to-be process, producing one or more alternatives to the current process.

• Implement a reengineered process and improve continuously.

Data-driven activities have been described in great detail in previous chap-ters. To choose from them, we have to start the analysis from causes and problems, discovered in step 4. We discuss a few cases.

1. If a relational table has low accuracy, and another source represents the same objects and common attributes with higher accuracy, we perform an object identiﬁcation activity on the table and the source. Then, we select the second source for values of common attributes.

2. Assume that a table exists used mainly for statistical applications, and characterized by low completeness. We perform an error correction activ-ity that changes null values to valid values, keeping the statistical distri-bution of values unchanged.

3. Assume that a certain data ﬂow is of very poor quality; in this case, we perform a source selection activity on data conveyed by the data ﬂow.

The goal of a source selection activity is to change the actual source, selecting one or more data sources that together provide the requested data with better quality. Source selection can be seen as a particular case of quality-driven query processing, discussed in Chapter 6.

At the end of the step, we should be able to produce a data/activity matrix like the one shown in Figure 7.20, where we put a cross for every pair of (i) activity and (ii) groups of databases or data ﬂows to which it applies.

7.4 The CDQM methodology 187

Fig. 7.20. The data/activity matrix

7.4.8 Choose Techniques for Data Activities

In this step we have to choose the best technique and tool for each data ac-tivity in the data/acac-tivity matrix . To choose the technique, starting from the available knowledge domain, we use all the arguments and comparative anal-ysis dealt with in Chapters 4, 5, and 6. Here, we need to look at the market to check which techniques, among the chosen ones, are implemented in commer-cial DQ tools. We have to compare their costs and technical characteristics;

therefore, the choice of the technique is influenced by the market availability of the tools. With reference to the object identification activity, many com-mercial tools or open source tools include probabilistic techniques, while tools adopting empirical and knowledge-based techniques are less widespread. If the tool is extendible, it can be chosen and then adapted to specific requirements.

For instance, assume that we have performed in the past a deduplication ac-tivity on citizens of a country, in which last names are typically very long; now we have to perform the same activity on citizens of another country where last names are shorter. If in the past we have used a probabilistic technique with given distance functions for the attributes Name, LastName, and Address, we could modify the technique, adapting the decision procedure to the changed context, by changing, for instance, for the attribute LastName the distance function and weights as discussed in Chapter 5.

7.4.9 Find Improvement Processes

We now have to link crosses in the data/activity matrix in order to produce possible candidate improvement processes, with the objective of achieving completeness, i.e., all databases and data ﬂows involved in the improvement program are covered. Linking of crossings in the data/activity matrix can be performed in several ways, and gives rise to several candidate processes, two or three of them usually suﬃcient to cover all possible relevant choices. In Figure 7.21 we see one of them, in a context in which we have chosen object

identiﬁcation, error correction, and data integration as data-driven activities, and business process reengineering as process-driven.

X Process re-design

X X

Data integration

X Error localization

And correction

X X

Object identification

BD1/2/7 BD1/5/6

BD3 BD1 e BD2 Data/Activity

2 3

X Process re-design

X X

Data integration

X Error localization

And correction

X X

Object identification

BD1/2/7 BD1/5/6

BD3 BD1 e BD2 Data/Activity

2 3

Fig. 7.21. An example of improvement process

7.4.10 Choose the Optimal Improvement Process

We are close to the solution; we now have to compare the candidate improve-ment processes from the point of view of the cost of the improveimprove-ment program.

For instance, anticipating a business process reengineering activity may lead to a more efficient object identification activity, and anticipating an object identification activity results in simpler error correction.

Items to be considered in cost evaluation include cost of equipment, cost of personnel, cost of licenses for tools and techniques, and cost of new custom software to be realized for ad hoc problems. Once the costs are evaluated and compared, we choose the most eﬀective improvement process. At this point, it is important to compare again the costs of the selected improvement process with net savings (hopefully) resulting from the set new DQ levels step; the net ﬁnal balance should be positive; otherwise, it is better to do nothing!

In document 6 Data Quality Issues in Data Integration Systems (Page 49-56)