• No results found

Assessment Methodologies

Methodologies for Data Quality Measurement and Improvement

7.2 Assessment Methodologies

at the same time effectiveness and costs: process control activities and, above all, process re-design activities can get to the root of the problem and solve the problem once for all. Their costs are mainly the fixed costs related to the one shot control or re-design activity, plus variable process maintenance costs distributed over a time period.

The above considerations are valid for the long term. For the short term it is well known that process re-design can be very costly. As a consequence, data-driven strategies become more competitive. We refer the reader to [167]

for a complete discussion on these issues.

7.2 Assessment Methodologies

The goal of assessment methodologies is to provide a precise evaluation and diagnosis of the state of the information system with regard to DQ issues.

Therefore, the principal outputs of assessment methodologies are (i) measure-ments of the quality of data bases and data flows, (ii) costs to the organization due to the present low quality, and (iii) a comparison with DQ levels consid-ered acceptable from experience, or else a benchmarking with best practices, together with suggestions for improvements. The usual process followed in assessment methodologies has three main activities:

1. relevant dimensions and metrics are initially chosen, classified, and mea-sured;

2. subjective judgments of experts are performed; and

3. objective measurements and subjective judgements are compared.

Some examples of methodologies for the choice of dimensions and mea-sures and for the objective vs subjective evaluation are given by Lee et al.

[114], Kahn et al. in [107] Pipino et al. [161], Su et al. [185], and De Amicis et al. [56]. With regard to dimension classification, dimensions are classified in [114] (see Figure 7.4) into sound, useful, dependable, and usable, according to their positioning in quadrants related to “product quality/service quality” and

“conforms to specifications/meets or exceeds consumer expectations” coordi-nates. The goal of the classification is to provide a context for each individual DQ dimension and metric, and for consequent evaluation. In the following we describe the methodology proposed in [56] in detail, which was tailored for the financial domain (see the main phases in Figure 7.5). For an example of benchmarking in the financial domain, see [127]. Here, we adopt the statistical term variable for attributes whose quality is to be measured.

Phase 1, variables selection, concerns the identification, description and classification of primary variables of financial registries, which correspond to the main data attributes to be assessed. The most relevant variables in finan-cial databases are identified. Then, they are characterized, according to their meaning and role. The possible characterizations are qualitative/categorical, quantitative/numerical, and date/time.

Usable

Meets or exceeds consumer expectations

Meets or exceeds consumer expectations

Conforms to specifications

Fig. 7.4. Classification of dimensions in [114] for assessment purposes

Assessment

of errors Quantitative

objective assessment

Qualitative subjective assessment Business & data

quality expertise

of errors Quantitative

objective assessment

Qualitative subjective assessment Business & data

quality expertise

Fig. 7.5. The main phases of the assessment methodology described in [56]

In phase 2, analysis, data dimensions and integrity constraints to be mea-sured are identified. Simple statistical techniques are used for the inspection of financial data. Selection and inspection of dimensions is related to process analysis. It has the final goal of discovering the main causes of erroneous data, such as unstructured and uncontrolled data loading and data updating pro-cesses. The result of the analysis on selected dimensions leads to a report with the identification of the errors.

In phase 3, objective/quantitative assessment, appropriate indices are de-fined for the evaluation and quantification of the global data quality level. The number of erroneous observations for the different dimensions and the different data attributes is first evaluated with statistical and/or empirical methods, and, subsequently, normalized and summarized. An example of quantitative assessment is shown in Figure 7.6, where the three variables considered, typ-ical of the financial domain are

7.2 Assessment Methodologies 169 1. Moody’s rating. Moody’s Investors Service is a leading provider of risk analysis, offering a system of ratings of the relative creditworthiness of securities.

2. Standard and Poor’s rating, from another leading provider.

3. Market currency code, e.g. EUR.

The values associated with quality dimensions represent the percentages of erroneous data by data quality dimension. Internal consistency refers to the consistency of a data value item within the same set of financial data; external consistency refers to the consistency of a data value item in different data sets.

3.0 Standard’s & Poor

Rating Standard’s & Poor

Rating Moody’s Rating

Quality dimensions

Variables

Fig. 7.6. Example of objective quantitative assessment

Phase 4 deals with subjective/qualitative assessment . The qualitative as-sessment is obtained by merging three independent evaluations from (i) a business expert, who analyzes data from a business process point of view; (ii) a financial operator (e.g., a trader), who uses daily financial data; and (iii) a data quality expert, who has the role of analyzing data and examining its quality. See Figure 7.7 for a possible result of this phase, where domain values are High, Medium, and Low.

H

Fig. 7.7. Example of subjective quantitative assessment

Finally, a comparison between objective and subjective assessment is per-formed. For each variable and quality dimension, we calculate the distance between the percentages of erroneous observations obtained from quantita-tive analysis, mapped in the discrete domain [High, Medium, Low], and the quality level defined by the judgment of the three experts. Discrepancies are analyzed by the data quality expert, to detect causes of errors and to find alternative solutions to correct them.

7.3 Comparative Analysis of General-purpose