Toolboxes to Compare Tools - Tools for Data Quality

Tools for Data Quality

8.4 Toolboxes to Compare Tools

Toolboxes proposed to compare tools focus on the object identiﬁcation prob-lem. [145] adopts a theoretical approach, while [65] describes a practical tool based on experiments, called Tailor. The two toolboxes are described in the following subsections.

8.4.1 Theoretical Approach

Neiling et al. [145] presents a theoretical framework for comparing techniques.

Two aspects are addressed: the complexity of object identiﬁcation problems and the quality of object identiﬁcation techniques.

With regard to the first aspect, a reference indicator called hardness is introduced. It characterizes the difficulty of an object identification problem;

for example, it is intuitive that it is more complex to perform record linkage over two files with low accuracy than over two correct files. As remarked in Chapter 5, the different techniques adopt very specific decision models, characterized in terms of inputs, outputs, and objectives. Therefore, each of the techniques can be more suitable for one class of problems and less suitable

8.4 Toolboxes to Compare Tools 217 for another class of problems. The hardness measures how good a technique is for a specific class of problems. The hardness depends on several factors, such as (i) a set of semantic constraints valid in the domain of interest, (ii) the number of pairs to be identified, and (iii) the selectivity of the attribute set that contains identifying information used in the object identification problem.

The second issue addressed in [145] concerns a test framework for the com-parison of techniques. The framework consists of a test database, its charac-teristics (e.g., the existence of semantic keys), several quality criteria for the evaluation of the quality of a solution, and a test speciﬁcation. The quality criteria, inspired by database benchmarks (see [85]), are of two types, re-spectively quantitative criteria and qualitative criteria. Quantitative criteria are:

1. correctness, the estimation of misclassiﬁcation rates for test runs;

2. scalability with respect to the size of the input;

3. performance in terms of computational eﬀort;

4. cost, i.e., expenses for the running operations, e.g., hardware and software licenses.

The most important among the above criteria is correctness, which is mea-sured by false negative percentage and false positive percentage, as deﬁned in Chapter 5, Section 5.9.1.

Qualitative criteria include usability, integrability, reliability, completeness, robustness, transparency, adaptability and flexibility. From these we define three: usability is defined as the need for specialized experts and the possibility of automated or incremental updates; integrability is considered in the light of existing software architecture functionalities, such as interfaces, data/object exchange, remote control; transparency concerns understandability and non-proprietariness of algorithms and results. For definitions of the remaining criteria, see [145].

Similar to the benchmarks available for database management systems, the above set of qualities provides the general criteria for comparing object identiﬁcation techniques.

8.4.2 Tailor

Tailor [65] is a toolbox for comparing object identiﬁcation techniques and tools through experiments. The corresponding benchmarking process can be built by tuning a few parameters and plugging in tools that have been developed in-house or are publicly available.

Tailor has four main functionalities (see Figure 8.11), called layers in [65], corresponding to (i) the three main record linkage steps discussed in Chapter 5, i.e., searching method, comparison function, decision model, and (ii) mea-surement . Figure 8.11 shows the information ﬂow between the four function-alities, and how the record linkage process operates. The ﬂow is coherent with

Searching

Fig. 8.11. Architecture of Tailor

the general procedure discussed in Chapter 5. At a ﬁnal stage, a measurement step is performed, to estimate the performance of the decision model.

Probabilistic models Fellegy & Sunter familiy Cost based

Techniques, models and metrics implemented in Taylor

Layer

Probabilistic models Fellegy & Sunter familiy Cost based

Techniques, models and metrics implemented in Taylor

Layer

Fig. 8.12. Tailor list of implemented techniques

Figure 8.12 provides a complete list of the various techniques, models and metrics implemented in each of the three record linkage steps. All searching methods and comparison functions mentioned in the ﬁgure have been intro-duced and discussed in Chapter 5. For decision models, the reader may refer to [65] for the clustering model and the hybrid model.

8.5 Summary

Tools and frameworks are crucial for making the techniques and methodologies eﬀective. A comparative analysis of commercial tools is out of the scope of

8.5 Summary 219 this book. In this chapter we have discussed a speciﬁc group of tools and frameworks that closely implement research results. These tools cover various functionalities related to data quality activities, while commercial tools are more focused on speciﬁc issues.

In the area of data quality, as in many other areas, there is a temporal gap between research results and their implementation in tools. Furthermore, research groups tend to develop prototypes, characterized by uncertain com-patibility and scarce documentation, due to the high investment needed for engineering and selling products. A researcher who aims at using tools in his/her research activity has three choices: (i) use commercial tools, trying to obtain academic licenses, (ii) use public domain tools, extending them with new functionalities, or (iii) develop own tools. The third choice has to be encouraged every time a new technique is conceived in order to experiment and compare results. A theoretical or even qualitative comparison, especially in the data quality area, is seldom possible also when similar paradigms are adopted; only the richness of experimental results can provide evidence of the superiority of a tool with respect to other. Another challenging issue is the production of highly specialized, integrated tools, as an evolution of present tools.

With regard to frameworks, the development process is at an early stage, despite the need for many DQ functionalities in distributed and cooperative information systems. Finally, we notice that the tool is not the solution. In the spirit of this book, this means that the measurement and improvement DQ process has to be carefully planned, using the methodologies discussed in Chapter 7, and the choice of tools has to be addressed only when the re-lationships between organizations, processes, databases, data ﬂows, external sources, dimensions, and activities to be performed have been deeply under-stood.

In document 6 Data Quality Issues in Data Integration Systems (Page 84-88)