Tools for Data Quality
8.4 Toolboxes to Compare Tools
Toolboxes proposed to compare tools focus on the object identification prob-lem. [145] adopts a theoretical approach, while [65] describes a practical tool based on experiments, called Tailor. The two toolboxes are described in the following subsections.
8.4.1 Theoretical Approach
Neiling et al. [145] presents a theoretical framework for comparing techniques.
Two aspects are addressed: the complexity of object identification problems and the quality of object identification techniques.
With regard to the first aspect, a reference indicator called hardness is introduced. It characterizes the difficulty of an object identification problem;
for example, it is intuitive that it is more complex to perform record linkage over two files with low accuracy than over two correct files. As remarked in Chapter 5, the different techniques adopt very specific decision models, characterized in terms of inputs, outputs, and objectives. Therefore, each of the techniques can be more suitable for one class of problems and less suitable
8.4 Toolboxes to Compare Tools 217 for another class of problems. The hardness measures how good a technique is for a specific class of problems. The hardness depends on several factors, such as (i) a set of semantic constraints valid in the domain of interest, (ii) the number of pairs to be identified, and (iii) the selectivity of the attribute set that contains identifying information used in the object identification problem.
The second issue addressed in [145] concerns a test framework for the com-parison of techniques. The framework consists of a test database, its charac-teristics (e.g., the existence of semantic keys), several quality criteria for the evaluation of the quality of a solution, and a test specification. The quality criteria, inspired by database benchmarks (see [85]), are of two types, re-spectively quantitative criteria and qualitative criteria. Quantitative criteria are:
1. correctness, the estimation of misclassification rates for test runs;
2. scalability with respect to the size of the input;
3. performance in terms of computational effort;
4. cost, i.e., expenses for the running operations, e.g., hardware and software licenses.
The most important among the above criteria is correctness, which is mea-sured by false negative percentage and false positive percentage, as defined in Chapter 5, Section 5.9.1.
Qualitative criteria include usability, integrability, reliability, completeness, robustness, transparency, adaptability and flexibility. From these we define three: usability is defined as the need for specialized experts and the possibility of automated or incremental updates; integrability is considered in the light of existing software architecture functionalities, such as interfaces, data/object exchange, remote control; transparency concerns understandability and non-proprietariness of algorithms and results. For definitions of the remaining criteria, see [145].
Similar to the benchmarks available for database management systems, the above set of qualities provides the general criteria for comparing object identification techniques.
8.4.2 Tailor
Tailor [65] is a toolbox for comparing object identification techniques and tools through experiments. The corresponding benchmarking process can be built by tuning a few parameters and plugging in tools that have been developed in-house or are publicly available.
Tailor has four main functionalities (see Figure 8.11), called layers in [65], corresponding to (i) the three main record linkage steps discussed in Chapter 5, i.e., searching method, comparison function, decision model, and (ii) mea-surement . Figure 8.11 shows the information flow between the four function-alities, and how the record linkage process operates. The flow is coherent with
Searching
Fig. 8.11. Architecture of Tailor
the general procedure discussed in Chapter 5. At a final stage, a measurement step is performed, to estimate the performance of the decision model.
Probabilistic models Fellegy & Sunter familiy Cost based
Techniques, models and metrics implemented in Taylor
Layer
Probabilistic models Fellegy & Sunter familiy Cost based
Techniques, models and metrics implemented in Taylor
Layer
Fig. 8.12. Tailor list of implemented techniques
Figure 8.12 provides a complete list of the various techniques, models and metrics implemented in each of the three record linkage steps. All searching methods and comparison functions mentioned in the figure have been intro-duced and discussed in Chapter 5. For decision models, the reader may refer to [65] for the clustering model and the hybrid model.
8.5 Summary
Tools and frameworks are crucial for making the techniques and methodologies effective. A comparative analysis of commercial tools is out of the scope of
8.5 Summary 219 this book. In this chapter we have discussed a specific group of tools and frameworks that closely implement research results. These tools cover various functionalities related to data quality activities, while commercial tools are more focused on specific issues.
In the area of data quality, as in many other areas, there is a temporal gap between research results and their implementation in tools. Furthermore, research groups tend to develop prototypes, characterized by uncertain com-patibility and scarce documentation, due to the high investment needed for engineering and selling products. A researcher who aims at using tools in his/her research activity has three choices: (i) use commercial tools, trying to obtain academic licenses, (ii) use public domain tools, extending them with new functionalities, or (iii) develop own tools. The third choice has to be encouraged every time a new technique is conceived in order to experiment and compare results. A theoretical or even qualitative comparison, especially in the data quality area, is seldom possible also when similar paradigms are adopted; only the richness of experimental results can provide evidence of the superiority of a tool with respect to other. Another challenging issue is the production of highly specialized, integrated tools, as an evolution of present tools.
With regard to frameworks, the development process is at an early stage, despite the need for many DQ functionalities in distributed and cooperative information systems. Finally, we notice that the tool is not the solution. In the spirit of this book, this means that the measurement and improvement DQ process has to be carefully planned, using the methodologies discussed in Chapter 7, and the choice of tools has to be addressed only when the re-lationships between organizations, processes, databases, data flows, external sources, dimensions, and activities to be performed have been deeply under-stood.