Data Quality enhancement with Data Mining

3.3 Euclid Data Quality

3.3.4 Data Quality enhancement with Data Mining

In the traditional DQ methodology, briefly touched in the previous section, the statistical approach is usually employed for measuring the quality of data, in many common cases with good results (for example financial, enterprise, medical warehouses). But dealing with much more complex cases, especially in data warehouses designated as repositories of high precision scientific experiment results (like in the Euclid case), the traditional approach appears to be quite insufficient.

The major limit of statistical methods, when applied directly on data quality control, is the fact that traditionally DQ modifies the data themselves while for scientific data this needs to be avoided. Data Mining, on the contrary, is a methodology for measuring the quality of data, preserving their intrinsic nature. DM algorithms extract some knowledge, that can be used to measure the quality of data, with particular reference to the quality of input transactions and then, eventually flag the data of poor quality. A typical procedure to measure DQ of data transactions should be based on three steps:

1. Extract all association rules, which depend on input transactions; 2. Select compatible association rules;

3. Add confidence factor of compatible rules as criteria of data quality of trans- action.

There are two important challenging issues. First, the extraction of all association rules needs a lot of time and next, in most cases there is no exact mathematical formula for measuring data quality.

So far, a more effective DM approach to DQ should be alternative to find exact deterministic or statistical formulas. Therefore, for us, the answer is in employing methodologies derived from Machine Learning (ML) paradigms, such as (a) active on-line learning, which addresses the issue of optimizing the combination and trade-o of losses incurred during data acquisition; (b) associative reinforcement learning, connected with the predictive quality of the nal hypothesis. Moreover, one of the guidelines of our proposed approach is to conjugate these machine learning paradigms with features coming from biological adaptive systems.

The key principles are to process information systems using a connectionist approach to computation, in order to emulate the powerful correlation ability at the base of the cognitive learning engine of human brain, together with the optimization process at the base of biological evolution (Darwin’s law)

Our experience in such methodology has produced the DAME Program, which includes several projects, mostly connected with Astrophysics, although spread into various of its scientific branches and sub-domains. Data Mining is usually conceived as an application (deterministic/stochastic algorithm) to extract unknown information from noisy data. This is basically true but in some way it is too much reductive with respect to the wide range covered by mining concept domains. More precisely, in DAME, data mining is intended as techniques of exploration on data, based on the combination between parameter space filtering, machine learning, soft computing techniques associated to a functional domain. In the data mining scenario, the machine learning model choice should always be accompanied by the functionality domain. To be more precise, some machine learning models can be used in a same functionality domain, because it represents the functional context in which it is performed the exploration of data.

Examples of such domains are: Dimensional reduction, classification, regres- sion, clustering, segmentation, statistical data analysis, forecasting, data Mining model filtering.

From the technological point of view, the employment of state of the art web 2.0 technology, allows the end user (i.e. the data centres) to be in the best condition to interact with the DQ process by making use of a simple web browser.

The approach outlined above has three immediate advantages:

• DQ controls can be approached by remote, through homogeneous and inter- operable interfaces, federated whereas possible under VO standards.

• Different DQ models and algorithms available by remote web applications can

the SDC does not need to be particularly skilled with DM methodologies to create and configure workflows on data;

• DM applications could be executed by remote cloud/grid frameworks, em-

bedding all the complex management issues of the distributed computing infrastructure.

However, another indirect positive issue for our approach arises by considering that, in a massive data centric project like EDW, one of the unavoidable constraints is to minimize data flow traffic and down/up-load operations from remote sites. DQ tools should therefore be installed and maintained at the SDC.

It is worth to stress that this approach fits perfectly within the recently emerging area of interest named DQM (Data Quality Mining). DQ uses information attributes as a tool for assessing quality of data products. The goal of DQM is to employ data mining methods in order to detect, quantify, explain and correct DQ deficiencies in very large databases. For this reason there is a reciprocal advantage between the two application fields (DQ is crucial for many applications of KDD, which on the other side can improve DQ results).

In document Data-rich astronomy: mining synoptic sky surveys (Page 110-112)