Summary and Discussion - Learning In Presence of Ontology Mapping Errors

CHAPTER 5. Learning In Presence of Ontology Mapping Errors

5.4 Summary and Discussion

5.4.1 Significance

The rapid proliferation of autonomous, distributed data sources in many emerging data-rich domains (e.g., bioinformatics, social informatics, security informatics) coupled with the rise in the use of ontologies to associate semantics with the data has led to a growing interest in the problem of learning predictive models from semantically disparate data sources. Many practical

approaches to this problem rely on mapping the instance descriptions used by the individual data sources into instance descriptions expressed in a common representation assumed by the learner (As an example G02 (2009) lists mappings between 20 different ontologies to the gene ontology). Establishing such mappings is a complex and inevitably error-prone process. Hence there is a need for approaches to learning from such data in the presence of mapping errors.

In this paper we have established that the problem of learning from semantically disparate data sources in the presence of mapping errors can be reduced to the problem of learning from a single data source in the presence of nasty classification noise within a PAC-like framework. It should be noted that reduction to any arbitrary noise model is not applicable. For example, in general, learning in the presence of mapping errors cannot be reduced to the problem of learning in presence of random classification noise. In the random classification noise model, the label of each instance can get flipped with a fixed probability η. In contrast, in the case of a k -delegating oracle model, a given instance always gets assigned the same label. This is because the mappings regardless of whether they are correct or not are fixed prior to sampling and will result in an instance (when sampled) always being assigned the same label. Hence, it is possible that a dataset D generated from an Oracle with random classification noise can include two examples of the form hx, 0i and hx, 1i (i.e. D contains the same instance with two different labels). The dataset D can never be generated by a k -delegating oracle model since it will always label the instance x in the same way (even if x occurs multiple times in D).

The reduction of learning in the presence of mapping errors and learning in the presence of nasty classification noise opens up the possibility of applying existing results and approaches to learning in presence of classification noise to the problem of learning in the presence of mapping errors. In our opinion this reduction is important since it provides a theoretical basis for solving practical issues that arise in learning in the semantically disparate setting. Based on this reduction, we outlined some of the techniques that can be used to cope with errors in mappings in this setting. We believe these techniques will prove do to be very useful in practice as the use of ontologies becomes even more widespread. On a theoretical side, we also presented an algorithm that can be used to learn in presence of mapping errors in a PAC like

setting.

5.4.2 Related Work

There is growing interest in the problem of learning predictive models from distributed data sources [Park and Kargupta (2003)]. Caragea et al. (2005) have described algorithms that given correct mappings, provide rigorous performance guarantees (relative to their single data source counterparts) for learning from distributed, semantically disparate data sources when the mappings are semantic preserving. Crammer et al. (2008) have examined the problem of learning predictors from a set of related data sources. Ben-david et al. (2002) have analyzed the sample complexity of learning from semantically disparate data sources in a setting where classifiers trained on data sources D1· · · Dn−1 are used to predict the class labels of instances

from a data source Dn. However, none of these works have considered the effect of errors in

mappings between the representations used by the individual data sources.

The problem of learning predictive models from in the presence of noise in the data has received considerable attention in the literature. A variant of PAC learning to model learning in the presence of random classification noise was introduced in [Angluin and Laird (1998)]. Other variants of PAC learning that have been considered to model learning from noisy data include PAC learning in the presence of malicious errors [Kearns and Li (1993)], learning in the presence of attribute noise (but not classification noise) [Shackelford and Volper (1988), Goldman and Sloan (1995)], learning under the nasty (or adversarial) noise model [Bshouty et al. (1999)]. Several different types of noise in data have been been examined in the context of the PAC learning framework in Sloan (1995). A quantitative study of classification noise and attribute noise is given in Zhu and Wu (2004). Wilson and Martinez (2004) provide an overview of approaches to cope with noise in data. Karmaker and Kwek (2005) have described a boosting based approach to detect outliers in data which is closely related to the problem of detecting mislabeled examples in a noisy dataset.

There has been very little work on detecting mapping errors in the setting of learning from heterogeneous data sources. Of related interest is the work in ontology mapping field [Kalfoglou

and Schorlemmer (2005), Euzenat and Shvaiko (2007)]. However, the primary focus in this area is aligning ontologies (through use of mappings), merging related ontologies or detecting logical inconsistencies in mappings [Meilicke et al. (2007)]. However, a consistent mapping need not be correct in the sense described in this paper and in addition the focus of this paper is to learn in presence of mapping errors.

5.4.3 Future Work

There are several interesting directions along which the analysis presented in this paper can be extended including in particular, consideration of the effect of mapping errors in multi- relational learning multiple instance learning, multi-label and structured label learning, among others. Also of interest are theoretical and experimental studies of alternative approaches to learning from semantically disparate data in the presence of mapping errors.

CHAPTER 6. RELATIONSHIP BETWEEN LEARNING CLASSIFIERS

In document Learning predictive models from massive, semantically disparate data (Page 88-92)