KD2R: A Key Discovery approach for Data Linking
3.1 Problem Statement
OWL2 keys express sets of properties that uniquely identify each instance of a class. How- ever, if a complete knowledge of owl:sameAs and owl:differentFrom is not provided in a dataset, it is not possible to distinguish the two following cases:
1. common property values describing instances that refer to the same real world entity 2. common property values describing instances that refer to two distinct real world
entities. Dataset D1:
Person(p1), firstName(p1,00Wendy00), lastName(p1, 00Johnson00), hasFriend(p1, p2),
hasFriend(p1, p3), bornIn(p1, 00USA00),
Person(p2), firstName(p2,00Wendy00), lastName(p2, 00Miller00), bornIn(p2, 00UK00),
Person(p3), firstName(p3,00Madalina00), lastName(p3, 00David00), hasFriend(p3, p2),
hasFriend(p3, p4),
Person(p4), firstName(p4,00Jane00), lastName(p4,00Clark00), bornIn(p4, 00Ireland00)
Fig. 3.1 RDF dataset D1
The Figure 3.1 presents the dataset D1 that contains instances of the class Person. A per- son is described by its first name, its last name, its friends and the country where he/she was born. In D1, if we know that the persons p1 and p2 are the same, the property f irstName can be considered as a key.
In the RDF datasets that are available on the Web, owl:sameAs and owl:differentFrom links are rarely declared. If we consider the overall Linked Open Data cloud (LOD), there
exist datasets containing duplicate instances. Nevertheless, datasets that fulfill the UNA, i.e., all the instances of a dataset are distinct, are not so uncommon. Indeed, datasets generated from relational databases are in most of the cases under the UNA. Furthermore, in some cases RDF datasets are created in a way to avoid duplicates, like the YAGO knowledge base [SKW07]. Thus, we are interested in discovering keys in datasets where the UNA is fulfilled. For example, considering that the dataset D1 of the Figure 3.1 is under the UNA, the property f irstName cannot be a key since there exist two distinct persons having the same name.
When literals are heterogeneously described, the key discovery problem becomes much more complex. Indeed, syntactic variations or errors in literal values may lead to miss keys or to discover erroneous ones. For example, in the dataset D1, if a person is born in “USA” and another in “United States of America”, bornIn can be found as a key. In this work, we assume that the data described in one dataset are either homogeneous or have been normalized.
Furthermore, in the Semantic Web context, RDF data may be incomplete and asserting the Closed World Assumption (CWA), i.e., what is not currently known to be true is false, may not be meaningful. For example, the fact that in the dataset D1, the person p2 has no friends does not mean that hasFriend(p2, p1) is false. Axioms such as functionality or maximum cardinality of properties could be taken into account to exploit the completeness of some properties. Since these axioms are rarely given, discovering keys in RDF data requires the use of heuristics to interpret the possible absence of information. We consider two different heuristics, the optimistic heuristic and the pessimistic heuristic:
• Pessimistic heuristic: when a property is not instantiated, all the values in the dataset are possible while in the case of instantiated properties, we consider that the information is complete. For example, hasFriend(p2, p1), hasFriend(p2, p3), hasFriend(p2, p4) are possible while hasFriend(p1, p4) is not.
• Optimistic heuristic: only the values that are declared in the dataset are taken into account in the discovery of keys. In other words, we consider that if there exist other values that are not contained in the dataset, they are different from all the existing ones. For example, hasFriend(p2, p3) is not possible.
The quality of the discovered keys improves when numerous data coming from different datasets are exploited. Thus, we are interested in discovering keys that are valid in several datasets. The datasets may not be described using the same ontology. Hence, we assume that equivalence mappings between classes and properties are declared or computed by an
ontology alignment tool. However, we do not consider that all the datasets are united in a single dataset (under an integrated ontology). Indeed, in this case the UNA would no longer be guaranteed. Therefore, keys are first discovered in each dataset and then merged according to the given mapping set.
Let D1 and D2 be two RDF datasets that conform to two OWL ontologies O1, O2 respec- tively. We assume that OWL entailment rules [PSHH04] are applied on both RDF datasets.
We consider in each dataset Di the set of instantiated properties Pi={pi1,pi2, . . . ,piN}. To
discover keys that involve property expressions, we assume that for each object property p, an inverse property (inv-p) is created. Let Ci={ci1,ci2, . . . ,ciL} be a set of classes of the ontology Oi. Let M be the set of equivalence mappings between the elements (properties
or classes) of the ontologies O1 and O2. Let P1c(resp. P2c) be the set of properties of P1
(resp. of P2) such that there exists an equivalence mapping with a property of P2(resp. of
P1). The problem of key discovery that we address in this chapter is defined as follows:
1. for each dataset Di and each class ci j 2 Ci of the ontology Oi, such that it exists a
mapping between a class ci j and a class cks of the other ontology Ok, discover the
parts of Pithat are keys in the dataset Di
2. find all the parts of Picthat are keys for equivalent classes in the two datasets D1 and
D2 with respect to the property mappings in M .