Assessed Links 35
correspondences. Thus, the link production method of this paper may do a lot of computation on building indexes for tokens that only appear in the ranges of properties that do not have corresponding properties. These tokens will in turn help create a lot of sample links that are not correct across two corresponding classes. These sample links will be useless for constructing and improving the interlinking pattern, because they do not contain any attribute correspondence.
Ngonga Ngomo et al. [Ngonga Ngomo 2012b, Ngonga Ngomo 2013] generate
sample links by comparing instance’s property value of “rdfs:label” or some pre- defined interlinking pattern. Such a sample link process cannot be applied in all kinds of interlinking tasks, because not all data sets have the property “rdfs:label” or a predefined interlinking pattern.
Therefore, we need a sample link generating process that produce less irrelevant sample links that do not contain attribute correspondences and can be applied for different interlinking tasks without external information.
3.4
Constructing and Improving the Interlinking Pat-
tern with Assessed Links
After attribute correspondences are obtained, they need to be organized into an interlinking pattern to generate links by learning assessed links with a learning method. Thus, this section is to illustrate how to build an interlinking pattern.
It is not a straight-forward process to construct an interlinking pattern that is able to produce correct links. [Song 2013, Ngonga Ngomo 2011c] build an inter- linking pattern by combining all attribute correspondences into a conjunction of all attribute correspondences. Yet, it is not a universal interlinking pattern that can generate correct links for all interlinking tasks. Here is an example. Assume that we interlink two data sets D and D0. There are two correct links l1 and l2. Link l1 has three attribute correspondences AC1, AC2 and AC3. It means that attribute
values of each attribute correspondence across two linked instances of the link l1 are the same. Moreover, link l2 has an attribute correspondence AC3. If the inter- linking pattern is a conjunction of attribute correspondences AC1, AC2, and AC3,
the link l2 cannot be generated. The reason is that the link l2 does not have all attribute correspondences in the interlinking pattern. Since it is hard to estimate which and how many sets of attribute correspondences are required for constructing an interlinking pattern, we need some assessed links to learn the sets of attribute correspondences.
The reasonable way of constructing an interlinking pattern is to improve the pat- tern iteratively with assessed links. It is a supervised learning process. Thus, there are two elements for constructing an interlinking pattern. One is a set of assessed links, the other is a supervised learning method. The related work of producing the assessed links is illustrated in Section 3.3. The related works of constructing and improving the interlinking pattern are introduced in Section 3.1.5. According to the analysis in Section3.1.5, related works of constructing and improving the in-
36 Chapter 3. Related Work
terlinking pattern by learning have several weak points. Some interlinking methods cannot be applied for all kinds of interlinking tasks. Some interlinking methods take longer running time. Therefore, a learning method that costs shorter running time and is able to generate interlinking patterns for all interlinking tasks is required to construct and improve the interlinking pattern.
3.5
Conclusion
From the analysis above, Record Linkage, Trust Propagation, Statistical Techniques, Ontology Matching and Machine Learning are all used to interlink RDF data sets. Ontology Matching and Machine Learning are two primary techniques for interlink- ing. The reasons are below.
• First, Ontology Matching is needed to discover attribute correspondences across two corresponding classes. Interlinking is the process which aims to find out links by comparing instances. URIs of instances are difficult to compare due to the naming differences, so links are generated by comparing instances’ attribute values. Nevertheless, there are a lot of comparisons of attribute val- ues across two compared instances. Most of the comparisons are useless for evaluating the similarity of two instances, because these comparisons are exe- cuted on attributes that are not corresponding with each other. Accordingly, discovering attribute correspondences is a necessary step.
• Second, Machine Learning is used to improve the linking pattern. It is not e- nough to compare instances with attribute correspondences to find out correct links, in that different correct links usually have different attribute correspon- dences. The interlinking pattern needs to be constructed and improved with assessed links by a supervised machine learning method, so as to cover more correct links.
The related works on interlinking by applying Ontology Matching and Machine Learning still have some points to improve.
• As for the works on Ontology Matching, instance-based ontology matching requires external information to find out attribute correspondences. Other related works on interlinking rely on some available assessed links provided by users to extract attribute correspondences. Both kinds of related works take long running time to produce attribute correspondences, because external information and available assessed links take time to be obtained. Thus, it is required to design an ontology matching method that discovers attribute correspondences at runtime without external information.
• As for the machine learning techniques that are used for interlinking, most related works utilize Genetic Programming. Genetic Programming costs long running time on evaluating candidate interlinking patterns when improving the interlinking pattern. Hence, the interlinking process requires a supervised
3.5. Conclusion 37
learning method that can improve the interlinking pattern with short running time and is able to generate interlinking patterns for all interlinking tasks. Furthermore, in order to improve the interlinking pattern, a set of sample links should be generated for the user to assess. The related works on generating sample links either produce many irrelevant sample links that are useless for constructing and improving the interlinking pattern, or cannot be applied for all interlinking tasks. Therefore, the required link production method should produce fewer irrelevant sample links and can be applied for all interlinking tasks.
Next chapter will introduce the solutions of the three tasks for the interlinking problem of this thesis. They are:
• Discovering attribute correspondences by classifying attributes of each class according to attributes’ value features with the K-medoids clustering method and matching attributes with regard to the clustered groups that share similar value features.
• Generating links by constructing a sample interlinking pattern with a disjunc- tion of discovered potential attribute correspondences and sending sample links to users for assessing.
• Constructing and improving the interlinking pattern of two RDF data sets with assessed links and the Version Space learning method.
Chapter 4
Proposed Solution
Contents
4.1 The Interlinking Process . . . 39 4.2 Discovering Attribute Correspondences . . . 40 4.3 Generating Links. . . 42 4.4 Constructing and Improving the Interlinking Pattern with
Assessed Links . . . 43 4.5 System Overview. . . 46 4.6 Conclusion . . . 47
For the sake of overcoming the weak points of the related works introduced in Chapter 3, an interlinking method is proposed for the interlinking problem of this thesis, especially for all three tasks, to be discussed briefly in Section 4.2, Section
4.3and Section 4.4respectively.
4.1
The Interlinking Process
The interlinking process can be fulfilled by running an interlinking script that is transferred from an interlinking pattern with a semi-automatic interlinking tool. As it is introduced in Chapter 2, an interlinking pattern can be constructed to help compare instances’ attribute values and generate links of two RDF data sets. The interlinking pattern is composed of the set of attribute correspondences of each correct link. Furthermore, it aggregates all these sets together. With the interlinking pattern, we can design a JAVA program to compare instances’ attribute values according to the attribute correspondences in the interlinking pattern and aggregate all similarities of compared attribute values into one final value. The final value represents the similarity of two compared instances. Then, we can evaluate whether two compared instances are the same or not according to the final value. Nevertheless, there are already some tools that have the same function of such a program, such as semi-automatic interlinking tool Silk1 and LIMES2. The tools require users to specify an interlinking script that expresses the interlinking pattern in a specific syntax as well as a group of comparison methods for different property data types. In this thesis, we choose Silk to generate links since it is a open-source
1
http://www4.wiwiss.fu-berlin.de/bizer/silk/
2
40 Chapter 4. Proposed Solution
tool. Furthermore, Silk provides a rich set of comparison methods and aggregation methods.
To conclude, in order to find out links across two RDF data sets, we should first find out all attribute correspondences across two corresponding classes. We should second aggregate all attribute correspondences into an interlinking pattern that distinguish correct links and incorrect links. Finally, we should transfer the interlinking pattern into an executable interlinking script, so that Silk can generate a link set of the two RDF data sets.