• No results found

Comparison with Previous Interlinking Systems

6.3 Implementation and Evaluation

6.3.3 Comparison with Previous Interlinking Systems

Table 6.4 provides a comparison between LinkD and popular interlinking systems. It can be clearly noticed the extent of the improvements that LinkD introduced to SERIMI. D1 and D2 are joined together in the other approaches, in LinkD, however, they are separated into two domains being movies and people (actors, writers, director, etc.).

Although Table6.4 shows that SLINT is performing better in terms of F1 score in all the datasets of IM@OAEI2011 considered including LinkD, the scale of the targeted data is sig- nificantly larger in LinkD. This highlights that the nature of the problem addressed is not the same. It is the only way, however, to numerically compare LinkD and to show that despite the difference in terms of the scale of the data targeted, LinkD performance is relatively good.

Table 6.5reports the time it took LinkD to process the datasets D1-D5 comparing SLINT (more information about the implementation environment can be found in Appendix III). It is not a direct comparison, the difference in terms of the number of target pairs is highlighted. For instance, with the presumption that SLINT performance strongly and directly correlates with the amount of the target datasets, its performance for D1 would be 474 Millions divided by 10108, multiplied by 67, the results is approximately 3,141,867 seconds.

ID LinkD (seconds) for 474M SLINT (seconds) D1 51 007 67 (for 13758) D2 2 241 D3 1 774 3.55(for 2083) D4 2 247 12.74 (for 4588) D5 1 519 4.29 (for 1274)

Table 6.5: Performance evaluation of LinkD against SLINT [Author, 2017].

6.4

Summary

This section proposed a new interlinking approach, called LinkD, that provides identity links between a single source dataset and the Linked Data cloud, using the domain as reference in ap- plying variable weights in the similarity measurement. The approach proposed goes through two stages to achieve this aim: the blocking (see Section 6.2.1) and data interlinking (see Section 5.3. A variety of distance measurement tools and algorithms were used to calculate the similarity between the labels describing the resources, including UMBC EBIQUITY-COR [Han et al.,2013] (to measure semantic distance) and Jaro-Winkler (to measure the similarity between two sets or strings, (see Section6.2.2.4)). Neither the structure nor the ontology of the dataset were considered on the proposed system in order to maintain its feasibility to target large-scale data. The major challenges faced are the high computational cost and the dynamic allocation of the weights according to the domain and the number of the matched properties. The evaluation of different components (see Section6.3) showed that LinkD is able to target significantly larger datasets whilst maintaining high quality measures (precision, recall and F1 score) (see Section6.3.3).

A new data integration approach along with its prototypes are presented in the next chapter. It aims at ”consuming” two important data models available on the Web being semi-structured and Linked Data. The new data integration approach of the next chapter is the short-term solution the author proposes in this thesis to bridge between semi-structured and Linked Data.

Chapter 7

SemiLD: Keyword Search over

Semi-Structured and Linked Data

With data collection, ’The sooner the better’ is always the best answer.

Marissa Mayer

7.1

Introduction

The distributed and the autonomous nature of Linked Data sources make it unlikely to sustain the use of one model in representing data in a particular domain. Each source has its specifici- ties, conditions, and a different vision on the way to expand. The internal links (pointers to data within the local Linked Data source) can be relatively consistent and easily maintained as the data publishers are aware of the changes occurring on their data repositories. The external links, however, represent a challenging task, given they connect two vocabularies, models or views that are managed and situated in separate locations and are regularly changing. This dynamism of the relations between the integration system and sources, and the continuous expansion of Web of Data, along with data freshness requirements, suggests that a solution would need to integrate data virtually on-the-flyKettouch et al.[2015b]. SemiLD combines the use of ontolo- gies, to obtain high precision, with property matching, to achieve a high degree of automation

while targeting large-scale data.

The distributed and autonomous nature of Semantic Web sources, as explained in Section

3.5, imposes more challenges and leads to heterogeneous terminologies. Multiple ontologies and vocabularies can be utilised to represent similar information in a particular domain. On the other hand, as stated in Section5.2.1(and confirmed in Section 5.2.2), although different dispersed RDF datasets, describing data in the same domain, may not be exactly identical, they overlap in the semantics of their properties. This statement is as valid for Linked Data sources as for semi-structured data sources. Semi-structured data are frequently described using XML or JSON technologies, where tags play the same role as properties in RDF.

This chapter proposes a mediator-based modular architecture to integrate on-the-fly hetero- geneous semi-structured and Linked Data. This chapter presents SemiLD, a novel approach to integrate semi-structure and Linked Data. SimiMatch and LinkD are adapted and included as modules in SemiLD, as Figure7.1shows. This chapter provides two prototypes of the SemiLD. The first prototype is a highly-automated keyword search system that retrieves its input from various SPARQL endpoints and Web APIs. The second prototype is movie collection man- ager and is provided to highlight the adaptivity of the author’s proposed approach as well as to present another working scenario and an evaluation method. The evaluation of the system illustrates the high performance, usability and efficiency of the contributed approach.

Figure 7.1: Relation of SemiLD approach with reference to other systems of this thesis [Author, 2017].

Figure 7.2: General architecture of SemiLD [Author, 2017].