Data Interlinking - A new approach for interlinking and integrating semi-structured and linked

Many data interlinking approaches have been proposed throughout the years either indepen- dently or as part of the yearly OAEI event. This section presents a review of the most popular and more related solutions.

4.4.1 SERIMI

SERIMI [Araujo et al.,2011] was the second best system at the OAEI Instance Matching 2011 [Nguyen et al., 2012a]. It does not require any ontology alignment upfront or prior knowledge of the data or the schema. It is the tool that the interlinking approach proposed in this thesis based on (see Section6.3). It consists of two phases: the selection phase and the disambiguation phase.

The selection phase is based on what their authorsAraujo et al.[2011] described as existing traditional information retrieval and string matching algorithms. More specifically, it begins by extracting the entity label properties of the source dataset, which are the properties describing the labels that most represent the resource being interlinked (all RDF predicates that have a literal value with less than 200 characters). Only discriminative predicates with a higher entropy than a certain threshold are considered. Then, the labels of these properties are utilised to search for resource candidates with the same or similar labels. The results are resource candidates

called a pseudo-homonym set. The entity label properties of each of these resource candidates are extracted in the same way as was done to the source dataset. The entity labels of the common property between the source and target entity label properties is then normalised, tokenised and compared using RWSA [Branting,2003] algorithm. The resources with a similarity score below 70% are discarded from the pseudo-homonym set.

Having a set of pseudo-homonym for each source resource, the disambiguation phase then takes place to filter out false positive matches from true positive matches. They define false positive matches as resources in which their instances share the same label but belong to differ- ent classes, for example: Algiers can be a street, hotel, or a city. This problem is addressed in SERIMI via a model called Resource Description Similarity (RDS). RDS identifies the class of interest by finding the set of resources that are the most similar among pseudo-homonym sets.

The limitation of SERIMI is that it is restricted to only a single [few] properties for the matching [Nentwig et al., 2017]. Additionally, the similarity threshold and other parameters have to be specified manually.

4.4.2 SLINT

SLINT is a domain independent Linked Data interlinking system. It uses coverage and dis- criminability to select the important predicates. Then, these predicates are aligned based on their confidence. The confidence in this context is high when corresponding predicates describe instances sharing the same type and characteristics. Using a three steps process (indexing, accu- mulating and candidate selection) SLINT generates the pair of instances with a high possibility to be homogeneous. The score of the instances of the generated candidates is calculated taking into account the confidence of their predicates and their similarities. The similarity is calculated differently according to the data type. For objects of type date, for instance, the similarity is 1 if the two values are equal and 0 otherwise. For strings type objects and URIs, they utilise TF-IDF35which gives advantage to instances sharing more common tokens.

SLINT was published as part of the yearly ontology matching event in 2012 (OM-2012).

Other versions and extensions were published since then, including SLINT+ [Nguyen and Ichise, 2013], ScSLINT [Nguyen and Ichise, 2015]. SLINT+ presented the same principles of SLINT applied on OAEI 2013 benchmarks. ScSLINT identifies the lack of scalability of its predecessors and tried to address the performance. ScSLINT, however, does not consider balancing performance with precision or recall. Nguyen and Ichise [2015] (the authors who proposed ScSLINT) also described the use of weighted matrix structure in computing the similarity in candidate generation stage of SLINT+ (and SLINT) as not scalable and inaccurate on ambiguous data. The main features presented in ScSLINT and later versions have been: i) to normalise the data format in calculating the confidence, ii) to consider only target properties’ objects that overlap the objects of its source counterpart property, and iii) to enable the user to install new similarity measures. These modifications naturally enhance the performance rela- tive to previous releases, but they are also expected to impact other non-performance measures (such as precision and recall), something that is not elaborated in ScSLINT.

4.4.3 RiMOM

Risk Minimization based Ontology Mapping (RiMOM) [Zhang et al.,2016] was first developed in 2006 [Li et al., 2006a] and was originally a multi-strategy ontology matching and property matching approach. It is based on the combination of three lexical strategies being EditDistance, Vector-Distance and WordNet [Niu et al.,2011]. An adaptive variation of similarity flooding is also used with the structural matching.

In 2010, RiMOM focus shifted, to some extent, to instance matching. As described in

Shvaiko et al.[2010], their approach consists of four stages being: Preprocessing, Information Complementation, Matching and Spread Similarity, which respectively aim for:

• classifying individuals by their classes;

• completing information of each individual;

• running the matching algorithm for each class respectively;

• computing the similarity of two candidates based on weight-mean of properties assigned with specified weights.

RiMOM2013 [Zheng et al., 2013] is an extension of RiMOM. It was presented as part of the ontology matching annual event OM-2013. Generally, the new characteristics that have been contributed to this new version, in contrast to the 2010 version, were a new interface and control layers that allow the user to customise the matching procedure. This included selecting preferred components, setting the parameters for the system, choosing to use translator tool or not. For the instance matching track, particularly, a new algorithm inspired by Wang et al.

[2012], called Link Flooding Algorithm, was used. It is constituted of three modules. The first module performs a simple pre-processing and normalisation of the data such as unifying the language and data format and removing special characters. The second module is described using examples, but it is mainly logical matching whereby the subjects are aligned. The third module is for objects alignment (another term of instance matching described in Section3.7.2). A weighted average score of the similarity of the instances of specific properties is calculated and compared against a threshold to decide whether two instances are aligned. The similarity measure used is EditDistance [Navarro,2001].

RiMOM is a popular tool that produced promising results as an ontology matching solution [Li et al.,2006a]. As an instance matching approach, RiMOM similarity metrics has been evaluated inRong et al.[2012] against existing learning approaches. The results suggested that the combination of the three strategies is not accurate enough for instance matching. RiMOM2013 showed good results, but it targeted specific properties (comments, mottos, birthDates and almaMaters) that were rather tailored for the addressed benchmark of the OM-201336 _event.

4.4.4 SILK

The Link Discovery Framework (SILK) [Volz et al., 2009] is a link discovery system that sup- ports a data publisher in setting explicit links between two datasets. It has its own declarative language Silk - Link Specification Language (Silk-LSL) that data publishers can use to choose which types of RDF links ought to be discovered between data sources and which conditions data items must fulfil in order to be interlinked. Various similarity metrics can be applied to these link conditions as well as taking into account the graph around the data item using an

RDF path language.

Four main advantages the authors of SILK outlined: i) the flexibility that Silk-LSL offer in defining link conditions; ii) generating not only identity links, but other types of RDF links; iii) the ability to be applied in distributed environments without replicating the data locally; iv) the implementation of multiple caching, indexing and pre-selection to improve the performance.

On the other hand, SILK has not been evaluated and tested in the same way as existing data interlinking approaches have been. No benchmark was used and the primary focus was the number of the links that can be discovered. The precision and the recall have not been considered. This is may be due to two reasons. SILK was published in 2009, when there were not many systems to compare against and OAEI had just started publishing benchmarks, for instance matching, in the same year. The second reason is that SILK is a link discovery system; therefore, other RDF links can also be discovered which makes it challenging to evaluate the same way as identity links (interlinking) approaches. It is used to assist with linking data with existing resources in the Web of Data. Although the evaluation is not as revealing as it can be,Nguyen and Ichise[2015] stated that the limitation of SILK can be seen when addressing a large-scale dataset.

4.4.5 LIMES

LIMES [Ngomo and Auer, 2011] is a link discovery tool that, similarly to ScSLINT, focuses on improving the processing time when mapping large knowledge bases. It views the problem of data interlinking from a metric space perspective. It uses mathematical characteristics, such as triangle inequality, to compute pessimistic approximations of distances and to estimate the similarity between instances [Symeonidou,2014]. Based on these approximations, LIMES find and exclude a large number of computations without losing links.

LIMES showed more efficiency in terms of time-consuming than SILK [Rajabi et al.,2015]. Similarly, to many record linkage and link discovery tools, it concentrates much of its efforts on filtering out non-matches before going through the more time-consuming comparisons.

4.5 Analysis and Discussion of Data Interlinking Approaches

In document A new approach for interlinking and integrating semi-structured and linked data (Page 89-94)