Identifying Tools and Algorithms Suitable for Interlinking

5.3 Interlinking

5.3.2 Identifying Tools and Algorithms Suitable for Interlinking

Identifying Interlinking Tools: There exist a number of tools for creating semi-automatic links between datasets by running some matching techniques. These linking tools identify similarities between entities and generate links (e.g.owl:sameAs) that connect source and target entities. Rajabi et al. conducted a study that suggests that data publishers can trust interlinking tools to interlink their data to other datasets; accordingly, LIMES and Silk are the most promising frameworks [221]. Simperl et al. have compared various linking tools by addressing aspects such as required input, resulting output, considered domain and matching techniques used [246]. This allowed for a comparison from several perspectives: degree of automation (to what extent the tool needs human input) and human contribution (the way in which users are required to do the interlinking.

In summary, these comparisons point out the two well-known open source interlinking frameworks that we also used: LIMES38(Link Discovery Framework for Metric Spaces) and Silk39(Link Discovery Framework for the Web of Data). In an evaluation of the two frameworks, the LIMES developers showed that LIMES considerably outperforms Silk in terms of running time, with a comparable quality of the

37_{https://datahub.io/}

38_{http://aksw.org/Projects/LIMES.html} 39_{http://silkframework.org/}

Metric Description

Trigrams uses the number of matching triples in both strings as s = 2 × m

(a×b)where m is the number of matching trigrams, a is the

number of trigrams in string 1, and b is the number of trigrams in string 2 [227].

Levenshtein is based on the minimum number of insertion, deletion or replacement operations to transform string 1 into string 2. Jaro is a measure of characters in common, being no more than half the length of the longer string in distance, with considera-

tion for transpositions; it is best suited for short strings such as person names [255].

Jaro-Winkler is an optimized version of Jaro designed and best suited for short strings such as person names

Cosine is the cosine of the angle between string vectors; for equal strings the angle between them will be 0 and the cosine will be 1 [227].

Table 5.7: String matching algorithms. The string matching algorithms are shows with a description of their characteristics.

output. Moreover, LIMES can be downloaded as a standalone tool for carrying out link discovery locally and consists of modules that can be extended easily to accommodate new or improved functionality.

Our comparative evaluation of Silk and LIMES, which finally made us choose LIMES based on the quality of the output.

Identifying String Matching Algorithms: One of the most important factors in discovering links effectively is choosing the right string matching algorithm. The results of our heuristic experiments shows that both tools supports string matching according to trigrams, Levenshtein40, Jaro, Jaro-Winkler and cosine (all of them normalized); cf. Table 5.7. It shows detailed definition of the algorithms. In our initial experiments, Jaro and Levenshtein proved most reliable for identifying equivalent names and titles. Thus, we chose Levenshtein for long string values, i.e., publication titles, and Jaro for short string values, i.e., person names.

We constructed the configuration files with the metrics defined above and perform a test interlinking based on author name and publication title matching between publication resources which has been published in year 2008 along with their author resources of OA and SWDF dataset. The test interlinking performed on all the mentioned metrics. Based on the result we got, Jaro and Levenshtein were most reliable by means of identifying equivalent names and titles. Thus, we chose Levenshtein metric for long string values, i.e., publications title and Jaro metric for short string values, i.e., person name since this metric is best suited for short string. An example of a metric definition in LIMES is shown below.

Listing 5.11: Metric definition in LIMES LIMES takes certain metrics as an input and combines them for the matching instances in the sources and target datasets.

Set a threshold can be used to find the exact matching and our aim is to correctly identify links. To this end, the threshold in the work-flow was set to 0.95, which means that two concepts are considered as matched if their syntax similarity is more than 95%. An example of an acceptance definition in LIMES is shown below.

<FILE > o p e n a i r e _ d b l p _ a c c e p t . n t < / FILE ><RELATION> owl : sameAs < / RELATION> </ACCEPTANCE>

Listing 5.12: Acceptance definition in LIMES. The sump of the datasets given for interlinking is accepted through a certain rule.

5.3 Interlinking

We chose 0.75 as a threshold value for review condition. These review links should be verified by manual evaluation. We have chosen a low value as threshold for review in order to observe how well linking specification works. We observed that similar but different entities appeared as links in review files. There can be a situation where a correct link appears in review file, this can be due to spelling differences. An example of a review definition in LIMES is shown below.

<FILE > o p e n a i r e _ d b l p _ r e v i e w . n t < / FILE ><RELATION> owl : sameAs < / RELATION> </REVIEW>

Listing 5.13: Review definition in LIMES For finding teh exact matches a review of the metrics is done by taking into account a certain threshold.

In document Collaborative Integration, Publishing and Analysis of Distributed Scholarly Metadata (Page 155-157)