• No results found

6.1 EDOAL and Silk Link Scripting Language

6.1.2 Linkage Specifications

An interlinking specification is a script that describes necessary information on comparing attributes of two given classes for judging whether two instances of the given classes must be linked. It is written with the syntax of interlinking tools. Briefly, the tools require the following information for generating links.

1 Where to get the data sets

2 From which classes to get the instances 3 Which attributes to compare

4 With which comparison method to compare attribute values 5 How to aggregate the similarities

6 How to store the generated links

There are not many differences between these tools. In this thesis, Silk is chosen to execute interlinking and generate a link set, because it is an open source tool. Therefore, this thesis only introduces Silk in order to give a brief introduction to the semi-automatic interlinking tools.

Silk provides a declarative language LSL for specifying which conditions two instances must fulfill in order to be interlinked [Jentzsch 2012]. Assume that we would like to find links on departments of two geographical data sets INSEE3 and

EUROSTAT4. INSEE is a data set that describes geographical data in France.

EUROSTAT is a data set that describes geographical data in Europe. insee and eurostat are two prefixes that name resources of two data sets respectively. Both data sets describe departments in France. So the interlinking specification written in LSL can be expressed below.

<?xml version="1.0" encoding="utf-8" ?> <Silk>

<Prefixes>

<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" /> <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />

3

http://rdf.insee.fr/geo/index.html

4

6.1. EDOAL and Silk Link Scripting Language 71

<Prefix id="insee" namespace="http://rdf.insee.fr/geo/" /> <Prefix id="eurostat" namespace=

"http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#" /> </Prefixes>

<DataSources>

<DataSource id="insee" type="sparqlEndpoint">

<Param name="endpointURI" value="http://localhost:8080/datalift/sparql" /> </DataSource>

<DataSource id="eurostat" type="sparqlEndpoint">

<Param name="endpointURI" value="http://localhost:8080/datalift/sparql" /> </DataSource>

</DataSources> <Interlinks>

<Interlink id="departement"> <LinkType>owl:sameAs</LinkType>

<SourceDataset dataSource="insee" var="a"> <RestrictTo>

?a rdf:type insee:Departement . </RestrictTo>

</SourceDataset>

<TargetDataset dataSource="eurostat" var="b"> <RestrictTo> ?b rdf:type eurostat:NUTSRegion . </RestrictTo> </TargetDataset> <LinkageRule> <Aggregate type="average"> <Compare metric="Levenshtein"> <TransformInput function="lowerCase"> <Input path="?a/insee:nom" /> </TransformInput> <TransformInput function="lowerCase"> <Input path="?b/eurostat:name" /> </TransformInput> </Compare> <Compare metric="Levenshtein"> <TransformInput function="lowerCase">

<Input path="?e1\ id1:subdivision/id1:nom" /> </TransformInput> <TransformInput function="lowerCase"> <Input path="?e2/id2:hasParentRegion/id2:name" /> </TransformInput> </Compare> </Aggregate> </LinkageRule> <Filter />

72 Chapter 6. Generating Links

<Outputs>

<Output type="sparul" > <Param name="uri" value=

"http://localhost:8080/openrdf-sesame/repositories/lifted/statements"/> <Param name="parameter" value="update"/>

</Output> </Outputs> </Interlink> </Interlinks> </Silk>

In the above script, the information for interlinking is specified as follows: 1 Where to get the data sets

Both data sets are queried through Datalift’s SPARQL endpoint

http://localhost:8080/datalift/sparql. 2 From which classes to get the instances

Instances in INSEE come from the class insee:Departement. Instances in

EUROSTAT come from the class eurostat:NUTSRegion. 3 Which attributes to compare

The property insee:nom should be compared with the property eurostat:name. The property insee:nom of the class that has the relation insee:subdivision should be compared with the property eurostat:name of the class that is the object of the relation eurostat:hasParentRegion.

4 With which comparison method to compare attribute values

Property values will be compared with the method Levenshtein, which is a string metric for measuring the difference between two strings. It is used here because all values of the properties insee:nom and eurostat:name are strings. 5 How to aggregate the similarities

The similarities of both attribute pairs will be aggregated by the method average into one similarity value.

6 How to store the generated links

The links will be stored in the public SPARQL endpoint of the Datalift platfor- m. It is http://localhost:8080/openrdf-sesame/repositories/lifted/statements. As an expressive language, LSL provides several comparison methods for at- tribute values. For strings, the set of comparison methods are levenshteinDistance, levenshtein, jaro, jaroWinkler, equality, inequality, jaccard, dice, and softjaccard. For numbers, there is a comparison method named num. For time, the compari- son methods are date and dateTime. For geographical coordinates, the comparison method is wgs84.

LSL also provides several aggregation methods to transfer similarities of attribute values into one similarity value. They are, average, max, min, quadraticMean, and geometricMean.