• No results found

Semantic Web: A Pilot Study on the Billion Triples Challenge

7.1 A Value-based Property Matching Scheme

The first step towards being able to link the BTC dataset would be an automatic prop-erty matching scheme that produces the comparability between various predicates. In the

literature, there have been different types of approaches for matching properties. Stoilos et. al. [114] proposed a property matching algorithm that computes the string similarity between the extracted tokens from the URIs of the properties. Although there are many ex-isting string matching algorithms, such as Levenstein [111], Needleman-Wunsch (assigning different weights to different edit operations) [115] and Jaro-Winkler [116], a novel string matching metric that combines Jaro-Winkler and string overlappings was proposed. Fur-thermore, reasoning-based approaches were also adopted for ontology alignment [117, 118].

Such systems typically consist of two steps. In the beginning, they will compute syntac-tic similarity between the labels or extracted tokens of the properties and generate initial mappings. After this, a reasoner is usually utilized to check semantic inconsistency based upon subsumptions and disjointness. Not only schema level information was used for ontol-ogy matching, instance data can also be helpful for aligning properties from heterogeneous schemas. Instance-level data can give important insight into the contents and meaning of schema elements, particularly when useful schema information is limited. In general, for such approaches [119, 120, 121], values of the properties from two data sources will be examined and property pairs that share similar values are then treated as matches.

Previous algorithms have been shown to be effective, however they also have some limitations. First of all, only computing the string similarity between extracted tokens from property URIs or labels may not be sufficient. For example, two properties “rdfs:label” and

“foaf:name” may not share highly similar strings in their URIs or even labels; however, the former is sometimes used for representing person names. Therefore, only by looking at their URIs or labels would let a property matching system miss this pair of properties and finally cause the recall of the final coreference results to be affected. Logic based approaches have the advantage of being able to help to filter some property pairs that share similar

value spaces but are actually logically disjoint. They may also help to complement property mappings generated by using string matching techniques, when the extracted tokens from URIs are not sufficiently similar. However, one potential drawback is that such approaches heavily rely on the correctness of the utilized ontological axioms. Errors in the logical axioms or ontologies could result in incorrect property mappings.

In our work, we developed a value-based property matching mechanism, which is similar to some of the previous approaches [121]. But different from previous algorithms, we propose a different similarity metric to assist filtering out some false positives. In general, given two datatype properties that we want to match, we extract the tokens from the objects (i.e., literals) of triples of such properties. For object properties, we treat each URI as a token, i.e., we do not tokenize URIs. Once we obtain the tokens, we then calculate the similarity of two properties by examining their tokens and treat them as a match if their computed similarity exceeds the given threshold. Our similarity measures are given in Equations 7.1 and 7.2. Here, p1 and p2 represent two properties that we want to match; G is an RDF graph where these two properties are being used; token set(p1, G) is a function that extracts the tokens of property p1 in graph G and forms a token set.

Sim(p1, p2, G) = |{token set(p1, G)} ∩ {token set(p2, G)}|

min(|{token set(p1, G)}|, |{token set(p2, G)}|) (7.1)

Ratio(p1, p2, G) = min(|{token set(p1, G)}|, |{token set(p2, G)}|)

max(|{token set(p1, G)}|, |{token set(p2, G)}|) (7.2)

To determine whether two properties match, we use two metrics: Sim and Ratio. Sim calculates the similarity between the token sets of two properties. We may notice that our Sim function is similar to the traditional Jaccard similarity measure; but instead of using a union in the denominator, we use the min size of the two token sets. This is to

ensure that subproperties can match with their superproperties. Let’s take the following property pair “full name” and “label” as a concrete example. For person instances, the “full name” predicate is typically used to represent people’s name information; however, in some datasets/situations, data publishers may choose to use the “label” property for representing names, which is perfectly fine and does not violate anything. In this given example, if we adopt the Jaccard similarity measure, then “full name” may not be aligned with “label”, because the union of their token sets could be much larger than the intersection. Without this property pair, the performance of the overall coreference results, particularly recall, could be significantly impacted.

In addition to Sim, we also employ an additional filtering metric, called Ratio. Given two properties, this metric computes the ratio between the size of their token sets. If the calculated Ratio value of two properties is below a threshold, then we do not consider the properties comparable. The intuition here is that although we want to be able to cover as many appropriate property mappings as possible by using the min size in the Sim function, we would like to filter properties pairs that happen to share common tokens but differ significantly in the size of their token size. To be more specific, let’s see the “title” and

“keyword” example. Intuitively, we may not say that these two properties are comparable, since they actually represent different semantics. However, if we only use the Sim function defined in Equation 7.1, they may be highly similar, because publication titles actually cover the majority of the possible keywords. If we also adopt the Ratio metric, the two properties will be filtered out, since the token set of “title” is actually much larger than that of “keyword”. We will study the impact of setting different thresholds for both Sim and Ratio in our evaluation.

7.2 A Modified Graph Matching Algorithm for Detecting