Data linking with n-almost keys - C-SAKey: Conditional key discovery

SAKey: Scalable Almost Key discovery in RDF data

4.5 C-SAKey: Conditional key discovery

4.6.4 Data linking with n-almost keys

In this section, we evaluate the quality of identity links that can be found using n-almost keys. We have exploited two different datasets provided by the OAEI 2010 and OAEI 2013 for the instance matching track.

OAEI 2010. In the first experiment, we evaluate the quality of identity links, found between the datasets D3, D4 provided by OAEI 2010, first introduced in the Section 3.5.1.1. For clarity reasons, we remind that both datasets contain instances of the classes Restaurant and Address. Each restaurant is described using the datatype properties name, phoneNumber, hasCategory and the object property hasAddress. An address is described using the datatype properties street, city and the object property hasAddress. As we observe in Table 4.10, when 2 exceptions are allowed, SAKey discovers phoneNumber while when no exceptions are allowed, the 1-almost key {phoneNumber,category} is found. Note that the set of n- almost keys remains the same even when the n reaches up to 100. Thus, we link the instances of the class Restaurant using the 1-almost keys and the 2-almost keys. In order to use the key hasAddress, found in both cases, we link the addresses using the 1-almost key of D3 {street, city}. As we notice in Table 4.11, when the 2-almost keys are applied, the recall increases by 12% and reaches up to 99.1% while the precision stays also the same. We notice though that even if the property phoneNumber is not a key, it has a high linking power. Indeed, in this dataset, there exist two distinct restaurants, found in the same place, sharing phone numbers. In Table 4.12, we see that comparing the linking power of the 1-almost key {phoneNumber,category} and the 2-almost key {phoneNumber}, the use of phone number as a key increases the recall by 74% and it reaches up to 94.6% while the precision is reduced only by 1.4%.

OAEI 2013. In the first experiment, the benchmark contains one original file and five test cases. Each test case contains a file that should be linked with the original one. This ex-

n n-almost keys

0, 1 {{name}, {hasAddress}, {phoneNumber,category}}

2,. . . 100 {{name}, {hasAddress}, {phoneNumber}}

Table 4.10 n-almost keys for the class Restaurant of OAEI 2010 # exceptions Recall Precision F-measure

0, 1 87.5% 85.9% 86.7%

2,. . . 100 99.1% 86% 92.11%

Table 4.11 Data linking for the class Restaurant of OAEI 2010 using equality

Almost keys Recall Precision F-measure

{name} 75.8% 94.4% 84.1%

{address} 34.8% 75% 47.5%

{phoneNumber, category} 20.5% 95.8% 33.8%

{phoneNumber} 94.6% 94.6% 94.6%

Table 4.12 Linking power of almost keys found in D3

periment is conducted using the file from the first case. The original file contains DB- pedia descriptions of persons and locations, while the test case file contains the same instances but with descriptions that have been modified. More precisely, values of 5 properties have been changed by randomly deleting/adding characters, by changing the date format, and/or by randomly changing integer values. Each file contains 1744 triples describing 430 instances. Each person can be described by the datatype properties birthName, birthDate, comment, label and the object properties almaMater, award, birthPlace and doctoralAdvisor. The property almaMater is used to describe a high school or a university from which an individual has graduated. Each location can be described by the datatype properties populationTotal, label, motto and the object property isPartO f . The property motto represents a short sentence that encapsulates the ideals of a location.

Each file contains 1744 triples describing 430 instances, using 11 properties. The second file is taken from the first test case. We have applied SAKey to discover n-almost keys in the

test case file, where stop words have been eliminated. For example, words like00_{o f}00_,00_the00_,

00_at00_, 00_restaurant00 _{have been removed. The discovered n-almost keys have been used to}

link the data described in the two files. Two different scenarios have been executed. In the first scenario, only the string equality has been used by the linking tool in order to evaluate

# exceptions Recall Precision F-measure 0, 1 25.6% 100% 41% 2, 3 47.6% 98.1% 64.2% 4, 5 47.9% 96.3% 63.9% 6, . . . , 16 48.1% 96.3% 64.1% 17 49.3% 82.8% 61.8%

Table 4.13 Data Linking in OAEI 2013 using equality # exceptions Recall Precision F-measure

0, 1 64.4% 92.3% 75.8%

2, 3 73.7% 90.8% 81.3%

4, 5 73.7% 90.8% 81.3%

6, . . . , 16 73.7% 90.8% 81.3%

17 74.4% 82.4% 78.2%

Table 4.14 Data Linking in OAEI 2013 using similarity measures

the quality of keys without considering the data heterogeneity. Thus, in this scenario, two resources are linked when they have common values for all the n-almost key properties. The recall, precision and F-measure of our linking results has been computed using the gold standard provided by OAEI.

Table 4.13 shows the evaluation of the linking process, in terms of recall, precision and F-measure, when all the discovered n-almost keys are applied and when n varies from 0 to 18. Unsurprisingly, the more exceptions are allowed, the more the recall increases and the precision decreases. Nevertheless, we observe that when two exceptions are allowed (i.e., 2-almost keys), the recall increases by 22%, while the precision decreases only by 1.9%. Moreover, we notice that in this case, the highest F-measure is obtained (62.4%). Indeed, by allowing two exceptions, SAKey discovers the properties motto and birthDate as 2-almost keys. Both properties have a high precision even if they are not keys in every case (they have few exceptions). Although the F-measure is increasing significantly when n-almost keys are applied, we notice that even in the best case the recall is not very high (less than 50%). For this reason, in the second scenario, the linking tool uses similarity measures to link the data. Table 4.14 presents the results of data linking in terms of recall, precision and F-measure when similarity measures are applied. In this case, the recall reaches up to 75%. We notice that the precision is, in general, lower when similarity measures are used than when only equality is used, since two instances are more easily considered as equal.

1-almost keys 3-almost keys 5-almost keys

{label}, {label} {label},

{birthName}, {birthName}, {birthName},

{populationTotal}, {populationTotal}, {doctoralAdvisor},

{award,doctoralAdvisor}, {award,doctoralAdvisor}, {motto},

{doctoralAdvisor,birthPlace}, {doctoralAdvisor,birthPlace}, {BirthDate}

{motto,isPartO f }, {motto},

{award,birthDate}, {BirthDate}

{almaMater,birthDate}, {doctoralAdvisor,birthDate}

16-almost keys 17-almost keys

{label}, {label}, {birthName}, {birthName}, {populationTotal}, {populationTotal}, {doctoralAdvisor}, {doctoralAdvisor}, {motto}, {motto}, {BirthDate}, {BirthDate}, {award,almaMater,birthPlace} {award,birthPlace}, {award,almaMater}

Table 4.15 Set of n-almost keys when n varies from 1 to 17

Table 4.15 presents the sets of n-almost keys discovered when different values of n are applied. When the same set of n-almost keys is obtained with different n values, the table contains only the n-almost keys for the biggest n. We observe that the number of minimal n-almost keys varies according to the n value for two reasons. First, the more the n increases the more the n-almost keys become general, since sets of properties considered as n-non keys may be considered as n-almost keys afterwards. For example, the 1-almost key {award, birthDate} becomes {birthDate} when 3 exceptions are allowed. Moreover, new n-almost keys can be added. For example, {award,almaMater,birthPlace} is introduced when the n is set to 16.

The linking power of each n-almost key is presented in Table 4.16, in terms of recall, precision and F-measure. We observe that, the 1-almost key {birthName} has 0% recall. This means that no link can be found using this key. This occurs since there exist only 8 descriptions in the dataset having this property. Comparing the two keys with no exceptions {award, birthDate} and {almaMater, birthDate} with the 4-almost key {birthDate}, we notice that the recall of birthDate alone reaches up to 32% while in the both keys it was less

that 15%. The precision of {birthDate} remains high (98.6%). Respectively, comparing {motto, isPartO f } that is a key with no exceptions with the 4-almost key {motto}, we notice that the recall using the motto alone increases almost 10%, while the precision falls to less than 5%.

Almost keys Recall Precision F-measure

{label} 8.3% 100% 15.4% {birthName} 0% - - {populationTotal} 0.2% 100% 0.4% {award, doctoralAdvisor} 0.4% 100% 0.9% {doctoralAdvisor, birthPlace} 5.1% 100% 9.7% {doctoralAdvisor} 5.3% 85.1% 10% {motto, isPartO f } 1.1% 100% 2.2% {motto} 10.4% 95.7% 18.8% {award, birthDate} 9.3% 100% 17% {almaMater, birthDate} 15.3% 100% 26.6% {doctoralAdvisor, birthDate} 4.8% 100% 9.3% {BirthDate} 32.5% 98.6% 49%

{award, almaMater, birthPlace} 48.1% 96.3% 64.2%

{award, birthPlace} 0.5% 100% 0.9%

{award, almaMater} 8.1% 49.3% 13.97%

Table 4.16 Linking power of almost keys for the OAEI 2013

In document Automatic key discovery for Data Linking (Page 152-156)