• No results found

URI Identity Management for Semantic Web Data Integration and Linkage

N/A
N/A
Protected

Academic year: 2020

Share "URI Identity Management for Semantic Web Data Integration and Linkage"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

1. Linked Data

2. URI Multiplicity

3. The Problem of Coreference

4. URI Identity Management Approaches 5. The Problem with owl:sameAs

6. The Consistent Reference Service (CRS) 7. CRS Architecture

(3)

• DBpedia has URIs for approximately 2 million entities

• Linked datasets contain many overlapping entities

• A single entity can have a number of URI’s

• Entities are linked using owl:sameAs

Example

(4)

http://www.rkbexplorer.com

Contains URIs for more than 10 million entities

Data relating to people, projects, papers and

institutions

A single entity has a number of URIs (even within

the same repository)

Entities are linked using CRSs

(5)

URIs for ‘Spain’:

http://dbpedia.org/resource/Spain

http://ww4.wiwiss.fu-berlin.de/factbook/resource/Spain http://sws.geonames.org/2510769

http://www4.wiwiss.fu-berlin.de/eurostat/resource/countries/Espa%C3%Bla

URIs for ‘Hugh Glaser’:

http://acm.rkbexplorer.com/rdf/resource-P112732

http://citeseer.rkbexplorer.com/rdf/resource-CSP109020 http://citeseer.rkbexplorer.com/rdf/resource-CSP109013 http://citeseer.rkbexplorer.com/rdf/resource-CSP109011 http://citeseer.rkbexplorer.com/rdf/resource-CSP109002 http://dblp.rkbexplorer.com/rdf/resource-27de9959

http://europa.eu/People/#person-0ff816fa

(6)

Tom Anderson – http://www4.wiwiss.fu-berlin.de/dblp/resource/person/109074

Is dc:creator of <http://www4.wiwiss.fu berlin.de/dblp/resource/record/conf/dac/MorettiHNCKABDF01> is dc:creator of

<http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftcs/SaeedLA91>

is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftrtft/LemosSA92>

is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/hybrid/AndersonLFS92> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iccbss/AndersonFRR03>

is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iciap/TruccoARI05> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icnp/ElySWSA01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ifip/AndersonRR04> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sc/BorchersASW95> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/seaai/AndersonH98>

is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/srds/Anderson86> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/words/AndersonFRR05> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/bell/LiuBFSRA04>

is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/cj/LemosSA92> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson03> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/ZorianASTI96> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/software/LemosSA95> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/ton/SavageWKA01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/tse/AndersonBHM85>

is dblp:editor of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sigcomm/2006>

Vice President O-in Design Automation inc. USA Professor, University of NewcastleProfessor, Heriot Watt UniversityUniversity of Washington

(7)

The problem of coreference has existed for many

years

Physical Libraries disambiguate authors through

Date of Birth

Digital Libraries still have the problem of author

disambiguation

Problems caused by variations in naming

schemes

e.g. ‘Glaser, H.’

‘H. Glaser’

(8)

Coreference Problem referred to as ‘Record

Linkage’

Matching entities between records similar to

matching entities between datasets

Database linkage is easier due to imposed

schema

Formal theory of Record Linkage proposed by

Fellegi & Sunter (1969)

Uses coded agreements between each field

(property) to give the probability of record (instance) equivalence

(9)

• Coreference on the Semantic Web is defined as being the situation where two or more URI’s are used for a single non-information resource

• URI usage can change with context

• Non-Information resources are hard to define precisely

Examples

‘Hugh Glaser’ at Southampton vs. ‘Hugh Glaser’ at Imperial

‘Harry Potter and the Order of the Phoenix’ in Hardback vs. Softback

(10)

Use a centralised naming authority to issue URIs

for every entity in the world

Let everyone create their own URIs and link them

to ‘official’ URIs (using owl:sameAs)

Let everyone create their own URIs and register

them at a centralised repository

Let everyone create their own URIs and let them

be managed by many decentralised repositories

In all of the above encourage reuse and linking as

(11)

owl:sameAs was designed for a specific purpose

Resources linked with owl:sameAs have the same

identity i.e. The subject and object are exactly the same resource

owl:sameAs has been misused for Linking Open

Data

Linking can occur between two very different

resources, e.g. Tom Anderson

Reasoning with LOD will have unintended

(12)

<rdf:Description rdf:about=“#URI-1”> <rdf:Description rdf:about=“#URI-2”> <vcard:FN>Hugh Glaser</vcard:FN> <vcard:FN>Hugh Glaser</vcard:FN>

<vcard:EMAIL>[email protected]</vcard:EMAIL> <vcard:EMAIL>[email protected]</vcard:EMAIL> <vcard:ROLE>Reader</vcard:ROLE></rdf> <vcard:ROLE>Lecturer</vcard:ROLE></rdf>

Assert <URI-1> <owl:sameAs> <URI-2>

SELECT ?x WHERE {<URI-1> vcard:EMAIL ?x}

Returns [email protected] [email protected]

Which email belongs to which role?

(13)

• Data (Knowledge) providers publish data (knowledge)

• Resources from one provider cannot be guaranteed to be the same as resources from another provider

• Knowledge will be published and made

dereferenceable at the domain that the publisher has control over

• URIs will be constructed from the domain name of the publisher’s site

• An intermediate service groups URIs of resources that may be the same

(14)

Can be seen as a conventional Knowledge Base

Contains knowledge about the URIs in a

repository

URIs referring to the same resource are grouped

together in ‘Bundles’

A Bundle has properties:

• Coref:hasEquivalentReference – The URIs in a bundle are

grouped together using this predicate

• Coref:hasCanonicalReference – One URI in a bundle can be

made to be the canonical representation i.e. The preferred URI

(15)

@prefix coref: <http://www.resist.ecs.soton.ac.uk/ontology/coref#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://citeseer.rkbexplorer.com/crs/coref#bundle1> a coref:Bundle ;

coref:hasCanonicalReference

<http://citeseer.rkbexplorer.com/rdf/resource-CSP109002> ;

coref:hasEquivalentReference

(16)
(17)

Finding all equivalences (bundles) is up to the

application

A separate activity from coreferencing a single

data source

Services such as Sindice can perform this

function for free

To perform the equivalence closure just follow the

crs:hasCRS links

Scalability is ensured by not including all possible

(18)

The Resilience Knowledge Base Explorer displays

communities of practice for people, projects and publications from the RKB

Uses multiple CRSs to disambiguate people and

publications

One CRS per knowledge base ensures scalability

Multiple SPARQL queries

Look yourself up!

(19)

Equivalence Mining is a difficult task that requires

multiple algorithms

Adding policies to determine the trust level of a

CRS

Establishing the authority of a CRS over a KB

Establishing performance metrics

Collaborating with LOD community for wide scale

deployment

(20)

Coreference exists in many disciplines and will

exist on the Semantic Web

The equivalence of non-information resources

depends on context

The semantics of owl:sameAs do not fit with the

current usage in Linked Data

The CRS is a solution that is being deployed on a

large knowledge-based infrastructure

(21)

References

Related documents

The decision to include only the 2007 dummy variable to capture changes in insecticide use over time is because insecticide use in South Dakota remained fairly stable from 1978

Mogale City has started a process of restructuring the tariffs for utility services for instance water, due to high proposed water tariff increase from Rand Water Mogale City

By using my personal emotional experience as a teacher, embodied in the form of short paradoxical vignettes of ‘love and hate’, and by using current scholarly literature to

- Requests the TrGW to bind corresponding IPv4 address(es) and port number(s) from its pool with the received IPv6 address(es) and port number(s) to enable the routing of user

ushabti. The tombs were, of course, often robbed, how often, it was difficult to decide, for the destruction caused by the falling roof is very similar to that caused by early

At the end of each working day, if the wall has been only partially covered, apply a bead of BITUTHENE mastic, BITUTHENE liquid membrane or GCP approved, compatible sealant

In a recent discussion of their experiences doing a registered report (Chambers, 2016; for the report see Ratner, Burrow, and Thoemmes, 2016), a different group of authors note

categorical variables) were used to examine group differences and relationships among variables. In addition, binary logistic regressions were used to examine the predictors of