1. Linked Data
2. URI Multiplicity
3. The Problem of Coreference
4. URI Identity Management Approaches 5. The Problem with owl:sameAs
6. The Consistent Reference Service (CRS) 7. CRS Architecture
• DBpedia has URIs for approximately 2 million entities
• Linked datasets contain many overlapping entities
• A single entity can have a number of URI’s
• Entities are linked using owl:sameAs
Example
http://www.rkbexplorer.com
• Contains URIs for more than 10 million entities
• Data relating to people, projects, papers and
institutions
• A single entity has a number of URIs (even within
the same repository)
• Entities are linked using CRSs
URIs for ‘Spain’:
http://dbpedia.org/resource/Spain
http://ww4.wiwiss.fu-berlin.de/factbook/resource/Spain http://sws.geonames.org/2510769
http://www4.wiwiss.fu-berlin.de/eurostat/resource/countries/Espa%C3%Bla
URIs for ‘Hugh Glaser’:
http://acm.rkbexplorer.com/rdf/resource-P112732
http://citeseer.rkbexplorer.com/rdf/resource-CSP109020 http://citeseer.rkbexplorer.com/rdf/resource-CSP109013 http://citeseer.rkbexplorer.com/rdf/resource-CSP109011 http://citeseer.rkbexplorer.com/rdf/resource-CSP109002 http://dblp.rkbexplorer.com/rdf/resource-27de9959
http://europa.eu/People/#person-0ff816fa
Tom Anderson – http://www4.wiwiss.fu-berlin.de/dblp/resource/person/109074
Is dc:creator of <http://www4.wiwiss.fu berlin.de/dblp/resource/record/conf/dac/MorettiHNCKABDF01> is dc:creator of
<http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftcs/SaeedLA91>
is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftrtft/LemosSA92>
is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/hybrid/AndersonLFS92> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iccbss/AndersonFRR03>
is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iciap/TruccoARI05> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icnp/ElySWSA01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ifip/AndersonRR04> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sc/BorchersASW95> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/seaai/AndersonH98>
is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/srds/Anderson86> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/words/AndersonFRR05> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/bell/LiuBFSRA04>
is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/cj/LemosSA92> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson03> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/ZorianASTI96> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/software/LemosSA95> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/ton/SavageWKA01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/tse/AndersonBHM85>
is dblp:editor of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sigcomm/2006>
Vice President O-in Design Automation inc. USA Professor, University of NewcastleProfessor, Heriot Watt UniversityUniversity of Washington
• The problem of coreference has existed for many
years
• Physical Libraries disambiguate authors through
Date of Birth
• Digital Libraries still have the problem of author
disambiguation
• Problems caused by variations in naming
schemes
e.g. ‘Glaser, H.’
‘H. Glaser’
• Coreference Problem referred to as ‘Record
Linkage’
• Matching entities between records similar to
matching entities between datasets
• Database linkage is easier due to imposed
schema
• Formal theory of Record Linkage proposed by
Fellegi & Sunter (1969)
• Uses coded agreements between each field
(property) to give the probability of record (instance) equivalence
• Coreference on the Semantic Web is defined as being the situation where two or more URI’s are used for a single non-information resource
• URI usage can change with context
• Non-Information resources are hard to define precisely
Examples
‘Hugh Glaser’ at Southampton vs. ‘Hugh Glaser’ at Imperial
‘Harry Potter and the Order of the Phoenix’ in Hardback vs. Softback
• Use a centralised naming authority to issue URIs
for every entity in the world
• Let everyone create their own URIs and link them
to ‘official’ URIs (using owl:sameAs)
• Let everyone create their own URIs and register
them at a centralised repository
• Let everyone create their own URIs and let them
be managed by many decentralised repositories
• In all of the above encourage reuse and linking as
• owl:sameAs was designed for a specific purpose
• Resources linked with owl:sameAs have the same
identity i.e. The subject and object are exactly the same resource
• owl:sameAs has been misused for Linking Open
Data
• Linking can occur between two very different
resources, e.g. Tom Anderson
• Reasoning with LOD will have unintended
<rdf:Description rdf:about=“#URI-1”> <rdf:Description rdf:about=“#URI-2”> <vcard:FN>Hugh Glaser</vcard:FN> <vcard:FN>Hugh Glaser</vcard:FN>
<vcard:EMAIL>[email protected]</vcard:EMAIL> <vcard:EMAIL>[email protected]</vcard:EMAIL> <vcard:ROLE>Reader</vcard:ROLE></rdf> <vcard:ROLE>Lecturer</vcard:ROLE></rdf>
Assert <URI-1> <owl:sameAs> <URI-2>
SELECT ?x WHERE {<URI-1> vcard:EMAIL ?x}
Returns [email protected] [email protected]
Which email belongs to which role?
• Data (Knowledge) providers publish data (knowledge)
• Resources from one provider cannot be guaranteed to be the same as resources from another provider
• Knowledge will be published and made
dereferenceable at the domain that the publisher has control over
• URIs will be constructed from the domain name of the publisher’s site
• An intermediate service groups URIs of resources that may be the same
• Can be seen as a conventional Knowledge Base
• Contains knowledge about the URIs in a
repository
• URIs referring to the same resource are grouped
together in ‘Bundles’
• A Bundle has properties:
• Coref:hasEquivalentReference – The URIs in a bundle are
grouped together using this predicate
• Coref:hasCanonicalReference – One URI in a bundle can be
made to be the canonical representation i.e. The preferred URI
@prefix coref: <http://www.resist.ecs.soton.ac.uk/ontology/coref#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://citeseer.rkbexplorer.com/crs/coref#bundle1> a coref:Bundle ;
coref:hasCanonicalReference
<http://citeseer.rkbexplorer.com/rdf/resource-CSP109002> ;
coref:hasEquivalentReference
• Finding all equivalences (bundles) is up to the
application
• A separate activity from coreferencing a single
data source
• Services such as Sindice can perform this
function for free
• To perform the equivalence closure just follow the
crs:hasCRS links
• Scalability is ensured by not including all possible
• The Resilience Knowledge Base Explorer displays
communities of practice for people, projects and publications from the RKB
• Uses multiple CRSs to disambiguate people and
publications
• One CRS per knowledge base ensures scalability
• Multiple SPARQL queries
• Look yourself up!
• Equivalence Mining is a difficult task that requires
multiple algorithms
• Adding policies to determine the trust level of a
CRS
• Establishing the authority of a CRS over a KB
• Establishing performance metrics
• Collaborating with LOD community for wide scale
deployment
• Coreference exists in many disciplines and will
exist on the Semantic Web
• The equivalence of non-information resources
depends on context
• The semantics of owl:sameAs do not fit with the
current usage in Linked Data
• The CRS is a solution that is being deployed on a
large knowledge-based infrastructure