Knowledge Base and RKBExplorerHugh Glaser and Ian Millard
School of Electronics & Computer Science University of Southampton, UK
Abstract. In order to provide a knowledge-enabled infrastructure for the ReSIST1 EU-funded Network of Excellence, a knowledge base, associated
utilities and user interface were built and have been delivered. We discuss the RKBExplorer2, which is the interface to the system, as well as the components
and utilities that support it. These include the acquisition tools, co-reference analysis tools, co-reference management and publishing facilities, linked data publishing infrastructure, community of practice analysis and fresnel lens formatting use.
When involved in forming the consortium for the proposal for the ReSIST Project on Resilience for Survivability in IST, Prof Brian Randell of Newcastle University proposed that the CS AKTive Space  that he had recently seen was an interesting prototype of how the project might have an effective knowledge-enabled infrastructure for the 15 partners and the outputs they would generate. The University of Southampton joined the consortium to provide such an infrastructure. During the three years of the project (January 2006 – December 2008) the work progressed, informed by the project requirements, and we present the result of that work, and the work of other collaborations.
In the course of building the required system, it was necessary to confront many of the challenges currently being faced by Linked Data researchers. In the subsequent sections of this overview we provide brief descriptions and figures comprising some major subsystems.
We begin by presenting the most obvious component, the user interface. We go on to show outputs of probably the most challenging part, the co-reference subsystem, and then discuss the sources and methods for data acquisition.
Many other aspects are exposed and can be freely used by interested third parties, and these will be shown on demand during the poster/demonstration session.
Although user interface work was not intended to be a major part of the project, it was necessary to provide a suitable interface for the intended users of the RKB. A browser was developed, and is now in its third realisation, following previous versions and evaluations .
Fig. 1. The RKBExplorer, showing a focus on Prof. Randell
Figure 1 shows a screenshot of this latest version of the RKBExplorer. It is a faceted browser, showing the community of practice of the resource in focus (in this case Prof Randell) on the top left, and details of the resource on the top right. Below these are panes which show resources that have been identified by the system as being related to the resource in focus, in descending order of importance. The ordering is informed by the ontological relationships in the underlying knowledge base. The type of resource in the lower panes can be changed (for example to “Organisations”) by clicking a “change” link.
The detail pane on the top right shows a unified view of the knowledge about the resource from all the known knowledge bases, while also taking cognisance of the multiple URIs of the resource. These resources are then formatted for the pane by using fresnel lens that have been defined as appropriate for the RKB system, or by the ontology providers. In this case, the pane shows information from a number of
rkbexplorer.com sites, but has also brought in (by URI resolution on demand), the information on this resource from dbpedia.org, and shown it as “Description”.
Co-reference identification, management and reuse
The problem of ascertaining when two URIs identify the same non-information resource was a serious concern in CS AKTive Space. As the world of Linked Data has grown, this has been seen as an increasingly important problem, and with the many sources used in the RKB, it presented the RKB builders with a significant challenge. Although some of the sources purported to have made such identifications, it rarely proved possible to accept their judgement, as there were too many false positive cases. In viewing a source that has few connections, if any, to other sources, it is often possible for a user to recognise that two or more resources have been identified as one, and reject the incorrect knowledge by using known context, almost without being conscious of it. However, once a resource is being used as only one of many inputs to a system that is providing aggregation, as the RKB does, such errors are less obvious and amenable to being discounted by context, and therefore more serious. For example we found that the excellent and widely-respected DBLP Computer Science Bibliography3 had many such false positives. Until we made the
decision to reject all identifications from DBLP, one consequence was that one of our project members, Prof Tom Anderson of Newcastle University, was identified as being in receipt of major funds from the US’ NSF (as well as all the other knowledge about publications and EU funding), when in fact the individual we were considering had never been funded by that organisation. Another example was that the same individual was identified as working with many colleagues in Australia, as the information provided by one of the sources had incorrectly identified the University of Newcastle, Australia as being the same institution as Newcastle University, UK.
To solve this problem, almost all strings that referred to resources were given new, unique, URIs on acquisition. A variety of tools were then run to carry out the co-reference analysis in a conservative manner, matching strings with varying levels of precision, depending on type and context, as well as performing pattern matching on the RDF graphs of the different resources.
Having deduced this co-reference information, we have devised a system to aid storing, maintaining and modifying URI equivalences. The Co-reference Resolution Service (CRS) enables groups of URIs to be represented in ‘bundles’ which identify resources which are deemed to be equivalent within a given context. Methods and interfaces are provided to query for equivalent URIs, to modify, merge, and undo bundles, and to deprecate URIs. We have currently instantiated a CRS for each of the rkbexplorer.com repositories, maintaining knowledge of co-reference both internally within that ‘local’ dataset and of equivalent resources in other ‘foreign’ datasets. However, one can use as many or few CRSes as required, for example utilising multiple instances to represent a different notion or context of equivalence, such as
very strict heuristics for use in a citation analysis scenario compared to a more liberal sense of similarity suitable for general use.
To facilitate reuse of the co-reference data we have collected, a service is provided at http://www.rkbexplorer.com/sameAs/ which can return all equivalences for any given URI. This service iteratively interrogates those CRSes as required to expand the equivalence set until a global closure is achieved, enabling applications such as RKBExplorer to easily discover alternative URIs which should be considered.
Figure 2, below, shows an overview of the co-reference inter-linkage between rkbexplorer.com repositories (circles) and external data sources (rectangles). The weight of each line is representative of the number of co-references between each source, while the size of rkbexplorer.com repositories is proportional to the number of triples in that dataset. Full details available at http://www.rkbexplorer.com/linkage/.
Fig. 2. Co-reference links between RKBExplorer repositories and external sources
As many Semantic Web and Linked Data application builders have found, the project was faced with the problem that there was very little of the required data available in RDF. It was therefore necessary to harvest our own data from suitable resources, and convert it to RDF; a number of different technologies were used for this purpose, depending on the characteristics of the source being harvested.
However, rather than put all the harvested RDF in a single store, the idea that the ultimate objective of a Linked Data web would require a multitude of disparate, distributed sources being queried on demand was embraced, and each of the sources was harvested into a separate 3store triplestore. Each of these stores has been exposed as Linked Data on a separate subdomain of rkbexplorer.com, with the required URI resolution with 303 redirect, SPARQL endpoint, semantic sitemap, and VoiD description. In due course, this will enable us to use RDF provided by the places from which we have harvested, should they provide it in the future, and indeed we may be able to push the RDF harvesting we are doing to the original data providers themselves. Further details and open access to each of these resources is available via http://www.rkbexplorer.com/sources/.
A consequence of this segregation of information has been that the system software has to be capable of providing a service to query all the knowledge bases as if they were a single store, while respecting the co-reference knowledge, as described above. Utilities have been developed to field queries to appropriate SPARQL endpoints where known, or to utilise Linked Data dereferencing to obtain RDF from other sources. As a result, when co-references were established with dbpedia.org URIs the system was able to seamlessly include that information without any change.
The system that was built to support the ReSIST Project is extensive and confronted and overcame many of the challenges associated with utilising Linked Data and the Semantic Web. This short description has highlighted some of the major ones, but space has meant that other aspects have been omitted, such as the use of a project Semantic MediaWiki, Natural Language Processing for text classification, and ontology development and acquisition for the application domain.
We shall be delighted to discuss these issues during the poster/demonstration session, and will additionally be able to demonstrate the system in use.
Many people have contributed to the ideas and software of such an extensive system. We are pleased to acknowledge in particular the help of many members of the AKT and ReSIST (IST Contract No: 026764) Projects, and many students both postgraduate and undergraduate. Of the software described here, Marcus Cobden and Chris Franklin actually provided parts of the code which is currently being used.
1. N.R. Shadbolt, N. Gibbins, H. Glaser, S. Harris, and m.c. schraefel, “CS AKTive Space or how we stopped worrying and learned to love the Semantic Web”, IEEE Intelligent Systems, 19(3) pp. 41-47 (2004).
2. H. Glaser, I. Millard, T. Anderson, and B. Randell, ReSIST Project Deliverable D10: Prototype knowledge base, Technical Report, University of Southampton, http://eprints.ecs.soton.ac.uk/14304/ (2007).
3. A. Jaffri, H. Glaser, and I. Millard, “URI Disambiguation in the Context of Linked Data”, Linked Data on the Web, http://CEUR-WS.org/Vol-369/paper19.pdf (2008).