Epimorphics Linked Data Publishing Platform

(1)

Epimorphics Linked Data Publishing Platform

Epimorphics’ Services for G-Cloud

Version 1.2

15th December 2014

Authors: Andy Seaborne, Martin Merry

Contributors: Dave Reynolds

Review:

(2)

1

1 Overview

The Epimorphics Linked Data Publishing Platform is a resilient, scalable, cloud-based solution for publishing linked data.

It is widely used for publishing linked data on data.gov.uk, including data at

environment.data.gov.uk, location.data.gov.uk, landregistry.data.gov.uk, and many others. We offer the platform as a fully hosted and managed service for publishing linked data; in addition we can install the platform on a client’s own infrastructure. The prices quoted in this document assume we are providing a hosted service on top of Amazon Web Services. An instance of the platform runs on a cluster of dedicated machines for each client.

The platform includes

 A Linked Data API engine, providing access to the data in a number of developer-friendly formats as well as human-readable web pages

 Customisable text search

 A triple store for storing data as RDF

 A fully SPARQL 1.1-compliant endpoint

 A scale-out, fault tolerant runtime platform

 An upload manager, to enable clients to load their own data

Optionally we can provide additional upload mechanisms which will integrate with clients’ existing workflows to support “business as usual” publication of linked data.

The platform is customisable and can also be used to host applications running on top of the data. We offer consultancy and application development services to support the development of such applications; see our G-Cloud services “Linked Data Modelling and Consultancy” and “Linked Data Application Development” for further details. We also offer training courses for people wishing to develop their own skills in linked data publishing – see our G-Cloud service “Linked Data Training”. In addition to the full platform we also offer an entry-level system for people wishing to start linked data publication. The entry level platform runs on a single dedicated machine, so is neither fault-tolerant nor scalable, and will need to be taken out of service during scheduled maintenance. It only provides limited management information.

(3)

2

2 Platform Details

The Epimorphics Linked Data Platform is used by the Environment Agency, Land Registry as well as commercial customers for linked data publication.

It consists of:

 A Linked Data API engine, provided by Epimorphics implementation of the LDA, ELDA

 Text search, provided by Apache Solr 4

 A fully compliant SPARQL 1.1, provided by Apache Jena ARQ

 A scale-out, fault-tolerant runtime platform, hosted in Amazon Web Services

 An update controller for managing coordinated updates to replicated services

The platform architecture has 3 main tiers: load balancing and routing, application services and storage.

Platform Architecture Linked Data API Engine

The Linked Data API is a specification commissioned by the Cabinet Office and co-developed by Epimorphics, to provide web developer-friendly access to linked data

(http://code.google.com/p/linked-data-api/wiki/Specification). It enables developers to consume linked data in a variety of formats without having to learn the details of SPARQL and RDF.

Our platform uses ELDA, our own widely-used open source implementation of the Linked Data API. ELDA can also combine text search as an additional facility in defining web-developer APIs to access the data. ELDA optionally uses SPARQL 1.1 (particularly sub-queries) in order to improve

(4)

3 Text search

The text search indexing is provided by Apache Solr. This can be accessed via the Linked Data AIP or directly within SPARQL queries:

The indexed data model is based on the conceptual entities within the data, rather than raw indexing of triples.

SPARQL 1.1 Engine

Our platform is based on Apache Jena, including TDB and Fuseki. This includes the ARQ query engine, which passes the complete SPARQL 1.1 test suite for query, update and protocol.

In addition, the engine is capable of combining free text search with SPARQL queries.

Runtime Platform

The runtime platform can be deployed within a number of different cloud service providers, as well as on a client’s own infrastructure. In this document we assume that the deployment will be within AWS.

It achieves scalability and fault-tolerance by having a number of identical replicas across different AWS availability zones. Data is kept with the EU for data protection jurisdiction.

The replicas are a combination of application services and a local copy of the SPARQL database and, separately, Solr text indexing.

An Amazon load balancer tracks active nodes and routes traffic based on current load and

availability of service machines. The number is adjustable to meet the expected load on the system and desired responsiveness within the available budget.

The ELDA and SPARQL services reside on the same machine because the ELDA implementation uses the triple store for all its data. The text search may have different scalability requirements and is scaled independently of the triple store.

The platform logs all incoming requests, including originating IP address, enabling clients to understand and mine the log information to determine usage patterns as desired.

(5)

4

Deployment View

Update Controller

Changes to the published data are performed by a secured controller. The controller is responsible for determining the necessary changes to the replicated triple store and replicated text index. The controller can be used both by user interface and by scripted processes.

The controller also provides SPARQL Update for management of the triple stores, such as corrections to published data. The public interface exposed to the data consumer does not include the SPARQL Update service, which is only available via the secured controller.

Entry level platform

For the entry level system the runtime platform is limited to a single dedicated machine (there is no replication and no load balancing). There is no direct access to the logs of incoming requests. Apart from this the details are the same as those described under “Runtime platform” above.

(6)

5

3 Service Details

As a hosted service, our platform is accredited to store and process IL0 information only.

All data loaded onto the platform is backed up at the time the data is loaded, so the backup is always an accurate reflection of the data in the system. The replicated nature of the platform means that a hardware failure will not cause data from the running system to be lost. In the event of catastrophic infrastructure failure which takes all out the replicated instances the data will be restored from backup as quickly as possible.

On-boarding: if no customisation of the web interfaces etc. is required, then we will provide the client access to the upload manager so that they are able to have data loaded onto and published by the platform within 5 business days after contracts have been signed. We can provide expedited on-boarding at extra cost if desired.

Off-boarding: no user data is collected by the system – the only data stored on the publishing platform is data supplied by the client. On termination of the contract all client data will be securely deleted. During the life of the contract clients can request access to a copy of all the data stored on the system.

As the system is fully replicated routine maintenance can be carried out without taking the system off-line; there is no need for scheduled maintenance windows when the system is out of service. We aim for the availability of the system to be 100%. Details of our support services are given in the next section.

We do not offer a trial service, though we do offer an entry level offering for fewer than 10M triples – see our separate pricing document for details.

(7)

6

4 Support

Our hosting support for the full system includes all regular maintenance, monitoring and backups. We will provide reports to the clients on the usage of the system – the precise details of the data reported will be agreed with the client during the setup phase. We also provide an incident reporting service. The basic service is available during normal business hours (09.00 – 17.30 Mondays –

Fridays, excluding public holidays). We provide an email address for incident reporting and will respond to any notification within 4 hours. If an incident results in loss of service we will restore the service within 1 business day; in all other cases we will use reasonable efforts to resolve the incident as quickly as possible.

Additional support options are available at extra cost, including telephone support and faster response times. For such additional support services we offer service credits in the event of failing to meet targets.

We note that the replicated nature of our architecture is such that we do not need to take the system down in order to perform regular maintenance and system updates. The production system we run for the Environment Agency went live in April 2012 and since then has been available for 100% of the time.

For the entry-level system, running on a single dedicated machine, we will still provide an email address for incident reporting and will respond to any notification within 4 hours; however, if an incident results in a loss of service we will use reasonable efforts to restore the service as quickly as possible, but will not offer a guarantee that we will restore the service within 1 business day.

5 Use of Open Source Software

Our platform is based on open source software, notably Apache Jena, including ARQ, TDB and Fuseki

Apache Web Server Apache SOLR

ELDA, Epimorphics open source implementation of the Linked Data API Apache Tomcat

Apache Lucene

(8)

7

Linked data is crucially dependent on the correct implementation of the relevant open standards. Our platform is fully compliant with all the relevant standards, notably

RDF syntaxes: RDF/XML, Turtle, N-Triples RDF 1.1 Turtle

SPARQL 1.1 Query

SPARQL 1.1 result set formats (XML, JSON, CSV, TSV) SPARQL 1.1 Update