Epimorphics Linked Data Publishing Platform
Epimorphics’ Services for G-Cloud
Version 1.215th December 2014
Authors: Andy Seaborne, Martin Merry
Contributors: Dave Reynolds
Review:
1
1 Overview
The Epimorphics Linked Data Publishing Platform is a resilient, scalable, cloud-based solution for publishing linked data.
It is widely used for publishing linked data on data.gov.uk, including data at
environment.data.gov.uk, location.data.gov.uk, landregistry.data.gov.uk, and many others. We offer the platform as a fully hosted and managed service for publishing linked data; in addition we can install the platform on a client’s own infrastructure. The prices quoted in this document assume we are providing a hosted service on top of Amazon Web Services. An instance of the platform runs on a cluster of dedicated machines for each client.
The platform includes
A Linked Data API engine, providing access to the data in a number of developer-friendly formats as well as human-readable web pages
Customisable text search
A triple store for storing data as RDF
A fully SPARQL 1.1-compliant endpoint
A scale-out, fault tolerant runtime platform
An upload manager, to enable clients to load their own data
Optionally we can provide additional upload mechanisms which will integrate with clients’ existing workflows to support “business as usual” publication of linked data.
The platform is customisable and can also be used to host applications running on top of the data. We offer consultancy and application development services to support the development of such applications; see our G-Cloud services “Linked Data Modelling and Consultancy” and “Linked Data Application Development” for further details. We also offer training courses for people wishing to develop their own skills in linked data publishing – see our G-Cloud service “Linked Data Training”. In addition to the full platform we also offer an entry-level system for people wishing to start linked data publication. The entry level platform runs on a single dedicated machine, so is neither fault-tolerant nor scalable, and will need to be taken out of service during scheduled maintenance. It only provides limited management information.
2
2 Platform Details
The Epimorphics Linked Data Platform is used by the Environment Agency, Land Registry as well as commercial customers for linked data publication.
It consists of:
A Linked Data API engine, provided by Epimorphics implementation of the LDA, ELDA
Text search, provided by Apache Solr 4
A fully compliant SPARQL 1.1, provided by Apache Jena ARQ
A scale-out, fault-tolerant runtime platform, hosted in Amazon Web Services
An update controller for managing coordinated updates to replicated services
The platform architecture has 3 main tiers: load balancing and routing, application services and storage.
Platform Architecture Linked Data API Engine
The Linked Data API is a specification commissioned by the Cabinet Office and co-developed by Epimorphics, to provide web developer-friendly access to linked data
(http://code.google.com/p/linked-data-api/wiki/Specification). It enables developers to consume linked data in a variety of formats without having to learn the details of SPARQL and RDF.
Our platform uses ELDA, our own widely-used open source implementation of the Linked Data API. ELDA can also combine text search as an additional facility in defining web-developer APIs to access the data. ELDA optionally uses SPARQL 1.1 (particularly sub-queries) in order to improve
3 Text search
The text search indexing is provided by Apache Solr. This can be accessed via the Linked Data AIP or directly within SPARQL queries:
The indexed data model is based on the conceptual entities within the data, rather than raw indexing of triples.
SPARQL 1.1 Engine
Our platform is based on Apache Jena, including TDB and Fuseki. This includes the ARQ query engine, which passes the complete SPARQL 1.1 test suite for query, update and protocol.
In addition, the engine is capable of combining free text search with SPARQL queries.
Runtime Platform
The runtime platform can be deployed within a number of different cloud service providers, as well as on a client’s own infrastructure. In this document we assume that the deployment will be within AWS.
It achieves scalability and fault-tolerance by having a number of identical replicas across different AWS availability zones. Data is kept with the EU for data protection jurisdiction.
The replicas are a combination of application services and a local copy of the SPARQL database and, separately, Solr text indexing.
An Amazon load balancer tracks active nodes and routes traffic based on current load and
availability of service machines. The number is adjustable to meet the expected load on the system and desired responsiveness within the available budget.
The ELDA and SPARQL services reside on the same machine because the ELDA implementation uses the triple store for all its data. The text search may have different scalability requirements and is scaled independently of the triple store.
The platform logs all incoming requests, including originating IP address, enabling clients to understand and mine the log information to determine usage patterns as desired.
4
Deployment View
Update Controller
Changes to the published data are performed by a secured controller. The controller is responsible for determining the necessary changes to the replicated triple store and replicated text index. The controller can be used both by user interface and by scripted processes.
The controller also provides SPARQL Update for management of the triple stores, such as corrections to published data. The public interface exposed to the data consumer does not include the SPARQL Update service, which is only available via the secured controller.
Entry level platform
For the entry level system the runtime platform is limited to a single dedicated machine (there is no replication and no load balancing). There is no direct access to the logs of incoming requests. Apart from this the details are the same as those described under “Runtime platform” above.
5
3 Service Details
As a hosted service, our platform is accredited to store and process IL0 information only.
All data loaded onto the platform is backed up at the time the data is loaded, so the backup is always an accurate reflection of the data in the system. The replicated nature of the platform means that a hardware failure will not cause data from the running system to be lost. In the event of catastrophic infrastructure failure which takes all out the replicated instances the data will be restored from backup as quickly as possible.
On-boarding: if no customisation of the web interfaces etc. is required, then we will provide the client access to the upload manager so that they are able to have data loaded onto and published by the platform within 5 business days after contracts have been signed. We can provide expedited on-boarding at extra cost if desired.
Off-boarding: no user data is collected by the system – the only data stored on the publishing platform is data supplied by the client. On termination of the contract all client data will be securely deleted. During the life of the contract clients can request access to a copy of all the data stored on the system.
As the system is fully replicated routine maintenance can be carried out without taking the system off-line; there is no need for scheduled maintenance windows when the system is out of service. We aim for the availability of the system to be 100%. Details of our support services are given in the next section.
We do not offer a trial service, though we do offer an entry level offering for fewer than 10M triples – see our separate pricing document for details.
6
4 Support
Our hosting support for the full system includes all regular maintenance, monitoring and backups. We will provide reports to the clients on the usage of the system – the precise details of the data reported will be agreed with the client during the setup phase. We also provide an incident reporting service. The basic service is available during normal business hours (09.00 – 17.30 Mondays –
Fridays, excluding public holidays). We provide an email address for incident reporting and will respond to any notification within 4 hours. If an incident results in loss of service we will restore the service within 1 business day; in all other cases we will use reasonable efforts to resolve the incident as quickly as possible.
Additional support options are available at extra cost, including telephone support and faster response times. For such additional support services we offer service credits in the event of failing to meet targets.
We note that the replicated nature of our architecture is such that we do not need to take the system down in order to perform regular maintenance and system updates. The production system we run for the Environment Agency went live in April 2012 and since then has been available for 100% of the time.
For the entry-level system, running on a single dedicated machine, we will still provide an email address for incident reporting and will respond to any notification within 4 hours; however, if an incident results in a loss of service we will use reasonable efforts to restore the service as quickly as possible, but will not offer a guarantee that we will restore the service within 1 business day.
5 Use of Open Source Software
Our platform is based on open source software, notably Apache Jena, including ARQ, TDB and Fuseki
Apache Web Server Apache SOLR
ELDA, Epimorphics open source implementation of the Linked Data API Apache Tomcat
Apache Lucene
7
Linked data is crucially dependent on the correct implementation of the relevant open standards. Our platform is fully compliant with all the relevant standards, notably
RDF syntaxes: RDF/XML, Turtle, N-Triples RDF 1.1 Turtle
SPARQL 1.1 Query
SPARQL 1.1 result set formats (XML, JSON, CSV, TSV) SPARQL 1.1 Update