Consuming Linked Data - A new approach for interlinking and integrating semi-structured and lin

Consumption in this context is the ability to access and retrieve data. The consumption methods presented in this section can be found elsewhere classified as publishing methods such as in [Rietveld,2016]. This is similar to the ”chicken or the egg” problem. In order to be able to apply certain consumption methods, Linked Data providers usually have to follow certain procedures on top of the four principles (and they follow these procedures in order to be ”consumed” in these ways). In this thesis, they are considered more as consumption mechanisms because their variations are more noticeable at this stage of the process.

Before introducing data integration in the next chapter, it is necessary to understand how to access each of the data sources separately, which are in the context of this thesis semi-structured and Linked Data sources. It also helps explaining the choices made in designing the approaches used in this thesis and their advantages and limitations. Therefore, this section shows some of the popular ways and technologies available to retrieve or query Linked Data sources.

2.4.1 Crawling Pattern

It is the most straightforward method whereby users download, process and parse the RDF files in order to get the results. From a data provider point of view, it is about hosting serialised RDF files of Linked Dataset on the Web. Even though it seems a simple way of publishing Linked Data, Rietveld[2016] stated that the majority of RDF files published through this mechanism fail to follow the standards and best practices of the Linked Data paradigm. He also stated some of the common errors that can be found in files which are: incorrect HTTP headers; published in a corrupt compressed archive and containing duplicate triples or serialisation errors. The crawling pattern also has advantages of being relatively easy to implement on top of the downloaded RDF files and that is not related to the status or the performance of any remote server.

2.4.2 On-The-Fly Dereferencing

On-The-Fly dereferencing pattern is comparable to the functioning of the Web of documents. It conceptualises the Web as graph of documents that consists of dereferenceable URIs [Göçebe et al.,2015]. As introduced as part of the referenceable URIs (see Section2.2.4), dereferencing means that a description of a resource identified by an URI is recovered using the HTTP GET request in a machine-readable format (as an RDF file for example) and optionally in human- friendly format.

In this pattern, the query is executed by dereferencing the URI address in order to access the RDF file, then follows the URI links by parsing the received file on-the-fly [Hartig et al.,2009]. It can be relatively easy and fast for the server to process if it is used as subject pages access (a simple index lookup) [Verborgh et al., 2014b]. It can also be, however, complex and slow if dereferencing thousands of URIs in the background [Göçebe et al., 2015;Heath and Bizer,

2011]. This pattern is implemented by Linked Data browsers such as Marbles25.

2.4.3 Using SPARQL to query Linked Data

SPARQL endpoint is a protocol service and one popular method for querying Linked Data sources. It enables users to query a knowledge base via the SPARQL language. SPARQL endpoint is viewed as a machine-friendly interface, as frequently one or many machine-processable formats are offered in expressing the results. Many triple stores offer a SPARQL interface, such as Jena TDB [Grobe, 2009] and Virtuoso [Erling and Mikhailov, 2010]. A human-readable presentation can also be implemented.

Although SPARQL endpoints have shown many capabilities in terms of the ability of expressing and running complex and federated queries, they are often criticised about their performance and availability. At least two scenarios are able to reveal SPARQL endpoints limitation. The first is in case where the query is asking for a considerable amount of result sets, or multiple queries are sent to the same data source. The second is when running federated queries on multiple sources. Because SPARQL endpoints concentrate all their query processing tasks on

the server side only [Beek et al.,2016], the execution of queries can be slow [Heath and Bizer,

2011] or can trigger an interruption as a result of restrictions set by data source to prevent the service from overloading or collapsing. Consequently, as confirmed by a study carried out by

Verborgh et al.[2014a] who estimated that one and a half (1.5) days each month is the average downtime of SPARQL endpoints servers. This is one impetus for systems like LDF.

2.4.4 Querying through Linked Data Fragments (LDF)

Although it can be argued that LDF approach is more related to the publishing stage, its benefits can be seen in the consuming part; hence, it is classified in this section. LDF is a publishing method ”that allows efficient offloading of query execution from servers to clients through a lightweight partitioning strategy” [Verborgh et al.,2014b, p. 1]. It can be described as a com- promise between the limited subject-based Linked Data dereferencing and the difficultly of the scalable server-side SPARQL execution [Verborgh et al.,2014b].

2.4.5 Linked Search Engines

Linked Data search engines are not a method of consuming Linked Data, but rather a category of applications, generally built upon the crawling pattern, facilitating to some extent the ex- ploitation of data in this paradigm. They crawl RDF data on the Web and aggregate it. The retrieved data can be queried by following the links or by keyword search. The results can be presented in various forms depending on the application. Many examples can be listed in this section, for example: Swoogle [Ding et al.,2004] and Falcons [Cheng et al.,2008].

In document A new approach for interlinking and integrating semi-structured and linked data (Page 52-54)