Data Source Description and Source Selection Approaches

2.3 Federated Query Processing Systems

3.2.1 Data Source Description and Source Selection Approaches

Federated query processing engines employ a catalog of data source descriptions for selecting relevant data sources to answer a given query. Such data source descriptions can be extracted at query time or computed beforehand. FedX [52] does not require a catalog of source descriptions computed beforehand, but uses triple pattern-wise ASK queries sent to data sources at query time. Triple pattern-wise ASK queries are SPARQL ASK queries which contain only one triple pattern in the graph expression of the given query. Lusail [53], like FedX, uses a on-the-fly catalog solution for source selection and decomposition. Unlike FedX, Lusail takes an additional step to check if pairs of triple patterns can be evaluated as one subquery over a specific endpoint; this knowledge is exploited by Lusail during query decomposition and optimization. Posting too many SPARQL ASK queries can be a burden for data sources that have limited compute resources. which may result in DoS.

Pre-computed catalog of data source descriptions can be used to reduce the number of requests sent to the data sources. ANAPSID [34] is a federated query processing engine that employs a hybrid solution and collects a list of RDF predicates of the triple patterns that can be answered by the data sources and sends ASK queries when required during query time. During the source selection, ANAPSID parses the SPARQL query into star-shaped subqueries and identifies the SPARQL endpoints for each subquery by utilizing predicates list computed beforehand. For triple patterns that are not found in the catalog, ANAPSID sends SPARQL ASK query to data sources if they could be answered by any of the existing sources in the federation. Similarly, HiBISCuS [55], a source selection approach, uses a hybrid solution to collect a catalog of data source descriptions that combines service descriptions computed beforehand with triple-pattern wise ASK queries. The data source description of HiBISCuS includes additional information on the subject and object values of predicates which relies on authority fragment of URIs gathered for each endpoint. HiBISCuS source selection approach discards irrelevant sources for a particular query by modeling SPARQL queries as hypergraphs.

Publicly available dataset metadata are utilized by some federated query processing engines as catalog of source descriptions. SPLENDID [51] relies on instance-level metadata available as Vocabulary of Interlinked Datasets(VoID) [56] for describing the sources in a federation. SPLENDID provides a hybrid solution by combining VoID descriptions for data source selection along with SPARQL ASK queries submitted to each dataset at run-time for verification. Statistical information for each predicate and types in the dataset are organized as inverted indices, which will be used for data source selection and join order optimization. Similarly, Semagrow [49] implements a hybrid solution, like SPLENDID, and triple pattern-wise source selection method which uses VoID descriptions (if available) and SPARQL ASK queries. Avalanche [48] is a federated query engine which also identifies relevant sources and plans the query based on online statistical information published as VoID descriptions. Although VoID allows for the description of a dataset statistics, this description is limited and lacks details necessary for efficient

Approach Catalog ASK Privacy-aware

FedX x X x

ANAPSID Predicates list X x

SPLENDID VoID Desc. X x

Lusail x X x

Semagrow VoID Desc. X x

HiBISCuS Service Desc.+ URI auth X x

Avalance VoID Desc. x x

Odyssey FCP x x

DAW Service Desc.+ MIPs X x

DARQ Service Desc. x x

FEDRA FEDRA index x x

SAFE Data Cubes X X

Table 3.1: Overview of data source description approaches supported by state-of-the-art federated query processing engines. While some of the federated query processing engines send only ASK queries to check if triple pattern(s) can be executed at query time, most of them uses this method only when a description related to a triple pattern cannot be found in their catalog. Only one federated query processing engine, i.e., SAFE, supports privacy and access control specifications in their description.

query optimization. For instance, though VoID descriptions provide information about link existence between datasets via a linking property, it is not clear in which class(es) this property belongs too. In addition, VoID descriptions could be out-of-date if the dataset updates are very frequent. Odyssey [50] collects detailed statistics information on datasets that enable cost estimation which may lead to low-cost execution plans. The optimization is based on a cost model using statistical methods used for centralized triple stores, i.e., Characteristics Set (CS) [57] and Characteristics Pairs (CP) [57, 58]. Odyssey identifies CSs and sources using predicates of each star-shaped subquery. Then, it prunes to non-relevant sources based on links between star-shaped subqueries and by finding Federated Characteristics Pairs (FCPs). However, unexpected changes and misestimated statistics may conduce to poor query performance.

Different data sources in a federation could contain duplicated data or can be replicas of a dataset. DAW [59] is a duplication-aware hybrid solution for triple pattern wise source selection; it uses the DAW index to identify sources that lead to duplicated results and skip those sources. After making triple pattern-wise source selection, the selected sources are ranked based on the number of new triples they provide; those sources that are below a threshold are skipped. Duplicates are detected using Min-Wise Independent Permutations (MIPs) stored in the DAW index for each triple within the same predicate. FEDRA [60] is a source selection strategy for sources with a high replication degree. FEDRA relies on schema-level fragment definitions and fragment containment to detect replication; it exploits replication information to minimize data redundancy and data transfer by reducing the number of unions, i.e., by minimizing the number of endpoints selected.

In this thesis, we propose RDF Molecule Template based source descriptions that leverage the semantics encoded in data sources. We formally define RDF Molecule Templates (RDF-MTs) and devise techniques for exploiting RDF-MTs during source selection, query decomposition, and planning. Unlike FedX and Lusail, our approach collects RDF Molecule Templates (RDF-MTs) beforehand, reducing the number of requests sent to a data source during query time. Our approach describes sources as a set of RDF-MTs, where each RDF-MT describes a set of RDF molecules that have the same characteristics, such as rdf:type values, and possible properties associated to them. An RDF-MT also contains a set of links that exist within the same source and a set of links with other RDF-MTs in different data sources.

3.2 Federated Query Processing Systems

Given a SPARQL query, our decomposition and source selection approach parses it into star-shaped subqueries and creates a query-graph where nodes are star-shaped subqueries and edges are join variables. Using RDF-MT based source descriptions, for each node in the query-graph, our approach selects the RDF-MT(s) that contain all or subset of predicates of a star-shaped subquery. Finally, our source selection approach selects a source for a star-shaped subquery if it is described by a RDF-MT with properties that appear in the triple patterns of the subquery. Once the RDF-MT(s) are selected for the subqueries of a query, information about links between RDF-MT(s) is used to prune the RDF-MT(s) and select only the relevant sources; thus, speeding execution time without impacting query completeness.

In document Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake (Page 41-43)