• No results found

Join Mechanisms in RDF Stores

3.4 Operator Implementation: The Importance of the Join in RDF Query

3.4.3 Join Mechanisms in RDF Stores

As described in Section 3.2.2.1, the standard physical storage schema in RDF stores is a three index layout using SPO, POS, and OSP indexes, effectively covering all access paths into the system. Some dedicated stores, such as Jena TDB, use an exclusively index nested loops approach to joining data, while systems based on top of RDBMSs will use whatever join mechanism the system chooses at query time.

Some newer stores, specifically Hexastore (Weiss et al., 2008) and RDF-3X (Neumann and Weikum, 2008), have chosen a different approach. They create all six possible index permutations, allowing them to make the greatest possible use of merge joins, and where merging is not possible employ sort-merge or hash join strategies.

Each of the different join strategies can be illustrated using the query specified in Fig- ure 3.12, a query that one might run over an RDF database of students studying in the UK.

To answer this query using INL, one might use the process outlined below:

1. Look up pattern 1 in the POS-ordered index, receving results for ?x back. Bind next value of ?x. If bindings are exhausted, query is complete.

PREFIX ex: <http://www.example.com/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?mealname WHERE {

1 ?x ex:attends-university ex:Southampton . 2 ?x ex:nationality ex:bangladesh .

3 ?x ex:has-gender ex:male . 4 ?x ex:likes-meal ?meal . 5 ?meal rdfs:label ?mealname . }

Figure 3.12: SPARQL query to determine the meals enjoyed by Bangladeshi students at the University of Southampton. Triple patterns are numbered for reference purposes.

2. All values for pattern 2 are now fixed. Look up in any index. Receive at most one result back. If no results are returned, return to pattern 1 and bind next value of ?x.

3. All values for pattern 3 are now fixed. Look up in any index. Receive at most one result back. If no results are returned, return to pattern 1 and bind next value of ?x.

4. Look up, in an SPO-ordered index, which meals the currently bound value of ?x enjoys, receiving results for ?meal back. If no results are returned, or results are exhausted, return to pattern 1 and bind next value of ?x, else bind next value of ?meal

5. All values for pattern 5 are now fixed. Look up in any index. Receive at most one result back. If no results are returned, bind the next value of pattern 4. Otherwise, output a result.

An alternative merge and sort-merge strategy might take a variety of approaches, de- pending on how the query was optimised. If the store attempts to minimise the number of sorts performed, it might merge join patterns 1 and 3. It would then merge join 2 and 4. These two outputs would then be merge joined together, and sorted on ?meal. Finally, the working set would be joined against the output of query pattern 5.

The characteristics of each approach vary substantially. Consider the following (fabri- cated) figures:

1. There are 10,000 students at the University of Southampton.

2. There are 15,000 Bangladeshi students in the UK, of which Southampton has 500. 3. There are 1,000,000 male students in the UK.

5. There are 20,000,000 different labelled ‘things’ in the database.

If the query were run using INL joins, triple pattern 1 returns 10,000 results. Pattern 2 returns a result 1/20th of the time, while pattern 3 will return a result 1/2 of the time. Pattern 4 returns four results on average, and pattern 5 then returns one result at most. This query, then, will run pattern 1 once, pattern 2 10,000 times, pattern 3 500 times, pattern 4 250 times, and pattern 5 1000 times, giving a total of 11,751 lookups.

If this query were performed using merge and sort/merge joins, the join might involve the following steps (although one might find an alternative ordering to cut down result sets further):

1. Merge join patterns 1 and 3: iterate over 10,000 items on the left side, and up to 1,000,000 on the right, producing an output of 5000 elements. This output is already sorted on ?x, and does not need sorting again.

2. Merge join patterns 2 and 4: iterate over 15,000 elements on the left side, and 2,000,000 on the right, producing an output of 60,000 elements. This output is already sorted on ?x, and does not need sorting again.

3. Merge join the output of steps 1 and 2, outputting around 250 results. Sort this output on ?meal.

4. Merge join the output of step 3 against the 20,000,000 items produced by triple pattern 5.

5. Output results.

The advantage of the many-index approach used by Hexastore and RDF-3X is that they avoid having to do a lot of expensive sorts, as if the data has a sort order, they can find an index to make use of that sort order. The fact that the query described in Figure 3.12 has very few variables is helpful in this regard: most of the time, the working set remains sorted on ?x. Overall, this approach has been conclusively shown to provide significantly better performance than using fewer indexes with a merge/sort-merge strategy (Weiss et al., 2008).

No comparisons, however, have been made between the merge join strategy and the use of index nested loops. It’s possible to see from this example that INL does a better job of eliminating irrelevant data, thanks to its superior selectivity: the merge join example has to iterate over many millions of items, which INL ignores by virtue of substituting additional bindings into index retrievals. The factor working against INL is that while it touches much less data, it performs a lot more random access. If that random access is cheap, INL will be much faster. If the random access is very expensive, merge and sort/merge will become faster.

Overall, then, memory-based or heavily cached disk-based systems are likely to benefit from an INL approach (particularly given its low memory usage), while systems without such benefits are likely to do better with merge/sort-merge or merge/hash join strate- gies. In future, for disk-based stores using INL joins, it may be worth investigating data structures like compact, in-memory Bloom filters (Bloom, 1970) that would inform whether a disk query would return any results or not, eliminating the large majority of disk seeks that would return no results. In this example, assuming a 1% error rate and no caching of information, the number of lookups for the INL approach would drop from 11,750 to 1,976. In practice, that number of seeks would usually be further reduced by the influence of cached data.