• No results found

In this section, we describe the RDF benchmark we have developed for evaluating the performance of our three RDF databases. Our benchmark is based on publicly available library data and a collection of queries generated from a web-based user interface for browsing RDF content.

8.5.1 Barton Data

The dataset we work with is taken from the publicly available Barton Libraries dataset [1]. This data is provided by the Simile Project [5], which develops tools for library data management and interoperability. The data contains records acquired from an RDF-formatted dump of the MIT Libraries Barton catalog, converted from raw data stored in an old library format standard called MARC (Machine Readable Catalog). Because of the multiple sources the data was derived from and the diverse nature of the data that is cataloged, the structure of the data is quite irregular. We converted the Barton data from RDF/XML syntax to triples using the Redland parser [4] and then eliminated duplicate triples. We then did some very minor cleaning of data, eliminating triples with particularly long literal values or with subject URIs that were obviously overloaded to correspond to several real-world entities (more than 99% of the data remained). This left a total of 50, 255, 599 triples in our dataset, with a total of 221 unique properties, of which the vast majority appear infrequently. Of these properties, 82 (37%) are multi-valued, meaning that they appear more than once for a given subject; however, these properties appear more often (77% of the triples have a multi-valued property). The dataset provides a good demonstration of the relatively unstructured nature of Semantic Web data.

8.5.2 Longwell Overview

Longwell [2] is a tool developed by the Simile Project, which provides a graphical user interface for generic RDF data exploration in a web browser. It begins by presenting the user with a list of the values the type property can take (such as Text or Notated Music in the library dataset) and the number of times each type occurs in the data. The user can then click on the types of data to further explore. Longwell shows the list of currently filtered resources (RDF subjects) in the main portion of the screen, and a list of filters in panels along the side. Each panel represents a property that is defined on the current filter, and contains popular object values for that property along with their

corresponding frequencies. If the user selects an object value inside one of these property panels, this filters the working set of resources to those that have that property-object pair defined, updating the other panels with the new frequency counts for this narrower set of resources.

We will now describe a sample browsing session through the Longwell interface, along with screenshots from this sample session. The opening screen with the list of different types is shown in Figure 8-3. Note the list of different types is shown at the top of the screen and their frequencies in the upper-right-hand side of the screen.

The path starts when the user selects Text from the type panel (at the upper-right-hand side of the screen), which filters the data into a list of text entities. This is shown in Figure 8-4.

On the right side of the screen, we find that popular properties on these entities. As the user scrolls down (see Figure 8-5), we can see that they include “subject,” “creator,” “genre,” and “publisher.” Within each property there is a list of the counts of the popular objects within this property. For example, as the user scrolls down even more (see Figure 8-6), we find out that the German object value appears 122 times and the French object value appears 131 times under the language property. By clicking on “fre” (French language), information about the 131 French texts in the database is presented, along with the revised set of popular properties and property values defined on these French texts. This is shown in Figure 8-7.

Currently, Longwell only runs on a small fraction of the Barton data – 9375 records, as its RDF triple store cannot scale to support the full 50 million triple dataset while still allowing interactive response time to queries (we will show this scalability limitation in our experiments).

Our experiments use Longwell-style queries to provide a realistic benchmark for testing the designs proposed. Our goal is to explore architectures and schemas which can provide interactive performance on the full dataset.

8.5.3 Longwell Queries

Our experiments feature seven queries that need to be executed on a typical Longwell path through the data. These queries are based on a typical browsing session, where the user selects a few specific entities to focus on and where the aggregate results summarizing the contents of the RDF store are updated.

The full queries are described at a high level here and are provided in full in Appendix C as SQL queries against a triple-store. We will discuss later how we rewrote the queries for each schema.

Query 1 (Q1). Calculate the opening panel displaying the counts of the different types of data in the RDF store. This requires a search for the objects and counts of those objects with property Type.

There are 30 such objects. For example: Type: Text has a count of 1, 542, 280, and Type: NotatedMusic has a count of 36, 441.

Query 2 (Q2). The user selects Type: Text from the previous panel. Longwell must then display a list of other defined properties for resources of Type: Text. It must also calculate the frequency of these properties. For example, the Language property is defined 1, 028, 826 times for resources that are of Type: Text.

Query 3 (Q3). For each property defined on items of Type: Text, populate the property panel with the counts of popular object values for that property (where popular means that an object value appears more than once). For example, the property Edition has 8 items with value “[1st ed. reprinted].”

Query 4 (Q4). This query recalculates all of the property-object counts from Q3 if the user clicks on the “French” value in the “Language” property panel. Essentially this is narrowing the working set of subjects to those whose Type is Text and Language is French. This query is thus similar to Q3, but has a much higher-selectivity.

Query 5 (Q5). Here we perform a type of inference. If there are triples of the form (X Records Y) and (Y Type Z) then we can infer that X is of type Z. Here X Records Y means that X records information about Y (for example, X might be a web page with information on Y). For this query, we want to find the inferred type of all subjects that have this Records property defined that also originated in the US Library of Congress (i.e. contain triples of the form (X origin “DLC”)). The subject and inferred type is returned for all non-Text entities.

Figure 8-6: Longwell Screen Shot After Clicking on “Text” in the Type Property Panel and Scrolling Down to the Language Property Panel

Q2 to extract information in aggregate about items that are either directly known to be of Type: Text (as in Q2) or inferred to be of Type: Text through the Q5 Records inference.

Query 7 (Q7). Finally, we include a simple triple selection query with no aggregation or inference. The user tries to learn what a particular property (in this case Point) actually means by selecting other properties that are defined along with a particular value of this property. The user wishes to retrieve subject, Encoding, and Type of all resources with a Point value of “end.” The result set indicates that all such resources are of the type Date. This explains why these resources can have “start” and “end” values: each of these resources represents a start or end date, depending on the value of Point.

We make the assumption that the Longwell administrator has selected a set of 28 interesting properties over which queries will be run (they are listed in Appendix D). There are 26,761,389 triples for these properties. For queries Q2, Q3, Q4, and Q6, only these 28 properties are considered for aggregation.