4.4 Profiling an In-Memory, Java-based RDF Store
4.4.2 CPU Profiling
This section considers the CPU efficiency of the Jena Memory Model, with particular reference to the process of answering queries rather than data assertion. The profiling tools use sampling to attribute time spent to methods: figures may thus not be perfectly accurate. To enable correct attribution of profiling data to methods, method inlining was turned off using the -XX:-Inline JVM option.
4.4.2.1 Indexes and Cache Efficiency
Jena uses a hash-based index scheme that does not support composite indexing. As a result, it spent a large proportion of its time iterating over indexes: iteration operations (next() and hasNext()) consumed 26% of CPU runtime. Jena’s indexes offer excellent performance when matching against a single node (subject, predicate, or object): the underlying hash maps offer O(1) lookup, allowing the Bunch associated with a node to be retrieved quickly. However, if one wishes to restrict by a second node (for example, a predicate as well as a subject), it is necessary to iterate over all the elements in the retrieved Bunch to find the matches. This scales poorly as the Bunch expands.
Jena dedicated six times more storage to Hash Bunches than Array Bunches, with the Hash Bunches allocating space for an average of 1088 triples each, with some larger than 300,000. A consequence of the amount of data stored in large Hash Bunches is that Jena spends a lot of time iterating over the arrays backing those structures to retrieve matches. One result of this is that Jena’s data cache efficiency is quite high: its access patterns are very predictable. Profiling indicated that over the course of a BSBM query session the proportion of data cache accesses that missed both L1 and L2 was just 0.45%, much lower than typical DBMSs (Ailamaki et al., 1999). It is likely that this percentage would increase with a better, more selective mode of access.
4.4.2.2 Node Comparisons
In any query, particularly those that deal with large working sets of data, a lot of node comparisons are necessary: joins must be performed, and triples must be matched within Bunches. Jena’s design, however, does not preclude multiple node objects being created to represent the same actual node: it simply uses a node cache to try to reduce the number of duplicates that get created. This approach means that Jena does not have to maintain a separate explicit index to find nodes, but has a variety of disadvantages. Since there may be more than one instance of logically equivalent nodes, it is not possible to use a simple referential comparison to determine node equality, and a String equality test is required. This is not a large performance hit when comparing strings of different lengths, since the inequality can be trivially discovered, but requires a computationally expensive character-by-character comparison in the case of equal length Strings. This issue is exposed by the BSBM dataset: BSBM’s automatically generated URIs have relatively little variation in length, and as a result Jena spent as much as 13.5% of its time performing String comparisons.
4.4.2.3 Garbage Collection
Tests indicated that Jena spent an insignificant amount of time (less than 0.05%) in garbage collection. This is due to the fact that the dataset is, in this case, static. The only garbage generated is short-lived objects related to handling queries, which are re- moved efficiently during collections of the young generation. In this case, generational garbage collection is an ideal match. It should be noted, however, that manually trig- gering a garbage collection caused a 15 second pause, indicating that in the event of a collection, the overhead is substantial.
4.4.3 Memory Profiling
The Jena Memory Model (JMM) was profiled after assertion of the dataset, with a full garbage collection triggered to remove any unused objects. The JMM used a total of 639.27MB after assertion, spread across 6,253,509 objects. This is an average of 466.4 bytes per triple. At nearly twice the size of the original data file, despite the normalising of repeated nodes, this footprint is clearly undesirable. It should be noted that this footprint is also dependent upon the dataset being in sorted order on the subject field, so that node cache utilisation can be maximised: using the same dataset in shuffled order increased storage requirements to 1.18GB. The space used by the sorted dataset broke down as follows: 377.71MB of the space was dedicated to node storage: nodes, String objects, their underlying character arrays, and so on. 257.86MB was dedicated to indexes: Triple objects, Bunches, and their underlying arrays. The remainder was used by instances of rarely used classes in the system.
There are several culprits for wasted space in the memory model. The most interesting is overhead associated with small objects: assuming even the bare minimum per-object overhead of 16 bytes, this amounts to 95.4MB. It can amount to substantially more when considering extra space required for alignment.
Secondly, the fact that nodes are not guaranteed to be normalised, combined with a relatively small, fixed node cache size, means that a lot of space is used storing duplicate nodes. Figure 4.4 indicates the number of nodes generated compared to those that exist in the dataset. The amount of memory used for storing nodes, even in this cached environment, validates the fully-normalised model.
Finally, there is the issue of empty space in the arrays that back both Hash and Array Bunches. Clearly, the fixed-size arrays of 4 or 9 elements will regularly be only partially full, and the Hash Bunches use at most 50% of their underlying array’s capacity. This overhead exists for a reason: over-filling hash maps or resizing arrays each time they are added to is costly.
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000
URIs Literals All
Unique Node Count Generated Node Count
Figure 4.4: Nodes generated in the Memory Model versus unique nodes that exist in the dataset
4.5
Summary
This chapter has showed that modern JVMs are entirely suitable hosts for high-performance DBMSs:they offer excellent performance with respect to CPU time, and have the poten- tial to be compact. Within this statement, however, there are caveats: programs that are likely to generate large quantities of small objects, in particular, are poor candidates for JVMs. This is due to the cost in terms of memory space, and the strain placed on garbage collection, as shown in Section 4.4.
If attention is paid to the weaknesses described in this chapter, however, there are few barriers to the creation of a performant Java-based RDF store. This analysis informs the use of Java as the implementation language for the prototypes described in Chapter 6.
Examination of RDF Datasets
As previously emphasised in this document, the structure (or lack thereof) of RDF data remains a particular problem for efficient storage and retrieval. The commonalities that can be found in existing RDF datasets are not well understood, and it follows that understanding them in more depth would provide a substantial benefit to the develop- ment of high quality RDF storage systems. In order to inform such development, it was decided to create a tool to produce statistics on RDF documents.
This chapter describes the design and development of ExamineRDF, a tool created to produce detailed statistics over arbitrarily-sized RDF files. It is designed to require rel- atively little memory, scales linearly with the amount of data being processed, performs fast, append-only writes to disk, and reads from disk in large, contiguous chunks. Its only requirement is sufficient disk space to store its results during processing.
The chapter goes on to provide an explanation of the output of the ExamineRDF tool, and the use to which this output can be put. Finally, new statistics on a variety of popular RDF datasets are presented and analysed. This information offers insights into the compressibility of both triple data and the string sets found in RDF datasets, and provides much of the basis for the development of the new RDF data structure described in the following chapters.
5.1
ExamineRDF Design
ExamineRDF was created out of a desire to analyse popular RDF datasets such as DBpedia (Auer et al., 2007) and UniProt (Apweiler et al., 2004). DBpedia amounts to over 200 million triples, while UniProt is over three billion. Simply loading these datasets into an RDF store and extracting statistics using SPARQL queries would be impractical: it was found that just loading a 200 million triple set would take several hours on modern stores, and analytics would take much longer. Scaling this to UniProt, or even larger
datasets, would not be practical. This is the approach taken by RDFStats (Langegger and W¨oß, 2009), the only alternative RDF statistics generation system that the authors are aware of. While it produces very detailed information, RDFStats does not effectively scale to very large datasets, and does not have support for human visualisation of results. As a result, the decision was made to build a custom system, the design of which is related in this section.