Performance Evaluation - Scalable Quality Assessment of Linked Data

The aim of this experiment is to assess the scalability of the Luzzu framework. Runtime is measured for the Stream, SPARK and SPARQL processors against a number of datasets ranging from 10K to 125M triples. Since the main goal of the experiment is to measure the processors’ performance, having datasets with different quality problems is considered irrelevant at this stage. Furthermore, this evaluation avoids external time-affecting factors, in order to achieve values solely related to the performance of the implemented processors. For example, the Dereferenceability metric [49] is a metric that requires consistent online access. This would deteriorate the overall computation time due to possible network latencies.

We generated synthetic datasets of different sizes using the Berlin SPARQL Benchmark (BSBM) V3.1 data generator9. We generated datasets with a scale factor of 24, 56, 128, 199, 256, 666, 1369, 2089, 2785, 28453, 70812, 284826, 357431, which translates into approximately 10K, 25K, 50K, 75K, 100K, 250K, 500K, 750K, 1M, 10M, 25M, 50M, 100M, and 125M triples respectively. Since these generated datasets will have no relevant quality problems, it was not necessary to initialise any metric processors and thus use them during performance evaluation. The processing capability will not be affected by the introduction of quality metrics in the framework, though once these metrics are introduced, the overall computation time will increase according to the time complexity of the initialised metric processors. Furthermore, the functionality of the dataset processors is to stream data from datasets (Stream and SPARK processors) or endpoint (SPARQL processor) to metrics. The stream and SPARQL processor tests were performed on a Unix based machine with an Intel Core i5 2.4GHz and 4GB of RAM, whilst for the SPARK processor three worker clusters were set up.

How quality metrics do not affect the processors’ time complexity

In terms of software engineering, Luzzu employs a low degree of coupling between the modules. The framework’s modules pass data to each other but do not care about the inner workings of other

9_{BSBM is primarily used as a benchmark to measure the performance of SPARQL queries against large datasets; cf.}_http:

//wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/. Date Accessed

4.5 Performance Evaluation

Figure 4.9: Time vs. Dataset Size in triples – Comparing Stream, SPARQL and SPARK dataset processors.

modules. This ensures that third parties can define quality metrics and add them to the framework as needed. The dataset processors do not know what the processors for these metrics do, but the im- portant aspect is that the dataset processors know that these metric processors exist and that data (i.e. triples) should be passed through them for assessment. Therefore, any initialised quality metric will not affect the dataset processor’s time complexity, as the dataset processors just read the triples and pass them to the metric processors, whereas the main method implemented by every metric processor (QualityMetric.compute(Quad), corresponding to Definition4.2in Section4.3.2) merely re- ceives triples/quads without caring where they have been obtained from (e.g. a file stream, a resilient distributed dataset (RDD), a SPARQL endpoint or an in-memory dataset). Therefore, the performance of the dataset processors themselves will be the same for any number of simple or complex metric processors. Nonetheless, quality metrics will affect the overall quality assessment running time. We show such results in Chapter7, where we assess the performance of a number of probabilistic-based quality metrics implemented for Luzzu. However, this is out of the scope of this performance evaluation of the Luzzu framework.

Results Figure4.9shows the time taken (in ms) to process datasets of different sizes. We normalised the values with a log (base 10) function on both axes to improve readability. All dataset processors scale linearly as the number of triples grows. The SPARQL processor was not responding in acceptable time for datasets larger than 25M triples. The results also confirm the assumption that Big Data technologies such as Spark are not beneficial for smaller datasets, whilst processing data from a SPARQL endpoint takes more time than the other two processing approaches.

From these results, we also conclude that for up to 125M triples, the stream processor performs better than the SPARK processor. A cause for this difference is that the SPARK processor has to deal with the extra overhead needed to enqueue and dequeue triples on an external queue, however, as the number of

Number of Triples T im e in M ill is ec o n d s

Stream Processor Memory Processor

10k 100k 10 100 1k 10k 100k Highcharts.com Figure 4.10: Time vs. Dataset Size in triples – Comparing In-Memory processor against Stream processor.

triples increases, the performance of both processors converges. With an increasing number of metrics to be computed, the execution time of both processors increases but remains linear with regard to the size of the dataset.

Following this experiment, we implemented an in-memory processor, which, rather than streaming triples directly from a dataset, loads the dataset under assessment in the available memory10. In Fig- ure4.10we compare the two processors and see that, although faster, the in-memory processor does not significantly improve performance. When the dataset was larger than 250K triples, the stream processor fared better. One reason why this occurred is that the in-memory processor has the extra overhead of loading the dataset into memory. However, the main drawback of the in-memory processor is that the memory space has to be large enough to fit the dataset.

In document Scalable Quality Assessment of Linked Data (Page 66-68)