Dereferenceability Reservoir Sampling vs Stratified Sampling

7.3 Metric Analysis and Experiments

7.3.2 Dereferenceability Reservoir Sampling vs Stratified Sampling

In Table7.2we also compare the reservoir sampling approach and the stratified sampling approach with regard to time and value. For the stratified sampling, we use the same parameters as used in the first approach. We can observe that the time taken in both dataset sizes is significantly lower in the second approach than that of the first approach. Furthermore, the values of the second approach seemed to be closer to the actual value. Therefore, the benefits of lower running time and more accurate metric values when using the stratified approach can be seen from these two datasets.

The improvement on the running time can be narrowed down to the total number of HTTP GET requests performed. In the first approach, if we take P2, then the maximum number of HTTP GET requests is 50000, since each PLD (maximum 50 in the case of P2) can have at most 1000 data objects to dereference. On the other hand, the second approach will have at most only 1000 HTTP GET requests, where these 1000 requests are representing a proportional sample of all data objects. In order to explain this further, let us take into consideration the first dataset evaluated - Learning Analytics and Knowledge (LAK) dataset. Table7.6shows the occurrences of each PLD in the subject and/or object of the LAKs dataset’s triples. Domain Occurrence http://data.linkededucation.org 129832 http://dbpedia.org 2602 http://purl.org 1 http://dblp.l3s.de 668 http://data.semanticweb.org 216 http://lak12.sites.olt.ubc.ca 1 http://lakconference2013.wordpress.com 1 http://www.ifets.info 1 https://tekri.athabascau.ca 1 http://lak14indy.wordpress.com 2 http://ceur-ws.org 2

Table 7.6: PLDs and the corresponding occurrences in the subject and/or object of the LAKs’ dataset’s triples.

7.3 Metric Analysis and Experiments

Time (s) Value Time (s) Value Time (s) Value Time (s) Value

SOTON Approach One Approach Two WN Approach One Approach Two

P1 242.056 0.02092 129.180 0.02697 P1 146.576 0.915 85.711 0.913 P2 242.530 0.02335 163.603 0.02597 P2 137.895 0.905 141.599 0.916 P3 638.067 0.01618 327.909 0.02170 P3 347.668 0.92 339.658 0.91166 P4 647.717 0.01354 333.741 0.02537 P4 351.158 0.78633 335.209 0.914 P5 1043.133 0.01392 526.649 0.02385 P5 * * * * P6 1039.038 0.01452 528.570 0.02546 P6 * * * *

Time (s) Value Time (s) Value Time (s) Value Time (s) Value

Sweto Approach One Approach Two XBRL Approach One Approach Two

P1 5345.065 0.00229 113.744 0.0 P1 1883.181 0.04595 1912.185 0.034 P2 7544.902 0.00720 282.019 0.0 P2 2102.561 0.03396 2538.540 0.034 P3 5225.544 0.00801 237.847 0.0 P3 2083.078 0.03465 2711.345 0.032 P4 14739.160 0.03423 254.679 0.0 P4 2288.651 0.02965 2714.614 0.033 P5 7704.807 0.00371 430.79 0.0 P5 2295.527 0.02919 2344.056 0.0274 P6 6882.000 0.00514 687.682 0.0 P6 2529.421 0.03059 2897.722 0.0248

Table 7.7: Dereferenceability values with different parameter settings to compare reservoir sampling (approach one) and stratified sampling (approach two).

data.linkededucation.org and dbpedia.org PLDs, and all other resources with the different PLDs. This approach does not take into consideration the proportion difference between data.linkededucation.org and dbpedia.org, hence, if dbpedia.org resources were not dereferenceable, the quality value would have been affected as the “statistical weight” given to dbpedia.orgis the same given to data.linkededucation.org.

On the other hand, the proportional sample obtained with the stratified sampler looks as follows (the numbers denote the total number of resources with that PLD in the final sample):

• http://data.linkededucation.org - 974 • http://dblp.l3s.de - 5

• http://data.semanticweb.org - 2 • http://dbpedia.org - 20

We can observe that based on a sample of 1000, data.linkededucation.org will take the largest chunk, followed by a considerably smaller dbpedia.org. The second approach dereferences just the resources in the final proportionate sample.

In Table7.7we compare the time taken and quality values of both approaches. We can observe that the stratified sampler fares better with regard to time for the Southampton ECS E-Prints and the Sweto DBLP datasets, no difference in the WordNet dataset, whilst the reservoir sampler was slightly better for the XBRL dataset. We can also observe that there was no difference between the two approaches in the WordNet dataset. This is because the dataset had only one PLD (www.w3.org) used in all resources, therefore there is no advantage of using the second approach of the first one in such cases. In this case, the second method will yield comparable results to the first approach. However, the host (W3C) was blacklisting our IP address during the assessment after trying to dereference a large number of URIs pertaining to the www.w3.org PLD.

In the Sweto DBLP dataset we notice that the stratified sampler approach gave a quality value of 0 in all cases, whilst for the reservoir sampler we got values that are less than 0.009 (and one case 0.03). Upon further investigation, we found out that the dataset contained around 5044 unique PLDs extracted from the subject and object of the dataset’s triples. In this regard, the global reservoir set parameters for

both approaches was very low, and therefore both approaches will miss the majority of PLDs. From the 5044 unique PLDs a large concentration of URIs had the same PLD, namely dblp.uni-trier.de - ≈ 8.7M URIs, and www.informatik.uni-trier.de - ≈ 2.8M URIs. The rest, ≈ 39K, is split between the other 5042. The third highest PLD, www.springer.de, had 5030 resources in a 100M triple dataset, followed by the www.computer.org with 1392 resources, and www.acm.org with 890 resources.

The reservoir sampler takes into consideration a sample of resources from all PLDs. Therefore, the increase of the global reservoir parameter to cover all PLDs (e.g. in P2 - 50 7→ 5050) might change the overall quality value. On the other hand, this increase does not necessarily mean improvement on the quality value for the alternative approach.

For the second approach, ≈ 39K data objects are considered to be relatively small in relation to the objects pertaining to the two “largest” PLDs and the total data objects. Therefore, in this case, even if we consider a larger global reservoir of 5050 (to cover all unique PLDs in this dataset), the PLD reservoir sampler of 1000 still represents a relatively small portion of PLDs. With a sampler of 1000, only the two largest populations (i.e dblp.uni-trier.de and www.informatik.uni-trier.de) are represented. Therefore, enlarging the sampler would make the population sample more representative, which also means that the “larger” PLDs would have more resources represented together with some resources from smaller PLDs, thus maintaining the ratio of the population sample. With a reservoir sampler of 10000 we expect that resources from 5 PLDs are represented, with 50000 21 PLDs, with 100000 60 PLDs, etc . . . . The size of the reservoir sampler is the main trade-off of this approach, where a larger reservoir sampler means more accurate results at the expense of taking more time. However, when we increased the value for both parameters and re-run the experiments the quality value still resulted in 0. Upon further inspection, we manually checked a number (around 30) of resources for their dereferenceability. The top two PLDs had no resource that was dereferenceable (Listing7.1shows how manual dereferenceability is done using just a terminal). This may imply that around 99.65% of the resources in this dataset might be non-dereferenceable. However, this can only be confirmed by assessing the dataset using the non-approximate metric.

$ c u r l −H " A c c e p t : a p p l i c a t i o n / r d f +xml " − I −L " h t t p : / / www. i n f o r m a t i k . u n i − t r i e r . de / ~ l e y / db / b o o k s / c o l l e c t i o n s / kim95 . h t m l # D i t t r i c h D 9 5 " HTTP / 1 . 1 301 Moved P e r m a n e n t l y D a t e : Wed , 10 Aug 2016 0 8 : 2 8 : 0 8 GMT S e r v e r : Apache L o c a t i o n : h t t p : / / d b l p . u n i − t r i e r . de / db / b o o k s / c o l l e c t i o n s / kim95 . h t m l C o n t e n t −Type : t e x t / h t m l ; c h a r s e t = i s o −8859 −1 HTTP / 1 . 1 200 OK D a t e : Wed , 10 Aug 2016 0 8 : 2 5 : 1 2 GMT S e r v e r : Apache / 2 . 4 . 7 ( Ubuntu ) S e t −C o o k i e : d b l p −v i e w =y ; p a t h = / ; e x p i r e s = F r i , 09 Sep 2016 1 0 : 2 5 : 1 2 +0200 Vary : A c c e p t − E n c o d i n g Cache − C o n t r o l : max−a g e =28800 E x p i r e s : Wed , 10 Aug 2016 1 6 : 2 5 : 1 2 GMT C o n t e n t −Type : t e x t / h t m l ; c h a r s e t = u t f −8

Listing 7.1: A manual cURL request and response to check if a resource is dereferenceable.

To conclude, with the stratified sampler, we can reduce the runtime even further than with the reservoir sampler, whilst still getting an acceptable (in some cases even better) estimate values. We observed that this approach might suffer when a dataset has a large set of unique PLDs and a large number of URIs

7.3 Metric Analysis and Experiments

LAK 75K LSAO 270K S'OTON 1M WN 2M SWETO 15M S-XBLR 100M

Logar ithm ic Tim e in Se conds (10x ) Datasets Extensional Conciseness Aproximate Extensional Conciseness (P3) Clustering Coeﬃciency Approximate Clustering Coeﬃciency (P1) Link External Data Providers Approximate Link External Data Providers (P2) 100 101 102 103 104 105

Figure 7.2: Runtime of metrics vs. datasets.

in a small set of these PLDs (as in the case of the SWETO DBLP dataset). In such cases, a larger PLD sampler (i.e. the unique sampler holding URIs with the same authoritative domain name) is required, in order to have a more accurate proportionate allocation.

In document Scalable Quality Assessment of Linked Data (Page 112-115)