Evaluation Setup and Results - Assessment Tests for Statistical Linked Data

5.2 Assessment Tests for Statistical Linked Data

5.2.5 Evaluation Setup and Results

We have evaluated the assessment tests using several Statistical Linked Data sets. We chose various different data sets in order to achieve significant results and to test the general applicability of our approach.

Evaluation Setup

We evaluated whether the assessment tests work with Statistical Linked Data and how they perform. Therefore, we have run each test on several Statistical Linked Data sets and computed precision, recall and F-measure scores as the results. For Stage A, it was evaluated whether the data elements of the packages could be identified correctly and completely. For Stage B, it was evaluated whether the dimension characteristics and the overlaps between instance values have been found. We have not evaluated the data sets themselves.

For the evaluation, we use real-world Statistical Linked Data sets from the web (for more information see also Section 2.2). We have chosen representative data sets of different data providers in order to achieve universal and significant results for statistical data in general. For this reason, we randomly chose ten data sets per data provider that has a large number of published data sets as well as single data sets from data providers with a smaller offering on Linked Data. Additionally, we ensured that all data sets are not modelled and represented in RDF the same way. This was necessary because the assessment tests rely heavily on the structure of the data.

Table 5.2 presents an overview on the chosen data sets. All data sets are available in either the OWL or RDF formats and are published as Linked Data. The correct data elements have been labelled by domain experts for each test. Additionally, for Stage B, the corresponding pairs of schema elements have been created as a reference alignment by domain experts as well. This information serves as the gold standard in our evaluation and is the basis for the computation of precision, recall and F-measure.

5 Data Matching for Published Linked Open Social Science Data

In accordance to [vR79], we define precision as the ratio of all detected data elements that are correct and all detected data elements. Recall is defined as the ratio of all detected data elements that are correct and all correct data elements (see Section 2.3.4 for more details).

Results

In the following tables, the detailed results for all packages of the Stages A and B are shown that have been conducted on all data sets. Tables 5.3 and 5.4 show the results for the packages of Stage A10. Since we evaluated ten data sets for each of the data providers, Eurostat, World Bank Climates, data.gov, and OECD, Table 5.3 depicts only the computed means for precision and recall. However, we have observed that, in most cases, precision and recall is either high, from about 0.75 to 1, or very low, from 0 to 0.25. In general for this evaluation, a low precision or recall means that the particular test retrieved no satisfying results, i.e. not all or incorrect data elements have been detected. A reason for this lies, in first place, in the prototypical implementation of the packages. But another reason may also be the modelling and structure of the data set, i.e. missing information or the insufficient annotation of data elements. In particular, the results of Packages A3.1 and A3.2, where the geographical and temporal dimensions had to be identified, show that the data can be modelled very differently. Although we have defined rules that are based on a standardized way for representing these dimensions, it is not the only way allowed for modelling this particular part of the data. This impression is supported by the results for all data sets in Table 5.6, where the mean values for each stage have been computed. When precision and recall values are 1, either all data elements have been identified correctly or no particular element occurred in the data set, which has been the case for some data sets during the evaluation of Package A2. However, in these cases the assessment test was successful because it was correct that the particular data element was not included into the data set and, therefore, could not be detected.

Package Eurostat World Bank Climates data.gov OECD Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec.

A1 0.75 1 1 1 0.695 0.833 1 0.931 A2 1 1 1 1 1 1 1 1 A3 0.794 1 0.724 1 0.898 1 0.712 1 A3.1 0.633 1 0.217 0.5 0.5 0.5 0.369 1 A3.2 0.5 0.5 1 1 1 0.972 0.95 1 A4 1 1 1 1 0.695 0.833 1 0.931 Stage A 0.779 0.917 0.824 0.917 0.798 0.856 0.839 0.977 Table 5.3: Results of Stage A.

10_{Prec. = Precision, Rec. = Recall}

5.2 Assessment Tests for Statistical Linked Data

Package Global Hunger Index EnAKTing Energy ISTAT data.gov.uk

Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec.

A1 0.333 1 1 1 1 1 0.5 1 A2 0.2 1 1 1 1 1 1 1 A3 0.778 1 0.625 1 0.833 1 0.875 1 A3.1 0 0 0 0 0 0 0.333 1 A3.2 0.25 1 0 0 0 0 0.5 1 A4 1 1 1 1 1 1 1 0.5 Stage A 0.427 0.833 0.604 0.667 0.639 0.667 0.701 0.917 Table 5.4: Results of Stage A (continued).

In Table 5.5, the results for the evaluation of Stage B are shown11_{. Since two data sets} have been analysed together for this stage and the pair-wise combination of all data sets used in this evaluation would lead to a large number of tests, we decided to reduce the number of pairs examined to the data sets of the four larger data providers, Eurostat, Worldbank Climates, data.gov, and OECD. The results of stage B are independent of those of Stage A, i.e. for the case that no similar dimensions have been identified in Stage A, potential pairs of dimensions between both data sets may be identified according to their instance values in Stage B.

Between the data sets of one single data provider, the precision and recall is always 1, because the data is modelled in the same way, i.e. the same instance values are always represented equally. In general, the precision is always 1, since all detected similar instance values have been correct. However, the low recall for some tests prove that in a significant number of cases, not all correct overlapping instance values (B1) and no overlaps between instance values (B2) have been detected. We have observed that it is easy for the algorithm to detect similar or different instance values when the instances are represented using the same code or pattern. Most prominent cases in the evaluation have used different country lists and, in the case of World Bank Climates, a completely different coded representation of dates. If this is not the case, correct instance overlaps may be missed. This observation is valid for the detection of no overlaps between instance values. These have been missed when the instance values have been coded differently. The results in Table 5.5 prove that Stages B1 and B2 receive similar results and that it is reasonable to combine them into a single task.

In Table 5.6, we summarize the results of the evaluation. We have calculated the mean values for each particular stage and package. This highlights the strengths and weaknesses of the assessment tests. As already discussed, the identification and extraction of information on data elements works successfully in general. However, depending on the structure and semantic richness of single data sets, the tests may not achieve the desired results. While dimensions can be identified in general, the detection of temporal and geographical dimensions, in particular, can be improved (Packages A3.1 and A3.2),

5 Data Matching for Published Linked Open Social Science Data

Package Eurostat World Bank

Climates

data.gov OECD

Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec.

Eurostat 1 1 1 0.5 1 0.4 1 0.5

World Bank Climates - - 1 1 1 0.3 1 0.1

data.gov - - - - 1 1 1 0.25

OECD - - - 1 1

Eurostat 1 1 1 0.5 1 0.4 1 0.5

World Bank Climates - - 1 1 1 0.3 1 0.1

data.gov - - - - 1 1 1 0.25

OECD - - - 1 1

Stage B

Eurostat 1 1 1 0.5 1 0.4 1 0.5

World Bank Climates - - 1 1 1 0.3 1 0.1

data.gov - - - - 1 1 1 0.25

OECD - - - 1 1

Table 5.5: Results of Stage B.

since the underlying rules cannot be defined clearly. The use of several heterogeneous data sets in the evaluation prove that the approach can be generalized and that it is executable with different Linked Data sets. We will discuss the observed limitations in the following Section 5.2.6 in detail.

5.2.6 Discussion and Limitations

The evaluation has shown that detecting the data elements in Statistical Linked Data is challenging because there is no consistent labelling of data elements and no consistent patterns for instance values. This complicates the definition of assessment rules. However, the results are promising.

The complexity and extent of modelling data is often very different. Some providers deliver additional information about units, populations, provenance, etc., but this is not always the case. In most cases, this is not a problem of the RDF representation of the statistical data. It is often due to the original published data format, which often does not include such information directly. All examined data sets are – more or less accurately – modelled according to the Linked Data principles. Therefore, a lot of additional information about dimensions, etc. is encoded in the URIs. Currently, the implementation does not query URIs in a data set in order to retrieve more information. This hinders the full identification of data characteristics as intended in Stage B, thus complicating the data comparison in general.

5.2 Assessment Tests for Statistical Linked Data

Stage / Package All Data Sets

Precision Recall F-Measure

A1 0.785 0.971 0.864 A2 0.9 1 0.947 A3 0.779 1 0.876 A3.1 0.257 0.5 0.339 A3.2 0.525 0.684 0.594 A4 0.961 0.908 0.934 Stage A 0.701 0.844 0.766 B1 1 0.605 0.754 B2 1 0.605 0.803 Stage B 1 0.605 0.803

Table 5.6: Summarized results for Stages A and B.

The results have revealed the challenge that there is sometimes more than one date in a single observation. For example, data about schools from data.gov.uk includes diverse dates like the ‘opening date’ or the ‘date of the last welfare visit’, among other things. This complicates the automatic detection of temporal dimensions because there might not be just one correct solution and because research interests are diverse. While in a well-formed data set all of these dates are accompanied with specific XML datatype properties, the semantics behind the dates, e.g. in the underlying schema, have to be taken into account.

In order to guess possible factors for making data sets comparable, the information on dimensions must be very detailed, e.g. the existence of hierarchical structures in a dimension. For example, the structure of NUTS levels may be useful in order to aggregate data between different levels. This can be a solution if one data set is available on NUTS level 2 and the other one on NUTS level 1. From a scientific point of view, this might represent a loss in data quality, but it can at least support researchers in getting an initial insight into the data.

Also important for the detection of values and dimensions is the naming of the property and class types in a data set. The more that standardized vocabularies (e.g. Data Cube vocabulary [CRT14], SCOVO [HHR+09], Dublin Core [Ini]) are used or that the naming conventions of the URIs are generic and machine-interpretable, the easier is an automatic detection. A promising approach, especially as a preprocessing step for Packages B1 and B2, can be the use of link discovery tools (e.g. Silk [VBGK09], SERIMI [AHSdV11]) and ontology/schema matching systems, including the matching of instances like PARIS [SAS11] or COMA++ [EM07]. Such tools can detect links between dimensions or precise values of their instances. More powerful techniques for this purpose are discussed in Section 5.3 and 5.4.

With the execution of the Packages A1 to B2, all of our assessment tests are complete. After a successful execution, we are able to determine special characteristics and dimen-

5 Data Matching for Published Linked Open Social Science Data

sional coverages of a data sets with little effort. We are also able to detect, whether two data sets are suitable for a comparative analysis or in which dimensions problems can occur. However, the detection and extraction of necessary information from data sets is not trivial and can be improved. Additionally, the results of the evaluation of Stage B have shown that it is relevant to consider how the instance values are coded, i.e. whether they are coded differently. Since information on dimensions lies in the properties of Linked Data sets (i.e. the schema elements of the data sets that are desired to be matched), we will focus on matching these properties in the following sections.

In document Methods for Matching of Linked Open Social Science Data (Page 151-156)