3.2 Prototypical Semantic Data Library for the Social Sciences
3.2.3 Framework for a Semantic Data Library
In this section, we introduce a generic framework of a Semantic Data Library in order to address the key challenges for semantic library services providing survey and statistical data in the social sciences. Thus, the framework is oriented on the goals of Semantic Digital Libraries [KM09a], but sets the focus on research data. As we focus on the retrieval, integration and analysis processes, our approach can be distinguished from the concept of Semantic Digital Archives, because we do not address typical archiving services like long-term preservation or data curation. Our framework is composed of modules for identifying and exchanging, searching and integrating, evaluating and publishing data. Thereby, we address the main obstacles for reusing statistical or survey data in the social sciences. Figure 3.1 depicts an overview of the architecture of the framework.
The framework consists of seven key modules, each of which covers specific aspects from the path of publishing data as Linked Data to the graphical preview, statistical analysis and export of data. These modules are: (1) Common Identifier Format, (2) Common Data Exchange Format, (3) Retrieval of Data, (4) Linking and Integration, (5) Graphical Data Preview, (6) Data Analysis and (7) Export and Referencing. The first two modules are comprised in the depicted LOD wrapper, which exposes data as Linked Data. The data can then be queried via SPARQL. Based on the exposed dimensions and measures (by the SPARQL queries), the user can link and integrate data according to his research intentions. This means that multiple data sets are not linked automatically during the retrieval with SPARQL queries. Moreover, it is left to the user to decide which data sets and which of their dimensions should be linked and integrated. This decision has been
12 http://zacat.gesis.org/ 13 http://panel.gsoep.de/soepinfo/ 14 http://www.diw.de/en/soep 15http://www.graphpad.com/quickcalcs/index.cfm
3 Linked Data Consumption in the Social Sciences
Figure 3.1: Framework for a Semantic Data Library.
made together with the domain experts from the research data centres working on this prototype based on the fact that research questions can vary massively. The integrated data can be previewed in a graphical way, e.g. as a line diagram analysed with simple statistical calculations, and exported in common formats in order to be used in additional tools. Users can access the Semantic Data Library from a web browser, but a connection to other tools capable of data analysis, like statistical tools or the Vizgr [HZSM12] tool, is possible. The different modules are describes in detail in the following paragraphs.
Common Identifier Format The identification of data sets, measurements or dimensions
is of importance for a variety of reasons. On the data level, a unique identifier enables referencing the data set itself. Referencing is crucial in the context of making data sets citable in scientific publications, thereby providing valuable metadata about the scientific work. For this reason, identifiers like DOI, URN or Handles are typically used. But, according to the Linked Data principles, the use of URI, a core ingredient to Semantic Web technologies, is preferred because it is resolvable by HTTP with the need of an extra resolver. After discussions regarding the use of DOI as an identifier for Linked Data16, CrossRef17, the official DOI link registration agency for scholarly and professional publications has announced that DOIs can be used as HTTP URIs with
16
Exemplary discussion can be found at
http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.html and http://bitwacker.com/2010/01/19/the-doi-datacite-and-linked-data-made-for-each-other/
17http://crossref.org/
3.2 Prototypical Semantic Data Library for the Social Sciences
content negotiation18.
Within the data, the Linked Data identifiers provide a way to identify the semantics of dimensions, measures and observations as well as detailed metadata information. URIs fulfil this requirement. With respect to the integration and aggregation of data sets, particularly the semantics of the dimensions is of interest.
Common Data Exchange Format There are a few well-established and proven formats
for statistical calculations. Amongst others, Excel spreadsheets, SPSS, SAS, Stata and R native formats are used to exchange data including the respective formulas. Unfortunately, these formats are proprietary (locked) and/or in binary format, which makes it difficult to transform data seamlessly from one format to another. Moreover, these well-known formats do not describe their data in an expressive way, i.e. in a way that is expressive enough to deliver self-explanatory data via metadata. For the purpose of a data library for the social sciences, it is necessary to integrate various heterogeneous data sources and perform calculations directly on data or on aggregated items coming from these sources. To achieve direct calculations, we are interested in self-explanatory or self-descriptive data sources that deliver generic structures which can be semantically processed further. Thus, we aim for annotated or metadata-enriched data formats that promote easy exchange, integration and annotation using data from many heterogeneous sources. These requirements are well met by the Data Cube format [CRT14] since it (a) is an open, non-proprietary metadata model in the RDF format, (b) is widely based on the established SDMX information model [SDM02] and also includes other vocabularies, and (c) provides a semantic and self-descriptive annotation of the data. Given these advantages, it is likely that this metadata model will be supported by established statistics packages or that converter programs will be developed. The advantages of Data Cube foster a thorough adoption by practitioners as well as facilitate an easy deployment and publication of statistical and survey data. Another advantage of Data Cube is that thanks to its flexibility and simplicity, it is easy to convert existing data. In our prototype implementation presented below, we actually use efficient wrapper modules to convert proprietary or other non-semantic formats on the fly to the Data Cube vocabulary. Modelling examples using the Data Cube vocabulary are presented in Section 3.2.4 through the implementation of a prototype.
Retrieval of Data The ability to find relevant data sets is a key factor for enabling
social scientists to make use of existing data sets. Therefore, an efficient retrieval module is necessary to ensure that the search of data is suitable for the respective research topic. Metadata, which is semantically annotated highly expressive, delivers more processable input for traditional information retrieval algorithms. Thus, more details about the requested data become evident during the retrieval process, e.g. the granularity of
18
see the announcement at http://www.doi.org/news/DOINewsApr11.html#2 and a detailed description of the implementation at
3 Linked Data Consumption in the Social Sciences
specific dimensions or the frequency of observations. To provide researchers with useful information about a data set, extensive metadata must be available. Metadata not only supports the retrieval process, but nust also be considered afterwards to be able for evaluating relevance, quality and suitability for the following analysis process. For comparative research, the description and attributes of for example different indicators, sample designs and populations have to allow for comparisons to those of other data sets. Eventually, the retrieval module should provide the underlying data itself.
The semantic description of the data also enables more complex search tasks. For instance, if a researcher is interested in the GDPs of European countries, the available data provides these figures in the currency of the corresponding countries and not all of the data might be provided using Euro as a currency. If a second source can deliver the conversion rate, it is possible to combine the data sets and produce the requested information. Beyond the actual retrieval of the data sets, the module will need to provide a simple interaction component to define possible common dimensions by which data sets should flexibly be merged and integrated, i.e. temporal or geographical areas. Therefore, the task of the retrieval module is twofold: retrieve (a) metadata about the data sets (e.g. using taxonomies as is common in libraries like SKOS [MB09b]) and (b) the data
sets themselves.
Semantic technologies can aid in the combined querying of data. Both, descriptions of a data set (such as author, publication date) and the data itself (individual observations), can be encoded and interpreted by machines. Thus, an integration is made possible. Both are required: descriptions of the data (e.g. author, responsible organisation) and the data itself (the individual observations). Once data has been published in a uniform base format (e.g. RDF), machine-supported integration is possible. There are several services possible on integrated data, e.g. keyword search [LT10] or faceted browsing [WLT11]. VisiNav, in particular, offers navigation functionality over data integrated from the web [Har12]. Combined querying of distributed and heterogeneous data is also discussed in detail in [FCOO12, HL10] and in [HBF09], which traverses RDF links in order to discover potentially relevant data during the query execution.
Linking and Integration The semantic representation and annotation of data allows for
services far beyond the simple retrieval and provisioning of data sets. As the semantics of dimensions, values and metrics is explicitly modelled in the data, automatic linking and integration of data is at a researcher´s fingertips.
To correctly join and merge two data sets, it is necessary to identify common dimensions, align and map the according values, and possibly aggregate some of the data entries. Based on the dimension concept in Data Cube and the possibility for semantic annotation, the identification step can be carried out easily. Alignment of the values requires some more insights and may be achieved by a more detailed model and description of the data. In data with a temporal dimension, for instance, it is necessary to define its format and distinguish between frequencies or between values in percentages and values in absolute numbers. Aggregation becomes necessary when there is no comparable representation
3.2 Prototypical Semantic Data Library for the Social Sciences
and the data values need to be summed up, or averaged. Again, the semantic description of the dimension has to provide the exact information that is necessary to know which aggregation function to apply. This is not only relevant for further calculations, but also for combined visualizations, e.g. the combined representation of different data sets in a line graph where the axes have to be aligned to each other.
Graphical Data Preview For any existing or newly created (by the means of linking
and integration) data set, the first approach for a researcher is typically to examine some key characteristics of the data. Therefore, together with the provision of the data itself, the library presents some results of a simple statistical analysis. For existing data sets, key characteristics can be pre-computed; for freshly integrated data, an overview will be generated on the fly. We benefit from a semantic representation of the data that allows for a better notion of which characteristics are of interest and which dimensions need to be looked at.
Figure 3.2: Visualization of a combined query on different data sets [ZHM11].
To make a first glance analysis easier, data sets can be presented in a graphical form, plotting key indicators over the main or common dimensions of integrated data sets. Figure 3.2 depicts a combined visualization of two heterogeneous data sets [ZHM11]. Such visualizations depict the need for an alignment of values according to the axes of the diagram. While the depicted election votes from the first data set range from values of zero to approximately 10,000,000, the values of the other data set, the survey results of a study with a sample of 1,000 persons always lie beneath 1,000. This complicates the legibility of the graph. Furthermore, the use of flexible queries on the data allows easy adjustment of the graphs. In Figure 3.3, the first visualisation has been filtered to show only one party and one specific type of answer of the chosen data sets.
3 Linked Data Consumption in the Social Sciences
Figure 3.3: Filtered Visualization with one party and answer type.
Data Analysis The information infrastructure for working with research data are often
too strictly adjusted to data sources that are traditionally used or to specific domains or purposes. Considering the LOD movement, a method for performing mostly secondary research tasks is needed to allow for statistical queries and calculations on standardised data sets. The calculations for weighting and transforming the data as well as the statistical methods applied afterwards (e.g. regression analysis) is performed on the level of the integrated data corpus. Since the use case consists of precise tasks and we do not claim to replace statistical tools, our prototype provides the possibility of performing basic secondary analyses based on a small set of implemented functions. For implementing more complex statistical calculations, existing sources from the R Project for Statistical Computing can be reused. An alternative is the extension of the SPARQL query language by query rewriting in order to transform SPARQL queries to particular statistical functions. To allow more comprehensive analyses, the system provides export capabilities to standard tools such as SPSS and STATA. Our scenario focuses on the processing of data of the same hierarchical (aggregation) levels. The data integration layer is virtual, i.e. the integration layer provides access to data that remains at its original source.
Recently, there have been various approaches for analysing Statistical Linked Data sets. Most similar to our approach regarding the retrieval and manipulation of data are the SPARQL plugin19 for the R Project and the OLAP-based approach in [KOH12]. Both use SPARQL queries for loading the data into the particular systems and allow, in a subsequent step, various operations on the data. The LiDDM system [NKIV11] is similar to our approach and offers a complete framework for retrieving, integrating, filtering and mining Linked Data. Statistical analysis of data is enabled by mining the integrated
19http://cran.r-project.org/web/packages/SPARQL/index.html
3.2 Prototypical Semantic Data Library for the Social Sciences
data for, e.g. patterns using the Weka tool [HFH+09]. The user is involved in the full process and can decide what data is retrieved, how it is integrated and according to which constraints it is filtered and mined. User interaction during these processes is very important in order to support the wide range of various different information needs. A different approach, which uses Linked Data as background knowledge, is presented in [Pau12]. The approach Explain-a-LOD aims to explain statistical information by enriching it with Linked Data. Hypotheses are then generated for explaining a particular statistical fact based on a combination of statistics and Linked Data.
Export and Referencing While the preview and basic analysis can provide first insights
into the data it neither can nor is supposed to replace the analysis based on a full statistics application. Therefore, the system needs to allow for exporting the data to enable downstream processing. An export service providing data sets in a selection of common formats (like CSV, RDF, or Excel) is crucial for feeding into the individual scientific processing pipelines of research groups. Exporters are needed particularly as long as the RDF Data Cube format itself is not supported by all major statistics tools. As each data set is compiled based on user-defined parameters and needs, the data set can be reproduced at any time. Parameters can also be used in a unique identifier to a data set. Thereby, data sets can be referenced and cited.