• No results found

Statistical Linked Data

during the Semantic Web Deathmatch panel discussion33. In Section 3.2, we present a prototype application for the consumption of Linked Data, which has been inspired by a real-world use case.

The publication and consumption of Linked Data generates new and increases existing challenges and open issues, which accompany a decentralized, open and self-organizing web. Issues of privacy and licensing, trust, relevance and quality of Linked Data as well of maintaining the links have gained in importance in recent years. As these topics are beyond the scope of this thesis, we will not discuss them further at this point. Relevant work regarding privacy can be found in, e.g. [HP09, WSRH10, SP11]. Issues of trust, relevance and the quality of Linked Data are discussed in, e.g. [BFF08, GGSL12, HZ09, MMB12]. The problem of maintaining links is addressed in [VBGK09, VHC10, PH10].

2.2 Statistical Linked Data

In this section, we define the term Statistical Linked Data. An overview of the typical structure, semantics and modelling issues is given. This section builds upon the found- ations of LOD as given in Section 2.1. The focus of this section is on the currently available Statistical Linked Data sets and their typical characteristics, attributes and patterns. Since Statistical Linked Data may be used for scientific research in the social sciences, it is seen as a part of Linked Open Social Science Data as introduced in Section 1.1.

Statistical Data

Statistical data has a long tradition, starting from the late 1800s as a means for kings to keep track of the economic development of their country. In the early 1900s, the very first automatic data storage systems, e.g. Hollerith cards from IBM34, were developed to enable large scale statistical operations, like the US Census in 1890. Since then, statistical data may have lost its position as a front runner in the technological race towards better data management, but it has never lagged too far behind either.

Statistical data is periodically collected by administrative sources [fECoD02] and attempts to describe the state of a nation in numbers, typically by collecting demographic and eco- nomic data. Commonly known examples include population number and unemployment ratios, but also soft measurements like general well-being. When the data is collected, it is usually stored in table-like data structures, like Excel, or the diverse formats of current statistical programs, like SPSS35, STATA36and R37. For larger-scale processing, these are

33see http://videolectures.net/iswc2011_panel/ and http://semanticweb.com/semantic-web-death-match-

at-iswc_b24249 for a summarisation

34 http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/tabulator/ 35 http://www-01.ibm.com/software/uk/analytics/spss/ 36 http://www.stata.com/ 37http://www.r-project.org/

2 Background & Related Work

Geo Name Code

geo_DE „Germany“ „DE“ geo_FR „France“ „FR“ geo_ES „Spain“ „ES“

Marital Name

marital_3 „married“ marital_4 „single“

Gender Name Code

gender_M „Male“ „M“ gender_F „Female“ „F“ gender_T „Total“ „T“

Age Name Code

age_20 „Ages 20 to 29" „A2029" age_30 „Ages 30 to 39" „A3039" age_40 „Ages 40 to 49" „A4049"

Obs # Geo Gender Age Marital Time Value

1 geo_DE gender_M age_20 marital_3 „2004" „173429" 2 geo_DE gender_F age_20 marital_3 „2004" „179908" 3 geo_FR gender_M age_30 marital_4 „2003" „158233"

Fact Table

Dimensional Table Dimensional Table

Dimensional Table Dimensional Table

Figure 2.2: Statistical Data organized in a Star Schema.

then transferred to relational databases or data warehouses. The structure of statistical data can be compared with multidimensional models in data warehouses [Inm05], where a fact table determines the centre of the model in a star join data structure.

‘A fact table is a structure that contains many occurrences of data. Surround- ing the fact table are dimensions, which describe one important aspect of the fact table’ [Inm05]

This structure is also reflected in the SDMX information model [SDM09], a multidimen- sional standard model for describing statistical data. In terms of statistical data, an occurrence of data refers to a statistical observation. In general, statistical data consists of multiple observations [SDM09], which have been made at some points. Observations can be organized among specific dimensions (e.g. temporal or geographical dimensions), which describe the logical space on which the observations have been applied (e.g. time and geographical area). This information is often coded [SDM09], i.e. its values are taken from classifications or code lists. For example, the dimension "reporting country" may refer to values of a classification consisting of country names. The values of the fact table can be organized along with their surrounding dimensions as data cubes [CRT14]. Figure 2.2 depicts an example of statistical data organized in a star schema.

Code lists are used for encoding information on geographical concepts, gender, age groups and others [SDM09]. Especially for geographical codes, there exist several code lists, which are traditionally used for statistical data. These are widely reused for Statistical Linked Data. The ISO norms38 3166-1 for codes of countries and dependent territories

38http://www.iso.org/iso/home/standards/country_codes.htm

2.2 Statistical Linked Data

and 3166-2 for codes of subdivisions (e.g. states or provinces) of countries are well known and used internationally. The Nomenclature of Territorial Units for Statistics (NUTS)39 denotes a common standard for referencing regional areas in the member states of the EU, where the three levels stand for different levels of subdivisions of the countries. For Germany, for example, NUTS level 1 denotes the federal states, level 2 government regions and level 3 the smallest subdivision, the districts. But various agencies maintain their own code lists, which increases heterogeneity.

Statistical Linked Data

Due to governmental pressure and other effects, the number of statistical data sets available as LOD has recently seen a considerable increase. This is a welcome step towards governmental transparency, as professionals from many domains rely on the analysis of such raw data as opposed to the often graphic-based representations that are preferred by laypersons.

In this thesis, we define Statistical Linked Data as statistical data that is technically published according to the Linked Data principles. Statistical Linked Data includes both, aggregated and micro data, and can be generalized to survey data. However, these are so far rarely found as Linked Data. In this thesis, the term does not include statistical data from other domains, such as experimental data from clinical trials or laboratory experiments of any kind. Represented as Linked Data, a statistical data set consists of several instances of data entries, each of which determines a particular data value, e.g. 548215. The data values are supplemented by additional objects which provide further information, e.g. in which country or at which time the data value has been collected. This sets the data values in a context. Such objects are referenced in the data value instances by object properties. However, the objects themselves are classes or individuals of other external or separate data sets (e.g. classifications or code lists). Figure 2.3 depicts an example of Statistical Linked Data.

A vocabulary designed for representing Statistical Linked Data is the RDF Data Cube vocabulary [CRT14]. It is based on the SDMX information model [SDM02] and is capable of modelling observations, dimensions and measures for multi-dimensional data sets. However, there are other vocabularies for other types of research data, e.g. the DDI-RDF Discovery vocabulary (see Section 4.3 and [BCWZ12, BZWG13]), which aims to represent micro data (i.e. person-level data) as Linked Data. An overview on other vocabularies for representing statistical data can be found in Section 4.3.1.

Statistical Linked Data Sets Used in this Thesis

Table 2.1 depicts the Statistical Linked Data that is used in this thesis. Additional overviews regarding available Statistical Linked Data sets can be found at the Data

2 Background & Related Work ex1:ages ex1:gender 2004 173429 Married Germany Ages 20 to 29 concepts:dic/countries#geo_DE concepts:dic/ agegroups#ages_2029 concepts:dic/ cl_mar#marital_3 ex1:obs_1 concepts:dic/ gender#gender_M Male concepts:dic/ gender#gender_F Female concepts:dic/ gender#gender_T Total concepts:dic/gender concepts:hasConcept concepts:hasConcept concepts:hasConcept ex1:Observation rdf:type rdfs:label rdfs:label rdfs:label rdfs:label ex1:geo rdfs:label ex1:cl_marital ex1:value ex1:date rdfs:label F concepts:code T concepts:code M concepts:code ex1:dataset ex1:observations

Figure 2.3: Example of Statistical Linked Data.

Hub40, a data repository that currently contains 9,864 data sets including the 570 data sets of the LOD cloud diagram [SBJC14], and the wiki of Planet Data41, which collects data sets published in the RDF Data Cube vocabulary [CRT14]. It lists 21 data sets. In this thesis, all implementations and evaluations that use Statistical Linked Data have been carried out with these data sets. In Section 3.2, two additional data sets have been used. Their RDF generation is presented in Section 3.2.4.

German General Social Survey – ALLBUS The German General Social Survey ALL-

BUS42, which collects up-to-date data on attitudes, behaviour and social structure in Germany, is archived at GESIS – Leibniz Institute for the Social Sciences43. Due to data privacy restrictions, we use a special edited version of a subset of ALLBUS/GGSS 1980-2008 (Cumulated German General Social Survey 1980 - 2008) [ALL10], which in- cludes only few variables that are relevant for our use case (‘Current Economic Situation in Germany’ and ‘Resp. own Current Financial Situation’). Additionally, only the data of participants from North Rhine-Westphalia has been included into the subset in order to make it comparable to the election statistics from North Rhine-Westphalia on a

40 http://thedatahub.io/ 41 http://wiki.planet-data.eu/web/Datasets 42 http://www.gesis.org/en/allbus/allbus-home/ 43http://www.gesis.org/ 26

2.2 Statistical Linked Data

geographical level. Because of omitting a lot of relevant information and variables for the subset, it has been explicitly created for technical feasibility experiments only.

Data Provider

Description URL

data.gov US governmental data from diverse agencies like energy, environment, veterans affairs, housing and urban development, commerce, healthcare, etc.

http://data-

gov.tw.rpi.edu/wiki/ Data.gov_Catalog

data.gov.uk Data about education, transport, environment, etc. from the British government.

http://data.gov.uk/linked-data

Eurostat Data publicly available from the statistical office of the EU. Covers various topics ranging from population, economics and industry to education.

http://estatwrap. ontologycentral.com/

ISTAT Immigration statistics from Italy. http://www.linkedopendata.it/ datasets/istat-immigration OECD Organisation for Economic

Co-operation and Development published as Linked Data

http://oecd.270a.info/

World Bank Climate Change

Data about climate change, including the response of the global climate system to increasing greenhouse gas concentrations. http://worldbank.270a.info/ dataset/world-bank- climates.html Global Hunger Index

The Global Hunger Index (GHI) offers a multidimensional

overview of global hunger recording the state of global, regional and national hunger.

http://datahub.io/dataset/ global-hunger-index-2011

EnAKTing Energy

Data extracted from the statistics for road transport consumption compiled by the UK Department for Business, Enterprise and Regulatory Reform.

http://energy.psi.enakting.org/

2 Background & Related Work

Election Statistics of North Rhine-Westphalia The election statistics from the German

federal state of North Rhine-Westphalia are provided by IT.NRW44, the statistical office and IT service provider of the federal state. They are published as tables on HTML pages and are accessible as CSV via a web service. The statistics contain election votes and results for the elections of the German parliament, the parliament of North Rhine-Westphalia as well as for elections of the European parliament. Both, votes and results, can be retrieved on different administrative levels, e.g. from the federal state itself, administrative districts down to single electoral districts.