Enriching integrated statistical open city data by combining equational knowledge and missing value imputation

(1)

ePub

WU

Institutional Repository

Stefan Bischof and Andreas Harth and Benedikt Kämpgen and Axel Polleres

and Patrik Schneider

Enriching integrated statistical open city data by combining equational

knowledge and missing value imputation

Article (Published)

(Refereed)

Original Citation:

Bischof, Stefan and Harth, Andreas and Kämpgen, Benedikt and Polleres, Axel and Schneider,

Patrik (2017) Enriching integrated statistical open city data by combining equational knowledge and

missing value imputation.

Journal of Web Semantics

, 48. pp. 22-47. ISSN 1570-8268

This version is available at:

http://epub.wu.ac.at/6141/

Available in ePub

WU

: March 2018

ePub

WU

, the institutional repository of the WU Vienna University of Economics and Business, is

provided by the University Library and the IT-Services. The aim is to enable open access to the

scholarly output of the WU.

This document is the publisher-created published version. It is a verbatim copy of the publisher

version.

(2)

Enriching Integrated Statistical Open City Data by Combining Equational Knowledge and

Missing Value Imputation

Stefan Bischofa,∗, Andreas Harthb, Benedikt Kämpgenc, Axel Polleresd,e, Patrik Schneidera a_{Siemens AG Österreich, Siemensstrasse 90, 1210 Vienna, Austria}

b_{Karlsruhe Institute of Technology, Karlsruhe, Germany} c_{FZI Research Center for Information Technology, Karlsruhe, Germany}

d_{Vienna University of Economics and Business, Vienna, Austria} e_{Complexity Science Hub Vienna, Austria}

Abstract

Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not re-usable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-usable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV.

Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combiningrule-based inferenceabout equational knowledge with inferences obtained fromstatistical Machine Learningapproaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data.

Keywords: open data, Linked Data, data cleaning, data integration

1. Introduction

The public sector collects large amounts of statistical data. For example, the United Nations Statistics Division1_provides regularly updated statistics about the economy, demographics and social indicators, environment and energy, and gender on a global level. The statistical office of the European Commission, Eurostat2, provides statistical data mainly about EU member countries. Some of the data in Eurostat has been aggregated from the statistical offices of the member countries of the EU. Even several larger cities provide data in on their own open data portals, e.g., Amsterdam, Berlin, London, or Vienna3. Increas-ingly, such data can be downloaded free of charge and used

∗_{Corresponding author}

Email addresses:[email protected](Stefan Bischof),

[email protected](Andreas Harth),[email protected](Benedikt Kämpgen),

[email protected](Axel Polleres),[email protected]

(Patrik Schneider)

1_{http://unstats.un.org/unsd/} 2_{http://ec.europa.eu/eurostat/}

3_{http://data.amsterdam.nl/}_, _{http://daten.berlin.de/}_, _http://data.

london.gov.uk/, andhttp://data.wien.gv.at/

under liberal licences.

Such open data can benefit public administrations, citizens and enterprises. The public administration can use the data to support decision-making and back policy decisions in a transpar-ent manner. Citizens can be better informed about governmtranspar-ent decisions, as publicly available data can help to raise awareness and underpin public discussions. Finally, companies could de-velop new business models and offer tailored solutions to their customers based on such open data. As an example for making use of such data, consider Siemens’ Green City Index (GCI) [1], which assesses and compares the environmental performance of cities. In order to compute the KPIs used to rank cities’ sustain-ability, the GCI used qualitative and also quantitative indicators

about city performance, such as for instance CO2emissions

or energy consumption per capita. Although many of these quantitative indicators had been openly available, the respective datasets had to be collected, integrated, and checked for integrity violations mostly manually because of the following reasons: (i) heterogeneity: ambiguous data published by different Open Data sources in different formats, (ii) missing data, that needed to be added manually through additional research in text documents

(3)

or estimated by experts, and, last but not least, (iii) outdated data: soon after the GCI had been published in 2012, its results were likely already obsolete.

Inspired by this concrete use case of the GCI, the goal of the present work is on collecting, integrating, and enriching quantit-ative indicator data about cities including basic statistical data about demographics, socio-economic factors, or environmental data, in a more automated and integrated fashion to alleviate these problems.

Even though there are many relevant data sources which publish such quantitative indicators as open data, it is still cum-bersome to use data from multiple sources in combination and to keep this data up-to-date. The system we present in this paper, the Open City Data Pipeline, thus contributes by addressing all of the three above challenges (i)–(iii) in a holistic manner: (i) Heterogeneity: All this data is published in different formats

such as CSV, JSON, XML, proprietary formats such as XLS, just as plain HTML tables, or even worse within PDF files – and so far to a much lesser degree only as RDF or even as Linked Data [2]. Also, the specifications of the individual data fields – (a) how indicators are defined and (b) how they have been collected – are often implicit in tex-tual descriptions only and have to be processed manually for understanding whether seemingly identical indicators published by different sources are indeed comparable. Our contribution: we present a systematic approach to integrate statistical data about cities from different sources asStatistical Linked Data[3] as a standardised format to publish both the data and the metadata. We build a small ontology of core city indicators, around which we can grow a statistical Linked Data cube: we use standard Linked Data vocabularies such as the RDF Data Cube (QB) [4] vocabulary to represent data of statistical data cubes, as well as the PROV [5] vocabulary to track the original sources of the data, and we create an extensible pipeline of crawlers and Linked Data wrappers collect this data from the sources.

(ii) Missing values: Data sources like Eurostat Urban Audit cover many cities and indicators. However, for reasons such as cities providing values on a voluntary basis, the published datasets show a large ratio of missing values. The impact of missing values is aggravated when com-bining different data sets, due to either covering different cities or using different, non-overlapping sets of indicat-ors.

Our contribution: our assumption – inspired also by works that suspect the existence of quantitative models behind the working, growth, and scaling of cities [6] – is that most indicators in such a scoped domain as cities have their own structure and dependencies, from which we can build statistical prediction models and ontological

back-ground knowledge in the form of equations.4 We have

4_{We sometimes refer to “predicting” instead of “imputing” values when we}

developed and combined integrated methods to compute missing values on the one hand using statistical inference, such as different standard regression methods, and on the other hand rule-based inference based on background knowledge in the form of equations that express know-ledge about how certain numerical indicators can be com-puted from others.5While this new method is inspired by our own prior work on using statistical regression meth-ods [7] and equational knowledge in isolation [8]), as we can demonstrate in our evaluation, the combination of both methods outperforms either method used alone. We publish the imputed/estimated values, adding re-spective PROV records, and including error estimates, as Linked Data.

(iii) Updates and changes: Studies like the GCI are typically outdated soon after publication since reusing or analysing the evolution of their underlying data is difficult. To im-prove this situation, we need regularly updated, integrated data stores which provide a consolidated, up-to-date view on data from relevant sources.

Our contribution:the extensible single data source wrap-pers (based on the work around rule-based linked data wrappers by Stadtmüller et al. [9]) in our pipeline architec-ture, are crawling each integrated source regularly (once a day) for new data, thus keeping the information as up-to-date as possible, while at the same time re-triggering the missing value enrichment methods and thereby continu-ously improving the quality of our estimations for missing data: indeed we can show in our evaluations that the more data we collect in our pipeline over time, the better our prediction models for missing values get.

In summary, our work’s contribution is twofold, both in terms of building a practically deployed, concrete system to integrate and enrich statistical data about cities in a uniform, coherent and re-usable manner, and contributing novel methods to enrich and assess the quality of Statistical Linked Data:

1. as for the former, we present the Open City Data Pipeline which is based ona generic, extensible architectureand how we integrate data from multiple data sources that publish numerical data about cities in a modular and ex-tensible way, which we re-publish as Statistical linked data.

2. as for the latter, we describethe combination of statistical regression methods with equational background know-ledge, which we callQB equations, in order to impute and estimate missing values.

mean finding suitable approximation models to estimate indicator values for cities and temporal contexts where they are not (yet) available. These predictions may (not) be confirmed, if additional data becomes available.

5_{Such equational knowledge could be also understood as “mapping” between}

indicators, which together with manually crafted equality mappings between indicators published by different data sources can be exploited for enrichment, e.g. if one source publishes the population and area of a city, but not the population density, then this missing value, available for other cities directly from other sources, could be computed by an equation.

(4)

We alsoevaluateour approach in terms of measuring the errors (by evaluating the estimated root mean square error rate (RMSE) per indicator) of such estimates, demonstrating that firstly, the combination of statistical inference with equations indeed pays off, and secondly, the regular update and collection of additional data through our pipeline contributes to improve our estimations for missing values in terms of accuracy. Note that the method of enrichment by QB equations can not only be used for imputing missing values, but also be used to assess the quality of am-biguous values from different data sources: by “rating” different observed values for the same indicator and city from different sources against their distance to our estimation, we have means to return confidence in different sources in such an integrated system.

The remainder of the paper is organised as follows. Sec-tion2introduces the necessary preliminaries in terms of Stat-istical Linked Data and other technical background, such as an overview of the used machine learning methods for missing value imputation. Section3gives an overview of the City Data Pipeline architecture, including a description of data sources and a description of how the resulting data set is made available in a re-usable and sustainable manner via a web interface, a Linked

Data interface and a public SPARQL endpoint. Section4

de-scribes the data gathering as well as the main challenges in this context. Section5explains the missing data prediction process in more detail. Section6refines this process by introducing and applying QB equations. Both the basic value imputation mechanism and the refinement by QB equations are evaluated in Section7. Section8puts our approach in the context of related work. Section9gives conclusions, provides lessons learnt and summaries directions for future research.

2. Preliminaries

In the following, we briefly introduce some core terms and notations used throughout the paper. We start with how Stat-istical Linked Data allows for modelling, wrapping, crawling, and querying of numerical data. We continue with provenance annotations to allow linking and integration as well as tracking the origin of numerical data. Then, we explain how equational background knowledge allows inferencing of numeric inform-ation. Also, we explain the basics of missing value prediction using machine learning methods.

Statistical Linked Data. Our focus is on data integration us-ing web technologies. As such, we use technologies such as

RDF [10], RDFS, OWL, and SPARQL 1.1 (both query

lan-guage [11] and update language [12]) to represent, query, and integrate statistical data. We assume the reader is familiar with these standards. Statistical Linked Data refers to statistics pub-lished according to the Linked Data principles [13] reusing the RDF Data Cube Vocabulary (QB) [4] as a basis for representing both the individual data points and the metadata. QB is a widely-used vocabulary to describe statistical datasets as so-called data cubes using a multidimensional data model [14].

In the following, we illustrate how the metadata of a popula-tion dataset can be modelled using QB. If not stated otherwise,

we use (abbreviated) Turtle notation for RDF6. The following is an excerpt from Turtle documents containing data from Eurostat: </ id / u r b _ c p o p 1 # ds > a qb : D a t a S e t ;

rdfs : l abe l " P o p u l a t i o n on 1 J a n u a r y by age g r o u p s and sex - c i t i e s and g r e a t e r c i t i e s " ; qb : s t r u c t u r e </ dsd / u r b _ c p o p 1 # dsd > . </ dsd / u r b _ c p o p 1 # dsd > a qb : D a t a S t r u c t u r e D e f i n i t i o n ; qb : c o m p o n e n t [ qb : d i m e n s i o n d c t e r m s : date ] ; qb : c o m p o n e n t [ qb : d i m e n s i o n e s t a t w r a p : c i t i e s ] ; qb : c o m p o n e n t [ qb : d i m e n s i o n e s t a t w r a p : i n d i c _ u r ] ; qb : c o m p o n e n t [ qb : m e a s u r e sdmx - m e a s u r e : o b s V a l u e ] .

In the example, a data structure definition (DSD) defines the independent, categorical properties of the dataset, so-called dimensions: date, city, and indicator. Also, the DSD defines one dependent numeric property, so-called measure. The data structure definition could also include all valid dimension values, such as all city URIs for dimensionestatwrap:cities.

Now, we give an example of how one data point can be modelled using QB: _ : obs1 a qb : O b s e r v a t i o n ; qb : d a t a S e t </ id / u r b _ c p o p 1 # ds > ; e s t a t w r a p : c i t i e s </ dic / c i t i e s # AT001C1 > ; e s t a t w r a p : i n d i c _ u r </ dic / i n d i c _ u r # DE1001V > d c t e r m s : date " 2 0 1 3 " ; sdmx - m e a s u r e : o b s V a l u e " 1 7 4 1 2 4 6 " .

The example describes an observation of 1,741,246 inhab-itants of Vienna in 2013 in the population dataset of Eurostat. Since any individual data point within a dataset is uniquely de-scribed by its dimension-value combinations, observations are usually modelled using blank nodes.

In the remainder of the paper we use the terms (statistical) dataset, QB dataset and (data) cube synonymously. The QB specification defines the notion of “well-formed cubes”7based on constraints that need to hold on a dataset. When generating and publishing QB datasets, we ensure that these constraints are fulfilled. For instance, when we later generate new observations via predictions and computations we also generate new datasets containing these values.

For publishing statistics in arbitrary formats as Statistical Linked Data, wrappers can access data from the original source, either in real-time or in batch mode, from the original format, e.g., CSV, and provide the data as Statistical Linked Data to the consumer. To collect data from different sources in one place, Linked Data crawlers can start with a seed list of dataset URIs and follow links to access the connected RDF documents. Observations from a dataset can be queried (e.g., filtered and aggregated) using SPARQL [3,15].

Linking and Provenance Annotations. In our scenario, integra-tion means building a unified view that allows querying obser-vations from several datasets as if they would reside in a single dataset (we will refer later to this unified view as the global cube). To allow for querying the unified view the query pro-cessor requires mappings and the means to use these mappings during query evaluation.

6_Use_{http://prefix.cc/}_{to look up prefix declarations.} 7_{https://www.w3.org/TR/vocab-data-cube/#wf}

(5)

Consider a query to return all values of the indicator “pop-ulation” of the area “Vienna”, in the year “2010” simultan-eously over two datasets. The two datasets may use different

identifiers for the same dimensions, e.g., estatwrap:geo vs.

sdmx-dimension:refAreaand dimension values, e.g., http://-estatwrap.ontologycentral.com/dic/cities#AT001C1vs. db-pedia:Vienna. Hence, the query would need to take these differ-ent URIs into account, or the query processor requires additional data to be able to resolve the differences.

The following RDF snippet contains example links between a dimensions and a dimension value:

e s t a t w r a p : c i t i e s rdfs : s u b P r o p e r t y O f sdmx - d i m e n s i o n : r e f A r e a . < http :// e s t a t w r a p . o n t o l o g y c e n t r a l . com / dic / c i t i e s # AT001C1 > owl :

s a m e A s d b p e d i a : V i e n n a .

When considering the semantics of such links when querying the observations of two or more datasets with identical dimen-sions, observations can be queried simultaneously as if they would reside in a single dataset [16].

To make observations more traceable and allow to judge the trustworthiness of data, we go beyond the lightweight approach of using Dublin Core properties such asdc:publisherto refer from a dataset to its publisher. We use the PROV ontology [5] to add provenance annotations, such as the agents and activities that were involved in generating observations from other ob-servations (e.g., predicting, inferencing). The following RDF fragment shows a PROV example of two observations, where a

QB observationex:obs123was derived from another

observa-tionex:obs789via an activityex:activity456on the 15th of January 2017 at 12:37. This derivation was executed according to the ruleex:rule937with an agentex:fredbeing responsible.

ex : o b s 1 2 3 prov : g e n e r a t e d A t T i m e " 2017 -01 -15 T12 : 3 7 : 0 0 " ; prov : w a s D e r i v e d F r o m ex : o b s 7 8 9 ; prov : w a s G e n e r a t e d B y ex : a c t i v i t y 4 5 6 . ex : a c t i v i t y 4 5 6 prov : q u a l i f i e d A s s o c i a t i o n [ prov : w a s A s s o c i a t e d W i t h ex : fred ] ; prov : h a d P l a n ex : r u l e 9 3 7 .

Equational Background Knowledge. In the Semantic Web, in-formation sometimes can beinferreddeductively by applying on-tological reasoning over suitably formalised background know-ledge that often can be evaluated based on rules (e.g. reasoning about subproperties and subclasses, or entity consolidation using

owl:sameAsinferences) [17]. Less common is equational know-ledge defining functional dependencies among certain attributes of a resource.

For example, if we know that the city Bolzano has 54,031 female residents and a value of 109.0 for the Eurostat indicator “Women per 100 men”, then we can compute a value of 49,570

male residents from the following equation:

women per 100 men=population female·100

population male

Two of our previous works [8,16] have shown that it is pos-sible to define rules to execute such equations, also considering 1) that equations are undirected, 2) that numeric datasets may be modelled with an arbitrary number of dimensions, and 3) that both forward-chaining inference and query rewriting are

Enrichment

(1) Linked Data

Wrapping (2) Linked Data Crawling

(1) Linked Data Wrapping … (3) Data Integration (7) Data Publication (5) Statistical Missing-Values-Prediction

(6) QB Equations (4) Data Storage

Figure 1: Open City Data Pipeline workflow

suitable approaches. In case no computation up to a fixpoint is needed, the executions were realised with simple SPARQL CONSTRUCT or INSERT queries (or, in off-the-shelf SPARQL engines by iteratively applying such queries).

Missing Value Prediction. In our attempt to impute (predict) missing values for certain indicators and cities, our assumption is that every such indicator has its own distribution (e.g., nor-mal, Poisson) and relationship to other indicators. Hence, we aim to evaluate different regression methods and choose the best fitting model to predict the missing values. In the field of Data Mining [18,19] various regression methods for prediction were developed. We focus on well-established methods such as

K-Nearest-Neighbour Regression,Multiple Linear Regression

andRandom Forest Decision Trees, since these methods are straightforward to apply and show a robust behaviour. In case the data is very sparse, these methods are not applicable since they require complete (subsets) of data. To this end, a common and robust method is theregularised iterative PCA algorithm: to first perform a Principal Component Analysis (PCA) to re-duce the number of dimensions of the data set and use the new

compressed dimensions, calledprincipal components(PCs) as

predictors [19,20]. For measuring the quality of predictions

(possibly used in equations), we use the root mean squared

error(RMSE) andnormalised root mean squared error in %

(RMSE%) [18].

3. Overview and System Architecture

The workflow of the Open City Data Pipeline (OCDP) is illustrated in Figure1and consists of several steps:

1. Data is provided as Statistical Linked Data via wrappers which have to be created once per source in the Wrapping step.

2. A crawler collects data regularly (currently, weekly) from different sources in the Crawling step through the wrap-pers.

3. In the Data Integration step the data is integrated into the global cube, where data is enriched by links and hetero-geneities resolved.

(6)

Triple Store Crawler Linked Data Sources Linked Data Wrappers

SPARQL WebUI QB Equations Statistical Missing-Values-Prediction Data Integration Enrichment

Figure 2: Open City Data Pipeline architecture

4. In the Data Storage step, the data is loaded into a SPARQL endpoint.

5. One further enrichment step exploits equational back-ground knowledge in the form of QB equations.

6. Another further enrichment step applies statistical meth-ods for missing values prediction.

7. Finally, in the Data publication step, the resulting enriched Linked data is made accessible.

In order to realise these steps, the architecture of the OCDP

system implements several components. Figure2gives a high

level overview of the architecture with a triple store being the central part. The data quality improvement workflow uses vari-ous methods to improve data quality and enrich the data.

We start with surveying data sources that serve as input to the pipeline in Section3.1. We introduce the different components, their inputs, outputs, and interfaces in Section3.2and explain how we make the resulting data available in Section3.3.

3.1. Data Sources

Many interesting statistical data sources are nowadays avail-able. Many indicators in these data sources are provided on a country level and only a subset of indicators are available on the city level. We have identified the following potential providers of statistical data concerning cities:

• DBpedia8; • Wikidata9;

• Eurostat with Urban Audit;

• United Nations Statistics Division (UNSD) statistics; • U.S. Census Bureau statistics;

• Carbon Disclosure Project10; • individual city data portals.

8_{http://dbpedia.org/} 9_{http://wikidata.org/} 10_{https://www.cdp.net}

In particular, we use statistical data from the United Nations and from Eurostat, which are integrated and enriched by the OCDP. The data sources contain data ranging from the years 1990 to 2016, but most of the data concerns the years after 2000. Further, not every indicator is covered over all years, where the highest coverage of indicators is between 2004 and 2015 (see Tables1and2). Most European cities are contained in the Eurostat datasets. The UNSD contains the capital cities and cities with a population over 100 000, all listed in the United

Nations Demographic Yearbook11.

The previous OCDP of ISWC 2015 [7] contains data from

1990 to 2013 with 638,934 values from the Eurostat data source and 69,772 values from the U.N. data source. Due to some reorganisation in the Eurostat and U.N. datasets, Eurostat con-tains now 506,854 values and the U.N. provides 40,532 values. Regarding indicators, we now have 209 instead of 215 Eurostat and 64 instead of 154 U.N. indicators. The reason for the drop in indicators is due to the fact that the U.N publishes fewer datasets. The same effect can be seen for the cities, where we have 966 instead of 943 Eurostat and 3,381 instead of 4,319 U.N cities. Due to the smaller size of the datasets (see Tables1and2), we now have an improved missing values ratio of 81.7% (before 86.3%) for Eurostat, resp. 94.4% (before 99.5%) for the U.N. dataset.

We now describe each of the data sources in detail.

Eurostat. Eurostat12 offers various datasets concerning E.U. statistics. The data collection is conducted by the national stat-istical institutes and Eurostat itself. Of particular interest is the Urban Audit (UA) collection, which started as an initiative to assess the quality of life in European cities. UA aims to provide an extensive look at the cities under investigation, since it is a policy tool to the European Commission: “The projects’ ultimate goal is to contribute towards the improvement of the quality of urban life” [21]. Currently, data collection takes place every three years (last survey in 2015) and is published via Eurostat Urban Audit. All data is provided on a voluntary basis which leads to varying data availability and missing values in the col-lected datasets. At the city level, Urban Audit contains over 200 indicators divided into the categories Demography, Social Aspects, Economic Aspects, and Civic Involvement. Currently, we extract the datasets that include the following topics:

• Population by structure, age groups, sex, citizenship, and country of birth

• Fertility and mortality

• Living conditions and education • Culture and tourism

• Labour market, economy, and finance • Transport, environment, and crime.

11_{http://unstats.un.org/unsd/demographic/products/dyb/dyb2012.}

htm

(7)

Table 1: Values of the Eurostat Dataset

Year(s) Cities Indicators Available Missing Missing Ratio (%)

1990 131 88 1 799 9 641 84.27 2000 433 163 6 420 63 996 90.88 2005 598 168 20 460 79 836 79.60 2010 869 193 56 528 110 996 66.26 2015 310 69 2 030 19 291 90.48 2004–2016 879 207 437 565 1 331 250 75.26 All (1990–2016) 966 209 506 854 2 257 171 81.66

Table 2: Values of the United Nations Dataset

Year(s) Cities Indicators Available Missing Missing Ratio (%)

1990 5 3 8 7 46.67 2000 1 078 61 3 861 61 836 94.12 2005 777 61 2 110 45 226 95.54 2010 1 525 64 5 866 91 670 93.99 2015 216 3 568 77 11.94 2004–2016 2 095 64 28 849 511 759 94.66 All (1990–2016) 3 381 64 40 532 685 548 94.42

United Nations Statistics Division (UNSD). The UNSD offers data on a wide range of topics such as education, environment, health, technology and tourism. The focus of the UNSD is usually on the country level, but there are some datasets on cities available as well. Our main source is the UNSD Demo-graphic and Social Statistics, which is based on the data collected annually (since 1948) by questionnaires to national statistical offices13. Currently we use the datasets on the city level that include the following topics:

• Population by age distribution, sex, and housing • Households by different criteria (e.g., type of housing) • Occupants of housing units / dwellings by broad types

(e.g., size, lighting, etc.)

• Occupied housing units by different criteria (e.g., walls, waste, etc.)

The full UNSD Demographic and Social Statistics data has over 650 indicators, wherein we kept a set of 64 course-grained indicators and dropped the most fine-grained indicator level. For example, we keephousing units totalbut drophousing units 1 room. We prefer more coarse-grained indicators to avoid large groups of similar indicators which are highly correlated.

3.2. Pipeline Components

We now give an overview of each of the components of the OCDP system.

Statistical Linked Data Wrappers. Currently, none of the men-tioned data sources publishes statistical data as Statistical Linked Data upfront. Thus, we use a set of wrappers which publish the data from these sources according to the principles listed in Section2. Section4.1below explains these wrappers in more detail.

13_{http://unstats.un.org/unsd/demographic/}

Linked Data Crawler. A Linked Data crawler starts with a seed list of URIs and crawls relevant connected Linked Data. The res-ulting RDF data is collected in one big RDF file and eventually loaded into the triple store. Section4.4explains the linked data crawler in more detail.

Triple Store. We use a standard Virtuoso 7 triple store as a cent-ral component to store data at different processing stages. For data loading we use the Virtuoso SQL console which allows faster data loading. For all other data access we rely on Virtu-oso’s SPARQL 1.1 interface which allows not only to query for data but with SPARQL Update also to insert new triples.

Enrichment Component: Data quality improvement workflow.

In an iterative approach we improve data quality of the crawled raw data. This component covers steps (4)-(6) in the workflow

shown in Figure1: in this configurable workflow we use the

several different sub-components corresponding to these steps consecutively. Each workflow component first reads input data (=observations) from the triple store via SPARQL queries, pro-cesses the data accordingly and inserts new triples into the triple store either via SPARQL Insert queries or the Virtuoso bulk loader facility (the first option is more flexible – it allows the execution of the workflow on a different machine – the second usually allows faster data loading).

The workflow currently uses three different subcomponents corresponding to steps (4), (5) and (6), respectively:

• TheData Integrationsub-component, corresponding to

step (3), performs some linking and data integration steps and materialises the global cube in a separate named graph. This linking and materialisation effectively resolves dif-ferent types of heterogeneity found in the raw data: (i) different URIs for members, (ii) different URIs for di-mensions, (iii) different DSDs (although the DSDs must be compatible to some extent for the integration to make sense). Eventually the global cube provides a unified view over many datasets from several sources. This compon-ent is implemcompon-ented with SPARQL Update queries and supplied background knowledge for the integration. The exact process of linking statistical data and the material-isation will be described in more detail in Section4.3and Section4.2below.

• TheStatistical Missing Values Predictionsub-component for missing value prediction, corresponding to step (5), extracts the whole global cube generated by the materi-alisation as one big data matrix, which is then used for applying different standard statistical regression methods to train models for missing value prediction. This compon-ent is implemcompon-ented as a set of R scripts which extract the data with SPARQL queries. We then train and evaluate the models for each of the indicators. If the selected model delivers predictions in a satisfactory quality we apply the model and get estimates for the indicators. Finally the component exports the statistical data together with error estimates to one RDF file which is then loaded into the triple store with the Virtuoso bulk load feature and added

(8)

to the global cube. Section5below explains the details of this components in more detail.

• TheQB Equationssub-component, corresponding to step

(6), uses equations from different sources to infer even more data. To this end, we introduce QB equations. These QB equations provide an RDF representation format for equational knowledge and a semantics as well as a for-ward chaining implementation to infer new values. QB equations are implemented in a naive rule engine which directly executes SPARQL INSERT queries on the triple store. Section6introduces the concept of QB equations with syntax, semantics and implementation.

Lastly, in Section6.3, we also explain the interplay between the Data Enrichment sub-components in more detail, that is – roughly – after cleansing and linking in component (4) we first run the QB equations component (6) once, to compute any values by equations that can be derived from the raw factual data alone, then approximate the remaining missing values by the statistical missing values prediction component (5), after which finally we run the QB equations component (6) again to improve predictions from (5) iteratively. As we will see in a detailed evaluation in Section 7, this iterative combination indeed performs better than using either (5) or (6) alone.

3.3. Data Publication

Eventually after the data is crawled and loaded into the triple store, improved and enriched by our workflow, the resulting global cube is available for consumption.

We provide a SPARQL endpoint14based on Virtuoso, where

the global cube is stored in a named graph15. The prefix names used in the examples above are already set in Virtuoso, thus no prefix declarations are necessary for SPARQL queries.

We also provide a simple user interface16to query values for a selected indicator and city in the global cube. Queries are directly executed on the triple store during loading of the website using a JavaScript library called Spark; thus one can have a look at the SPARQL queries in the source code. We show all predicted values for transparency reasons. We simply order by the error value, i.e., the most trustworthy value per year is always shown first.

4. Data Conversion, Linking, and Integration

We now explain our approach for data conversion, linking, and integration. The approach is modular and extensible in the sense that every new data source can be prepared for considera-tion separately and independently from other sources. The data integration pipeline can be re-run at any time and thus allow for up-to-date data.

The approach consists of the following components:

14_{http://citydata.wu.ac.at/ocdp/sparql}

15_{http://citydata.wu.ac.at/qb-materialised-global-cube} 16_{http://kalmar32.fzi.de/indicator-city-query.php}

• Linked Data wrappers that publish numerical data from various data sources as Statistical Linked Data (Section4.1); • the definition of a unified view over all relevant Statistical

Linked Data (Section4.2);

• semi-automatically generated links between Statistical Linked Data from different sources (Section4.3); • a rule-based Linked Data crawler to collect the relevant

data and creates the unified view (Section4.4).

4.1. Wrappers

We use Linked Data as interface to access and represent rel-evant data sources (e.g., Eurostat or UNSD), which are originally published in tabular form. The uniform Linked Data interface hides the specialities and structure of the original data source. When the wrapper receives an HTTP request for a particular dataset, it retrieves the data on-the-fly from the original source, transforms the tabular representation to RDF, using the RDF Data Cube vocabulary, and returns the RDF representation of the original tabular data.

The wrappers provide a table of contents with links to all available datasets (as a collection of qb:DataSettriples), in-cluding the data structure definition of the datasets (as qb:Data-StructureDefinition). The individual data points are modelled as observations (asqb:Observation). The data structure defin-ition includes the available dimensions (asqb:dimension) and concept schemes (asskos:ConceptScheme). We require a list of dataset and data structure definitions to be able to crawl the data.

Each wrapper coins URIs for identifying the relevant re-sources, for example, indicators or locations. We use URIs as unique identifiers for datasets, dimensions, and dimension values from different data sources.

The data sources identify indicators differently. For example, UNSD provides population numbers in dataset “240”, while Eurostat provides population numbers in dataset ”urb_cpop1”. We use the City Data Ontology to unify the various indicator identifiers. Similarly, locations have varying identifiers and sometimes varying names in the different data sources. For a relatively clear-cut example consider the city of Vienna: UNSD uses city code “001170” and label “WIEN”, whereas Eurostat uses code “AT001C1” and label “Wien”. The wrappers generate a Uniform Resource Identifier (URI) for every city out of the unique identifiers in the original tabular data.

We use the following wrappers that provide access to the underlying data source via a Linked Data interface:

Eurostat Wrapper. The Eurostat wrapper17makes the Eurostat datasets, originally available in tabular form at the Eurostat website, available as Linked Data. Eurostat provides several dictionary files in SDMX format; these files are used to construct a list of dimension values in the data structure definition and to generate URIs for relevant entities (such as cities). All files are accessed from the original Eurostat server once the wrapper

(9)

receives a HTTP request on the particular URI, ensuring that the provided RDF data is up-to-date. Population data in the

Eurostat wrapper18useshttp://estatwrap.ontologycentral.

com/dic/indic_ur#DE1001Vto identify “Population on the 1st of January, total”. The indicator URI is mapped to indicator URIs from the City Data Ontology in a subsequent step.

UNSD Wrapper. The UNSD wrapper19makes the UNSD data-sets, originally available in tabular form at the UNSD website, available as Linked Data. The UNSD wrapper provides a simple data structure definition describing the available dimensions and measure. In total, we cover 14 datasets ranging from popula-tion to housing data. Most indicators, e.g., populapopula-tion of the “240” dataset,20are directly mapped to an indicator URI from

the City Data Ontology, namelyhttp://citydata.wu.ac.at/

ns#population.

4.2. Unified View over Statistical Linked Data

As the different data sources use different identifiers (and the wrappers use different URIs), we need to link the varying URIs before we can do an integrated querying of the data. As the foundation for efficiently querying Statistical Linked Data – and in turn enriching the data as described in Section5and Section6

– we define a unified view of all crawled datasets about cities in a simplified version of the global cube [16]. In the following, we describe the structure of the global cube.

We define the unified view as the basis for querying as

follows. Theqb:Observations (consisting of dimensions and

measures) have the following structure, starting with the dimen-sions:

• For the time dimension we usedcterms:date.

• For the time dimension values we use single years repres-ented as String values such as"2015".

• For the geospatial dimension we use

sdmx-dimension:ref-Area, which is recommended by the QB standard.

• For the geospatial dimension values we use instances of

dbpedia:City, such asdbpedia:Vienna.

• For the indicator dimension we usecd:hasIndicator. • For the indicator dimension values we use instances of

cd:Indicator, such ascd:population_female. For the indicator dimension values, we defined the CDP ontology as the main hub of indicator URIs to link to since there was no list with common indicator values.

Most data source follow the practice of using an unspecific

measuresdmx-measure:obsValueand a dimension indicating

the measured variable, e.g.,estatwrap:indic_na. For the uni-fied view, we thus also assume data cubes to have only one

gen-eral measure,sdmx-measure:obsValue. Please note that there

18_{http://estatwrap.ontologycentral.com/id/urb_cpop1} 19_{http://citydata.wu.ac.at/Linked-UNData/}

20_{http://citydata.wu.ac.at/Linked-UNData/data/240}

are different equivalent alternative representations of the same information. Specifically for measure properties, in QB there is a choice for the structuring of the observations. Either use a single observation value property and a dedicated indicator dimension, or encode the indicator in the measure property. To sum up: in-line with established usage, we use a single meas-ure property, but that structmeas-ure contains all the information that would also be present in the alternative representation.

If we want to pose queries over the two datasets, we have two options. Either specifically write the query to consider possibly different identifiers (i.e., need to know all identifiers) or 2) assume existing links and reasoning. Then, if we query for values for the canonical identifiers (as for any other identifier in the equivalence class), we also get the values for the respective other identifiers. In the paper, we assume reasoning to allow for flexible addition of new sources without the need to change the queries for each new data source.

Take as an example we want a query all values of the indic-ator “population” of the area “Vienna”, in the year “2010” over data from both datasets. The indicator would be expressed as a dimension, with a URI representing “population” as dimension value. The area would be expressed with a dimension, with a URI representing “Vienna” as dimension value. The query looks like the following:

S E L E C T ? city ? year ? v alu e

WHERE { ? obs cd : h a s I n d i c a t o r cd : p o p u l a t i o n ; sdmx - d i m e n s i o n : r e f A r e a d b p e d i a : V i e n n a ; d c t e r m s : date ? year ; sdmx - m e a s u r e : o b s V a l u e ? val ue . }

Our unified view uses the basic modelling features of the QB vocabulary. In particular, we model indicators in a way that include what otherwise might be encoded as separate dimensions. In the more complex modelling, we would need to use the union of all dimensions of the source datasets, which would lead to introducing an“ALL” dimension value for those dimensions that are not distinguished by the particular dataset (see [16] for details on this more normalised representation). However, all the newly introduced dimensions per dataset would need to be considered in querying which complicates the queries. Rather than adding a dimension “sex” to encode gender, we create separate indicator URIs, for example for population, population male and population female. A benefit of the relatively simple structure is that queries and rules operating on the unified view are also simple.

We have published the data structure definition of the global cube using the QB vocabulary. Besides the general measure (sdmx-measure:obsValue), theqb:DataStructureDefinition

of the global cube uses the mentioned dimensionsdcterms:date,

sdmx-dimension:refArea, and cd:hasIndicator. Also, we

have defined instances ofqb:AttributePropertyfor

cd:esti-matedRMSE(for describing the error),cd:preferredObservation

(for linking to more reliable values),prov:wasGeneratedBy(for

describing provenance information) andprov:generatedAtTime

(for the time of generation) that help to interpret and evaluate the trustworthiness of values.

(10)

dimensions and dimension URIs. In the global cube, we use ca-nonical URIs to represent resources from different data sources.

4.3. Linking and Mapping Data

We start by explaining the required mappings for dimension URIs, followed by explaining the required mappings for dimen-sion value URIs. In general, the data from the UNSD wrapper, due the simpler representation in the original data source, re-quires less mappings than the data from the Eurostat wrapper.

The two data sources exhibit the three dimensions for

dc-terms:date(year),sdmx-dimension: refArea(city) and cd:-hasIndicator (indicator). We map the following dimension

URIs of the global cube usingrdfs:subPropertyOf:

• For the time dimension our wrappers directly use

dc-terms:date. The time dimension hence does not require any further mapping.

• For the geospatial dimension the UNSD wrapper uses

sdmx-dimension:refArea. The Eurostat wrapper uses dif-ferent representations for the geospatial dimension, such aseurostat:geo,eurostat:citiesand eurostat:metro-reg, which we link tosdmx-dimension:refArea.

• For the indicator dimension we use cd:hasIndicator.

Again, the UNSD wrapper directly uses that URI, while the data from the Eurostat wrapper requires links from

eurostat:indic_naandeurostat:indic_urto cd:has-Indicator.

The Eurostat site provides a quite elaborate modelling of dimensions, code lists and so on in SDMX files. The datasets

from the Eurostat wrapper use various units such as:THS

de-noting "Thousand" and:COUNTdenoting that the number was computed from a count operation. However, all other dimensions of datasets from Eurostat we consider in the pipeline besides the three canonical dimensions of the global cube exhibit only one single possible dimension value (e.g.,:THS). Hence, we can assume that all other dimensions and their values are part of the indicator.

The UNSD site has a simpler structure than Eurostat. The modelling of different dimensions and code lists is less elaborate. Thus, for the UNSD wrapper, we have ensured on the level of the published RDF that each dataset only provides the canonical dimensions.

The two wrappers use different URIs for the same

dimen-sions, e.g.,eurostat:geoandsdmx-dimension:refArea. The

wrappers also use different URIs for the same dimension values, e.g.,

• For the time dimension values we use single years repres-ented as String values such as"2015".

• For the geospatial dimension values we link to DBpedia URIs from other representations such as http://estatwrap.-ontologycentral.com/dic/cities#AT001C1and http://-citydata.wu.ac.at/resource/40/001170#000001.

• For the indicator dimension values we link to instances ofcd:Indicator, such ascd:populationand cd:popula-tion_male. The UNSD wrapper directly uses these values. For the URIs used in the data from the Eurostat wrapper, we link to instances ofcd:Indicator.

We now describe how we generated these links to map data from different sources to the canonical representation, starting with the dimension and dimension value URIs. We manually cre-ated therdfs:subPropertyOftriples connecting the Eurostat di-mension URIs with our canonical URIs, and semi-automatically generated the indicator URIs from an Excel sheet provided by Eurostat. We then created an RDF document with links from the newly generated URIs to the URIs of the Eurostat wrapper. We manually adapted the UNSD wrapper to use the newly generate URIs as indicator URIs.

We choose to have a one-to-one (functional) mapping of every city from our namespace to the English DBpedia URI,

which in our re-published data is encoded byowl:sameAs

rela-tions. We identify the matching DBpedia URIs for multilingual city names and apply basic entity recognition, similar to Paul-heim et al. [22], with three steps using the city names from UNSD data:

• Accessing the DBpedia resource directly and following possible redirects.

• Using the Geonames API21to identify the resource. • For the remaining cities, we manually looked up the URI

on DBpedia.

The mappings of geospatial URIs from the Eurostat wrapper were done in a similar fashion. All the mappings are published online as RDF documents that are accessed during the crawling step.

4.4. Data Crawling and Integration

The overall RDF graph can be published and partitioned in different documents. Thus, to access the relevant RDF docu-ments, the system has to resolve the URIs of entities related to the dataset. Related entities are all instances of QB-defined con-cepts that can be reached from the dataset URI via QB-defined properties. For example, from the URI of aqb:DataSetinstance,

the instance ofqb:DataStructureDefinitioncan be reached

viaqb:structure. Similarly, instances of qb:ComponentPro-perty(dimensions/measures) andskos:Concept(members) can be reached via links.

Once all numeric data is available as Linked Data, we need to make sure to collect all relevant data and metadata starting from a list of initial URIs. First, the input to the crawling is a seed list of URIs of instances ofqb:DataSets. One example of a “registry” or “seed list” of dataset URIs is provided by the PlanetData wiki22. A seed list of such datasets is published as

21_{http://api.geonames.org/}

(11)

RDF and considered as input to the crawling. We use two such seed lists: one with links to the relevant instances ofqb:DataSet

from the UNSD wrapper, and another one with links to the relevant instances ofqb:DataSetfrom the Eurostat wrapper.

Then, Linked Data crawlers deploy crawling strategies for RDF data where they resolve the URIs in the seed list to collect further RDF and in turn resolve a specific (sub-)set of contained URIs. An example Linked Data crawler is LDSpider [23], which uses a depth-first or breadth-first crawling strategy for RDF data. Linked Data crawlers typically follow links without considering the type.

A more directed approach would apply a crawling strategy that starts with resolving and loading the URIs ofqb:DataSets relevant for the task, and then in turn resolves and loads instances of QB concepts that can be reached from the dataset URIs.

To specify how to collect Linked Data, we use the Linked Data-Fu language [9] in which rule-based link traversal can be specified. For instance, to retrieve data from allqb:DataSets, we define the following rule:

{ ? ds rdf : type qb : D a t a S e t . } = > { [] http : mthd htt pm : GET . http : r e q u e s t U R I ? ds . } .

The head of a rule corresponds to an update function of an internal graph representation in that it describes an HTTP method that is to be applied to a resource. In our example, the head of a rule applies a HTTP GET method to the resource?ds. The body of a rule corresponds to the condition in terms of triple patterns that have to hold in the internal graph representation. In our example,?dsis defined as an instance ofqb:DataSet.

Similarly, we retrieve instances of

qb:DataStructureDef-inition,qb:ComponentSpecification, qb:DimensionProper-ty,qb:AttributeProperty,qb:MeasureProperty,qb:Slice, qb:-SliceKey, andqb:ObservationGroup. Also, we access the list

of possible dimension values (based onqb:codeList in data

structure definitions) as well as each single dimension value. The only instances we do not resolve are observations since these are usually either modelled as blank nodes or provided together with other relevant information with the RDF document

containingqb:DataSetorqb:Slice.

Crawling may include further information, e.g.,

rdfs:see-Alsolinks from relevant entities orowl:sameAslinks to equi-valent URIs. Assuming that the number of related instances of QB concepts starting from a QB dataset is limited and that links

such asrdfs:seeAlsofor further information are not crawled

without restriction (e.g., only from instances of QB concepts), the directed crawling strategy terminates after a finite amount of steps.

Besides all the relevant data and metadata ofqb:DataSets, we collect the following further information:

• The City Data Ontology23_{(CDP ontology) that contains}

lists of common statistical indicators about cities.

23_{http://citydata.wu.ac.at/ns}

• The QB Equations Ontology24that contains the

vocabu-lary to describe QB equations and is further detailed in Section6.

• The Eurostat QB equations25 that contains a set of QB

equations generated from formulas published by Eurostat as further detailed in Section6.

• Background information26that links indicators of Estat-wrap to the CDP ontology as further described in Sec-tion4.3.

• Background information providing additional

owl:equi-valentPropertylinks27between common dimensions not already provided by the wrappers such as between the different indicator dimension URIsestatwrap:indic_ur,

cd:hasIndicatorandeurostat:indic_na.

Besides explicit information available in the RDF sources, we also materialise implicit information to 1) make querying over the triple store easier and 2) automatically evaluate relevant QB and OWL semantics. We execute the QB normalisation algorithm28in case the datasets are abbreviated. Also, we

ex-ecute entailment rules29for OWL and RDFS. However, we only

enable those normalisation and entailment rules that we expect to be evaluated quickly and to provide sufficient benefit for querying.

For instance, we evaluate rules about the semantics of equal-ity, e.g., symmetry and transitivity ofowl:sameAs. We again describe the semantics of such axioms using Linked Data-Fu. However, because we do not need the full materialisation of the equality, but only the canonical URIs, we define custom rules that only generate the triples involving the canonical URIs. Thus, the resulting dataset contains all triples required to integrate and query the canonical representation, but not more.

The crawling and integration is specified in several Linked Data-Fu programs. The programs are executed periodically

us-ing the Linked Data-Fu interpreter30 in version 0.9.12. The

interpreter issues HTTP requests to access the seed list, follows references to linked URIs, and applies the derivation rules to ma-terialise the inferences. The crawled and integrated data is then made available for loading into a triple store. Before loading the observations into the triple store we ensure for each obser-vation that the correct dimension URIs and member URIs are used, filter out non-numeric observation values and mint a new observation URI if a blank node is used. Finally, the filtered and skolemised observations are loaded into an OpenLink Virtuoso triple store (v07) using the standard RDF bulk loading feature31. Thus, the global cube can be queried in subsequent imputa-tion and calculaimputa-tion steps.

24_{http://citydata.wu.ac.at/ocdp/qb-equations} 25_{http://citydata.wu.ac.at/ocdp/eurostat-equations} 26_{http://kalmar32.fzi.de/triples/indicator-eurostat-links.nt} 27_{http://kalmar32.fzi.de/triples/dimension-property-links.nt} 28_{https://www.w3.org/TR/vocab-data-cube/#normalize-algorithm} 29_{http://semanticweb.org/OWLLD/} 30_{https://linked-data-fu.github.io/}

31_See_{http://citydata.wu.ac.at/ocdp/import}_{for a collection of}

(12)

Figure 3: Prediction Workflow

5. Imputation: Predicting Missing Values

As discussed in Section1and Section3.1, the filling in of missing values by reasonable predictions is a central requirement for the OCDP, since we discovered a large number of missing values in our datasets (see Table1and2).

The prediction workflow is given in Figure3. The initial step regards the loading, transposing, and cleansing of the ob-servations taken from the global cube. Then, for each indicator, we impute all the missing values with neutral values for the principal components analysis (PCA), and perform the PCA on the new matrix, which creates the principal components (PC) that are used as predictors. Next, the predictors are used for the model building step using a basket of statistical Machine learning methods such as multiple linear regression. Finally, we use the best model from the basket to fill in all missing value in the original matrix and publish them using the Missing Values wrapper.

In our earlier work [7], we have evaluated two approaches to choose the predictors, one based on applying the base methods to complete subsets in the data and the other based on PCA. In the present paper, we only use the PCA-based approach, since, although it delivers slightly lower prediction accuracy, it allows us to cope more robustly with the partially very sparse data, such that we can also predict values for indicators that do not provide sufficiently large subsets of complete, reliable predictors.

Base Methods. Our assumption is that every indicator has its own statistical distribution (e.g., normal, exponential, or Pois-son distribution), sparsity, and relationship to other indicators. Hence, we aim to evaluate different regression methods and choose the best fitting method/model to predict the missing val-ues per indicator. In order to find this best fitting method, we

measure the prediction accuracy by comparing thenormalised

root mean squared errorin % (RMSE%) [18] of every tested regression method. While in the field of Data Mining [18,19] (DM) numerous regression methods for missing value predic-tion were developed, we chose the following three “standard” methods for our evaluation due to their robustness and general performance:

K-Nearest-Neighbour Regression(KNN), models denoted as

MKNN, is a wide-spread DM technique based on using a distance

function to a vector of predictors to determine the target values from the training instance space. As stated in [19], the algorithm

is simple, easily understandable and reasonably scalable. KNN can be used in variants for clustering as well as regression.

Multiple Linear Regression(MLR), models denoted asMMLR, has the goal to find a linear relationship between a target and sev-eral predictor variables. The linear relationship can be expressed as a regression line through the data points. The most common approach isordinary least squaresto measure and minimise the cumulated distances [19].

Random Forest Decision Trees(RFD), models denoted as

MRFD, involve the top-down segmentation of the data into

mul-tiple smaller regions represented by a tree with decision and leaf nodes. Each segmentation is based on splitting rules, which are tested on a predictor. Decision nodes have branches for each value of the tested attribute and leaf nodes represent decision on the numerical target. A random forest is generated by a large number of trees, which are built according to a random selection of attributes at each node. We use the algorithm introduced by Breiman [24].

Principal Component Analysis. All three of the above-described base methods need a complete data matrix as a basis for calculat-ing predictions for the respective target indicator column. Hence, we need for each target indicator (to be predicted) a complete training data subset of predictor indicators. However, as dis-cussed in [7], when dealing with very sparse data, such complete subsets are very small and would allow us to predict missing values only for a few indicators and cities. Instead, we omit the direct use of indicators as predictors. Instead, we first perform a PCA to reduce the number of dimensions of the data set and use

the new compressed dimensions, calledprincipal components

(PCs) as predictors for the above described three base methods: as stated in [19], the PCA is a common technique for finding patterns in data of high dimensions (in our case, many different indicators for many different cities and years). We use PCA to compress the large number of indicators to a smaller set of principal components which can later be used as predictors. The second main advantage of PCA is in terms of dealing with sparse data: as described in [20], all the missing values in the raw data

matrix can be replaced byneutralvalues for the PCA created

according to the so-calledregularised iterative PCA algorithm. This step allows to perform PCA on the entire data matrix, even if only a few complete subsets exist.

5.1. Preprocessing

Before we can apply the PCA and subsequently the base regression methods we need to pre-process and prepare the data from the global cube to bring it into the form of a two-dimensional data matrix. This preprocessing starts with the extraction of the observations from the global cube. Since the described standard DM methods can not deal with the hierarch-ical, multi-dimensional data of the global cube, we need to “temporary flatten” the data back to tuples. For this, we pose the following SPARQL query, with an increasing year range that is currently2004–2017.

S E L E C T D I S T I N C T ? city ? i n d i c a t o r ? year ? val ue

FROM < http :// c i t y d a t a . wu . ac . at / qb - m a t e r i a l i s e d - global - cube >

(13)

? obs d c t e r m s : date ? year .

? obs sdmx - d i m e n s i o n : r e f A r e a ? city . ? obs cd : h a s I n d i c a t o r ? i n d i c a t o r . ? obs sdmx - m e a s u r e : o b s V a l u e ? val ue .

{ ? obs a cd : C r a w l e d O b s e r v a t i o n } UNION { ? obs a cd : f a c t u a l Q B e O b s e r v a t i o n }.

F I L T E R( xsd : i n t e g e r (? year ) >= 2 004 )

} ORDER BY ? i n d i c a t o r ? city ? year

The SPARQL query flattens the multidimensional data to an input data table with tuples of the form:

hCity,Indicator,Year,Valuei.

Based on the initial table, we perform a simple preprocessing as follows:

• Removing nominal columns and encode boolean values; • Merging the dimensions year and city to one, resulting in:

hCity Year,Indicator, Valuei;

that is, we further flatten the consideration of city per year to city/year “pairs”

• Finally, we transpose the initial table to a two-dimensional data matrix with one row per city/year-pair one column per indicator, resulting in tuples of the form:

hCity Year,Indicator1Value1, . . . ,IndicatornValueni;

• From this large matrix, we delete columns and rows which have a missing values ratio larger than 99%, that is, we remove city/year pairs or indicators that have too many missing values to make reasonable predictions, even when using PCA.

Our initial data set from merging Eurostat and UNSD contains 1 961 cities with 875 indicators. By merging city and year and transposing the matrix we create 12 008 city/year rows. After deleting the cities/year-pairs and indicators with a missing values ratio larger than 99%, we have the final matrix of 6 298 rows (city/year) with 212 columns (indicators).

Note that the flattening approach and deletion of too sparse rows/columns are generic and could obviously still be applied if we added more data sources, but our experiments herein focus on the Eurostat and UNSD data.

5.2. Prediction using PCA and the Base Regression Methods

Next, we are ready to perform PCA on the data matrix

cre-ated in the previous subsection. That is, we impute all the

missing values withneutralvalues for the PCA, according to the above-mentionedregularised iterative PCA algorithmdescribed in [20]. In more detail, the following steps are evaluated having an initial data setA1as a matrix and a predefined number of

predictorsn(we test this approach also on differentn0s): 1. Select the target indicatorIT;

2. Impute the missing values in A1 using the regularised

iterative PCA algorithm resulting in matrixA2and remove

the column withIT;

0 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 R MSE % Predictors Knn Rforest Linreg Best

Figure 4: Prediction results using PCA

3. Perform the PCA on theA2resulting in a matrixA3of a

maximum of 80 PCs;

4. Append the column ofIT toA3creatingA4and calculate

the correlation matrixACofA4betweenIT and the PCs;

5. Create the submatrixA5ofA4on the selection of the PCs

with the highest absolute correlation coefficients and limit them byn;

6. Create submatrixA6ofA5for validation by deleting rows

with miss. values forIT;

7. Apply stratified tenfold cross-validation onA6. which

results in the best performing modelMBest;

8. Use the method forMBestto build a new model onA5(not A6) for predicting the missing values ofIT.

5.3. Evaluation and Publishing

Figure4shows the results for the median RMSE% with an

increasing number of predictors (selected from the 80 PCs) and compares the performance of KNN, RFD, MLR, and the selec-tion of best method. Clearly, for 80 predictors MLR performs best with a median RMSE% of 0.56%, where KNN (resp. RFD) has a median RMSE% of 4.36% (resp. 5.27%). MLR is the only method that improves steady up to 80 predictors. KNN provides good results for a lower number of predictors, but starts flattening with 20 predictors. Contrary to MLR, the parameter of KNN and MLR have to be adjusted according to number of predictors, hence optimising the number of clusters for KNN

could improve the result. The red line in Figure4shows the

median RMSE% with the best regression method chosen. Up to 60 predictors, the overall results improves by selecting the best performing method (for each indicator). The best median RMSE% of 0.55% is reached with 80 predictors, where MLR is predominant and only 5 out of 232 indicators are predicted by KNN. We emphasise that, compared to the result of our earlier

experiments in [7], the median RMSE% improved from 1.36%

to 0.55%, which is mainly related to the lower sparsity of the datasets.

Finally, we note again why we added PCA, as opposed to attempting predictions based on complete subsets: in our preliminary evaluations, based on the comparison of the two approaches in [7], by picking the best performing regression

(14)

method per indicator with ten predictors from the raw data based on complete subsets the median RMSE% was 0.25%. However, due to the low occurrence of complete subsets of reasonable size for ten predictors, only one third of the missing values could be imputed compared to using PCA. We acknowledge that this comes at a cost, as the a median RMSE% when using PCA goes up to 0.55% with 80 predictors (see above). However, due to the sparsity in the data we decided to trade better completeness for accuracy of the prediction.

We publish the predicted values created by the combina-tion PCA and selecting the best regression method per indicator where we apply a threshold of RMSE% of 20% as a cut off. This leads with the current evaluation to no removal of any indicator. Following our strategy of using statistical linked data wrappers, we publish the predicted values using theMissing Values wrap-per,32which provides a table of content, a structure definition, and datasets that are created for each prediction execution.

5.4. Workflow and Provenance

The full prediction workflow of our statistical prediction

for missing values is shown in Figure 3and is based on all

observations but the old predicted values in the global cube. The

data preprocessing and transposingfor the input data matrix is written in Python, but all other steps such asPCA,model building, andmodel evaluationare developed inR[25] using its readily available “standard” packages (another advantage of relying on standard regression methods). All the scripts and their description are available on the website of the Missing Values wrapper. We conducted an evaluation of the execution time on our Ubuntu Linux server with 2 cores, 2.6 GHz, and 16 GB of RAM. A single prediction run requires approx. 10min for each indicator (approx. 3 min for each method) resulting in a total time of about 35 hours for all indicators, which still is reasonably doable for re-running wrappers, recomputing models and predictions in a weekly batch job.

Looking back to Figure3, one can see that the workflow

branches after four steps, where we distinguish two cases. In the case of no previous executions, we perform the full prediction steps as described in the previous section. In the case of previous executions, we already have provenance information available in our triple store, which describes the last execution and the related model provenance information (for each indicator). The model provenance includes for each indicator the number of PCs, the number of predictors used from these PCs, the chosen prediction base method, method parameters (i.e., the number of clusters in the KNN), and the RMSE%.

To sum up, we keep provenance for our predictions on three levels:

• For each execution, we publish the median RMSE% over all indicators, number of predictors, creation date, and the creation agent;

• For each indicator, we publish the above mentioned model provenance data;

32_{http://citydata.ai.wu.ac.at/MV-Predictions/}

• For each predicted value published as aqb:Observation, we provide the overall absolute estimated RMSE

(us-ing the predicatecd:estimatedRMSE) and the estimated

RMSE% (using the predicatecd:estimatedNormalizedRMSE).

Further, we point to better observations (published with

an lower RMSE%) using the predicate

cd:preferredOb-servationwhich might occur if another approach such as a different base method or QB Equations (discussed in Section6below) improve the predicted values.

For describing the model provenance, we use theMEX vocab-ulary, which is compared to other vocabularies (i.e., DMOP [26]) lightweight and designed for exchanging machine learning

metadata [27]. We use theMEX Algorithmlayer to describe

our prediction method and its parameter and theMEX

Perform-ancelayer to describe the RMSE%. Further, we describe each

execution using attributes ofMEX Execution.

Example 5.1. The following example gives an intuition into reading the data about missing value predictions.

@ p r e f i x prov : < http :// www . w3 . org / ns / prov # > .

@ p r e f i x cdmv : < http :// c i t y d a t a . wu . ac . at / MV - P r e d i c t i o n s / > . @ p r e f i x mexp : < http :// mex . aksw . org / mex - perf / > .

@ p r e f i x mexc : < http :// mex . aksw . org / mex - core # > . @ p r e f i x mexa : < http :// mex . aksw . org / mex - algo # > . cvmv : p r e d D S 1 rdf : type qb : D a t a S e t .

cvmv : p r e d D S 1 prov : w a s G e n e r a t e d B y cvmv : run P1 .

cvmv : p r e d D S 1 dc : t itl e " A3_2004 -20 16 _ n c p 8 0 _ s e e d _ 1 0 0 _ p r e d _ 8 0 " . cvmv : r unP 1 rdf : type mexc : E x e c u t i o n ; prov : A c t i v i t y . cvmv : r unP 1 cdmv : p r e d i c t i o n P C s 80 . cvmv : r unP 1 mexp : r o o t M e a n S q u a r e d E r r o r 1 . 0 7 0 5 . cvmv : r unP 1 mexc : e n d s A t "2017 -07 -31 T10 : 5 2 : 0 2 Z "^^ xsd : d a t e T i m e . cvmv : r unP 1 cvmv : h a s P r e d i c t e d cvmv : r u n P 1 _ 1 . cvmv : r u n P 1 _ 1 mexc : d a t a s e t C o l u m n cd : no_bed -p l a c e s _ i n _ t o u r i s t _ _ a c c o m m o d a t i o n _ e s t a b l i s h m e n t s . cvmv : r u n P 1 _ 1 mexc : h a s A l g o r i t h m C o n f i g mexa : R e g r e s s i o n . cvmv : r u n P 1 _ 1 cd : e s t i m a t e d A b s o l u t e R M S E 3 2 2 8 . 8 7 2 6 . cvmv : r u n P 1 _ 1 cd : e s t i m a t e d N o r m a l i z e d R M S E 1 . 7 8 2 5 9 . cvmv : r u n P 1 _ 1 cdmv : size 2737 . cdmv : obs1 rdf : type cd : P r e d i c t e d O b s e r v a t i o n . cdmv : obs1 cd : h a s I n d i c a t o r no_bed -p l a c e s _ i n _ t o u r i s t _ _ a c c o m m o d a t i o n _ e s t a b l i s h m e n t s . cdmv : obs1 sdmx - d i m e n s i o n : r e f A r e a d b p e d i a : B o l z a n o . cdmv : obs1 d c t e r m s : date " 2 0 1 0 " . cdmv : obs1 sdmx - m e a s u r e : o b s V a l u e 1 4 9 0 . 4 4 8 5 . cdmv : obs1 cd : e s t i m a t e d A b s o l u t e R M S E 3 2 2 8 . 8 7 2 6 . cdmv : obs1 cd : e s t i m a t e d N o r m a l i z e d R M S E 1 . 7 8 2 5 9 . cdmv : obs1 cd : p r e f e r r e d O b s e r v a t i o n cdmv : obs1 . cdmv : obs1 qb : d a t a S e t cvmv : p r e d D S 1 .

The example shows aqb:DataSetof predicted values

gener-ated by a run on the 2017-07-31 using our PCA-based approach. We show one predicted value and its RMESs for the indicator

no_bed-places_in_tourist__accommodation_establishments

of the city of Bolzano in the year 2010. The best method for this indicator was MLR which is indicated by the triple:

cvmv:runP1_1 mexc:hasAlgorithmConfig mexa:Regression.

The triplecdmv:obs1 cd:preferredObservation cdmv:obs1

states that currently there is no better prediction available, i.e., that this observation is itself the most preferred (i.e., best) for the respective indicator for this city/year.

In summary, while through the availability of more and new raw data we could improve the prediction quality compared to [7], this is – essentially, apart from the more structured workflow