Data cataloguing on the Web - Knowledge Components and Methods for Policy Propagation in Data F

ASKquery that can be used to perform consistency checks.

3. Declares the property spin:rule that can be used to link a class to a SPARQL CONSTRUCT query that can be used to produce inferred triples from the members of the class.

4. Defines a spin:Template type useful to store parametrized SPARQL queries.

While OWL is a modelling language, SPIN is a technique to make interact queries and data in an automated way, therefore it does not have the properties of a formal language. However, it provides a large flexibility, allowing to define Horn-style rules and production rules (creating terms previously not existing in the data).

2.5 Data cataloguing on the Web

In this section we summarise the main approaches to the cataloguing of datasets, focusing on the way they support terms and conditions in the context of the Web of Data. Moreover, we give an overview of data cataloguing platforms and how they support the reference to licences and terms of use.

2.5.1 Approaches to data cataloguing

Technologies and standards for data representation and annotation have a long tradition in library cataloguing and interoperability (for example in the context of the Open Data Archive initiative11) and in large part, they contributed to the current standards of the Semantic Web (the Dublin Core Metadata Initiative is the most prominent case12).

Metadata is data that provides information about the context of an asset, supporting a set of functions associated with it. In the context of data cataloguing, metadata enables dataset discovery, understanding, integration, and maintenance [Assaf et al. (2015)].

The concept is nowadays popular and embraces a wide range of structured information about digital assets and can be characterised as (a) identification, (b) administration, (c) terms and conditions, (d) content ratings, (e) structural metadata, (f) provenance, and (h) linkage [Greenberg (2005)]. In its early stage, the Web has been often described as a library, although its characteristics differ under many respects. However, “something very much like traditional library services will be needed to organise, access and preserve networked information” [Lynch (1997)]. The Dublin

11_{http://www.openarchives.org/}

Core Initiative was started as a means to standardise the way objects metadata could be represented, inheriting the fundamental notions from the library domain, including authorship, attribution, and copyright [Weibel (1999)]. Most cataloguing approaches that publish dataset metadata in a machine- readable way include the capability of referring to a licence or rights statement, using mechanisms that rely upon Dublin Core (DC) properties such as dc:license and dc:rights.

The Semantic Web programme promotes approaches and techniques to publish and exchange metadata on the Web, through the definition of more specialised vocabularies. VoID, the "Vocabulary of Interlinked Datasets", has been proposed to describe Linked Data. VoID covers several aspects of RDF datasets, from general metadata (inheriting Dublin Core terms) to access methods, internal organisation, partition, and linking [Alexander and Hausenblas (2009)]. The RDF Data Cube Vocabulary is the W3C recommendation for the publication of multidimensional data tables, mainly in the domain of statistical data. Its scope is on providing a schema to publish data, not a method to describe datasets [Reynolds and Cyganiak (2014b)]. VOAF [Vandenbussche and Vatant (2011)] and VANN13 [Davis (2005)] have been designed to describe vocabularies in the Linked Data cloud14. The Asset Description Metadata Schema (ADMS) is an RDF vocabulary proposed to be a common representation and exchange format of e-Government repositories [Shukair et al. (2013)]. All these vocabularies inherit the Dublin Core properties to express licences and terms of use. The Data Documentation Initiative has developed a metadata specification for the social and behavioural sciences. The intent is to describe data sets resulting from social science research (like survey data). An RDF version of the vocabulary has been recently developed15to foster the publishing into the Web of Linked Data. DDI provides reusable datasets as well as reusable metadata models and standards, including a LicenseMandate term to include citations to legal documents providing details about the terms of use [Vardigan et al. (2008)].

The Data Catalog Vocabulary (DCAT) is a W3C recommendation for the description of data catalogues [Erickson and Maali (2014)]. DCAT defines a dataset as a collection that is published and curated by an organisation that provides access to it in one or more formats. Examples of data catalogues using DCAT are data.gov.uk16 _{and the Open Government Dataset Catalog [Erickson} et al. (2011)]. The DC subschema for rights and licenses is incorporated in the DCAT standard of the W3C for the representation of the catalogue meta-level17 [Erickson and Maali (2014)]. DCAT introduces a further level of concretisation via the dcat:Distribution class, which

13_{http://purl.org/vocab/vann/}

14_{See the Linked Open Vocabularies activity: http://lov.okfn.org/dataset/lov/.}

15_{http://rdf-vocabulary.ddialliance.org/discovery.html.}

16_{http://data.gov.uk}

2.5. DATA CATALOGUING ON THE WEB 53 accommodates bespoke rights statements, typically using the URIs of licence descriptions. The HyperCatspecification follows a similar notion, however, it enforces the use of URIs for values and contemplates machine-readable content as a possible form to which they dereference18_.

Even in eGovernment, where policy transparency is of the utmost importance, a fairly recent study made emerge a certain degree of heterogeneity when it comes to expressing licenses in government data catalogues [Maali et al. (2010)], although such a survey could be expected to deliver slightly more encouraging results if carried out today, if anything because of the standardisation efforts that have since been promoted by organisations like the Open Data Institute (ODI), among others [Attard et al. (2015)].

Although the importance of licence information is recognised in the scientific literature to allow businesses to be aware of the legal implications concerning (open) data reuse, also including a machine-readable representation of the licence, the existing work is still preliminary on the modelling side [Assaf et al. (2015)].

2.5.2 Tools for data cataloguing

Systems like CKAN, one of the best-known data cataloguing platforms, and Dataverse19, support the attachment of a licence to datasets. Socrata20supports the specification of roles and permissions for the management workflow, while data terms of use are exclusively in human-readable form21_.

CKAN adopts a package manager paradigm to implement dataset management22. A package, i.e. the basic unit whereupon licences are set in CKAN, is the dataset itself. A similar argument can be made for Dataverse23_{, which adopts a method similar to those of CKAN. A survey of existing} Socrata-based open data catalogues24_{accessed through the Socrata API has brought to the surface} metadata about owner descriptions and roles, permissions - mostly related to the management platform - and grant inheritance policies, all using an in-house (presumably controlled) vocabulary. Custom metadata are also used for the specification of licenses, though their instances are for the

18_{HyperCat specification, http://www.hypercat.io/standard.html} 19_{Dataverse, http://dataverse.org}

20_{Socrata, http://www.socrata.com}

21_{Example at the time of writing: https://opendata.camden.gov.uk/api/views/6ikd-ep2e.}

json

22_{Comprehensive Kerbal Archive Network, http://ckan.org} 23_{Dataverse, http://dataverse.org}

most part in human-readable form25. DKAN26is a datasets management system based on Drupal on with a set of cataloguing, publishing and visualisation features, whose data model is akin to the CKAN one, covering information to describe datasets, resources, groups and tags [Assaf et al. (2015)].

In recent years, data repositories and registries have been growing, spanning from data cataloguing services (Datahub27), data collections (Wikidata28, Europeana29), to platforms that manage the collection and redistribution of data (Socrata30_{). Research Data Management include the cataloguing} of the assets involved in research activities, and this area is recently developing also through a push from libraries [Cox and Pinfield (2014)]. Although [Lyon (2012)] maps potential roles of the library to a research lifecycle model, including the aspect of Research Data licensing, the current status of data licensing in research data management infrastructures is at the early stages (with the small exception that we will mention in Section 2.6.3). Services such as Zenodo or Figshare offer the possibility of preserving and publishing assets produced by scientific research. Zenodo encourages to share the data openly and allows a variety of licences31_{. Figshare recommends the use of Creative} Commons 4.0 licences and provides guidance about a set of popular licences for open data and software32. An emerging category of systems is City Data Hubs, whose role is to collect data mainly from sensors networks and publish them in order to support novel IoT applications. City Data Hubs need to support developers not only in obtaining data but also in assessing the policies associated with data resulting from complex pipelines [Bischof et al. (2015); Bohli et al. (2015); d’Aquin et al. (2014)]. It is therefore for these systems to implement technologies that allow policies associated to derived datasets to be assessed.

2.5.3 Towards supporting an exploitability assessment

Data cataloguing systems are often designed to support discovery and indexing of data sources, leaving out an important dimension in data reuse, which is the support for early analysis. An important part of the early analysis is the applicability of the data sources to the use case at hand

25_{Example at the time of writing: https://opendata.camden.gov.uk/api/views/6ikd-ep2e.}

json

26_{DKAN: http://getdkan.com/}

27_{Datahub. https://old.datahub.io/. Accessed: May, 2018.} 28_{Wikidata. https://www.wikidata.org. Accessed: May, 2018.} 29_{Europeana. http://labs.europeana.eu/. Accessed: May, 2018.} 30_{Socrata. https://www.socrata.com/. Accessed: May, 2018.}

32_{Zenodo, http://help.zenodo.org/features/, Accessed: May 2018.}

32_{Figshare, https://knowledge.figshare.com/articles/item/what-is-the-most-appropriate-license-for-my-data,}

2.6. LICENSES AND POLICIES: REPRESENTATION AND REASONING 55

In document Knowledge Components and Methods for Policy Propagation in Data Flows (Page 74-78)