Open Science – Open Data - GRDI2020 Final Roadmap Report

There is an emerging consensus among the members of the academic research community that “e-science” practices should be congruent with “open science”. The essence of the e-science is “global collaboration” in key areas of science and the next generation of data infrastructures must enable it. Global scientific collaboration takes many forms, but from the various initiatives around the world a consensus is emerging that collaboration should aim to be “open” or at least should be a substantial measure of “open access” to data and information underlying published research, and to communication tools [81].

The concept of Open Data is not new; but although the term is currently in frequent use, there are no commonly agreed definitions (unlike, for example, Open Access where several formal declarations have been made and signed).

The Organization for Economic Co-operation and Development, OECD, an international organization helping governments tackle the economic, social, and governance challenges of a globalized economy has produced a Report “Recommendations of the Council concerning Access to Research Data from Public Funding” [82] where the following definition of openness is given: “Openness means access on equal terms for the international research community at the lowest possible cost, preferably at no more than the marginal cost of dissemination. Open access to research data from public funding should be easy, user-friendly and preferable Internet-based”

Open Data is focused on data from scientific research. Problems often arise because these are

commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use.

Open Data has a similar ethos to a number of other "Open" movements and communities such as

open source and open access. However these are not logically linked and many combinations of practice are found. The practice and ideology itself is well established but the term "open data" itself is recent.

An essential prerequisite for successful adoption of Open Data principle is the willingness of scientific communities to share data. Researchers acknowledge that data sharing increases the impact, utility, and profile of their work. Conversely, research is highly competitive, and publications depend on individual ability to produce novel data, which can be a disincentive for collaboration.

There are also major ethical considerations in sharing data between researchers and between countries and in making data available for open access. Acceptance of the open data principle entails a cultural shift in the science regarding the importance of data sharing and mining.

Much data is made available through scholarly publication, which now attracts intense debate under "Open Access". The Budapest Open Access Initiative (2001) coined this term:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

The logic of the declaration permits re-use of the data although the term "literature" has connotations of human-readable text and can imply a scholarly publication process. In Open Access discourse the term "full-text" is often used which does not emphasize the data contained within or accompanying the publication. While “open data” will enhance and accelerate scientific advance, there is also a need for “open science”—where not only data but also analyses and methods are preserved, providing better transparency and reproducibility of results.

The extent of openness is an important issue that must be regulated by rules and norms governing the disclosure of data and information about research methods and results [81]:

 How fully and quickly is information about research procedures and data released?

 How completely is it documented and annotated so as to be not only accessible but also useable by those outside the immediate research group?

 On what terms and with what delays are external researchers able to access materials, data and project results?

 Are findings held back, rather than being disclosed in order to first obtain IPRs on a scientific project’s research results, and if so how long is it usual for publication to be delayed?

The Open Data Principle has three dimensions: policy, legal, and technological. Technology must render physical and semantic barriers irrelevant, while policies and laws must allow to overcome legal jurisdictional boundaries

Policy Dimension

There are several political reasons advocating for open access to data:

 “Data belong to human race”. Typical examples are genomes, environmental data, medical science data, etc.

 Public money was used to fund the work and so it should be universally available.

 In scientific research, the rate of discovery is accelerated by better access to data.

 Facts cannot legally be copyrighted.

 Sponsors of research do not get full value unless the resulting data are freely available

There are also some important economic and social reasons in the pursuit of knowledge, and making explicit the supportive role played by norms that tend to reinforce cooperative behaviors among scientists. In brief, rapid disclosures abet:

 rapid validation of findings,

 reduces excess duplication of research effort,

 enlarge the domain of complementarities and

 yield beneficial “spill-overs”among research programs.

Advocates of open data argue that access restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license.

As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions.

Their arguments may include:

 this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)

 the government gives specific legitimacy for certain organizations to recover costs (NIST in US, Ordnance Survey in UK)

 government funding may not be used to duplicate or challenge the activities of the private sector (e.g. PubChem)

It may be noted that it is the difficulty of monitoring research effort that make it necessary for both the open science system and the intellectual property regime to tie researchers’ rewards in one way or another to priority in the production of observable “research outputs” that can be transmitted to “validity tests and valorization” – whether directly by peer assessment, or indirectly through their application in the markets for goods and services.

An acceptable compromise between open and closed data could be the introduction of a data- disclosure norm where a limited, in time, embargo period is defined. During this period of time the novel data produced by a researcher remains closed thus allowing him/her to produce publications based on this data without running the risk of fraudulent publications. After the embargo period has elapsed the data become open (publicly available). The length of the embargo period must be agreed among the scientific communities and could be discipline dependent. The issues around consent and ownership are yet more complex within networked science environments.

Indeed, we appear to be in the midst of a massive collision between unprecedented increases in data production and availability and the privacy rights of human beings worldwide.

Common frameworks and defined principles first need to be established if an Open Science Data space is to be established, particularly when it comes to the ethical and privacy issues.

are effectively acted on in practice lies in the development of a coherent policy and legal framework at a national level. The national framework must support the international principles for data access and sharing but also be clear and practical enough for researchers to follow at a research project level [83].

The development of a national framework for data management based on principles promoting data access and sharing (such as the OECD recommendations) would help to incorporate international policy statements and protocols such as the Antarctic Treaty and the GEOSS Principles into domestic law.

Legal Dimension

It is generally held that factual data cannot be copyrighted. However publishers frequently add their copyright statements (often forbidding re-use) to scientific data accompanying (supporting, supplementing) a publication. It is also usually unclear whether the factual data embedded in full text are part of the copyright.

While the human abstraction of facts from paper publications is normally accepted as legal there is often an implied restriction on the machine extraction by robots.

Some Open Access publishers do not require the authors to assign copyright and the data associated with these publications can normally be regarded as Open Data. Some publishers have Open Access strategies where the publisher requires assignment of the copyright and where it is unclear that the data in publications can be truly regarded as Open Data.

The ALPSP and STM publishers have issued a statement about the desirability of making data freely available :

“Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research. Data searching and mining tools permit increasingly sophisticated use of raw data. Of course, journal articles provide one ‘view’ of the significance and interpretation of that data – and conference presentations and informal exchanges may provide other ‘views’ – but data itself is an increasingly important community resource. Science is best advanced by allowing as many scientists as possible to have access to as much prior data as possible; this avoids costly repetition of work, and allows creative new integration and reworking of existing data.”

And

“We believe that, as a general principle, data sets, the raw data outputs of research, and sets or sub-sets of that data which are submitted with a paper to a journal, should wherever possible be made freely accessible to other scholars. We believe that the best practice for scholarly journal publishers is to separate supporting data from the article itself, and not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question.” Even though this statement was without any effect on the open availability of primary data related to publications in journals of the ALPSP and STM members. Data tables provided by the authors as supplement with a paper are still available to subscribers only.

The importance of building scientific data infrastructures in which research findings can be readily made available to and used by other researchers has long been recognized in international scientific collaborations. However, creating and maintaining conditions of openness might mean not simply putting data on-line but making them sufficiently robust and well-documented to be widely utilized.

There are two main options in making data openly accessible and sharable:

Open access to data/metadata with re-use restrictions Open access to data/metadata without re-use restrictions

In order to implement the Open access to data/metadata with re-use restrictions policy the data producer/provider must make them understandable to the researchers belonging to the same or different scientific disciplines.

Therefore, the data must be endowed with some auxiliary information which contribute to enrich its semantics. Such auxiliary information could include contextual information as well as open

access community ontologies, terminologies and taxonomies.

In addition, the adoption of annotation practices based on standardized terminologies will greatly increase the understandability of the published data.

In order to implement the Open access to data/metadata without re-use restrictions policy the data producer/provider must make them not only understandable but also usable.

Therefore, the data must be endowed with some auxiliary information which contribute to make them usable. Such auxiliary information could include provenance, quality, and uncertainty information.

The challenge for the next generation of global scientific data infrastructures is to support automated agents and search engines able to crawl the science data space in order to discover, mine, relate and interpret data from datasets as well as the literature.

We envision that the future research data infrastructures will constitute infrastructures for open scientific research.

The principles of open science data and open science can be widely accepted only if realized within an Integrated Science Policy Framework to be implemented and enforced by global research data infrastructures.

In document GRDI2020 Final Roadmap Report (Page 95-100)