D4.2 - Analysis of Data Infrastructure and Data repositories

(1)

www.chain-project.eu [email protected] Grant Agreement n. 306819

Project acronym : CHAIN-REDS

Project full title : Co-ordination & Harmonisation of Advanced and e-INfrastructures for Research Education Data Sharing

Grant agreement : 306819

Start date : December 1, 2012

Duration : 30 months

Programme : 7th Framework Programme (FP7)

Theme : Capacities specific program

Thematic area : Research Infrastructures Funding scheme : Support action

Call identifier : FP7–INFRASTRUCTURES–2012-1

Project coordinator : Federico Ruggieri (INFN)

D4.2 - Analysis of Data Infrastructure and Data repositories

Deliverable Status : Draft or Final

File Name : CHAIN-REDS-D4.2_V05

Due Date : November 2013 (M12)

Submission Date : November 2013 (M12) Dissemination Level : Public

Author : CIEMAT ([email protected])

INFN Istituto Nazionale di Fisica Nucleare - Italy

CIEMAT Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas - Spain

GRNET Greek Research and Technology Network S.A. - Greece

CESNET Zajmove Sdruzeni Pravnickych Osob - Czech Republic

UBUNTUNET The UbuntuNet Alliance for Research and Education Networking - Malawi

CLARA Cooperación Latinoamericana de Redes Avanzadas - Uruguay

IHEP Institute of High Energy Physics Chinese Academy of Sciences - China

ASREN Arab States Research and Education Network - Jordan

SIGMA . . . Sigma Orionis - France

(2)

2

Disclaimer

More details on the copyright holders can be found at www.chain-project.eu. CHAIN-REDS (“Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing”) is a project co-funded by the European Union in the framework of the 7th_{FP for Research and Technological Development, as part of}

the “Capacities specific program - Research Infrastructures FP7–INFRASTRUCTURES–2012-1”. For more information on the project, its partners and contributors visit hwww.chain-project.eu. You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: "Copyright (C) 2013 - CHAIN-REDS Consortium - www.chain-project.eu".The information contained in this document represents the views of the CHAIN-REDS Consortium as of the date they are published. The CHAIN-REDS Consortium does not guarantee that any information contained herein is error-free, or up to date. THE CHAIN CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.

Revision Control

Issue Date Comment Author

v01 06/11/2013 First version Rafael Mayo-García

v02 08/11/2013 New info added Margaret Ngwira, Bruce Becker and

Rafael Mayo-García

V03 20/11/2013 Comments and suggestions Ognjen Prjnat, Rafael Mayo-García V04 29/11/2013 New information added Roberto Barbera, Rafael Mayo-García V05 03/12/2013 Comments and edition Ognjen Prjnat, Roberto Barbera,

(3)

3

Abstract

This document reports on the analysis of Data Infrastructure and Data repositories carried out by WP4 ‘Data Infrastructure’ during the first year of the CHAIN-REDS project. Bearing in mind the increasing importance of data management and data analytics in big data issues, it describes the current status of the project objectives as well as of the collaborative Virtual Research Communities of CHAIN-REDS: eIFL, agINFRA, ENGAGE and EarthServer. Furthermore, information about other initiatives is also presented, in particular the one related to the main European Data initiative EUDAT.

Based on this, an analysis of the adoption of the standards promoted by CHAIN-REDS by these communities and on the transcontinental coverage that the current computational infrastructures are offering to the final users is described. This information is complemented with the actions that have been carried out by the project for enhancing this transcontinental impact.

After a description of the CHAIN-REDS tools that are being implemented for proposing a methodology for achieving computing and data trust building, i.e. the CHAIN-REDS Knowledge Base and the Semantic Search Engine, the first results of their use is detailed as well as the plans and the road map that the project is going to follow to extend the aforementioned data trust building.

(4)

4

ABSTRACT ________________________________________________________________________________________________ 3 TABLE OF CONTENTS ______________________________________________________________________________________ 4 PURPOSE __________________________________________________________________________________________________ 5 GLOSSARY _________________________________________________________________________________________________ 5 1. INTRODUCTION ___________________________________________________________________________________ 7

2. UPDATED INFORMATION OF TRANSCONTINENTAL DATA INFRASTRUCTURE AND DATA REPOSITORIES _ 9 2.1 WP4 OBJECTIVES ______________________________________________________________________________ 10

2.2 CHAIN-REDS Collection of Data ____________________________________________________________ 10

2.3 General data related initiatives – EUDAT, EIFL and documents repositories____________ 12

2.4 Agriculture – agINFRA _____________________________________________________________________ 13

2.5 e-Government – ENGAGE and H3Africa ___________________________________________________ 14

2.6 Earth Sciences – EarthServer and SAEON _________________________________________________ 15

2.7 Cultural Heritage – DCH-RP and the University of Cape Town ___________________________ 16

2.8 Astrophysics – IVOA and SKA _______________________________________________________________ 17

2.9 e-Infrastructure Data – iMENTORS ________________________________________________________ 18

2.10 WP4 Dissemination actions ________________________________________________________________ 18

3. ANALYSIS ON DATA INFRASTRUCTURES UNDER THE CHAIN-REDS PERSPECTIVE ___________________ 21

3.1. Standards ___________________________________________________________________________________ 21

3.2. Transcontinental coverage _________________________________________________________________ 23

4. THE CHAIN-REDS TOOLS ________________________________________________________________________ 25

4.1. The CHAIN-REDS Knowledge Base _________________________________________________________ 25

4.2. The CHAIN-REDS Semantic Search Engine ________________________________________________ 26

5. THE CHAIN-REDS WORK PLAN ON DATA INFRASTRUCTURES ______________________________________ 32

5.1. The worldwide interoperability demo _____________________________________________________ 32

5.2. Adding data functionalities to the CHAIN-REDS demo ___________________________________ 34

(5)

5

Purpose

CHAIN-REDS is a FP7 project co-funded by the European Commission (DG CONNECT) which started on December, 1st_{2012 and aims at promoting and supporting technological} and scientific collaboration across different e-Infrastructures established and operated in various continents, in order to define a path towards a global e-Infrastructure ecosystem that will allow Virtual Research Communities (VRCs), research groups and even single researchers to access and efficiently use worldwide distributed resources (i.e., computing, storage, data, services, tools, applications).

The purpose of this deliverable is to provide a study of the commonalities, differences, requirements and future challenges that the identified Data Infrastructure and Data Repositories have. This is a first version delivered at the end of the first year, which includes a first updated edition of D4.1 ‘Trans-continental Data Infrastructures and Data repositories’. Future updated versions of D4.2 ‘Analysis of Data Infrastructures and Data repositories’ will be part of D4.3 and D4.4 on months M18 and M24 respectively.

In addition to that updated information on Data Infrastructure and characteristics, contents about the CHAIN-REDS tool that manage and exploit data, and the roadmap of the project for demonstrating data trust building are described.

Glossary

API Application Programming Interface CDMI Cloud Data Management Interface

CHAIN Co-ordination and Harmonisation of Advanced e-INfrastructures

CHAIN-REDS Co-ordination and Harmonisation of Advanced

e-Infrastructures for Research Education Data Sharing CNRI Corporation for National Research Initiatives

DCI Distributed Computing Infrastructure DCMI Dublin Core Metadata Initiative

DoW Description of Work – Annex I to the GA

DR Data Repository

EC European Commission

EGI European Grid Initiative

FOAF Friend Of A Friend – machine readable ontology

FP7 European Commission’s Framework Programme Seven

GA Grant Agreement

ICT Information and Communication Technology(ies) IVOA International Virtual Observatory Alliance

KB Knowledge Base

MoU Memorandum of Understanding OADR Open Access Data Repository

OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting OCCI Open Cloud Computing Interface

(6)

6

OWL Ontology Web Language

PID Persistent IDentifier

RDF Resource Description Framework ROC Regional Operation Centre SKA Square Kilometre Array

SPARQL SPARQL Protocol and RDF Query Language VRC Virtual Research Community

VRE Virtual Research Environment

WP Work Package

(7)

7

1. Introduction

CHAIN-REDS started on December, 1st_{2012 and aims at promoting and supporting} technological and scientific collaboration across different e-Infrastructures established and operated in various continents. It is a FP7 project co-funded by the European Commission (DG CONNECT) and has as ultimate goal to define a path towards a global e-Infrastructure ecosystem that will allow Virtual Research Communities (VRCs), research groups and even single researchers to access and efficiently use worldwide distributed resources (i.e., computing, storage, data, services, tools, applications).

To do so, the project is structured in several Work Packages that addresses these different scenarios. Specifically, WP4 ‘Data Infrastructure’ deals with the promotion of trust building towards open scientific data infrastructures across the world regions, including organisational, operational and technical aspects with a strong liaison with WP3 ‘Interoperation and coordination of e-Infrastructures’ and WP5 ‘Support to small groups and emerging communities’ Activities.

Regardless of what could be thought of the ubiquity of the "Big Data" meme, it is clear that the growing size and diversity of datasets are changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in-between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

Nowadays, open access to data is becoming a must and is being promoted by many entities. The European Commission published in Oct 2010 the report entitled ‘Riding the wave. How Europe can gain from the rising tide of scientific data’ and has recently declared the IP/12/790Communication1_{, where it is considered that open access is a} fundamental requirement for improving the flow of knowledge and jointly with it, innovation in Europe. Thus, open access shall be required for all scientific publications carried out with funding from Horizon 2020, which is the next EU program for funding research and innovation during the 2014-2020 period. The Communication recommended that the Member States in their national programs adopt a similar approach from that of the Commission. In addition, this European Commission study, which also focused on the EU and neighbouring countries as well as on Brazil, Canada, Japan and the United States of America, states that over 40% of peer-reviewed scientific articles and published worldwide between 2004 and 2011 is now available online in open access regime.

Nevertheless, the widening of this action, mainly focused on document repositories, to actual data that could be exploited by as many researchers and users as possible is a must. Furthermore, such a use and management of data should be done in accordance to the advances that are being carried out in e-infrastructures, which is the topic of CHAIN-REDS WP3.

In order to study the opportunities of data sharing across different e-Infrastructures and continents, two main actions have been done so far: collect information from Data Repositories worldwide and widen the scope of the previous CHAIN Knowledge Base (KB)2 to Data Infrastructures.

At the end of the first year of CHAIN-REDS, when this report is being delivered, information about Data Repositories (DR) and Open Access Document Repositories (OADR) is extensively provided to the users by accessing the CHAIN-REDS KB and the

1_{IP/12/790Communication, available at http://europa.eu/rapid/press-release_IP-12-790_en.htm} 2_{CHAIN Knowledge Base (KB), available at http://www.chain-project.eu/knowledge-base}

(8)

8

project manages information about the commonalities, differences, requirements and future challenges that the identified data communities have. All of these are detailed in Section 2 and 3 of this document.

Furthermore, it was felt necessary by the CHAIN-REDS Consortium to extend the KB capabilities to deal with data methodologies: semantic search; semantic web based metadata enrichment; download and upload of data; etc. Documented information about these implementations is provided in Section 4.

Finally, and due to the fact that the main objective of WP4 is to provide proof-of principle use-cases for Data sharing across continents, the road map described in D4.1 ‘Trans-continental Data Infrastructures and Data repositories’ has been redefined in order to reach this ultimate goal. It is worth mentioning that such a redefinition has actually implied an extension of activities whose results are summarised in Section 5.

(9)

9

2. Updated Information of Transcontinental Data

Infrastructure and Data Repositories

Huge quantities of observational data and simulations are becoming available to researchers at an ever-accelerating rate and are transforming science impacting on the Scientific Method which is, since four centuries, the iterative procedure used by scientists and researchers to go through the so-called “knowledge path”. For example, new astronomical observatories anticipate delivering combined data volumes of over 100 PB by 2020, yet even the current data volume of 1 PB is beginning to strain archives3_{. At the} same time, simulations are increasing in complexity and scope. Thus, there is a clear growth in volume, but also in the complexity of products, often derived through integrating existing data sets and confronting them with simulations by teams of distributed, often international, collaborations.

Sciences will then witness a breakdown in their current computing model if no intervention is made. Furthermore, data are discovered and downloaded through web-based services offered by archives and data centres, and then analysed and integrated on local machines, i.e., the very scale of new data sets will transform data discovery, access, and computation in many disciplines.

Given that maximum science return will involve federation of data sets, discovery of data or simulations will be performed through queries to distributed archives; these queries will aim to locate data having particular properties and even store these data, and so data will be processed in situ by archives running users’ software or shipped to remote processing facilities.

In other words, as it was claimed by the U.S. National Science Foundation (NSF), it is required the inclusion of a data management plan, whose provisions include sharing “the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants”. Proposals in response to other funding bodies, such as the European Commission Directive previously mentioned, have similar requirements.

As a consequence, it is necessary to first know which, where and how data is being stored, and, second, to implement a mechanism that allows researchers to access and further exploit such data for producing new science. At the same time, new data and even publications that were produced should be lately stored in the same conditions than the previous input data used.

As it was described in D4.14_{, the first step taken by CHAIN-REDS was to identify best} practices and to approach the main stakeholders of regional e-Infrastructures and of data provisioning/use in order to propose and even define a path towards a global e-Infrastructure ecosystem that will allow VRCs, research groups and even single researchers to access and efficiently use worldwide distributed resources, i.e. computing, storage, data, services, tools, applications.

A survey about DRs and OADRs for feeding the CHAIN-REDS KB was performed and commonalities in the strategies followed by some VRCs were identified. As a result, some communities were approached and even official collaborations agreed. In what follows, updated information on the status of these collaborations is explained.

3_{G. Bruce Berriman, “The role in the Virtual Astronomical Observatory in the era of massive data sets”} 4_{http://www.chain-project.eu/deliverables}

(10)

10 2.1 WP4 objectives

For the sake of completeness, in this subsection the most important information about WP4 main objectives and the actions taken to achieve them until M12 is summarised. The five objectives identified in the DoW and their current status are summarised in Table 1.

Objective Current status (as that of Nov 2013)

Extend the CHAIN-REDS KB

with Data Infrastructures The CHANI-REDS KB already counts with a huge amount of links to OADRS and DRs (almost 3,000 in total). New data-related capabilities have been also added to the Semantic Search, Applications and Science Gateway links

Support the study of data infrastructures for a few VRCs

CHAIN-REDS has established official MoUs with several VRCs and some conversations are underway with others

Promote trust building towards open scientific data

infrastructures across the world regions

WP4 jointly with WP3 has set-up a road-map to demonstrate such a trust building, which includes (data) infrastructure worldwide.

Study the opportunities of data sharing across different e-Infrastructures and

continents

The project is looking for datasets already stored worldwide jointly with the identified VRCs, which enhance the impact of the aforementioned demo. Provide proof-of principle

use-cases for Data sharing across the continents

The demo mentioned in the third objective is

expected to be showed during the last months of the project lifetime.

Table I. CHAIN-REDS objectives and their current status

In D4.1 the reader can find a list of actions (Action1-Action5) that were already established as the first steps to be taken in order to accomplish the objectives appearing in Table I. All of them have been almost performed and only a deeper analysis on the DRs and OADRs of interest to CHAIN-REDS and to the collaborative VRCs with a worldwide impact is still on the way. For performing such an analysis, some specific questions have been requested from the VRCs (see Section 3).

2.2 CHAIN-REDS Collection of Data

The CHAIN project, the precursor of CHAIN-REDS, already promoted interoperability as one of its main objectives. A worldwide multi-middleware interoperability demonstration5 was given in September 2012 at the EGI Technical Forum and the project carried out several actions that were a step further in facilitating the access to information coming from different regions. Thus, the KB provided dynamically updated information about the deployment of e-Infrastructure related topics per country and even about specific Distributed Computing Infrastructures (DCIs) by means of a Site or a Table view. All these concepts conducted to a validation model for VRCs that was successfully tested by the end of CHAIN by counting on the aforementioned worldwide demo and the road-map of services requested from the VRCs to the DCIs.

For CHAIN-REDS, new data-oriented capabilities have been implemented (see Section 4). The first of them was to include in the KB information about DRs and OADRs. Such a compilation was obtained on a three-fold basis:

5

(11)

11

- By a multi-layer structure where a metadata harvester, running either on Grid or Cloud, fetches metadata from OAI-PMH end-points of many OADRs and DRs; - By direct integration from repositories that CHAIN-REDS was aware of by means of

the survey described in D4.1; and,

- By direct contacts between the project and other initiatives mainly devoted to data storage.

Figure 1. A snapshot of the CHAIN-REDS Knowledge Base - OADR Site view

Most of the available links to open access repositories were obtained using the first item. Information about the data included from the results of the CHAIN-REDS survey are deeply explained in D4.1 and specific agreements with data initiatives as those coming from DRIVER6_{, OpenAIRE}7_{, OpenDOAR}8_{, Databib}9_{or DataCite}10_{. Nowadays, the KB} contains links to 2,579 entries from OADRs and 596 from DRs (Oct, 24th_{2013), i.e. from} the last D4.1 deliverable information, 91 OADRs and 89 DRs have been incorporated. In order to visualise and access the repositories, users can employ both geo- and tab-views. In the former, red markers refer to data currently taken from the almost 2,500 OADRs of DRIVER, OpenAIRE, and OpenDOAR and currently refer to more than 30 million documents. Yellow markers refer to other OADRs, i.e. those integrated by means of the 6_{http://www.driver-support.eu} 7_{https://www.openaire.eu/} 8_{http://www.opendoar.org/} 9_{http://databib.org/} 10_{http://www.datacite.org/}

(12)

12

outreach activities of the project (previous second and third item), such those belonging to La Referencia11 _{in Latin America and those pointed out by EIFL in Europe and Africa}

(see next sub-section).

For each OADR/DR, the following information is provided: the country where the data is stored; the name of the repository (with a direct link to its home page); the scientific domain it belongs to; and, the organisation is maintaining it.

2.3 General data related initiatives – EUDAT, EIFL and documents repositories

As it has been aforementioned, the first kind of repositories that CHAIN-REDS has been working with has been document repositories, that is, articles, papers and proceedings which have been included in the project KB either classified by OADRs or DRs. This action has been of outmost importance because it has allowed the project to develop semantic methodologies for improving the retrieval of data and related information and, also, plan a major challenge as the extraction of raw data from articles (see Sections 4 and 5).

In addition, CHAIN-REDS has identified some general data related initiatives that represent major actors in the European landscape. Among them, the main candidate is EUDAT12_{. This} initiative was created for working on data management and aims at providing European researchers from all fields with state-of-the-art instruments and services that support the deployment of new research facilities on a pan-European level. From the first contacts, further interaction between EUDAT and CHAIN-REDS has been established, where the commonalities of both projects have been detailed and the CHAIN-REDS KB has been proposed as an example of data management and standards adoption by means of the OADRs and DRs views.

In addition, specific meetings have been held between representative personnel from both projects were specific actions have been scheduled. Thus, the EUDAT technical coordinator, Prof. Peter Wittenburg, has participated in the CHAIN-REDS workshop13_held as part of the IEEE e-Science Conference (Beijing, 22 Oct 2013) and the CHAIN-REDS Project Coordinator, Technical Coordinator, WP4 Manager and T3.1 Leader have participated in the 2nd_{EUDAT conference}14_{(Rome, 28-30 Oct 2013).}

From this point on, a MoU between the two initiatives has been internally discussed and agreed within the EUDAT consortium and further collaboration by combining both initiatives’ developments (KB and Persistent Identifiers in the case of CHAIN-REDS) is expected.

Additional joint work has been set up between EIFL15_and CHAIN-REDS. Working in collaboration with libraries in more than 60 developing and transition countries in Africa, Asia, Europe, and Latin America, EIFL enables access to knowledge for education, learning, research and sustainable community development. Then, in accordance to the road map proposed by CHAIN-REDS described in Section 5 and because of the common regions of interest to both initiatives, common actions are expected for 2014. In this sense, CHAIN-REDS will propose its current 11_{http://lareferencia.redclara.net/} 12_{http://www.eudat.eu/} 13_{http://agenda.ct.infn.it/conferenceOtherViews.py?view=standard&confId=919} 14_{http://www.eudat.eu/2nd-conference} 15_{http://www.eifl.net/}

(13)

13

technical developments as a proof-of-principle methodology for better using data. In the last six months, a MoU16_{has been signed between EIFL and CHAIN-REDS and EIFL has} provided the OAI-PMH of several OADRs which have been added to the KB.

2.4 Agriculture – agINFRA

One of the main promising fields in data and metadata management is agricultural science. A metadata framework has even been introduced by the Food and Agriculture Organization of the United Nations (FAO) and it recognized that there was a strong need for statistical metadata, which would provide better understanding of all the data items and the way to obtain them within the national system of agricultural statistics.

The idea in the agricultural field is to establish metadata databases for food and agricultural as key components for improving data quality and statistical development. The concept of metadata here describes all aspects of the national systems of agricultural statistics on how, when, where, why, and by whom the data are collected. The challenge faced by the management of metadata at the international level is how to design a framework so it can be used by countries to collect the relevant and succinct information in a manageable and comparable way.

In order to accomplish such a task in an affordable way, CHAIN-REDS has established a collaboration with the FP7 agINFRA17 project, which is participated by an international institution (FAO) and partners from Europe, Asia and Latin America. agINFRA aims to set up a data infrastructure to support agricultural scientific communities promoting data sharing and development of trust in agricultural sciences. It also plans to improve service deployment for data by transferring scientific and technological results from the agricultural field into real outcomes. In addition, even when agriculture is a resource in almost every country in the world, from the e-Infrastructure point of view, it is also a key point that agINFRA is relying its computing power and administration on the same infrastructures targeted by CHAIN-REDS in WP3 and has also adopted the Science Gateway paradigm.

In the last six months, a MoU has been signed between agINFRA and CHAIN-REDS, which is part of the WP4 milestone ‘MS6- MoUs signed with at least 2 VRCs’ fulfilment.

The first common action has been an exchange of information about the use of the standards adopted by CHAIN-REDS by agINFRA. This is of importance in order to propose to agINFRA and the agriculture community the CHAIN-REDS road map of data trust building that is initially being tested with generic applications.

Furthermore, in the context of the MoU between the two projects, the Semantic Search part of the agINFRA website18_{has been restructured according to the new capabilities} developed within the CHAIN-REDS consortium. By now, it is possible to perform a semantic search in parallel, i.e. a new functionality that allows users to search in parallel across the millions of resources contained in the CHAIN-REDS Knowledge Base and in the FAO OpenAgris repository19_.

16_{CHAIN-REDS MoUs available at http://documents.ct.infn.it/collection/CHAIN-Reds%20MoUs?ln=en} 17_{http://aginfra.eu/}

18_{http://aginfra-sg.ct.infn.it/semantic-search} 19_{http://aims.fao.org/openagris}

(14)

14

2.5 e-Government – ENGAGE and H3Africa

In e-Government, public agencies are responsible for providing access to information and services for everyone living within a country or a region, all of whom will have varying levels of IT skills including individuals with lower incomes and disabilities. This is why organizing e-Government collections on the internet, in a way that helps users to search and locate government information without needing details of government structure, or to find government services without knowing which agency delivers them, is a fundamental activity in e-Government. Thus, metadata is a valuable tool in e-Government applications to make seamless flow of information and services across government and to support citizens finding government information and services more easily.

CHAIN-REDS has identified the FP7 ENGAGE project as an ideal initiative to collaborate with. ENGAGE20_{is an infrastructure for} open, linked governmental data provision both for research communities and citizens. The ENGAGE e-Infrastructure is envisaged to promote a highly synergetic approach to governance research by providing the ground for experimentation to actors from both ICT and non‐ICT related disciplines and scientific communities, as well as by ensuring that the scientific outcomes are made accessible to the citizens, so that they can monitor public service delivery and influence the decision making process.

ENGAGE will provide enhanced services in the data e-Infrastructure layer while on the other hand building a community that can exploit the e-Infrastructure services. The project has developed a platform (currently in a beta version), which has already been a good point for collaboration between this initiative and both the CHAIN-REDS consortium and the groups interested in e-Government approached worldwide by means of the CHAIN-REDS WP4 survey. Thus, a fist common action has been a deep analysis of the ENGAGE platform by the CHAIN-REDS management. ENGAGE coordinators contacted CHAIN-REDS and their feedback about the platform functionality was submitted. Within its field of expertise, the ENGAGE platform provides access to 14,379 datasets (Nov 2nd 2013) already searchable by SPARQL queries.

Later on, there will be an exchange of information about the use of the standards adopted by CHAIN-REDS and ENGAGE. This is of importance in order to propose to ENGAGE and the e-Government community the CHAIN-REDS road map of data trust building that is initially being tested with generic applications.

In the last six months, a MoU has been signed between ENGAGE and CHAIN-REDS, which is part of the WP4 milestone ‘MS6- MoUs signed with at least 2 VRCs’ fulfilment.

In the context of that MoU between the two projects, the Semantic Search part of the CHAIN-REDS website has been restructured. Now it is possible to perform a semantic search on a two-fold basis: Single, i.e., the usual semantic search service that is described below in this document21_{; and, Parallel, a new functionality that allows users to} search in parallel across the millions of resources contained in the CHAIN-REDS Knowledge Base and in the ENGAGE Platform22_.

20_{http://www.engagedata.eu/}

21_{http://www.chain-project.eu/semantic}

(15)

15

CHAIN-REDS has also approached H3Africa23_{. The Human Heredity} and Health in Africa initiative aims to facilitate a contemporary research approach to the study of genomics and environmental determinants of common diseases with the goal of improving the health of African populations. To accomplish this, the H3Africa Initiative aims to contribute to the development of the necessary expertise among African scientists, and to establish networks of African investigators. Furthermore, data generated from this effort will inform strategies to address health inequity; the final goal is then to develop a scientific case for a pilot study of a specific disease(s) and producing some general principles/guidelines for collaboration, data sharing and addressing Ethical, Legal and Social Issues.

2.6 Earth Sciences – EarthServer and SAEON

With increasing volumes of satellite and remote sensing, models and other Earth Science data available and the popularity of the Internet, Earth scientists are now facing challenges to publish and to find interesting data sets effectively and efficiently. One of the main barriers to exploiting the great wealth of global Earth science data available today is that researchers are unable to rapidly search and find data relevant to their studies.

For exploring the CHAIN-REDS objectives in this field, the project is collaborating with the FP7 initiative EarthServer24_{, which is working on establishing open} access and ad-hoc analytics on several extreme-size Earth Science data related to cryospheric, airborne, atmospheric and planetary sciences and also to geology and oceanography. EarthServer, counts on a Science Gateway based on that developed by CHAIN-REDS project, so a close collaboration has been set up between this two projects in order to build trust building. To achieve such a goal, it is also worth mentioning that both projects share the same e-Infrastructures and that the data collected in EarthServer are of interest in any regions of the world.

In the last six months, a MoU has been signed between EarthServer and CHAIN-REDS, which is part of the WP4 milestone ‘MS6-MoUs signed with at least 2 VRCs’ fulfilment. A deep exchange of information about the use of the standards adopted by CHAIN-REDS by EarthServer has been already made. This is of importance in order to propose to EarthServer and the agriculture community the CHAIN-REDS road map of data trust building that is initially being tested with generic applications.

Besides, a discussion about data analytics has been held by the two projects. Considering a couple of tools, SciDB25_{and rasdaman}26_{, the latter one has} been initially selected and further common developments are expected for the near future in accordance to the goals to be achieved by the CHAIN-REDS road map.

23_{http://www.h3africa.org/} 24_{http://www.earthserver.eu/} 25_{http://scidb.org/}

(16)

16

CHAIN-REDS has also established contacts with the SAEON consortium27_{. Its vision} corresponds to a sustained, coordinated, responsive and comprehensive in situ South African Earth Observation Network that delivers long-term reliable data for scientific research and informs decision-making for a knowledge society and improved quality of life. Progress in a sustainable development is constrained by the lack of reliable long-term data at scales that are relevant to policy, and by the lack of integration between the various systems that provide information on the environmental, social and economic elements of sustainability. To address this critical gap, SAEON will collect, store and assess appropriate longitudinal social, economic and environmental data to inform relevant research, policy, reporting and action.

2.7 Cultural Heritage – DCH-RP and the University of Cape Town

The use of computational methods in the humanities is rapidly growing, with the increasing quantities of born-digital primary sources (such as emails, social media) and the large-scale digitisation programmes applied to libraries, museums and archives. This has resulted in a range of interesting applications and case studies highlighting at the same time the interpretative issues raised by applying such “hard” methods for answering subjective questions in the humanities.

Moreover, the questions and concerns raised by the humanities themselves have consequences for the interpretation in general of big data and the challenges of producing quality (meaning, knowledge and value) from quantity. Several points are then of interest: text- and data-mining of historical and archival material; social media analysis; crowd-sourcing; archival practices; big data in Heritage; metadata schema, etc. As part of the Dublin Core Metadata Initiative (DCMI), a Cultural Heritage Metadata Task Group is currently active.

The Digital Cultural Heritage Roadmap for Preservation28 (DCH-RP) is a coordination action supported by the European Commission under the e-Infrastructure Capacities Programme of Seventh Framework Programme for Research (FP7). The project has been launched on October 2012 to look at best practices for preservation standards in use. DCH-RP aims to: (i) harmonize data storage and preservation policies in the digital cultural heritage sector at European and international level, dealing with the storage phase which includes both long-term preservation and short-term preservation; (ii) progress a dialogue among DCH-RP institutions, eInfrastructures, research and private organisations and integrate these efforts in a common work; and, (iii) identify more suitable models for the governance, maintenance and sustainability for such infrastructure.

The main outcome for DCH-RP will be a roadmap for the implementation of a preservation federated e-infrastructure, supplemented by practical tools for decision makers. Since it will be validated through a range of proof of concepts, a close collaboration with CHAIN-REDS is expected due to both initiatives sharing the same vision and technological approach. In the last six months, a MoU has been signed between DCH-RP and CHAIN-REDS, which is part of the WP4 milestone ‘MS6-MoUs signed with at least 2 VRCs’ fulfilment, and further joint actions are on the way.

27_{http://www.saeon.ac.za/} 28_{http://www.dch-rp.eu/}

(17)

17

At the beginning of October 2013, the CHAIN-REDS Knowledge Base and the Semantic Search Engine were showcased at the eResearch Africa 2013 Conference29_{held in Cape Town (South} Africa). This has triggered collaboration with the Metadata Working Group initiative and the Information and Communication Technology Services at the University of Cape Town (UCT) to tailor and adapt the CHAIN-REDS services to the needs of UCT and the South African strategy on open access document and data repositories.

2.8 Astrophysics – IVOA and SKA

In this community, neither the international collaborations supporting big facilities nor the bureaux or societies dictate how a data centre handles its own archive. However, a ‘Virtual Observatory-layer’ is needed to translate any locally stored data to an agreed standard. Data providers are then advised to systematically collect metadata about the curation process, assign unique identifiers, describe the general content of a collection, and provide interface and capability parameters of services. In the context of Astrophysics, CHAIN-REDS has identified the International Virtual Observatory Alliance30_{(IVOA) as} an ideal collaborator. The Virtual Observatory (VO) is the vision that astronomical datasets and other resources should work as a seamless whole. Many projects and data centres worldwide are working towards this goal. IVOA is an organisation that debates and agrees the technical standards that are needed to make the VO possible. It also acts as a focus for VO aspirations, a framework for discussing and sharing VO ideas and technology, and a body for promoting and publicising the VO.

Since its formation in 2002, the IVOA has been working on reaching truly world-wide cohesion in debating and agreeing key astronomical standards, establishing a forum for discussing and debating astronomical data technology in general, as well VO standards in particular, and achieving rapid agreement on an initial set of basic standards (a table exchange format, a specification for simple catalogue and image query services, the definition of metadata describing resources, a dictionary for standardised column names, and a suite of standards allowing the construction of VO registries). The IVOA is also pursuing the provision of further standards, including those needed for virtual storage addressing, single sign on, semantic reasoning, grid and web service modularisation. It counts on a Grid & Web Services Working Group, which has developed and interface to access both PRACE31_{and EGI}32_{infrastructures for the scientific computations, which are} targeted e-Infrastructures in CHAIN-REDS.

It is also worth mentioning in the scope of CHAIN-REDS regions the contacts that have been established in principle with the Square Kilometre Array 33 _{(SKA) in Africa.} Nevertheless, it should be pin-pointed that SKA is a major consortium formed by organisations from ten countries (Australia, Canada, China, Germany, Italy, New Zealand, South Africa, Sweden, the Netherlands and the United Kingdom) and one Associate Member (India). In this way, it addresses most of the regions of interest to CHAIN-REDS. 29 http://eresearch.ac.za/ 30_{http://www.ivoa.net/} 31_{http://www.prace-ri.eu/} 32_{http://www.egi.eu/} 33_{http://www.ska.ac.za/}

(18)

18

The SKA will use hundreds of thousands of radio telescopes, in three unique configurations, which will enable astronomers to monitor the sky with an unprecedented detail and survey the entire sky thousands of times faster than any system currently in existence. The SKA telescopes will be co-located in Africa and in Australia. South Africa’s Karoo desert will cover the core of the high and mid frequencies of the radio spectrum which will have telescopes spread all over the continent, with Australia’s Murchison region covering the low frequency range and hosting the survey instrument. As it could be easily inferred, the huge amount of data that these radio telescopes collect will have to be properly managed and CHAIN-REDS aims to provide SKA with its perspective and solutions.

2.9 e-Infrastructure Data – iMENTORS

iMENTORS34_{(e-Infrastructure monitoring evaluation and tracking support system) is a} project co-funded by the European Commission's DG CONNECT under the 7th_Framework Programme which aims to build a one-stop-shop data warehouse on all e-infrastructure development projects of Sub-Saharan Africa. By mapping e-infrastructure initiatives, iMENTORS goal is to help scientists, universities, research and education networks as well as policy-makers and international donors gain valuable insights on the gaps and progress made in the region and to enhance the coordination of international actors involved in ICT initiatives in this part of the world.

iMENTORS is equipped with advanced Geographic Information and Visualisation Systems along with a robust decision-support system drawing public data from many online databases to assist provide policy support and assist programme planning and implementation. The ultimate objective of iMENTORS is to form a vibrant online community of practice made of international actors and practitioners exchanging of up-to-date knowledge and information through online social interactions and dedicated spaces for online collaboration, and encourage the community to adopt and update the platform on its own.

Sharing the same “raising awareness and providing information” approach and addressing a crucial region of the world such as Sub-Saharan Africa, the collaboration between CHAIN-REDS and iMENTORS was deemed very important by both projects and has framed in the context of a Memorandum of Understanding.

iMENTORS will provide access to CHAIN-REDS to its Data Warehouse and CHAIN-REDS will provide access to iMENTORS to its Knowledge Base and to its Semantic Search Engine and collaborate to explore how they can be integrated in the iMENTORS platform.

2.10 WP4 Dissemination actions

During the first twelve months of CHAIN-REDS, several dissemination actions have been carried out. One of them has been the implementation of a WP4 wiki page35_where

34_{http://www.imentors.eu/}

(19)

19

information about the use of the Science Gateway, the Parallel Semantic Search Engine, and the PID service are displayed.

It is also worth mentioning the presentations (see Table II) about the WP4 developments at international Conferences and outreach activities, which have been specific dissemination actions beyond the ones already showed by the consortium as a whole. To those, other dissemination activities carried out as part of CHAIN-REDS events must be added.

Event Date Location Type of contribution

e-AGE 2012 Dec 2012 Dubai (UAE)

Presentation “Data Infrastructures in CHAIN-REDS”

SCALAC 2013 Feb 2013 Bucaramanga (Colombia) Presentation “CIEMAT”

ISGC 2013 Mar 2013 Taipei (Taiwan)

Presentation “Data Infrastructures in CHAIN-REDS”

EGI CF 2013 Apr 2013 Manchester (UK)

Presentation “Support to Data Infrastructures in CHAIN-REDS”

IST-Africa 2013 May 2013 Nairobi (Kenya)

Presentation “The Knowledge Base of Open Access Document Repositories (OADRs) and How African Libraries can Contribute to it”

EGI TF 2013 Sep 2013 Madrid (Spain)

Presentation “Support for VRCs outside of Europe - services by the CHAIN-REDS project”

eResearch Africa 2013 Oct 2013 Cape Town (South Africa)

Presentation “Data Infrastructures for e-Science (the CHAIN-REDS perspective)” UbuntuNet Connect 2013 Nov 2013 Kigali (Rwanda)

Presentation “Virtual Research Communities: Knowledge and Data” RedCLARA Virtual day Nov 2013 Latin America

Presentation “Virtual Research Communities: Knowledge and Data” Table II. WP4 dissemination and outreach activities in non-CHAIN-REDS events.

To these presentations, the attendance by the WP4 coordinator to the 2nd_EUDAT conference held in Rome (Italy) in Oct 2013 should be added, since new plans of collaborations were set up as previously mentioned.

(20)

20

In addition, some papers have been accepted for publication through this first year of the project. They are listed in Table III.

Title Reference

The CHAIN-REDS Semantic Search Engine

Proceedings of the UbuntuNet Connect 2013 Conference, in press.

A CHAIN-REDS Perspective about Data

Access and Metadata Management Proceedings of the e-AGE 2013 Conference, in press. Table III. Papers accepted for publication through the first year of CHAIN-REDS.

(21)

21

3. Analysis on Data Infrastructures under the

CHAIN-REDS perspective

Once the several communities have been approached and the WP4 survey on DRs and OADRs was ended, an analysis about the commonalities, differences, requirements and future challenges that these communities have regarding the computing and data infrastructure they use has been carried out. Of course, such an analysis deserves to bear in mind the different specificities that the regions of interest to CHAIN-REDS have and its conclusions must be aligned with the European vision about data curation and management.

3.1. Standards

In order to set up a data trust building, it is mandatory to define a set of standards that will rule the data management for facilitating its further use. During the first six months of its lifetime, CHAIN-REDS identified some of them. The project has been working with them as it will be stated in Section 5 and has analysed the new standards that could be of interest, i.e., Std5. The current list of standards is the following:

Std1. OAI-PMH36_{for metadata retrieval. The Open Archives Initiative Protocol for} Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. The service works in a way that data providers are repositories that expose structured metadata via OAI-PMH and service providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.

Std2. Dublin Core 37 _{as metadata schema. Specifically, the DCMI is an open} organization that has defined a set of vocabulary terms that can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources, physical resources and objects. The original set of classic metadata terms, known as the Dublin Core Metadata Element Set, counted on 15 entries.

Std3. SPARQL38_{for semantic web search. Resource Description Framework (RDF) is a} standard model for data interchange on the Web. It is a directed, labelled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs

Std4. XML39_{as potential standard for the interchange of data represented as a set of} tables. Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. Std5. Persistent Identifier (PID) is defined as a long-lasting reference to a digital

object, being it either a single file or set of files. Noted persistent identifier

36_{http://www.openarchives.org/pmh/} 37_{http://dublincore.org/}

38_{http://www.w3.org/2001/sw/wiki/SPARQL} 39_{http://www.w3.org/XML/}

(22)

22

systems include: Archival Resource Keys (ARKs), Digital Object Identifiers (DOIs), Persistent Uniform Resource Locators (PURLs), Uniform Resource Names (URNs), and Extensible Resource Identifiers (XRIs). The use of PIDs is out utmost importance for identifying data on a long-term basis and facilitating their discovery; as a consequence, the European Persistent Identifier Consortium40_{(EPIC) has been already established as an initiative devoted to} provide PID Services for the European Research Community, based on the handle system™41_{, for the allocation and resolution of persistent identifiers.}

agINFRA ENGAGE EarthServer

Raw data description Mainly articles and “referatories”, i.e., they do not hold the content, but aggregate

metadata from many sources (more

information can be seen in agINFRA deliverable D2.3)

Metadata are stored in a postgresql database. The actual raw data linked to the datasets are either direct URLs to external files (of other websites), or local files stored in a cloud storage.

Large part of Earth Observations

metadata is stored in XML in practice. Versatile retrieval methods are on the way

OAI-PMH Yes No Some repositories

have OAI-PMH endpoints and a catalogue of services, supporting OAI-PMH queries, is being built.

Dublin Core Yes Yes No

SPARQL Yes Yes No

XML N/A XML is used for the RDF

files (metadata & data) stored in Virtuoso. For the rest communication formats, JSON (API, visualizations, etc.) is usually employed. Yes PID No No No Intellectual property issues

Majority of open data sets

Some data sets Are you interested in

adopting the missing standards (if any) with the CHAIN-REDS support?

Yes Yes Yes

Table IIII. Standards adopted by CHAIN-REDS and their status in the collaborative VRCs (Nov 2013).

40_{http://www.pidconsortium.eu/} 41_{http://www.handle.net/}

(23)

23

Bearing them in mind, CHAIN-REDS has started to build a data challenge that will be used as a proof-of-concept for data trust building, so the first action has been to enquire the identified VRCs about the adoption of these standards. A brief survey of the current status has been sent to three of the communities, the result of which is summarised in Table IV. With these results, some preliminary conclusions can be deducted:

- Raw data is usually kept in its own format and storage allocation, so (big) related initiatives basically collect the links to these primitive repositories.

- Some CHAIN-REDS promoted standards have been basically adopted. OAI-PMH is not accepted in the ENGAGE project, and a closer collaboration has been agreed between the two projects for its further implementation. Then, the data retrieval and reuse seem feasible.

- PIDs are not generally used, although their value is acknowledged, and CHAIN-REDS will work towards their promotion in these communities.

- Although it cannot be applied to the whole set of repositories, the collaborative initiatives work with datasets that are not under intellectual property limitations. Last, three more points should be raised. The first two of them refer to time scheduled issues: the astrophysics (IVOA) and Cultural Heritage (DCH-RP) communities are being expected to be added to this round of contacts through 2014 and the CHAIN-REDS support for promoting the proposed standards is expected to be carried out during the scientific data challenge whose code name has been set to DART (Data Accessibilty, Reproducibility and Trustworthiness).

3.2. Transcontinental coverage

CHAIN-REDS supports the interoperation of Grids in Europe and other world regions through the support for Regional Operations Centres in terms of functionality, requirements and structure.

In this regard, WP3 ‘Interoperation and coordination of eInfrastructures’ has described in D3.1 ‘Interoperation model and plan’4 the current operations structure of the European Grid Infrastructure and a model for Grid interoperations between Europe and the rest of world regions involved in the project. In addition, there is also an actualization of the interoperation model to be implemented through the execution of a concrete action plan drafted in cooperation with the representatives of each region.

It is the aim of this document to neither repeat the interoperation model between Europe and the regions participating in this project nor the blueprint for the implementation of Regional Operation Centres depending on two possible scenarios for interoperation with Europe. Nevertheless, some hints about the current status of the different regions will be described in order to better assess on which infrastructures the DART challenge should run:

- The Africa & Arabia Region count on a ROC and 13 sites (May 2013), although new ones were on the way for their incorporation. They are using the middleware and tools promoted by EMI and EGI, although some services are not yet installed. - The Asia-Pacific Region count on a ROC and 27 sites (May 2013) running, also EMI

compliant. The services promoted by EGI have been adopted and only the signature of a MoU is advisable.

- China counts on two Resource Infrastructure Providers, China ROC and CNGrid. The former counts on 1 Resource Centre and still lack some EGI services and the latter is a federation of 14 High Performance Computing Resource Centres using Grid Operating System (GOS) middleware.

(24)

24

where a set of higher-level middleware services has been developed.

- The Latin American Region counts on 4 sites running operationally within the ROC LA with the EMI middleware. New sites coming from the old IGALC ROC are expected to be incorporated very soon. Management services are implemented and comply with the EGI model.

With regard to clouds, it is clear that it requires significant programming and system administration support and many gaps and challenges exist in current open-source virtualized cloud software stacks for both production science and education use. In addition, interoperation between commercial and scientific clouds ought to be addressed. Based on the EGI Federated Cloud Task force42_{, which is entering a “Pre-Production”} phase and is inviting other cloud infrastructure providers to join its federated infrastructure, CHAIN-REDS will start surveying the regions of interest for collecting information about Clouds for research & education purposes. In addition, CHAIN-REDS is also organising a set of round tables to get a valuable feedback from the different regions; the first of them was held in the CHAIN-REDS workshop allocated with IEEE e-Science Conference 2013 in Beijing (Oct 2013). Relying on the collected information, suggestions towards cloud interoperability will be proposed.

WP3 and W4 are working together for designing a data trust building challenge that will profit from the different computing facilities distributed worldwide and will provide a legacy for the future use of both data and computational infrastructure. The first action has been the action plan per region described in D3.1; such recommendations are designed to better interoperate the different ROCs around the world with the European strategy, widening the current computing power. This being achieved, it will be easier to extend the previous demonstrated interoperability demo (EGI TF 2013) to more sites allocated in different world regions.

The second action has been to accelerate the access to the different infrastructures by means of Identity Federations. D5.14 takes also into account this feature and reports about the current status worldwide. One of the potential identified management solutions for Identity Federation issues is the Perun service. The Perun service43_{is a user, resource} and service management system developed by the CHAIN-REDS partner CESNET and provided as a service. Since Perun could help research communities to overcome initial barriers, when they want to start using federated services or simply want to manage access to the shared services, it is being considered by CHAIN-REDS to be adopted as support tool.

In addition, CHAIN-REDS is also working on PID services. The project partner GRNET is deeply involved in this development and already provides a PID service44_{. The service} uses the Handle System for PID resolution and assignment, and it has been assigned a prefix by the Corporation for National Research Initiatives (CNRI). Furthermore, GRNET provides a REST web interface in front of the Handle service that eases the integration with the software repositories. As it is evident from Table IV, many VRCs do not have experience in using PID services as part of their data management practices. CHAIN-REDS is committed to promote the adoption of PIDs in chosen use cases and the wider community, and will provide them with the necessary support.

Bearing all these aspects in mind, it will be feasible to achieve the final data trust building challenge that is described in the next Section.

42_{https://wiki.egi.eu/wiki/Fedcloud-tf:FederatedCloudsTaskForce} 43_{http://perun.metacentrum.cz}

(25)

25

4. The CHAIN-REDS tools

In this Section, a brief description of the CHAIN-REDS tools that are available at its website and are of interest to Data Infrastructure is provided. They actually are the backbone that will be used by the future data challenge called DART that aims to provide a proof-of-principle data trust building.

4.1. The CHAIN-REDS Knowledge Base

The CHAIN-REDS Knowledge Base is one of the largest existing e-Infrastructure-related digital information systems. It currently contains information, gathered both from dedicated surveys and other web and documental sources, for largely more than half of the countries in the world.

Information is presented to visitors through geographic maps and tables. The “country view” is shown in Fig. 2. Users can choose a continent in the map and, for each country where a marker is displayed, get the information about the Regional Research & Education Network(s) and the Grid Regional Operation Centre(s) the country belongs to as well as the National Research & Education Network, the National Grid Initiative, the Certification Authority, and the Identity Federation available in the country, down to the Grid site(s) running in the country and the scientific application(s) developed by researchers of the country and running on those sites. Besides e-Infrastructure sites, services and applications, the CHAIN-REDS KB publishes information about Open Access Document Repositories and Data Repositories as it has been described in Section 2. Deeper explanation about the KB characteristics can be also found in D4.1.

(26)

26

Although it is quite useful to have a central access point to thousands of repositories and millions of documents and datasets, with both geographic and tabular information, the OADR and DR part of the CHAIN-REDS KB is only a demonstrator with limited impact on scientists’ day-by-day life. In order to find a document or a dataset, users should know beforehand what they are looking for and there is no way to correlate documents and data which would actually be of the most important facilitators. In order to overcome these limitations and turn the KB into a powerful research tool, the CHAIN-REDS consortium has decided to semantically enrich the OADRs and DRs gathered in the KB and build a search engine on the related linked data. The CHAIN-REDS Semantic Search Engine has been the result of such an effort led by INFN.

4.2. The CHAIN-REDS Semantic Search Engine

The multi-layered architecture of the search engine is sketched in Figure 3 where both the official and de facto Semantic Web standards and technologies adopted are described by small logos. Starting from the bottom of Figure 3, the first two components of the service are described below.

Figure 3. Architecture of the Semantic Search Engine

The metadata harvester is a process able to run both on Grid and Cloud infrastructures which consists of the following parts:

 Get the address of each repository publishing an OAI-PMH standard endpoint;

 Retrieve, using the OAI-PMH repository address, the related Dublin Core encoded metadata in XML format;

 Get the records from the XML files and, using the Apache Jena API, transform the metadata in RDF format;

 Save the RDF files into a Virtuoso triple store according to an OWL-compliant ontology built using Protégé.

Each Resource Description Framework (RDF) file retrieved and saved in a Virtuoso45 -enabled triple store is mapped onto a Virtuoso Graph that contains the ontology expressly developed for the search engine, shown in Figure 4 for the sake of completeness. The ontology, built using Dublin Core and FOAF standards, consists of:

45

(27)

27

 Classes that describe the general concepts of the domain: Resource, Author, Organisation, Repository and Dataset (where Resource is a given open access document);

 Object properties that describe the relationships among the ontology classes; the ontology developed for the service described in this paper has several specific properties such as hasAuthor (i.e., the relation between Resources and Authors) and hasDataSet (i.e., the relation between Resources and Datasets);

 Data properties (or attributes) that contain the characteristics or classes’ parameters.

Figure 4. Schema of the ontology used for the Semantic Search Engine.

The third, and highest-level, component is the Search Engine itself. Using it, visitors can either enter a keyword and submit a SPARQL query to the Virtuoso triple store or select a language and get, on the left side of the page, the list of subjects available in that language with the indication, between parentheses, of the number of records available for that particular subject (see Figure 5).

(28)

28

Figure 5. Schema of the ontology used for the Semantic Search Engine.

The results of a given query are listed in a summary view directly displayed on the webpage. For each record found, the title, the author(s) and a short description of the corresponding resource are provided. Clicking on the “More Info” link, visitors can access the detailed view of the resource.

In the “Dataset information” panel users get the link to the open access document and, if existing, to the corresponding dataset. Clicking on the “Graphs” tab, which appears at the top of the summary view, users can select one or more of the resources found and get a graphic view of the semantic connections among Authors, Subjects and Publishers, as shown in Figure 6. In this way, if new links appear, connecting different resources (as shown in the lower left corner of the figure), users can infer new relations among resources, thus discovering new knowledge. It is worth mentioning that this part of the Semantic Search Engine is at prototypal stage and is subject to changes and improvements in the coming months.

Figure 6. Graphic connections among records found by the Semantic Search Engine.

Further to these capabilities, a very important one has been recently implemented: to perform either single46_{or parallel semantic searching}22_{. By passing the mouse over the} "Semantic Search" link of the CHAIN-REDS webpage, any user can see a sub-menu with two items:

 Single: the usual semantic search service described above; and,

(29)

29

 Parallel: the new parallel semantic search service that allow users to search in parallel (i.e., at the same time) across the millions of resources contained in the CHAIN-REDS Knowledge Base and in the ENGAGE Platform.

Parallel semantic search engines have been made available also in the Science Gateways of some (collaborating) projects, enhancing and extending in this way the solutions proposed by CHAIN-REDS. This parallel semantic search can be found at:

 agINFRA18, here the user can search in parallel across the millions of resources contained in the CHAIN-REDS Knowledge Base and in the OpenAgris repository; and,

 DCH-RP47_{, here the user can search in parallel across the tens of millions of} resources contained in the CHAIN-REDS Knowledge Base and in the Europeana48_, Cultura Italia49_{and Isidore}50_{repositories.}

A snapshot of these parallel semantic search webpages is depicted in Figure 7, where it is clearly displayed for the user’s knowledge, the repositories that are included in the parallel semantic search.

47_{http://ecsg.dch-rp.eu/semantic-search} 48 http://www.europeana.eu/ 49 http://www.culturaitalia.it/ 50 http://www.rechercheisidore.fr/

D4.2 - Analysis of Data Infrastructure and Data repositories

D4.2 - Analysis of Data Infrastructure and Data repositories

Abstract

Table of contents

Purpose

Glossary

1.

Introduction

2.

Updated Information of Transcontinental Data

Infrastructure and Data Repositories

3.

Analysis on Data Infrastructures under the

CHAIN-REDS perspective

4.

The CHAIN-REDS tools