Lessons Learned from Submitted Solutions - An Alternative Approach: Bootstrapping a Value Chain

4.5 An Alternative Approach: Bootstrapping a Value Chain

4.5.4 Lessons Learned from Submitted Solutions

In this section we discuss lessons learned from the participants’ solution. We start with an overview of the solutions; next, we group the lessons into four categories: lessons on submitted tools, used ontologies, submitted data and evaluation process; even though there is some overlap between these aspects.

Sol 2.1 Sol 2.2 Sol 2.3 Sol 2.4 Sol 2.5 Sol 2.6 Sol 2.7 Sol 2.8

year 2015 2015 2016 2016 2015 2015 2015 2016 2016

dataset size 2.6M 1.5M 285 184K 3.6M 2.4M 17M 152 235

# triples 21,681 10,730 2,143 1,628 15,242 12,375 98,961 1,126 1,816

# entities 4,581 1,300 334 257 3,249 2,978 19,487 659 829

# properties 12 23 23 15 19 21 36 571 23

Table 4.13: Statistics. about the produced dataset (Task 2 – 2015 and 2016 editions)

Lessons learned from the tools L5.1. There are both generic and ad hoc solutions. All solutions were methodologically different among each other.For Task 1, for instance, two solutions (1.1 and 1.3) primarily consisted of a tool developed specific to this task, whereas the other two solutions wrote task-specific templates in the otherwise generic implementations (adaptive to other domains).In the later case, Solution 1.2 abstracted the case-specific aspects from the implementation, whereas Solution 1.4 kept them inline with the implementation.It becomes, therefore, clear that there are alternative approaches which can be used to produce RDF datasets.

L5.2. There are HTML code and content-based approaches to information extraction. Even though solutions were methodologically different, two main approaches for dealing with the HTML pages prevailed: HTML-code-based and content-based.

Lessons learned from models and ontologies L6.1. All solutions used almost the same data model (Task 1). All solutions of Task 1 tend to converge regarding the model of the data.The same occurs but on a higher level in the case of Task 2.In particular for Task 1, Solution 1.4 domain modeling was inspired by the model used in Solution 1.1, with some simplifications. Note also that Solution 1.2 was the winner solution in 2014.Based on the aforementioned, we observe a trend of converging regarding

4.5 An Alternative Approach: Bootstrapping a Value Chain

the model the CEUR-WS data set should have, as most of the solutions converge on the main identified concepts in the data (Conference, Workshop, Proceedings, Paper and Person).

L6.2. All solutions used almost the same vocabularies for the same data (Task 1). There is a wide range of vocabularies and ontologies that can be used to annotate scholarly data. However, most of the solutions preferred to (re)use almost the same existing ontologies and vocabularies (see table6 for Task 1). This is a good evidence that the spirit of vocabulary reuse gains traction. However, it is interesting that different solutions used the same ontologies to annotate the same data differently.

C H A P T E R

5

Publishing Linked Open Scholarly Metadata

Scholarly metadata on the Web have been published by different sources and data providers. In addition to the huge volume, such datasets are represented in various formats in terms of data type and schema. Therefore, the Web contains heterogeneous and disconnected scholarly datasets as well as the other domains. It is required to have homogeneous data in order to integrate with other sources and use semantic- based technologies. Therefore, immediately after (or simultaneously) the data acquisition from different resources a (semi-)automated procedure for data transformation is needed. As the aim of this research is to build services based on semantic technologies, the uniform data type considered here is the Resource Description Framework (RDF) (explained in subsection 3.4.1) which is a W3C standard language that organizes data into a set of triples. Having data in this format enables the interlinking with other datasets. This chapter addresses these three steps of the metadata life cyclesection 3.4: Exraction, Transformation and Interlinking. The following sections of this chapter are based on the research contributions related to these two steps that have been previously published as research articles1.

Several data gathering methods have been implemented to mature OpenResaerch.org (mainly introduced in chapter 6). One of the main resources that has been used to gather event-related metadata has been the emails of calls for papers distributed through certain mailing lists. The corresponding work is introduced in the following publication that will be explained in 5.1 the data gathering section of this chapter. As explained in 1, this has been a teamwork led by the author. Rebaz Omar have done the modeling of metadata distributed in mailing lists and implemented the proposed approach by the author namely SAANSET. The integration of SAANSET with OpenResearch.org ontology and the collected data was also the contribution of the author.

Rebaz Omar, Sahar Vahdati, Christoph Lange, Maria-Esther Vidal and, Andreas Behrend, SAANSET: Semi-Automated Acquisition of Scholarly Metadata using OpenResearch.org Plat- form, ICSC 2018.

Section 5.2 describes the work done for transformation of different metadata formats to machine- readable RDF.

1_{Own Manuscript Contributions: The author contributed to the conception and design of the research work, transformation} of CSV to RDF and comparison of the results and finally making the RDFization based on the winner approach, CSV to RDF in the context of the OpenAIRE project. The work co-authored by Alexiou and et al. is a join work of the OpenAIRE LOD team. The University of Bonn is coordinating the effort of publishing the OpenAIRE data as Linked Open Data (LOD) and the effort is further supported by the Athena Research and Innovation Center and CNR-ISTI (Alexiou and Papastefanatos). Vahdati has mainly contributed in generating, constructing and improving the interlinking patterns with assessed links and Alexiou aligned it to the OA infrastructure. The work by Ameri et al. has been a master thesis mainly supervised by Vahdati. In both articles, the author had main role in drafting and final approval of the published versions.

• Sahar Vahdati, Farah Karim, Jyun-Yao Huang, Christoph Lange. Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XMLIn Metadata and Semantics Research Conference 2015..

Representation of data in such formats makes it interoperable and easily reusable. More details will be discussed in section 5.3. In the following publications the we discuss the design and implementation of scholarly metadata interlinking.

• Giorgos Alexiou, Sahar Vahdati, Christoph Lange, George Papastefanatos, Steffen Lohmann. LOD services: Scholarly Communication Data as Linked Data, SAVE-SD Workshop of WWW2016, LNCS post-proceedings 2016;

• Shirin Ameri, Sahar Vahdati, Christoph Lange. Interlinking OpenAIRE LOD and related Datasets, Theory and Practice of Digital Libraries 2017.

5.1 Extraction

In our era open access to scientific literature has become widespread. The overall process of scientific communication, e.g., preparation of manuscripts, organization of conferences, and a peer review process have become considerably efficient. This results in an enormous amount of research output and information about research activities. Researchers spend a lot of time in finding information about other researchers, scientific events, journals, scientific papers and research topics related to their interest. Although there exist a lot of services, such as data and content repositories, digital libraries or metadata catalogues to assist researchers, it is often a time-consuming task to find information such as:

• Which scientific events covering topic X and including a PhD Consortium will be held near location Y during the next Z months? (a community calendar)

• Where does the next event of an event series X take place?

• Which countries have the research groups that have been most active in organizing events (consid- ering roles in events, e.g., PC membership) over a period of X years?

• What upcoming events on topic X have a high networking potential in terms of interesting participants (e.g., keynote speakers) and its schedule (e.g., social events)?

Mailing lists are used as a popular way [250] of exchanging announcements or spreading discussions easily among researchers. They form one of the most reliable sources of information about upcoming events because of the large coverage of events by Calls for Papers (CfPs) disseminated in those mailing lists. The principal reasons for using email as a scientific communication channel are the known target group, speed and immediacy it offers. However, the sheer amount of emails sent through those mailing lists makes it difficult for one individual to keep track of them.

Although data from mailing lists is a reliable source of information about upcoming events, it is hard for one individual to extract specific information from them. To obtain the information they are interested in, subscribers are required to first filter a huge amount of emails by relevance, and then, in the worst case, read the full text of the relevant ones. In this section, we present a semi-automatic approach for relevance filtering and metadata extraction from CfPs and expose the extracted data in a useful way in the OR information portal.

Motivating Example We motivate the problem of filtering and extracting metadata about scientific events from CfP emails of mailing lists with the following scenario. Our focus is on mailing lists, i.e.,

5.1 Extraction

a communication medium often used by research communities as a specific channel for distributing, e.g., announcements of releases of software packages or datasets, CfPs of upcoming scientific events, and research related opinions and questions. Active Researchers receive a vast amount of emails about conferences and scientific progress every day. Subscribing to such mailing lists increases the enormous number of announcements every day. Suppose a researcher who has subscribed to such a mailing list needs to identify upcoming related scientific events. Figure ?? depicts a pipeline that can be followed to achieve this goal using mailing lists. The upper part of the figure shows researchers in the role of an event organizer, who are concerned with preparing CfPs and are seeking ways and channels to distribute them to the relevant communities. A researcher in our scenario has to trace the emails on a list and to decide which ones to have a closer look into. Although this process looks straightforward and is one of the favorite communication channels for researchers, a lot of relevant information might either be overlooked or overwhelm recipients. We therefore present SAANSET (Semi-Automated AcquisitioN of Scholarly mETadata), a method to support researchers with these tasks; the proposed method is not only able to filter emails but is also able to capture knowledge encoded in CfP emails and to represent this metadata as structured data in OR for further reuse.

Input RSS feed of mailing lists (E)

Wiki Pages: semantically represented data (D*)

The SAANSET Architecture

Import

Output

Ontology Development

Figure 5.1: The Architecture of SAANSET. SAANSET receives as input a set E of emails and a keyword query

Qand outputs an RDF (Resource Description Framework) dataset D∗. A keyword query Q is used to select a set E∗

of relevant emails containing CfPs. The RDF dataset D∗is composed of the RDF triples that describe the scientific

events described in E∗.

In document Collaborative Integration, Publishing and Analysis of Distributed Scholarly Metadata (Page 132-137)