computational aspects of reproducibility
Scientific data management plays a key role in knowledge discovery, data integra- tion, and reuse. The prerequisite for good data management is provided by the FAIR principles [Wilkinson et al., 2016]. Humans are capable of understanding semantics which makes it easier for us to identify and interpret data. But it is difficult for us to act at a high speed on complex and large datasets. While machines are capable of handling data at a larger and faster scale, they are not able to understand the semantics of the data. Therefore, the guideline for FAIRness is proposed for both machines as well as humans. One of the principles of FAIR is to make the data inter- operable by making it machine-readable. The data objects are interoperable “only if the data is machine-actionable, utilizes shared vocabularies or ontologies and the data within the object should be syntactically parseable and semantically machine- accessible” [Wilkinson et al., 2016]. One of the benefits of machine-readable data is tracking of provenance records.
Digital preservation helps in ensuring long-term data access in the present era of ever-changing technologies and research. Preservation of digital objects is studied for long in the digital preservation community. Some works give more importance to software and business process conservation [Mayer et al., 2012], while other works fo- cus on scientific workflow preservation [Belhajjame et al., 2015]. There are also works which provide the infrastructure to support the execution of workflows. The pack- aging tools like Reprozip [Chirigati et al., 2013] and Docker [Boettiger, 2015] help user to create packages that include all dependencies to reproduce a computational experiment or a workflow. The tool Reprozip records workflow of command-line executions and creates packages which can be used to rerun and verify the results. However, Reprozip does not capture the evolution of workflows and uses proprietary language for workflow descriptions.
We focus our approach more towards the data management solutions for scientific data including images. The paper [Eliceiri et al., 2012] provides a list of biological imaging software tools. It presents two open-source image database. We reviewed these two imaging database management platforms: BisQue [Kvilekval et al., 2010] and OMERO [Allan et al., 2012]. The Bio-Image Semantic Query User Environment (BisQue) is an open source, server-based software system that can store, display and analyze images. The stored images can be accessed through a web interface or by us-
ing API. It is being developed and maintained by a small team at UCSB. They have two releases per year schedule. The platform uses the Bio-formats14, OpenSlide15, and ImarisConvert16 to support over 240 file formats.
OMERO [Allan et al., 2012] is another open source data management platform for imaging metadata primarily for experimental biology. The OMERO software plat- form is developed by the Open Microscopy Environment (OME) which is a collabo- rative consortium responsible for producing open specifications and tools to enable open-access of image data. Its plugin architecture provides a rich set of features including analyzing and modifying images. It supports over 140 image file formats using BIO-Formats [Linkert et al., 2010]. OMERO has a very active development community ensuring a continued effort to improve the system, with everybody being able to contribute. It has also a well-documented API to write own tools and the ability to extend the web interface with plugins. It also profits from a faster release
cycle. OMEROs ICE (ZeroCs Internet Communications Engine17)-based framework
is demonstrated to be scalable to very large multi-terabyte datasets across applica- tions. The performance and the scalability while handling large heterogeneous data are important criteria in biological applications.
RIKEN [Kobayashi et al., 2018] is a meta-database platform for life-sciences. It provides datasets of genomes and phenomes of different species as well as sequence and image data. It also provides a SPARQL endpoint, a web interface for data input and an RDF converter tool.
A general approach to document experimental metadata is provided by the CEDAR workbench [Gon¸calves et al., 2017]. It is a metadata repository with a web-based tool which helps users to create metadata templates and fill in the metadata using those templates. The metadata is available in JSON, JSON-LD and RDF formats. The main features of the CEDAR workbench include the Template Designer, BioPortal Lookup Service, Intelligent Authoring and Collaboration. The BioPortal Lookup Service Module in CEDAR helps the user to annotate the template using the on- tology terms. The Intelligent Authoring module helps to decrease the metadata authoring time by recommending values based on the context-sensitive suggestions. It also provides REST API to export the metadata and the templates to other sys- tems. This work is developed parallel to this thesis. One part of our work is to provide a metadata editor which overlaps with this work. The ability to query and visualize the end-to-end provenance of scientific experiments is missing.
The myExperiment [Goble et al., 2010] is a social networking environment for sharing bioinformatics workflows. Since its release from 2007, it has around 3900 workflows mainly Taverna. The workflows and the supporting files can be bundled together as
14https://www.openmicroscopy.org/bio-formats/ 15https://openslide.org/
16http://www.bitplane.com/ 17http://www.zeroc.com
Solution Category Purpose
OPM [Moreau et al., 2011] Provenance Model Model Scientific Workflows
PROV-O [Lebo et al., 2013] Provenance Model General-Purpose ontology to model Entities, Activities and Agents P-Plan [Garijo and Gil, 2012] Provenance Model Model Scientific Workflows with plans and their execution Provenir [Sahoo, 2010] Provenance Model Model Scientific Workflows
OPMW [Garijo and Gil, 2011] Provenance Model Model Scientific Workflows D-PROV [Missier et al., 2013] Provenance Model Model Scientific Workflows
Research Objects [Belhajjame et al., 2015] Provenance Model Model Scientific Workflows with the aggregation of resources EXPO [Soldatova and King, 2006] Provenance Model Model Scientific Experiments
OMERO [Allan et al., 2012] Experimental Data Preservation Image Database BisQue [Kvilekval et al., 2010] Experimental Data Preservation Image Database
Table 3.1: Overview of the solutions for describing scientific experiments
packs so that other users can download it together. It also provides collaborative support allowing users to create and join groups. However, the difficulty in reusing of other scientist’s workflow has been a major concern [Zhao et al., 2012] of this environment.
3.4
Discussion
In Section 3.2, the need for describing scientific experiments with Semantic Web technologies has been discussed. Table 3.1 provides an overview of the solutions in the context of the non-computational aspect of reproducibility. Several provenance models described using ontologies have been presented to suit for different domains of applications. After many discussions and provenance challenges, a standard is developed to capture the provenance information irrespective of the domain. Paral- lel to the development of this Open Provenance Model, several other ontologies like Provenir were also developed with the same aim. The W3C Provenance Working Group developed a family of documents, PROV, which became the standard model for provenance information. Since it is an upper-level ontology, it is essential to cap- ture the provenance information in detail based on the application of use. Several approaches have been introduced to describe provenance information of scientific workflows. The workflow-centric Research Objects have been widely used to sup- port the aggregation of resources. Ontologies like ProvONE, D-PROV are used to represent the computational processes of scientific workflows. There are only a few models which capture the execution environment of workflows. The WICUS ontol- ogy has targeted for the conservation of scientific workflows. However, there are few ontologies which capture the provenance information of scientific experiments like EXPO, SWAN. These have the limitation that they do not extend the PROV model hence resulting in the lack of interoperability.
Several approaches for provenance management of scientific experiments have also been discussed. One of our requirements is the data management of microscopy images. OMERO and BisQue are the two closest approaches which meet our re-
quirements. We reviewed other solutions in the context of scientific data man- agement. These solutions either focus on providing data management support of non-computational processes or management support of scientific workflows. But these solutions do not directly provide the features to support our goals. There exists a gap in them as they do not provide the feature to fully capture, represent and visualize the complete path of a scientific experiment. Hence, it is important that they are extended to support our goals and at the same time reuse their rich features.