Collaboration, coordination and harmonisation in the develop-

Chapter 2 State of the Art

2.4 Sharing omic data

2.4.7 Collaboration, coordination and harmonisation in the develop-

Since the development of MIAME, many groups from different omic disciplines have started to develop omic data reporting standards [128], often modelled on MIAME, such as reporting standards for metabolomics [148], proteomics [147], in situ hybridisation, immunohistochemistry [216], flow cytometry [150], and experiments centred around a specific technology [155].

Until recently, the development of these standards has happened in an uncoordi- nated, isolated fashion, leading to duplication of efforts resulting in the proliferation of MI specifications, associated data exchange formats and ontologies that are incompati- ble and overlapping, making re-use of data and integration of data for analysis difficult because of differences in scope, structure and vocabulary [135, 156, 217, 157, 155, 129, 128].

In fact, many parts of a scientific investigation are shared by many areas, such as sample collection and study design. This has created overlaps between standards. For example, the scope of MIAPE overlaps partially or completely with minimum reporting requirements of other scientific domains, such as those in transcriptomics [18] and metabolomics [148] in study design, collection procedures for biosamples, statistical analysis used and in the use of specific analytical technologies [143]. Furthermore, many techniques are being used in other scientific domains, for example, both proteomics and metabolomics use chromatography and mass spectrometry [20, 147, 143]. The Core Information for Metabolomics Reporting (CIMR), MIAME and MIGS specifications overlap in the description of the biosample, and others overlap in specific technologies, e.g. mass spectrometry is covered both by CIMR and MIAPE [156].

With the trend towards multi-omics approaches that may use a number of different omic approaches and technologies, such as proteomic, transcriptomic and metabolomic, to investigate the same source sample, the integration of the different types of omic data generated by these approaches is becoming more and more important, to en-

able effective data analysis [156, 135, 217, 138, 147, 129, 130, 155]. For example,

nutrigenomics, environmental genomics, and toxicogenomics investigations cannot be described sufficiently by any individual of existing reporting structures [155]. Further- more, integration of data is a prerequisite for enabling its reuse [157]. The number of data- rather than hypothesis-driven studies has been growing over recent years, making coordination and harmonisation between groups involved in the developments of data standards and exchange formats more and more important [218, 157]. Standard development should be coordinated to avoid the creation of multiple, competing standards and to avoid overlaps, but also to avoid incompatibility where there is no overlap in

scope, e.g. with regard to the structure, granularity of information recorded, terminology etc. which can hinder data integration and the combination of diverse sets of data [156, 157, 155, 148, 149].

There have been efforts by individual standards development organisations to avoid overlap. These efforts include coordinating the development of reporting requirements for technologies, protocols and entities that are common in a lot of areas in collaboration between standard bodies, by creating shared modules, as for MIAPE [143, 20]. In the development of MIAPE, the PSI concentrated on techniques specific to proteomics rather than very general areas (such as the description of the biosample used in the study) in order to avoid overlap with other standards such as MIAME to facilitate integration of different data sets [20]. Furthermore, there are efforts of individual standards development organisations to collaborate and coordinate with other organisations in the development of standards, such as by the PSI [147], the Genomics Standards Consortium [130], and the Metabolomics Society [148, 147, 219, 148].

More recently, several overarching collaboration and coordination efforts to harmonise amongst standards have been initiated on all levels of standard: on the level of MI specifications (MIBBI), on the level of data exchange formats, representing syn- tax (FuGE, ISA-TAB) and on ontologies, representing semantics (OBI, OBO Foundry) [156, 157]. Each of these efforts is open to the public, collaborates with diverse biomedical domains, and participants of each effort interact to foster coordination between these efforts [156]. These efforts are described below.

Coordination at the level of the specification

In the following sections, the initiatives MIBBI, the RSBI, the GSC, and their efforts in coordination among MI specifications are described.

MIBBI: Coordination for MI Specifications Due to the proliferation of minimum information checklists, the Proteomics Standards Initiative (PSI), the Reporting Standards for Biological Investigations (RSBI) and the GSC [157] have initiated the

Minimum Information for Biological and Biomedical Investigations (MIBBI)50 project

[157, 143, 20]. MIBBI aims to coordinate the creation and harmonisation of minimum information guidelines and checklists between the different biosciences domains, to avoid the duplication and overlap of existing standards [157]. Over 30 projects developing reporting requirements are registered under MIBBI. Figure 2.16 illustrates the current overlap between checklists.

Figure 2.16: Interaction graph for MIBBI projects (line thickness and colour saturation show similarity) (Source:http://www.docstoc.com/docs/document-preview.aspx?doc id=118652775.)

MIBBI has two key parts [157]:

• The MIBBI Portal: MIBBI provides a freely accessible web-based resource acting

as a portal or ‘one-stop shop’ for exploring existing minimum checklist projects, checklists and their associated data formats, ontologies and controlled vocabu- laries, databases and tools [157].

• The MIBBI Foundry: an initiative to foster the joint harmonisation of checklists,

with the aim to ultimately develop a number of checklists that are integratable, clearly bounded, self-consistent and orthogonal [157]. The MIBBI Foundry is the MI equivalent of the OBO Foundry [157].

MGED: RSBI Working Group In collaboration with the PSI, the MGED Soci- ety has initiated the Reporting Structures for Biological Investigations (RSBI) working

group51 _{in 2004 to develop standards and standard compliant software and databases}

[155, 147, 151, 147, 155, 151]. The RSBI working group has been formed to unite and harmonise amongst independent standard development efforts of the nutrigenomics, environmental genomics and toxicogenomics communities (such as exemplified by the MIAME extensions MIAME/Tox [165], MIAME/Env [166] and MIAME/Nut [167])

to avoid duplication and incompatibility in order to facilitate data integration in functional genomics studies that use a number of omic technologies [155, 151, 135, 155, 151]. The most important output by the RSBI has been the definition of “high-level concepts” for a reporting standard that are applicable to any technical or biological application domain and which can be extended by the communities into their respective domains [155]. The use of these concepts avoids duplication of core information that is common to different biological and technical domains, including study design, sam- pling, sample processing and instruments [155]. The high-level concepts defined by the RSBI are as follows (adapted from [155]):

• An Investigationis defined as a self-contained entity of scientific enquiry, con- taining a holistic objective or hypothesis

• A Design is defined by the relationships between one or more Study(s) and Assay(s).

• A Study is defined as an experiment comprised of specific Phase(s), where Ac- tion(s), are applied to Subject(s) or Group(s) of Subject(s), thus contains the description of experimental procedures performed on subjects

• ASubjectis defined as the biological sample that is being studied (e.g., a tissue slice, a population, an individual, cells).

• AnAssay describes the experiment performed on the Subject(s), and thus contains the test(s) and data produced by omics or other technologies

These high-level concepts and their relationships are shown graphically in Fig- ure 2.17.

Figure 2.17: Some of the top-level concepts defined by the RSBI and their relationships (Source: [155])

The GSC: Effort to harmonise the sample concept Another group, which has later become the GSC, worked on harmonising the sample concept amongst omic domains to support multi-omics studies on the same sample [130]. Using ontological concepts, they argue that a sample may be described as an entity or “continuant” (“cellular component” in the Gene Ontology (GO) [220]), for example a mouse, a cell, a urine sample, a liver biopsy, a yeast culture or a plasmid, that exists and goes on to exist through time and may change or have “processes” (“biological process” in GO) applied to it [221], such as are RNA extraction, ageing, toxicological perturbation, and acclimatisation [130]. They propose that a sample description should contain the following components (from [130]):

• Characteristics: Characteristics are observable, assumed or estimated features of the entity, e.g. providing information divided into the following five sections as proposed by Gkoutos et al. [222] of an entity descriptor, a description of its attributes with associated values and units and the assay used for measuring the attributes.

• Processes: Processes can influence sample characteristics, e.g. the process “transportation” can change the location characteristic of a sample. In computational terms an entity on which any process has acted is regarded as a new entity, even though most of its attributes stayed the same.

• Temporal component: a temporal component, t, is necessary for continuants to exist. The application of a process leads to a change in characteristics at a new point in time, ti. The temporal component can be captured in relative terms, e.g. in terms of duration since the start of an experiment or in absolute terms, such as GMT or UST, depending on what is more appropriate.

• Environmental history: Environmental history describes the processes that have acted upon the sample before it has been studied. This is the most loosely defined component.

The GSC also propose the use of the inheritance paradigm (cf. object-oriented programming) in the description of sample over time, where characteristics of the source sample would be inherited by the destination sample and any subsequent samples, so that the same data would not need to be entered repeatedly and only require change at the source sample to fix a mistake [130].

Coordination at the level of data exchange formats

Data formats are the containers used to transmit information from point of data entry or as output from an instrument into visualisation or analysis software or databases

(which may in fact be as simple as an archive of indexed files) [156]. The majority of modern data exchange formats are based on XML [156]. There has been a proliferation of standard data formats in the omic domain from the main standards developing groups [129]. The proliferation of independently created data standards makes the integration of data produced by different experimental techniques difficult, because the models differ in their representation of information common to all functional genomics experiments, e.g. in detail and terminology, so that semantically equivalent information is described in syntactically incoherent ways [129, 138].

In the next sections, the FuGE and ISA-TAB projects to define unified omic data formats are described.

FuGE Since its initiation by MGED in 2004 (now FGED to reflect focus on func-

tional genomics standards development), the FuGE project had become a major cross- community collaboration, with developers from various organisations and representa- tives from industry and academia [129]. The aim of the FuGE project is to unify functional genomics data exchange formats for high-throughput experiments to create consistently structured data formats [129, 138].

The FuGE data model covers the description of biospecimens, protocols, instruments and software used [129]. FuGE was created essentially by removing the microarray specific parts of MAGE-OM and modifying it to enable the creation of technology specific extensions [129, 138]. The FuGE object model (FuGE-OM, [129]) models components that are universal to all types of functional genomics experiments across all technologies, e.g. preparation of sample, instruments, protocols, and multidimensional data [129, 138]. FuGE is comprised of 10 packages divided into two categories [129, 138] as shown in Figure 4.8. The “Common” package and its sub-packages are used for general data standards functionality [129, 138]. The “Bio” package and its sub-packages contain modules for describing experimental procedures, such as experimental design and sample annotation [129, 138]. FuGE can be used to encode biological workflows, such as experimental procedures or methods, SOPs, instrumental or software analysis procedures with the “Protocol” package [138]. The “Collection” package defines the root structure of the XML format and is used in an instance document to group all objects within a FuGE instance [223].

The FuGE-OM provides a data capture format (FuGE-ML) that can be used to in a stand-alone way to describe any workflow in biomedical science at arbitrary level of detail (depending on existence of appropriate ontologies), covering processes, data and materials, and comes with the facility to describe and reference other data sources, such as files, online data [147, 138].

Figure 2.18: The hierarchy of base classes in FuGE. In addition to Common and Bio, a third package exists as a child of FuGE, Collection, which contains only classes and no child packages, Source: http://fuge.sourceforge.net/

FuGE has been designed so that it can be adapted for specific techniques (from [129, 138]):

1. through extension, either of the core FuGE model, with the specific model in- heriting the functionality (such as auditing) and structure of FuGE or through development of ontologies for annotation

2. by a reference mechanism that can refer to external data formats (and as such embed them), such as other open or proprietary formats, while providing an accompanying metadata description

FuGE is intended to be used together with ontologies, such as FuGO [219] (later

becoming Ontology of Biomedical Investigation (OBI)52, for the detailed annotation of

data [129, 138]. FuGE can also be used to provide context to the technology-specific data recorded in established or proprietary data formats in an experiment, as these formats often lack the metadata that describes the context of their creation [138]. The Computational Proteomics Analysis System (CPAS) [224] project embeds established mass-spectrometry formats (such as mzXML [183]) and supporting information into FuGE to achieve the description of a complete workflow [138]. The PRIDE database supports PSI-formats that are FuGE extensions [134]. Both MGED and PSI have used FuGE as the basis for the development of data exchange formats [138], (from [129]), e.g. in

• MAGE-ML version 2 by the Microarray Gene Expression Data society (MGED),

a complete workflow format in an attempt to simplify the much more complex MAGE version 1 [129, 138].

• GelML (for gel electrophoresis), spML (for sample processing) and the gel infor-

matics format, created by HUPO Proteomics Standards Initiative (PSI) [129, 147] and planned for analysis of mass spectra (analysisXML) [225, 138, 147, 147] and there are plans to “retrofit” mzData to the FuGE-OM to make it compatible [147]. mzIdentML is also based on the FuGE object model [202].

Many omic communities are considering harmonisation of their data formats with FuGE, such as PSI [153], or using FuGE for the development of domain specific data models, such as the Metabolomics Society and other standards development organisations are planning further FuGE-OM extensions [129, 138, 149]. FuGE has already been adopted by several omic communities in their development of data formats, such as the Genetical Genomics consortium in the MOLGENIS framework, XGAP [226], the Flow Informatics and Computational Cytometry Society [227] in the generation of FuGEFlow, an extension of the FuGE-OM specifically for flow cytometry data and metadata to support the Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard [207]. The mark-up language, Flow-ML was generated from this data model, which can implement MIFlowCyt [227].

ISA-TAB The ISA-TAB format was proposed by the MGED RSBI as a common data format for capturing data and metadata from biological, biomedical, and environmental investigations, using various omics (such as genomics, transcriptomics, proteomics, and metabolomics) technologies together with more traditional methodologies [228]. ISA-TAB is a simple tab-delimited format, built around the Investigation, Study, Assay entities [155] for exchanging experimental metadata [228]. ISA-TAB is modelled on the successful MAGE-TAB format [168, 228]. Like MAGE-TAB, ISA-TAB files can be created using specific software or through spreadsheets, allowing easy creation, edit- ing and viewing by researchers with little or no bioinformatics support [228]. ISA-TAB can be used as a simpler alternative for XML-based formats and as such competes with and duplicates efforts such as FuGE-ML [228].

Coordination at the level of Ontologies/CVs

The use of ontologies to annotate data has allowed complex data to be “semantically integrated” [155, 98]. CVs and ontologies facilitate in query functions in database searches, disambiguation and definition of synonyms [156]. Below, the FuGO/OBI project and the OBO initiative are presented, both of which are attempts at unifying and harmonising amongst biomedical ontologies.

FuGO/OBI The efforts for harmonising in ontology development started with the development of the Functional Genomics Investigation Ontology (FuGO). FuGO is an international collaborative project to provide terms for the unambiguous annotation of functional genomics studies, including the annotation of study design, materials, experimental protocols and instruments, created data and its statistical analysis [219, 219, 147]. FuGO was developed by extending the microarray specific MGED ontology (MO) [162], to become more generally applicable so that it could also be used to annotate data in the other omic domains, such as proteomics and metabolomics [219]. FuGO provides both terms common to all functional genomics studies, and domain specific terms and refers to existing established ontologies in annotations to avoid duplication [219]. As such, FuGO acts as “semantic glue” to facilitate that data from the various functional genomic sources will be understood the same way everywhere [219]. FuGO was later expanded to cover epidemiological and clinical research, biomedical imaging and other experimentational research domains and eventually resulted in the

Ontology for Biomedical Investigations (OBI)53 _{[98]. The OBI is an ontology with a}

broad scope for annotating a large variety of experimental data in clinical and biological investigations [98]. The OBI ontology covers experimental designs, protocols, instruments and materials used in the investigation, processes, data and analysis types in all domains of biological and biomedical research [98]. Figure 2.19 shows the how a biomedical investigation may be represented using OBI Core, which contains a number of key terms from OBI and related ontologies.

Twenty-five groups are participating in the creation of OBI54 _{under the umbrella}

of OBO Foundry [98].

The OBO Foundry The Open Biomedical Ontologies (OBO) Foundry is a community effort that promotes coordinated reform of existing and OBO ontologies, such as the GO and coordination in the development of new ontologies, to support data integration in the biomedical domain [98]. Over 60 groups have registered with the OBO

Foundry [98, 229]. The OBO website55_{also acts as a public repository for other existing}

biomedical ontologies of different domains, such as anatomy, experiments, phenotype etc.

53_{OBI, http://obi.sourceforge.net/, http://obi-ontology.org/page/Main Page} 54_{OBI, http://obi.sf.net/community}

Figure 2.19: Representation of biomedical investigation using OBI Core, Source: http://obi-ontology.org/page/Core

In document Supporting bio medical knowledge discovery : the Archetype Based Electronic Bio Medical Research Record (eBMRR) (Page 75-85)