• No results found

Chapter 2 State of the Art

2.4 Sharing omic data

2.4.3 Reporting standards in Transcriptomics

DNA microarray analysis is a widely used high-throughput technique to measure the locations and levels of gene expression on a genome wide rather than single-gene scale through the generation of gene expression profiles [145, 132, 131, 133]. Microarray technology was the first high-throughput technology that could investigate different cell types and different biological conditions of different individuals all in one experiment [152].

Microarray experiments are characterised by the huge volumes of data that they

35BBSRC, http://www.bbsrc.ac.uk/

36BBSRC data sharing policy, http://www.bbsrc.ac.uk/organisation/policies/position/policy/

produce [144]. Microarray analysis data is complex and is only meaningful if it is viewed in context, i.e. with regard to the state of the organism from which the biomaterial was received, and the experimental conditions used to generate the data and the data analysis technique used, thus requiring a large amount of metadata to accompany the data [18, 152]. In addition, to facilitate comparison, the raw data of microarray gene experiments needs to be provided to remove the possibility of variation of data through different data normalization techniques (not to be confused with database normalization) [144, 146, 137].

However, before the development of reporting standards, gene expression data from microarray analysis was widely inaccessible to the research community, it may be found scattered over the internet on the websites of authors of publications researching gene expression levels [133], in different, non-standardised formats. These may be lost in future or may not even be accessible at all [18, 144, 137]. In addition, annotation levels of microarray data were generally insufficient to allow reuse of the data, and many publications on microarray experiments did not provide supporting data and contained insufficient information about experimental procedures [144], and other im- portant information, such as evidence on quality, reliability and possible error levels of the generated data may have been missing [18]. Furthermore, the existence of many different formats, units, different normalisations, different microarray platforms and different experimental designs, was making the integration and comparison of microar- ray data difficult and error-prone [158, 159, 151] and limited the development of scalable high-throughput automated approaches for data mining and analysis, severely limiting the speed at which biomedical discoveries from microarray analysis could be made [18]. This problem did not exist for genome sequence data, for which many different for- mats existed in databases and analysis tools [18]. While the Human Genome Project and other DNA sequencing efforts also produced large amounts of data, genome se- quence data is much simpler and more robust than microarray gene expression data, since an organism only has one genome but many different transcriptomes, depending on different physiological conditions [152, 18].

MGED and the MIAME standard

The Microarray Gene Expression Data (MGED)37 Society originated in 1999 as a

grassroots movement by the microarray community (users, developers, vendors) [159, 144, 151], concerned with the establishment of standards to support the sharing of microarray gene expression data, including proteomics array data [151].

MGED created the MIAME standard, the earliest reporting standard developed in the omic domain, the first version of which was published in 2001 [18, 160]. MIAME

is now on version 2.038.

MIAME was designed with two principles in mind [18]. Firstly, it should specify the minimum information that would be needed to ensure the interpretability of the results of a microarray experiment, to be able compare the results to other microarray experiments and to independently verify the results of the analysis, e.g. by replicating the experiment [18]. Secondly, MIAME was to be structured so that it would be amenable to the development of automated data analysis and mining and other useful querying [18].

A microarray experiment consists of hybridisations between samples and arrays. The resulting hybridised arrays are scanned, producing an image. This image is anal- ysed further, resulting in a set of measurements for each array element. A normalisation is carried out on the measurements and the data sets of replicate hybridisations are combined.

Figure 2.12 shows a schematic representation of the six components of a microarray experiment.

Figure 2.12: Schematic representation of the six components of a microarray experi- ment (Source: [18]).

These six components of a microarray experiment are used as the basis for the MIAME standard, so that MIAME is divided into the following six parts (from [18]):

1. Experimental design: the set of hybridisation experiments as a whole

2. Array design: each array used and each element (spot, feature) on the array

3. Samples: samples used, extract preparation and labelling

4. Hybridizations: procedures and parameters

5. Measurements: images, quantification and specifications (the raw data)

6. Normalization controls: types, values and specifications

MIAME only specifies the content that needs to be recorded, but not the format in which it should be recorded, i.e. it does specify how it should be implemented technically and does not impose a requirement to use a particular platform, software or method of data analysis [18]. However, together with the Life Sciences Research

Task Force of the Object Management Group (OMG)39, MGED also developed a data

model that builds the basis for a standard data-exchange format called MAGE-ML that implements MIAME requirements to allow communication of MIAME data between laboratories, archives and analysis packages and supporting software is also developed by participating organisations [18].

In addition, the MGED Ontology Working Group have produced an ontology (the MGED ontology, MO) and rules for the unambiguous annotation of MIAME compliant microarray gene expression data to be used in conjunction with MAGE-ML [161, 162, 151, 18]. The MO covers all areas of microarray experiments, including experimental design, array layout, sample preparation and hybridisation protocols and data analysis [161, 162]. MGED also created a short set of guidelines and a checklist that authors, reviewers and journal editors could use to achieve and check for MIAME compliance [163, 136, 160, 164].

Three extensions to MIAME have been developed so far [155]:

• MIAME/Tox [165] was created by the MGED Toxicogenomics working group,

to specifically capture toxicology data in an experiment (e.g. histopathology, clinical chemistry)

• MIAME/Env [166] has been developed to capture the data of environmentally

relevant organisms in functional genomics investigations

• MIAME/Nut [167] has been created to capture nutritional components of func-

tional genomics investigations

MIAME data formats

To facilitate the standardised storage and exchange of MIAME compliant microarray gene expression data, MGED have developed the several technical artefacts: MAGE- OM (microarray gene expression object model), a UML based object model repre-

senting MIAME concepts and their relationships, MAGE-ML, an XML-based data ex- change format derived from the MAGE-OM to support exchange of MIAME compliant gene expression data and which has subsequently been implemented in microarray data capturing software and a Perl, Java and C++ based software took kit (MAGE-STK) to provide APIs to facilitate the creation of MAGE-ML documents [19].

While MAGE was adopted reasonably successfully in microarray repositories at first, the MAGE-OM was considered too complex and use of MAGE-ML too difficult in laboratories without bioinformatics support [152, 168], and too cumbersome due to the increased size of data files based on XML [151, 168]. MAGE-ML has also been crit- icised for being too flexible, allowing the same information to be described in different, inconsistent ways [128, 135, 151]. Also, MAGE does not in itself guarantee compliance with MIAME as valid MAGE documents can be created that contain less or more of the information required by MIAME, depending on the individual’s interpretation of MIAME [135]. To solve these problems, MGED later developed MAGE-TAB (MicroAr- ray and Gene Expression TABular), a simple spreadsheet-based, tab-delimited format to report MIAME-compliant data [168]. The MAGE-TAB format became more pop- ular, as it could be edited directly without the use of special tools and because of software support for data import, e.g. for import into the Bioconductor analysis tool [152]. However, a 2009 review of MIAME showed the MAGE-TAB adoption was low in comparison adoption of the MIAME checklist [152]. MGED later revised MAGE to

eliminate existing ambiguities [151] and in 2006, MAGE-ML v240 was published, based

on FuGE (see Section on Standard harmonisation). However, no single implementa- tion of MAGEv2 is known and MAGE-TAB is the universally used format, having

completely replaced the more complex MAGE-ML (v1)41.

MIAME adoption in journals and databases

Adoption of the MIAME guidelines for journal publications received a boost after the MGED society wrote an open letter in 2002 to major journals, such as Nature and the Nature research journals, Lancet, Science and Bioinformatics, to require that authors of microarray papers provide MIAME compliant metadata either as a database submission, e.g. to ArrayExpress or GEO or as supplementary data as a prerequisite for publication. The journals have agreed to follow this recommendation [163, 136, 160, 164]. Major scientific journals now require the submission of MIAME compliant data to support publications of microarray experiments [151]. Moreover, several funding agencies (e.g. UK’s Natural Environment Research Council (NERC)) require adherence to MIAME guidelines for all microarray data produced under funded programmes [151].

40MAGE-ML v2, http://www.mged.org/Workgroups/MAGE/magev2.html 41personal communication with Ugis Sarkans from FGED, [email protected]

MIAME has been implemented in major microarray databases, e.g. ArrayExpress,

GEO and the Center for Information Biology gene EXpression (CIBEX) database42

[169]. ArrayExpress supports submissions in MAGE-ML format and via MIAMEx- press, an online submission tool aimed mainly at users who cannot provide submissions in MAGE-ML format [144, 170]. The ArrayExpress relational database schema is mod- elled on the MAGE-OM [171, 172, 173]. ArrayExpress also supports MAGE-TAB sub- missions [152]. MIAME is a set of guidelines only and thus the level of detail required to satisfy MIAME is subjective and may be interpreted differently. For example, even if all mandatory fields are provided, there is no way to check if the data is accurate and sufficient and thus it is not possible to automatically enforce the guidelines and validate submissions for compliance [174, 175]. Both ArrayExpress and GEO provide manual validation of MIAME compliance [175, 174, 156]. The Stanford Microarray Database provides software allowing for the upload of MAGE-ML formatted documents of compa- nies producing microarray chips and facilitating the submission of MIAME-compliant, MAGE-ML-formatted data to the GEO and ArrayExpress databases [176].

Other transcriptomic standards

Together with decreasing costs, new-generation sequencing (NGS) techniques become a viable alternative to traditional microarray applications, such as gene expression pro- filing and gene copy number and epigenetic variation analyses [152]. The MGED So- ciety has responded with the development of new guidelines for Minimum Information

about a high-throughput SeQuencing Experiment (MINSEQE)43, which are already

being supported by GEO and ArrayExpress [152]. The Minimum information specifi- cation for in situ hybridisation and immunohistochemistry experiments (MISFISHIE) provides guidelines for the reporting of experiments that investigate the localisation of gene expression in tissues through visual interpretation, e.g. IHC, ISH, lectin affinity histochemistry and experiments using reporter gene constructs, such as green fluores-

cent protein (GFP) and β-galactosidase [149]. MISFISHIE was inspired by MIAME,

and can be used in conjunction with other MI specifications, such as MIAPE and MI- AME, to support integrative functional genomics [149]. Figure 2.13 shows a schematic representation of the six parts of MISFISHIE. The authors of MISFISHIE also devel- oped a MISFISHIE-compliant XML-based data exchange format and a document type definition (DTD) for several gene expression repositories [149].

42CIBEX, http://cibex.nig.ac.jp/

Figure 2.13: Schematic representation of the six components of MISFISHIE (Source: http://www.emouseatlas.org/emage/info/misfishie.html)