• No results found

The importance of public databases in spreading knowledge and providing raw material for data mining has already become apparent in the previous sections. What has often started as a simple collection of data regarding the research topic of a certain group has today become indispensible tool for biological research. The wealth and diversity of freely available information would be difficult to conceive only a decade ago. In fact, it is hard to imagine of any contemporary research effort that does not facilitate some type of database to a larger or lesser extent.

The importance of biological databases is reflected in their popularity and the rate at which established databases have grown and new ones have emerged in recent years. Indicatively, the 2005 release of the Nucleic Acids Research online Molecular Biology Database Collection (NAR, 2011) includes 719 databases, an increase of 171 over the previous year (Galperin 2005). In comparison, the 2010 release of the same collection contains 1230 carefully selected databases covering variousaspects of molecular and cell biology, an increase of 5% over the last year (Cochrane & Galperin 2010).

Databases can be roughly categorised according to the type of information stored in them. Table 2.2 provides an overview of the categories of biological databases according to the Nucleic Acids Research online Molecular Biology Database Collection. Importantly, this categorisation is only a rough guide as number of databases store data that can be assigned to more than one of these categories. For the scope of this work metabolic and signalling pathway databases along with microarray data databases are the most relevant and will be briefly discussed.

2: Background

58

Table 2.2 2011 NAR Database Summary Paper Category List

Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases

Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases

Microarray Data and other Gene Expression Databases Proteomics Resources

Other Molecular Biology Databases Organelle databases

Plant databases

Immunological databases

2.7.1 Metabolic and Signalling pathways databases

Pathway databases provide a collection of metabolic and regulatory pathways, including the genes and proteins involved along with chemical compounds participating in the respective reactions, which can be seen as the wiring diagrams of genes and molecules. The KEGG database (Kanehisa et al. 2008) which plays a prominent role in this research is a characteristic example of database that stores information which makes it assignable to more than one of the categories on Table 2.2. It is a general genomics database, storing information about individual genes and a number of completed genomes in its GENES section, while at the same time a pathway database, with graphical representation of cellular processes included in the PATHWAY section. Importantly these sections are linked providing information about the way genomic information is related with higher order functional information, that is, pathways. Figure 2.12 provides a snapshot of the KEGG database home page.

2: Background

59 Figure 2.12 KEGG homepage (KEGG 2011).

MetaCyc (Caspi et al. 2010) is another example of a popular metabolic-pathway database that describes more than 1000 pathways. An important characteristic of MetaCyc is that it only deals with pathways that have been determined experimentally through wet lab research. While this approach imposes some limitation on the amount of available data, on the positive side, it confers accuracy and reliability to the available information. Notably, MetaCyc provides a graphical user interface with a plethora of options and a number of applications including pathway analysis tool.

2: Background

60

In fact, MetaCyc is part of a larger database, named BioCyc (Karp et al. 2005) which uses Pathway Tools software and MetaCyc as a reference to construct predicted metabolic networks. It holds a collection of 653 organism-specificPathway/Genome Databases (14/07/10), each one containing the full genome and the predicted metabolic network of one organism.

Finally, Reactome is another pathway database of importance relevant to this work, dealing with human pathways and processes (Croft et al. 2011). Importantly, the database is manually curated and peer-reviewed by an expert team of biologists and has gained widespread popularity. The core unit of the Reactome data model is the reaction, hence the name. Naturally, reactions are grouped into pathways representing a network of interconnecting processes. The data model generalizes the concept of a reaction to include the transport of a molecule from one compartment to another and the formation of complexes besides the classical biochemical transformations. Hence, pathways in Reactome include classic metabolism as well as signalling, transcriptional regulation, apoptosis and so on.

All pathways are cross-referenced to proteins, genes and small chemical compounds in relevant databases, primary research literature and GO controlled vocabularies. Besides an intuitive useful visualisation of such pathways, also allowing navigation and zooming in and out of processes, the database provides tools for pathway based analysis of microarray and other datasets. The user can supply a list of entities, such as genes and expression data to identify over expression in pathways.

2.7.2 Microarray Data Databases

Since the introduction of microarray technology it has become a widely used tool for the generation of gene expression data. This has been accompanied by an apparent growing demand for any publication to make the analysed dataset available to the wider research community. Naturally, a number of databases have been created to satisfy this need, with 69 listed in Nucleic Acids Research online Molecular Biology Database Collection (20/02/11). They include the National Centre for Biotechnology Information (NCBI) Gene Expression Omnibus database (Barrett et al. 2009) (Figure

2: Background

61

2.13) and ArrayExpress (Parkinson et al. 2008), which have emerged as the main public repositories.

Figure 2.13 GEO homepage (GEO 2011).

Today, a variety of journals require that all authors using microarray data analysis in their research submit a complete dataset to a public repository in order to publish. As of 2011 GEO stores over half a million distinct microarray samples, meaning results of distinct gene expression experiments for a wide range of organisms from yeast to humans. It should be noted that the growing demand for publicly available

2: Background

62

microarray data has also stimulated the need to set some general standards regarding the format and the information accompanying each dataset, to allow subsequent analysis by different researchers. For example the Microarray Gene Expression Data Society (MGED 2011) advocates open access to genomic datasets and works towards developing standards for data quality, annotation and exchange. MIAME, which stands for the Minimal Information About a Microarray Experiment (Brazma et al. 2001), is designed to help authors, who submit microarray data, ensure that the data meets some minimum requirements, allowing other researchers to interpret the results of the experiment unambiguously and potentially to reproduce the experiment.

Related documents