On the effects of large-scale transcriptomics datasets on gene functional analyses

(1)

On the Effects of Large-scale

Transcriptomics Datasets on Gene

Functional Analyses

A Thesis

Submitted For the Degree Of

DOCTOR OF PHILOSOPHY

By

Prajwal Krishna Bhat

September 2011

(2)

Declaration of Authorship

I, Prajwal K. Bhat, hereby declare that this thesis and the work presented in it is entirely my own. Where I have consulted the work of others, this is always clearly stated. Signed: Date: 04th_Feb, 2012

(3)

Acknowledgements

Guru brahma, guru vishnu, gurudevo maheshawaraha Guru saakshaath parabrahma, thasmai shrigurave namaha ____ The teacher is the creator, he is the preserver, he is the destroyer He is the source of the Absolute, I offer all my efforts to him It is indeed a rare honour to be writing this section. I have benefitted from so many individuals who have unconditionally and wholeheartedly extended their support. I will never be able to adequately express my gratitude to them.

Firstly, I would like to thank my supervisors Dr. Alessandra Devoto, Prof. Dr. Laszlo Bogre and Prof. Dr. Peter Bramley for giving me the invaluable opportunity to pursue a PhD. I would like to thank Dr. Alessandra Devoto for her immense patience and unconditional support throughout the course of this project. Her words of encouragement especially made a world of difference to me. I would like to thank my advisor Dr. Alberto Paccanaro for being a mentor for which I will forever be in his debt. His energy and acumen will always be a benchmark for me. The lessons he taught have changed my science and life forever.

I would like to thank Dr. Haixuan Yang and Dr. Tamas Nepusz for their technical inputs and discussions which have enriched my work. They have been truly inspiring and I consider myself fortunate to have worked with them. I would like to specially thank Dr. Yang for his invaluable help with MATLAB throughout the project, and especially while implementing the experiment selection algorithm. I would also like to extend my thanks to Dr.

(4)

Alfonso Romero, Dr. Haixuan Yang and all the past and current members of the lab for making the lab an electrifying place to do science.

On a personal note, I would like to thank my parents for their invaluable support. I am grateful to them for stoking my scientific temper throughout my life. The long discussions with my father on science, technology and commerce were perhaps the seeds for this journey. Finally, I would like to thank my wife Akhila for her unconditional support and companionship without which this thesis would have been infinitely harder to write. I dedicate this thesis to my family.

(5)

Abstract

The Guilt‐by‐Association (GBA) principle, according to which genes with similar expression profiles are functionally associated, is widely applied for functional analyses using large heterogeneous collections of transcriptomics data. In this thesis we show that using such large collections could hamper GBA functional analysis for genes whose expression is condition specific. In these cases a smaller set of condition related experiments should instead be used, but identifying such functionally relevant experiments from large collections based on literature knowledge alone is an impractical task.

The study begins by discussing the basic principles underlying the definition of gene function and the use of large microarray collections for GBA based gene function analyses. We look at the effects of condition specific gene expression on GBA analyses and provide a mathematical and biological perspective. We show that using large microarray collections to calculate correlation can mask the effectiveness of the GBA principle. We suggest that using only those experiments that are relevant to the biological function under analysis can significantly improve GBA based gene functional analyses.

We then present a semi‐supervised algorithm that can select functionally relevant experiments from large collections of transcriptomics experiments. The algorithm is able to select experiments relevant to a given GO term, MIPS FunCat term or even KEGG pathways. We extensively test our algorithm on large dataset collections for Yeast and Arabidopsis. We demonstrate that: (i) using the selected experiments there is a statistically significant improvement both in correlation between genes in the functional

(6)

category of interest and in GBA based function predictions; (ii) the effectiveness of the selected experiments increases with annotation specificity; (iii) our algorithm can be successfully applied to GBA based pathway reconstruction.

We conclude by discussing the potential applications of our technique. We outline several developments that could be implemented in the future to improve the efficiency of the experiment selection procedure.

(7)

Keywords

Guilt by Association, Condition specific gene expression, Experiment selection, Transcriptomics, Co‐expression, Gene Ontology, Gene functional analyses

(8)

Abbreviations

GBA Guilt By Association GO Gene Ontology MIPS The Munich Information Center for Protein Sequences PPI Protein‐Protein Interaction TAF Transcription Associated Factors SGD Saccharomyces Genome Database SNOMED Systematized Nomenclature of Medicine MIPSFunCat MIPS Functional Catalogue EC Nomenclature Enzyme Commission Nomenclature OBO Open Biological Ontologies DAG Directed Acyclic Graph BP Biological Process MF Molecular Function MJ Methyl Jasmonate CC Cellular Component TAIR The Arabidopsis Information Resource NCBI National Center for Biotechnology Information GEO Gene Expression Omnibus EBI European Bioinformatics Institute JA Jasmonic Acid SOM Self‐organizing maps PCC Pearson Correlation Coefficient M3D Many Microbes Database MAS MicroArray Suite ROC Receiver Operating Characteristic LOO Leave One Out TPR True Positive Rate FPR False Positive Rate AUC Area Under the Curve

(9)

List of Figures

1. Distribution of correlation coefficients in a functional category 4 2. Structure of the Gene Ontology 19 3. Euclidean distance in 2D space 27 4. Comparing Euclidean distance and Pearson correlation on gene expression profiles 29 5. Schematic representation of Microarray dataset and microarray compendia 35 6. Distribution of correlation coefficients in a functional category 40 7. Cross‐talk can lead to poor correlation 42 8. Effect of technical noise on correlation 43 9. Poor correlation dilutes overall correlation 46 10. Comparison of distribution of correlation coefficients in a functional category 48 11. Heatmap of correlation matrix for GO category pairs 52 12. Comparison of biclustering strategies 59 13. Scheme of correlation matrix for functional category of interest and background 63 14. Composition of the background set in GO 65 15. Psuedocode of the experiment selection algorithm 67 16. Comparison of correlation distribution 73 17. Example of a ROC curve 79 18. Evaluation of performance of selected set of experiments 81 19. Performance improves with annotation specificity 83 20. Selected experiments improve KEGG classification 85 21. Choosing the right feature in a classification problem 89 22. QT‐clustering of MKK7 gene expression profiles 117

(10)

List of Tables

1. Major functional categories in MIPS FunCat 16 2. Number of microarrays available for model organism in GEO 33 3. Co‐expression tools available for gene expression analysis 37 4. Comparison of average correlations and t‐test p‐ values in GO category pairs 53 5. Improvement in the number of correlations in GO categories 54 6. List of microarray experiments in Arabidopsis microarray collection 70 7. Confusion matrix 77 8. GO categories and corresponding set of selected experiments 88 9. Selected overrepresented GO terms characterizing the major down‐ and up regulated clusters generated by the QT algorithm 118

(11)

Publication Based On This Thesis

Prajwal Bhat, Haixuan Yang, Laszlo Bogre, Alessandra Devoto and Alberto Paccanaro (2011) “Computational selection of Transcriptomics experiments improves Guilt‐by‐Association analyses”, Under review.

(12)

Acknowledgements

II

Abstract

IV

Keywords

VI

Abbreviations

VII

1. Introduction

1 1.1. Context of the thesis 1 1.2. Problem statement 2 1.3. Objectives of the study 6 1.4. Overview of the thesis 7

2. Understanding Gene Function

9 2.1. What is Gene Function? 10 2.2. Organizing Functional Information: Ontologies 13 2.2.1. MIPS FunCat 15 2.2.2. Gene Ontology project 18

3. Functional Analyses Using Gene Expression

Data

21 3.1. Guilt‐by‐Association 22 3.2. Similarity Metrics in GBA Analyses 25 3.2.1. Euclidean Distance 26 3.2.2. Pearson Correlation Coefficient 29

4. Limitations of Compendium Based Correlation

Analyses

32 4.1. Emergence of Microarray Repositories 32 4.2. Correlation Analyses Tools for Microarray Compendia 33 4.3. Poor Correlation Between Functionally Related Genes 39 4.4. Causes of Poor Correlation 41 4.5. Poor Correlation Dilutes Overall Correlation 46

(13)

5. A Method For Selecting Relevant Experiments

57 5.1. State of the Art 57 5.1.1. Bi‐clustering 58 5.2. Motivation for a Novel Experiment Selection Technique 60 5.3. A Novel Algorithm for the Selection of Functionally Relevant Experiments 62 5.3.1. Defining Functional Relevance 62 5.3.2. The Experiment Selection Algorithm 65 5.3.3. Materials 69 5.4. Selected Experiments Improve Correlation 72 5.5. Quantifying the Effectiveness of the Selected Experiments for Function Prediction 74 5.5.1. Evaluation Strategies for Measuring Classifier Performance 75 5.5.2. Measuring Performance using ROC curves 77 5.5.3. Results from the evaluation and measurement of Classifier Performance 80 5.6. Effectiveness of the Selected Experiments Increases With Annotation Specificity 82 5.7. Implications on Pathway Reconstruction 84 5.8. Selected Experiments Generally Reflect Literature 86 5.9. Discussion 91

6. Conclusions and Outlook

95 6.1. Conclusions 95 6.2. Future Perspectives 99 6.2.1. Improvements to the Experiment Selection Algorithm 99 6.2.2. Selecting RNA‐Seq Datasets for GBA Analyses 99 6.2.3. A Novel Functional Analyses Approach by Mapping Gene Expression 100 6.2.4. Applications in Modelling Regulatory Pathways 101

Appendix I

Microarray Data Analysis

102

Appendix II

MATLAB Implementation of the

Selection Algorithm

119

Bibliography

131

(14)

1

Introduction

1.1 Context of the thesis

Proteins are widely recognised as the building blocks and functional components of the cell. Proteins are omnipresent; structural proteins form tissues of organisms, enzymes and other proteins regulate biochemical reactions and trans‐membrane proteins act as transporters that maintain the cellular environment. Therefore, the knowledge of functions of proteins and the properties of genes which transcribe them are essential for understanding the mechanisms of biology. Gaining an insight into gene or protein function is essential for the development of new drugs, improving crop yield and development of essential biochemicals such as insulin and enzymes.

The early efforts to elucidate gene or protein function were mostly experimental and generally involved focussing on a small set of genes or a protein complex. These experiments were low‐throughput in nature due to the enormous experimental resources, time and human effort required. However, the arrival of high‐throughput experimental techniques such as rapid DNA Sequencing has changed the outlook of modern biology. In the

(15)

subsequent phase now known as the post‐genomic era, numerous high‐ throughput techniques have been developed, each offering a new perspective of the mechanisms by which a gene or protein executes its function. These high‐throughput techniques provide us with an unprecedented opportunity to understand complex biological systems by providing insights into myriad facets of a gene’s function and its dynamic interaction with other molecular components. These insights enable us to build increasingly complex models of regulatory interactions, helping us understand gene function.

1.2 Problem statement

Among the various high‐throughput data available, microarrays for transcription profiling are currently the most abundant due to their accessibility, interpretability and decreasing experimental costs (Yeung, Medvedovic, & Bumgarner, 2004; Zien, Fluck, Zimmer, & Lengauer, 2003). The large quantities of data generated by microarray experiments have made manual analysis of the data impractical and a large number of computational tools and protocols have been developed to extract functional information from microarrays. Generally, most of the techniques developed to extract functional information from microarrays are based on the principle of Guilt‐ by‐Association (GBA). The GBA principle suggests that genes that have a similar expression profile are suggested to share similar functions. This is largely based on the fact that genes which encode proteins that participate in a pathway are found to be co‐regulated. GBA driven microarray analysis approaches include various clustering techniques such as hierarchical clustering, k‐means clustering, biclustering and various co‐expression based network analyses techniques. Generally, these techniques aim to group genes

(16)

based on their expression profiles. The ways by which the genes relate to each other or the membership in a group of genes determine the function. GBA‐based microarray analyses approaches calculate similarity between gene expression profiles using a similarity metric such as Pearson correlation. Often, this has been done over a large heterogeneous collection of datasets (Manfield et al., 2006; Obayashi, Hayashi, Saeki, Ohta, & Kinoshita, 2009a; Zimmermann, Hirsch‐Hoffmann, Hennig, & Gruissem, 2004). One reason behind this approach is that a large number of data points would result in a more robust correlation by combining weak expression signatures over many datasets. Formally, the significance of the correlation between the vectors is likely to increase with the size of the vector.

Generally, co‐expression studies over large heterogeneous collections of datasets were aimed to reveal a global transcriptional response (Wu et al. 2002). Studies such as Zien et al. (2003) and Yeung et. al (2004) have shown that the size of the dataset has an upper limit of approx. 80 data points above which there is no discernable improvement in the quality of the analysis. We hypothesized that working with such large datasets may not always be beneficial to the functional analyses at hand.

For the principle of GBA to be effective for elucidating gene function or to perform any other functional analyses, genes which belong to the same functional category would be expected to have similar expression profiles. Consequently, the genes belonging to the same functional category are expected to be highly correlated. This property remains the cornerstone of GBA‐based microarray data analyses. To verify this property, we looked at the distribution of correlation coefficients for genes which belong to the same

(17)

functional category1_{such as genes annotated to the Gene Ontology}

Biological Process category “Response to Jasmonic Acid Stimulus”. We prepared a microarray data collection containing 756 arrays from 44 individual experiments based on wild‐type Arabidopsis thaliana. The data was sourced from NASCarrays (Craigon et al., 2004) and represented various experimental backgrounds such as stresses and developmental stages. Only Affymetrix ATH1 GeneChip data with MAS5.0 normalization was used. The processed data was normalized to a mean of zero and a standard deviation of 1. Further details regarding the experiments and the normalization pipeline can be found in Chapter 5 (Section 5.3.3) and Appendix I.

Figure 1: Distribution of correlation coefficients among genes in the GO category “Response to Jasmonic Acid Stimulus” obtained when a large collection of 44 microarray experiments (756 microarrays) were used to calculate the correlation. This result was found to be typical for most GO Biological Process categories.

1_{Only experimentally annotated genes were considered to ensure the reliability of the annotation.}

(18)

Surprisingly, although the genes belonged to the same functional category we observed very poor correlation between the genes, with 81.4% of the correlation closer to zero (Fig.1). Such a distribution of correlation coefficients would limit the effectiveness of the GBA principle for functional analyses. If the correlation among genes in the same functional category is near zero, then putative genes that may belong to that functional category cannot be inferred based on correlation among genes already annotated to that category. We obtained similar results for most of the GO Biological Process categories and this was also replicated across functional classification systems such as MIPS and in other organisms such as Yeast. We wanted to understand the causes of such poor correlation and investigate ways of limiting its effects on GBA‐based analyses. We hypothesized that there could be two causes for observing such poor correlation. The first source of poor correlation could be an experimental artefact due to the noise inherent in the measurement. The second source of poor correlation could be biological phenomena such as cross‐talk in the biological pathway of interest. We examine this in detail in Chapter 6. Gene function and gene expression are acknowledged to be condition‐specific. Therefore, genes are also expected to be correlated based on the experimental conditions. In experiments where the pathways of interest are not activated, the genes in the pathway could show poor correlation. In cases where a large heterogeneous collection of microarrays is used for functional analyses, the poor correlations from functionally unrelated experiments could significantly dilute the high correlations observed in experiments where the pathways are sufficiently activated.

To limit the dilution of correlation, it is important that the sources of poor correlation are minimized. This would mean that in a functional analysis, the

(19)

experiments where the genes of interest are poorly correlated are eliminated. However, identifying microarray experiments relevant to a functional category of interest is a non‐trivial task as literature knowledge relevant to a biological process is seldom exhaustive. Further, the relevance of an experiment to a biological process may not be obvious and experiments that are deemed irrelevant by a researcher could in fact withhold significant information regarding the biological process of interest.

1.3 Objectives of the study

From the problem statement presented earlier in Section 1.2, it is clear that the using large collections of microarrays in GBA‐based analyses can result in poor correlation among genes that are deemed functionally related. This poor correlation among functionally related genes could undermine the effectiveness of GBA‐based functional analyses because it would be difficult to determine the function of a gene based on its correlation with genes with known functions. In this thesis, we aim to present a comprehensive investigation of the problem of poor correlation among functionally related genes and also present a methodology for improving the correlation. The objectives of this study are summarized as follows:

1. We want to investigate the limitations of performing GBA‐based functional analyses using large collections of microarrays. Specifically, we want to discuss some of the causes of poor correlation among genes involved in the same biological process.

2. We want to show that correlation among genes involved in the same process can be improved by using only functionally relevant sets of experiments. This is in contrast to the traditional approach where large numbers of experiments are used.

(20)

3. We want to show that selecting such functionally relevant experiments is a non‐trivial task and we underline the need for an automated technique for identifying experiments that are relevant to a GBA‐based functional analysis.

4. To address this need, we want to develop a method that is able to select from a large collection of experiments, a set of functionally relevant experiments.

5. We want to show that experiments selected using such a method would be able to improve GBA‐based functional analyses. We would like to illustrate this by showing that,

a. The selected experiments improve correlation among genes in the same functional category.

b. The correlation resulting from the selected experiments is a better feature for classifying genes into a functional category. c. The effectiveness of the feature varies with the specificity of

annotation.

d. The selected experiments improve transcriptomics‐based pathway reconstruction.

1.4 Overview of the thesis

Chapter 2: We discuss the concept of gene function and its interpretation in

the context of post‐genomic high‐throughput biology. We briefly look at a few of the high‐throughput experimental data types available for extracting functional information. We then briefly discuss the need for machine‐ friendly ontologies for organizing functional information. We outline the salient features of two widely used functional classification systems: Gene Ontology and MIPS FunCat.

(21)

Chapter 3: We investigate the principles employed for extracting functional

information from microarrays. Primarily, we discuss the principle of Guilt‐ by‐Association and its application in microarray‐based functional analyses. We then highlight the importance of similarity metrics such as Pearson correlation, Mutual Information and Euclidean distance in GBA‐based functional analyses. Lastly, we discuss the major functional analyses techniques that are based on the principle of GBA such as clustering, network‐based approaches and basic co‐expression based analytical tools.

Chapter 4: In this chapter, we investigate the reasons for observing poor

correlation among genes which belong to the same functional category. We outline potential sources of noise in microarray data and suggest that these could have a far‐reaching effect on any GBA‐based functional analyses. We illustrate how in GBA analyses based on large heterogeneous datasets, poor correlation between genes in the same functional category can be limited by using only those experiments which are found to be relevant to the functional category of interest. We suggest that the identification of relevant datasets based on literature knowledge alone may not be efficient and propose the development of a computationally‐driven method for the identification of functionally relevant experiments.

Chapter 5: In this chapter, we illustrate our experiment selection algorithm

and prove its effectiveness for GBA analyses. We demonstrate that the algorithm is able to select experiments for a group of genes independent of their functional classification. We also show that the selection performance is replicable across various organisms. We demonstrate that the selected experiments lead to substantially improved correlation between genes in a functional category compared to using a large compendium of data. As a consequence, we show that using correlation obtained from the selected set

(22)

of experiments leads to substantial improvements in GBA‐based functional prediction.

Chapter 6: We outline the conclusions that can be drawn from the study and

look at the future prospects and possible improvements of the experiment selection technique.

(23)

2

Understanding gene

function

The concept of the “gene” as a discrete unit of heredity was first put forward by Gregor Mendel in 1866. The word “gene”, the etymology of which can be traced to the Greek word “genos” (origin), was first used by Wilhem Johansson in the 1900s. The early concept of the gene considered it to be an abstract entity which was responsible for transmitting one or many phenotypes over generations. Subsequently a gene was seen as a physical molecule which is a blueprint for a protein. However, with the ever‐ increasing complexity of the insights into molecular biology, there is a need for a comprehensive framework to define a gene. Recently, Gerstein et al. (2007) defined a gene as “a union of genomic sequences encoding a coherent set of potentially overlapping functional products”. Although the definition of the gene has continuously evolved, the intrinsic idea that a gene is responsible for a phenotype has remained constant. At the molecular level, this could mean that the DNA sequence of the gene determines the sequence and therefore the structure of the functional molecules that implement a specific phenotype. Regardless of the perspective, the identity of a gene is coupled to the corresponding functional products or phenotype.

(24)

Understanding the function of all known gene products has become the primary goal of modern molecular biology. It is widely recognized that elucidating the functions of various genes and gene products and the complex interactions between them is the key to understanding a biological system. The approaches for elucidating gene function have evolved dramatically over the past decades. Traditional approaches for elucidating gene function focussed on individual genes and were largely a single gene approach. These approaches were highly resource intensive and hence not very scalable. However, in the post‐genomic era, there is a deluge of functional data from high‐throughput experimentation techniques such as gene expression microarrays, protein‐protein interaction experiments and genome‐wide phenotype screens. The high‐throughput data has necessitated novel ideas for functional analyses often adapted from computer sciences and statistics.

In this chapter, we look at gene function in the context of the changing experimental paradigms. We look at two state‐of‐the‐art frameworks for organizing functional information, MIPS FunCat and Gene Ontology, and discuss their properties. Finally, we briefly discuss some of the few high‐ throughput data types available that can be exploited for linking a gene to its function.

2.1 What is Gene Function?

Traditionally, gene function elucidation techniques focussed on a single or a very small set of genes at a time. The notion of gene function was very conservative with every gene or protein having specific biological and

(25)

molecular identities. For example, the molecular function of an enzyme would be its specific role in the catalysis of the reaction. Its biological function would be the pathway in which it participated. Therefore, the characterization of a gene or a protein would not be complete without elucidating the biological and molecular aspects. However, with the rapidly accumulating high‐throughput experimental data, there is an unprecedented opportunity to automate gene function elucidation.

For instance, bioinformatic techniques have been successfully applied to annotate and characterize genes (Eddy, 1998; Mulder, 2003). For a gene whose functions cannot be predicted using sequence alone, a wide array of high‐throughput data such as protein‐protein interaction (PPI) data and co‐ expression data are available that can be exploited for functional analyses (Huynen, Snel, von Mering, & Bork, 2003; Vazquez, Flammini, Maritan, & Vespignani, 2003). The amount of PPI data available for the various model organisms is growing exponentially due to advances in techniques such yeast‐2‐hybrid (Y2H), Tandem Affinity Propagation (TAP) and Mass Spec Protein Complex Identification (HMS‐PCI) (Gavin et al., 2002; Ito et al., 2001; H. Wang et al., 2007). However, data produced by these techniques suffer from high levels of false positives and false negatives. Typically, the false positive rates for Y2H are as high as 64% and TAP the false positive rate is as high as 77%. Similarly, the false negative rates could be as high as 71% and 50% respectively (Edwards et al., 2002).

Importantly, genes or proteins are no longer viewed in isolation but are recognized to be part of a complex network of interactions. A single gene can take part in multiple biological processes by having multiple collaborators and this has been recognized as the fundamental mechanism by which single

(26)

genes control multiple traits. Thus the function of a gene is being defined by its interaction partners in a large network of interactions. Unlike the traditional view of gene function this new notion of gene function is fuzzy and encompasses a wide range of phenomena such as physical interaction, regulator‐target interaction and co‐expression. This has been termed the “probabilistic view” of gene function (Fraser & Marcotte, 2004; I. Lee, 2011) as opposed to the more deterministic traditional approach.

Although a rigorous definition of gene function has yet to emerge, it is appreciated that a general framework is necessary which is sufficiently robust to contain all the features of gene function (Fraser & Marcotte, 2004). One such feature is the promiscuous nature of gene function. Gene function is highly context sensitive as genes are known to be involved in multiple roles depending upon the biological process to be executed. For example, Transcription Associated Factors (TAF) have a role in DNA repair and transcriptional initiation; RAS protein regulates both mitogenesis and cytoskeletal rearrangement. Additionally, the framework also needs to keep up with the rapid pace of high‐throughput analyses and accommodate new gene functions as they are discovered. With the development of computational approaches for elucidating gene function there has been a need for organizing gene functional information in a machine‐friendly hierarchical ontology. Fraser and Marcotte (2004) outlined the two perspectives for organizing gene function, called the “top down” and the “bottom up” approaches. The top‐down approach involves organizing all known functional information into a standardized vocabulary and organizing them into a hierarchical tree. This is similar to the current ontological projects such as the Gene Ontology (Ashburner et al., 2000a). In

(27)

the top‐down view, the function of a gene is defined by all the terms it is associated with in the ontology. However, in the bottom‐up approach, genes or proteins are first organized into networks based on their interaction with each other. The interactions are determined using high‐throughput data such as co‐expression and PPI data. Here, the function of a gene is determined by its collaborators(C. v. Mering et al., 2003).Thus, the concept of gene function is fluid depending upon the nature of the investigation.

2.2 Organizing functional information: Ontologies

The need for building ontologies for describing gene function was realised at the beginning of the post‐genomic era when functional analyses became increasingly driven by high‐throughput data. Traditionally, functional information for genes in the various model organisms was maintained by the organism‐specific databases such as Flybase (Gelbart et al. 1997) for Drosophila and the Saccharomyces Genome Database (SGD) (J. Cherry, 1998a) for Yeast. Although, such databases were very successful in their respective communities they were independent with no co‐ordination between them. The potential of large scale functional analyses enabled by technologies such as microarrays revealed the potential for meta‐analyses or interconnecting biological information on a global scale. A global functional ontology would enable easy access to functional information integrated in an unambiguous way.

The early approaches to describing function of a gene or a protein was based upon natural language where the functional label depended on the discretion of the investigator. Generally the functional annotation tended to be simple phrases, that are non‐standard with no organizational structure

(28)

(Lan, Montelione, & Gerstein, 2003). Typical examples included gene names such as redtape, roadblock, radish and turnip. In addition to the obvious lack of organizational rule for naming, with thousands of new genes being discovered on a regular basis, not all genes could be named inventively. There is indeed a practical limitation to the number of gene names one can come up with. Evidently, due to the large variability, such a naming system was not amenable to analysis by a computer or even humans. The need for a machine‐friendly, standardized, functional naming system was paramount. Some of the desirable characteristics of a potential functional naming scheme are listed below (Pandey, 2006):

1. Wide coverage: The functional scheme should cover the entire gamut of functions across as many organisms as possible. This is possibly the most desired characteristic.

2. Standardized structure: Adopting a standard data structure for the functional labels is imperative for minimizing variability between labels. This also results in easy readability and makes the label relatively computer friendly.

3. Hierarchical structure: Arranging the functional labels in a hierarchical arrangement starting from a general functional category leading to a specific function allows the researcher to select the relevant functional granularity for the analysis.

4. Multiple functions: As discussed earlier, it is well acknowledged that gene function is highly context‐specific. To reflect the condition‐ specific nature of gene function, any functional labelling scheme would have to allow multiple labels for the same gene.

(29)

5. Future proof: The functional labelling scheme would be expected to be amenable to future additions. This would allow users to append new functional information as and when available.

The idea of a standard system for organizing biological knowledge was first proposed in the early 1990s with the introduction of the Enzyme Classification (E.C) numbers (Bairoch, 2000). Since then several functional classification schemes have been proposed such as EcoCyc (Keseler et al., 2005) and SNOMED (Spackman, 1997)with a majority of them being organism‐specific systems. One of the early functional labelling schemes was the MIPSFunCat (Mewes et al., 2004). MIPS was one of the first classifications schemes to develop a machine‐readable, standardized vocabulary for organizing functional information which was organism‐ independent. MIPS is widely used in bioinformatics‐driven functional analyses due to its wide coverage and standardized hierarchical structure. However, one of the largest and the most comprehensive efforts in organizing functional information is the Gene Ontology Project (Ashburner et al., 2000b). The GO project was founded on strong ontological principles and is currently the most widely used functional labelling scheme with more than 7000 citations. In this thesis, generally, all the functional analyses and results are reported based on GO and MIPS functional classifications. In the following sections, we discuss the salient features of the two functional classification schemes.

2.2.1 MIPS FunCat

MIPS Functional Catalogue (H.W. Mewes et al. 2004; Andreas Ruepp et al. 2004) was one of the early attempts to generate a standardized, functional vocabulary. Unlike other functional schemes at the time such as EC (contd...)

(30)

Metabolism 01 Metabolism 02 Energy 04 Storage Protein Information pathways 10 Cell cycle and DNA processing 11 Transcription 12 Protein synthesis 14 Protein fate (folding, modification and destination) 16 Protein with binding function or cofactor requirement (structural or catalytic) 18 Protein activity regulation Transport 20 Cellular transport, transport facilitation and transport routes Perception and response to stimuli 30 Cellular communication/signal transduction mechanism 32 Cell rescue, defence and virulence 34 Interaction with the cellular environment 36 Interaction with the environment (systemic) 38 Transposable elements, viral and plasmid proteins Developmental processes 40 Cell fate 41 Development (systemic) 42 Biogenesis of cellular components 43 Cell type differentiation 45 Tissue differentiation 47 Organ differentiation Localization 70 Subcellular localization 73 Cell type localization 75 Tissue localization 77 Organ localization 78 Ubiquitous expression Experimentally uncharacterized proteins 98 Classification not yet clear‐cut 99 Unclassified proteins

Table 1: Major functional categories in MIPS FunCat and the corresponding category numbers. The numbers represent the main category identifiers.

(31)

nomenclature and SWISS‐PROT (Boeckmann, 2003), MIPSFunCat focussed solely on associating gene products and functional information. Here, hierarchically structured keywords or a controlled vocabulary is used to describe gene function. The MIPSFunCat scheme was initially designed for Saccharomyces cerevisiae and was later extended to cover 11 other organisms including Arabidopsis thaliana, Neurosporacrassa and Bacillus subtilis. To account for the broad spectrum of biological processes found in the various organisms, the FunCat annotation scheme consists of 28 main functional categories that cover general functions such as cellular transport, metabolism and protein activity regulation (Andreas Ruepp et al., 2004). Each of the main functional categories is organized as a hierarchical tree‐like structure. The main functional categories of the FunCat are listed in Table 1.

As discussed earlier, an important consideration for a new ontology or annotation scheme is machine readability and human usability. The FunCat scheme was aimed to find a balance between the two contrasting requirements. By design, the FunCat scheme is compact with a limited number of terms. The FunCat terms generally offer a broad classification of the gene of interest compared to similar vocabularies such as the Gene Ontology Project. Each of the functional categories is assigned a unique two digit number, as indicated in Table 1. The hierarchy between the functional categories is depicted by using a dot between the two digit category numbers e.g. 10 Cell cycle and DNA processingÆ10.01 DNA Processing Æ10.01.09 DNA restriction and modificationÆ10.01.09.05 DNA conformation and modification. The FunCat scheme allows for assigning a gene to multiple categories to accommodate for the condition‐specific nature of gene functions.

(32)

2.2.2 The Gene Ontology Project

The Gene Ontology (GO)(Ashburner et al. 2000a) is the most widely used biological ontology for describing gene function and covers over 28 different organisms. The current version of the GO has over 30,000 terms and over 50,000 relationships (The Gene Ontology Consortium, 2010). The GO was developed in collaboration with a variety of biological databases such as SwissPROT (Boeckmann, 2003), GenBank (Benson, Karsch‐Mizrachi, Lipman, Ostell, & Wheeler, 2008), MIPS (H W Mewes et al., 2004) and Pfam (Bateman et al., 2004). The GO consists of two discrete parts; firstly, the annotation of genes with GO vocabulary and secondly the vocabulary and the relationships between the terms in the vocabulary. The annotation of the genes to the terms is maintained by the individual organism‐specific databases such as TAIR (Rhee et al., 2003) for Arabidopsis. The GO terms are created, organized and maintained exclusively by the GO consortium.

The GO is based on the Open Biological Ontologies (OBO) (Camon et al., 2004) framework and consists of three discrete classification systems called Biological Process, Cellular Component and Molecular Function. Each classification system addresses a different aspect of a gene’s function.

Cellular Component describes the location of the gene products in the cell. It

is designed to describe the physical structure with which a gene or a gene product is associated e.g. extra‐cellular matrix, golgi apparatus. Molecular

Function is defined by GO as “the biochemical action characteristic of a gene

product”. The Molecular Function ontology describes the action without specifying the locality of action or the time of action e.g. transporter, protein stabilization. Biological Process is defined as “A phenomenon marked by changes that lead to a particular result, mediated by one or more gene

(33)

gene product contributes. The process may involve physical or chemical transformation i.e. the nature of the products before the event can be very different from the end product. Typical Biological Process terms include response to stress, translation and cell cycle.

Figure 2: Structure of the Gene Ontology Biological Process Tree. The example shown above is the GO structure for the term “cytokinesis after meiosis I”. Only two (is_a and part_of) of the 5 types of relationships are depicted in the figure.

The terms are arranged as a hierarchically arranged Directed Acyclic Graph (DAG) with increasing levels of granularity or specificity. The broadest term sits at the root of the DAG and the specificity increases with the distance from the root. The relation between the terms can be one of the 5 types: is_a,

(34)

part_of, regulates, negatively regulates and positively regulates. Although in the earlier versions of GO, the Biological Process (BP), Molecular Function (MF) and Cellular Component (CC) trees were discrete with no links between them, the recent versions of the GO can accommodate links between the trees e.g. A gene can be annotated to a MF term and can have a “regulates” relationship to a BP term (The Gene Ontology Consortium, 2010).

The GO was designed to possess all the desirable properties for a biological ontology discussed earlier in the beginning of this chapter. The GO has a very wide coverage with nearly 28 databases such as SGD (J. Cherry, 1998b), EcoCyc (Keseler et al., 2005) and TAIR (Rhee et al., 2003). The terms in the GO follow a standardized format with each node having a unique GO id of the form GO:XXXXXXX e.g. GO:0009560. The terms are arranged in a DAG with well‐defined relationship between them e.g. is_a. This allows the gene products to have multiple parents and multiple children; this design allows for denoting multiple functions for the same gene. The well‐defined structure makes GO relatively better suited for computational applications and also for human readability. The three disjoint ontologies provide a multi‐dimensional view of gene function. Finally, the GO is designed to be

dynamic. As new research uncovers novel functions, the curators can easily

incorporate the knowledge into the GO. These features have heavily contributed in making the GO the de facto standard for functional annotation.

(35)

3

Functional analyses using

gene expression data

Functional analyses of genes and elucidation of gene function is central to the aims of Systems Biology. This includes understanding biological systems by uncovering novel gene interactions, reconstructing regulatory and metabolic pathways and characterizing novel genes or proteins that may be involved in these biological processes (Kitano, 2002).

Systems biology has adopted two main ideologies for functional analyses; “top‐down” and “bottom‐up” approaches2_{(Bruggeman & Westerhoff, 2007;}

Noble, 2002). The bottom‐up approach is largely based on prior knowledge. Prior knowledge about various biological components such as a gene regulatory pathway or reaction kinetics of enzymes is integrated and used to model the behaviour of the biological systems of interest. In contrast, the

2_{Not to be confused with the “top‐down” and “bottom‐up” principles of organizing gene function}

(36)

Top‐down approach is based on sampling data concerning the various biological components in various experimental conditions. This is a generic approach and does not require any prior knowledge regarding the biological systems under investigation. Responses to experimental conditions can be sampled at various levels such as the transcriptome using microarrays, the proteome using the yeast 2‐hybrid technique (Ito et al., 2001) or the metabolome using mass‐spectrometry (Fiehn, 2002). The choice of data source depends entirely on the nature of the biological phenomena under investigation. Large quantities of data obtained from these techniques are mined using analytical techniques which are designed to look for patterns in the data that may support a prior formed hypothesis.

Gene expression data‐based functional analyses are based on the top‐down perspective of systems biology. The primary aim is to mine gene expression data to characterize novel genes and uncover novel relationships between genes. These interactions are summarized to form biological processes and pathways. The Guilt‐by‐Association (GBA) principle has been the most successful approach for mining functional information from gene expression data. In the following sections, we will take a closer look at the GBA principle and its general applicability in microarray data analyses. Subsequently, we investigate the role of similarity metrics in GBA‐based analyses. These concepts are crucial for understanding the problem presented in this thesis.

3.1 Guilt‐by‐association in microarray data analyses

Microarray experiments focus on identifying patterns of gene expression in response to a treatment or stimulus such as chemical treatment or stress time course or simply a comparison between two or more tissue types such as

(37)

wild‐type and mutant. The gene expression profiles from these microarray experiments can be exploited for uncovering cellular pathways and the interaction between the myriad pathways in a biological system. Appendix I provides an overview of widely used microarray technologies and discusses the strategies used in extracting functional information from raw microarray data.

From the very beginning of the age of microarrays, it was evident that the biological functions of genes could be uncovered by applying the principle of guilt‐by‐association (GBA) (Quackenbush, 2003; Wolfe, Kohane, & Butte, 2005). In the context of gene expression data analysis, the GBA principle states that genes with similar expression profiles may share similar functions. This is based on the observation that genes encoding proteins that participate in a metabolic pathway are generally found to be co‐regulated. Also clusters of genes with related functions often exhibit expression profiles that are correlated under several experimental conditions (Eisen, 1998; Ihmels, Bergmann, & Barkai, 2004; Stuart, Segal, Koller, & Kim, 2003). Therefore, the principle of GBA is based on the idea that a co‐ordinated gene expression profile across several experimental conditions suggests the presence of a functional linkage.

Although the GBA approach, in the context of gene expression, holds great promise in the quest for uncovering gene function, its application is not without criticisms. It is important to note that the ideal end point for the description of a biological system would involve measuring protein levels and their respective activities rather than being limited to mRNA expression measurements alone. Studies by Gygi et al. (1999) have found that the correlation between mRNA and protein levels is insufficient to predict protein expression levels from quantitative mRNA measurements. In other

(38)

words, the protein and mRNA abundance were found to be poorly correlated. Studies such as the ones by Clare & King (2002) clustered yeast microarrays and found poor correlation between the genes in the clusters and the functional annotations. Based on such reports, it was argued that GBA may be an unsatisfactory approach for understanding biological systems. Also, it can be observed that genes that are co‐regulated may not necessarily be co‐expressed and genes which are co‐expressed are not necessarily functionally related. For example, it is well known that gene regulation is dependent on the presence of sequence motifs such as cis elements. However, cis‐regulatory motifs have been shown to occur by chance in the genome leading to unexpected gene regulatory events. Events such as these would be hard to detect when analyzing gene expression data from single organisms.

Additionally, phenomena such as post‐translational modifications of proteins heavily influence protein structure and functions. Such modifications cannot be detected by measuring gene expression alone. In such instances, GBA would not be effective for elucidating the functional roles of the genes.

Regardless of the criticisms, GBA has proven to be the most effective approach for the functional analyses of microarrays. Allocco, Kohane, & Butte (2004) show that there is a high degree of agreement between clusters of gene expression profiles and GO‐based functional categories. This result is in contradiction to the results obtained by Clare & King (2002). Allocco et al. (2004) explain the discrepancy by suggesting that their more comprehensive approach is better suited for analyzing subtle similarities in gene expression profiles. They also attribute their superior results to the use of a larger and

(39)

large corpus of microarray‐based functional analyses techniques have been leveraged based on the idea of GBA. The techniques vary in their complexity ranging from simple correlation‐based analyses (Manfield et al. 2006; Obayashi et al. 2009; Steinhauser et al. 2004; Usadel et al. 2009), gene expression profile clustering approaches (Andreopoulos, An, Wang, & Schroeder, 2009)to the recent network‐based approaches (Babu, Luscombe, Aravind, Gerstein, & Teichmann, 2004; Long, Brady, & Benfey, 2008; Y. Wang, Joshi, Zhang, Xu, & Chen, 2006; Wolfe et al., 2005; Zhou et al., 2005). These techniques are briefly outlined later in section 4.2.

3.2 Similarity metrics for GBA analyses

At the core of the GBA principle lays the question of assessing the similarity between gene expression profiles found in microarray experiments. Microarray experiments take the form of a series of measurements taken at various points in time, response to treatments, as a comparison between biological samples or as a combination of the above. The vector containing the set of measurements is called the gene expression vector or gene expression profile. The primary step in any microarray‐based functional analysis is to calculate the distance or the similarity between every pair of genes in the dataset. The resulting matrix of distance or similarity metrics is the starting point for nearly all data mining methods such as clustering and network construction.

The commonly used similarity measures can be divided into two main classes; Distance metrics and Similarity metrics. Distance metrics include Euclidean distance and City Block distance. Often used similarity metrics include Pearson correlation coefficient, Spearman’s rank correlation coefficient and Mutual Information. Among the various similarity and

(40)

distance metrics available, Pearson correlation, Euclidean distance is more widely used in microarray literature (Wen 1998; Khan et al. 2001; S K Kim et al. 2001; Lein et al. 2007; Eisen et al. 1998) although recently Mutual Information has also been applied in gene expression analysis (Margolin et al., 2006; Priness, Maimon, & Ben‐Gal, 2007). Each of the similarity metrics has unique characteristics and the choice of the metric solely depends on the nature of the analysis. One commonly used approach in gene expression analyses is to select the distance measure which yields the best proportion of functionally related genes (Gibbons & Roth, 2002). In following sections, we present the properties of Pearson correlation and Euclidean distance.

3.2.1 Euclidean Distance

Euclidean distance is one of the most widely used distance measures in gene expression analyses. It is simply the straight line distance between two points (Fig.3).

In two dimensions, the distance is calculated using the Pythagorean Theorem (Eq. 1).

,

(41)

Figure 3: The Euclidean distance between two points x and y in a two dimensional space.

This concept is easily extended to higher dimensions found in gene expression profiles. Considering two gene expression profiles X and Y containing n measurements or dimensions, the distance between X and Y can be calculated using the formula illustrated in Equation 2.

,

(Eq.2) Together with most distance measures, Euclidean distance follows a common set of ground rules (Stekel, 2003) which are as outlined below: Given two vectors x and y,

(42)

, 0 2. Distance between x and y must be equal to the distance between y and x. , , 3. Distance between x and y cannot be negative. , 0 , 4. Given another vector z, the distance between x and y must be less than the sum of the distances between x and z and y and z. This is also called the law of Triangle Inequality.

, , ,

Euclidean distance measures the similarity between two expression profiles based on the intensity of expression or the magnitude of the curve. In the context of gene functional analyses where the aim is to identify genes with similar expression profiles without considering the magnitude of expression, this property of Euclidean distance can be viewed as a limitation. The problem is illustrated in Fig.4, where we have three gene expression profiles measured over 5 data points. In this example, Euclidean distance would classify Profile B and Profile C as similar and Profile A would be considered as dissimilar. Although Profile A and Profile B have similar expression dynamics, they would be considered dissimilar.

(43)

Figure 4: Hypothetical expression profiles of three genes over 5 data points. The choice of similarity metric dictates the similarity between the three expression profiles. Profile A and B would be considered similar according to Pearson correlation whereas Profiles B and C would be consider similar according to Euclidean distance.

3.2.2 Pearson correlation coefficient

Pearson correlation quantifies the similarity between two sets of gene expression measurements. If we denote the two sets of measurements with the notation (xi) and (yi), where iis an index from 1 to the total number of

measurements (denoted by n), then the correlation coefficient r is given by the formula (Eq. 3):

,

∑

(44)

The Pearson correlation coefficient is bound between ‐1 and +1. A correlation coefficient of ‐1 between two gene expression profiles indicates that the shape of one expression profile is the exact mirror of the other i.e. in cases where one gene is activated and the other gene is correspondingly repressed. A coefficient of +1 indicates that the shapes of the two expression profiles are very similar to each other i.e. when two genes are similarly activated or repressed. A correlation coefficient of zero indicates that there is no similarity between the two expression profiles. Generally, in clustering applications where distance metrics are used, the absolute value of the Pearson correlation coefficient is subtracted from 1 to obtain a correlation distance (Eq.4).

1 | | where:

c is the correlation coefficient

(Eq.4) In Pearson correlation, the significance of the correlation between two vectors or expression profiles can be calculated to obtain a p‐value. The significance refers to the likelihood of having observed the correlation simply by chance alone. The p‐values are obtained by transforming the correlation values to follow a t distribution using the formula in Eq. 5: 2 1 where: r is the correlation coefficient n is the number of pairs of data

(45)

source: (Neter, Wasserman, Kutner, & Li, 1996) (Eq. 5) Unlike Euclidean distance, Pearson correlation depends on the shape of the curve rather than the intensity or magnitude. In Fig.4, profile A and profile B would be classified as similar and profile C would be relatively dissimilar to profile A and B. This key property makes Pearson correlation well suited for clustering gene expression profiles. For this reason, we use Pearson correlation coefficient for our analyses.

It is reasonable to believe that genes that belong to a biological pathway or process can be positively or negatively correlated due to phenomena such as negative feedback loops in biological pathways. One such example would be the negative feedback loop found in the plant circadian clock machinery where (Alabadí et al., 2001) have shown that two MYB transcription factors called LHY and CCA1 repress the activation of their activator TOC1. For these core circadian genes, we retrieved microarray data from a time‐series experiment conducted on wild‐type Arabidopsis (NASCArray Database ID 137). Here, gene expression was measured at 30 min, 1h, 3h, 6h, 12h and 24h. We found that the LHY was positively correlated with CCA1 (r = 0.92) and TOC1 was negatively correlated with LHY (r = ‐0.85) and CCA1 (r = ‐0.78). For this reason, in this thesis, when considering the correlation between genes belonging to the same functional category we use the absolute value of the correlation coefficient (r = |c|).

On the effects of large-scale transcriptomics datasets on gene functional analyses