Exploiting network-based approaches for understanding gene regulation and function

(1)

Exploiting network-based

approaches for understanding gene

regulation and function

Sarath Chandra Janga

A dissertation submitted to the University of Cambridge in

candidature for the degree of Doctorate of Philosophy

April 2010

Darwin College, University of Cambridge

MRC Laboratory of Molecular Biology

(2)

(3)

Previous page: A portrait of the transcriptional regulatory network of the budding yeast, Saccharomyces Cerevisiae. Each circle represents the network of transcriptional interconnections between all other chromosomes to one of the chromosomes. Evidently all chromosomes are transcriptionally controlled by factors encoded on many of the 16 chromosomes in this organism marked by the letters ‘a’ through ‘p’.

(4)

Declaration of originality

This dissertation describes work I carried out at the Medical Research Council Laboratory of Molecular Biology in Cambridge between January 2008 and April 2010. The contents are my original work, although much has been influenced by the collaborations in which I took part. I have not submitted the work in this dissertation for any other degree or qualification at any other university.

Sarath Chandra Janga April, 2010

(5)

Acknowledgements

First of all I would like to express my gratitude to Dr. Madan Babu with out whose continuous support all along my doctoral work, it would have just remained a dream for me to carry out my thesis work at MRC Laboratory of Molecular Biology. Madan has not only been an excellent supervisor but a good friend who was always supportive of my research interests, by allowing me to work independently on a wide range of problems during my stay here. He has been a source of great inspiration on various occasions and a great scientific colleague to work with. In short, I probably could not have had a more understanding and motivating supervisor.

I am also very grateful to Dr. Sarah Teichmann whose equivalently supporting words from time to time have been a motivation to finish my doctoral work in a short time. I have learnt from her the art of adventuring into unchartered territories of molecular biology with out fear.

I am also thankful for the kind support and warm welcome that I received from Dr. Cyrus Chothia from the first day that I came to LMB.

I consider myself very fortunate to be in a wonderful lab with a lot of energetic and highly motivating people working on fundamental problems of molecular biology. Indeed, I must admit that I have learnt at least as much from my colleagues and seminars at LMB, as I have learnt from reading books and papers, not to mention the fun that I had during numerous lunch and dinner breaks with various members of the lab and TCB group in particular. I especially would like to thank A Wuster, B Lang, AJ Venkatakrishnan, D Hebenstreit, D Wilson, E Levy, G Chalancon, J Su, N Mittal, P Kota, R Janky, S De, T Perica, V Charoensawan and J Gsponer for making my stay at LMB a memorable experience.

I am also greatly indebted to all my scientific friends, collaborators and mentors, both in the past and during my PhD, for having helped me learn and adventure diverse areas of molecular biology. In no defined order, I would like to sincerely thank Agustino Martinez-Antonio (Irapuato, Mexico) for his confidence in my abilities, Ernesto Perez-Rueda (Cuernavaca, Mexico) for his kind hospitality during my visits to mexico, Gabriel Moreno-Hagelsieb (Waterloo, Canada) for being a great mentor and an excellent scientific friend, Heladia Salgado (Cuernavaca, Mexico) for her energy and patience to my requests to data, Andrew Emili (Toronto, Canada) for giving me the opportunity to work on an unsolved mystery, Denis Thieffry (Marseille, France) for making me learn to focus on important ideas and many other colleagues for scientific discussions over the years which made me a mature and independent scientist. I would also like to take this opportunity to offer my gratitude to all colleagues, administrative staff and heads of division, Venki Ramakrishnan and Kiyoshi Nagai at LMB whose continuous support have made it possible for me to develop a career in science.

I am also grateful to the financial support that I received from Cambridge Commonwealth Trust (CCT) and the Medical Research Council during my PhD. Last, but not the least, I am most indebted to my family (my parents and sister) as well as near and dear who have been continuously supportive of my adventures in science and for understanding my reasons to be in silence for months. My very presence on this planet would not have been possible if not for my mother who expired long before I knew what maths and science is all about. I dedicate this thesis on her name.

(6)

Abbreviations

3C Chromosome Confirmation Capture ArcA Aerobic respiration control protein A BDBH Bi-Directional Best Hits

BLAST Basic Local Alignment Search Tool cAMP cyclic Adenosine MonoPhosphate ChIP Chromatin immunoprecipitation

CLIP Cross Linking and Immuno-Precipitation COGs Clusters of Orthologous Groups

CRP cAMP Receptor Protein CT Chromosomal Territory

DBTBS DataBase of Transcriptional regulation in Bacillus Subtilis DNA DeoxyriboNucleic Acid

EC Enzyme Commission

FDR False Discovery Rate

FIS Factor for Inversion Stimulation FISH Fluorescent In Situ Hybridization FFL Feed Forward Loop

FNR regulator of Fumarate and Nitrate Reduction GBA Guilt By Association

GC Genomic Context

GO Gene Ontology

GR Global Regulator

GRN Gene Regulatory Network HMM Hidden Markov model

hnRNP heterogeneous nuclear RiboNucleoProtein HNS Histone-like Nucleoid Structuring protein HU Heat Unstable protein

IHF Integration Host Factor LAD Lamina Associated Domain

LCMS Liquid Chromatography-Mass Spectrometry LCR Locus Control Region

MALDI Matrix-Assisted Laser Desorption/Ionization MCL Markov CLuster algorithm

mRNA Messenger RNA

NAP Nucleoid Associated Protein PAB PolyAdenylate-Binding protein PI/PPI Protein Interactions

PTM Post-Translational Modification PTN Post-Transcriptional Network PTS PhosphoTransferase System RBD RNA Binding Domain

RBP RNA Binding Protein RIP RNP ImmunoPrecipitation RNA RiboNucleic Acid

RNP RiboNucleo Protein complex RRM RNA Recognition Motif TAP Tandem Affinity Purification TF Transcription Factor

TG Target Gene

TPI Target Proximity Index

(7)

Summary

It is increasingly becoming clear in the post-genomic era that proteins in a cell do not work in isolation but rather work in the context of other proteins and cellular entities during their life time. This has lead to the notion that cellular components can be visualized as wiring diagrams composed of different molecules like proteins, DNA, RNA and metabolites. These systems-approaches for quantitatively and qualitatively studying the dynamic biological systems have provided us unprecedented insights at varying levels of detail into the cellular organization and the interplay between different processes. The work in this thesis attempts to use these systems or network-based approaches to understand the design principles governing different cellular processes and to elucidate the functional and evolutionary consequences of the observed principles.

Chapter 1 is an introduction to the concepts of networks and graph theory summarizing the various properties which are frequently studied in biological networks along with an overview of different kinds of cellular networks that are amenable for graph-theoretical analysis, emphasizing in particular on transcriptional, post-transcriptional and functional networks.

In Chapter 2, I address the questions, how and why are genes organized on a particular fashion on bacterial genomes and what are the constraints bacterial transcriptional regulatory networks impose on their genomic organization. I then extend this one step further to unravel the constraints imposed on the network of TF-TF interactions and relate it to the numerous phenotypes they can impart to growing bacterial populations.

Chapter 3 presents an overview of our current understanding of eukaryotic gene regulation at different levels and then shows evidence for the existence of a higher-order organization of genes across and within chromosomes that is constrained by transcriptional regulation. The results emphasize that specific organization of genes across and within chromosomes that allowed for efficient control of transcription within the nuclear space has been selected during evolution.

Chapter 4 first summarizes different computational approaches for inferring the function of uncharacterized genes and then discusses network-based approaches currently employed for predicting function. I then present an overview of a recent high-throughput study performed to provide a ‘systems-wide’ functional blueprint of the bacterial model, Escherichia coli K-12, with insights into the biological and evolutionary significance of previously uncharacterized proteins.

In Chapter 5, I focus on post-transcriptional regulatory networks formed by RBPs. I discuss the sequence attributes and functional processes associated with RBPs, methods used for the construction of the networks formed by them and finally examine the structure and dynamics of these networks based on recent publicly available data. The results obtained here show that RBPs exhibit distinct gene expression dynamics compared to other class of proteins in a eukaryotic cell.

Chapter 6 provides a summary of the important aspects of the findings presented in this thesis and their practical implications.

Overall, this dissertation presents a framework which can be exploited for the investigation of interactions between different cellular entities to understand biological processes at different levels of resolution.

(8)

1.2.4 Methods to construct other classes of cellular and biological networks ... 1-20

1.3 OUTLINE OF THE THESIS ... 1-23

REFERENCES ... 1-24

Chapter 2: Functional, structural and dynamic constraints on

bacterial regulatory networks

OUTLINE ... 2-3

CONTRIBUTION TO THE WORK IN THIS CHAPTER... 2-4

2.1 INTRODUCTION ... 2-5

2.2 RESULTS ... 2-9

2.2.1 Constraints imposed on the network of transcription factors in bacteria ... 2-9

2.2.1.1 Topology of Escherichia coli cross-regulatory transcriptional network .... ….2-11

2.2.1.2 Multiple parallel feed-forward loops regulate the use of different

carbon sources ... 2-13

2.2.1.3 Long hierarchical cascades regulate developmental processes ... 2-14

2.2.2 Constraints imposed on bacterial genome organization by

transcriptional network ... 2-15

... 2-17

2.2.2.1 Genomic co-localization of TFs and target genes is observed in

small regulons ... 2-18

2.2.2.2 Transcriptional regulatory flow in the network of TFs... 2-19

2.2.2.3 Absolute and average mRNA abundance of TFs suggests

correlation with regulon size and network hierarchy in E. coli... 2-20

2.2.2.4 A conceptual model for the structuring of regulatory networks in

bacteria ... 2-23

2.3 DISCUSSION & CONCLUSION ... 2-25

2.4 METHODS ... 2-27

2.4.1 Identification of regulon groups... 2-27

2.4.2 Estimating the statistical significance of the regulon groups ... 2-28

REFERENCES ... 2-28

(9)

Chapter 3: Transcriptional regulation constrains the

organization of genes on eukaryotic chromosomes

OUTLINE ... 3-3

3.2 RESULTS ... 3-8

3.2.1 Eukaryotic genome organization and transcriptional regulation... 3-8

3.2.1.1 Long-range interactions involving distal regulatory elements ... 3-12

3.2.1.2 Inter-chromosomal interactions ... 3-13

3.2.1.3 Chromosomal territories, movement and nuclear organization... 3-14

3.2.1.4 Association of the genomic loci with the nuclear periphery... 3-16

3.2.2 Transcriptional regulation constrains genome organization ... 3-17

3.2.2.1 The majority of TFs show a strong preference to regulate genes on

specific chromosomes ... ……3-18

3.2.2.2 A significant fraction of the TFs tend to have targets on specific

regions of the chromosomal arm ... 3-23

3.2.2.3 Most TFs show a strong preference to positionally cluster their

targets within a chromosome ... 3-26

3.3 DISCUSSION & CONCLUSION ... 3-28

3.4 MATERIALS AND METHODS ... 3-29

3.4.1 Dataset of Transcription factors in S. cerevisiae and their regulatory

interactions ... 3-29

3.4.2 Estimation of statistical significance ... 3-30

3.4.3 Calculation of chromosomal preference ... 3-30

3.4.4 Calculation of regional preference ... 3-31

3.4.5 Calculation of target proximity ... 3-31

REFERENCES ... 3-32

Chapter 4: Uncovering the functional architecture of

uncharacterized proteins in

Escherichia coli

OUTLINE ... 4-3

4.2 RESULTS ... 4-6

4.2.1 Overview of network-based function prediction ... 4-6

4.2.1.1 Methods and databases for constructing functional association

networks ... .4-9

4.2.1.2 Computational methods for predicting function from network context ... 4-12

4.2.2 Uncovering the cellular roles of functional orphans in E. coli ... 4-14

4.2.2.1 The extent of existing functional annotation for E. coli proteins ... 4-16

4.2.2.2 Properties of the functional orphans of E. coli ... 4-17

4.2.2.3 A systematic approach to elucidate biological function ... 4-18

4.2.2.4 Experimental definition of the physical interaction network of the

soluble proteome ... 4-19

4.2.2.5 Orphan membership within multiple protein complexes... 4-21

(10)

4.2.2.6 Functional interactions predicted by genomic-context methods ... 4-24

4.2.2.7 Defining the participation of orphans as the components of

functional modules... 4-27

4.2.2.8 Improved functional inference within an integrated network

framework ... ….4-28

4.2.2.9 Functional neighborhoods ... 4-30

4.4.1 PI network generation ... 4-35

4.4.2 GC network generation ... 4-36

4.4.3 Clustering ... 4-37

4.4.4 Network-based function prediction and benchmarking ... 4-37

REFERENCES ... 4-37

Chapter 5: Structure and dynamics of post-transcriptional

regulatory networks directed by RNA-binding proteins

OUTLINE ... 5-3

5.2 RESULTS ... 5-7

5.2.1 RNA binding proteins and post-transcriptional regulation ... 5-7

5.2.2 Methods to Identify RBPs and their targets ... 5-9

5.2.3 RBPs and post-transcriptional operons ... 5-12

5.2.4 Post-transcriptional network formed by RBPs ... 5-12

5.2.5 Expression dynamics of RBPs in post-transcriptional networks ... 5-15

5.2.5.1 RBPs show high abundance and tight regulation at the protein level ... 5-15

5.2.5.2 The number of distinct targets bound by a RBP is correlated with its

cellular abundance… ... 5-19

5.2.5.3 RBPs bound to many RNA targets are less frequently degraded and

tightly controlled at protein level ... 5-21

5.4.1 Data on RNA-binding proteins in S. cerevisiae and their interactions ... 5-24

5.4.2 Analysis of the structure and properties of post-transcriptional

regulatory network ... 5-25

5.4.3 Data for comparative analysis of expression dynamics ... 5-25

5.4.4 Comparison of the regulatory properties of RBPs with other protein

coding genes ... 5-26

5.4.5 Analysis of the relationship between the number of targets of a RBP

and its dynamic properties ... 5-27

REFERENCES ... 5-27

(11)

Chapter 6: Conclusions and Perspectives

6.1 Outline ... 6-3

6.2 Major Findings ... 6-5

6.2.1 Constraints imposed by transcriptional regulation on genome

organization and regulatory network... 6-5

6.2.2 Uncovering the functional landscape of a bacterial genome... 6-6

6.2.3 Structure and dynamics of post-transcriptional networks controlled by

RNA binding proteins ... 6-9

Implications and Future Directions ... 6-11

REFERENCES ... 6-14

Appendix

A.1 LIST OF PUBLICATIONS ... A-3

Publications during PhD (January 2008- April 2010) ... A-3

Publications under review, revision and in preparation... A-5

Publications prior to starting PhD ... A-6

A.2 REPRINTS ... A-7

(12)

(13)

CONTENTS OF CHAPTER 1

PREAMBLE

... 1-3

OUTLINE OF THE INTRODUCTION

... 1-4

1.1 BASICS OF GRAPH THEORY AND NETWORKS

... 1-5

1.1.1

L

OCAL LEVEL

... 1-6

1.1.2

M

ODULAR LEVEL

... 1-9

1.1.3

G

LOBAL LEVEL

... 1-12

1.2 NETWORKS IN MOLECULAR BIOLOGY

... 1-14

1.2.1

M

ETHODS TO CONSTRUCT TRANSCRIPTIONAL REGULATORY NETWORKS

... 1-14

1.2.2

M

ETHODS TO CONSTRUCT FUNCTIONAL LINKAGE NETWORKS

... 1-17

1.2.3

M

ETHODS TO CONSTRUCT POST

-

TRANSCRIPTIONAL REGULATORY NETWORKS

... 1-19

1.2.4

M

ETHODS TO CONSTRUCT OTHER CLASSES OF CELLULAR AND BIOLOGICAL NETWORKS

1-20

1.3 OUTLINE OF THE THESIS

... 1-23

(14)

PREAMBLE

Reductionism, which has been the paradigm in biological research for more than a century, has provided us with a wealth of knowledge about the individual cellular components, their functions and mechanisms. Despite its huge success in the last century, post-genomic biology has increasingly made it clear that discrete biological function can only rarely be attributed to an individual molecule. Instead, most biological outcomes in a cell arise from a complex interplay between different cellular entities such as proteins, DNA, RNA and metabolites. Therefore, a key challenge for biology in the twenty-first century is to understand the structure and dynamics of the complex web of interactions in a cell that contribute to its proper functioning. Although, we can not answer this question in full, the analyses, concepts and frameworks outlined in this thesis, will help the scientific community to interpret and better understanding the logic behind the several layers of complex web of interactions happening in the cell.

In the last few years there has been a rapid development in various high-throughput technologies which has lead to the accumulation of a large amount of data from different areas of molecular and cellular biology. These developments together with increasing interest in the community for gaining a systems-wide understanding of the cellular machinery have provided us unprecedented insights into the structure, organization and dynamics of various major cellular processes such as transcription, translation, degradation etc. Likewise, efforts to understand the interaction of the cell with external environment have generated global phenotypic maps such as those due to small-molecule perturbations. Despite the growing amount of data representing each of these processes it should be admitted that none of these cellular processes work in isolation but rather form an integrated network of different wiring diagrams which is responsible for the observed behavior of the cell. In this thesis, I provide evidence that each of these networks of associations associated with a particular cellular process can be studied in detail to provide meaningful insights into how they contribute to the functioning of the cell, factors that constrain their structure and how they influence the genomes on which they are encoded. Nevertheless, an open challenge of the contemporary biology is to integrate these diverse cellular programs to first understand and model in quantitative terms the topological and dynamic properties of such a unified cellular network and then to exploit it for the therapeutic benefit of mankind.

(15)

OUTLINE OF THE INTRODUCTION

An emerging notion in post-genomic biology is that cellular components can be visualized as a network of associations between different molecules like proteins, DNA, RNA and metabolites. This has led to the application of network theory and network-based approaches to a wide range of biological problems from understanding regulation of gene expression to prediction of gene’s function and phenotype to drug discovery settings. In this chapter, I first introduce the notion of networks and the basic principles of network biology together with an overview of different kinds of networks that are being widely studied in biological sciences at the systems level. In particular, I introduce the transcriptional and post-transcriptional networks in which trans-acting elements like TFs, RBPs and sigma factors form one set of nodes and their target genes or RNAs, of which they control the activity, form the other set of nodes. The links between them which have directionality from the trans-acting elements to their target genes, controlled by their cis-regulatory elements, form a complex and directional network of interactions. In contrast, functional linkage networks constructed in function prediction pipelines typically comprise of undirected networks where all the nodes are treated essentially the same and there is no directionality between nodes. These networks aim to uncover the broad functional role of the uncharacterized genes using the annotations of already characterized members to which they are connected to. I then give a brief overview of other classes of networks such as small-molecule protein interaction networks which are also referred to as the drug-target networks, to extend the generality and applicability of the network-guided approaches in understanding biological systems.

(16)

1.1 BASICS OF GRAPH THEORY AND NETWORKS

Complex networks describe a wide range of dynamical systems in nature and society. In simplistic terms, a network comprises of a set of nodes with connections between them called edges. Most real world systems can be visualized in the form of networks also called graphs in mathematical literature. Examples include that of internet, World Wide Web (WWW), social networks of acquaintances between individuals, food webs, metabolic networks, transcriptional networks, signaling networks, neural networks and many others. Although the study of networks, in the form of mathematical graph theory, is one of the fundamental areas of discrete mathematics, much of our understanding about their underlying organizational principles has come to light only recently. While traditionally most complex networks have been modeled as random graphs, it is increasingly recognized that the topology and evolution of real networks are governed by robust design principles.

A number of biological systems ranging from metabolic to neuronal and food webs to ecosystems can be usefully represented as networks. More generally, the behavior of most complex systems emerges from the orchestrated activity of a many components that interact with each other through pairwise interactions. As such at a highly abstract level, the components can be reduced to a series of nodes that are connected to each other by edges, with each edge representing the interactions between two components. The nodes and links together form a network, or in more formal mathematical language, a graph and these definitions can be extended to any sub-system of a complex system under study. Since understanding the network of cellular interactions as a whole is impractical at the moment for at least two major reasons, namely incompleteness of the data representing the wide variety of interactions that are possible in a cell and variations in the mode as well as type of interactions. Theoreticians have been studying networks by dissecting the biological processes into different levels with the most commonly studied being the physical interactions between molecules, such as protein-protein, protein-nucleic acids and protein-metabolite, all of which can be conceptualized using the node-link nomenclature. Nevertheless, more complex functional interactions can also be considered within this representation. A classic example of such a representation is the network of metabolic pathways, where in metabolic substrates and products are connected with directed edges joining them if a known metabolic reaction exists that acts on a given substrate and produces a given product.

Depending on the nature of the interactions, networks can be directed or undirected. In directed networks, the interaction between any two nodes has a well-defined direction, which

(17)

represents, for example, the direction of material flow from a substrate to a product in a metabolic reaction or the direction of information flow from a transcription factor to the gene that it regulates. In undirected networks, the links do not have an assigned direction. For example, in protein interaction networks a link represents a mutual binding relationship and hence do not have a directionality in their association.

Another important class of biological networks is the genetic regulatory network. The expression of a gene, i.e., the production by transcription and translation of the protein for which the gene encodes for, can be controlled by the presence of other proteins called transcription factors (TFs) which can control the expression of the gene both positively or negatively. In the former case, TFs are considered to act as activators and in the later as repressors. It is due to the regulatory network the genome can co-ordinate its response to both external and internal stimuli by controlling the expression of thousands of genes in appropriate amounts under appropriate conditions and time. Genetic regulatory networks were in fact one of the first networked dynamical systems for which large-scale modeling attempts were made. The early work on random Boolean nets by Kauffman (Kauffman, 1969; Kauffman, 1971; Kauffman, 1993) is a classic in this field before substantial advance has come more recently. The structure of transcriptional regulatory networks has been the focus of several recent studies (Babu et al., 2004; Farkas, 2003; Guelzim et al., 2002; Janga and Collado-Vides, 2007; Thieffry et al., 1998).

1.1.1 Local level

A number of properties can be defined for a network representation and these properties can be grouped into three major classes namely local, module and global levels. In the following sections, I will summarize the major quantitative properties which can be used to define the structure of complex networks at each of these three levels. The first of them is at the local level and as the name suggests refers to the local properties of a node. For instance, as discussed above, networks can be directed or undirected depending on the nature of the interactions and as such directed networks comprise of both an out-going degree as well as in-coming degree while undirected networks only comprise of one degree associated with their nodes (see Table 1-1 for a list of local properties of networks). Degree or connectivity of a node in a network corresponds to the total number of connections it has with other nodes in the network. As is evident, in directed networks degree or connectivity of a node is the sum of in-coming and out-going degrees. Highly connected nodes i.e, nodes with high degree in biological networks are often referred to as hubs in the network. Degree distribution, P(k), is another property derived from degree of nodes in a network, which gives the probability that a selected node has exactly

(18)

Table 1-1. Different local properties which can be defined for a node in complex networks.

Property Definition

Indegree or incoming degree

In directed networks where directionality of an interaction is taken into account, indegree refers to the number of incoming connections to a node of interest. In other words, indegree is the number of arrows that flow into the node under investigation.

Outdegree or outgoing degree

Out degree refers to the number of edges which start from a node of interest and point to other nodes in the network and is valid for directed networks where there is direction associated with each edge represented. Degree or Connectivity

Degree or connectivity of a node refers to the total number of interactions it has in a network – the higher the connectivity (i.e., hub nodes) the more the number of targets it interacts with. In directed networks degree simply corresponds to the sum of in and out degrees of a node.

Clustering coefficient

Clustering coefficient of a node reflects the extent to which the neighbors of a given node are interconnected among themselves to what is expected theoretically and indicates the cohesiveness or local modularity of the network. An extension of this metric to the complete network defined as the average clustering coefficient of all nodes, tells whether the network is modular or is sparsely connected.

Betweenness

Betweenness centrality of a node measures the number of shortest paths between all pairs of nodes in the network that pass through a node of interest – the higher the number of paths that pass through a node, the more important it is.

Average path length Average length of the shortest paths between all pairs of nodes in the network.

Closeness

Closeness centrality is defined as the inverse of the average length of all the shortest paths from a node of interest to all other nodes in the network - note that closeness centrality defined this way implies that higher the closeness value, the higher the importance (centrality) of a node.

Diameter

The diameter of a network is the length of the longest path among all the shortest paths defined between two nodes. It gives an estimation of the distance between the farthest nodes in the network.

Graph density The density of a network is the ratio of the number of edges to the number of total possible edges.

Power law fit (exponent-alpha)

Fitting a power-law distribution function to the degree distribution of the network to study whether the network is likely to exhibit a scale-free network structure.

k links. P(k) is obtained by counting the number of nodes N(k)with k=1,2.. links and dividing by the total number of nodes N. The degree distribution allows us to distinguish between different classes of networks. For example, a poissonian degree distribution is seen when P(k) is plotted against k for random networks indicating that most nodes have roughly equal number of links with little deviation from the average degree of a node in the network. By contrast, a power-law degree distribution indicates that a few nodes interact with numerous other nodes while most interact with rather few nodes (see Global Level).

(19)

Another important property at the local level is the clustering coefficient of a node which tells how interconnected are the neighbors of a given node to what is expected if all the neighbors are full connected. Mathematically, it is defined as the ratio of the number of observed links between the neighbors of a node of interest to the total number of feasible links between all the immediate neighbors. Average clustering coefficient of a network calculated as the mean of the clustering coefficients of all the nodes in the network gives a measure of cohesiveness in the network which is also commonly referred to as the extent of modularity. The higher the clustering coefficient greater is the modular nature of the network. To compare the extent of cohesiveness in a network often clustering coefficients of the real networks are compared with random networks with similar size and degree distribution.

So far all the properties which are discussed concern the nodes in the network, however a number of properties have also been defined for edges in a network. Most important of these which needs mention is the path length between two nodes, which refers to the number of edges that one needs to traverse between two nodes of interest. Since there can be many alternative paths between two nodes, the shortest path i.e, the path with the smallest number of links between the selected nodes is often referred to as the path length. In directed networks, the path length between two nodes A and B may not be the same as that between nodes B and A reflecting the directionality in the network. Another important global property which stems from path length is the average or mean path length of a network and refers to the average of all the shortest paths between all pairs of nodes and offers a measure of a network’s overall reach.

In addition to the degree of a node which tells how central or important a node in a network is, a number of other centrality measures have also been defined in the literature. These include betweenness and closeness centrality among other less popular definitions (Junker et al., 2006). Betweenness centrality, which is the number of shortest paths going through a node is typically calculated using the brandes algorithm (Brandes, 2001). Closeness, is measured as the inverse of the average length of the shortest paths from a node of interest to all other nodes in the network. Since the centrality measures, betweenness and closeness use the shortest path lengths between all pairs of nodes in a graph, for cases where no path exists between a particular pair of nodes, shortest path length is usually taken as one less than the maximum number of nodes in the graph.

While a number of these properties have been studied in diverse kinds of cellular networks and these will be discussed in the respective chapters or as appropriate, I summarize below some of the observations to give a flavor of their importance in understanding complex networks. Studies on the statistical properties of metabolic networks revealed that the

(20)

distributions of the outgoing and incoming degrees have been found to follow power law (Jeong et al., 2000). It was also shown using undirected versions of these metabolic graphs that they have short average path length and a large clustering coefficient (Fell and Wagner, 2000). In protein-protein interaction networks it was shown that the degree distribution follows a power law and that highly connected proteins are more likely to be lethal than lowly connected ones (Jeong et al., 2001) and that links between highly connected proteins tend to be suppressed while those between highly connected and low-connected proteins are abundant, which was proposed as an attribute of cellular networks to attain robustness and decrease cross talk between different functional modules (Maslov and Sneppen, 2002). This property of highly connected proteins avoiding interactions with other highly connected proteins in a network has been referred to as dissociative property. On the other hand, the observation that most real world networks have extremely small average path lengths is referred to as the small world effect (Watts and Strogatz, 1998).

1.1.2 Modular level

Another important level at which network organization is often studied is that of modules. Modules are seen in all kinds of complex systems from groups of friends in social networks, websites that are dedicated to similar topics in the internet, to groups of organisms which survive in a similar niche in an ecological food web. Modules are also evident in several engineered systems, from a simple computer chip to a more sophisticated super computer, where in they are employed to create an order and to organize the tasks dedicated to each of these fundamental units. Likewise, cellular processes have been proposed to be carried out in a highly modular manner (Hartwell et al., 1999). More generally, modules in biological networks refer to a group of genes/proteins or other cellular entities that work together to achieve a common task for the proper functioning of the cell (Alon, 2003; Hartwell et al., 1999; Ravasz and Barabasi, 2003; Ravasz et al., 2002). In fact, there are numerous examples of modules in a cellular context such as protein-protein and protein-RNA complexes which form physical modules or co-expressed gene clusters which work together in a given biochemical process or signaling modules which gather extracellular cues to prepare an organism for variations in the environment. Evidence for the existence of modularity in cellular networks has mostly come from the calculation of average clustering coefficient (see Table 1-1) of a wide variety of networks, which indicates the occurrence of a high number of interconnections between the neighbors of a node of interest. Average clustering coefficient which is the mean of the clustering coefficients of all the nodes is considered a proxy for modularity in networks. In the

(21)

absence of modularity, the clustering coefficient of the real and the randomized network are comparable. The average clustering coefficient of most real networks is significantly larger than that of a random network of equivalent size and degree distribution. For instance, existence of modularity defined in this fashion has been convincingly shown for a number of biological networks including metabolic, protein-protein and transcriptional (Guelzim et al., 2002; Ravasz et al., 2002; Wagner, 2001; Wuchty, 2001). Although there is no definitive agreement on how modules in cellular networks can be best identified and what set of genes would constitute a module (Wolf and Arkin, 2003), it is now a common knowledge that most biological systems can be divided into groups of genes which form discrete biological functions. Part of the problem in our ability to precisely determine the components of a module in cellular networks is that biological networks are hierarchical and scale-free structures (Ravasz and Barabasi, 2003; Ravasz et al., 2002) (see below) and therefore modularity in these settings indicates that the network can be split into either many modules each of which containing only few genes or a set of few modules where in each module can harbor many genes. It is therefore intuitive that the hierarchical modular nature of cellular systems naturally permits the definition of a module to be plastic depending the choice of the granularity one wishes to dissect a system into.

The high clustering in the cellular networks indicates that they are generally locally grouped with various subgraphs of highly interconnected groups of nodes forming the core – evidence supporting the occurrence of isolated functional modules. Subgraphs capture specific patterns of interconnections that characterize a given network at the modular level. However, not all subgraphs are equally significant in real networks, as indicated by a series of recent observations (Milo et al., 2002; Shen-Orr et al., 2002). Some subgraphs or patterns of interconnections between nodes in a network appear more often than expected by chance in random networks with the same topology and these are often referred to as network motifs. Motifs in networks are analogous to sequence motifs in a set of homologous sequences which are defined as the patterns of amino acids or DNA stretches which occur more conserved than expected by chance. Different networks have been shown to be abundant for various motifs (Milo et al., 2002). For instance, transcriptional networks have been shown to harbor the Feed-Forward Loops (FFLs) as the most abundant motif while protein interaction networks have been shown to comprise of fully connected cliques i.e, subgraphs in which all the nodes are connected to each other (Shen-Orr et al., 2002; Wuchty et al., 2003). The identification of motifs not only provides information about the type of local interconnections in the network but also allows one to understand their interplay with the rest of the network. Several evidences support the biological relevance for the occurrence of motifs in networks. For example, the high degree

(22)

of evolutionary conservation of motif constituents within the yeast protein interaction network and the convergent evolution of motifs observed in the transcription regulatory network of diverse species all support their biological relevance (Conant and Wagner, 2003; Madan Babu et al., 2006; Wuchty et al., 2003).

In case of a transcriptional regulatory network, a module is typically defined as a set of genes that are regulated by a common set of Transcription Factors (TFs). Under this definition, it is intuitive to expect that various cellular processes can be conveniently regulated by discrete and separable modules which can coordinate the activities of many genes and carry out complex functions. Therefore, identifying transcriptional modules is useful for understanding cellular responses to internal and external signals under different cellular conditions. Datasets of genome-wide gene expression and location analysis (ChIP-chip) are frequently used to identify transcriptional modules controlling a variety of cellular processes (Bar-Joseph et al., 2003; Ihmels et al., 2002; Segal et al., 2003; Stuart et al., 2003; Wu et al., 2006). Several of these studies have focused on yeast and other model organisms due to the availability of extensive datasets on gene expression and transcriptional regulatory interactions together with their binding site information. From a computational perspective, typical approaches for module discovery involved the use of clustering and motif-discovery algorithms to gene expression data to find sets of co-regulated genes with variations in methods to include previously known information of cellular functions or promoter sequences. Some studies also used model based approaches such as Bayesian networks to infer modules and understand regulatory network architectures (Segal et al., 2003). Despite several methods which have been developed to identify regulatory modules from expression data, most frequently used implementations take into account that genes co-expressed in similar conditions are likely to belong to the same set of regulatory modules (Ihmels et al., 2004; Ihmels et al., 2002; Segal et al., 2003) while more sophisticated approaches integrate additional data sources like TF binding data, motif information or functional annotation (Bar-Joseph et al., 2003; Ihmels et al., 2002; Pilpel et al., 2001).

Although there have been several different approaches to identifying modules and have provided distinct outcomes in terms of the number and size of the resulting modules, the general consensus has been that regulatory networks are highly interconnected and very few modules are entirely separable from the rest of the network. Therefore, the major conclusion has been that modules are frequently nested within each other in a hierarchical fashion at different levels. In fact, an analysis of the distribution of the commonly seen motifs across the identified modules in transcriptional networks, suggests that network motifs themselves do not

(23)

exist in isolation but rather integrate to form part of the modules by sharing some of their edges (Dobrin et al., 2004; Resendis-Antonio et al., 2005). Thus, many small, highly connected motifs group into a few larger modules, which in turn integrate into even larger ones. These nested modules are interconnected through local regulatory hubs. Such an organization not only explains the hierarchical organization, which is seen in other cellular networks (Ravasz and Barabasi, 2003) but also intuitively suggests the capacity for rapid regulatory changes through regulatory hubs, with integration and fine tuning of the regulatory processes by downstream TFs, thereby linking several modules in a hierarchical manner.

As the components of a specific motif often interact with nodes that are outside the motif, it is important to understand how different motifs interact with each other and with the rest of the network for different kinds of networks. While recent work shows that different motifs aggregate to form large motif clusters in transcriptional networks, the generality of these findings is still under debate. However, since motifs are present in all kinds of biological networks that have been examined till date (Milo et al., 2002), it is likely that the aggregation of motifs into motif clusters and modules is a generic property of most biological and real world networks.

1.1.3 Global level

One of the most important developments in our understanding of complex systems is the observation that despite the remarkable diversity in the variety of complex networks in nature, their architecture was found to be governed by a few simple principles. For example, most complex networks have been long believed to follow the degree distributions like that proposed by the Erdos-Renyi model, according to which a plot of the degree distribution, P(k), against the degree k of a complex network should follow a poisson distribution. However, it is now clear that most real world complex systems including biological networks follow a scale-free topology with a power-law degree distribution where in degree distribution, P(k), against the degree k on a log-log plot shows a straight line with a negative slope γ which varies between 2 and 3. It has also been shown that in both Erdos-Renyi model as well as scale-free model proposed by Barabasi and Albert (Barabasi and Albert, 1999), distribution of clustering coefficient was found to be independent of the degree (Barabasi and Oltvai, 2004). Nevertheless, a major difference between the two network models is that in the former most nodes have approximately equal number of links with all of them being close to the average degree in the network - indicative of a gaussian/poissonian degree distribution while the later is determined by the presence of a large number of nodes which are poorly connected and a relatively small number of nodes which are highly connected (also referred to as hubs). Due the scaling nature in the degree

(24)

distribution Barabasi-Albert or scale-free model exhibits a straight line on a log-log plot between the degree distribution and the degree of a node.

Yet another class of networks which have been proposed in the literature are the hierarchical scale-free networks which comprise of all the properties of scale-free networks and in addition also exhibit a slope of -1 when the distribution of clustering coefficient is plotted against the degree of a node on a log-log scale, indicating an organization where in sparsely connected nodes are part of highly clustered areas, with communication between the different highly clustered neighborhoods being maintained by a few hubs. It is increasingly believed that most real world complex networks obey this hierarchical scale-free modular structure (Ravasz and Barabasi, 2003; Ravasz et al., 2002; Yu and Gerstein, 2006).

Although the hierarchical nature of networks has not been extensively explored for all the cellular networks, there is extensive evidence that most of them including protein-protein, transcriptional regulatory to metabolic linkages at least exhibit a scale-free topology (Giot et al., 2003; Guelzim et al., 2002; Jeong et al., 2001; Wagner, 2001). In such networks, most proteins or cellular entities participate in only a few interactions while a few participate in disproportionately large number of interactions – a signature of scale-free networks with inherent power-law degree distribution. Although a large number of cellular networks have been shown to observe the scale-free topology in the recent years, not all of them are scale-free graphs. For instance, in the case of transcriptional regulatory networks the incoming connectivity which is defined as the number of transcription factors regulating a target gene, which quantifies the combinatorial effect of gene regulation, was observed to follow an exponential distribution in both Escherichia coli and Saccharomyces cerevisiae (Guelzim et al., 2002; Thieffry et al., 1998). The exponential behaviour indicates that most target genes are regulated by similar number of factors and could reflect the limits on the number of transcription factors that can affect a target gene due to the constraints on the intergenic spacing available and the number of proteins that can simultaneously effect a promoter region. On the other hand, the outgoing connectivity, which is the number of target genes regulated by each transcription factor, was found to be distributed according to a power law, contrary to the incoming connectivity parameter. This is indicative of a hub-containing network structure, in which a select set of transcription factors participate in the regulation of a disproportionately large number of target genes. These hubs can be viewed as ‘global regulators’, as opposed to the remaining transcription factors that can be considered as ‘fine tuners’.

In case of transcriptional regulatory networks it has been shown, by both a top-down and bottom-up approaches for determining hierarchy, that they possess a multi-layer hierarchical

(25)

modular structure (Ma et al., 2004; Yu and Gerstein, 2006). Interestingly, transcription networks do not seem to possess feedback regulation at the level of transcription meaning transcriptional regulation of TFs at the top by TFs at the bottom of this hierarchial structure is not frequent, indicating the prevalence for alternative forms of feedback control of transcription. Typically such a feedback occurs through the usage of protein-protein interactions at post-translational level or due to a complex interplay of cellular entities which control the activity of TFs by changing their conformation depending on the continuously varying intra- and extra-cellular conditions (Martinez-Antonio et al., 2006; Yu and Gerstein, 2006). It has also been observed that the TFs in the middle of this hierarchy (often from the levels 2 and 3 measured from the bottom) regulate more direct targets than those at the top suggesting that these middle level TFs act as managers and are indeed control-bottlenecks for cellular transcriptional response (Yu and Gerstein, 2006).

While a number of other properties such as diameter, graph density etc of a network have also been defined in network biology (see Table 1-1) they would not be of immediate relevance to the work discussed in this thesis and hence have not been discussed in detail.

1.2 NETWORKS IN MOLECULAR BIOLOGY

1.2.1 Methods to construct transcriptional regulatory networks

At an abstract level regulatory interactions linking TFs to their transcriptionally controlled target genes (TGs) in an organism can be viewed as a directed graph, in which the TFs and TGs represent the nodes while the regulatory interactions that connect them as the edges. Typically the resulting network is a complex, hierarchical, multilayered graph that can be studied at several levels of detail. However at a more fundamental level the organization of transcriptional regulatory machinery and the principles involved are considerably different in the two major kingdoms of life, bacteria and eukarya. In bacteria, transcription and translation happen in the same compartment i.e cytoplasm and transcriptional control can be considered to be mostly at the DNA sequence level through the use of cis-regulatory elements and organization of contiguous genes on the same strand of DNA into operons. However in eukaryotic genomes, the process of transcriptional regulation is highly complex and is co-ordinated at three major hierarchical levels. The first is at the DNA sequence level, i.e. the linear organization of transcription units and regulatory sequences. Co-regulated genes organized into clusters in the genome constitute part of these individual functional units. The second is at the chromatin level, which allows switching between different functional states, i.e between a state that suppresses

(26)

transcription and one that is permissive for gene activity. This level involves the changes in the chromatin structure that are controlled by the interplay between histone modification, DNA methylation, and a variety of repressive and activating mechanisms. This regulatory level is linked with the control mechanisms from level one that switch individual genes in the cluster to on and off, depending on the properties of the promoter. The third level is the nuclear level, which includes the dynamic 3D spatial organization of the genome inside the cell nucleus. The nucleus is structurally and functionally compartmentalized and epigenetic regulation of gene expression may involve repositioning of loci in the nucleus through changes in large-scale chromatin structure. All these differences add a layer of complexity and sophisticated control to the inherent structure, functionality and dynamics of transcriptional networks in eukarya in comparison to their bacterial counterparts. Despite these fundamental differences several basic principles in their organization and structure from a network perspective have been shown to be similar in both the kingdoms (Guelzim et al., 2002; Lee et al., 2002; Milo et al., 2002; Shen-Orr et al., 2002; Thieffry et al., 1998; Yu and Gerstein, 2006).

Despite enormous interest in understanding transcriptional networks across organisms our knowledge on transcriptional interaction graphs for a genome has been very limited and is mostly restricted to model organisms like Escherichia coli and Saccharomyces cerevisiae for which extensive information is available (Gama-Castro et al., 2008; Lee et al., 2002). Transcriptional interactions in an organism have been traditionally identified from small scale assays which are documented in regulatory network databases through extensive manual curation efforts (Baumbach et al., 2007; Gama-Castro et al., 2008; Makita et al., 2004; Matys et al., 2006) or are obtained from high-throughput screens like ChIP-chip or ChIP-seq which allow the identification of regulatory interactions for a vast set of TFs in an organism (Grainger et al., 2005; Lee et al., 2002). Yet another lower resolution high-throughput approach to screen in the whole genome, targets for a TF, is through the knock-out of TF genes and performing a whole genome microarray expression analysis (Devaux et al., 2001). Table 1-2 summarizes a list of these frequently employed low and high-throughput experimental techniques for the identification of regulatory interactions in an organism in an unambiguous manner.

(27)

Table 1-2. Different low and high-throughput strategies for studying and probing protein-DNA interactions. High-throughput technologies such as ChIP-chip, ChIP-seq and PBMs are frequently employed for the elucidating of regulatory networks on a genome-wide scale.

Method Description

Band shift

Since DNA molecules are more flexible than proteins, they tend to exhibit much higher mobility in a polyacrylamide gel. Thus, under favourable conditions, free DNA can be distinguished from DNA bound to proteins due to the difference in molecular weight (Garner and Revzin, 1981).

DNA footprinting

In DNA footprinting, a 5’ end labeled double stranded DNA is partially degraded by DNAase both in the presence and absence of the TF. Degraded fragments are then loaded on to a gel to visualize by autoradiography. Since the region where the protein has bound the DNA will be protected from DNAase, no fragments are seen in those regions. Therefore, by comparing lanes, one can identify the binding site (Galas and Schmitz, 1978).

FRET based binding site identification

In this method a library of double stranded DNA with one of the two fluorophores attached to its end is used. Protein binding to two pieces of DNA , one from each library where each comprises half of the binding site’s sequence, induces FRET signal which can then be used to find protein bound to DNA (Heyduk and Heyduk, 2002). Binding site detection using unnatural base analog

In this approach a library of DNA sequences with an unnatural base analog (one for each base) is used. Following selection for protein-bound DNA molecules, the DNA is cleaved specifically at the modified base. The site of incorporation can be identified by gel electrophoresis by running fragments generated from unbound sample next to the fragments generated from the bound sample. Since the presence of an analog in the binding site impedes protein binding, this results in a depletion of the protein-bound pool (Storek et al., 2002).

(ChIP-chip) and

(ChIp-seq) techniques

The DNA binding protein is tagged with an epitope and is expressed in a cell. The bound protein is covalently linked to DNA by using an in vivo cross-linking agent such as formaldehyde. After cross-linking, DNA is sheared and the protein–DNA complex is pulled down using an antibody for the tag. Reversal of the cross-link releases the bound DNA, allowing the sequence of the fragments to be determined by hybridization to a microarray (ChIP-chip) or by sequencing (ChIP-seq). In ChIP-chip experiments, intergenic regions are spotted on to a microarray chip. Following a chromatin immunoprecipitation step, the bound fragments are reverse cross-linked and hybridized onto the microarray chip (Lee et al., 2006). In ChIP-seq experiments, the bound fragments are directly sequenced using 454/Solexa/Illumina sequencing technology. The sequences are then computationally mapped back to the genome sequence (Johnson et al., 2007).

DNA adenine methyl transferase Identification

(DamID)

In DamID technique, protein of interest is fused to an E. coli protein, DNA adenine methyl transferase (Dam). Dam methylates the N6 position of the adenine in the sequence GATC, which occurs at reasonably high frequency in any genome (1 site in 256 bases). Upon binding DNA, the Dam protein preferentially methylates adenine in the vicinity of binding. Subsequently, the genomic DNA is digested by the DpnI and DpnII restriction enzymes that cleave within the non-methylated GATC sequence, and remove fragments that are not methylated. The remaining methylated fragments are amplified by selective PCR and quantified using a microarray (Greil et al., 2006). Protein

binding universal DNA

microarrays (PBMs)

This is an invitro method to probe protein–DNA interactions. A DNA binding protein of interest is epitope tagged, purified and bound directly to a double-stranded DNA microarray spotted with a large number of potential binding sites. Labeling with fluorophore conjugated antibody for the tag allows detection of binding sites from the significantly bound spots (Bulyk et al., 2004).

(28)

1.2.2 Methods to construct functional linkage networks

Traditionally function of a protein was defined using a number of low-throughput approaches like mutagenesis of residues or whole proteins which allowed the identification of the phenotypes for follow up analysis. However, it is increasingly becoming clear that this rational is limited in its ability to infer the function of proteins; failing for those which exhibit mild phenotype or those which are not expressed under standard experimental conditions. In addition, since most proteins associate dynamically with a number of other cellular entities during their life time, the traditional notion of identifying function of a protein by isolating it from the rest of the cellular machinery can be misleading for a majority. This notion followed by the availability of experimentally determined protein-protein interaction maps for diverse model organisms have given rise to the use of these datasets for delineating the biological processes, pathways and complexes that proteins take part in (Aranda et al., ; Bader et al., 2003; Breitkreutz et al., 2008). Indeed, there is now observable overlap and informative variation between different types of low- and high-throughput experiments (Shoemaker and Panchenko, 2007) which provides a convincing reason for exploiting them as complementary approaches in unraveling the functions of proteins. Indeed, recent years have seen an explosion in the number of methods and databases which provide functional associations (both direct physical and indirect contextual interactions) between proteins using both experimental and computational means. I present an extensive list of these resources in Table 4-2 of Chapter 4, where in I also provide a more in depth discussion of network-based approaches for function prediction.

Briefly, experimental approaches employed for constructing functional association networks mostly comprise of data from protein-protein interaction screens followed by co-expression networks comprising of gene pairs showing significant correlation in their co-expression profiles across conditions, derived from microarray datasets (Luo et al., 2007; Ruan et al., ; Wang et al., 2009). More recently, genetic interactions- measuring the fitness defects of the double mutants compared to that of the individual mutants, are also being employed for constructing these functional linkage networks (Butland et al., 2008; Costanzo et al.). These high-throughput experimental approaches not only increase the confidence of an association but also give cellular context of the protein providing complementary view to the traditional functional prediction paradigm.

In addition to the experimental methods, several computational methods have been proposed for constructing protein-protein associations from sequence data alone. These include the genome context methods namely gene fusion, gene cluster or gene order conservation,

(29)

operon arrangements and protein phylogenetic profiles. The gene fusion approach tries to detect the fusion of two genes into a single protein coding gene in one of the sequenced genomes and thereby links them as a strong functional association (Enright et al., 1999; Marcotte et al., 1999a). The method of gene order conservation aims to identify pairs of genes which consistently show a tendency to cluster in immediate vicinity in a number of genomes- suggesting a strong functional link in prokaryotic genomes which are abundant in operons (Dandekar et al., 1998; Overbeek et al., 1999). The method of operon rearrangement tries to identify a link between any pair of genes on a genome as long as their orthologs are predicted to be organized in an operon with a high confidence in at least one sequenced genome (Janga et al., 2005; Rogozin et al., 2002; Snel et al., 2002). The power of this approach depends on the predictive quality of operon prediction methods which have been shown to reach ~90% accuracy in most sequenced genomes (Brouwer et al., 2008; Moreno-Hagelsieb and Collado-Vides, 2002). Yet another approach not based on genomic proximity is phylogenetic profiles. In this method a vector of presence/absence profile of a gene across all the analyzed genomes is constructed and compared to identify genes which show the most correlated profiles, as a measure of functional link. The rational here is that two proteins showing similar profiles i.e, coordinated in their evolutionary gain and loss, are expected to be functionally related (Gaasterland and Ragan, 1998; Pellegrini et al., 1999). Modified versions of this approach take into account the phyogenetic signal of the genomes employed and/or the redundancy in the genome sequence information (Barker and Pagel, 2005; Date and Marcotte, 2003; Moreno-Hagelsieb and Janga, 2008).

Recently, the integration of different types of interaction data into genome-wide functional linkage maps has gained much popularity for functional inference as these integrated maps not only boost coverage but also confidence of an association when assessing protein function. One of the first studies which demonstrated the power of integrating different types of interaction data was by Marcotte and colleagues where they have put together diverse kinds of computational genome context inferences (Marcotte et al., 1999b). This was followed by a number of other methods such as those implemented in the STRING and PROLINKS databases, among other focused studies (Bowers et al., 2004; Hu et al., 2009; Jensen et al., 2009; Massjouni et al., 2006). Typically, in these networks edge weights correspond to the integrated interaction probability values obtained by first scoring each of the methods independently against a set of gold standard interactions, which are then used in a bayesian fashion assuming the scores obtained in each method are independent of each other. More complex methods take into account the dependence and correlation between methods to

(30)

develop a regression model for scoring the integrated interactome (Linghu et al., 2008; Zhao et al., 2008). Nevertheless, all of them boil down to constructing a network with either weighted or unweighted edges which are then used for propagating annotations to uncharacterized members using network-based approaches discussed in Chapter 4.

1.2.3 Methods to construct post-transcriptional regulatory networks

Gene expression is a highly controlled process which is known to occur at several levels in eukaryotic organisms. Although traditionally messenger RNAs have been viewed as passive molecules in the pathway from transcription to translation there is increasing evidence that their metabolism is controlled by a class of proteins called RNA-binding proteins (RBPs) (Glisovic et al., 2008; Keene, 2007; Mata et al., 2005). In eukaryotes, since transcription and translation occur in different compartments, it allows for a plethora of options to control RNA at the post-transcriptional level, including their splicing, polyadenylation, transport, mRNA stability, localization and translational control (Glisovic et al., 2008; Keene, 2007). Although some early studies revealed the involvement RBPs in the transport of mRNA from nucleus to the site of their translation, increasing evidence now suggests that RBPs regulate almost all of the post-transcriptional steps.

Development of several high throughput approaches has increased the amount of data for targets of RBPs in diverse organisms (See Table 5-3 in Chapter 5 for a detailed overview of these methods and techniques). These techniques have not been discussed here to avoid redundancy. This data of RBPs and their targets could be utilized to construct RBP-RNA interaction network which is also typically referred to as post-transcriptional regulatory network. This post-transcriptional network is represented in the form of a directional network with each edge corresponding to a regulatory link between the nodes (RBP and the target RNA) similar to directed networks discussed above for transcriptional regulatory networks. In this directed network, one set of nodes are RBPs forming the regulatory proteins while the other set of nodes are RNAs encoded by either protein-coding or non-protein coding genes referred to as the target nodes. These two nodes (regulator node and target node) are joined by an arrow starting from regulator node and directing towards target node. The target RNA may belong to diverse functional proteins including other RBPs. This network can also contain loops as a link starting from RBP and targeting itself, typically referred to as autoregulation of an RBP. This loop structure suggests that RBP can bind to its own RNA and control its metabolism at transcript level. There are several examples suggesting the auto-regulation of RBPs at post-transcriptional level. For instance, in humans, RBPs such as AUF1, HuR, KSRP, NF90, TIA-1

(31)

and TIAR were reported to associate with their own mRNA and other RBPs (Pullmann et al., 2007).

Due to the availability of the network of post-transcriptional interactions for a considerable fraction of RBPs in model systems such as S. cerevisiae (Hogan et al., 2008), it has become possible to address several questions concerning the structure and organization of post-transcriptional networks directed by RBPs. Chapter 5 focuses on studying these properties by directly analyzing the currently available post-transcriptional regulatory network in the budding yeast.

1.2.4 Methods to construct other classes of cellular and biological

networks

Development of several high throughput approaches in the last decade have not only increased the amount of information that we could gather to reveal important insights on the transcriptional, post-transcriptional or functional organization of an organism but they have also enabled us to start our journey to uncover the principles which hold them together. This is mainly because of the extent of information that has been possible to be collected by interrogating the cell’s environment at different levels of detail. For instance, availability of modern techniques now enable us to identify the set of protein-protein interactions, genetic interactions, metabolic maps and small molecule interactions at a whole-organism level. While a complete discussion of all the methods and techniques used to identify their respective interactomes is beyond the scope of this thesis. I outline below some of the commonly employed approaches for identifying the interaction graphs for each of these types of interactions occurring in the cell.

Perhaps the most common form of interaction graphs which have been studied since the early days of genome sequencing are protein interactions. A number of approaches for studying them have been reported in the literature and these include the yeast two hybrid (Y2H) (Fields and Song, 1989), protein fragment complementation assay (PCA) (Pelletier et al., 1998), affinity purification coupled with mass spectrometry (AP-MS) (Babu et al., 2009a; Babu et al., 2009b; Gavin et al., 2002), protein chips (Fasolo and Snyder, 2009; Kung and Snyder, 2006), phage display (McCafferty et al., 1990), fluorescence energy transfer (FRET) (Jares-Erijman and Jovin, 2003) and surface plasmon resonance (SPR) (Slavik and Homola, 2006). For a more extensive discussion on the protocols and methods for identifying protein interactions as well as for new developments in this area the reader is referred to recent reviews (Levy and Pereira-Leal, 2008; Shoemaker and Panchenko, 2007).

(32)

Another class of networks which are commonly studied is that of metabolic networks. They comprise of representing the metabolites and enzymes involved in catalyzing metabolic reactions as the nodes and edges in a directed network. Most of the work on understanding metabolic networks relies on either manually curated or semi-automated metabolic databases such as the kyoto encyclopedia of genes and genomes (KEGG) and Metacyc which are available for a wide range of model organisms (Caspi et al., 2008; Grossetete et al., ; Kanehisa et al., 2008). In addition to the metabolic maps available for diverse organisms, several groups also study and compile the metabolic reactions for a model organism of interest which are then used for follow up analysis of the metabolic circuitry (Duarte et al., 2007; Durot et al., 2009; Ma et al., 2007).

Organisms respond to continuous variations in internal and external cellular conditions by orchestrating their responses depending on the environmental challenges they are faced with. This involves the usage of a complex network of interactions among different proteins, RNA, metabolites and several other cellular entities, which undergo rewiring when perturbed by small molecules such as chemicals or drugs. The interaction between different chemicals and cellular entities can be represented in the form of a network- so called Drug-Target network. Recent years have seen the development of a number of approaches both computational and experimental for the identification and elucidation of the molecular targets of a drug on a genomic scale (Apsel et al., 2008; Brewerton, 2008; Fabian et al., 2005; Hillenmeyer et al., 2008; Ho et al., 2009; Jacob and Vert, 2008; Kuhn et al., 2008; Paolini et al., 2006; Whitehurst et al., 2007; Yamanishi et al., 2008). This cellular target space which contains the targets of drugs, can be considered to predominantly comprise of three components namely protein-protein, metabolic and transcriptional interaction networks. While the vast majority of the drugs target the protein-protein and metabolic components, limited number of targets have been identified till date for the transcriptional pool (Brennan et al., 2008; Goh et al., 2007; Lage et al., 2007; Lee et al., 2008; Yildirim et al., 2007). Indeed, most common therapeutic targets for established drugs belong to either protein kinase or receptor families with enzymes and ion channels forming the second most predominant class of targets (Wishart et al., 2008). This explains the reasons for the increased attention towards understanding the biophysics of protein-protein contacts in the context of drug targets as these protein classes form major players in protein-protein interactions (Archakov et al., 2003). Table 1-3 shows different methods which are used for the construction of Drug-Targ

Exploiting network-based approaches for understanding gene regulation and function