Computational Methods for Identifying Conserved Protein Complexes between Species from Protein Interaction Data

(1)

COMPUTATIONAL METHODS FOR IDENTIFYING

CONSERVED PROTEIN COMPLEXES BETWEEN SPECIES

FROM PROTEIN INTERACTION DATA

NGUYEN PHI VU

(B.Sc (Hons), Vietnam National University - HCMC)

A THESIS SUBMITTED

FOR THE DEGREE OF MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

NATIONAL UNIVERSITY OF SINGAPORE

(2)

(3)

i

Acknowledgements

Firstly and most of all, I would like to extend my deep gratitude to my supervisor, Professor Leong Hon Wai. He taught me not only skills in doing scientific research but also the courage in pursuing the career of science. Many of his lessons are eye-opening and unforgettable to me. In particular, those are the habit of having evidences in any scientific claims, the positive attitude when listening to critiques, comments. My sincere thanks also go to Dr. Sriganesh Srihari for his co-authorship, suggestions and discussions during my works on this thesis. Without these supports from Professor Leong and Dr. Srihari, the thesis would not be possible.

The RAS Group at School of Computing – NUS has been a source of friendship as well as colleagueship. I have learnt so many things via discussions, coffee chats and activities from the group, especially from Nam Ninh Nguyen, Dr. Ket Fah Chong and Dr. Melvin Zhang.

I would be very grateful to the Computational Biology Group at SoC – NUS for all the seminars, lectures and activities which greatly enhanced my background knowledge in the area.

Finally, I would like to thank my parents for their unbounded love and belief in me during my oversea study.

(4)

ii

Summary

Protein complexes conserved across species indicate processes that are core to cellular machinery. While numerous computational methods have been devised to identify complexes from the protein interaction (PPI) networks of individual species, these are severely limited by noise and errors (false positives) in currently available datasets. Our analysis using human and yeast PPI networks revealed that these methods missed several important complexes including those conserved between the two species.

In this thesis we first present a definition for the problem of identifying conserved protein complexes between species from protein interaction data. We then review the existing computational methods for this problem and its related issues. After that we propose a new and effective method for identifying conserved complexes by constructing interolog networks (IN). Our experiments were performed on human and yeast data. Here, we note that much of the functionalities of yeast complexes have been conserved in human complexes not only through sequence conservation of proteins but also of critical functional domains. Therefore, our method leverages the functional conservation of proteins between species through domain conservation in addition to sequence similarity. Our analysis revealed that the IN-construction removes several non-conserved interactions many of which are false positives, thereby improving the number of conserved protein complexes detected compared to direct complex prediction from the PPI networks. These additional complexes included the mismatch repair complex, MLH1-MSH2-PMS2-PCNA, and other important ones namely, RNA polymerase-II, EIF3 and MCM complexes, all of which constitute core cellular processes known to be conserved across the two species.

Our method based on integrating domain conservation and sequence similarity to construct interolog networks also helps to produce a better quality of interolog network between human and yeast compared to other local network alignment based methods. Therefore, integrating information of domain conservation might throw further light on conservation patterns between yeast and human complexes.

We observe from our experiments that protein complexes are not conserved from yeast to human in a straightforward way, that is, it is not the case that a yeast complex is a (proper) sub-set of a human complex with a few additional proteins present in the human complex. Instead complexes have evolved multifold with considerable re-organization of proteins and

(5)

iii

re-distribution of their functions across complexes. This finding can have significant implications on attempts to extrapolate other kinds of relationships such as synthetic lethality from yeast to human, for example in the identification of novel cancer targets.

(6)

iv

Content

Acknowledgements ... i Summary ... ii Content ... iv List of Figures ... vi

List of Tables ... viii

Chapter 1 - Introduction ... 1

1.1. Background and Motivation... 1

1.1.1. Protein-protein interaction networks ... 1

1.1.2. Protein complex and predicting protein complexes from PPI networks. ... 2

1.1.3. Why do we need comparative interactomics and conserved protein complexes? ... 3

1.2. Research objectives ... 4

1.3. Contributions of the thesis ... 5

1.4. Organization of the thesis ... 6

Chapter 2 - The problem of identifying conserved protein complexes from PPI data ... 7

2.1. Problem definition ... 7

2.2. The computational pipeline ... 8

2.2.1. Experimental data ... 8

2.2.2. Ortholog assignment ... 9

2.2.3. Protein complex detection from PPI networks ... 11

2.2.4. Result evaluation for conserved protein complexes ... 12

Chapter 3 – Computational methods for identifying conserved protein complexes ... 13

3.1. Local network alignment approach ... 13

3.1.1. Problem definition and general solution framework ... 14

3.1.2. NetworkBLAST ... 15

3.1.3. Other local network alignment based methods ... 21

3.2. Network querying approach ... 21

3.2.1. Problem definition... 21

3.2.2. Torque – Topology-free network querying ... 22

3.2.3. Other network querying based methods ... 26

(7)

v

Chapter 4 – COCIN: Conserved protein complex detection from Interolog Networks ... 29

4.1. Overview ... 29

4.2. Method ... 33

4.2.1. Constructing the interolog network ... 33

4.2.2. Clustering the interolog network and detection of conserved complexes ... 34

4.2.3. Building a benchmark dataset for conserved protein complexes ... 35

4.3. Results ... 36

4.3.1. Preparation of experimental data ... 36

4.3.2. Results of complex detection using interolog network (IN) ... 38

4.3.3. The result of complex detection in the conserved subnetworks ... 45

4.3.4. Comparisons with other complex detection methods in PPI networks ... 46

4.3.5. Integrating domain information significantly enhances interolog construction ... 48

Chapter 5 – Conclusion ... 53

5.1. Main contributions ... 53

5.2. Limitations ... 54

5.3. Recommendations for further research ... 54

(8)

vi

List of Figures

Figure 1.1 – (a) protein-protein interaction, (b) protein-protein interaction network. ... 1

Figure 1.2 – (a) a picture of protein complex, (b) a graph representation of a protein complex.(c) core-attachment structure of protein complexes. ... 2

Figure 2.1 – An example about human (right) and yeast (left) Eukaryotic initiation factor (eIF3) complex. ... 7

Figure 2.2 – The computational pipeline for identifying conserved protein complexes. ... 12

Figure 3.1 - A simple example for pair-wise network alignment, in which nodes having the same shape are considered as sequence-similar. Conserved sub-networks have thick edges. 14 Figure 3.2 – A general solution framework for identifying conserved protein complexes using network alignment. ... 15

Figure 3.3 – An illustration of two nodes and their edge in the orthology graph. ... 19

Figure 3.4 – An illustration for the query set of proteins (a) and its matched connected subgraph (b) in the target network, each number label represents a color. The multisets of colors, which represent multisets of biological protein function, in (a) and (b) are equal. ... 23

Figure 4.1 - Conservation of complexes between yeast and human ... 31

Figure 4.2 - Construction of the interolog network – a simplified example ... 33

Figure 4.3 - Conservation scores for building benchmark complex datasets ... 36

Figure 4.4 - An illustration on a predicted complexes from IN ... 41

(a) A predicted complex in the IN. ... 41

(b) The corresponding complex in the human PPI network. ... 41

(c) The corresponding complex in the yeast PPI network. ... 41

(9)

vii

Figure 4.6 - Some examples of additional conserved complexes found in IN ... 46

Figure 4.7 - COCIN compared to HACO ... 47

Figure 4.8 - COCIN compared to MCL ... 48

Figure 4.9 - Assessment of Ensembl and OrthoMCL based homology for IN construction and conserved-complex detection... 49

Figure 4.10 – Some examples of the one-to-many and many-to-many relationships of

complex conservation between human and yeast ... 50

Figure 4.11 – Comparison between using Ensembl and OrthoMCL in constructing the

(10)

viii

List of Tables

Table 4.1 – Properties of yeast physical PPI datasets ... 37

Table 4.2 - Properties of human physical PPI datasets ... 37

Table 4.3 - Properties of manually curated protein complex datasets ... 37

Table 4.4 - Properties of the interolog network constructed from yeast and human PPIs ... 38

Table 4.5 - Comparisons of different methods on yeast data ... 39

Table 4.6 - Comparisons of different methods on human data ... 40

Table 4.7 – Additional conserved complexes found in yeast ... 43

Table 4.8 – Additional conserved complexes found in human ... 44

Table 4.9 – Details of gold standard testing dataset for conserved protein complexes between human and yeast ... 49

(11)

1

Chapter 1 - Introduction

1.1. Background and Motivation

1.1.1. Protein-protein interaction networks

Protein interactions play a central role in most biological processes. In order to carry out biological functions as catalysts, signaling molecules, or building blocks in cells, proteins need to bind together via domain interfaces to make the corresponding chemical reactions happen. Thus, a critical step towards understanding the inner workings of cellular machinery is to build a complete map of protein-to-protein physical interactions, which is called the interactome.

Protein-protein interaction network (PPI network) is a mathematical model of the interactome in which nodes and edges of the network represent proteins and the physical interactions between them. There could be also edge weights which reflect the reliability of interactions. Figure 1.1b is a picture of the yeast PPI network [Jeong et al., 2001], one of the first eukaryotic interactomes that were studied.

(12)

2

As efforts to get a complete image of the interactome, many high-throughput techniques have been developed over the last decade to detect protein interactions on a genome-wide level not only in yeast, two typical techniques among them are: Yeast two hybrid (Y2H) [Uetz et al., 2000; Ito et al., 2001] and Tandem affinity purification combined with mass spectrometry (TAP-MS) [Gavin et al., 2006; Krogan et al., 2006] (See section for details 2.2.1).

1.1.2. Protein complex and predicting protein complexes from PPI networks. Many proteins have to perform their functions together with other proteins to form protein complexes which are responsible for specific processes in a cell. Understanding how, why and when proteins associate into protein complexes is a critical part of understanding cellular life. Therefore, identifying protein complexes, along with protein pathways, which could be together referred to as cellular machinery, is known as one of the fundamental problems in molecular biology.

Figure 1.2 – (a) a picture of protein complex, (b) a graph representation of a protein

complex.(c) core-attachment structure of protein complexes.

One of the biggest difficulties for computational methods to detect protein complexes from PPI networks is that there is no mathematical definition for protein complexes but the observation that proteins within a complex interact closely with each other (figure 1.2a).

(13)

3

Henceforth, computational biologists usually use an early accepted model of protein complexes as dense (or clique-like) subgraphs (figure 1.2b) and aims to seek for dense regions in the PPI networks as protein complex candidates. Typical complex detection methods that are based on graph clustering are: MCODE [Bader et al., 2003], MCL [van Dongen et al., 2000], CMC [Liu et al., 2009], HACO [Wang et al., 2009].

It is also known that protein complexes have a core-attachment structure [Gavin et al., 2006], in which cores are the stable parts of complexes, they keep recruiting attachment proteins to help perform specific functions. Among attachment proteins, there are instances where two or more proteins are always together, which are called ‘modules’ (figure 1.2c). Also, attachment proteins were seen to be shared between two or more complexes, thereby exemplifying the view that the same protein may participate in multiple complexes [Pu et al., 2007; Wang et al., 2009]. Typical complex detection methods incorporating core-attachment structure are CORE [Leung et al., 2009], COACH [Wu et al., 2009], MCL-CAw [Srihari et al., 2010]. For a complete literature survey on computational methods for predicting protein complexes from PPI networks, please refer to the recent papers [Li et al., 2010] and [Srihari et al., 2013].

Existing complex predicting methods have to face the difficulties in dealing with highly noisy interaction data (high false positive and false negative rates) and also low overlap between different data sources. Therefore, existing computational complex predicting methods still cannot have a complete coverage of known protein complexes. Shared proteins between multiple complexes in PPI networks also hinder graph-clustering based complex detection methods.

Current protein complex detection methods (all approaches) also rarely have 100% match for each detected complex, this hinders the comparisons between any two detected complexes from two species to identify the conserved pairs. Due to the above obstacles, protein complex detection from original PPI networks are still not an optimal approach for identifying conserved protein complexes among species.

1.1.3. Why do we need comparative interactomics and conserved protein complexes?

One of the most important reasons behind the searching for conserved biological entities between species is that: conservation implies functional significance. This accounts for the

(14)

4

birth of comparative genomics to identify proteins whose functions are conserved among species. While sequence-conserved proteins form the basis of comparative genomics, it is also very important to consider the conserved patterns of interactions between proteins themselves, which can be referred to as comparative interactomics [Kiemer et al., 2007]. The reason here is that comparing interactomes among different species helps to transfer biological knowledge and function annotation at a higher level than comparing only protein sequences.

Conserved protein complexes and functional modules is one of the main outcomes from solving comparative interactomics problems. Identifying conserved complexes between species is a fundamental step towards identification of conserved mechanisms from model organisms to higher level organisms, such as protein translation, DNA transcription, cell cycle, etc. These mechanisms, at the same time, are considered as back-bones for a unit living system as cell. Therefore, conserved protein complexes are highly related to core cellular processes and critical to be studied carefully.

Another advantage supporting the comparative interactomics approach is that despite the noises in data, comparative analysis helps us to use the cross-species conservation criteria to focus on the more reliable parts of protein interaction networks and infer likely functional components. Once the number of well-studied species increases, we can use this approach to guide the search for protein complexes in newly-sequenced species, thereby increase the precision of current computational protein complex predicting methods.

Identifying conserved protein complexes can also help to understand the evolutionary mechanisms of protein complexes and protein interaction networks between multiple species, such as deriving evolutionary rate and age measures for protein complexes [Yosef et al., 2009].

In summary, the generalization from finding orthologous proteins to orthologous protein complexes [Yosef et al., 2009] is a significant extension.

1.2. Research objectives

Due to the significance of detecting conserved protein complexes between species, and the fact that current protein complex detecting methods still cannot undertake this task, we now need an effective method for this purpose. There also exist methods specialized for

(15)

5

detecting conserved protein complexes, but most of them use only BLAST score for the whole protein sequence to decide which pairs of proteins between two species are considered to be conserved (see Chapter 3 for details). This can severely limit the number of protein pairs that are actually conserved in function. Identifying function-conserved proteins in this case is important because it serves as a corner-stone for predicting conserved protein complexes. For species that have far evolutionary distances, the above limitation causes a serious mistake because in these cases, their proteins have evolved many-fold in complexity, so simple BLAST scores for whole-sequence similarity may not be able to capture these complicated evolutionary processes. Henceforth, we also need an effective method in this aspect. Due to these research objective, the key contributions of this thesis are featured as follows.

1.3. Contributions of the thesis

1. A survey on computational methods for identifying conserved protein complexes between species: in this survey, computational methods for identifying conserved protein complexes are grouped into two classes, each uses a different approach. For each approach, a typical method is described in details, and the other methods are briefly described. Connections between methods and comparisons between the two approaches are also shown. Furthermore, a short summary on ortholog assignment methods is also presented due to its significance in the computational pipeline for identification of conserved protein complexes.

2. A novel method for identifying conserved protein complexes by constructing interolog networks: This method is novel in terms of: (i) employing an innovative and effective framework for detecting conserved protein complexes; (ii) hypothesizing an evolutionary mechanism among protein complexes that integrates protein domain information. Our experiments on yeast and human datasets revealed that our method can identify considerably more conserved complexes than plain clustering of the original PPI networks. Furthermore, we demonstrated that integrating domain information generates many-to-many ortholog relationships which significantly enhances the interolog network quality and throws further light on conservation of mechanisms between yeast and human.

3. A gold standard dataset for conserved protein complexes between human and yeast: By proposing a score to measure the conservation level between protein complexes, a collection of conserved complexes pairs between yeast and human is built and considered as a gold

(16)

6

standard dataset during this work. As currently there is no benchmark dataset for conserved protein complexes between human and yeast in the literature, the author hopes that this dataset could be useful for reference. Furthermore, this step also gives us a detailed examination on the conservation level between manually curated protein complexes of human and yeast.

1.4. Organization of the thesis

This chapter has briefly described the background and motivation, and outlined the research objectives of this work. The remainder of this thesis is organized as follows. Chapter 2 first gives the definition for the problem of identifying conserved protein complexes between species from protein interaction data, then presents the general computational pipeline to solve this problem. This pipeline includes the preparation for experimental data; a brief survey on ortholog assignment methods for defining conserved proteins; and protein complex detection from all the input data. Chapter 3 will survey existing methods specialized for detecting conserved protein complexes and functional modules from protein interaction data. The two main approaches presented are network alignment and network querying, which have interesting computational properties. Chapter 4 features the main contribution of this thesis, which designs a novel method for mining conserved protein complexes from the interolog network built from the two species’ PPI networks. Chapter 5 concludes the work by figuring out the main contributions, limitations and recommendations for further research.

(17)

7

Chapter 2 - The problem of identifying conserved protein

complexes from PPI data

2.1. Problem definition

The problem of identifying conserved protein complexes can be described as follows: Given a PPI network and a collection of manually curated protein complexes of a well-studied species, a PPI network of a new species (the interaction data of this species might be far from complete, and both of the networks can contain many noisy interactions), and the homology information between the two species. How can we predict protein complexes in the new species that are conserved in the well-studied species? Conservation of protein interaction sub-networks is measured in terms of similarity in protein function (node similarity) and similarity in interaction patterns (network topology similarity).

Figure 2.1 below illustrates a pair of conserved protein complex between a well-studied species as yeast and a newly sequenced species as human. For species that have a far evolutionary distance as human and yeast, many cellular mechanisms, though conserved in function, have in fact evolved many-fold in complexity. Consequently, the similarity in composition of the conserved protein complexes between these species is not expected to be

Figure 2.1 – An example about human (right) and yeast (left) Eukaryotic initiation factor

(18)

8

very high, on the contrary, there might be a high portion of difference (in terms of insertions/deletions of proteins) in these pairs of protein complexes. Therefore, an efficient method for predicting conserved protein complexes from PPI networks needs to be able to recognize the evolutionary mechanisms responsible for the difference part of the two conserved protein complexes.

2.2. The computational pipeline

In order to carry on identifying conserved protein complexes between species from PPI data, we first need to gather physical protein interactions of the two species from various datasets and experiments to enhance the coverage of true positive interactions. Manually curated protein complexes (if available) of the well-studied species are also collected to aid predicting conserved complex in the other species. The second key step in this computational pipeline is to define the correspondence of function similarity between the two set of proteins, each from one species. This step is usually deemed to be identical to the task of ortholog assignment. And finally, when the input data is available, we need a method to detect conserved protein complexes from these data, followed by an evaluation for the resulting complexes.

2.2.1. Experimental data

Many high-throughput techniques have been developed over the last decade to detect protein interactions on a genome-wide level not only in yeast, the following are the two typical techniques among them:

Yeast two hybrid (Y2H) [Uetz et al., 2000; Ito et al., 2001]: is a screening technique for

physical protein-protein and protein-DNA interactions which takes place in a living cell of yeast (in vivo). The two proteins of interest are injected into a genetically engineered strain of yeast. If they physically interact, a reporter is transcriptionally activated and we get a colour reaction on specific media. This technique is low-cost but can be degraded by a high number of false positive (as well as false negative) detections (about 70% false positive rate as in [Deane et al., 2002]) and a low overlap rate between the two experiments (only 20% as in [Shoemaker, 2007]).

(19)

9

Tandem affinity purification combined with mass spectrometry (TAP-MS) [Gavin et

al., 2006; Krogan et al., 2006]: is an in vitro technique, which has two steps: in the TAP stage, the protein of interest is embedded in a cell lysate to act as a bait for its interact-able proteins (prey) to bind, then together they will be identified by mass spectrometry after washing out the contaminants. Although TAP-MS technique still has a large number of false positive interactions and miss a lot of known interactions as Y2H, it can report higher-order interactions as protein complexes while Y2H has an advantage of detecting transient interactions [Shoemaker et al., 2007].

As an inherent weakness of high-throughput techniques, protein interaction data generated by these techniques contains a large number of false positives. For this reason, PPI scoring methods are invented to assess the reliability of each interaction in the PPI network. Some typical PPI scoring methods are: FSweight [Chua et al., 2006], Iterative-CD [Liu et al., 2008], which use solely the PPI network topology to evaluate the reliability of PPIs and predict new interactions; TCSS [Jain et al., 2010] uses semantic similarity within gene ontology of proteins to score PPIs.

For manually curated protein complexes, the two famous databases providing wet-lab experiments and verification are: Wodak Lab CYC2008 [Pu et al., 2007, 2008], which is for yeast, and CORUM [Ruepp et al., 2008, 2009], which is for mammalian species. Other typical databases for manually curated protein complexes include: MIPS [Mewes et al., 2006], Aloy [Aloy et al., 2004] for yeast, and Emililab [Havugimana et al., 2012] for human.

2.2.2. Ortholog assignment

Ortholog assignment takes a key role in this work because it defines the correspondence of function similarity between the two set of proteins of the two species, which is the corner stone for identifying protein complexes with function similarity. Orthology prediction methods can be grouped into three main classes: “graph-based”, “phylogenetic tree-based” and “synteny based”. It would be a large topic to talk about ortholog identification methods. At the scope of this thesis, only a brief summary with very popular methods for orthology inferring, some of which were used throughout this work, are mentioned.

Graph-based methods perform pair-wise gene/protein sequence comparisons between whole genomes, typically using all-versus-all BLAST. A weighted graph is then constructed with genes as nodes and sequence similarity scores as weights. Finally, various graph

(20)

10

clustering techniques are used to identify homolog groups. COGs [Tatusov et al., 2003], Inparanoid [O’Brien et al., 2005], OrthoMCL [Li et al., 2003] belong to this class.

Phylogenetic tree-based methods have the first stage similar to graph based methods, in which homolog groups are identified. For each of these homolog groups, a gene tree are built from multiple sequence alignments of homologs. These gene trees are then analyzed and reconciled with a trusted species tree to localize speciation and duplication events, which is the basis for differentiating orthologs from paralogs. For these details in analysis, many studies have shown that phylogenetic methods have greater precision than graph-based methods [Chen et al., 2007]. Typical examples of phylogenetic methods are EnsemblCompara [Vilella et al., 2009], PHOG [Datta et al., 2009].

Synteny based methods use the information of synteny blocks. This is based on a property that an ortholog pair is usually surrounded by many others, or ortholog pairs tend to locate closely to each other on the two genomes to collaborate in specific conserved functions. This fact is reflected in typical examples as operons in prokaryotes and conserved gene clusters in eukaryotes. Some instances of methods in this class are MSOAR2 [Shi et al., 2009] and BBHLS [Zhang et al., 2012], in which sequence similarity is combined with gene context similarity.

In many existing methods for identifying conserved protein complexes, function similarity between proteins were measured by using BLAST score only ([Sharan et al., 2005], [Flannick et al., 2006], [Sharon et al., 2009]). This severely restricts the number of actual proteins whose functions are conserved. The following is one of the approaches that can overcome this weakness.

Orthology prediction considering protein domain similarity:

There are circumstances under which a domain-based phylogeny may be preferable to one that is based on whole-sequence similarity. First, the requirement that orthologs have to be aligned well over their entire lengths – neither much longer nor shorter – might be overly restrictive. This is because there are cases when species have far evolutionary distances, their othologs have evolved many-fold in complexity so that only their functional and structural domains – which are the parts that directly perform functions – are similar to each other. Secondly, existing methods for ortholog identification are usually based on BLAST, a local alignment protocol, which is not designed to distinguish between sequences sharing a

(21)

11

common domain architecture and those having only local matches. This may increase the potential for annotation errors.

For these reasons, there are some ortholog assigment methods consider protein domain similarity in the process of inferring functional similarity. Those include Ensembl orthology [Vilella et al., 2009] and PHOG [Datta et al., 2009].

2.2.3. Protein complex detection from PPI networks

Protein complex detection is the final stage in the computational pipeline for identifying conserved protein complexes, when all input data (PPI data of the two species, manual curated protein complexes, homology information) are ready. The recent literature surveys for computational methods for protein complex prediction are done in [Li et al., 2010] and [Srihari et al., 2013].

This part aims to focus on standard methods that are based on graph clustering for complex detection. While these methods proposed effective framework for mining protein complexes from protein interaction data, and some of which has reached the state-of-the-art performance compared to other approaches, the approach of modeling protein complexes as dense sub-graphs faces difficulty in having radical detection of complexes from original PPI networks due to the following facts. First, protein interaction datasets, especially for newly sequenced species as human, still contain substantial number of noisy interactions. This will break out the protein complex model. Secondly, in a PPI network, especially of multi-cellular species, each protein does not necessarily participate in all its known interactions simultaneously (as shown in [Liu et. al., 2011]). In other words, each protein can participate in many different complexes (shared attachment proteins is an example [Gavin et al., 2006]), so if using only the PPI network, it is difficult to know which subset of interactions take place together in a same complex. These factors can cause graph clustering based methods in missing many true complexes, many of which involve in core cellular processes that are conserved among species [Nguyen et al., 2013]. Some typical methods in this class are: MCODE [Bader et al., 2003], MCL [van Dongen et al., 2000], CMC [Liu et al., 2009], HACO [Wang et al., 2009].

Resulting complexes are subjected to a matching with manually curated protein complexes for evaluation. Current protein complex detection methods (all approaches) also rarely get 100% matched for each detected complex, this also hinders the comparisons

(22)

12

between any two detected complexes from two species to identify the conserved pairs. Due to the above obstacles, protein complex detection from original PPI networks are still not an optimal approach for identifying conserved protein complexes among species.

Figure 2.2 – The computational pipeline for identifying conserved protein complexes.

2.2.4. Result evaluation for conserved protein complexes

Detected conserved protein complexes need a benchmark dataset to be matched with. If there are no such datasets in the literature, we have to build one. Usually, for building a testing dataset for conserved protein complexes, we have to devise a model for protein complex conservation, or a score to measure the conservation level of two given protein complexes. We then apply this score to every pair of complexes that we need to check if they are conserved. Collecting experimental data (PPIs, manually curatedcomplexes) Ortholog assignment

Protein complex detection

(23)

13

Chapter 3 – Computational methods for identifying conserved

protein complexes

In general, there are two approaches for solving the conserved protein complexes from PPI networks, one compares the two whole PPI networks of the two corresponding species by aligning similar nodes and edges then searching for potential regions in the alignment network that could be conserved, which is called the local network alignment approach. Another approach uses information from the known protein complexes of a well-studied species then matches them to the PPI network of a new species to identify subnetworks that have similar shapes to the query complexes. Thus, the second approach is called network querying. Detailed descriptions for these two approaches are given in the following sections.

3.1. Local network alignment approach

Analogous to sequence alignment, network alignment is to measure the similarity between two networks by finding the best way to fit one network into the other. As for sequence alignment, there also exist local and global network alignments. Global network alignment searches for a unique alignment from every node in the smaller network to exactly one node in the larger network, even though this may lead to inoptimal matchings in some local regions. Because of this, global network alignment is aimed for discovering the common network topological properties that are preserved between the two networks. Several different formulations of the global network alignment problem have been proposed ([Flannick et al., 2008; Liao et al., 2009; Zaslavskiy et al., 2009]). On the other hand, local alignments look at small similar sub-networks between the two networks, thus aiming to identify pathways or protein complexes conserved in PPI networks of different species. By this, a node (or a sub-network) from one network can be mapped to many nodes (or many sub-networks) in another network. That is why this section is dedicated for local network alignment.

(24)

14

3.1.1. Problem definition and general solution framework

If a PPI network is represented by an undirected graph G(V, E), where V denotes the set of proteins, and (u, v)  E denotes an interaction between proteins u, v  V, then the local network alignment problem can be informally stated as follows:

Local network alignment problem: given k different PPI networks of k different species, how can we find conserved sub-networks between these networks?

In other words, a local network alignment is defined as a set of sub-networks chosen from the interaction networks of different species, together with a (label) mapping between corresponding (or aligned) proteins. To get an alignment uniquely specified, we require that the mapping is an mathematical equivalence relation. Consequently, the groups of aligned proteins are disjoint, and we refer to them as equivalence classes. Each of these classes can be called a protein family (or be usually referred to as a homology group), which represents a particular protein function. By this, a biological interpretation of an alignment is a collection of proten families whose interactions are conserved across a given set of species.

Generally, in order to find these conserved sub-networks, we have to build an alignment graph (or orthology graph), in which each of its nodes represents k sequence-similar (homologous) proteins (each protein belongs to a different species), and each edge represents a conserved interaction between k species.

When the number of species is 2 (k =2), this problem is called pair-wise network alignment. For the purpose of simplicity, henceforth, we will imply pair-wise network alignment when using the term network alignment. Figure 3.1 below gives a simple example of pair-wise network alignment.

Figure 3.1 - A simple example for pair-wise network alignment, in which nodes having the

same shape are considered as sequence-similar. Conserved sub-networks have thick edges.

With the purpose of applying network alignment to find conserved protein complexes from PPI networks, network alignment problem is extended to allow a limited number of

(25)

15

mismatches w.r.t. nodes and edges in the resulting subgraphs, some limited number of insertions/deletions of nodes.

General solution framework: a general framework for applying network alignment to identify conserved protein complexes can be illustrated in figure 3.2, where the first stage is defining a protein complex model in which every sub-network that satisfies this model will have a high chance being a true protein complex. The model accuracy is highly dependent on how good the knowledge (represented in terms of graphs) we use to define a protein complex. The second step is to devise a definition for protein complex conservation using the protein complex model of each species. This stage takes into account the homology information between the protein sets of the two corresponding species to build a so-called alignment graph (or orthology graph), which will be used for the searching stage afterwards.

Figure 3.2 – A general solution framework for identifying conserved protein complexes

using network alignment.

When the alignment graph is built, the problem of identifying conserved protein complexes will be equivalent to finding heavy subgraphs (in terms of node weight and edge weight) in the alignment graph. Moreover, the problem of searching for induced heavy subgraphs in a graph is NP-hard even when considering a single species where all edge weights are 1 or -1 and all vertex weights are 0 [Shamir et al., 2004]. Thus a heuristic is employed for searching the alignment graph for conserved protein complexes.

In this section, we will look at NetworkBLAST [Sharan et al., 2005a; Sharan et al., 2005b] as a typical method that bases on the above solution frame work for network alignment, other methods are usually variants of this.

3.1.2. NetworkBLAST [Sharan et al., 2005a; Sharan et al., 2005b]

This method is to find conserved protein complexes by comparative analysis of two PPI networks, it assumes that proteins in a protein complex should be highly connected within themselves to help them act as a single organization. Thus a protein complex can be

(26)

16

represented in the form of a dense subgraph (clique-like). In order to evaluate how likely a subset of proteins can form a protein complex, and how statistically significant it is, a probabilistic model for protein complexes is devised as follows.

A probabilistic model for protein complexes:

At a top-down view, the complete protein complex model is a log likelihood ratio which is defined for each subset U of proteins to measure how likely they form a true complex (let us call it the complex likelihood):

Pr( | ) ( ) log Pr( | ) U c U n O M L U O M  (3.1)

In this formula, OU is the observation of all interactions within U; Pr(OU |Mc)is a likelihood that measures how likely we can observe OU given the complex model Mc (Mc represents for the fact that U is within a complex). The complex model Mc assumes that every two proteins in a complex interact with a high probability p (0.95 is used in this work). In terms of the graph, the assumption is that two vertices that belong to a same complex are connected by an edge with probability p, independently of all other pair-wise interactions and all other information.

In order to have a high chance becoming a true protein complex, a subset of proteins U with its observed interactions OU need also to be statistically significant, and Pr(OU|Mn) measures this quantity. In fact, this is the p-value for OU in the null model Mn. The random model Mn assumes that each edge is present with the probability that one would expect if the edges of G (the graph that represents the PPI network) were randomly distributed but respected the degrees of the vertexes, which means edges incident to vertexes with higher degrees have higher probability. More precisely, let FG represents the family of all graphs having the same vertex set as G and the same degree sequence. The probability of observing the edge (u, v) is defined to be the fraction of graphs in FG that include this edge.

Given the assumption that all pair-wise interactions are independent, the log likelihood function in (3.1) can be decomposed into the log likelihood ratio for individual protein pairs as: ( , )

Pr(

|

)

( )

log

Pr(

|

)

uv c u v U U uv n

O

M

L U

O

M

 





_(3.2)

(27)

17

where Pr(Ouv|Mc)Pr(O Tuv, uv|Mc)Pr(Ouv,Fuv|Mc) (law of total probability)

Pr(Ouv|Tuv,Mc) Pr(Tuv|Mc)Pr(Ouv|Fuv,Mc) Pr(Fuv|Mc)

Pr(O_uv|T_uv) (1 ) Pr(O_uv|F_uv) (3.3) (Ouv and Mc are conditionally independent,  Pr(Tuv|Mc))

Tuv (and Fuv) is the event that protein u truly interact (and not interact) with protein v; 

is the probability that any two proteins u and v interact with each other in the complex model Mc.

Similarly, Pr(O_uv|M_n) p_uvPr(O_uv|T_uv) (1 p_uv) Pr(O_uv|F_uv) (3.4)

where here, as mentioned in the description of the null model Mn above, puv= Pr(Tuv|Mn) depends on the degrees of u and v. Hence, from (3.3) and (3.4), the log likelihood function in (3.2) can be rewritten as follows:

( , ) Pr( | ) (1 ) Pr( | ) ( ) log Pr( | ) (1 ) Pr( | ) uv uv uv uv u v U U uv uv uv uv uv uv O T O F L U p O T p O F         



( , ) Pr( | ) (1 Pr( )) (1 )(1 Pr( | )) Pr( ) log Pr( | ) (1 Pr( )) (1 )(1 Pr( | )) Pr( ) uv uv uv uv uv uv u v U U uv uv uv uv uv uv uv uv T O T T O T p T O T p T O T               



(3.5)

(after applying Bayes’s rule and cancelling common terms in the numerator and denominator)

So far, the log likelihood ratio can be calculated from: Pr(T

uv |Mc) or , the probability of

a truly interaction in the complex model, which is set manually in this work as 0.95; Pr(T

uv |Mn) or puv, the probability of an interaction if the edges are randomly distributed but

respected the degree of vertexes, which can be estimated by Monte Carlo estimation; Pr(T

uv |Ouv), the reliability of the interaction between u and v, estimated by using a PPI

network scoring method; Pr(T

uv), the prior probability that two random proteins interact. Two-species protein complex conservation model:

Consider two subsets of proteins U1 from species 1 and V2 from species 2, and a many-to-many mapping



:U1V2 between them. Then the likelihood score that measures how likely the 2 subsets of proteins are complexes can be computed as follows (let us call it the concurrent complex likelihood),

(28)

18 1 2 1 2 1 2 1 2 1 2 Pr( | ) Pr( | ) ( , ) log log Pr( | ) Pr( | ) c c U U n n U U O M O M L U V O M O M   (3.6)

which is the sum of the two corresponding complex likelihoods, each in one species. In order to get a conservation score of these two subsets of proteins, we have to take into account the sequence conservation among the pairs of proteins defined by , which assigns orthologous pairs between U1 and V2. Thus here, we need to define a so-called homolog likelihood, which measures how likely the two proteins u and v are homologs. This log likelihood ratio is also in the form of ratio between the likelihoods under the conserved complex model and the null model as follows:

( , ) logPr( | ) Pr( | ) uv c uv n E M H u v E M 

(Euv and Mn are conditionally independent.)

Using Bayes’s rule, a simpler formula for the homolog likelihood can be derived as:

Pr( | ) ( , ) log Pr( ) uv uv h E H u v h  (3.7)

where E denotes the BLAST E-value between u and v; Pr(huv|Euv) is the probability that u and v are homologs given their BLAST E-value, this probability was calculated as in [Kelly et al., 2003]

Finally, the complete complex conservation score is formed as the sum of the concurrent complex likelihood L(U1, V2) and the sum of homolog likelihood on all homolog pair between U and V. The first term measures how likely the two subsets of proteins U and V are true complexes in the two corresponding species while the second term measures how likely all homolog pairs assigned by  are truly homologs.

1 1 2 1 2 ( )

(

,

)

(

,

)

( , )

v u u U

S U V

_

L U V

H u v

  





 

(3.8)

(29)

19

Searching for conserved protein complexes:

After the complex model and complex conservation model are built, the problem of identifying conserved protein complexes reduces to the problem of identifying a subset of proteins in each species, and a correspondence between them, such that the complex conservation score S exceeds a threshold. In order to facilitate the search on all possible pairs of subsets U and V of proteins (each from one species) to test whether they are conserved complexes, a concept of orthology graph (or alignment graph) is introduced.

Let G1(E1, V1) and G2(E2, V2) be PPI networks of the two corresponding species, then the

orthology graph OG(EOG, VOG) is built as follows:

Each node in VOG is a pair (u, v) of proteins where u V1 and vV2.

Edges in OG connect all possible pairs of nodes. In other words, OG is a complete graph. Each edge that connects two nodes (u1, v1) and (u2, v2) in OG has two weights: w1=

L1({u1, u2}); w2= L2({v1, v2}), where L is the complex likelihood in (2), in this case, it

measures how likely (u1, u2) and (v1, v2) form two co-complex relationships in the two

corresponding species.

Each node (u, v) in OG has a weight that is the homolog likelihood between them, w(u, v) = H(u, v).

Figure 3.3 is an illustration of a node and an edge with two weights in the orthology graph. In this sense, if we can enumerate all possible subsets of nodes in OG, then those are all possible pairs of subsets U, V of nodes (each from one species).

(30)

20

Basing on the orthology graph, the problem of identifying a subset of protein in each species, and a correspondence between them, such that the complex conservation score is high, is equivalent to finding heavy subgraphs in the orthology graph. This is an NP-Hard problem, because it is reduced from the maximum clique problem. Thus a heuristic for searching was proposed as follows:

Compute a seed around each node v, which consists of v and all its neighbors u such that (u, v) is a strong edge.

If the size of this set is above a threshold (e.g. 10), iteratively remove from it the node whose contribution to the subgraph score is minimum, until we reach the desired size.

Enumerate all subsets of the seed that have size at least 3 and contain v. Each such subset is a refined seed on which a local search heuristic is applied.

Local search: Iteratively add a node, whose contribution to the current seed is maximum, or remove a node, whose contribution to the current seed is minimum, as long as this operation increases the overall score of the seed. Throughout the process, the original refined seed is preserved and nodes are not deleted from it.

For each node in the alignment graph, record up to k (e.g. 5) heaviest subgraphs that were discovered around that node.

Note that because the orthology graph is a complete graph, at any time, a constructed subgraph is also a clique. The resulting subgraphs may overlap considerably, thus a greedy algorithm is used to filter subgraphs whose percentage of intersection is above a threshold as follows:

Iterative find the highest weight subgraph. Add that subgraph to the final output list. Remove all other highly intersecting subgraphs.

Pruning the orthology graph:

In order to reduce the complexity of the graph and focus on potential conserved complexes, nodes with low homolog likelihood are removed from the graph. They are considered back only they satisfy the following condition: for every node (p, y)  S, we check whether there exist two nodes (p1, y1), (p2, y2) S such that p interacts with p1 and p2,

(31)

21

and y interacts with y1 and y2. In this case, (p, y) serve as “bridges” in the orthology graph

between protein pairs, whose members in each species are not known to directly interact.

Experimental results:

This method was experimented on yeast and bacterial data, it found 11 correct conserved protein complexes between these two species with the evaluation based on complex functional annotation. However, there was no benchmark data for estimating the sensitivity of the results.

3.1.3. Other local network alignment based methods

MaWIsh local network alignment method [Koyuturk et al., 2006] is based on the duplication/divergence models that focus on understanding the evolution of protein interactions. It constructs a weighted global alignment graph and tries to find a maximum induced sub-graph in it. Graemlin algorithm [Flannick et al., 2006] scores a possibly conserved module between different networks by computing the log-ratio of the probability that the module is subject to evolutionary constraints and the probability that it is under no constraints, taking into account the phylogenetic relationships of the species whose networks are being aligned. [Hirsh et al., 2007] also developed their own protein complex evolution model basing con protein interaction attachment/detachment and gene duplication events, then employed it to identify conserved protein complexes between yeast and fly. [Zhenping Li et al., 2007] formulate the local network alignment as an integer quadratic programming problem and then transform this into a quadratic programming problem, which almost always ensures an integer solution, thereby making the local network alignment problem tractable without any approximation.

3.2. Network querying approach

3.2.1. Problem definition

If we already have a list of known protein complexes, then it would be a natural thinking to match these complexes to a new species’ PPI network for predicting conserved protein complexes, rather than aligning the whole two PPI networks and make no use of known

(32)

22

protein complex information in the well-studied species. The network querying problem can be stated as follows:

Network querying problem: given a query subnetwork GQ and a target network GT, how can we find subnetworks in GT that are similar to GQ? Similarity here is in terms of both node label and network topology.

Also, more general and suitable for identifying conserved protein complexes, insertion of proteins into the matched subnetwork, or deletion of vertices from the query subnetwork, as well as a limited number of mismatches, are allowed.

In this section, we will describe a typical method of network querying for identifying conserved protein complexes, Torque (TOpology-free netwoRk QUErying) [Bruckner et al., 2010].

3.2.2. Torque – Topology-free network querying [Bruckner et al., 2010]

“Topology-free” here means we only use the set of involved proteins of each query subnetwork and do not care about its topological information. The motivation of this work is that most of the protein complexes reported in the literature do not provide any information about their interaction patterns. Thus, Torque aims to find a connected component of proteins in the target network that matches the query set of proteins. This work first gives a formulation for the topology-free network querying and then devise three solutions to the problem those are: randomized dynamic programming, integer linear programming (ILP) solver (after formulating the network querying problem as an ILP problem), and a shortest-path based heuristic. In order to present the formulation for the problem, we firstly need to define a concept called colorful.

Let G= (V, E) be a PPI network where vertices represent proteins and edges correspond to PPIs. Given a set of color (1, 2, …, k), a coloring constraint function : V2C that assigns each vertex vV a subset of colors of C (we can call this is the color set of v). For any subset S of C, we define a subset of vertices H of V as S-colorful if |H| = |S| and each vertex v in H can selected one color in its color set that is distinct from the selections of the other vertices in H.

Then the topology-free network querying problem can be formulated as a C-colorful connected subgraph basing on the colorful concept as follows.

(33)

23

C-colorful connected subgraph problem: Given a graph G = (V, E), a color set C, and a coloring constraint function : V2C, is there a connected subgraph of G that is C-colorful?

This problem is corresponding to the topology-free network querying problem as follows: suppose we have a query complex with C proteins, if we assign each protein in this complex a distinct color (even if this protein has paralogs in this complex), then we have the color set C. If a protein in the target network G is orthologous with a protein in the complex, it will put the color of this protein complex into its color set. Thus, one protein in G can have multiple colors in its color set when it is orthologous with more than one protein complex. Therefore, if there is a connected subgraph of G that is C-colorful, then its node set will have the same set of protein families (or homolog groups), and each family has the same number of paralogs as the complex. And this subgraph is considered as a conserved protein complex of the query one.

We also can find another formulation for this problem that is somehow simpler to visualize as follows:

Let the query complex be a multiset M of colors in which each color represents a biological protein function. Thus, paralogs in this complex will have the same color. Then the problem is: does G have a connect subset of vertices whose multiset of colors equals M? (Note: two multisets are defined to be equal if they have the same multiplicity (number of occurrences) of each element).

Figure 3.4 – An illustration for the query set of proteins (a) and its matched connected

subgraph (b) in the target network, each number label represents a color. The multisets of colors, which represent multisets of biological protein function, in (a) and (b) are equal.

With the topological-free network querying problem defined above, Torque designs three approaches for solution:

1 1 3 3 3 2 4 4 1 1 3 3 2 4 4 (a) (b) 3

(34)

24

Randomized dynamic programming approach:

This approach is used for firstly considering only coloring constraint functions that associates each vertex v  V with a single color. Then the problem is to find a connected subgraph that has exactly one vertex of each color in the query protein complex. Since every subgraph has a spanning tree, this approach looks for colorful trees. A dynamic programming table B is constructed with rows corresponding to vertices and columns corresponding to subsets of colors. B(v, S) = true if there exists in G a subtree rooted at v that is S-colorful, and B(v, S) = false otherwise. As initialization, when S has a single color c and v V we initialize B(v, c) = true iff the color set associated with v contains only c. Other entries of B can be computed using the following recurrence:

1 2 1 2 1 2 ( ) ( ) , ( )

( , )

( ,

)

( ,

)

u N v S S S v S u S

B v S

B u S

      







(N(v) is neighbor nodes of v)

This algorithm runs in O(3km) time and can be generalize to the case of weithted graph by searching for heaviest colorful subtree rooted at each vertex and B(v, S) is a real number instead of a Boolean value. The weight of an optimum match is given by maxv B(v, C) and

the recursion is modified as:

1 2 1 2 1 2 ( ) ( ) , ( )

( , )

max

( ,

)

( ,

)

( , )

u N v S S S v S u S

B v S

B u S

w u v

      





After having the solution for the single-colored node case, this approach is extended for allowing a limited number of insertions and deletions in the resulting subgraphs by considering that: an S-colorful solution allowing j special insertions is a connected subgraph H  G, where H’  H such that V(H’) is S-colorful and all other vertices of H are non-colored, then finding a C-colorful connected subgraph with up to Nins special insertions can

be solved in O(3kmNins) time. Deletions can be handled directly by the dynamic programming

algorithm: if no C-colorful solution was found, then B(v, C) = false for all v. Allowing up to Ndel deletions can be done by scanning the entries of B. If there exists Cˆ Csuch that

ˆ

(35)

25

Finally, this approach is generalized to multiple color constraints, where a color constraint function can associate each vertex with a set of colors, not just a single color as above. This problem arises when a protein in the network is homologous to more than one protein in the query complex. The basic idea is to reduce the problem to the single color case by randomly choosing a single valid (distinct from other vertexes) color for every vertex. In order to do this, a coloring graph need to be defined as a bipartite graph B = (V, C, E) where V is the set of target network vertices, C is the set of colors and (v, c)  E iff vertex v has color c in its color set. Consider a possible match to the query, the probability for a subset of vertices of size k to become colorful in a random coloring is at least 1/(k!).

Integer linear programming:

An integer linear programming (ILP) formulation is also given to the C-colorful connexted subgraph problem, then ILP solvers can be employed. This method allows exactly Nins arbitrarily insertions and exactly Ndel arbitrarily deletions. Particularly, we are given edge

weights



: EQ and wish to find vertex subset K  V of size t= k + Nins – Ndel that

maximizes the total edge weight

( , )v wE v w K; ,  vw



. For expressing the connectivity of the

C-colorful subgraph, it is formulated as finding a flow with t-1 selected vertices as sources of flow 1, and a selected sink r that drains a flow of t-1, while disallowing flow between non-selected vertices. For details of this formulation, please refer to [Bruckner et al., 2010]. Shortest-path based heuristic:

A heuristic based on a shortest-path algorithm is designed to obtain a fast solution for finding C-colorful subgraphs in the target network. This heuristic is suitable for the cases when the number of colored vertices is small and it does not allow insertions/deletions (indels) in the resulting subgraphs. This method is also used as a preliminary step, when it fails to return a solution or when indels are required, the dynamic programming or integer linear programming above will be run.

The heuristic aims to partition the initial vertex set V of the target network into two subsets: Vin, which is the final solution (the connected component that is C-colorful), and Vout

for the remaining part. To get this final result, it has to maintain a partition of V into three sets , Vin, Vout, and Vopen. Starting with Vopen= V, vetices are then greedily moved from Vopen

(36)

26

rejected. Shortest-path is used in this heuristic as a criterion to move color nodes in Vopen to

Vin.

Experimental results:

Torque was applied to six collections of protein complexes from: yeast, fly, human and used complexes from one species as queries to query against the target PPI networks of the other species. The result comparison showed that it outdoes QNet (which was considered as a state-of-the-art method for finding conserved protein complexes and pathways at that time) in all the cases.

3.2.3. Other network querying based methods

QPath [Shlomi et al., 2006] is a technique for querying PPI networks with path-structured queries, QNet [Dost et al., 2008] is an extension of QPath for queries shaped as trees and graphs with bounded treewidth (though in its implementation, only tree-shaped queries are handled). Both QPath and QNet are based on the color coding technique [Alon et al., 1995], a randomized technique for finding simple paths and simple cycles of a specified length k within a graph (the basic idea is to randomly assign k colors to the vertices of the graph and then search for colorful paths in which each color is used exactly once). In both methods, the total number of node insertions and deletions in the potential solutions are bounded by two thresholds Nins and Ndel.

3.3. Comparison between the approaches

Local network alignment has a sound theoretical framework for complex conservation modeling and identifying conserved protein complexes, so that methods basing on this framework easily incorporate their own definitions of protein complex evolution into it [Sharan et al., 2005; Koyuturk et al., 2006; Flannick et al., 2006; Hirsh et al., 2007; Nguyen et al., 2013]. Because network alignment is based on the co-occurrences protein interactions between multiple species, it helps the complex detection focus on the more reliable parts of the PPI networks thereby increasing the precision of the task.

Network querying employs known protein complexes in well-studied species to query against PPI networks of other species. This can help to compensate for the incompleteness in PPI networks of some newly sequenced species. On the other hand, this approach is restricted

(37)

27

by the collections of known protein complexes and cannot be extended to detect novel complexes, which in turn highlights this advantage in network alignment approach. There are still not methods that combines the two approaches to exploit the best availability of information we have. Topology-free querying is flexible and robust to noises in protein interaction data but simultaneously, missing the important information of interaction pattern similarity. Table 3.1 below will summarize the comparisons between methods in local network alignment approach and network querying approach.

Advantages Disadvantages Local network

alignment approach

Sound theoretical

framework and ease in

incorporating protein complex evolution models.

Releasing noises in data by focusing on co-occurring PPIs, which are more reliable PPIs.

Can detect novel protein complexes.

Not using the information of known protein complexes.

NetworkBLAST [Sharan et al, 2005a&b]

Using a simple probabilistic protein complex conservation

model basing on dense

subgraphs and protein sequence similarity.

Using only whole-sequence similarity (BLAST score) for aligning proteins. MaWIsh [Koyuturk et al., 2006] Using the duplication/divergence models for protein interaction evolution.

Using only whole-sequence similarity (BLAST score) for aligning proteins. Graemlin [Flannick et al., 2006] Combining phylogenetic relationships of proteins in

different species and the

evolutionary history of

Using only whole-sequence similarity (BLAST score) for aligning proteins.

(38)

28 interactions.

[Hirsh et al., 2007]

Using protein complex

evolution model basing on

protein interaction

attachment/detachment and

gene duplication events.

Using only whole-sequence similarity (BLAST score) for aligning proteins. COCIN [Nguyen et al., 2013] (our method) Considering protein domains in identifying

functional conserved proteins.

Network querying approach

Using the information of known protein complexes to compensate for incompleteness in the queried PPI networks, and as a good guide for searching for conserved complexes.

Not be able to detect novel protein complexes because it is restricted by the querying protein complexes.

Topology-free querying

[Bruckner et al., 2010]

Flexible and robust to noises in protein interaction data.

QPath [Dost et al., 2008]

Simple and fast Only allows path-structured

queries QNet [Shlomi et

al., 2006]

Can allow both