Algorithms - ESSENTIAL PROTEIN DISCOVERY IN A PPI NETWORK US-

Chapter 5 ESSENTIAL PROTEIN DISCOVERY IN A PPI NETWORK US-

5.3 Methods

5.3.1 Algorithms

Existing centrality algorithms are instable in incomplete networks and they are derived only by structural properties. Therefore, we develop a robust and biologically meaningful centrality

algorithm using network motifs and gene ontology terms. To see how the algorithms are different and why we need a new algorithm, we review existing centrality algorithms, then introduce MC and MCGO algorithms.

5.3.1.1 Centrality Algorithms

Centrality algorithms are useful to determine more influential individuals from a social group which is represented as a social network [13]. Many researchers applied centrality measures to analyze biological networks such as prediction of essential proteins in PPI networks [254,265,266] and detection of global gene regulator in gene regulation networks [267]. However the term of ‘centrality’ is ambiguous as the notion depends on the context. For example, as shown in Figure 5.1, a vertex would be central if the network is separated into two or more components with the removal of the vertex. On other hand, a vertex is central if the network is scattered when the vertex is removed. Therefore, various centrality algorithms are developed with different purposes and interpretations.

Wang et al. [254] introduced a new centrality algorithm, named SoECC (sum of edge clustering coefficient centrality), for identifying essential proteins in PPI networks, and compared the performance with those of other existing centrality algorithms. The study showed that while any one of the other algorithms is not dominantly good, SoECC outweighs all other centrality methods based on several validations. Similarly, Li et al. [263] introduced LAC, defined as a local average connectivity, and showed its superior performance compared with other methods as well. In this chapter, we introduce MC (Motif Centrality) and MCGO (Motif Centrality in GO-pruned network) and compare the performance with those of DC, BC, CC, SC, EC, SoECC and LAC.

For the sake of notation, a PPI network is regarded as an undirected graphG= (V, E)where V is the set of vertices (proteins) andE is the set of edges (interactions). The number of vertices inGisN andAis defined as the adjacency matrix of the networkG. Each nodeu ∈V is ranked differently with different centrality algorithms as the followings;

Figure 5.1: The top graph is an original network. If we remove A, then the graph is separated into two subgraphs as shown in the bottom-left side. However, if we remove B or C, the graph is nearly scattered as appeared in the bottom-right side. Therefore, a central node is not deterministic. The graph is captured from [13].

1. Degree Centrality (DC) ranks each node as its degree.

DC(u) =du (5.1)

wheredu is the degree ofuinG.

2. Betweenness Centrality (BC) determines the score of a node u as average fraction of the shortest paths passing throughu.

BC(u) = X s X t ρ(s, u, t) ρ(s, t) , s6=t6=u (5.2)

whereρ(s, t)is the total number of the shortest paths ofsandt andρ(s, u, t)is the number of the shortest paths ofsandtwhich includesuin the path.

3. Closeness Centrality (CC) of a node u is the inverse proportion of the average of graph- theoretic distances fromuto all other nodes inG.

CC(u) = _P N −1 vdist(u, v)

(5.3) wheredist(u, v)is the distance betweenuandv, which is the number of links in the shortest path ofu, v.

4. Subgraph Centrality (SC) of a nodeuis defined as the number of subgraphs in Gwhereu participates. The smaller the subgraph is, the more weights are given.

SC(u) = ∞ X l=0 µl(u) l! = N X v=1 [αv(u)]2eλv (5.4)

where µl(u) is the number of closed loops of length l at u. αi,(1 ≤ i ≤ N) is the or- thonormal basis of RN _{composed by eigenvectors of} _{A, associated to the eigenvalues of} λj,(1≤j ≤N). Here,αv(u)is theuth component ofαv.

5. Eigenvector Centrality (EC) of a nodeuis theuth_{component of the principal eigenvector of} A.

EC(u) =αmax(u) (5.5)

whereαmaxis the eigenvector corresponding to the largest eigenvalue of A, and αmax(u)is theuthcomponent ofαmax.

6. Sum of edge clustering coefficient (SoECC) is the sum of all neighborhood edge clustering coefficients. SoECC(u) = X v∈Nu ECC(u, v) (5.6) = X v∈Nu zu,v min(du−1, dv−1) (5.7)

whereNuis the set of all neighbors ofu,Zu,vis the number of triangles that include the edge (u, v),duanddv are degree ofuandv inG, respectively.

7. Local Average Connectivity (LAC) of a nodeuis defined as the average local connectivity of its neighbors. LAC(u) = P w∈Nudeg Cu₍_w₎ |Nu| (5.8) whereNuis the set of neighbors of nodeu, andCuis the subgraph induced byNu. degCu(w) is the degree ofwin the graphCu.

5.3.1.2 Motif Centrality (MC) and MCGO

Network motifs are defined as frequent and unique subgraph patterns in a network and they are used in many biological applications. Similar to a protein sequence motif, network motif is defined as an overly repeated pattern, but the detection process requires much costly computation as it involves NP-hard isomorphic testing and repeated processes for uniqueness determination. Definition 4.3.1 defines network motifmformally.

We define aMotif Centralityof a nodeu, MC(u), as the number of motifs whereuis con- tained, divided by a weight wk, as defined in Equation (5.9). MCGO(u) in Equation (5.10) is MC(u) in a reduced graph G0 which is the result of EDGEGO(G). EDGEGO algorithm is de- scribed in the next section.

M C(u) = n X i=1 mi(u) wk , u∈G (5.9)

M CGO(u) = M C(u), u∈G0 (5.10)

In the above equations,nis the number of all the network motifs inG,kis the motif size and wkis|V|k, andmi(u) = 1if the nodeuis a member of the motif mi, otherwisemi(u) = 0. The size of motifkis currently set to 3 or 4 for practical usage.

We should note that MC is closely related to DC and SC as well. In most cases, if a node has a high degree, then the node has a higher chance of being involved in more network motifs. However, MC is more complicated than DC because MC is affected not only by directed neighbors but also by neighbors with several hops. And MC is more robust than SC as network motifs are frequent and unique subgraphs, while SC involves all the subgraphs regardlessly.

5.3.1.3 EDGEGO algorithm

Most of centrality algorithms are based on the structure of the network only, such as degree, distance or edge clustering coefficient. In this work, we introduce an algorithm, EDGEGO, which removes a number of ‘biologically insignificant’ edges from the network, as a method to incor- porate biological information. EDGEGO algorithm is similar to EDGEGO-BNM in Algorithm 1 which was used to detect biological network motifs in Chapter 4. The only difference is that EDGEGO in Algorithm 5 stops when it removes a number of edges with GO terms and returns a reduced network , while EDGEGO-BNM processes further to detect network motifs.

In EDGEGO algorithm, Gene ontology (GO) [231] terms for the proteins in a network determine biologically insignificant edges to be removed. We specifically utilize GO in a PPI network, as GO terms provide annotations of gene and gene product attributes across species and databases. GO consists of three independent domains: biological process (BP), molecular function (MF) and cellular component (CC). A BP refers to series of events by multiple molecular functions, such as, cellular physiological process and pyrimidine metabolic process. An MF is a molecular level of activities, including catalytic activity or binding. A CC is a component of a cell which is part of larger item. Nucleus, ribosome or proteasome are the examples. GO is represented as a directed

acyclic graph (DAG) as shown in Figure 4.2, so each GO term has its informative depth in GO DAG. All the proteins in a data network are annotated with multiple GO terms as if a gene geis annotated with a GOpe, it meansgeis annotated with all the ancestor GO terms ofpe.

We define anEdgeGOof an edgeeas a set of all GO’s associated to both of the end points ofeand anEdgeGOdepthofeis the maximum depth of the GOs in the EdgeGO, as shown in the Definition 4.3.2.1.

Algorithm 5:EDGEGO

input : GraphG= (V, E),d:a GO depth threshold

output : Reduced graphG0 = (V, E0). 1 E0 ←E 2 for∀e∈Edo 3 p, q be end nodes ofe 4 GO(p)= a set of GO terms ofp 5 GO(q)= a set of GO terms ofq 6 EdgeGO(e) = GO(p)TGO(q) 7 D←EdgeGOdepth(e) 8 ifD < dthen 9 E0 =E0 − {e} 10 OutputG0 = (V, E0)

Algorithm 5 describes the detailed steps of EDGEGO, where a thresholddshould be given as a parameter and ifEdgeGOdepth(e)is less thand,ewill be removed. Differentdresults different number of edges to remove, and we experimentally determinedfor the best results. This work is motivated by Lee et al. [28] which reveals that different levels of GO terms lead to different motif modes. EDGEGO is deterministic and the whole process runs linearly with the number of edges in the graph.

In document Innovative Algorithms and Evaluation Methods for Biological Motif Finding (Page 122-128)