Noise Tolerance - Evaluation of Directed Graphlet-based Methods for Network Comparison

3.2 Evaluation of Directed Graphlet-based Methods for Network Comparison

3.2.3 Noise Tolerance

Real world network data are noisy. This is especially true for biological networks which are still incomplete and contain false edges. Hence, we evaluate the robustness to noise for all directed graphlet-based measures and contrast them to spectral distance measure and total degree distribution distance (sum of in- and out-degree). We evaluate clustering performance for the following types of noise in the networks:

• Networks with missing edges, which correspond to real world scenarios of incomplete networks. We repeat the model clustering evaluation for different percentages of missing edges in the networks described in Section 3.2.2 as follows. For each of the 5 network models, 3 different network sizes (500, 1000, and 2000 nodes) and 2 network densities (1% and 0.5%) we generate 10 networks, resulting in 5×3×2×10 = 300 networks to cluster. We remove a random 10% of edges from each network and evaluate the clustering performance by measuring AU C. To account for the variability of the results obtained from the randomisation, we remove 10% of edges from the original networks and measure AU C 20 times to calculate the mean, maximum and minimum value ofAU Cfor clustering networks with 10% missing edges. Following the same approach, we evaluate model clustering in cases when 20%, 30%, 40%, 50%, 60% and 70% of edges are removed, by calculating mean, maximum and minimum value of AU C. We consider removing up to 70% of edges, as the quality of clustering significantly drops for percentage of noise higher than 70%. Same as in Section 3.2.2, we evaluate the model clustering performance of different measures for the two cases: (1) we compare networks of the same density and size, and (2) we compare all networks. Note that we used 10 instead of 30 instances of each network model, size and density, as was the case for the original settings, to reduce the time complexity required for computing DGDV for a large number of networks caused by repeating the clustering 20 times for each level of noise.

• Rewired networks which correspond to noisy real world networks. Following the approach presented above, we calculate the mean, maximum and minimum value ofAU Cfor different percentages of rewired edges: from 10% to 70% in increments of 10%. Similarly as in Section 3.2.2, we evaluate model clustering performance of different measures for the two cases: (1) we compare networks of the same density and size, and (2) we compare all networks.

• Networks with added edges which correspond to networks with falsely identified edges. Following the approach presented above, we calculate the mean, maximum and minimum value of AU C for different percentages of added edges: from 10% to 70% in increments of 10%. Similarly as in Section 3.2.2, we evaluate model clustering performance of different measures for the two cases: (1) we compare networks of the same density and size, and (2) we compare all networks.

Figure 3.7-a presents the minimum, mean and maximum AUC values for network clustering, when comparing all-to-all networks, against growing percentages of missing edges. Figure 3.7-b presents the minimum, mean and maximum AUC values for network clustering, when comparing same size networks, against growing percentages of missing edges.

Figure 3.8-a presents the minimum, mean and maximum AUC values for network clustering, when comparing all-to-all networks, against growing percentages of rewired edges. Figure 3.8-b presents the minimum, mean and maximum AUC values for network clustering, when comparing same size networks, against growing percentages of rewired edges.

Figure 3.9-a presents the minimum, mean and maximum AUC values for network clustering, when comparing all-to-all networks, against growing percentages of added edges. Figure 3.9-b presents the minimum, mean and maximum AUC values for network clustering, when comparing same size networks, against growing percentages of added edges.

0 10 20 30 40 50 60 70 80 Missing edges(%) 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 A U C (a) 0 10 20 30 40 50 60 70 80 Missing edges(%) 0.80 0.85 0.90 0.95 1.00 A U C (b) DGCD-129 DGCD-13 DGDDA RDGF-3 RDGF-2 IN,OUT degree Spect r. dist ance

Figure 3.7. Effects of missing network edges on model clustering performance of different network distance measures. The vertical axis represents the mean, maximum and minimum value of AUC scores for the 20 randomised experiments that are performed at each of the noise levels that are presented by the horizontal axis independently. (a) AUC scores obtained by comparing all pairs of the 300 networks. (b) AUC scores obtained by comparing only the same size and density networks.

0 10 20 30 40 50 60 70 80 Rew ired edges(%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 A U C (a) 0 10 20 30 40 50 60 70 80 Rew ired edges(%)

0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 A U C (b) DGCD-129 DGCD-13 DGDDA RDGF-3 RDGF-2 IN,OUT degree Spect r. dist ance

Figure 3.8. Effects of rewiring networks on model clustering performance of different network distance measures. The vertical axis represents the mean, maximum and minimum value of AUC scores for the 20 randomised experiments that are performed at each of the noise levels that are presented by the horizontal axis independently.(a) AUC scores obtained by comparing all pairs of the 300 networks. (b) AUC scores obtained by comparing only the same size and density networks.

0 10 20 30 40 50 60 70 80 Added edges(%) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 A U C (a) 0 10 20 30 40 50 60 70 80 Added edges(%) 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 A U C (b) DGCD-129 DGCD-13 DGDDA RDGF-3 RDGF-2 IN,OUT degree Spect r. dist ance

Figure 3.9. Effects of adding network edges on model clustering performance of different network distance measures. The vertical axis represents the mean, maximum and minimum value of AUC scores for the 20 randomised experiments that are performed at each of the noise levels that are presented by the horizontal axis independently. (a) AUC scores obtained by comparing all pairs of the 300 networks. (b) AUC scores obtained by comparing only the same size and density networks.

The evaluation of clustering of noisy networks show that DCGD-13 measure outperforms all other measures, except in case of noise modelled by random addition of edges in the network and clustering of networks when the same size and density network are

compared (Figure 3.9-b) when it is outperformed by DCGD-129. Also, in all observed cases, the two DCGD and the two RDGF measures outperform other network comparison measures (Figures 3.7, 3.8, 3.9). Degree distribution distance measure and DGDDA compete for the fifth position: DGDDA performs better in clustering model networks when all networks are compared regardless their size and density (Figures 3.7-a, 3.8-a, 3.9-a), while degree distribution distance better clusters the model networks when only same size and density networks are compared (Figures 3.7-b, 3.8-b, 3.9-b).

Overall results emphasise the significance of graphlet-based measures for directed networks comparison. The best results are obtained by using DCGD measures, followed by RDGF distance. RDGF-2 distance, which also takes into account the two-node orbits, slightly outperforms the RDGF-3 where these orbits are omitted.

3.3 Conclusions

In this chapter we introduced up to 4 node directed graphlets and orbits and defined and implemented directed graphlet-based heuristics. We identified orbit dependencies and accounted for them when defining directed graphlet degree vector similarity between two nodes in a network. We also derived 23 equations that describe relationships between directed graphlet orbits in networks without anti-parallel pairs of arcs. We implemented directed graphlet and orbit counting algorithm and used it on synthetic model networks when evaluating our new measures for network comparison: relative directed graphlet frequency distance, directed graphlet degree distribution similarity and directed graphlet correlation distance. We compared these measures to other common directed network comparison measures, by evaluating their performance on model network clustering and found that directed graphlet-based measures outperform others. The directed graphlet correlation distance performed the best in model clustering and showed the highest tolerance to noise, regardless of the type of noise in networks: random addition of edges, random removal of edges or random edge rewiring.

3.4 Author’s Contributions

Section 3.1Anida Sarajli´c defined 40 directed graphlets and 129 orbits, implemented the graphlet and orbits counting algorithm, generalised the existing graphlet measures to directed case, identified the redundancies between directed graphlet orbits and derived all orbit redundancy equations.

Section 3.2 Anida Sarajli´c generated the directed random model networks, implemented and performed experiments for evaluation of model clustering for all analysed distance measures, generated noisy networks, implemented and performed experiments for evaluation of model clustering in the presence of noise and analysed results.

Anida Sarajlić was supervised on the work presented in this chapter by Dr. Noël Malod-Dognin and Dr. Nataˇsa Prˇzulj who defined the research topic and assigned it to Anida Sarajlić.

Anida Sarajlić wrote the first draft for the paper: Anida Sarajlić, Noël Malod-Dognin, ¨

Omer Nebil Yaverˇoglu and Nataˇsa Prˇzulj: “Directed Graphlets Uncover Topology– Function Relationships in Directed Metabolic Networks of Eukaryotes” in August 2015. This paper draft contained the work presented in Chapters 3 and 4. Currently (Decem- ber 2015), the results presented in that paper draft are being merged with the results of application of directed graphlets to directed world trade networks, aiming for a publica- tion with wider range of applications (Note: Anida Sarajlić provided the directed orbit and graphlet counts for the directed world trade networks, while further experiments and analyses on world trade networks were performed by Noël Malod-Dognin and Ömer Nebil Yaverˇoglu).

4 Application of Directed

Graphlet-based Methods to Metabolic

Networks

In Chapter 3 we used synthetic data to show that our new graphlet-based measures for directed network comparison outperform non-graphlet-based measures. In this chapter we apply our methodology to biological data, in particular to directed metabolic networks.

First, we contrast our new directed graphlet-based measures with other similarity measures, exploring if the topology-based clustering of directed metabolic networks of eukaryotic species agrees with their taxonomic classification. We then use directed graphlet degree vector (DGDV) similarity to show that similar local topology around enzymes in metabolic networks is an indicator of their shared biological functions. To further explore this, we use a canonical correlation analysis [182] to quantify the relationships between the local topology around the enzymes (described using DGDV) and their biological functions. We then use these relationships to predict novel functional an- notations based solely on the network topology. Finally, we look for conserved topology– function relationships across metabolic networks of different eukaryotic species.

4.1 Topology-based Clustering of Metabolic Networks of

Eukaryotes Agrees with Taxonomic Classification

The principal goal of evolutionary biology is to understand the evolutionary relationships between different species, which can be quantified by constructing phylogenetic trees [183]. Traditionally, phylogenetic similarities among species have been studied based on phenotypical similarities or sequence similarities [184, 185]. The phylogenetic trees can be reconstructed from the sequence alignments using maximum likelihood methods [186, 187] or Bayesian methods [188, 189]. More extensive review of phylogeny reconstruction methods is beyond the scope of this dissertation and can be found in [190].

Since the topology around molecules in biological networks is shown to be related to similar biological functions, as discussed in Section 1.1, it is expected that phylogeneti- cally similar species have similar biological network topologies. Hence, the topological properties of networks have already contributed to constructing phylogenetic trees. Ex- amples are the similarities between metabolic pathways, obtained by combining global network properties (diameter, clustering coefficient) and similarities of neighbourhoods around nodes [191, 192]. Another approach relies on topological properties such as network size and connectivities of common metabolites, to quantify the similarities between undirected metabolic networks [193] and to use them for phylogenetic reconstruction. Also, a graphlet-based alignment algorithm GRAAL [194] was applied to PPI networks to demonstrate that species phylogeny can be extracted from purely topological alignments. This foregrounds network topology as a new source of phylogenetic information, complementing the sequence information.

Here, we explore the correspondence between the topological similarity between directed metabolic networks of eukaryotic species and their known phylogenetic classification. For network comparison we use our directed graphlet-based measures and contrast their performance with other similarity measures to directed networks. We use the taxonomic classification of species, which is based on the evolutionary relationships between organisms and is directly related to phylogenetic trees.

4.1.1 Methods

4.1.1.1 Data Sets

Metabolic networks. We parsed the organism-specific pathway data of all eukaryotes (299 species) which were available from the KEGG/PATHWAY database [67] in December 2014 and reconstructed the metabolic networks as follows. The KEGG/- PATHWAY database maintains the molecular interaction and reaction relations for each organism specific pathway and provides information such as: (1) pathways with the reactions (links) between enzyme-coding genes and metabolites (enzymes are given in Entrez gene notation) and (2) hierarchical classification that groups enzymes into families which catalyse similar reactions. Links can be directed or undirected, depend- ing on the chemical reversibility of a specific reaction. We consider only the directed links. A directed link between two enzymes in a metabolic network denotes that one enzyme catalyses a reaction whose product is a substrate for a reaction catalysed by the other enzyme. We construct the directed metabolic network where nodes correspond to enzyme-coding genes. Note that we use the terms gene and enzyme interchangeably

when we refer to nodes, as the nodes in the network correspond to genes coding for enzymes. The sizes of our metabolic networks vary, with the number of nodes being mainly between 500 to 2000, and edge densities in the 0.5%–1% range.

Taxonomic classification. We downloaded the taxonomic classification of eukaryotes from the NCBI database in February 2015. This database provides a classification of species according to: domain, kingdom, phylum, class, order, family and genus. We evaluate the clustering of eukaryotic species from KEGG that were identified in NCBI database files (297 out of 299 of them) according to six levels of taxonomic classifica- tions: kingdom, phylum, class, order, family and genus, where kingdom corresponds to the most general and genus corresponds to the most specific level of classification. Note that not all 297 species have every taxonomy level specified. Specifically: (1) 297 species are related to a specific genus levels, but 181 of them are related to a genus that has only one member (species) in its cluster, leaving only 116 species to cluster based on genus, (2) 274 species are related to specific families, but 112 of them are the only member of their cluster (these are family clusters with just one member), leaving 162 species to cluster based on family, (3) 273 species are related to specific order, 60 of them are the only member of their cluster (order clusters with just one member), leaving 213 species to cluster based on order, (4) 237 species are related to a specific class, 20 of them are the only member of their cluster (class clusters with just one member), leaving 217 species to cluster based on class, (5) 271 species are related to a specific phylum, 7 of them are the only member of their cluster (phylum clusters with just one member), leaving 264 species to cluster based on phylum, (6) 251 species are related to specific kingdom.

4.1.1.2 Clustering Evaluation

We assess the ability of directed network distance measures to cluster directed metabolic networks according to the six levels of the NCBI taxonomic classification. We evaluate the following measures for comparison of directed networks: RDGF-2, RDGF-3, DGDDA, DGCD-13, DGCD-129, in-degree distribution distance, out-degree distribution distance, sum of in and out degrees distribution distance, and spectral distance (all described in detail in Sections 1.3.1 and 3.1.2). We evaluate the quality of clustering usingROC and Precision-Recall curves and compareAUPRandAUC scores (described in section 3.2) for all analysed distance measures.

Notice that there is an observational bias in the data, because the interactions in metabolic networks of less explored species are inferred from experimentally more ex-

plored species, based on their phylogenetic similarity. Hence, it is expected that the similarity of the topologies of metabolic networks will agree with the taxonomic classification of the corresponding species. We are not aiming to confirm or refute this, but to show that directed graphlet-based measures can be used to correctly group the species according to their taxonomic classification and that they outperform other commonly used measures for directed networks comparison.

In document Analysing directed network data (Page 101-111)