Chapter 7 CONCLUSIONS
7.1 Conclusions
In this dissertation, we provide new insights for the algorithms and evaluation methods to find biological motifs, including sequence motifs and network motifs. Sequence motifs are substrings in DNA or protein sequences that encode structural motifs or include functional significance. Net- work motifs are small connected subgraphs in biological networks and they were first introduced as functional building blocks in gene regulatory networks. Sequence motifs have been found through biological or chemical experiments in the past, but now computationally derived motifs are more popular. On the other hand, network motifs are discovered solely through computational methods. Sequence motifs, which are discovered through multiple sequence alignments in a functionally re- lated set of sequences, are ‘candidate motifs’ until any significant biological function or structural information is verified. Network motifs are defined only by their structural frequency and unique- ness in networks, but there are no comprehensive evaluations to validate biologically important motifs.
We design algorithms which are based on clustering analysis and biological knowledge so that the discovered motifs can be more useful for bioinformatics applications. For protein sequence mo- tifs, we develop an algorithm combining sparse nonnegative matrix factorization (SNMF) method with granular computing and inclusion of statistical structure to discover high quality of protein motifs that are universally conserved across protein family boundaries. They have been applied to the prediction of local tertiary structure [211], for example. Previous algorithms [30–32, 210, 211] have used K-means clustering algorithms and repeated pruning steps for better results based on supervised filtering processes. The initialization process used the secondary structure of the data itself, which should be used only for the evaluation measures. We use an SNMF clustering method for more consistent and efficient results than the previous methods. Additionally, we incorpo-
rated biological knowledge to the data features using Chou-Fasman parameters. To find out their biological roles, we evaluate the candidate motifs with their secondary structure similarity, and ad- ditionally suggest a new measurement, sDBI, which evaluates the overall grouping qualities based on the inferred secondary structures and the primary sequences.
For network motifs, we provide new approaches to finding network motifs. Here, we sug- gest to find biologically meaningful network motifs instead of structural network motifs. As a start, we define abiological network motif as a biologically meaningfulk-node subgraph. Then we develop efficient algorithms for the detection of biological network motifs and introduce new evaluation measures to assess their biological significance. The algorithms use clustering methods such as Betweenness clustering and SNMF methods. Moreover, some algorithms are biological- knowledge based methods, so that they increase the chance of detecting biological network motifs. All the algorithms introduced in this study improve existing algorithms for high quality of struc- tural network motif detection as well. We also introduce a number of evaluation measures which measure biological significance of each subgraph. We ran the algorithms on two PPI networks ofS.cerevisiae, and compared the algorithms based on the new measures. An existing exhaustive search and other two existing approximation algorithms are also provided to be compared with our algorithms. As we know of, this is the first time to introduce systematical evaluation measures for network motifs.
We applied the biological network motifs in a practical problem, which is to detect essential proteins in a PPI network. Essential proteins are indispensable to support cellular life and they are a minimal set required for a living cell. They not only help understand the cellular life of an organism, but also are useful for practical usages such as drug design. A number of central- ity algorithms have been used to discover essential proteins; degree centrality (DC), betweenness centrality (BC), closeness centrality (CC), subgraph centrality (SC) or eigenvector centrality (EC) However, all the centrality algorithms depend only on the structural properties in a network. In this work, we show that the combination of network motifs and biological annotation improves the detection rates greatly, by proposing a new centrality algorithm, MCGO. We first develop a new centrality algorithm, motif centrality (MC) that counts the number of network motifs for the vertex.
Since network motifs are determined by its statistical significance, MC is more secure algorithm than others to rank vertices in a network. MCGO is MC in an edge-pruned network by EDGEGO, which trims edges based on GO terms. We also provide three evaluation measures to compare the performance with MCGO to those of other centrality algorithms. The evaluation methods include top-ranked true positive rates, statistical measurements and precision-recall curves. We addition- ally show that the incorporation of gene ontology (GO) annotations improve the performances further with other centrality algorithms.
However, depending only one centrality algorithm for the prediction of essential proteins might be not enough. Rio et al. [292] showed that not one centrality algorithm is dominant in the prediction of essential proteins, but combination of two or more can increase the detection rates. Hence, we make full use of existing centrality measures including MCGO and biological infor- mation to predict essential proteins with machine learning techniques in the yeastSaccharomyces cerevisiaePPI network. The network is first pruned by EDGEGO algorithm which removes some interactions of relatively uninformative GO terms. From the GO-pruned PPI network, we compute eight centrality measures, namely, DCGO, BCGO, CCGO, ECGO, SCGO, SoECCGO, LACGO and MCGO, where the DC, BC, CC, EC, SC, SoECC, LAC and MC algorithms attach ‘-GO’ term as they are computed from a GO-pruned network. With the eight centrality features, called CENT- GO, we construct ten balanced data sets where the number of essential proteins and the number of non-essential proteins are the same, to avoid biased performance to the majority set. For evaluation measures, we used the area under ROC (AUC-ROC), the area under PR (AUC-PR), accuracy at an optimal threshold (ACC) and the computational time (T). We first confirmed that the 10 balanced data sets are statistically similar through Mann-Whitney U-statistics test, so that we can choose a data set for further experiments. The performance is compared with the data set with 23 fea- tures obtained from [35], named, ING-GO (Acencio and Lemke, 2009). With only eight features, CENT-GO (Kim et al. 2012) performs better than 23 features of ING-GO(Acencio and Lemke, 2009) with ACC and T, although it does not beat the AUC values. Therefore, when all the features are integrated, the prediction performance significantly improves on all three evaluation methods.
The improvement is confirmed as statistically significant with Mann-Whitney U-statistic test as well.
We analyzed individual features as well to see the impact of each measure compared to all integrated features. When we apply the same classifier to each individual measure, DCGO pro- duces relatively better results than others, although the integration of all eight features perform significantly better. Another analysis is conducted by deriving a general rule using a decision tree algorithm as well. We could see that most of decision trees in the balanced data sets have MCGO or DCGO as a root node, indicating their important impacts on the general rules. In fact, the good quality of MCGO or DCGO has been proved in Chapter 5.