Discussion - Knowledge extraction from biomedical data using machine learning

This chapter introduced FuNeL: a protocol to infer functional networks based on the co-prediction paradigm where the structure of a rule-based machine learning model is used to identify functional relationships between genes. FuNeL generates functional networks using a di↵erent approach than the state-of-the-art methods, com- monly based on a similarity paradigm. Machine learning is at the core of the FuNeL protocol, the networks are generated via the mining of machine learning models (rule- based models in this instance) inferred to solve a classification task. Di↵erent options in the FuNeL protocol provide a total of 4 di↵erent configurations, each one generat- ing diverse networks. These have been contrasted with networks generated using the co-expression paradigm, the most widely adopted similarity-based approach.

Before the comparison with co-expression methods, synthetic data were used to evalu- ate the ability of FuNeL to retrieve known biological associations. It was fundamental to assess if the information obtained from the mining of machine learning was indeed relevant. That is, if the relationships inferred with FuNeL were found to be meaning- ful. When applied to synthetic datasets, FuNeL was able to identify existing pairwise relationships between genetic attributes (SNPs). The obtained results were in line,

and in some cases superior, to a recently proposed approach based on permutated ran- dom forest [70]. These findings mean that the assumption on which the co-prediction approach, and in general FuNeL, are based on, was proven to be correct. Attributes (SNPs) that (statistically) appear together more frequently than by chance in the BioHEL’s classification rules are also associated within the tested synthetic data.

Encouraged by those findings, it was checked if a rule-based machine learning model, with its complex knowledge representation, might be used to identify biologically mean- ingful relationships that escape the standard inference methods. This analysis was performed using eight real-world cancer-related transcriptomics datasets. FuNeL was compared with three co-expression inference methods (ARACNE, Pearson and MIC) by using networks of matching size and generated from the same data. The di↵erences between co-prediction and co-expression were observed from three points of view: ba- sic topological properties, enriched biological terms and relationships between known disease-associated genes.

The comparison of the topological properties revealed the influence of the protocol options. Not surprisingly, both the feature selection and the second training phase reduce the size of the networks, but at the same time, increase the clustering coefficient and the number of connections. The clustering coefficient was found to be lower in almost all the ARACNE networks, probably due to the pruning procedure. It was also lower in many MIC networks. Moreover, when feature selection was applied, the resulting networks had higher clustering coefficient than Pearson co-expression networks with the same number of edges. Interestingly, all of the co-expression networks were less compact (lower diameter). This is probably because many attributes appear together in the same classification rules, reducing the distance from each other in the FuNeL network.

The di↵erences in networks topology translated into di↵erences in the contained biological information. The overlap between enriched GO terms and pathways across protocol configurations was generally low, indicating that di↵erent configurations infer networks that capture di↵erent biological knowledge. The term overlaps between the co-prediction networks and their equivalent co-expression counterparts were even lower. This can be interpreted as evidence that the biological knowledge captured by the two

paradigms is not entirely redundant, but in large part complementary. Associations defined by FuNeL are not limited to attributes that show similar expression patterns but are extended to pairs of attributes that participate in the same classification rule.

Di↵erences between the networks were also observed in the analysis of the connections between genes known to be related to a specific disease. The disease-associated genes were more closely connected (higher proximity) in the co-prediction networks, which means that the disease-related nodes of the network were closer to its core. In addition, the number of functional units (triangle motifs), that can identify new gene-disease associations, was found to be higher in the co-prediction networks. Therefore, it can be concluded that the co-prediction approach better captures the abstract concept of functional relationship. The superior performance of FuNeL networks in identifying the disease-associated genes is likely a result of e↵ective use of the class labels of the samples, which the similarity-based methods ignore. Although it would be tempting to attribute this performance di↵erence entirely to the use of supervised learning in FuNeL, it would be an overstatement, as the knowledge of explicit links between genes and diseases is not available to it in training. The hypothesis is that this is rather a result of di↵erences in expression values of the disease-associated genes, which taken together are able to discriminate between sample phenotypes.

To further analyse and compare the two paradigms, a case study on the prostate cancer dataset [170] was performed. FuNeL generated networks that were enriched with knowledge totally missed by all the co-expression networks. Topologically important genes (nodes) in the co-prediction networks were found: (1) to be altered in a high percentage of tumour samples in an independent cancer transcriptomic study, and (2) to be already associated with prostate cancer according to the specialised literature. Therefore, the co-prediction networks not only capture biological knowledge complementary to the co-expression networks but also better highlight the important genes involved in the disease process. The key nodes (hubs and central nodes in this instance) from FuNeL networks could be considered as candidate biomarkers for the disease of the data. This is directly linked to their relevance in the BioHEL’s rules. Attributes frequently appear in the classification rules and become hubs if their expression can be

used to discriminate the samples of the data. Thus, they are likely to have an major role in the condition/disease.

In document Knowledge extraction from biomedical data using machine learning (Page 113-116)