Introduction - Knowledge extraction from biomedical data using machine learning

3.2.1 The co-prediction paradigm . . . 77 3.2.2 The FuNeL protocol . . . 79 3.2.3 Datasets . . . 83 3.2.4 Co-expression networks . . . 84 3.2.5 Enrichment analysis . . . 86 3.2.6 Disease association analysis . . . 86 3.3 Results . . . 87 3.3.1 Identification of predefined relationships in synthetic datasets . 88 3.3.2 Topological comparison of the inferred networks . . . 89 3.3.3 Complementarity of enriched terms . . . 93

3.3.4 Quantifying the amount of captured biological knowledge . . . 99

3.3.5 Evaluation of the networks in a disease context . . . 101 3.3.6 Prostate cancer case study: enriched terms . . . 103 3.3.7 Prostate cancer case study: disease associations . . . 110 3.4 Discussion . . . 113 3.5 Future work . . . 116

Abstract

This chapter presents FuNeL, a protocol for the inference of functional networks based on the analysis of rule-based machine learning models. As- sociations are generated from attributes that collaborate in solving a classification problem. FuNeL is one of the main contributions of this thesis and it represents the first example of how, a smart exploitation of machine learning models, such as classification rules, can generate new knowledge, in this instance in the form of functional networks.

3.1 Introduction

The behaviour of complex biological systems arises from the cooperation of a large number of components. The understanding of how biological events occur at a molec- ular level is one of the main goals of System Biology and an important e↵ort has been devoted to determine the chain of interactions that controls and mediates biological processes. Networks are the main tool used to characterise and study these complex processes and systems. A biological network is a graph in which nodes represent biological entities such as genes or proteins, and a connection between them indicates a biological relationship, e.g. regulation or common functions. The inference of these networks from biomedical and especially from high-throughput (-omics) data, is an area of intense research.

Most network inference methods focus on the definition of gene regulatory networks, in which edges represent direct regulatory interactions between genes [71,83,148]. Far less e↵ort has been put into the design of methods to build functional networks where a connection indicates a functional relationship, e.g. membership in the same pathway, protein complex or sharing the same function. One of the primary uses of functional networks is the identification of functional modules based on the nodes connectivity (subsets of genes with multiple internal connections and a few connections with genes outside the module that describe, explain or predict a biological process or phenotype) [131]. Functional networks are also often employed to identify genes that play a major

role in a specific biomedical context, such as a disease, based on their position in the networks (e.g. hubs).

A commonly adopted approach is to generate functional networks based on the “co- expression” principle [78]. A functional relation (via a direct or indirect interaction) is assumed between two genes when they have similar expression profiles across data (from here comes the name similarity-based methods). It has been demonstrated that co-expression networks can e↵ectively identify pathways and candidate biomarkers [149], or reveal gene modules representing a biological process perturbed in a disease [150], just to name a few examples. The similarity-based approach remains the dom- inant method of functional network inference today, with many recent examples of successful applications [151–154].

Although similarity-based inference methods have been extensively and successfully used, they detect relationships among genes only when similar expression patterns emerge. This limits the range of functional relationships that they reveal [4, 155]. A di↵erent approach, to infer biological networks, that is recently gaining popularity, involves the use of machine learning techniques. Due to the wide range of knowledge representations used within machine learning methods (e.g. classification rules, decision trees, artificial neural networks, SVM kernels, etc.), they can discover more complex and diverse relationships and overcome the limitations of the similarity-based methods. This is possible because within machine learning models the attributes are associated not because they are similar (e.g. have similar expression profiles), but because together they detect strong patterns. Moreover, if the learning is supervised, it can take advantage of the additional phenotype information (class labels of the samples, such as case and control) available with the data.

Machine learning can be employed in di↵erent ways and forms to solve the task of network inference. One approach is to train machine learning models that directly predict network edges [156]. However, this process requires an experimentally verified “ground truth” of known interactions and suitable controls that represent a challenging task on its own. A di↵erent approach is to generate machine learning models from the biological data and then mine the structure of the models to infer networks. Attributes co-operate in machine learning models not only when they are “similar” but when

together meaningful patterns can be extracted. Therefore, such an approach based on the mining of complex machine learning models could possibly lead one to uncover new and di↵erent (biological) knowledge, that is likely to escape the traditional similarity- based approaches. Figure 3.1 aims to illustrate the di↵erences between these two approaches: on one hand the similarity-based methods, on the other hand, the methods founded on the knowledge extraction from the machine learning models. The figure highlights how the two approaches di↵erently analyse the same data and how the relationships between the entities are extracted.

NETWORK INFERENCE OUTPUT

INPUT

Y Samples

X Y

similarity (X,Y) > threshold

X Y X Y X Y Classification rules -omics data X Y Pheno. Phenotype information Cancer Normal X Y Y X

Machine learning model Knowledge extraction Edge inference Profile similarity

If X > 0.23 and Y > 0.55: Cancer

Fig 3.1: Two approaches for the functional network inference: one based on the expression profile similarity and the other based on the extraction of knowledge from machine learning models. The similarity-based methods construct an network edge

X _$ Y when the similarity between the expressions of genes X and Y across the samples is above a threshold. Methods based on machine learning first build a pre- dictive model, in this example a rule-based model, using the samples phenotype (class labels) information and then construct a network edge X _$Y, when genesX and Y

are used together within that model to classify the samples. As these two approaches lead to di↵erent functional networks, it is possible that they capture complementary knowledge.

As described in Section 2.3, several types of machine learning have been successfully applied to accomplish this task: unsupervised learning in the form of association rules [65], supervised learning using regression (model trees [67, 68]) or classification (ran- dom forest [69, 70]).

This chapter proposes, describes and analyses a protocol, called FuNeL, for the inference of functional networks based on rule-based machine learning models. FuNeL generates functional networks using: an optional feature selection process to control

the size of the networks, a statistical filtering of the predicted associations between genes using permutation tests and a multi-stage rule-based network inference. The di↵erent options available within FuNeL, illustrated in Figure 3.2, generate a total of four protocol configurations.

Original dataset

Feature selection

Reduced dataset

Rule-based network generation

Permutation test

Rule-based network generation Second training Co-prediction network Yes No Yes No STAGE 1 2 3 Option 1 4 Option 2

Fig 3.2: The stages of the FuNeL (Functional Network Learning) inference protocol.

In the following sections, firstly FuNeL’s ability to correctly identify existing relationships is tested using a set of synthetic datasets. Then, FuNeL is evaluated using eight real-world transcriptomics datasets related to di↵erent types of cancer. For each dataset, the four di↵erent configurations of the protocol are used to create functional networks. The inferred networks are tested and compared with co-expression networks of equivalent size. To have an extensive evaluation of FuNeL, three di↵erent state-of- the-art methods to generate co-expression networks are used. The di↵erences between FuNeL and co-expression networks are assessed from two points of view: (1) the enriched biological terms and (2) the relationships between genes known to be associated with a particular type of cancer. Finally, using a prostate cancer dataset as a case

study, a more detailed biological analysis of the enriched terms and the disease-related genes is performed. The largest hubs and the most central nodes in the prostate cancer co-prediction networks are studied for their involvement in the disease. Literature support is found for the association between these topologically important genes and prostate cancer. This is further confirmed by an independent transcriptomics dataset (not used as a source in the inference process). Overall, the FuNeL inferred networks are shown to (1) capture relevant biological knowledge that is complementary to the knowledge associated with di↵erent co-expression networks, and (2) more adequately represent the relationships between genes associated with the disease targeted by each dataset.

In document Knowledge extraction from biomedical data using machine learning (Page 72-77)