The FuNeL protocol - Material and Methods

3.2 Material and Methods

3.2.2 The FuNeL protocol

SCoPNet showed that the co-prediction paradigm generates functional networks that incorporate meaningful biological knowledge and can be employed to formulate new research hypothesis. However, from a methodological perspective, many questions remained unanswered about co-prediction:

• Can the co-prediction approach identify known genetic relationships?

• Can the biological significance of the co-prediction networks be quantified?

• What is the impact of data pre-processing on the generated networks?

• Is this methodology able to capture knowledge that escapes other methods?

• Are the discovered functional relationships meaningful in the human disease con- text?

To address these questions a new network inference protocol called FuNeL (Func- tional Network Learning) is proposed. FuNeL aims to provide a general framework for the inference of functional networks based on the co-prediction paradigm by using rule-based machine learning models. The FuNeL protocol substantially extends the previous work of [161] by incorporating: (1) statistical filtering of inferred functional relationships via permutation tests, (2) a multi-stage network generation to maximise the knowledge extraction, and (3) a configurable feature selection stage to control the size of the generated networks.

Stages of FuNeL

FuNeL is defined by a total of four stages, as illustrated in Figure 3.2. Two of these stages are optional (1 and 4), they lead to a total of four di↵erent protocol configurations. If the first optional stage (feature selection) is performed, the original dataset is reduced to the most relevant attributes. In Stage 2, a rule-based machine learning is used to infer a network. This network is statistically refined in Stage 3, in which a permutation test is used to filter out non-significant nodes. The final step, in which

the network generation is repeated for the second time, is again optional. A time complexity analysis of the protocol is available in the Section A.4 of Appendix A.

Feature selection (stage 1) When datasets contain a large number of attributes, a common situation when dealing with biomedical and high-throughput data, some of them might be irrelevant to the prediction of the target. Discard those attributes helps the classification algorithm to focus its learning e↵ort only on the ones that matters. Therefore, the feature selection is the first stage of the inference process. To pick the relevant attributes, FuNeL employs SVM-RFE: a recursive feature elimination method based on Support Vector Machines [111]. The choice fell on SVM-RFE as this algorithm was initially designed to cope with cancer-related transcriptomics data and over the years it showed its potential in identifying relevant features from biomedical data. In FuNeL, SVM-RFE uses an SVM algorithm with a linear kernel as preliminary studies suggested that it can eliminate as much as 90% of the original dataset attributes, without losing much of the classification accuracy. In Figure 3.4 is illustrated the di↵erence in accuracy, calculated using a standard 10-fold cross-validation, when applying SVM-RFE to the datasets (see Section 3.2.3 for a description of the datasets employed) and removing 90% of the original set of attributes. When consider- ing only 10% of the original attributes, the accuracy increased for two datasets, slightly decreased for three datasets and remained unaltered for the remaining datasets.

Rule-based network inference (stage 2) To infer the rule-based classification models, BioHEL was used as in [161]. BioHEL generates sets of classification rules using a genetic algorithm. Figure 3.5 shows an example of a rule set created using BioHEL from a cancer-related dataset. Each rule is sequentially applied to the test set samples, if none of them classifies (“fire”) the sample, the default rule is applied. Due to the stochastic nature of BioHEL’s learning process, each of its runs generates a di↵erent rule set. This fact is leveraged by creating a large number of alternative hypotheses of functional relationships via multiple runs of the algorithm. FuNeL runs BioHEL 10 000 times and infers the network from the consensus of all the generated rule sets. To do that, all the pairs of attributes that appear together in the same classification rule are used as the network edges (co-prediction paradigm). Then, each

Fig 3.4: Changes in the accuracy (calcualated using a 10-fold cross-validation) when retaining 10% of the original attributes using SVM-RFE.

network node (attribute) receives a score that corresponds to the number of times it has been used in the rules (node score). datasets

Permutation test (stage 3) Given a list of edges (attribute-attribute associations) extracted from the rule sets, FuNeL aims to filter out the non-significant nodes. To determine the node significance, a statistical analysis procedure based on a permutation test, similar to the one described in [162], is followed. 100 permutated datasets are generated by randomly shu✏ing the class labels. Next, the co-prediction networks are inferred (as in Stage 2) from these permutated datasets. Then, for each node, the distribution of scores across the 100 networks generated from the permutated datasets

If 35742_at is > 29.71 and 32545_r_at is < 1.50 then

Tumor

If 38322_at is > 81.32 and 35769_at is > 26.24 thenTumor

If 34097_at is > 25.40 and < 88.84 then

Tumor

If 38602_at is < 35.01 thenTumor

If 40853_at is > 11.11 thenTumor

Default rule: Normal

is calculated. Using a one-tailed permutation test, a p-value is assigned to each node, to estimate how likely it is to draw its score from the calculated distribution. With this process it is sure that the nodes with high scores are really tied to the classes present in the data, and that the network truly represents functional relationships. To decide if a node is statistically significant a typical ↵= 0.05 threshold is employed. Preliminary experiments showed that using significant nodes alone leads to small and dense networks. By having networks that are highly dense, where each node tends to be connected to every other node, the reliability on the meaning of the functional relationships is lost. To counter that, the node pruning is relaxed to keep all direct neighbours of the significant nodes, so that larger and less dense networks can be created.

Network construction (stage 4) There are two ways to interpret the result of the statistical test (option 2 in Figure 3.2). The first is to use the significant nodes as a filter for the inferred relationships (edges) and remove all the edges between two non-significant nodes. The second approach is to use the permutation test as a further feature selection and build a new rule-based machine learning model using only the significant nodes. This second run of the learning algorithm is then focused only on the statistically important genes and creates the final network.

As a result of two independent optional stages in the FuNeL protocol, there are four di↵erent configurations that it can run with (see Table 3.1).

Config. Description

C1 reduced dataset + 1 stage of network generation

C2 original dataset + 1 stage of network generation

C3 reduced dataset + 2 stages of network generation

C4 original dataset + 2 stages of network generation

In document Knowledge extraction from biomedical data using machine learning (Page 79-83)