The RGIFE heuristic - Future work - Knowledge extraction from biomedical data using machine lea

3.5 Future work

4.2.1 The RGIFE heuristic

A detailed pseudo-code that describes the RGIFE heuristic is depicted in Algorithm 1, while in Figure 4.2 is illustrated its generic iterative nature.

RGIFE is able to analyse and extract biomarker signatures from any type of biomedical dataset. The only requirement is that the samples need to be associated to a finite set of categories or classes (e.g. control vs. case), that is they can be used for a classification problem. As briefly mentioned in the introduction, RGIFE removes attributes if their role in the predictive model is irrelevant. Therefore, the first step of the heuristic is to estimate the performance of the classifier using the original set of attributes and assess their importance (line 29). Any classifier that ranks the attributes, based on their relevance in the classification task, can be used in the heuristic. The original version of the heuristic employed BioHEL [39] as base classifier to generate the predictive models and the attribute rankings. In this new version of RGIFE, BioHEL has been replaced with a random forest classifier [50]. This choice was primarily due to reduce the overall computational cost, as will be shown later (Figure 4.5). The function RUN ITERATION( ) splits the dataset into training and test data by imple- menting ak-fold cross-validation (by defaultk = 10) process to assess the performance of the current set of attributes. A k-fold cross-validation scheme was preferred, rather than the leave-one-out used in the previous RGIFE version, because of its better results when it comes to model selection [30]. In here, to describe the RGIFE heuristic, the generic term performance will be used to refer to how well the model can predict the class of the test samples. In reality, within RGIFE many di↵erent measures can be employed to estimate the model performance (accuracy, F-measure, AUC, etc.).

Algorithm 1 RGIFE: Rank Guided Iterative Feature Elimination Input: datasetdata, cross-validation repetitionsN

Output: selected attributes 1:

2: function reduce data(data)

3: numberOf Attributes current number of attributes from data

4: . If blockSizeis larger than the attributes reduce it (and check for soft-fail) 5: if (startingIndex+blockSize)> numberOf Attributes then

6: blockRatio=blockRatio_⇥0.25

7: blockSize=blockRatio⇥numberOf Attributes

8: end if

9: attributesT oRemove attributesRanking[startingIndex: (startingIndex+

blockSize)]

10: reducedData removeattributesT oRemove fromdata 11: startingIndex=startingIndex+blockSize

12: return reducedData

13: end function 14:

15: function run iteration(data) 16: for N timesdo

17: .generate training and test set folds from data

18: perf ormances cross-validation over data

19: attributesRank get the attributes ranking from the models 20: end for

21: perf ormance = average(perf ormances) 22: attributesRank = average(attributesRank) 23: return perf ormance,attributesRank

24: end function 25:

26: blockRatio= 0.25

27: blockSize=blockRatio⇥ (attributes in data) 28: startingIndex= 0

29: perf ormance,attributesRank =run iteration(data) 30: ref erenceP erf ormance =perf ormance

31:

32: while blockSize 1do

33: data = reduce data(data)

34: numberOf Attributes current number of attributes from data 35: perf ormance, attributesRank =run iteration(data)

36: if perf ormance < ref erenceP erf ormancethen 37: f ailures=f ailures+ 1

38: if (f ailures= 5) OR (all attributes have been test) then 39: if there exist a soft-fail then

40: ref erenceP erf ormance =sof tF ailP erf ormance

41: numberOf Attributes, selectedAttributes attributes of the dataset at the sof tF ail iteration

43: else

44: blockRatio =blockRatio_⇥0.25

45: blockSize = blockRatio⇥numberOf Attributes

46: end if

47: f ailures= 0;startingIndex= 0

48: end if

49: else

50: ref erenceP erf ormance=perf ormance

51: selectedAttributes current attributes from data

52: blockSize= blockRatio⇥numberOf Attributes 53: f ailures= 0;startingIndex= 0

54: end if 55: end while

56: return selectedAttributes

The N parameter indicates how many times the cross-validation process is repeated with di↵erent training/test partitions, to minimise the potential bias introduced by the randomness of the data partition. The generated model (classifier) is then ex- ploited to rank the attributes based on their importance within the classification task. Afterwards, the block of attributes at the bottom of the rank is removed and a new model is trained over the remaining data (lines 33-35). The number of attributes to be removed in each iteration is defined by two variables: blockRatio and blockSize. The former represents the percentage of attributes to remove (that decreases under certain conditions), the latter indicates the absolute number of attributes to remove and is based on the current size of the dataset. Then, if the new performance is equal or better than the reference (line 49), the removed attributes are permanently eliminated. Otherwise, the attributes just removed are placed back in the dataset. In this case, the value of startingIndex, a variable used to keep track of the attributes been tested for removal, is increased. As a consequence, RGIFE evaluates the removal of the next blockSize attributes, ranked (in the reference iteration) just after those placed back. The startingIndex is iteratively increased, in increments of blockSize, if the lack of the successive blocks of attributes keeps decreasing the predictive performance of the model. With this iterative process, RGIFE evaluates the importance of di↵erent ranked subsets of attributes. Whenever either all the attributes of the current dataset have been tested (i.e. have been eliminated and the performance did not increase), or there has been more than 5 consecutive unsuccessful iterations (i.e. performance was

degraded), blockRatio is reduced by a fourth (line 44). The overall RGIFE process is repeated while blockSize(number of attributes to remove) is 1.

Consider soft-fail as success Data Generate predictive model Attributes ranking Improved performance Best performing attributes Yes No No Yes Remove attributes

Attribute importance (ranking)

i j

Attributes to remove Update removal window:

IF: all attribute tested: - reduce block size

ELSE: - increase indexes (i, j) > 5 fails AND exist soft-fail Generate predictive model Attributes ranking

Fig 4.2: The iterative nature of the RGIFE heuristic and its overall behaviour.

An important characteristic of RGIFE is the concept of thesoft-fail. After five unsuccessful iterations, if some past trial failed and su↵ered a “small” drop in performance (one misclassified sample more than the reference iteration) it is still considered successful (line 40). The reason behind this approach is that by accepting a temporary small degrade in performance, the probability of incurring in a local optimum is reduced. Thus, the likelihood of obtaining better solutions is increased. Given the importance of the soft-fail, as illustrated later in Section 4.3.4, in this new RGIFE implementation, the searching for the soft-fail is not only performed when five consecutive unsuccessful trials occur, as in the original version, but it occurs before every reduction of the block size. Furthermore, the iterations that are tested for the presence of a soft-fail are extended. While before only the last five iterations were analysed, now the searching window is expanded up to the most recent between the reference iteration and the iteration in which the last soft-fail was found.

4.2.1.1 Relative block size

One of the main changes introduced in this new version of the heuristic is the adoption of a relative block size. The term block size defines the number of attributes that are removed in each iteration. In [208], the 25% of the attributes was initially removed, then whenever having: all the attributes been tested, or five consecutive unsuccessful iterations, the block size was reduced by a fourth. However, the analysis suggested that this approach was prone to get stalled early in the iterative process and prematurely reduce the block size to a very small number. This scenario either slows down the iterative reduction process because successful trials will only remove few attributes (small block size), or it prematurely stops the whole feature reduction process if the size of the dataset being analysed becomes too small (few attributes) due to large chunks of attributes being removed (line 32 in Algorithm 1). To address this problem, the new implementation of the heuristic introduces the concept of the relative block size. By using a new variable calledblockRatio, the number of attributes to be removed is now proportional to the size of the current attribute set being processed, rather than to the original attribute set. While before the values of blockSizewere predefined (given the original attribute set), now they vary based on the size of the data in hand. Preliminary tests (not reported) showed that this block size policy is much more reliable.

4.2.1.2 Parameters of the classifier

RGIFE can be used with any classifier that can provide an attribute ranking after the training process. The presented version of RGIFE uses a random forest classifier that is known for its robustness to noise and its efficiency, so it is ideally suited to tackle biomedical data. Furthermore, as suggested in [209], random forest tends not to overfit, incorporates interactions among predictor variables, can be easily used when the number of features is extremely larger than the observations and can tackle both binary and multi-class problems. The current version of the heuristic is implemented in Python and uses the random forest classifier available in the scikit-learn library [210]. In this package the attributes, by default, are ranked based on thegini impurity. The gini impurity represents the expected error rate at a node M if the category label is selected randomly from the class distribution present at M. The feature importance

is calculated as the sum over the number of splits (across every tree) that include the feature, proportionally to the number of samples it splits. Default values for all the parameters of the classifier are used within the heuristic, except for the number of trees (set to 3000 because it provided the best results in preliminary tests not reported here). The attribute importance based on entropy was also tested, but considering that it did not produce any improvement in performance, the default criteria was chosen.

4.2.1.3 RGIFE policies

The current version of the heuristic uses a random forest as core classifier, rather than BioHEL as originally proposed [208]. The random forest is a stochastic ensemble classifier as each decision tree is built by using a random subset of features. As a consequence, RGIFE inherits this stochastic nature, that is each run of the algorithm results in a potentially di↵erent optimal subset of features. The presence of multiple optimal solutions is a common scenario when dealing with high dimensional -omics data [211]. Therefore, this situation is addressed by running RGIFE multiple times and using di↵erent policies to select the final model (signature):

• RGIFE-Min: the final model is the one with the smallest number of attributes

• RGIFE-Max: the final model is the one with the largest number of attributes

• RGIFE-Union: the final model is the union of the models generated across di↵erent executions

In the presented analysis the signatures were identified from 3 RGIFE runs.

In document Knowledge extraction from biomedical data using machine learning (Page 123-128)