Refining ensembles from DTs using weak features

features

Having obtained a range of the posterior probabilities of EEG features, we can define a threshold value to cut off the ones with the probabilities below this threshold; we define such features as weak. A trivial way of using the posterior information on weak features is to rerun the Bayesian averaging on data from which such features were deleted. This reduces a model parameter space so that it can be explored in more detail. However this techniques requires multiple reruns to find the best threshold.

The other way is to refine the DT ensemble by discarding DTs which use weak features. For each threshold, we can find the DT models which use these weak features and discard these DT models from the ensemble. We expect that such a refining strategy will reduce the original set of features without rerunning the Bayesian averaging, keeping its performance high. We can also expect that there is an optimal threshold probability at which the largest number of weak features can be discarded. It is interesting to explore whether the discarding of weak features will improve the results of Bayesian model averaging. In a series of experiments, we could increase the threshold probability in steps and evaluate the performance of the refined ensemble on the test data.

Alternatively to such try-and-see approach, we can search for the smallest set of important EEG features by discarding the models using weak features and monitoring the accuracy of the refined ensemble on the training data. We use a sequential forward strategy of finding DT models using a weak feature in order to eliminate these models from the ensemble. The search continues while the training accuracies of the refined and original ensembles are comparable within a given p-value of a statistical hypothesis test, such as the two-sample Kolmogorov-Smirnov test (KS-test). The accuracies are said comparable as long as the test cannot reject the null hypothesis. The null-hypothesis assumes that samples of the accuracies are drawn from the same distribution. The test rejects the null-hypothesis if the modifications made for k th attribute decrease the accuracy, and then the procedure stops.

To compare the training accuracies with a hypothesis test, the distribution of the accuracies given each feature subset need to be estimated. To estimate the distributions it is required to collect sufficient independent samples representing the accuracies of each of the ensembles. Such samples could be obtained by calculating the accuracies on multiple independent data sets. When the training data are limited, the independent data sets can be simulated by resampling the available data. One of the techniques enabling the multiple independent datasets to be generated is to randomly subsample two-thirds of data without replacement. In cases when the simulated data sets are required to be with the same number of samples as the original data set, bootstrapping with replacement is typically used. In our case, however, there is no such requirement.

The proposed technique of finding a subset of the most important features can be summarized by Algorithm 2.

The algorithm returns the number of features which were found weak within a given p-value. Thus, the indexes of weak features are in positions from 1 to

k of the list F. Obviously, the greater the number of attributes found weak, the

Algorithm 2Refining a DT Ensemble

1: _Inputs: training data D represented by m features, ensemble of DTs, number of subsamples n, p-value, number of attempts vmax,

2: _Initialise: counter of attempts v = 0, number of weak features k = 1 3: Estimate the posterior feature importance

4: Sort the list of features, F , in the order of their importance 5: _fori = (1, n) do

6: Subsample D and calculate the ensemble accuracy A_i 7: end for

8: whilev ≤ v_maxand k < m do

9: Find the DT models using feature F_k and delete them from the ensemble 10: fori = (1, n) do

11: Subsample D and calculate the ensemble accuracy AR_i 12: _{end for}

13: Run the KS-test to compare the samples {AR_i}n

1 and {Ai}n1

14: _if null-hypothesis rejected then 15: v ← v + 1

16: _else

17: Reset the counter of attempts v, v ← 0 18: _{end if}

19: k : k ← k + 1 20: end while 21: k ← k − v 22: returnk

we expect to find the smallest set of attributes making the most important contribution and keep the performance of the refined ensemble high.

A potential criticism of the refining technique is that the sequential forward strategy of eliminating the weak features does not take into account the pos- sible interactions between the features. However, the technique assumes that the feature interactions have been considered by the collected DT models. Our hypothesis is that the combinations of the features which make valuable con- tributions to the classification have been used by the largest portion of the ensemble’s DT models. On the contrary, the weak features, which are some- times added to the DT even with a slight decrease in the likelihood, are used by a much smaller portion of the models. When the MCMC technique adds a weak feature to a DT by making a birth move, a new “version” of the model with the weak feature is included in the ensemble. The fact that a weak feature is rarely used by ensemble’s DTs means that proposals to add this feature tend to decrease the model’s likelihood and are rarely accepted by the sampler. The refining technique is aimed to remove those DT versions which include the weak features while keeping those DTs which have employed the successful feature combinations. The efficiency of this technique is evaluated in experiments in terms of performance and accuracy of uncertainty assessment.

In document Bayesian assessment of newborn brain maturity from sleep electroencephalograms (Page 75-78)