5.2 K-complex classification
5.2.4 Comparative study with previous results
The results presented above are mainly focused on maximizing accuracy. However, aiming at comparing these results with the ones obtained by Bankman et al. (1992), the decision threshold has to be oriented to achieve a required sensitivity of 85%, 90% and 95%. Table 5.15 reports the sensitivity and false positive rate measures published by Bankman et al. (1992) for a FNN model with three hidden units and those obtained with the proposed methodology for Dataf (band-pass filter approach) and Dataf 2
(low-pass filter approach) datasets.
Table 5.15: False positive rate (%) for different sensitivity levels in the test set.
Sensitivity 85% 90% 95% Bankman 6.1 8.1 14.1 Dataf 5.4 7.7 9.9
Dataf 2 4.5 3.2 8.6
The values obtained for Dataf correspond to the 4-10-8-1 FNN model, using as
inputs the features selected by the consistency-based filter. As for the Dataf 2dataset,
the features selected by CFS were considered, combined with a 5-10-8-1 FNN model. It is easy to notice that the FP rate decreases for all the sensitivity levels considered, where the best performance was obtained on Dataf 2. Obviously, this comparison has
to be interpreted carefully as the datasets used in each case are different. Having said this, these results pave the way to important improvements in K-complex classification by using feature selection techniques.
5.3
Summary
Feature selection plays a crucial role in many real applications, since it reduces the number of input features and, most of the times, it improves performance. This chapter presented two real problems where feature selection has demonstrated to be useful to achieve better performance results.
5.3 Summary
The first real application considered was related with tear film lipid layer classifica- tion. The time required by existing approaches dealing with this issue prevented their clinical use because they could not work in real time. In this chapter, a methodology for improving this classification problem was proposed, which includes the application of feature subset selection methods: CFS, Consistency-based, and INTERACT. Results obtained with this methodology surpass previous results in terms of processing time whilst maintaining accuracy and robustness to noise. In clinical terms, the manual process done by experts can be automated with the benefits of being faster and un- affected by subjective factors, with maximum accuracy over 97% and processing time under 1 second. The clinical significance of these results should be highlighted, as the agreement between subjective observers is between 91%-100%.
The second real scenario was the K-complex classification, a key aspect in sleep stud- ies. Three filter methods were applied combined with five different machine learning algorithms, trying to achieve a low false positive rate whilst maintaining the accuracy. When feature selection was applied, the results improved significantly for all the classi- fiers. It is remarkable the 91.40% of classification accuracy obtained by CFS, reducing in 64% the number of features.
CHAPTER
6
Scalability in feature selection
Continuous advances in computer-based technologies have enabled researchers and en- gineers to collect data at an increasingly fast pace (Z. A. Zhao & Liu, 2011). The proliferation of high-dimensional data brings new challenges to researchers, and scala- bility and efficiency are two critical issues in this new scenario.
Most algorithms were developed when data set sizes were much smaller, but nowa- days distinct compromises are required for the case of small-scale and large-scale learn- ing problems. Small-scale learning problems are subject to the usual approximation- estimation trade-off. In the case of large-scale learning problems, the trade-off is more complex because it involves not only the accuracy but also the computational complex- ity of the learning algorithm, as seen in tear film lipid layer classification (Chapter 5). Moreover, the problem here is that the majority of algorithms were designed under the assumption that the data set would be represented as a single memory-resident table. So if the entire data set does not fit in main memory, these algorithms are useless.
For all these reasons, scaling up learning algorithms is a trending issue. The organi- zation of the workshop “PASCAL Large Scale Learning Challenge” at the 25th Interna- tional Conference on Machine learning (ICML’08), and the workshop “Big Learning” at the conference of the Neural Information Processing Systems Foundation (NIPS2011) are cases in point. Scaling up is desirable because increasing the size of the training set often increases the accuracy of algorithms (Catlett, 1991). Scalability is defined as the effect that an increase in the size of the training set has on the computational performance of an algorithm: accuracy, training time and allocated memory. Thus the challenge is to find a deal among them or, in other words, getting “good enough” solu- tions as “fast” as possible and as “efficiently” as possible. This issue becomes critical in situations in which there exist temporal or spatial constraints like: real-time appli- cations dealing with large data sets, unapproachable computational problems requiring learning, or initial prototyping requiring quickly-implemented solutions.
Chapter 6. Scalability in feature selection
This chapter is devoted to scalability in feature selection and it is divided in two parts. The first part studies the influence of feature selection methods on the scala- bility of artificial neural networks (ANN) training algorithms by using the measures defined during the PASCAL workshop (Sonnenburg, Franc, Yom-Tov, & Sebag, 2008). These measures evaluate the scalability of algorithms in terms of error, computational effort, allocated memory and training time. Then, in the second part, the scalabil- ity of feature selection methods is studied, checking their performance in an artificial controlled experimental scenario, contrasting the ability of the algorithms to select the relevant features and to discard the irrelevant ones when the dimensionality increases and without permitting noise or redundancy to obstruct this process. For analyzing scalability, new evaluation measures are proposed, which need to be based not only in the accuracy of the selection, but also in other aspects such as the execution time or the stability of the features returned.
6.1
Scalability of neural networks through feature selec-
tion
The appearance of very large data sets is not sufficient to motivate scaling efforts. The most commonly cited reason for scaling up algorithms are based on (typically) increasing the accuracy of algorithms when increasing the size of the training data set (Catlett, 1991). In fact, learning from small data sets frequently decreases the accuracy of algorithms as a result of over-fitting.
For most scaling problems the limiting factor has been the number of samples and features describing each sample. The growth rate of the training time of an algorithm as the data set size increases is an outstanding question that arises. But temporal complexity does not reflect scaling in its entirety, and must be used in conjunction with other metrics. For scaling up learning algorithms the issue is not so much as one of speeding up a slow algorithm but as one of tuning an impracticable algorithm into a practical one. The crucial point in question is seldom how fast you can run on a certain problem but rather how large a problem can you deal with (Provost & Kolluri, 1999). More precisely, space considerations are critical to scale up learning algorithms. The absolute size of the main memory plays a key role in this matter. Almost all existing implementations of learning algorithms operate with the training set entirely in main memory. If the spatial complexity of the algorithm exceeds the main memory then
6.1 Scalability of neural networks through feature selection
the algorithm will not scale well –regardless of its computational complexity– because page thrashing renders algorithms useless. Page thrashing is the consequence of many accesses to disk occurring in a short time, cutting drastically the performance of a system using virtual memory. Virtual memory is a technique for making a machine behave as if it had more memory than it really has, by using disk space to simulate RAM. But accessing to disk is much slower than accessing to RAM. In the worst case scenario, out of memory exceptions will make algorithms unfeasible in practice.
It has been shown that popular algorithms for ANNs are unable to deal with very large data sets (Peteiro-Barral, Guijarro-Berdi˜nas, P´erez-S´anchez, & Fontenla-Romero, 2013). For this reason, preprocessing methods may be desirable for reducing the input space size and improving scalability. This section aims to demonstrate that feature selection methods are an appropriate approach to improve scalability. By reducing the number of input features and, consequently, the dimensionality of the data set, we expect to reduce the computational time while maintaining the performance, as well as being able to apply certain algorithms which could not deal with large data sets.
Among the feature selection methods available, this research will be focused on the filter approach. The reason is that although wrappers and embedded methods tend to obtain better performances, they are very time consuming and they will be intractable in dealing with high dimensional data sets without compromising the time and memory requirements of machine learning algorithms.