Part II: Supervised Learning
5.5 Parallel SMO Implementations
5.6.2 Results on Benchmarks
Table 5.3 shows the classification performance of the models generated by MT- SVM, GPU-SVM and LIBSVM for the aforementioned datasets. Both the MT- SVM and the GPU-SVM yield competitive performance as compared to LIBSVM. The exception is the Ionosphere dataset, for which the SVM-GPU version fails to achieve a good performance. This may be explained by the usage of 32-bit floating precision arithmetic (float data type) and demonstrates that for some problems (64- bit) double precision is required. However, the GPU model could be used as a base for creating a better model, for example using the MT-SVM.
Table 5.3 MT-SVM, GPU-SVM and LIBSVM classification performance
Accuracy (%) F-Measure (%)
Dataset MT-SVM GPU-SVM LIBSVM MT-SVM GPU-SVM LIBSVM
Adult 84.65±0.39 80.97±0.80 84.72±0.38 90.13±0.28 86.76±0.67 90.38±0.24 Breast Cancer 97.48±2.26 95.42±2.02 97.76±1.49 96.59±1.90 93.32±3.12 96.96±2.02 German 73.61±1.65 73.28±2.90 73.03±1.32 83.44±1.08 81.15±2.36 83.46±0.82 Haberman 71.85±4.42 72.72±4.06 72.92±3.50 82.14±3.13 83.11±2.66 83.49±2.31 Heart 83.18±4.64 83.33±4.92 82.37±4.96 85.44±4.09 84.73±4.70 85.30±4.00 Ionosphere 89.66±3.38 67.52±2.00 89.06±3.58 91.16±3.13 79.55±1.06 90.72±3.29 Sonar 85.77±4.90 88.16±4.57 84.65±4.78 87.30±4.39 88.06±5.18 86.83±4.00 Tic-tac-toe 97.70±1.22 98.98±0.88 97.72±1.26 98.28±0.90 99.22±0.67 98.29±0.93 Two-Spiral 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00 MP3 Steganalysis 97.05±0.87 96.07±1.29 96.92±1.00 97.06±0.88 96.19±1.21 96.94±0.99
Table 5.4 shows the number of SVs of the models generated by MT-SVM, GPU-SVM and LIBSVM. Note that, despite producing competitive models, the MT-SVM generates models with fewer SVs. On the other hand, the GPU-SVM implementation generates models with a larger number of SVs.
Table 5.4 Number of Support Vectors (SVs) of the models generated by MT-SVM, GPU- SVM and LIBSVM
Dataset MT-SVM GPU-SVM LIBSVM
Adult 9,781.76±56.38 9,801.94±56.43 9,788.40±48.90 Breast Cancer 113.28±05.18 115.46±04.85 114.44±04.89 German 713.70±05.53 718.94±05.04 718.74±05.14 Haberman 149.14±04.84 152.42±04.55 151.20±04.41 Heart 175.74±03.09 178.28±02.98 177.18±03.10 Ionosphere 215.10±03.08 218.72±03.20 217.74±03.23 Sonar 151.10±02.57 154.82±02.39 153.74±02.36 Tic-tac-toe 548.36±10.08 553.18±10.58 551.78±10.89 Two-Spiral 698.80±28.65 1,053.10±67.49 939.86±58.23 MP3 Steganalysis 346.16±07.11 348.82±07.06 348.08±07.16
104 5 Support Vector Machines (SVMs)
Tables 5.5 and 5.6 show respectively the amount of time needed for the training and classification tasks and the corresponding speedups obtained by the MT-SVM and the GPU-SVM implementations, relatively to LIBSVM. Note that, only the three heaviest datasets (Adult, Two-Spiral and MP3 steganalysis) are shown, since for the smaller datasets the GPU speedups will be negative (<1). Both the MT- SVM and the GPU-SVM implementations can boost significantly the training and the classification tasks. This takes particular relevance for the training task as it can considerably reduce the time required to perform a grid search (a fundamental process to obtain good generalization models).
Table 5.5 Training and classification times (in seconds) for the MT-SVM and GPU-SVM and LIBSVM implementations
Training Classification
Dataset MT-SVM GPU-SVM LIBSVM MT-SVM GPU-SVM LIBSVM
Adult 14.83±0.42 2.24±0.22 30.49±0.72 0.84±0.12 0.03±0.01 2.02±0.07 Two-Spiral 21.26±0.19 0.89±0.01 146.72±0.42 3.02±0.50 0.04±0.11 11.72±0.19 MP3 Steganalysis 0.34±0.02 0.68±0.07 2.37±0.05 0.02±0.01 0.06±0.01 0.57±0.02
Table 5.6 Speedups (×) obtained for the MT-SVM and GPU-SVM relatively to LIBSVM implementation
MT-SVM GPU-SVM
Dataset Training Classification Training Classification
Adult 2.06 2.27 13.61 67.33
Two-Spiral 6.90 3.88 165.15 265.85
MP3 Steganalysis 6.88 29.53 3.48 9.50
Regarding the UKF kernel, Table 5.7 presents the classification results obtained using the MT-SVM implementation. The additional number of parameters greatly increases the complexity of performing a grid search. Having said that, it is possible that the results could be improved by narrowing the search. Nevertheless, the results show the usefulness of the this kernel. Using the Wilcoxon signed ranked test we found no statistical evidence of the UKF kernel performing worse than the RBF kernel and vice-versa.
Table 5.8 presents the average number of SV of the models generated. It is interesting to note that, when the UKF presents a smaller number of SVs than the RBF kernel it generally yields better F-Measure results. This seems to indicate that the UKF classification performance improves when it is able to gather points near to each other, in a higher dimension space, as intended.
5.7 Conclusion 105 Table 5.7 RBF versus UKF kernel classification results
RBF kernel UKF kernel
Accuracy (%) F-Meausure (%) Classification Accuracy (%) F-Meausure (%)
Adult 84.65±0.39 90.13±0.28 83.36±0.37 89.47±0.26 Breast Cancer 97.48±2.26 96.59±1.90 98.11±1.00 97.39±1.46 German 73.61±1.65 83.44±1.08 71.79±3.15 83.04±2.08 Haberman 71.85±4.42 82.14±3.13 72.45±4.91 82.85±3.58 Heart 83.18±4.64 85.44±4.09 82.93±3.59 84.85±3.41 Ionosphere 89.66±3.38 91.16±3.13 94.28±3.04 95.61±2.30 Sonar 85.77±4.90 87.30±4.39 85.10±4.81 86.19±4.69 Tic-tac-toe 97.70±1.22 98.28±0.90 98.17±0.94 98.59±0.76 Two-Spiral 100.00±0.00 100.00±0.00 100.00±0.00 100.00±0.00 MP3 Steganalysis 97.05±0.87 97.06±0.88 93.02±0.78 93.12±0.83
Table 5.8 Average number of SVs of the RBF versus UKF models
RBF kernel UKF kernel
Adult 9,781.76±56.38 12,543.60±56.63 Breast Cancer 113.28±05.18 63.08±03.03 German 713.70±05.53 795.60±13.89 Haberman 149.14±04.84 231.08±04.03 Heart 175.74±03.09 181.52±05.52 Ionosphere 215.10±03.08 101.40±03.99 Sonar 151.10±02.57 129.76±03.84 Tic-tac-toe 548.36±10.08 475.64±10.95 Two-Spiral 698.80±28.65 324.00±03.66 MP3 Steganalysis 346.16±07.11 1019.44±08.51
5.7
Conclusion
As the amount of data produced grows at an unprecedented rate fast machine learning algorithms that are able to extract relevant information from large repositories have become extremely important. To partly answer to this challenge in this Chapter we presented a multi-threaded parallel MT-SVM which parallelizes the SMO algorithm [74]. This implementation uses the power available on multi- core CPUs and GPUs, and efficiently learns (and classifies) within several domains, exposing good properties in scaling data.
Additionally, the UKF kernel which has good generalization properties in the high-dimensional feature space has been included, although more parameters are needed to fine tune the results. It would be interesting to account for vectorization (SSE or AVX) as well as support for kernel caching which may drastically decrease the amount of computation.
Chapter 6
Incremental Hypersphere Classifier (IHC)
Abstract. In the previous chapters we have presented batch learning algorithms,
which are designed under the assumptions that data is static and its volume is small (and manageable). Faced with a myriad of high-throughput data usually presenting uncertainty, high dimensionality and large complexity, the batch methods are no longer useful. Using a different approach, incremental algorithms are designed to rapidly update their models to incorporate new information on a sample- by-sample basis. In this chapter we present a novel incremental instance-based learning algorithm, which presents good properties in terms of multi-class support, complexity, scalability and interpretability. The Incremental Hypersphere Classifier (IHC) is tested in well-known benchmarks yielding good classification performance results. Additionally, it can be used as an instance selection method since it preserves class boundary samples.
6.1
Introduction
Learning from data streams is a pervasive area of increasing importance. Typically, stream learning algorithms run in resource-aware environments, constructing decision models that are continuously evolving and tracking changes in the environment generating the data [66]. This contrasts with traditional ML algorithms that are commonly designed with the emphasis set on effectiveness (e.g. classification performance) rather than on efficiency (e.g. time required to produce a classifier) [263] and predominantly focus on one-shot data analysis from homogeneous and stationary data [66]. Generally, it is assumed that data is static and finite and that its volume is “small” and manageable enough for the algorithms to be successfully applied in a timely manner. However, these two premises rarely hold true for modern databases and although (as we have seen) the GPU implementations can reduce considerably the time necessary to build the models, there are many real-world scenarios for which traditional batch algorithms are inapplicable [69]. Rationally, when dealing with large amounts of data it is conceivable that the memory capacity will be insufficient to store every piece of relevant information
N. Lopes and B. Ribeiro, Machine Learning for Adaptive Many-Core Machines – 107 A Practical Approach, Studies in Big Data 7,
108 6 Incremental Hypersphere Classifier (IHC)
during the whole learning process [96]. Moreover, even if the required memory is available, the computational requirements to process such amounts of data in a timely manner can be prohibitive, even when GPU implementations are considered. Additionally, modern databases are dynamic by nature. They are incessantly being fed with new information and it is not uncommon for the original concepts to drift [131, 134]. Therefore, both real-time model adaptation and classification are crucial tasks to extract valuable and up-to-date information from the original data, playing a vital role in many industry segments (e.g. financial sectors, telecommunications) that rely on knowledge discovery in databases and data mining tasks to stay competitive [186]. Reducing both the memory and the computational requirements inherent to ML algorithms can be accomplished by using instance selection methods that select a representative subset of the data. The idea is to identify the relevant instances for the learning process while discarding the superfluous ones. These methods can be divided into two groups (wrapper and filter) according to the strategy used for choosing the instances [167, 105]. Unlike filter methods, wrapper methods use a selection criterion that is based on the accuracy obtained by a classifier (instances that do not contribute to improve the accuracy are discarded). A review of both wrapper and filter methods can be found in L´opez et al. [167]. Although instance selection methods can effectively reduce the volume of data to be processed, their application may be time consuming (in particular for wrapper methods) and in some situations we may find that we are simply transferring the complexity from the learning methods to the instance selection methods. Usually, instance selection methods present scaling problems: for very large datasets the run-times may grow to the point where they become inapplicable [167]. A more desirable approach to deal with the memory and computational limitations consists of using incremental learning algorithms. In this approach, the learner gradually adjusts its model as it receives new data. At each step the algorithm can only access a limited number of new samples (instances) from which a new hypothesis is built upon [96]. Another important consideration when extracting information from data repositories is the interpretability of the resulting models. In some application domains, the comprehensibility of the decisions is as valuable as having accurate models. Moreover, understanding the predictions of the models can improve the users’ confidence on them [172, 17].