Part I Introduction to the Main Contributions on Classification
3.1 Conclusions
Real-world problems are complex and challenging. They can come in all sizes, shapes and varieties. However, when researching challenging situations on classification, the theoretical and methodological efforts of the literature tend to be biased towards simple classification schemas such as uni-dimensional binary problems. Instead of simplifying the statements of real-world problems so that the literature may be directly applied, this PhD dissertation opts to relax the common assumption of simple classification tasks. This enables me to push the existing classification schema boundaries further and to be able to directly apply the philosophy of taking on the real-world classification problems as they come. Specifically, here, I focus on generalising, from a both theoretical and methodological point of view, two current distinguished and defying situations called the semi-supervised learning paradigm and the class- imbalance problem.
Semi-supervised learning [10] is concerned with using unlabelled ex- amples1 in the learning task so that the performance of supervised learning
methods that only use a limited labelled set to train the model can be im- proved [39]. Nowadays, this technique is still a subject undergoing intense study due to the fact that, in some applications, gathering labels for the training dataset is relatively expensive or time-consuming compared to the cost of obtaining unlabelled examples. Unfortunately, hardly any theoretical or methodological solution of the literature assumes classification schemes be- yond uni-dimensional binary problems. Thus, in this dissertation, I devote a
large research period to studying the semi-supervised learning paradigm but applied to more complex classification scenarios such as multi-class and multi- dimensional classification problems. The achieved contributions in that time span are listed as follows:
1. An EM algorithm [119] based semi-supervised approach capable of dealing with multiple class variables. This methodology enabled the extension of the semi-supervised learning framework to the whole multi-dimensional classification spectrum [57].
2. It is shown that, as happens in uni-dimensional classification [72], when using the generative structure to train a multi-dimensional classifier in a semi-supervised manner, the unlabelled data always helps. When there is a mismatch between the generative and the assumed structures, perfor- mance degradation of the trained classifier occurs.
3. A competent solution for a real-world multi-dimensional classification problem in the context of Sentiment Analysis [112] by means of the pre- vious methodological advances. Specifically, a classifier capable of char- acterising the comments of the customers of several Spanish companies through three different, but related, dimensions was engineered. The sen- timent of the users towards the product, the subjectivity of their online post, and their will to influence on other customers was extracted. 4. It is also proven that the probability of committing a classification error
using a training dataset with no labelled data and any affinely extended real number of unlabelled data coincides with that of using the RAND classifier.
5. An optimal theoretical procedure for semi-supervised learning in the multi-class framework which allows the study of the fundamental lim- its of the methodological proposals in this setting. It is a generalisation of the pioneering optimal procedure for binary problems proposed in [31]. 6. Theoretically, the minimum number of labelled examples required to train
a multi-class classifier in a semi-supervised framework was also investi- gated.
7. It is proven, by means of the proposed optimal procedure, that the opti- mal probability of error in the semi-supervised framework might decrease to the Bayes error exponentially fast, but no faster, and that the class- overlapping [95, 106] had a direct influence on this convergence.
The second scenario, the class-imbalance problem [29, 100] arises from the class probability distribution, p(c), of the generative model and it is able to compromise the performance of the majority part of standard learning al- gorithms [11]. Traditional learning algorithms assume a 0-1 loss function and, thus, they expect balanced class distribution or equal misclassification costs. So, when they are presented with complex training datasets sampled from skewed class distributions, these algorithms fail to properly learn the char- acteristics of the generative model and resultantly provide dummy classifiers always classifying incoming data as the most probable values [107]. Nowa-
3.1 Conclusions 55 days, it is a hot topic in the literature [89] and it is the subject of many papers, workshops, special sessions, and dissertations. Unsuitably, the class- imbalance literature is highly biased towards binary classification problems. Thus, in this dissertation, I theoretically and methodological contribute to that literature but assuming the whole uni-dimensional classification spec- trum (binary + multi-class). Precisely, the presented contributions regarding the class-imbalance problem in this PhD work are the following:
1. A novel controlled theoretical framework where the other interdependent factors (see Section 1.4) threatening the performance of the trained clas- sifier can be marginalised. By doing so, the contribution of the class dis- tribution to the detriment of the performance of the classifier can be legitimately quantified in isolation.
2. Under the assumption of knowing the generative model, the BDR was used to define, in the suggested controlled framework, a valuable mea- sure for the influence of the class-distribution in the score to assess the performance of a classifier.
3. Evidence to support that numerical scores are sufficient to adequately as- sess problems suffering from class skewness was found during this research project. Previous work [11] argued the opposite.
4. It is claimed that the performance scores which are unweighted H¨older means [126] with p≤ 1 (a-mean, g-mean, h-mean, etc.) among the recalls are the most appropriate to evaluate the competitiveness of classifiers in unbalanced problems.
5. I discovered that most of the learning solutions proposed in the literature under the approaches of data sampling [109] and cost-sensitive learning [110] were designed to asymptotically converge to the EDR.
6. The EDR is an optimal classifier for the unweighted H¨older mean with p = 1 (a-mean) as proven in this research work.
7. The necessity in the literature for a standardised set of evaluation practices for proper comparisons among classifiers facing unbalanced data [41] was fulfilled. This was achieved by providing two practical bounds for common performance scores ensuring both competent and incompetent classifiers. 8. A new summary of any binary and multi-class class distribution, which is capable of properly measuring the class-imbalance extent and which highly correlates with the detriment produced by class skewness, was proposed. Finally, there is a contribution of this dissertation that, although it was key in the designed multi-dimensional semi-supervised methodological pro- posal, cannot be added to either of the previous lists. I also contributed to the supervised learning scenario by completing the family of multi-dimensional Bayesian network induction algorithms [57]. Specifically, I proposed a compu- tational efficient supervised filter algorithm for MD J/K [58] networks which makes use of statistical testing over mutual information measures. Further- more, this algorithm can also learn MDTAN structures in a much shorter time than using the original learning algorithm proposed in [57].