Future Work - Introduction to the Main Contributions on Classification

Part I Introduction to the Main Contributions on Classification

3.2 Future Work

The potential future extensions of this PhD work are briefly discussed in the following paragraphs.

In this dissertation, a particular real-world problem was solved by proposing a methodology which generalises the semi-supervised learning framework to the fresh multi-dimensional classification framework. Since it was an early exploration of this classification schema, there is, undoubtedly, a lot of work to be done here.

Firstly, the collection of multi-dimensional classifiers is limited [23, 57, 58]. To the best of my knowledge, there are still no learning algorithms capable of dealing with ordinal features without nominalising them. A variety of these classifiers could significantly enrich the literature. Furthermore, I would like to point out the necessity of proposing computational efficient learning algorithms not only to ensure the scalability in the number of features and class variables, but also to fight the computational burden of semi-supervised learning techniques such as EM algorithm [119] based solutions.

Secondly, different and diverse performance scores for multi-dimensional classification problems are required. When facing a new real-world uni- dimensional problem, there is always a positive probability of encounter- ing at least one of the threats described in Section 1.4. When dealing with multi-dimensional problem, just based on the number of class variables, the probability of facing a threat is greater. Therefore, just having only a couple of subjective performance scores [23] to assess a multi-dimensional classifier seems to be a crucial limitation in this field. Consequently, I strongly believe that it is imperative to perform an exhaustive analysis on which uni-dimensional performance scores are adequate for being gener- alised to the multi-dimensional setting.

Thirdly, in situations where labelled data are costly, the proposed methodology can be extended to deal with unlabelled records for which not all labels are missing.

Fourthly, as advanced in the introduction, the natural evolution of the semi-supervised field (Section 1.2.3) can be used in the multi-dimensional framework. In this path, the foundation stone was laid with the described methodological proposal [23]. Consequently, learning algorithms leveraging the unlabelled data by introducing some assumptions linking the features and the class values might be proposed [73] in the literature.

In the literature review of the theoretical studies on semi-supervised learning, I was only able to find one study [70] dealing with the multi-class setting. However, it follows a different path than mine. Thus, instead of taking one giant leap by assuming the whole classification schema, I limited my theoretical scope to the uni-dimensional spectrum in order to settle this spectrum in the literature. I still believe that taking such an assumption is burdensome,

3.2 Future Work 57 so my suggested future research lines focus on populating the literature with valuable answers to theoretical semi-supervised questions in the multi-class scenario. Specifically, the following three different issues are key:

Is there any number beyond which any extra additions of unlabelled data does not decrease the probability of error?

In this dissertation, lk was calculated as the expected minimum number

of labelled data required in semi-supervised learning for uni-dimensional domains assuming that the complexity and dimensionality of the feature space do not affect its calculation. What happens if this assumption is relaxed?

The pubished proposals follow a correct model assumption to study the optimal probability of error. When this assumption does not hold, how does the optimal probability of error varies in the number of labelled data, l? Does it still exponentially decrease in l?

Finally, how do skewed class distributions affect the semi-supervised learning techniques? The optimal procedure for binary problems proposed in [31] can easily be modified to be optimal for the a-mean by just remov- ing the priors in Stage 2 and by using the EDR in Stage 3. Thus, can the proposed theoretical work in the multi-class framework also be easily adapted to study the intricate class-imbalance problem in the semi- supervised framework?

Similarly to the semi-supervised learning framework, there was a lack of supporting works in the class-imbalance literature. This absence also pre- vented me from directly assuming a multi-dimensional framework. Thus, con- cerning the uni-dimensional framework, there are a few paths to directly ex- tend the introduced work:

Theoretically, the utilised novel framework to study the implication of the class-imbalance to the performance of the trained classifier can be easily complemented with several other classifiers rather than the BDR, other literature methodologies to deal with class skewness can be analysed and more performance scores can be tested.

Methodologically, a more exhaustive analysis of the summaries for the class distributions of multi-class problems can also be proposed. A large number of distance/dissimilarity functions over a larger set of problems, or, even introducing other measures for the rest of threats which harm the performance of the classifiers can be used.

It is important to remark that, in this case, I believe that the generalisation of both works to the multi-dimensional scenario is approachable and beneficial to the literature.

It can be easily seen that the theoretical framework proposed to study the implication of the class distribution on the performance of the trained

classifier is directly generalisable using the global notation of this intro- ductory part to the PhD dissertation. However, as happened in multi- class with the multi-minority and multi-majority cases, new and different class-imbalance scenarios will appear; an underrepresented nominal vector of class variables is a totally different class-imbalance case than having underrepresented class values for a given class variable. Moreover, the de- pendences between the class-variables will have an impact on this intricate threat. Having solved these adversities, the actual problem in this exten- sion lies in the above mentioned limited amount of diverse performance scores in the multi-dimensional framework. Thus, here, I also highlight the necessity of an exhaustive study on the performance scores.

Secondly, measuring the class-imbalance extent of multi-dimensional real- world problems also seems an upcoming step to take. Sadly, the introduced measure does not generalise to multiple class variables due to the fact that, first, it is necessary to establish which class distribution represents the bal- ance scenario in multi-dimensional problems. Fortunately, the feasibility of this future work has been ensured in the literature. Charte et al [137] pro- pose a generalisation of the imbalance-ratio to measure the class-imbalance extent of multi-class classification problems.

In document Theoretical and methodological advances in semi-supervised learning and the class-imbalance problem. (Page 66-68)