1.1 Problems with real life data sets
1.5.2 Summary of Thesis Report Layout
Chapter One(Introduction). In the introduction, we made the case for the
research topic by introducing the background of the study as being the general problems encountered when working with real-life datasets. The positing of imbal- anced classes as being very prevalent in additions to other real-life dataset issues was made here. A detailed explanations of other data sets issues as an addition to imbalanced class was presented. Furthermore, an explanation of similar imbalanced scenario, processes of dealing with raw data. Clear problems definition by explaining the research motivation, aims and contribution to knowledge was firmly rooted in this chapter.
Chapter Two(literature Review). The chapter is an extensive presentation
of previous work that has been done in dealing with imbalanced class distribution in data sets, we engage the argument of using data-centric research like data mining and machine learning to provide a solution in real-life scenario, hence the extent and attempt that has been made to provide solutions were explored here in a broader per- spective. The metrics of evaluations for classifiers were introduced for both binary and multi-classed data sets, we provided detailed explanation for 2 by 2 confusion matrix for binary classification and One-Versus-All for multi-classed scenario
Chapter Three(Variance Ranking Attribute Selection (VR) Tech-
nique)In this chapter we presented the Variance Ranking Attribute Selection tech-
nique for handling the imbalanced classed distribution, a detailed explanations of the datasets and data preparations, the theoretical basis of formula derivative used throughout the report and the experiments result were also included in this chapter.
Chapter Four(Comparison of Variance Ranking Attribute Selection
(VR) Technique with the Bench Mark) In this chapter a comparison of Vari-
ance Ranking Attribute Selection(VR) and other bench mark in attribute selection is provided , also a new similarity measurement techniques ”The Ranked Order Similarity measurement-ROS” was used to compare and quantify the similarities between the Variance Ranking Attribute Selection (VR)and two main bench marks which are Pearson Correlation and Information Gain. The novelty of The Ranked
Order Similarity measurement-ROS was invented here.
Chapter Five(Validation) In this chapter predictive modelling experiments
were carrieed out using three machine learning algorithm and seven data set (four binary and three multi classed). The accuracy , precision , recall etc were noted. The capturing of the minority class group in the imbalanced situation were proven, hence attesting to the efficacy of the (VR) techniques. More importantly, the com- parison of Variance Ranking with(SMOTE)and ADASYN techniques. The chapter provided and consolidated the reasons for the failure of using the algorithm based methods which have been the the conventional means and made a case why the
(VR),(SMOTE)and (ADASYN)techniques that rely mostly on the numbers of the
class groups is the right approaches to use.
Chapter Six (Summary Discussion and Conclusions) This chapter high-
lighted the major achievements of the research with a blow by blow summary of how the aims, and contributions were achieved, we also highlighted the shot comings of the existing techniques of handling the imbalanced data set problems. We provided a distinctive yet succinct presentations of all aspects of research that that made it possible to any reader to be familiar with the central knowledge that have been claimed achieved, we made ac case for the relevance of (VR)and the future work.
Literature Review
2.1
Overview of imbalance data
Class imbalance is a major problem in using real-life data for predictive modelling. A data set is said to be imbalanced when there is unequal number of groups, mean- ing that one group is more than the others, the larger groups are the majority classes while the smaller groups are called the minority classes, the ratio of the majority class to the minority class is often referred to as the imbalance ratio (IR) in binary classed imbalanced data. In the multi-classed imbalanced, the (IR) will be defined according to the techniques that will be used to express the imbalanced, the Figure
2.1 is a representation of different types of imbalance, for the binary classed, the
(IR) is 9:1 or 90%, this is straight forward. But for the multi-classed, the (IR) is 50:30:10:5:3:2, to expressed the(IR)as a percentage will depend on the technique of decomposition of the multi-classed using either ”one-versus-one” or ”one-versus-all” please see sections 2.3.2.
The problems caused by imbalance classes could affect all known predictive cate- gories; like supervised, unsupervised, and hybrid. In supervised learning, classifi- cation could be multi-classed or binary classed, the multi-class is when the target groups are more than two while binary is when the target groups are only two (Yes or No, Positive or Negative), [26] [27].
The effect of class imbalance in binary context is that, the accuracy of the predic- tion could be as high as 90% yet no minority class group has been captured by the prediction [28]. For example, if a data set has a total of 1000 instances, assuming that 900 are negative while 100 are positive case, if a binary classification predicted all the 1000 cases as negative will still appear to be 90% accurate, whereas none of the 100 minority class group have been captured.
Figure 2.1: Imbalanced and Balance data
The same wrong predictions in binary class is also very noticeable in a multi- classed data as shown in Figure2.1, consider a data set with classes as follows 50%, 30%, 10%, 5%, 3%, 2% being able to predict the small percentage groups (minority classes) by using the conventional machine learning algorithm and processes is next to impossible because by design and applications these algorithms assumed equal classes, and during implementations the process is usually optimized for accuracy thereby enhancing the capturing of the same majority classes. The irony is that, in most prediction; binary or multi-classed using real-life data, the minority groups are usually the interest or what we are looking to predict. Consider the case of binary classification in intrusion detection dataset. The minority is the few times the net- work may have been breached, in cancer research dataset, the minority group may be the few patients that have cancer, while in clinical trial of drug interactions, the few adverse interactions are usually the interest groups. In a multi-classed dataset were the prediction of various numbers in group membership is required like the ages of Abalones based on the numbers of rings [29], predicting a protein localization site in the Deoxyribonucleic acid(DNA)[30]. The smaller groups are impossible to cap- ture using the conventional machine learning algorithm and processes.
It is quite obvious that if a technique could be found to eliminate the problems of class imbalance, the performance of most predictive algorithm will improve dras- tically. At this juncture, let us provide a precise definition of the term predictive modelling. What is predictive modelling? ”This a term used to describe processes and techniques that use Statistics and machine learning to predict future events, outcomes or items, while using earlier events, data or observations as inputs during the process.”