• No results found

2.2 Techniques for handling imbalance class distribution

2.2.4 Cost-Sensitive method

The Cost-Sensitive Learning (CSL) approach consider the cost of misclassifications and adjust the result into empirical consequences by allotting a different cost value to the misclassified classes [71]. In a binary classification scenario, the cost of la-

beling positive as negative may be different from the cost of labeling negative as positive. This is true in real-life, considering the two example provided by [72], the cost of misclassifying a cancer patient as not having cancer is more damaging to misclassifying a healthy patient as having cancer, just as the cost of not being able to pick up a terrorist would be more damaging to labelling a none terrorist as terrorist.

This technique could be applied to the result of any classification algorithm (binary or multi-classed). The cost-sensitive approach posits that accuracy is not as impor- tant as the implication of the wrongly classified target of interest. The final results are computed with values that leads to minimum cost for wrongly predicted values of least consequence [73] and maximum for values of high consequences. During implementation, the value of the cost is provided and set beforehand [72]. Most time, CSL is used in combination with other classifiers that produce their results in a confusion matrix [74]. It could be applied to both binary and multi-classed classifications, Table 2.1 is a representation of a Cost Matrix using the Confusion Matrix in Table 2.4, given that cicj is the cost of predictingiclass while the actual class is j, therefore cicj is false j (Fj).

Predicted Positive Negative Actual positive c+, c+ c, c+ Actual Negative c+, cc, c

Table 2.1: Cost Matrix Representation

The similarities of the two tables are obvious, but the applications is were they differs. If the errors in the classification isc−, c+andc+, c−, and no error in correctly classified data given by; c+, c+ and c−, c− therefore the Cost Matrix in Table 2.1

would reduce to a ratio; c−, c+ /c+, c−, while the total cost is then dedused as

T otalcost=c−, c+∗F P +c+, c−∗F N (2.5) Provided that the classifier’s result could be explained using a confusion matrix, CSL could be derived from such classifier. In combining resampling, SVM with CSL [75] showed that a baseline of measuring the acceptable cost could be modified based on context situations. Combined algorithms like ensemble are very popular in using CLS IN handing imbalance classes, [76] provided exploratory study on bagging relationships and classes, [77] proposed a method of using ensemble (AdaBoost), CSL, SVM and query-by-committee (QBC), first the classifier was performed on the subset of the data sample having divided it by the imbalanced proportion, then the QBC is used to produce the training set before the CSL-SVM is used to train

the data. Training with cost-sensitive neural networks and increasing the threshold of the cost such that the output is improved because data item with higher costs become harder to be misclassified as proposed by [78].

K nearest neighbour and imbalanced classed data

This is a classification algorithm used in classifying a new data point within a sample spaced by considering other neighbouring data points [79][80][81], hence the term k-nearest neighbour. In its simplest form, let’s consider Figure 2.7a , if a new data point (blue dot) have to be classified as either belonging to the black or the white dot, its nearest neighbours has to be checked. If k is set to 3 (k=3) as in Figure 2.7b, it means the closest 3 data points to the blue dot, in Figure 2.7b, the three nearest neighbour to the blue dots are two white and one black. The majority vote is used to classify the blue dot as belonging to the class of the white dot by measuring the distance between the blue dots and its nearest neighbours, and it is assumed that data points are similar to its neighbors if the distance between them is small.

(a) K nearest Neighbour in sample space (b) Relevant data Figure 2.7: Value of K is 3 in the sample space

There are various metrics of measuring the distance of data points in KNN algorithm, the most popular once are listed in Equation 2.6, 2.7 and 2.8 are for continuous variables while the Hamming distance in Equation 2.9 which is almost the same with Manhattan distance but applied when the data is categorical or discrete. Euclidean distance =

v

u

u

t

n

X

i=1 (xi−yi)2 (2.6) M anhattan distance = n

X

i−1 kxi−yi k (2.7) M inkowski distance =

"

n

X

i−1 kxi−yikq

#

1q (2.8)

Hamming distance=

n

X

i−1

kxi−yik (2.9)

Various modification of k-nearest neighbour has been used to solve the problems associated with imbalanced classed, for example large weighted- k nearest neighbour (W-KNN) were used by [82], the process is to utilized wider region around the data items distribution to deduced the nearest neighbour, but this has resulted in accom- modating some extraneous data like outliers which may add some noise resulting in the whole prediction becoming less accurate with data set that has large variances. All the algorithm techniques for predictive modelling can never be exhausted, the fluidity of the concept is such that on a daily basis, new modifications and modi- fication of first modification are being invented. For example a modification of K nearest neighbour called weighted- K nearest neighbour (W-KNN) that was dis- cussed earlier created by [82], have been modified to used Decision tree boundaries to select its K nearest neighbour, the wider region around the data items now have a different metrics to qualify to vote for a new data as belonging to a particular class, this improve the limited accuracy that was recorded by the (W-KNN), hence some outliers will be voted out.

Recently, a new approach of handling imbalanced data set known as ”conditional generative adversarial networks (cGAN)” was introduced by [83], this is based on a concept of continuous competitions by two vectors known as generator and dis- criminator. While the discriminator tries to learn the actual data set pattern by comparing it to data being generated by the generator as against the feedback be- tween the two vector result, this could lead to adaptation and improvement to the data quality and finally the overall performance algorithm.