• No results found

Chapter 2 Literature Review

2.2 Machine Learning Algorithms

2.2.6 Naïve Bayes Classifier

Naive Bayes classifier is also used to test the performance of filters along with C4.5 decision trees. It applies Bayes’ theorem with strong assumption of independence between features (Mahalakshmi and Sivasankar, 2015). It is based on the concept of conditional probability:

p(𝐶𝑘|𝑥) =p(𝐶𝑘p()p(𝑥𝑥)|𝐶𝑘)

(2.7)

For two class labels i and j, under condition (inputs) X, the label with higher conditional probability is more likely to be the actual label. This can be achieved by calculating

R =𝑃𝑃((𝑝𝑗||𝑋𝑋)) =𝑃𝑃((𝑝𝑗))𝑃𝑃((𝑋𝑋||𝑝𝑗))

(2.8)

If R>1 then predict i, else predict j.

An implementation of Naive Bayes can be found in Weka.

2.2.7 Summary

In this section, the structure and learning algorithm of ANN are introduced. Some typical ANN structures are shown along with Naive Bayes and Decision Tree and will be used in the experiment of filters to demonstrate filters’ abilities. From this background research it is easy to see that training data has huge impact on the training of ANN. Thus, both instance selection and feature selection can improve the performance of ANN by enhancing its training process.

As a classic type of ANN, MLP is used in this research for solving a wide scope of real world problems. RBF and HONN are also utilized for some special kinds of problems. Naive Bayes and Decision Tree as two classic machine learning algorithms are used for comparison in filter experiments.

2.3

Instance Selection

Instance Selection is the process to generate input instance group from the original datasets. This process can be used for generating training-validation-test instances groups and avoid class imbalance problems (Liu, 2010). Instance Selection approaches that can solve class imbalance problems are divided into two categories: resampling and embedded methods.

2.3.1 Resampling

Resampling techniques aim at correcting problems with the distribution of a data set. Weiss and Provost (Weiss and Provost, 2003) noted that the original distribution of samples is sometimes not the optimal distribution to use for a given classifier, and different sampling techniques can modify the distribution to one that is closer to the optimal distribution.

Resampling methods include over sampling and under sampling. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. However, as most under sampling methods do not consider the relationship between examples, data redundancy or information loss may easily occur (He and Garcia, 2009). Thus, over-sampling methods are more popular in recent years.

One widely used resampling method, Synthetic Minority Over-sampling Technique (SMOTE) is an over-sampling approach in which the minority class is over-sampled by creating“synthetic” examples rather than by over-sampling with replacement(Chawla et al., 2011). It can achieve better classifier performance (in ROC space) than only under-sampling the majority class.

However, SMOTE generated the same number of synthetic examples for each minority example and this strategy may cause data overlapping. Some methods, which can overcome this limitation of SMOTE have been proposed, such as borderline-SMOTE (Han et al., 2005) and Adasyn (Haibo et al., 2008). Borderline-SMOTE only oversamples the borderline examples of the minority class. Adasyn adapts the number of synthetic examples for every minority example according to the distributions.

The oversampling methods try to overcome the property of imbalanced class distribution by adding examples to the training set. However, the duplicating or generating of examples may make the training set noisier and cause over-fitting(He and Garcia, 2009). Furthermore, adding training examples will also increase the training time. To overcome these drawbacks, data cleaning techniques were proposed. Tomek link is a useful definition for cleaning data. It can be used to clean up data after an oversampling method, such as the Random Over Sampling, SMOTE, and Adasyn (He and Garcia, 2009).

Some research also reported that resampling methods cannot improve NN performances or even hinder them. Some other experiments demonstrate that complex resample methods such as SMOTE are not better than the simply Random Over Sampling (Bhowan et al., 2013, Khoshgoftaar et al., 2010, Khoshgoftaar et al., 2011).

2.3.2 Embedded Methods

Another category of solutions is the embedded methods. These methods embedded a resampling approach in model training process, or modify the training algorithms and network structures. A typical embedded solution for imbalanced dataset problem is Dynamic Sampling Approach (Minlong et al., 2013). The main idea of it is selecting examples for training in each epoch to avoid redundant information and to make the best use of the training data. Main steps of this algorithm can be described as follows:

(1) Randomly fetch an example x from the training set.

(2) Estimate the probability p that the example should be used for training. (3) Generate a uniform random real number μ between 0 and 1.

(4)If μ < p, then use x to update the MLP. (5) Repeat steps 1-4

The DyS merged over sampling methods into the training process of MLP, while fixed the over fitting problems. However, it is just an adaptive form of random over sampling by duplicating both minority and majority samples. The basic idea is the same as resampling methods but the experiment results are better in the aspect of MLP generalization abilities.

Cost-sensitive method is another kind of embedded methods for class imbalance (Castro and Braga, 2013). They always follow two steps:

(1) Set a cost matrix for the class imbalance problem to formulate the problem as a cost-sensitive problem, and

(2) Employ a method to solve the cost-sensitive problem.

Boosting is also widely used to solve imbalance problems (Galar et al., 2012). This kind of data mining model has advantages as it can resample the data space automatically which eliminates the extra learning cost for exploring the optimal class distribution and the representative samples.

2.3.3 Summary

All researches covered in this background research need to add or delete instances to keep a balance between classes. Even some embedded methods prefer to use some

instances rather than others. This may cause information loss or adding into irrelevant information, which is highly likely to damage feature selection process such as filters. To solve the class imbalance problem without adding or delete instances, a novel instance selection algorithm will be created in this research. As a kind of resampling method, it focuses on distributing instances for training, validation and test randomly.

2.4

Feature Selection

Portion removed

According to evaluation function, wrappers can also fall into two groups: The partial derivative based saliency measure and the weight based saliency measure.

Nowadays, hybrid feature selection methods that contain both wrappers and filters are welcomed by researchers for its higher improvement of models generalization abilities. It can mainly be divided into two groups based on the joint structure. One way is to use filters before wrappers as a pre selection process. Another one is to merge filters into wrappers, which mainly focus on using filters in local search of wrappers to enhance the efficiency. The first one is more suitable for big data problem, while the later one is obviously more eligible on improving models’ performances.