Classifier Ensembles - Classifier Ensemble Feature Selection for Automatic Fault Diagnosis

The combination of classifiers, also known as classifier ensembles, is a well established approach (WANDEKOKEN et al., 2011). This approach consists in use simultaneously divergent accurate classifiers. The term accurate classifier used here refers to a classifier that achieves more than 50% of accuracy in tests. The term divergent means that the classifiers give their wrong prediction in a different region of the global feature space. In other words, the classifiers that compound the ensemble give the wrong prediction to different test samples. Figure 14 shows how a classifier ensemble can improve its classification performance combining divergent classifiers.

Figure 14 – The intersection of the individual wrong decision is necessarily smaller than the region of wrong decisions of any individual component classifiers (the regions R1 , R2 and R3 ) (WANDEKOKEN et al.,2011).

2.4.1 Bagging, Adaboost and Random Forest

One of the earliest ensemble approaches is bootstrap aggregating (Bagging) (BREIMAN, 1996). The diversity among the classifiers that compose the ensemble is

obtained using different datasets to train each base classifier. These datasets are generated with the statistical technique bootstrap. The bootstrap eliminates some samples and replicate others, keeping the same number of samples of the original dataset. Bagging generates ensembles with low trend of over-fitting (QUINLAN, 1996). Instead of drawing a succession of independent bootstrap samples from the original instances, the Adaboost algorithm (FREUND; SCHAPIRE, 1996) maintains a weight for each instance. The instance weight is directly proportional to its influence in the classifier learning. At each trial, the vector of weights is adjusted reflecting the performance of the classifier. The weight of misclassified instances is increased in each iteration. The final classifier also aggregates the learned classifiers by voting. However, the vote of each classifier is a function of its accuracy. Despite its popularity, Adaboost has highest trends of over-fitting than Bagging (QUINLAN, 1996). Unlike Bagging and Adaboost, Random Forests (BREIMAN,

2.4. Classifier Ensembles 43

2001) raffles samples and features simultaneously to generate divergent tree classifiers to compose the final ensemble. Random Forests have shown good results and have been widely cited in the literature.

2.4.2 Ensemble Feature Selection

The two main ways to promote diversity among the classifiers is selecting samples or selecting features. Many works use feature subset variations to promote the necessary diversity in a classifier ensemble (WANDEKOKEM et al.,2011; OPITZ,1999; BREIMAN, 2001; DIAO et al., 2014). In these works, the different subsets were obtained using feature selection algorithms, instead a pure random choice as performed in Random Forests. In WANDEKOKEM et al.(2011), the method varies the hyperparameters (kernel type, kernel parameter) of Support Vector Machine (SVM) (BURGES, 1998) to generate different feature subsets, using the Sequential Forward Selection (SFS) (GUYON; ELISSEEFF, 2003) algorithm. The SVMs trained with the generated subsets are combined to form an ensemble. Redundant SVMs are removed by Sequential Backward Selection (SBS) search. The SBS algorithm works like SFS, but removes features initially from the total set 𝑋,

instead of increasing an initial empty set. However, in BSFS case, it removes SVMs instead features. A similar idea like the last part of BSFS, when SBS is applied to reduce the number of SVMs, is used in DIAO et al.(2014), where ensemble predictors are transformed into training samples and classifiers are treated as features. This process aims to reduce the run-time overhead of the system, while improving the overall efficiency.

TheGenetic Ensemble Feature Selection (GEFS) algorithm proposed in OPITZ (1999) is inspired by the Genetic Algorithm (GA) paradigm to select feature subsets in a wrapper approach. The representation of each individual of the GEFS population is an array of integers, where each integer indexes a feature. GEFS calculates the fitness of each individual𝑖as: 𝐹 𝑖𝑡𝑛𝑒𝑠𝑠𝑖 =𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑖+𝜆𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦𝑖, where𝜆defines the trade-off between

accuracy and diversity, and its values variate between zero and one. The algorithm prunes the population to the 𝑁 most-fit individuals, updates the 𝜆 parameter, then repeats this

process. At every generation, the current ensemble consists of averaging the predictions of the current population. The main problem of GEFS is exactly the use of GA, because GA causes all individuals to converge towards a single point in the solution space. As the ensemble quality depends on the diversity of its classifiers, GEFS forces a diversity in its objective function. Hence, the GA fitness function is not the same function to be maximized, that is the ensemble accuracy. Thus, GEFS works on a dilemma controlled by the parameter 𝜆. If the 𝜆 value goes to zero, all individuals trends to converge to a

single point and there is no need to make an ensemble, since all answers will be the same. As the𝜆 goes to one, the guarantee of the individuals quality reduces. Therefore, GEFS

More recent work (PARK; KIM,2015) presents the Sequential Random K-Nearest Neighbor (SRKNN) feature selection algorithm. The SRKNN algorithm is, in a certain way, similar to random forests, since each base classifier is modeled based on a sequential forward selection strategy from a bootstrapped sample, and it uses a majority voting scheme. The main difference between SRKNN and random forests is that the first one uses a k-NN (COVER; HART, 1967) classifier instead of decision trees to form the ensemble. In GHIMIRE; LEE (2014) ensembles of ELM were made by using a bagging algorithm for Facial Expression Recognition(FER). Their method consists in extract the features from the face image by dividing it into a number of small cells. Then, a bagging algorithm was used to construct many different bags of training data and each of them was trained by using separate ELMs. To recognize the expression of the input face image, HOG features were fed to each trained ELM and the results were combined by using a majority voting scheme. The ELM ensemble using bagging improved the generalized capability of the network significantly.

More recently, CAO; CHEN; FAN (2016) proposed a majority voting based ELM ensemble for landmark recognition application to be used in mobile devices. The ELM was chosen as base classifier because it has fast training and good classification performance. It was considered the best choice to enhance the landmark recognition performance and maintain the training and recognition time at an acceptable level. No sophisticated technique was applied to training the ensemble that was composed by𝑘 ELMs trained with

all samples and features. Only the unstable nature of the random weights in the hidden layer of the ELMs was used to generate the desirable divergence among the classifiers.

Unlike CAO; CHEN; FAN (2016), HAN; LIU (2015) promotes the diversity within the ensemble, adopting feature segmentation and then feature extraction with nonnegative matrix factorization to the original data firstly. The authors argued the ELM was chosen as base classifier to improve the classification efficiency. According to them, the experimental results showed that the ensemble not only had high classification accuracy, but also handled the adverse impact of a few of labeled training samples in the classification.

3 Industrial Processes

This work uses three industrial processes as case study. The first process is the Case Western Reserve University Bearing Data. It is a widely cited process composed by a large number of labeled vibratory data in Matlab format. Normal and faulty rolling bearing conditions can be identified with this data. The second process is the Tennessee Eastman chemical process. It uses a simulator to generate the data that is public available. The last process is related with Electrical Submersible Pumps. The data from this process was obtained by a partnership between the Ninfa (Núcleo de Inferência e Algoritmos) Laboratory and Petrobras (Rio de Janeiro - Brazil).

In document Classifier Ensemble Feature Selection for Automatic Fault Diagnosis (Page 44-47)