Sample subset optimization for classifying imbalanced biological data

(1)

imbalanced biological data

Pengyi Yang^1,2,3, Zili Zhang^4,5⋆, Bing B. Zhou^1,3 and Albert Y. Zomaya^1,3

1 School of Information Technologies, University of Sydney, NSW 2006, Australia

2 NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia

3 Centre for Distributed and High Performance Computing, University of Sydney NSW 2006, Australia

4 Faculty of Computer and Information Science, Southwest University CQ 400715, China

5School of Information Technology, Deakin University, VIC 3217, Australia [email protected]; [email protected]

Abstract. Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.

1 Introduction

Modern molecular biology is rapidly advanced by the increasing use of computational techniques. For tasks such as RNA gene prediction [1], promoter recognition [2], splice site identification [3], and the classification of protein localization sites [4], it is often necessary to address the problem of imbalanced class distribution because the datasets extracted from those biological systems are likely to contain a large number of negative examples (referred to as majority class) and a small number of positive examples (referred to as minority class). Many popular classification algorithms such as support vector machine (SVM) have been applied to a large variety of bioinformatics problems including those mentioned above (e.g. refs. [1, 3, 4]). However, most of these algorithms are sensitive to the

⋆Corresponding author

(2)

imbalanced class distribution and may not perform well if being directly applied on the imbalanced data [5, 6].

Sampling is a popular approach to addressing the imbalanced class distribution [7]. Simple methods such as random under-sampling and random over- sampling are routinely applied in many bioinformatics studies [8]. With random under-sampling, the size of the majority class is reduced to compensate the imbalance, whereas with random over-sampling, the size of the minority class is increased to compensate the imbalance. Although they are straightforward and computationally efficient, these two methods are prone to either increased noise and duplicated samples or informative sample removal [9]. A more sophisticated approach known as SMOTE is to synthesize “new” samples using original samples in the dataset [10]. However, many bioinformatics problems often present several thousands of samples with a highly imbalanced class distribution. Ap- plying SMOTE will introduce a large number of synthetic samples which may increase the data noise substantially. Alternatively, a cost-metric can be specified to force the classifier to pay more attention to the minority class [11]. This requires to choose a correct cost-metric which is often unknown a priori.

Several recent studies found that ensemble learning could improve the performance of a single classifier in imbalanced data classification [6, 12]. In this study, we explore along this direction. In particular, we introduce a sample sub- set optimization technique for ‘intelligent under-sampling’ in imbalanced data classification. Using this technique, we designed an ensemble of SVMs specifically for learning from imbalanced biological datasets. This system has several advantages over the conventional ones:

– It creates each base classiﬁer using a roughly balanced training subset with a built-in intelligent under-sampling. This is important in learning from imbalanced data because it reduces the risk of bias towards one class while neglecting the other one.

– The system embraces an ensemble framework in which multiple roughly bal- anced training subsets are created to train an ensemble of classiﬁers. Thus, it reduces the risk of removing informative samples from the majority class, which may occur when a simple under-sampling technique is applied.

– As opposed to random sampling, the sample subset optimization technique is applied to identify optimal sample subsets. This may improve the quality of the base classiﬁers and result in a more accurate ensemble.

– The aforementioned biological problems often present several thousands of training samples. The proposed technique is essentially an under-sampling approach. It can avoid the introduction of data noise and the generated data subsets may be more eﬃcient for classiﬁer training.

The rest of the paper discusses the details of the proposed sample subset optimization technique and the associated ensemble learning system. Section 2 presents the ensemble learning system. Section 3 describes the main idea of sample subset optimization. The base classiﬁer and ﬁtness function of the ensemble system are described in Section 4. Comparisons with typical sampling and ensemble methods are given in Section 5. Section 6 concludes the paper.

(3)

2 Ensemble system

Ensemble learning is an effective approach for improving the prediction accuracy of a single classification algorithm. Such an improvement is commonly achieved by using multiple classifiers (known as the base classifiers) each trained on a subset of samples created by random sampling such as those used in bagging [13], or cost-sensitive sampling such as those used in boosting [14]. The base classifiers are typically combined using an integration function such as averaging [15] or majority voting [16].

c1

m m m m

n

n1

n2 n_L

…

c2 … c_L

Majority voting

…

Training set Test set

m’

n’

Prediction AUC value Optimized training subsets

Base classifiers

Optimize samples from majority class

Fig. 1. A schematic representation of the proposed ensemble system.

We propose an ensemble learning system specifically designed for imbalanced biological data classification. The schematic representation of the proposed system is shown in Figure 1. It has three main components – sample subset optimization, base classifier, and fitness function. The key of this ensemble system is the application of the sample subset optimization techniques (to be described in Section 3).

Suppose that a highly imbalanced dataset contains n samples from the ma- jority class and m samples from the minority class where n ≫ m, the system creates each sample subset by including all m minority samples and selecting a subset of samples from the n majority samples according to an internal opti- mization procedure. This procedure is conducted to generate multiple optimized sample subsets, each being a roughly balanced subset containing m minority samples and n_i carefully selected majority samples, where n_i ≪ n (i = 1...L) and L is the total number of optimized sample subsets. Using those optimized sample subsets, we can obtain a group of base classiﬁers ci (i = 1...L), each

(4)

being trained on its corresponding sample subset{m + ni}. The base classiﬁers are then combined using majority voting to form an ensemble of classiﬁers.

Algorithm 1 summarizes the procedure. A line starting with “//” in the algorithm is a comment for its adjacent next line.

Algorithm 1 sampleSubsetOptimization Input: Imbalanced dataset DI

Output: Roughly balanced dataset DB

1: cvSize = 2;

2: cvSets = crossValidate(DI, cvSize);

3: for i = 1 to cvSize do

4: // obtain the internal training samples 5: Dⁱ_T = getTrain(cvSets, i);

6: // obtain the internal test samples 7: Dⁱt= getTest(cvSets, i);

8: // obtain samples of the minority class 9: Dⁱminor = getMinoritySample(Dⁱ_T);

10: // obtain samples of the majority class 11: Dⁱmajor= getMajoritySample(Dⁱ_T);

12: // select a subset of samples from the majority class 13: Dⁱ_major′ = optimizeMajoritySample(Dⁱmajor, Dⁱminor, Dⁱt);

14: DB = DB ∪ (Dⁱminor ∪ Dⁱmajor′);

15: end for 16: return DB;

3 Sample subset optimization

The key function in Algorithm 1 is the optimization procedure applied to select a subset of samples from the majority class (Algorithm 1, line 13). The principal idea of the sample subset optimization procedure is to apply a cross validation procedure to form a subset in which each sample is selected according to the internal classification accuracy. In this section, we describe its formulation using a particle swarm optimization (PSO) algorithm [17], and analyze its behavior using a synthetic dataset. The base classifier and the fitness function used for optimization are discussed in Section 4.

3.1 Formulation of sample subset optimization

We formulate the sample subset optimization using a particle swarm optimization algorithm. In particular, for each sample from the majority class a dimension in the particle space is assigned. That is, for n majority samples, the particle is coded as an indicator function set p ={Ix₁, Ix₂, ..., Ix_n}. For each dimension, an indicator function Ix_j takes value “1” when the corresponding jth sample

(5)

x_j is included to train a classiﬁer. Similarly, a “0” denotes that the correspond- ing sample is excluded from training. By optimizing a population of L particles p_i (i = 1...L), the velocity of the ith particle v_i,j(t) and the position of this particle si,j(t) in the jth dimension of the solution space are updated in each iteration t as follows:

v_i,j(t + 1) = w· vi,j(t) + c₁r₁· (pbesti,j− si,j(t)) + c₂r₂· (gbesti,j− si,j(t)) (1)

si,j(t + 1) =

{0: if random()> S(vi,j(t + 1))

1: if random() < S(v_i,j(t + 1)) (2) S(vi,j(t + 1)) = 1

1 + e^−v^i,j^(t+1) (3)

where pbest_i,j and gbest_i,j are the previous best position and the best position found by informants, respectively. c₁, r₁, c₂, and r₂ are the learning rates and social coeﬃcients. random() is the random number generator with a uniform distribution of [0,1].

Representing this optimization procedure in pseudocode, we obtain Algo- rithm 2. Note that the PSO algorithm produces multiple optimized sample sub- sets in parallel. Therefore, by specifying the popSize parameter, we can obtain any number of optimized sample subsets with a single execution of the algorithm.

Algorithm 2 optimizeMajoritySamples

Input: Majority samples Dmajor, Minority samples Dminor, Internal test samples Dt

Output: Optimized sample subsets D^p_majorⁱ _′ (i = 1...L) 1: popSize = L;

2: initiateParticles(Dmajor, popSize);

3: for t = 1 to termination do

4: // go through each particle in the population 5: for i = 1 to popSize do

6: // extract the samples according to the indicator function set 7: D^p_majorⁱ _′ = extractSelectedSamples(pi, Dmajor);

8: D^p_trainⁱ = D^p_majorⁱ _′∪ Dminor;

9: // train a classiﬁer using selected majority samples and all minority samples 10: hi= trainClassiﬁer(D^p_trainⁱ );

11: // calculate the ﬁtness of the trained classiﬁer using internal test samples 12: f itness = calculateFitness(hi, Dt);

13: // update velocity (Eq. (1)) and position (Eq. (2)) according to ﬁtness value 14: vi,j(t) = updateVelocity(vi,j(t), f itness);

15: si,j(t) = updatePosition(si,j(t), f itness);

16: end for 17: end for

18: return D^p_majorⁱ _′ (i = 1...L)

(6)

3.2 Analysis of behavior

We analyze the behavior of sample subset optimization by using an imbalanced synthetic data. Samples are created with each has two features. These two features are generated from the same distribution. Speciﬁcally, 20 samples of the majority class are generated from a normal distributionN (5, 1) and 10 samples of the minority class are generated from a normal distribution N (7, 1). In ad- dition, 5 “outlier” samples are introduced to the dataset. They are labeled as majority class, but are generated from the normal distribution of the minority class. The class ratio of the data is 25:10.

Figure 2(a) shows the original dataset and the resulting classiﬁcation boundary of a linear SVM, and Figure 2(b) shows a dataset after applying sample subset optimization and the resulting classiﬁcation boundary of a linear SVM.

Note that this is one of the optimized dataset which is used to train one base classifier. Our ensemble is the aggregation of multiple base classifiers trained on multiple optimized datasets. It is evident that the class ratio is more balanced after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier samples are removed, and 7 redundant majority samples which has limited effect on the decision boundary of the linear SVM classifier are removed to correct the imbalanced class distribution.

3 4 5 6 7 8 9 10

2 3 4 5 6 7 8 9

Feature 1

Feature 2

Majority samples Minority samples Linear SVM border

3 4 5 6 7 8 9 10

2 3 4 5 6 7 8 9

Feature 1

Feature 2

Majority samples Minority samples Linear SVM border

(a) orinigal dataset (b) dataset after optimization Fig. 2. The green lines are the classiﬁcation boundary created using a linear SVM with (a) the original dataset and (b) the dataset after optimization.

4 Base classifier and fitness function

We select SVM as the base classiﬁer for building the ensemble system. SVM is routinely applied to many challenging bioinformatics problems. The design of the ﬁtness function is another important facet for sample subset optimization.

It determines the quality of the base classiﬁers, and thus the performance of the ensemble. The following subsections describe these two components in details.

(7)

4.1 Base classifier of support vector machine

SVM is a popular classification algorithm which has been widely used in many bioinformatics problems. Among different kernel choices, linear SVM with a soft margin is robust for large scale and high-dimensional dataset classification [18].

Let us denote each sample in the dataset as a vector x_i (i = 1...M ) where M is the total number of samples, and yi is the class label of sample xi. Each component in xiis a feature xij (j = 1...N ) interpreted as the jth feature of the ith sample, where N is the dimension of the feature space. In our case, features could be GC-content, dinucleotide values, or other biological markers used to characterize each sample.

A linear SVM with a soft margin is trained by optimizing following functions:

min

w,b,ξ

1

2||w||²+ C

∑M i=1

ξ_i

subject to : y_i(< w, x_i>) + b≥ 1 − ξi

where w is the weight vector, ξiare slack variables, and b is the bias. The constant C determines the trade-oﬀ between maximizing the margin and minimizing the amount of slack.

In this study, we utilize the implementation proposed by Hsieh et al. [19].

This is an implementation for fast and large scale linear SVM, which is especially suited as base classiﬁer for ensemble learning due to its computational eﬃciency.

Notice that classifiers are trained both for sample subset optimization and for composing ensemble. However, these two procedures are independent from each other, and therefore, the classifiers trained for sample subset optimization are not the classifiers used for ensemble. The purpose of the classifiers trained in the sample subset optimization procedure are to provide fitness feedbacks of the selected samples, whereas the classifiers used for composing ensemble are trained by using the optimized sample subsets and serve as the base classifiers of the ensemble. To maximize the specificity of the feedbacks, the same classification algorithm, that is, linear SVM, is used for both procedures.

4.2 Fitness function

For building a classifier, a subset of samples from the majority class is selected according to an indicator function set pi (see Section 3.1), and combined with the samples from the minority class to form a training set D^p_trainⁱ . The goodness of an indicator function set can be assessed by the performance of the classifier trained with the samples specified by it. For imbalanced data, one effective way to evaluate the performance of the classifier is to use area under the ROC curve metric [20]. Hence, we devise AU C(h_i(D^p_trainⁱ , D_test)) as a component of fitness function, where D^p_trainⁱ denotes the training set generated using p_i and D_test de- notes the test data. Function AU C() calculates the AUC value of a classification model hi(Da, Db) which is trained on Da and evaluated on Db.

(8)

Moreover, the size of the subset is also important because a small training set is likely to result in a poorly trained model with poor generalization. Therefore, the ﬁtness function can be constructed by combining the two components:

f itness(u_i) = w₁· AUC(hi(D^p_trainⁱ , D_test)) + w₂· Size(pi) (4) where Size() determines the size of a subset (specified by pi). Coefficients w1and w2are empirical constants which can be adjusted to alter the relative importance of each fitness component. The default values are w1= 0.8 and w2= 0.2 as they work well in a range of datasets.

5 Experimental results

In this section, we ﬁrst describe four imbalanced biological datasets used in our experiment. They are generated from several important and diverse biological problems and represent diﬀerent degrees of imbalanced class distribution. Next we present the performance results of our ensemble algorithm compared with six other algorithms using those datasets.

5.1 Datasets

We evaluated different algorithms using datasets generated for identification of miRNA, classification of protein localization sites, and prediction of promoter (drosophila and human). Specifically, the miRNA identification dataset contains 691 positive samples and 9248 negative samples, which is described by 21 features [21]. The protein localization dataset is generated from the study discussed in [22]. We attempted to differentiate membrane proteins (258) from the rests (1226). The human promoter dataset contains 471 promoter sequences and 5131 coding sequences (CDS) and intron sequences. Compared to the human promoter dataset, the drosophila promoter dataset has a relatively balanced class distribution with 1936 promoter sequences and 2722 CDS and intron sequences.

We calculated the 16 dinucleotide features according to [23].

The datasets are summarized and organized according to class ratio in Table 1.

Table 1. Summary of biological datasets used for evaluation.

Dataset (short name) # Sample # Features Minority vs. Majority drosophila promoter (DroProm) 6594 16 0.4156 (≈ 1:2.5)

protein localization (ProtLoc) 1484 8 0.2104 (≈ 1:5)

human promoter (HuProm) 5602 16 0.0918 (≈ 1:10)

miRNA identiﬁcation (miRNA) 9939 21 0.0747 (≈ 1:13)

(9)

5.2 Performance comparison

The performance of the single classifier of SVM was used as the baseline for all datasets. We compared the single classifier approaches including random under- sampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM), SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches including boosting with base classifiers of SVM (Boost-SVMs), bagging with base classifiers of SVM (Bag-SVMs), and our sample subset optimization technique with SVM (SSO-SVMs).

10 20 30 40 50 60 70 80 90 100 0.65

0.7 0.75 0.8 0.85 0.9

Number of Base Classifiers

Area Under ROC Curve

SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM

10 20 30 40 50 60 70 80 90 100 0.82

0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92

(a) drosophila promoter (b) protein localization

10 20 30 40 50 60 70 80 90 100 0.55

0.6 0.65 0.7 0.75 0.8

10 20 30 40 50 60 70 80 90 100 0.7

0.75 0.8 0.85 0.9 0.95

(c) human promoter (d) miRNA identiﬁcation

Fig. 3. The comparison of different algorithms for data classification. The x-axis de- notes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms that use a single classifier, the same AUC value is plotted on different ensemble sizes for the purpose of comparison.

For the ensemble methods, we tested the ensemble size from 10 to 100 with a step of 10. A 5-fold cross-validation procedure was applied to partition datasets for training and testing, and each algorithm was tested on the same partition

(10)

to reduce evaluation variance. Among the six tested algorithms, four of them employed the randomization procedure. They are RUS-SVM, ROS-SVM, Bag- SVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the reweight- ing implementation and is deterministic). For those with the randomization procedure, we repeated the test 10 times, each time with a diﬀerent random seed.

Figure 3 shows the results comparison. It can be seen that in most cases ensemble approaches give higher AUC values than the single classiﬁer approaches.

For single classifier approaches, random under-sampling, random over-sampling, and SMOTE sampling do improve the classification results when the analyzed dataset has a highly imbalanced class distribution such as the cases in Figure 3(b)(c)(d). However, the improvements become less significant when the imbalance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling performs better than random under-sampling and over-sampling approaches in the case of protein localization (Figure 3(b)). However, the performance gain is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe significant difference of the performance between random under-sampling and random over-sampling, except in the case of miRNA identification (Figure 3(d)) where random over-sampling is relatively better than random under-sampling.

For ensemble approaches, Boost-SVMs performs surprisingly worse than the other two approaches in most cases and the performance ﬂuctuates among different ensemble sizes. This may be caused by its training process in that the boosting algorithm assigns increasingly more classiﬁcation weights to those most

“difficult” samples in each iteration. However, those “difficult” samples could be the outliers and cause deleterious effect when the classifiers pay too much attention on classifying them while ignoring other more representative samples.

In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches.

However, SSO-SVMs almost always performs the best in every case and gener- ates much smaller performance variance when diﬀerent random seeds were used.

It is likely that the SSO-SVMs can capture the most representative samples from the training set which gives a better generalization on unseen data classiﬁcation.

We also observe that the improvement is more signiﬁcant when the datasets has a highly imbalanced class distribution (Figure 3(b)(c)(d)).

Table 2. The comparison of different algorithms for data classification according to AUC value. The value for ensemble approaches are averaged across different ensemble sizes.

Algorithm DroProm ProtLoc HuProm miRNA Single-SVM 0.6584 0.8296 0.5740 0.7542 RUS-SVM 0.6584 0.8850 0.6016 0.7644 ROS-SVM 0.6555 0.8866 0.5986 0.8114 SMOTE-SVM 0.6400 0.8976 0.5961 0.7924 Boost-SVMs 0.7756 0.8852 0.6644 0.8891 Bag-SVMs 0.8507 0.8671 0.7264 0.9198 SSO-SVMs 0.8520 0.9098 0.7718 0.9419

(11)

Table 3. P -value using one-tail student t-test to compare the performance diﬀerence Algorithm DroProm ProtLoc HuProm miRNA SSO-SVMs vs. Single-SVM 2× 10⁻¹⁵ 4× 10⁻¹⁸ 1× 10⁻¹¹ 1× 10⁻¹⁴

SSO-SVMs vs. RUS-SVM 2× 10⁻¹⁵ 1× 10⁻¹³ 4× 10⁻¹¹ 2× 10⁻¹⁴ SSO-SVMs vs. ROS-SVM 2× 10⁻¹⁵ 2× 10⁻¹³ 4× 10⁻¹¹ 3× 10⁻¹³ SSO-SVMs vs. SMOTE-SVM 8× 10⁻¹⁶ 8× 10⁻¹¹ 3× 10⁻¹¹ 9× 10⁻¹⁴ SSO-SVMs vs. Boost-SVMs 2× 10⁻⁸ 8× 10⁻⁷ 7× 10⁻⁶ 2× 10⁻⁵

SSO-SVMs vs. Bag-SVMs 6× 10⁻⁴ 7× 10⁻¹¹ 1× 10⁻⁶ 2× 10⁻³

Table 2 shows the AUC values of both single classifier and ensemble approaches. For the ensemble approaches, the AUC value is the average of those given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs the best in all four tested datasets. Comparing these results with the baseline of a single SVM, they account for 10%-20% improvements. To confirm the im- provements are statistically significant, we applied a one-tail student t-test and compared SSO-SVMs with the other six methods. Table 3 shows the p-value of the comparison. In all four datasets, the performance of SSO-SVMs is sig- nificantly better than the other six methods, with a p-value smaller than 0.05.

Therefore, we conﬁrmed the eﬀectiveness of the proposed ensemble approach.

6 Conclusion

In this paper we introduced a sample subset optimization technique for sampling optimal sample subsets from training data. We integrated this technique in an ensemble learning framework and created an ensemble of SVMs specifically for imbalanced biological data classification. The proposed algorithm was applied to several bioinformatics tasks with moderate and highly imbalanced class distributions. According to our experimental results, (1) the approaches based on data sampling for a single SVM are generally less effective compared to the ensemble approaches; (2) the proposed sample subset optimization technique appears to be very effective and the ensemble optimized by this technique produced the best classification results in terms of AUC value for all evaluation datasets.

References

1. Meyer, I.: A practical guide to the art of RNA gene prediction. Brieﬁngs in bioinformatics 8(6) (2007) 396–414

2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a re- view of currently used sequence features and classiﬁcation methods. Brieﬁngs in Bioinformatics 10(5) (2009) 498–508

3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., R¨atsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinformatics 8(Suppl 10) (2007) S7

4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular local- ization prediction. Bioinformatics 17(8) (2001) 721–728

(12)

5. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Im- balanced Datasets. In: Proceedings of the 15th European Conference on Machine Learning. (2004) 39–50

6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of the 10th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining. (2006) 107–118

7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.

Intelligent Data Analysis 6(5) (2002) 429–449

8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learn- ing. Application to Bioinformatics Problems. In: 2009 International Conference on Machine Learning and Applications, IEEE (2009) 545–550

9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6 (2004) 1–6 10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority

over-sampling technique. Journal of Artiﬁcial Intelligence Research 16(1) (2002) 321–357

11. Weiss, G.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1) (2004) 7–19

12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6) (2009) 412–426

13. Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) 123–140 14. Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: A new

explanation for the eﬀectiveness of voting methods. The Annals of Statistics 26(5) (1998) 1651–1686

15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classiﬁers by averaging or by multiplying? Pattern Recognition 33(9) (2000) 1475–1485

16. Lam, L., Suen, S.: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 27(5) (1997) 553–568

17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelli- gence 1(1) (2007) 33–57

18. Ben-Hur, A., Ong, C., Sonnenburg, S., Sch¨olkopf, B., R¨atsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10) (2008)

19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate de- scent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, ACM (2008) 408–415

20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8) (2006) 861–874

21. Batuwita, R., Palade, V.: microPred: eﬀective classiﬁcation of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8) (2009) 989–995

22. Horton, P., Nakai, K.: A probabilistic classiﬁcation system for predicting the cel- lular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, AAAI Press (1996) 109–

115

23. Rani, T., Bhavani, S., Bapi, R.: Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23(5) (2007) 582–588