imbalanced biological data
Pengyi Yang1,2,3, Zili Zhang4,5⋆, Bing B. Zhou1,3 and Albert Y. Zomaya1,3
1 School of Information Technologies, University of Sydney, NSW 2006, Australia
2 NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia
3 Centre for Distributed and High Performance Computing, University of Sydney NSW 2006, Australia
4 Faculty of Computer and Information Science, Southwest University CQ 400715, China
5School of Information Technology, Deakin University, VIC 3217, Australia [email protected]; [email protected]
Abstract. Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with im- balanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.
1 Introduction
Modern molecular biology is rapidly advanced by the increasing use of computa- tional techniques. For tasks such as RNA gene prediction [1], promoter recogni- tion [2], splice site identification [3], and the classification of protein localization sites [4], it is often necessary to address the problem of imbalanced class distri- bution because the datasets extracted from those biological systems are likely to contain a large number of negative examples (referred to as majority class) and a small number of positive examples (referred to as minority class). Many pop- ular classification algorithms such as support vector machine (SVM) have been applied to a large variety of bioinformatics problems including those mentioned above (e.g. refs. [1, 3, 4]). However, most of these algorithms are sensitive to the
⋆Corresponding author
imbalanced class distribution and may not perform well if being directly applied on the imbalanced data [5, 6].
Sampling is a popular approach to addressing the imbalanced class distri- bution [7]. Simple methods such as random under-sampling and random over- sampling are routinely applied in many bioinformatics studies [8]. With random under-sampling, the size of the majority class is reduced to compensate the im- balance, whereas with random over-sampling, the size of the minority class is increased to compensate the imbalance. Although they are straightforward and computationally efficient, these two methods are prone to either increased noise and duplicated samples or informative sample removal [9]. A more sophisticated approach known as SMOTE is to synthesize “new” samples using original sam- ples in the dataset [10]. However, many bioinformatics problems often present several thousands of samples with a highly imbalanced class distribution. Ap- plying SMOTE will introduce a large number of synthetic samples which may increase the data noise substantially. Alternatively, a cost-metric can be speci- fied to force the classifier to pay more attention to the minority class [11]. This requires to choose a correct cost-metric which is often unknown a priori.
Several recent studies found that ensemble learning could improve the per- formance of a single classifier in imbalanced data classification [6, 12]. In this study, we explore along this direction. In particular, we introduce a sample sub- set optimization technique for ‘intelligent under-sampling’ in imbalanced data classification. Using this technique, we designed an ensemble of SVMs specifi- cally for learning from imbalanced biological datasets. This system has several advantages over the conventional ones:
– It creates each base classifier using a roughly balanced training subset with a built-in intelligent under-sampling. This is important in learning from im- balanced data because it reduces the risk of bias towards one class while neglecting the other one.
– The system embraces an ensemble framework in which multiple roughly bal- anced training subsets are created to train an ensemble of classifiers. Thus, it reduces the risk of removing informative samples from the majority class, which may occur when a simple under-sampling technique is applied.
– As opposed to random sampling, the sample subset optimization technique is applied to identify optimal sample subsets. This may improve the quality of the base classifiers and result in a more accurate ensemble.
– The aforementioned biological problems often present several thousands of training samples. The proposed technique is essentially an under-sampling approach. It can avoid the introduction of data noise and the generated data subsets may be more efficient for classifier training.
The rest of the paper discusses the details of the proposed sample subset optimization technique and the associated ensemble learning system. Section 2 presents the ensemble learning system. Section 3 describes the main idea of sample subset optimization. The base classifier and fitness function of the en- semble system are described in Section 4. Comparisons with typical sampling and ensemble methods are given in Section 5. Section 6 concludes the paper.
2 Ensemble system
Ensemble learning is an effective approach for improving the prediction accuracy of a single classification algorithm. Such an improvement is commonly achieved by using multiple classifiers (known as the base classifiers) each trained on a subset of samples created by random sampling such as those used in bagging [13], or cost-sensitive sampling such as those used in boosting [14]. The base classifiers are typically combined using an integration function such as averaging [15] or majority voting [16].
c1
m m m m
n
n1
n2 nL
…
c2 … cL
Majority voting
…
Training set Test set
m’
n’
Prediction AUC value Optimized training subsets
Base classifiers
Optimize samples from majority class
Fig. 1. A schematic representation of the proposed ensemble system.
We propose an ensemble learning system specifically designed for imbalanced biological data classification. The schematic representation of the proposed sys- tem is shown in Figure 1. It has three main components – sample subset opti- mization, base classifier, and fitness function. The key of this ensemble system is the application of the sample subset optimization techniques (to be described in Section 3).
Suppose that a highly imbalanced dataset contains n samples from the ma- jority class and m samples from the minority class where n ≫ m, the system creates each sample subset by including all m minority samples and selecting a subset of samples from the n majority samples according to an internal opti- mization procedure. This procedure is conducted to generate multiple optimized sample subsets, each being a roughly balanced subset containing m minority samples and ni carefully selected majority samples, where ni ≪ n (i = 1...L) and L is the total number of optimized sample subsets. Using those optimized sample subsets, we can obtain a group of base classifiers ci (i = 1...L), each
being trained on its corresponding sample subset{m + ni}. The base classifiers are then combined using majority voting to form an ensemble of classifiers.
Algorithm 1 summarizes the procedure. A line starting with “//” in the algorithm is a comment for its adjacent next line.
Algorithm 1 sampleSubsetOptimization Input: Imbalanced dataset DI
Output: Roughly balanced dataset DB
1: cvSize = 2;
2: cvSets = crossValidate(DI, cvSize);
3: for i = 1 to cvSize do
4: // obtain the internal training samples 5: DiT = getTrain(cvSets, i);
6: // obtain the internal test samples 7: Dit= getTest(cvSets, i);
8: // obtain samples of the minority class 9: Diminor = getMinoritySample(DiT);
10: // obtain samples of the majority class 11: Dimajor= getMajoritySample(DiT);
12: // select a subset of samples from the majority class 13: Dimajor′ = optimizeMajoritySample(Dimajor, Diminor, Dit);
14: DB = DB ∪ (Diminor ∪ Dimajor′);
15: end for 16: return DB;
3 Sample subset optimization
The key function in Algorithm 1 is the optimization procedure applied to select a subset of samples from the majority class (Algorithm 1, line 13). The principal idea of the sample subset optimization procedure is to apply a cross validation procedure to form a subset in which each sample is selected according to the internal classification accuracy. In this section, we describe its formulation using a particle swarm optimization (PSO) algorithm [17], and analyze its behavior using a synthetic dataset. The base classifier and the fitness function used for optimization are discussed in Section 4.
3.1 Formulation of sample subset optimization
We formulate the sample subset optimization using a particle swarm optimiza- tion algorithm. In particular, for each sample from the majority class a dimension in the particle space is assigned. That is, for n majority samples, the particle is coded as an indicator function set p ={Ix1, Ix2, ..., Ixn}. For each dimension, an indicator function Ixj takes value “1” when the corresponding jth sample
xj is included to train a classifier. Similarly, a “0” denotes that the correspond- ing sample is excluded from training. By optimizing a population of L particles pi (i = 1...L), the velocity of the ith particle vi,j(t) and the position of this particle si,j(t) in the jth dimension of the solution space are updated in each iteration t as follows:
vi,j(t + 1) = w· vi,j(t) + c1r1· (pbesti,j− si,j(t)) + c2r2· (gbesti,j− si,j(t)) (1)
si,j(t + 1) =
{0: if random()> S(vi,j(t + 1))
1: if random() < S(vi,j(t + 1)) (2) S(vi,j(t + 1)) = 1
1 + e−vi,j(t+1) (3)
where pbesti,j and gbesti,j are the previous best position and the best position found by informants, respectively. c1, r1, c2, and r2 are the learning rates and social coefficients. random() is the random number generator with a uniform distribution of [0,1].
Representing this optimization procedure in pseudocode, we obtain Algo- rithm 2. Note that the PSO algorithm produces multiple optimized sample sub- sets in parallel. Therefore, by specifying the popSize parameter, we can obtain any number of optimized sample subsets with a single execution of the algorithm.
Algorithm 2 optimizeMajoritySamples
Input: Majority samples Dmajor, Minority samples Dminor, Internal test samples Dt
Output: Optimized sample subsets Dpmajori ′ (i = 1...L) 1: popSize = L;
2: initiateParticles(Dmajor, popSize);
3: for t = 1 to termination do
4: // go through each particle in the population 5: for i = 1 to popSize do
6: // extract the samples according to the indicator function set 7: Dpmajori ′ = extractSelectedSamples(pi, Dmajor);
8: Dptraini = Dpmajori ′∪ Dminor;
9: // train a classifier using selected majority samples and all minority samples 10: hi= trainClassifier(Dptraini );
11: // calculate the fitness of the trained classifier using internal test samples 12: f itness = calculateFitness(hi, Dt);
13: // update velocity (Eq. (1)) and position (Eq. (2)) according to fitness value 14: vi,j(t) = updateVelocity(vi,j(t), f itness);
15: si,j(t) = updatePosition(si,j(t), f itness);
16: end for 17: end for
18: return Dpmajori ′ (i = 1...L)
3.2 Analysis of behavior
We analyze the behavior of sample subset optimization by using an imbalanced synthetic data. Samples are created with each has two features. These two fea- tures are generated from the same distribution. Specifically, 20 samples of the majority class are generated from a normal distributionN (5, 1) and 10 samples of the minority class are generated from a normal distribution N (7, 1). In ad- dition, 5 “outlier” samples are introduced to the dataset. They are labeled as majority class, but are generated from the normal distribution of the minority class. The class ratio of the data is 25:10.
Figure 2(a) shows the original dataset and the resulting classification bound- ary of a linear SVM, and Figure 2(b) shows a dataset after applying sample subset optimization and the resulting classification boundary of a linear SVM.
Note that this is one of the optimized dataset which is used to train one base classifier. Our ensemble is the aggregation of multiple base classifiers trained on multiple optimized datasets. It is evident that the class ratio is more balanced after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier sam- ples are removed, and 7 redundant majority samples which has limited effect on the decision boundary of the linear SVM classifier are removed to correct the imbalanced class distribution.
3 4 5 6 7 8 9 10
2 3 4 5 6 7 8 9
Feature 1
Feature 2
Majority samples Minority samples Linear SVM border
3 4 5 6 7 8 9 10
2 3 4 5 6 7 8 9
Feature 1
Feature 2
Majority samples Minority samples Linear SVM border
(a) orinigal dataset (b) dataset after optimization Fig. 2. The green lines are the classification boundary created using a linear SVM with (a) the original dataset and (b) the dataset after optimization.
4 Base classifier and fitness function
We select SVM as the base classifier for building the ensemble system. SVM is routinely applied to many challenging bioinformatics problems. The design of the fitness function is another important facet for sample subset optimization.
It determines the quality of the base classifiers, and thus the performance of the ensemble. The following subsections describe these two components in details.
4.1 Base classifier of support vector machine
SVM is a popular classification algorithm which has been widely used in many bioinformatics problems. Among different kernel choices, linear SVM with a soft margin is robust for large scale and high-dimensional dataset classification [18].
Let us denote each sample in the dataset as a vector xi (i = 1...M ) where M is the total number of samples, and yi is the class label of sample xi. Each component in xiis a feature xij (j = 1...N ) interpreted as the jth feature of the ith sample, where N is the dimension of the feature space. In our case, features could be GC-content, dinucleotide values, or other biological markers used to characterize each sample.
A linear SVM with a soft margin is trained by optimizing following functions:
min
w,b,ξ
1
2||w||2+ C
∑M i=1
ξi
subject to : yi(< w, xi>) + b≥ 1 − ξi
where w is the weight vector, ξiare slack variables, and b is the bias. The constant C determines the trade-off between maximizing the margin and minimizing the amount of slack.
In this study, we utilize the implementation proposed by Hsieh et al. [19].
This is an implementation for fast and large scale linear SVM, which is especially suited as base classifier for ensemble learning due to its computational efficiency.
Notice that classifiers are trained both for sample subset optimization and for composing ensemble. However, these two procedures are independent from each other, and therefore, the classifiers trained for sample subset optimization are not the classifiers used for ensemble. The purpose of the classifiers trained in the sample subset optimization procedure are to provide fitness feedbacks of the selected samples, whereas the classifiers used for composing ensemble are trained by using the optimized sample subsets and serve as the base classifiers of the ensemble. To maximize the specificity of the feedbacks, the same classification algorithm, that is, linear SVM, is used for both procedures.
4.2 Fitness function
For building a classifier, a subset of samples from the majority class is selected according to an indicator function set pi (see Section 3.1), and combined with the samples from the minority class to form a training set Dptraini . The goodness of an indicator function set can be assessed by the performance of the classifier trained with the samples specified by it. For imbalanced data, one effective way to evaluate the performance of the classifier is to use area under the ROC curve metric [20]. Hence, we devise AU C(hi(Dptraini , Dtest)) as a component of fitness function, where Dptraini denotes the training set generated using pi and Dtest de- notes the test data. Function AU C() calculates the AUC value of a classification model hi(Da, Db) which is trained on Da and evaluated on Db.
Moreover, the size of the subset is also important because a small training set is likely to result in a poorly trained model with poor generalization. Therefore, the fitness function can be constructed by combining the two components:
f itness(ui) = w1· AUC(hi(Dptraini , Dtest)) + w2· Size(pi) (4) where Size() determines the size of a subset (specified by pi). Coefficients w1and w2are empirical constants which can be adjusted to alter the relative importance of each fitness component. The default values are w1= 0.8 and w2= 0.2 as they work well in a range of datasets.
5 Experimental results
In this section, we first describe four imbalanced biological datasets used in our experiment. They are generated from several important and diverse biological problems and represent different degrees of imbalanced class distribution. Next we present the performance results of our ensemble algorithm compared with six other algorithms using those datasets.
5.1 Datasets
We evaluated different algorithms using datasets generated for identification of miRNA, classification of protein localization sites, and prediction of promoter (drosophila and human). Specifically, the miRNA identification dataset contains 691 positive samples and 9248 negative samples, which is described by 21 fea- tures [21]. The protein localization dataset is generated from the study discussed in [22]. We attempted to differentiate membrane proteins (258) from the rests (1226). The human promoter dataset contains 471 promoter sequences and 5131 coding sequences (CDS) and intron sequences. Compared to the human pro- moter dataset, the drosophila promoter dataset has a relatively balanced class distribution with 1936 promoter sequences and 2722 CDS and intron sequences.
We calculated the 16 dinucleotide features according to [23].
The datasets are summarized and organized according to class ratio in Table 1.
Table 1. Summary of biological datasets used for evaluation.
Dataset (short name) # Sample # Features Minority vs. Majority drosophila promoter (DroProm) 6594 16 0.4156 (≈ 1:2.5)
protein localization (ProtLoc) 1484 8 0.2104 (≈ 1:5)
human promoter (HuProm) 5602 16 0.0918 (≈ 1:10)
miRNA identification (miRNA) 9939 21 0.0747 (≈ 1:13)
5.2 Performance comparison
The performance of the single classifier of SVM was used as the baseline for all datasets. We compared the single classifier approaches including random under- sampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM), SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches including boosting with base classifiers of SVM (Boost-SVMs), bagging with base classifiers of SVM (Bag-SVMs), and our sample subset optimization technique with SVM (SSO-SVMs).
10 20 30 40 50 60 70 80 90 100 0.65
0.7 0.75 0.8 0.85 0.9
Number of Base Classifiers
Area Under ROC Curve
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
10 20 30 40 50 60 70 80 90 100 0.82
0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92
Number of Base Classifiers
Area Under ROC Curve
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
(a) drosophila promoter (b) protein localization
10 20 30 40 50 60 70 80 90 100 0.55
0.6 0.65 0.7 0.75 0.8
Number of Base Classifiers
Area Under ROC Curve
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
10 20 30 40 50 60 70 80 90 100 0.7
0.75 0.8 0.85 0.9 0.95
Number of Base Classifiers
Area Under ROC Curve
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
(c) human promoter (d) miRNA identification
Fig. 3. The comparison of different algorithms for data classification. The x-axis de- notes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms that use a single classifier, the same AUC value is plotted on different ensemble sizes for the purpose of comparison.
For the ensemble methods, we tested the ensemble size from 10 to 100 with a step of 10. A 5-fold cross-validation procedure was applied to partition datasets for training and testing, and each algorithm was tested on the same partition
to reduce evaluation variance. Among the six tested algorithms, four of them employed the randomization procedure. They are RUS-SVM, ROS-SVM, Bag- SVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the reweight- ing implementation and is deterministic). For those with the randomization pro- cedure, we repeated the test 10 times, each time with a different random seed.
Figure 3 shows the results comparison. It can be seen that in most cases en- semble approaches give higher AUC values than the single classifier approaches.
For single classifier approaches, random under-sampling, random over-sampling, and SMOTE sampling do improve the classification results when the analyzed dataset has a highly imbalanced class distribution such as the cases in Figure 3(b)(c)(d). However, the improvements become less significant when the imbal- ance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling performs better than random under-sampling and over-sampling approaches in the case of protein localization (Figure 3(b)). However, the performance gain is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe signifi- cant difference of the performance between random under-sampling and random over-sampling, except in the case of miRNA identification (Figure 3(d)) where random over-sampling is relatively better than random under-sampling.
For ensemble approaches, Boost-SVMs performs surprisingly worse than the other two approaches in most cases and the performance fluctuates among dif- ferent ensemble sizes. This may be caused by its training process in that the boosting algorithm assigns increasingly more classification weights to those most
“difficult” samples in each iteration. However, those “difficult” samples could be the outliers and cause deleterious effect when the classifiers pay too much at- tention on classifying them while ignoring other more representative samples.
In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches.
However, SSO-SVMs almost always performs the best in every case and gener- ates much smaller performance variance when different random seeds were used.
It is likely that the SSO-SVMs can capture the most representative samples from the training set which gives a better generalization on unseen data classification.
We also observe that the improvement is more significant when the datasets has a highly imbalanced class distribution (Figure 3(b)(c)(d)).
Table 2. The comparison of different algorithms for data classification according to AUC value. The value for ensemble approaches are averaged across different ensemble sizes.
Algorithm DroProm ProtLoc HuProm miRNA Single-SVM 0.6584 0.8296 0.5740 0.7542 RUS-SVM 0.6584 0.8850 0.6016 0.7644 ROS-SVM 0.6555 0.8866 0.5986 0.8114 SMOTE-SVM 0.6400 0.8976 0.5961 0.7924 Boost-SVMs 0.7756 0.8852 0.6644 0.8891 Bag-SVMs 0.8507 0.8671 0.7264 0.9198 SSO-SVMs 0.8520 0.9098 0.7718 0.9419
Table 3. P -value using one-tail student t-test to compare the performance difference Algorithm DroProm ProtLoc HuProm miRNA SSO-SVMs vs. Single-SVM 2× 10−15 4× 10−18 1× 10−11 1× 10−14
SSO-SVMs vs. RUS-SVM 2× 10−15 1× 10−13 4× 10−11 2× 10−14 SSO-SVMs vs. ROS-SVM 2× 10−15 2× 10−13 4× 10−11 3× 10−13 SSO-SVMs vs. SMOTE-SVM 8× 10−16 8× 10−11 3× 10−11 9× 10−14 SSO-SVMs vs. Boost-SVMs 2× 10−8 8× 10−7 7× 10−6 2× 10−5
SSO-SVMs vs. Bag-SVMs 6× 10−4 7× 10−11 1× 10−6 2× 10−3
Table 2 shows the AUC values of both single classifier and ensemble ap- proaches. For the ensemble approaches, the AUC value is the average of those given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs the best in all four tested datasets. Comparing these results with the baseline of a single SVM, they account for 10%-20% improvements. To confirm the im- provements are statistically significant, we applied a one-tail student t-test and compared SSO-SVMs with the other six methods. Table 3 shows the p-value of the comparison. In all four datasets, the performance of SSO-SVMs is sig- nificantly better than the other six methods, with a p-value smaller than 0.05.
Therefore, we confirmed the effectiveness of the proposed ensemble approach.
6 Conclusion
In this paper we introduced a sample subset optimization technique for sampling optimal sample subsets from training data. We integrated this technique in an ensemble learning framework and created an ensemble of SVMs specifically for imbalanced biological data classification. The proposed algorithm was applied to several bioinformatics tasks with moderate and highly imbalanced class distribu- tions. According to our experimental results, (1) the approaches based on data sampling for a single SVM are generally less effective compared to the ensemble approaches; (2) the proposed sample subset optimization technique appears to be very effective and the ensemble optimized by this technique produced the best classification results in terms of AUC value for all evaluation datasets.
References
1. Meyer, I.: A practical guide to the art of RNA gene prediction. Briefings in bioinformatics 8(6) (2007) 396–414
2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a re- view of currently used sequence features and classification methods. Briefings in Bioinformatics 10(5) (2009) 498–508
3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., R¨atsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinformatics 8(Suppl 10) (2007) S7
4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular local- ization prediction. Bioinformatics 17(8) (2001) 721–728
5. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Im- balanced Datasets. In: Proceedings of the 15th European Conference on Machine Learning. (2004) 39–50
6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. (2006) 107–118
7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.
Intelligent Data Analysis 6(5) (2002) 429–449
8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learn- ing. Application to Bioinformatics Problems. In: 2009 International Conference on Machine Learning and Applications, IEEE (2009) 545–550
9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6 (2004) 1–6 10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16(1) (2002) 321–357
11. Weiss, G.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1) (2004) 7–19
12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6) (2009) 412–426
13. Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) 123–140 14. Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics 26(5) (1998) 1651–1686
15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9) (2000) 1475–1485
16. Lam, L., Suen, S.: Application of majority voting to pattern recognition: an anal- ysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 27(5) (1997) 553–568
17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelli- gence 1(1) (2007) 33–57
18. Ben-Hur, A., Ong, C., Sonnenburg, S., Sch¨olkopf, B., R¨atsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10) (2008)
19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate de- scent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, ACM (2008) 408–415
20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8) (2006) 861–874
21. Batuwita, R., Palade, V.: microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8) (2009) 989–995
22. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cel- lular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, AAAI Press (1996) 109–
115
23. Rani, T., Bhavani, S., Bapi, R.: Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23(5) (2007) 582–588