To test the generalization performance of classifiers involved in the experiments, two proce- dures have been used to compare the classification algorithms. The former is whether there is a statistically significant difference in the performance of a pair of classifiers over various benchmark datasets or between the performance of two classifiers on a single benchmark
dataset, where multiple training/test partitions are used. Later whether there is a statistically significance difference between the k classifiers over multiple benchmark datasets.
Comparison of a pair of Classifiers: The Wilcoxon signed-ranks test [85] is a non- parametric approach used to determine whether there is a statistically significant difference between the performance of two classifiers over various multiple benchmark datasets or over a single benchmark dataset that using results obtained from independent test sets. This approach is based on the ranks of the differences in performance of two classifiers for each dataset.
Suppose we haveC-matrix classification performance, which iskbynwherek=2 is the number of classifiers andnis the number of trials from independent test sets. LetCi1andCi2
be classification performances on theith trial for a single dataset. Then the idea behind the Wilcoxon signed-ranks test is to rank the absolute values ofDi, whenDiis the difference between two classifiers performance on theithtrial,Di =|Ci1−Ci2|. The rank starting from smallest to largest rank, then calculate the sum of ranks for positiveR+ and negativeR−
differences separately. Then the smallest sum of the ranks is considered as the test statistic. It is approximately normally distributed For large number of benchmark datasets. Later we will find a critical value at the level of significanceα=0.05. If the test statistic is less than
or equal to the critical value, that means the null hypothesisH0can be rejected, whereH0: that the two set of classifier results have equal median ranks. The alternative hypothesis test isH1, whereH1: that the two set of classifier results have different median ranks. In order to compare two classifiers over multiple benchmark datasets, the Wilcoxon signed-ranks test is used that instead ofith trails we haveith benchmark datasets.
The Wilcoxon signed-rank test is a more appropriate test for comparing two classifiers as it assumes independence between the performance measures. In addition, it does not require that the difference in the performance of a pair of classifiers are commensurable, because communicability of the differences is difficult across multiple benchmark datasets. Moreover,
the test does not assume that the difference in the performance of a pair of classifiers is normally distributed, which is more useful when the number of benchmark datasets are small. The Wilcoxon signed-rank test is robust to outliers, that skew the performance measures have less affect on this test. The Wilcoxon signed-rank test assumes the distributions of the differences must be symmetrical. In other words each side of the median must have a similar shape. If this assumption is violated, it can affect the power of Wilcoxon signed-rank test.
Comparisons of Multiple ClassifierThe Friedman test [29] is a non-parametric alterna- tive to analysis of variance ANOVA. This test is to determine whether there is a statistically significant difference between the average ranks of k classifiers, where k>2. The null hypothesisH0assumes that the average ranksRiover multiple datasets will be equal against the alternative hypothesisH1that at least one of the classifiers has different average ranks. Given two matrices,C-matrix classification performance, which iskbynwherek>2 is a number of classifier andnis the number of benchmark datasets andR, which is a matrix of average ranks, which iskbynas well. Then the Friedman statisticQcan be calculated:
Q = 12n k(k+1). k
∑
j=1 ¯ R2j−k(k+1) 2 4 ,The Q statistics is approximately distributed according to a Chi-squared distribution with(k−1)degrees of freedom. The Qstatistics is sufficiently use when the number of benchmark datasets, n, and classifiers, k, are large enough (as a rule of a thumb, n>10 andk>5). However, Demšar [25] notes that this calculation is often conservative for small number of benchmark datasets, and proposes using the following statistic:
F = (n−1)Q
n(k−1)−Q,
which follows an F distribution with (k−1)(n−1) degrees of freedom. The null hypothesis, H0, will be rejected if the value of this statistic greater than critical value
that means there is a statistically significant difference at least between two classifiers. If a significant difference is found then post-hoc test is applied to determine statistical significance between pairs of classifiers. The Nemenyi test is used to calculate the critical difference,CD,
CD = qα
r
k(k+1) 6n
whereqadepends on bothα andk. The Nemenyi test result can be visually demonstrated
by critical difference datagram. Figure 2.15 is an illustrative example for representing The Nemenyi test result for three classifiers. The two classifiers are significantly different if their average ranks is differ by more than the critical difference. While, these classifiers are linked by the bar indicate that there are no statistically significant differences between the means ranks for these classifiers and the differences between the means ranks is less than the critical difference. For example, in Figure 2.15 the Classifier 1 is statistically significantly different compare to the Classifier 3. In addition, the statistical different between the Classifier 1 and Classifier 2, and the statistical different between the Classifier 2 and Classifier 3 is not significantly difference. CD 3 2 1 1.6786 Classifier 1 1.8929 Classifier 2 2.4286 Classifier 3