• No results found

This section describes the operation of the Friedman statistical test. The Friedman test is commenced by ranking each classification techniques, with respect to each of

the data sets, according to the recorded AUC values. Then, the average rank for each classification techniques was obtained from across data sets. The Friedman test statistic, χ2F, is then calculated as follows [50, 66, 76]:

χ2F = 12n k(k+ 1) " k X i=1 µ2i − k(k+ 1) 2 4 # (7.1)

where: (i) nis the number of data sets, (ii)k is the number of classification techniques and (iii) µi is the average rank for classification techniquei which in turn is calculated

as follows: µi = 1 n n X j=1 rj (7.2)

whererj is the (AUC) rank for classification technique ion data setj.

The Friedman test was applied with respect to the proposed techniques in the con- text of the evaluation data sets. Two different cases were considered: (i) where the classification techniques was trained and tested on the same data set and (ii) where the classification techniques was trained on one data set and tested on another. In both cases the parameters that produced the best results in terms of the AUC measure, as obtained with respect to the experiments reported in the foregoing chapters, were used (as listed in Table 7.1 ). Recall that these results were obtained using TCV. More specifically the Friedman process is as follows:

1. Each of the proposed techniques is given a rank with respect to each data set as shown in Tables 7.2 and 7.3 (Table 7.2 shows the rankings when the classifier is trained and tested on the same data set, while Table 7.3 shows the rankings when the classifier is trained and tested on different data set). The ranks in both tables are presented in parenthesis where the best performing algorithm is given a rank of 1 and so on.

2. Note that (with reference to Tables 7.2 and 7.3) where two techniques share a ranking r, we used the so called ties rule. For example if two techniques are ranked fourth then they will be given a ranking of 4.5 (4+4+12 )

3. The average rankµfor each of the proposed technique is given in the last column of the two tables using Equation 7.2 wheren= 8.

4. The Null hypothesis (H0) that there is no significant difference between the op-

eration of the techniques, and the Alternative hypothesis (H1) that there is were

established.

5. Using the Friedman test there are two situations where the null hypothesis H0

6. The rejection of the null hypothesis H0 means the automatic acceptance of the

alternative hypothesisH1.

7. If the null hypothesis is rejected we can proceed with a post-hoc test to identify the critical distances between pairs of techniques to identify which technique(s) are significantly different (in terms of recorded AUC values). It should be noted that we can not proceed with a post-hoc test if the null hypothesisH0 is not rejected.

Number Technique (variation) Classifier Generator Best parameters

1 Level 1 LGM Decision tree d= 10,|L|= 3 andδzrepresentation

2 Level 2 LGM Decision tree d= 10,|L|= 3 andδzrepresentation

3 Composite LGM Decision tree d= 10,|L|= 3 andδzrepresentation

4 LDM Decision tree d= 2.5 and|L|= 3

5 LDM+ Level 1 LGM Decision tree d= 10 and|L|= 3

6 LDM+ Level 2 LGM Decision tree d= 10 and|L|= 3

7 LDM+ Composite LGM Decision tree d= 10 and|L|= 3

8 Point Series (PS) k-NN with DTW technique d= 5, 3×3keyPS representation

Table 7.1: The best parameter settings for the proposed techniques (variations) with

respect to each 3D representation technique.

Before proceeding with the operation of the Friedman test, the level of significance (α), p-value and degree of freedom concepts should be clearly defined. The level of significance, known as α, is the probability of wrongly rejecting the null hypothesis H0

where it is in fact true. Sometimes it is known as the level of risk. The commonly used value isα = 0.05 [27, 76]. By using this value, there is 95% chance that the statistical results are real and not due to chance. The “critical value” is the χ2 distribution ofα and normally is denoted asχ2α. Thep-value is defined as the probability of obtaining a result that is at least as extreme as the one we actually obtained assuming that the null hypothesis is true and it is typically 0≤p-value≤1. More simply, it is the probability of obtaining the same results by chance. Normally, the p-value is compared with the α value. Figure 7.1 shows the χ2 distribution curve where α is the shaded area under curve forχ2

α while thep-value is the shaded area under curve for χ2F. With reference to

the figure, if thep-value> α, the test is inconclusive and more evidence will be required to support the alternative hypothesis (H1), if thep-value < α, then this means that we

have a statistically significant result and hence the null hypothesis H0 can be rejected.

Finally, the degree of freedom is a positive number that indicates thevariability. In our case the number of independent classifiers that have been generated using the different proposed techniques isk= 8 and thus the degree of freedom isk−1 = 7.

As noted above, if the calculated Friedman test statistic χ2F is greater than the critical value for the Chi-square distribution χ2α obtained from a look up table of the form shown in Figure 7.4 then this means that the null hypothesisH0 should be rejected

and the alternative hypothesis H1 should be accepted. However, this is not sufficient;

to qualify the strength of evidences against the null hypothesis, a p-value is calculated. As already noted the rejection of the null hypothesis H0 indicates the existence of a

Figure 7.1: Theχ2 distribution. The shaded area is equal toαand denoted byχ2α,

and represents the region of rejection. Thep-value is the area under curve right of the

calculatedχ2F.

about the nature of this difference. Therefore, in the case where the null hypothesis is rejected we can proceed with a post-hoc test to determine which techniques performed differently. With respect to the work described in this thesis the Nemenyi test [155] was adopted. This operates using the concept of a “critical difference” calculated using Equation 7.3:

CD=qα,∞,k

r

k(k+ 1)

12n (7.3)

Where the critical value forqα,∞,k is calculated based on the Studentised range statistic

divided by √2. Here, k = 8, α = 0.05 and q0.05,∞,8 = 3.03 according to a table of

critical values for orqα,∞,k presented in [50]. ACritical Difference (CD) in this context

is thus used to identify the difference between the average ranks of pairs of classifiers. A classifier performance is considered to be distinct from that of the other classifiers if their average ranks differ by at least the CD.