• No results found

Diversity Measures for Classification

2.3 Diversity in Classification Ensembles

2.3.1 Diversity Measures for Classification

For quantitative assessment of diversity, several measures have been defined in two types,

ment between all possible pairs of classifiers in the ensemble. This similarity/disagreement

is expressed by a distance metric. Non-pairwise measures consider the voting distribu-

tion among classifiers and calculate the entropy or correlation. In particular, ten ex-

isting measures are frequently mentioned and studied in the majority of relevant liter-

ature (Kuncheva and Whitaker, 2003, 2001; Tang et al., 2006; Chen, 2008): Q-statistic

(Q), correlation coefficient (ρ), disagreement measure (Dis), double-fault (DF), entropy

(Ent), Kohavi-Wolpert variance (KW), interrator agreement (κ), difficulty measure (θ),

generalized diversity (GD) and coincident failure diversity (CF D). The first four belong

to pairwise measures, and the remaining ones are non-pairwise. All of them are based

on the oracle output (correct/incorrect decision) and the combining method of majority

vote. Their definitions will be given next.

Let F = {f1, f2, . . . , fL} be a set of learnt classifiers as committee members of an

ensemble, andY ={ω1, ω2, . . . , ωc}be a finite label set. For any input xj belonging to an

instance spaceX, there is an expected label yj ∈ Y. For each classifier fi (i= 1, . . . , L),

we defineOj,i = 1 iffi recognizesxj correctly, and 0 otherwise. Similarly for the combined

model of those individuals, we define Oj,ens = 1 if the majority vote result is correct for

example xj, and 0 if xj is misclassified. This is known as the oracle type of output. The

measures based on the oracle output do not require any prior knowledge about data and

any specific base learners. They are only determined by the correct/incorrect decisions of

individuals. The oracle output provides a general model for analyzing various ensemble

learning methods (Tang et al., 2006). Taking two classifiers fi and fk, we define Nab

as the number of examples xj for which Oj,i = a and Oj,k = b within a labelled data

set Z = {(x1, y1),(x2, y2), . . . ,(xN, yN)}. Table 2.4 presents the relationship between

classifiers fi and fk. Each entry can also be in the form of probability by calculating

Nab/N.

(1) Q-statistic (Q) (Yule, 1900)

Table 2.4: A 2×2 table of the relationship between a pair of classifiers (Kuncheva and Whitaker, 2003). fk correct (1) fk wrong (0) fi correct (1) N11 N10 fi wrong (0) N01 N00 Total: N =N00+N01+N10+N11 Qi,k = N11N00−N01N10 N11N00+N01N10. (2.6)

Qi,k assesses the similarity between them. For a set of L classifiers, diversity is mea-

sured by the averaged Q-statistic

Q= 2 L(L−1) L−1 X i=1 L X k=i+1 Qi,k. (2.7)

For statistically independent classifiers, the expectation ofQi,k is 0. It varies between

−1 and 1. It performs positive if classifiers tend to recognize the same input examples cor- rectly, and negative if they commit errors on different examples (Kuncheva and Whitaker,

2003). Larger Q-statistic values indicate smaller diversity. It is the most widely discussed

diversity measure in the ensemble literature.

(2) Correlation coefficient (ρ) (Sneath and Sokal, 1973)

The correlation coefficient ρ is defined as

ρi,k =

N11N00N01N10

p

(N11+N10) (N01+N00) (N11+N01) (N10+N00). (2.8)

Q and ρ have the same sign, and ρ ≤ Q. ρ = 0 indicates that the classifiers are uncorrelated.

(3) Disagreement measure (Dis) (Skalak, 1996; Ho, 1998)

differently (i.e. one is correct, and the other is incorrect) to the total number of examples.

It refers to the probability that two classifiers will disagree.

Disi,k =

N01+N10

N11+N10+N01+N00. (2.9)

(4) Double-fault (DF) (Giacinto and Roli, 2000)

The double-fault measure is the proportion of examples that are labelled incorrectly

by both classifiers,

DFi,k =

N00

N11+N10+N01+N00. (2.10)

The above four statistics belong to pairwise measures that express the (dis)similarity be-

tween any two classifiers. The following gives six non-pairwise measures that take the

ensemble’s decision into consideration. Let l(xj) denote the number of classifiers that

labelxj correctly (i.e. l(xj) =

PL

i=1Oj,i).

(5) Entropy (Ent) (Cunningham and Carney, 2000)

The entropy measure assumes that the highest diversity happens when half of the

classifiers are correct and the remaining ones are incorrect. If the oracle outputs are all

1’s or 0’s, there is no diversity among the classifiers. It is defined as

Ent= 1 N N X j=1 1 (L− dL/2e)min{l(xj), L−l(xj)}, (2.11)

where d.e is the ceiling operator. Ent varies between 0 and 1, where 0 indicates all classifiers perform the same, and 1 indicates the highest diversity.

The Kohavi-Wolpert variance measures the variability of the predicted class label given

an input x for a specific classifier

variancex = 1 2 1− c X i=1 P (y=ωi|x)2 ! (2.12)

and averages over the whole data set. In the case of oracle outputs, the variability is

estimated among the set of classifiersf1, f2, . . . , fL by

KW = 1 N L2 N X j=1 l(xj) (L−l(xj)). (2.13)

It can be proved that KW differs from the averaged disagreement measure Dis by a

constant factor (Kuncheva and Whitaker, 2003),

KW = L−1

2L Dis. (2.14)

(7) Interrator agreement (κ) (Fleiss, 1981)

It is a measure of classification reliability, indicating the level of agreement while

corrected by chance. Let ¯p denote the average individual classification accuracy

¯ p= 1 N L N X j=1 l(xj), (2.15) then κ= 1− 1 L PN j=1l(xj) (L−l(xj)) N(L−1) ¯p(1−p¯) . (2.16)

κ is related to KW and Dis as follows

κ= 1− L

(L−1) ¯p(1−p¯)KW = 1− 1

It should be noted that there is a pairwise κ statistic used in (Margineantu and Diet-

terich, 1997; Dietterich, 2000a). It is the agreement between two classifiers corrected by

chance. However, these two have no direct link.

(8) Difficulty measure (θ) (Hansen and Salamon, 1990)

A discrete random variable X0 is defined as the proportion of classifiers in F that classify an input x correctly: X0 = L0,L1, . . . ,1 . It describes a pattern of classification difficulty for all classifiers. The diversity measure θ depicts the distribution of difficulty

over the data setZ, captured by the variance ofX0: θ =V ar(X0). The higher the value of θ, the lower the ensemble diversity.

(9) Generalized diversity (GD) (Partridge and Krzanowski, 1997)

Partridge and Krzanowski argued that maximum diversity occurs when failure of one

of the L classifiers is accompanied by correct labeling by the other classifier; minimum

diversity occurs when one’s failure is always accompanied by failure of the other. If we

definepi as the probability thatiout ofLclassifiers misclassify an inputxsimultaneously

and p(i) as the probability that irandomly chosen classifiers will misclassify x,

p(1) = L X i=1 i Lpi and p(2) = L X i=1 i(i−1) L(L−1)pi. (2.18)

In the case with maximum diversity, there is p(2) = 0; in the case with minimum

diversity, p(2) = p(1) holds. Based on the assumptions, the generalization diversity is

defined as

GD = 1− p(2)

p(1). (2.19)

GD varies between 0 and 1.

The coincident failure diversity is a modification of GD. It is defined as CF D =        0, p0 = 1.0 1 1−p0 PL i=1 L−i L−1pi, p0 <1.0 (2.20)

It varies between 0 and 1. When all classifiers always produce correct labels or they

are correct or wrong simultaneously (minimum diversity), CF D will be 0. When all the

misclassifications happen to exactly one classifier (maximum diversity), 1 will be achieved.

Table 2.5 summarizes the ten measures, including the indication of their changing

directions in relation to diversity, the category they belong to (pairwise or not) and the

abbreviations used in this thesis.

Table 2.5: Summary of the 10 diversity measures. ‘↓/↑’ indicates greater diversity if the measure gets lower/higher.

Name Abbr. Pairwise Direction

Q-statistic Q Yes ↓

Correlation coefficient ρ Yes ↓

Disagreement measure Dis Yes ↑

Double-fault DF Yes ↓ Entropy Ent No ↑ Kohavi-Wolpert variance KW No ↑ Interrator agreement κ No ↓ Difficulty measure θ No ↓ Generalized diversity GD No ↑

Coincident failure diversity CF D No ↑

Kuncheva and Whitaker studied the relationship between these diversity measures (Kuncheva

and Whitaker, 2003). Strong positive correlations were observed. They found no single

best measure. Similar results were also obtained in other papers (Shipp and Kuncheva,

2002a; Chen, 2008). Particularly, Q-statistic was recommended by the authors for its

simplicity and understandability (Kuncheva and Whitaker, 2003). Besides, Q-statistic

do not depend on the individual accuracy for two classifiers with the same individual

accuracy (Kuncheva and Whitaker, 2001), and have little empirical relationship to the

averaged individual accuracy (Kuncheva and Whitaker, 2003). It could be an advantage

for studying class imbalance problems, considering that a diversity measure less correlated

to individual accuracy is expected to be less sensitive to class imbalance distributions.

Otherwise, it may result in inaccurate analysis.