New OVO aggregation scheme: WV-FROST - FROVOCO: novel algorithm for multi-class imbalanced prob

4.3 FROVOCO: novel algorithm for multi-class imbalanced problems

4.3.2 New OVO aggregation scheme: WV-FROST

Based on the WV aggregation scheme (Section 4.2.1), we present a novel OVO aggregation scheme. Our proposal is called WV-FROST, which stands for Weighted Voting with Fuzzy ROugh Summary Terms. As a traditional OVO aggregator, WV captures the local information derived from the binary classification problems. WV-FROST complements this aspect with a global evaluation in the form of two summary terms. The inclusion of the

global summary addresses the classifier competence issue (Section4.2.2) by counteracting the information loss induced by the binary decomposition step.

Letxbe the instance to classify andR(x) the score-matrix constructed with the OVO scheme. As recalled in Section4.2.1, the WV method assignsxto the class with the highest combined weighted vote in its favour. Its decision process relies on a vector Vx with

Vx(Ci) = 1 m X 1≤j≤m rij,

where the sum of the confidence degrees in favour of class Ci is divided by m for scaling

purposes. WV assignsxto the class corresponding to the maximal value inVx. Our proposed

WV-FROST method modifies theVx vector prior to the class prediction step. Two additional

measures are added to each valueVx(Ci) (i= 1, . . . , m):

• Positive affinity term mem(x, Ci): the membership degree of x to class Ci based

on the full training set. The membership degree is set to the average of the membership degrees to the fuzzy rough lower and upper approximations of Ci.

• Negative affinity term msen(x, Ci): a signature vector of expected membership

degrees of instances of class Ci to all classes can be constructed based on themem(y,·)

for all training instances y ∈ Ci. A similar vector can be constructed for x, namely

consisting of its mem(x,·) values for all classes. The distance of this vector to the signature of Ci is penalized.

We discuss the two summary terms in detail in the following paragraphs.

Positive affinity For classCi, the summary termmem(x, Ci) represents the globally eval-

uated affinity of instancex with classCi. This measure is defined as the average membership

degree ofx to the fuzzy rough lower and upper approximation ofCi, namely mem(x, Ci) =

Ci(x) +Ci(x)

2 . (4.4)

The values are directly derived from the full dataset and not from the binary sub-problems, such thatmem(x, Ci) corresponds to a global evaluation of the affinity ofxwithCi. It should

be clear that definition (4.4) relates to the decision procedure used by the FRNN classifier ([242], Section 4.1.3). We use the OWA based fuzzy rough set model to compute the Ci(x)

and Ci(x) values with weight vectors related to our adaptive weight settingWIR. The sizes of the sets to aggregate inCi(x) and Ci(x) are |co(Ci)|and |Ci| respectively, with co(·) the set

complement function. According toWIR, when the IR between Ci and its complement does

not exceed nine, exponential OWA weights are used in both the lower and upper approximation calculations (combinationW_e). In the other case, when combinationW_γ is followed, the shortest weight vector uses the exponential definition and the longest weight vector is constructed with (4.2), replacing|P|and|N|by min(|Ci|,|co(Ci)|) and max(|Ci|,|co(Ci)|) re-

spectively in its definition. Instead of simply adding themem(x, Ci) values to theVx vector,

we replace each positionVx(Ci) by Vx(Ci)+mem₂ (x,Ci). In this way, local evaluation Vx(Ci) is

Negative affinity The second global summary term measures the instance-to-class affinity at a higher level and evaluates how strongly the mem(x,·) values resemble the expected values for instances belonging to particular classes. In order to do so, a signature vector SCi (i = 1, . . . , m) consisting of these expected values is constructed for each class. The

class signature corresponds to a decision template [271]. VectorSCi has sizem and position

SCi(Cj) corresponds to the average membership value mem(y, Cj) of instances y ∈ Ci. In

particular SCi = hSCi(C1), SCi(C2), . . . , SCi(Cm)i = * 1 |Ci| X y∈Ci mem(y, C1), 1 |Ci| X y∈Ci mem(y, C2), . . . , 1 |Ci| X y∈Ci mem(y, Cm) + .

Themem(x,·) values can be grouped in a similar vectorSx and compared to theSCi vectors.

The distance between Sx and the class signatures is measured by the mean squared error as mse(x, Ci) = 1 m m X i=1 (mem(x, Cj)−SCi(Cj)) 2

and expresses to what extent x is dissimilar to the training instances of class Ci. The dis-

similarity property implies that a negative class affinity is evaluated by this measure. A high mse(x, Ci) value indicates a large distance between Sx and the expected class membership

values for instances in classCi. The inclusion of the second affinity term is motivated by the

following example. Consider a dataset with three classes, of which classesC1 and C2 have a high overlap in feature space. Since the definitions of the fuzzy rough lower and upper approximations strongly rely on instance similarity values, the membership valuesmem(x, C1) and mem(x, C2) can be expected to be close together. As a consequence, mem(x, C1) and mem(x, C2) may not suffice to decide between classes. The signature vectors SC1 and SC2

contain the expected membership values of instances of classesC1 and C2 to all classes and comparingmem(x, C3) toSC1(C3) and SC2(C3) can provide a vital clue in the class decision

process. Before including the mse(x,·) values in the Vx vector, we scale them by dividing

them by their total sum, defining

msen(x, Ci) =

mse(x, Ci) Pm

j=1mse(x, Cj) .

Themsen(x,·) values are used as summary terms. Since they measure a negative class affinity,

we subtract them from the values inVx with weight _m1. This factor is inversely proportional

to the number of classes in the dataset. We can expect the information measured by the msen(x,·) values to be less reliable when the number of classes increases. For larger values

of m, the size of SC increases and the constituent membership degrees become more similar.

As a result, the class distinction power of the mean squared error is reduced.

Summary WV-FROST modifies theVx vector constructed by WV by including two sum-

mary terms. For each class Ci, valueVx(Ci) is replaced by the affinity based alternative AVx(Ci) =

Vx(C) +mem(x, Ci)

2 −

mmsen(x, Ci), (4.5) representing the aggregated score for classCi. Figure 4.3 visually presents the internal con-

Figure 4.3: Computation of the class scoresAVx(·) by WV-FROST.

that the inclusion of both summary terms leads to superior classification results. The final predicted class label for test instancexis obtained as the class corresponding to the maximum AVx(·) value, that is,

l(x) = arg max

i=1,...,m

(AVx(Ci)).

The inclusion of the global summary terms mem(x,·) and msen(x,·) results in a dynamic

aggregation procedure combating the classifier competence problem. WV-FROST differs from the Dynamic OVO and DRCW-OVO methods described in Section4.2.2, because it leaves the score-matrix unchanged. We do not modify the matrix, but instead change the aggregated values by adding more information to them.

In document Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods (Page 96-99)