Proposed classifiers - Fuzzy multi-instance classifiers

6.3 Fuzzy multi-instance classifiers

6.3.1 Proposed classifiers

We define a fuzzy multi-instance classifier as a mapping f :_NX → C:X7→arg max

C∈C

[C(X)], (6.1)

withX the instance space, NX the bag space (defined as multi-sets of instances) and C the set of possible classes. The bagX is assigned to the classC for which its membership degree C(X) is largest. In case of a tie for the maximum value, one of the tied classes is randomly selected. We consider two approaches to derive theC(X) values:

• Instance-based fuzzy multi-instance classifiers (IFMIC family, Section 6.3.1.1): value C(X) is derived from the class membership degree C(x) of instancesx in bagX. • Bag-based fuzzy multi-instance classifiers (BFMIC family, Section 6.3.1.2): valueC(X)

is derived directly from bag information. In particular, it relies strongly on the similarity between X and the training bags.

Both classifier families require the definition of class membership degrees, either for instances or bags. These values are determined based on the training data and we introduce ways to do so in the sections below.

6.3.1.1 The IFMIC family

Our instance-based methods derive the membership degree C(X) in (6.1) from the membership degree of instances to class C. We therefore need to specify how to calculate the membership degree C(x) of an instancex to a class C. Prior to this, we introduce different ways to determine the membership degree (or, perhaps more appropriately, the degree of affinity) B(x) of instance x to training bag B. The calculation of C(x) relies on the B(x) values for training bags B belonging to class C. Interpreting bags B as fuzzy sets allows us to express how typical x is for this bag, that is, how strongly it relates to it. This differs from merely determining whether an instance belongs to a bag or not. We now discuss the different components of our IFMIC classifiers in their calculation of the C(X) values. These are summarized in Table 6.2.

Membership B(x) of instances to bags The definitions of the bag affinity values rely on an instance similarity relationRI(·,·). This fuzzy relation measures the degree of similar-

ity between pairs of instances. Due to the high-dimensional nature of many multi-instance datasets, we decide to work with two different relations. When the number of features is at most 20, RI(·,·) coincides with our standard instance similarity relation (3.13). When the

number of features exceeds this threshold, we use the cosine similarity relation RI(x1, x2) =

x1·x2

||x₁|| · ||x₂|| (6.2)

and rescale this value to the unit interval. The cosine similarity has been shown to be the most appropriate similarity measure to use in high-dimensional multi-instance datasets

Table 6.2: Settings for the IFMIC methods evaluated within our proposed framework. Set TC contains all training bags belonging to classC.

Code B(x) Code C(x)

Max max

b∈BRI(x, b) Max Bmax∈T_CB(x)

MaxExp OWA_Wexp

U ({RI(x, b)|b∈B}) MaxExp OWAW exp

U ({B(x)|B∈TC}) MaxInvadd OWAWinvadd

U ({

RI(x, b)|b∈B}) MaxInvadd OWAWinvadd U ({

B(x)|B∈TC})

MaxAdd OWA_Wadd

U ({RI(x, b)|b∈B}) MaxAdd OWAWUadd({B(x)|B∈TC})

Avg 1 |B| X b∈B RI(x, b) Avg _|_T1 C| X B∈TC B(x) Code C(X) Max max x∈XC(x)

MaxExp OWAW_Uexp({C(x)|x∈X})

MaxInvadd OWAWinvadd

U ({C(x)|x∈X}) MaxAdd OWA_Wadd

U ({C(x)|x∈X})

Avg _|_X1_|X

x∈X

C(x)

related to text processing and image retrieval (e.g. [369]), examples of which are included in our experiments. The cut-off value of 20 features is based on the experimental study of [6]. Given relationRI(·,·), we can defineB(x) in several ways. Table6.2lists the five possibilities

evaluated in this chapter. They represent several intuitive choices that one can make. A first option is to set B(x) to the maximum instance similarity ofx to one of the instances b∈B (Max). Alternatively, we can computeB(x) as the average of theRI(x, b) values (Avg). We

also include three OWA aggregations whose actions place them between taking the maximum and average. We aggregate the RI(x, b) values (for b ∈ B) with WUexp (MaxExp), WUinvadd

(MaxInvadd) or W_Uadd (MaxAdd). As shown in Section 3.2, these weight vectors correspond to softened maxima.

Membership C(x) of instances to classes Having determined the values B(x) for all training bags B, we consider the same five ways to aggregate these values to C(x) membership degrees. We define the set TC as the set containing all training bags belonging to

class C. For each bag B ∈ TC, the value B(x) can be computed. Based on these results, C(x) is determined as their maximum (Max), average (Avg) or OWA aggregation (MaxExp, MaxInvadd or MaxAdd).

MembershipC(X)of bags to classes To finally determine the membership degreeC(X) of bagX to class C in (6.1), the IFMIC methods aggregate the instance membership degrees C(x) for x ∈ X to one value using the same five alternatives as listed for the B(x) and C(x) calculations. In our experiments, we evaluate the use of the maximum (Max), softened maxima (MaxExp, MaxInvadd or MaxAdd) or the average (Avg).

6.3.1.2 The BFMIC family

The bag-based definition of our fuzzy multi-instance classifiers relies on a bag similarity measure R(·,·). The class membership degrees C(X) are based on the similarity between

X and training bags B of class C. All settings for the bag similarity relation and the class membership degree aggregation evaluated in this chapter are summarized in Table 6.3. Bag-wise similarity R(X, B) The similarity between bags X and B is based on the distance δ(·,·) between pairs of their instances, which we define based on the instance similarity relation. In particular,

(∀x∈X)(∀b∈B)(δ(x, b) = 1−RI(x, b)).

We consider eight bag similarity functions, which can be divided into two groups of four. The first group (consisting of H, HExp, HInvadd and HAdd) is based on the Hausdorff distance [135] between two bags. The first option is to set the similarity between two bags to the complement to one of this distance measure (H). The three other versions replace the maximum and minimum operators in this definition by OWA operators using exponential (HExp), inverse additive (HInvadd) or additive (HAdd) weights. The second group is based on the average Hausdorff distance [483]. The bag similarity based on this measure is given by AvgH. As above, we replace the maximum and minimum operators by OWA aggregations in the three modified versions AvgHExp, AvgHInvadd and AvgHAdd.

MembershipC(X) of bags to classes The aggregation of theR(X,·) values to aC(X) membership degree is handled by one of the same five alternatives as used in our IFMIC methods. We consider the use of the average similarity (Avg) as well as the maximum similarity (Max) ofX with a bagB in class C. We also include the softened maxima alternatives MaxExp, MaxInvadd and MaxAdd.

We note that our proposed bag-based classifiers exhibit a link with a nearest neighbour approach to multi-instance classification. In case of version Max for the C(X) calculations, the method computes the membership degrees of a test bag to the decision classes based on its most similar (nearest) training bag of each class. Since the final class prediction is made using (6.1), the class label of the overall nearest bag is automatically predicted. Consequently, we can conclude that the use of Max reduces our BFMIC proposal to a one-nearest neighbour multi-instance classifier, where the distance measure is taken as the complement of the bag similarity R(·,·). When one of the OWA based alternatives for C(X) is selected instead, BFMIC takes a step away from the true nearest neighbour paradigm. For each class C, all training bags belonging to C contribute to the membership degree estimation of X to C. Their contribution not only depends on their similarity with the bag to classify, but also on the number of bags inC, since the class size determines the length of the OWA weight vector. As part of our experimental study, we show that our proposed methods outperform the most prominent nearest neighbour multi-instance classifier CitationKNN [429].

6.3.1.3 Discussion

From the descriptions given in the two preceding chapters and Tables 6.2-6.3, it should be clear that in various cases an aggregation step can be set to the maximum or average of a group of values. Both correspond to intuitive aggregation approaches. The strict maximum assigns weight one to a single value and weight zero to all others, while the average involves all values in its calculation, effectively assigning equal weight to all. We consider intermediate

Table 6.3: Settings for the BFMIC methods evaluated within our proposed framework. Set TC contains all training bags belonging to classC.

Code R(X, B)

H 1−max(max

x∈Xminb∈Bδ(x, b),maxb∈B minx∈Xδ(x, b))

HExp 1−max[OWAWexp

U ({OWAW exp L ({δ(x, b)|b∈B})|x∈X}), OWAWexp U ({OWAW exp L ({δ(x, b)|x∈X})|b∈B})]

HInvadd 1−max[OWA_Winvadd

U ({OWAWLinvadd({δ(x, b)|b∈B})|x∈X}), OWA_Winvadd

U ({OWAWLinvadd({δ(x, b)|x∈X})|b∈B})]

HAdd 1−max[OWA_Wadd

U ({OWA_Wadd L ({δ(x, b)|b∈B})|x∈X}), OWA_Wadd U ({OWAWLadd({δ(x, b)|x∈X})|b∈B})] AvgH 1− 1 |X|+|B| X x∈X min b∈Bδ(x, b) + X b∈B min x∈Xδ(x, b) ! AvgHExp 1− 1 |X|+|B| X x∈X OWA_Wexp L ({δ(x, b)|b∈B}) + X b∈B OWA_Wexp L ({δ(x, b)|x∈X}) ! AvgHInvadd 1− 1 |X|+|B| X x∈X OWA_Winvadd L ({δ(x, b)|b∈B}) + X b∈B OWA_Winvadd L ({δ(x, b)|x∈X}) ! AvgHAdd 1− 1 |X|+|B| X x∈X OWAWadd L ({δ(x, b)|b∈B}) + X b∈B OWAWadd L ({δ(x, b)|x∈X}) ! Code C(X) Max max B∈TC R(X, B)

MaxExp OWAW_Uexp({R(X, B)|B∈TC})

MaxInvadd OWAWinvadd

U ({

R(X, B)|B∈TC})

MaxAdd OWA_Wadd

U ({R(X, B)|B∈TC}) Avg 1 |TC| X B∈TC R(X, B)

options by using the OWA based alternatives that model the trade-off between the maximum and average.

For the OWA weights, we use the weight vectors studied in Section 3.2.1. As discussed in Chapter 3, the selection of an OWA weight vector is not always straightforward. Several automatic procedures have been proposed for this purpose (see Section3.2.2). Many of them are optimization methods, optimizing a certain objective function (e.g. entropy) for a user- specifiedornessvalue. Within the scope of this study, fixing theornessof the weight vector beforehand feels arbitrary and we therefore opt not to use such an optimization algorithm. We also avoid them in the interest of computational cost. For a given ornessvalue, existing optimization methods do not provide the user with a closed formula to determine the weights, but rather with a particular weight set of the specified length. As should be clear from the descriptions given in Tables 6.2-6.3, the vector lengths are not fixed in our methods. For instance, in the MaxAdd variant of B(x) in Table6.2 the length of the weight vector equals the size of bag B. Since all bags in a multi-instance dataset can contain a different number of instances, the length of all weight vectors can be different. The variable length implies the requirement of multiple runs of the optimization procedure, imposing a considerable cost. Furthermore, these methods become practically intractable when the vector lengths increase.

The procedure proposed in [168], for instance, relies on the calculation of the roots of a polynomial equation with degree equal to the length of the weight vector. As shown in the experiments conducted later in this chapter, we can expect the vector lengths, and therefore the degree of these polynomial equations, to be high (100 and above).

We have decided to use the fixed OWA weight vectors recalled in Section 3.2.1, namely the exponential, inverse additive and additive weights. As discussed at length in Chapter 3, these weighting schemes exhibit sufficiently different characteristics to model the transition from the maximum to the average. Based on the proof of Theorem 3.2.1, we know that orness(Wadd

U ) = 23 holds independent of the aggregation length. We find, based on the proofs of Theorems3.2.2-3.2.3, that the orness values of weight vectors W_Uexp and W_Uinvadd do depend on the aggregation lengthp. This constitutes an important difference between the three OWA aggregations, which is illustrated in our experimental study. We do not include our Mult alternative introduced in Section3.2.2, because it only showed clear advantages on datasets with very specific properties.

In document Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods (Page 138-142)