Combining Density and Diversity - The Exploration Guided Active Learning Algorithm

6.2 The Exploration Guided Active Learning Algorithm

6.2.3 Combining Density and Diversity

Density and diversity sampling greedily choose examples that optimise locally, which can make them myopic approaches to selection in active learning. They can become trapped in local optimums which can result in poor performance globally. An example of density sampling’s poor performance is evident in Figure6.1a, which shows the performance of a density-based active learner on the Reuters dataset. This shows a degradation in performance until after 200 or so examples are labelled, at which point performance improves rapidly. Figure 6.1b illustrates how this can happen. With density sampling, examples from class 1 in groupAwill be repeatedly selected for labelling while examples from class 2 will be ignored, leading to a poorly defined classification boundary during this time. When diversity alone is used, similarly dysfunctional scenarios can arise.

To overcome these problems, we introduce an element of diversity to a density- based sampling approach. Including diversity means that high density examples that are close to labelled examples are not selected for labelling by the oracle.

To determine whether an example should be considered as a candidate for selection, we use a threshold β. If the similarity between an unlabelled example xi and

its nearest neighbour in the labelled set is greater than β then xi is not a candidate

for selection. We call the set of examples that can be considered for selection the

Candidate Set, CS, which we define as in Equation 6.4:

CS = _{∃xi ∈U|diversity(xi)≥

β} (6.4) Our EGAL selection strategy ranks the possible candidates for selection (i.e. those in CS) based on their density, and selects those examples with the highest density for labelling first. Thus, examples close to each other in the feature space will not be selected successively for labelling.

Parametersα and β play an important role in the selection process. α controls the radius of the neighborhood used in the estimation of density, while β controls the radius of the neighbourhood used in the estimation of CS. The values selected for these parameters can significantly impact the overall performance, especially the value of β.

The work by Cebron & Berthold (2008) proposed a method considering the density and diversity information in selection, however they set the parameters in calculating density and diversity as positive constants which is a static way. Diﬀerent from the static way of setting parameters as in Shen et al. (2004) and Cebron & Berthold (2008), we set the parameterβ in a dynamical way. Initially, we setβ =α

as shown in Figure 6.2a, where shaded polygons represent labelled examples in L

and circles represent unlabelled examples in U. The regions defined byα are shown as solid circles for a small number of unlabelled examples (A, B, C, D and E). For clarity of illustration, rather than showing the regions defined by β around every unlabelled example, we show them, as broken circles, around only the labelled examples. The eﬀect, however, is the same: if a labelled example is within the neighbourhood of an unlabelled example defined by β, then the unlabelled example will also be within the neighbourhood of the labelled example defined by β.

In the example shown in Figure 6.2a, since examples B and D have labelled examples in the neighbourhood defined by β, they will not be added to CS. A, C andE, however, will be added. As more examples are labelled, we may reach a stage when there are no examples in the candidate set as there are always labelled examples within the neighbourhood defined by β. This scenario is shown in Figure 6.2b. When this happens we need to increase β to shrink this neighbourhood as shown in Figure 6.2c. We update β when we have no examples left in CS – a unique feature of our approach as far as we are aware.

We use a novel method to update β motivated by a desire to be able to set the size of CS. As the size of the CS is defined by β, a bigger β value which defines a smaller neighbourhood gives us a bigger candidate set. We setβto a value which can give us a candidate set with a size proportional to the number of elements available for labelling (i.e. the size of the unlabelled pool U) as detailed below:

(i) Calculate the similarity between each unlabelled example and its nearest labelled neighbour giving the set S, as follows

(a)α=β andCS_�=_∅

(b)α=β andCS =∅

(c)α_�=β andCS_�=_∅

S = _{si =

diversity(xi) |

xi ∈U}

(ii) Sort the similaritiessi (i= 0,1, . . . , n) in S in ascending order and choose the

value sw fromS that splits S into two, where

S1 = {si ∈S|si ≤sw},

S2 = {sj ∈S |sj > sw} and

|S1| = �(w× |S|)�+ 1,0≤w≤1

(iii) Letβ =sw, which is the similarity value such that wproportion of unlabelled

examples will be in diverse neighbourhoods of the feature space.

The proportion parameter,w, allows us to balance the influence of diversity and density in our selection strategy, namely thebalancing parameter. Whenw= 0, the EGAL algorithm defaults to pure diversity-based sampling discounting any density information. As w increases, the influence of density increases and the influence of diversity decreases with more examples being added to CS. When w = 1 the EGAL algorithm becomes purely a density-based sampling algorithm. We explore the eﬀect of changing the value of the balancing parameter w in Section 6.3.1.

The procedure of EGAL is summarised in Algorithm 4 where the batch size b is set to one. EGAL can be implemented very eﬃciently. At the start the pair- wise similarity matrix for the entire dataset and the individual density measure

for every example are calculated and cached. At each iteration of the selection algorithm, the updated diversity measure for each example in the unlabelled set,U, is the only calculation necessary. Computationally this is very eﬃcient, especially considering the rebuilding of a classifier and the classification of every unlabelled example required by uncertainty sampling based methods at each iteration of the active learning selection.

Input: An initial labelled setL, an unlabelled poolU of n examples, a stopping criterion SC, a batch size b, a balancing parameter w

Output: A labelled dataset

Compute the similarity matrixM of s(i, k) where xi, xk∈L∪U;

Set α=β =µ₋0.5_×δ;µ and δ being the mean and standard deviation of the similarity matrixM;

Calculate density for all the unlabelled examplesxi, i∈Iu using Equation6.1;

while SC is not met do

CS=_∅,Selected=_∅;

Construct the candidate setCS as in Equation 6.4;

while _|CS_|= 0 do

Updateβ ; UpdateCS;

end

Rank examples inCS by descending density order;

foreach t, t= 1. . . bdo if _|CS_|< b then

Selected=Selected_∪CS;

else

Select the top b ranked examples fromCS with highest density and add them into Selected;

end end

Label each examplexi ∈Selected ;

L=L_∪Selected ,U=U/Selected;

end

In document Active Learning for Text Classification (Page 131-137)