Boosted decision trees - Supervised learning

2.1 Supervised learning

2.1.3 Boosted decision trees

The family of ML algorithms called boosting algorithms is based on the principle of

combining several learners, in order to produce one effective classifier. The typical boosting method starts by creating weak classifiers, models whose classification effectiveness is low. Weak classifiers are usually built learning only a few (limited) aspects of the training data, consequently their classification accuracy is limited. In boosting several weak classifiers are built with the final goal of combining them in a global strong classifier, that is able to deal with all the characteristics of the data.

One popular variant of boosting is AdaBoost, its name stands for Adaptive Boost- ing. In AdaBoost weak classifiers are built iteratively, each one is tuned in order to perform better on the data that the previous weak classifier has misclassified, the algorithm adapts itself to the training data.

A variant of AdaBoost is _MP-Boost [19], that is an improved version of _Ad-

aBoost.MH[72]. These two variants are optimized for multi-label classification, the

weak classifiers they create are decision trees, trained on specific terms. Each weak classifier is built around a pivot term and sets up its classification on the presence or the absence of the term in the document (binary feature weights). The improvement of MP-Boost, compared toAdaBoost.MH, is in how pivot terms are chosen. At each iteration, more pivot terms are selected, one for each class. The algorithm then constructs a set of weak classifiers at each iteration, each of them is optimized for a specific class.

MP-Boost algorithm

In MP-Boost weak classifiers are generated iteratively, so if S is the number of iterations, we have ˆΦ1_{. . .}_ΦˆS _{weak classifiers. At the end of each iteration the final} classifier is obtained summing up the weak classifiers ˆΦ = PS_s₌₁Φˆs_{, so both the}

2.1. SUPERVISED LEARNING

Input:

a training set T r, the number of iterationsS

Output: the classifier ˆΦ(di, cj) = PS s=1Φˆ s₍_d i, cj) Body: set P1(di, cj) = _|_{T r}1_|·|C| fors= 1, . . . , S do:

1. pass distributionPsto the weak learner;

2. get the weak classifier ˆΦs_{from the weak learner;} 3. setPs+1(di, cj) = Ps(di,cj)·exp(−Φ(di,cj)·Φˆsj) Zs j whereZs j = P|T r|

i=1 Ps(di, cj)·exp(−Φ(di, cj)·Φˆsj) is a normalization factor chosen so thatP|T r|

i=1

P|C|

j=1Ps+1(di, cj) = 1

Figure 2.2: The _MP-Boostalgorithm

confidence estimates and the classification decisions of ˆΦare obtained from the sum of the values of the single weak classifiers.

The algorithm (Figure 2.2) creates a weak classifier for each class at each iteration, then we have |C| weak classifiers at each iteration. For class cj, at iteration s, the classifier ˆΦs

j is built.MP-Boostupdates a distribution of weights on the training set

pairshdi, cjiat each iterations. We definePs+1as the distribution that is used by the algorithm at iterations+ 1. The value ofPs+1(di, cj) is updated in order to capture the effectiveness of the classifiers ˆΦ1_j. . .Φˆs_j in assigning classcj to the document di. The weights are proportional to the ability of the classifiers of correctly assigning the class, the higher the weight, the more difficult the assignment. The weak classifier

Φs_j+1, in order to better classify documents with higher weights, concentrates on those features that allows to better discriminate the classcj. Every weak learner at iteration

s+ 1 takes in input the training set and the distributionPs+1.

The initial distributionP1is uniform; at iterationsweights are updated according to: Ps+1(di, cj) = Ps(di, cj)·exp(−Φ(di, cj)·Φˆsj) Zs j where Z_js= |T r| X i=1 Ps(di, cj)·exp(−Φ(di, cj)·Φˆsj)

is a normalization factor, so Ps+1(di, cj) is a distribution, the sum of all the weight values is 1.

A weight is updated proportionally to its previous value, it is increased if the

CHAPTER 2. TEXT CLASSIFICATION

−Φ(di, cj)·Φˆsj is positive if the target function and the weak classifier have different sign, in this case the exponential functionexpis≥1.

MP-Boost weak classifiers

Text is represented with binary feature weights. The weightwkiof a document vector

di is equal to 1 if the term tk is contained in the document, 0 otherwise. A weak

classifier ˆΦs

j is a decision tree, with only one root and two leaves. It returns a specific real value if a term is present in the document and a different value if it is not. The term is calledpivot term, thus we have|C|pivot terms at each iteration s. The weak classifier is defined as ˆ Φs_j(di) = aj₀ifwj_ki= 0 aj₁ifwj_ki= 1

Creating a weak classifier means choosing a termtj_k and the valuesaj₀ andaj₁. These three parameters are obtained during the learning phase, that consists in minimizing the error of the weak classifier. This error can be measured by the normalization factorZs

In order to obtain ˆΦs_j these two steps are executed (at iterations): 1. Select among all ˆΦs

j that have a specific term as pivot term, the one which mini- mizesZs

j. In this step we create |T |weak classifiers, and for each one we choose the values aj₀andaj₁.

2. Select among all the best weak classifiers previously created, the one which mini- mizeZ_js. In this step we choose the unique pivot termtj_kfor the current iteration. The first step is critical, since we can not enumerate all the possible values ofaj₀ and

aj₁._MP-Boostdeals with this step using the same approach of_AdaBoost.MH; this approach is argued in [71], where the authors have proven that:

ˆ Φbest_j (k)(di) =      1 2log W₊₁0jk W₋0jk₁ ifw j ki= 0 1 2log W₊₁1jk W₋1jk₁ ifw j ki= 1

where best(k) denotes the best weak classifier for termtj_k, selected at the first step, and where W_bxjk= |tr| X i=1 Ps(di, cj)·Jw j ki=xK·JΦ(di, cj) =bK

forb∈ {−1,+1},x∈ {0,1}, and where_Jπ_Kis the characteristic function of predicate

π, so ifπis true the function returns 1, 0 otherwise. The valuesW_bxjk are equivalent to the weight, with respect to the distributionPs, of the documents that contain (or does not contain) the termtj_k, which are classified (or are not) withcj.

Reminding that ˆΦs(di, cj)≡Φˆsj(di), the final classifier is ˆΦ(di, cj) = PS

s=1Φˆ

s₍_d i, cj),

where S is the number of iterations (the only free parameter of MP-Boost algo-

rithm). The classification values are thus a combination of the valuesaj₀andaj₁of the pivot terms of each iteration.

In document Semi-Automated Text Classification (Page 30-33)