2 Gradient boosting algorithm - Number of estimators

Number of estimators

Algorithm 4 2 Gradient boosting algorithm

1: functionGradientBoosting(L={(xi,yi)∈X×Y}n

i=1;`;H;M) 2: f₀(x) =ρ₀=arg min_ρ_∈_RPn

i=1`(yi,ρ). 3: form=1toMdo

4: Compute the loss gradient for the training set points

gi_m= ∂ ∂y0`(y i_,_y0 ) y0₌_fm −1(x) ∀i∈{1,. . .,n}.

5: Find a correlated direction to the loss gradient

g_m=arg min g∈H n X i=1 (−gi_m−g(xi))2.

6: Find an optimal step length in the direction g_m

ρm =arg min ρ∈_R n X i=1 ` yi,fm−1(xi) +ρgm(xi) . 7: f_m(x) =f_m₋₁(x) +µρ_mg_m(x). 8: end for 9: returnf_M(x) 10: end function

Table4.1: Regression loss (Y=R) and binary classification loss (Y={−1,1}) their derivative with respect to a basis functionf(x).

Regression `(y,y0) −∂`(y,y0)/∂y0

Square 1₂(y−y0)2 y−y0

Absolute |y−y0| sign(y−y0)

Classification `(y,y0) −∂`(y,y0)/∂y0 Exponential exp(−yy0) −yexp(−yy0) Logistic log(1+exp(−2yy0)) ₁₊ 2y

exp(2yy0₎ Hinge max(0,1−yy0) −y1(yy0< 1)

4.3 b o o s t i n g e n s e m b l e s 83

Table4.2: Constant minimizers of regression losses (Y=R) and binary clas- sification losses (Y={−1,1}) given a set of samplesL={(xi,yi)∈ X×Y}n i=1. Regression Square f₀(x) = _n1 Pn_i₌₁yi Absolute f₀(x) =median({yi}n_i₌₁) Classification Exponential f₀(x) =log Pn i=11(yi=1) Pn i=11(yi=−1) Logistic f₀(x) =log Pn i=11(yi=1) Pn i=11(yi=−1) Hinge f0(x) =sign _n1 Pn i=11(yi=1) −12

L E A R N I N G I N C O M P R E S S E D S PA C E T H R O U G H R A N D O M P R O J E C T I O N S

5

R A N D O M F O R E S T S W I T H R A N D O M P R O J E C T I O N S O F T H E O U T P U T S PA C E F O R H I G H D I M E N S I O N A L M U LT I - L A B E L C L A S S I F I C AT I O N

Outline

We adapt the idea of random projections applied to the output space, so as to enhance tree-based ensemble methods in the context of multi- label classification. We show how learning time complexity can be reduced without affecting computational complexity and accuracy of predictions. We also show that random output space projections may be used in order to reach different bias-variance tradeoffs, over a broad panel of benchmark problems, and that this may lead to im- proved accuracy while reducing significantly the computational burden of the learning stage.

This chapter is based on previous work published in

Arnaud Joly, Pierre Geurts, and Louis Wehenkel. Random forests with random projections of the output space for high dimensional multi-label classification. In Machine Learning and Knowledge Discovery in Databases, pages 607–622. Springer Berlin Heidelberg,2014.

Within supervised learning, the goal of multi-label classification is to train models to annotate objects with a subset of labels taken from a set of candidate labels. Typical applications include the determi- nation of topics addressed in a text document, the identification of object categories present within an image, or the prediction of biolog- ical properties of a gene. In many applications, the number of candidate labels may be very large, ranging from hundreds to hundreds of thousands (Agrawal et al.,2013) and often even exceeding the sample size (Dekel and Shamir,2010). The very large scale nature of the output space in such problems poses both statistical and computational challenges that need to be specifically addressed.

A simple approach to multi-label classification problems, called binary relevance, is to train independently a binary classifier for each label. Several more complex schemes have however been proposed to take into account the dependencies between the labels (see Sec- tion 2.2.5). In the context of tree-based methods, one way is to train multi-output trees (see Section3.5), i.e. trees that can predict multiple outputs at once. With respect to binary relevance, the multi-output tree approach has the advantage of building a single model for all

labels. It can thus potentially take into account label dependencies and reduce memory requirements for the storage of the models. An extensive experimental comparison (Madjarov et al.,2012) shows that this approach compares favorably with other approaches, including non tree-based methods, both in terms of accuracy and computing times. In addition, multi-output trees inherit all intrinsic advantages of tree-based methods, such as robustness to irrelevant features, inter- pretability through feature importance scores, or fast computations of predictions, that make them very attractive to address multi-label problems. The computational complexity of learning multi-output trees is however similar to that of the binary relevance method. Both approaches are indeedO(pdnlogn), wherepis the number of input features,d the number of candidate output labels, andnthe sample size; this is a limiting factor when dealing with large sets of candidate labels.

One generic approach to reduce computational complexity is to ap- ply some compression technique prior to the training stage to reduce the number of outputs to a number q much smaller than the total numberdof labels. A model can then be trained to make predictions in the compressed output space and a prediction in the original label space can be obtained by decoding the compressed prediction. As multi-label vectors are typically very sparse, one can expect a dras- tic dimensionality reduction by using appropriate compression tech- niques. This idea has been explored for example in (Hsu et al.,2009) using compressed sensing, and in (Cisse et al., 2013) using bloom fil- ters, in both cases using regularized linear models as base learners. The approach obviously reduces computing times for training the model. Random projections are also exploited in (Tsoumakas et al., 2014) for multi-target regression. In this latter work however, they are not used to improve computing times by compression but instead to improve predictive performance. Indeed, more (sparse) random projections are computed than there are outputs and they are used each as an output to train some single target regressor. As in (Cisse et al., 2013; Hsu et al., 2009), the predictions of the regressors need to be decoded at prediction time to obtain a prediction in the original output space. This is achieved in (Tsoumakas et al., 2014) by solving an overdetermined linear system.

In this chapter, we explore the use of random output space projections for large-scale multi-label classification in the context of tree- based ensemble methods. We first explore the idea proposed for linear models in (Hsu et al., 2009) with random forests: a (single) random projection of the multi-label vector to a q-dimensional random subspace is computed and then a multi-output random forest is grown based on score computations using the projected outputs. We exploit however the fact that the approximation provided by a tree ensemble is a weighted average of output vectors from the training

5.1 m e t h o d s 87

sample to avoid the decoding stage: at training time all leaf labels are directly computed in the original multi-label space. We show theoretically and empirically that whenqis large enough, ensembles grown on such random output spaces are equivalent to ensembles grown on the original output space. Whendis large enough compared ton, this idea hence may reduce computing times at the learning stage without affecting accuracy and computational complexity of predictions.

Next, we propose to exploit the randomization inherent to the projection of the output space as a way to obtain randomized trees in the context of ensemble methods: each tree in the ensemble is thus grown from a different randomly projected subspace of dimensionq. As previously, labels at leaf nodes are directly computed in the original output space to avoid the decoding step. We show, theoretically, that this idea can lead to better accuracy than the first idea and, empirically, that best results are obtained on many problems with very low values ofq, which leads to significant computing time reductions at the learning stage. In addition, we study the interaction between input randomization (à la Random Forests) and output randomization (through random projections), showing that there is an interest, both in terms of predictive performance and in terms of computing times, to optimally combine these two ways of randomization. All in all, the proposed approach constitutes a very attractive way to address large-scale multi-label problems with tree-based ensemble methods.

The rest of the chapter is structured as follows: Section5.1presents the proposed algorithms and their theoretical properties; Section 5.2 analyses the proposed algorithm from a bias-variance perspective; Section 5.3 provides the empirical validations, whereas Section 5.4 discusses our work and provides further research directions.

5.1 m e t h o d s

We first present how we propose to exploit random projections to reduce the computational burden of learning single multi-output trees in very high-dimensional output spaces. Then we present and com- pare two ways to exploit this idea with ensembles of trees.

5.1.1 Multi-output regression trees in randomly projected output spaces The multi-output single tree algorithm described in Chapter 3 re- quires the computation of the sum of impurity criterion, such as the variance (or Gini), at each tree node and for each candidate split. WhenYis very high-dimensional, this computation constitutes the main computational bottleneck of the algorithm. We thus propose to approximate variance computations by using random projections of the output space. The multi-output regression tree algo-

rithm is modified as follows (denoting by L the learning sample L= ((xi,yi)∈X×Y)n_i₌₁):

• First, a projection matrix Φ of dimension q×d is randomly generated.

• A new datasetLm= ((xi,Φyi))n_i₌₁is constructed by projecting each learning sample output using the projection matrixΦ. • A tree (structure)Tmis grown using the projected learning sam-

pleLm.

• Predictions ˆy at each leaf of T are computed using the corre- sponding outputs in the original output space.

The resulting tree is exploited in the standard way to make predictions: an input vectorxis propagated through the tree until it reaches a leaf from which a prediction ˆy in the original output space is directly retrieved.

IfΦsatisfies the Jonhson-Lindenstrauss lemma (Equation2.86), the following theorem shows that variance computed in the projected subspace is an -approximation of the variance computed over the original space.

In document Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity (Page 94-100)