1 Search for the best splitting rule s ∗ t given a sparse input matrix X and a set of samples Lt

1: functionFindBestSparseSplit(X,Lt) 2: forj=1topdo

3: Extract strictly positiveX_j_,_pos and strictly negativesX_j_,_neg values fromXjgivenLt.

4: Search for the best splitting rule of the forms∗

j(x) = xj 6

τ with τ ∈ X_j_,_pos∪{0}∪X_j_,_neg maximizing the impurity reduction∆Iover the sample setLt.

5: Update s∗ if the splitting rules∗

j leads to higher impurity reduction.

6: end for 7: returns∗ 8: end function

To extract the non zero values of a sparse input matrix X∈Rn×p with sparsitys in the context of the decision tree growth, we need to perform efficiently two operations on matrices: (i) the column indexing for a given input variablejand (ii) the extraction of the non zero row values associated to the set of samples Lt reaching the nodetin this column j. The overall cost of extracting|Lt|samples at a column j from the input matrix X should be proportional to the number of non zero elements and not to |Lt|.1

Among the different sparse matrix representations (Barrett et al., 1994; Hwu and Kirk, 2009; Pissanetzky, 1984), the sparse csc matrix format is the most appropriate for tree growing as it allows efficient

column indexing, as required during node splitting. Let us show how to perform an efficient extraction of the non zero values given the sample setLtusing this matrix format. Note that using a compressed row storage sparse format2

would not be appropriate during the tree growth as we need to be able to efficiently subsample input variables at each expansion of a new testing node.

CompressedSparseColumn(CSC)matrix format

The sparse csc matrix withnnznon zero elements is a data structure composed of three arrays:

indices ∈ Znnz _{containing the row indices of the non zero ele-}

ments.

data∈ Rnnz _{containing the values of the non zero elements.}

indptr ∈Zp _{containing the slice of the non zero elements. For a} column j ∈ {1,. . .,p}, the row index and the values of the non zero elements of columns j are stored from indptr[j] to indptr[j+1]in the indicesanddataarrays.

The non zero values associated to an input variable j are located from indptr[j]to indptr[j+1] −1 in the indices and thedata arrays (when indptr[j] = indptr[j+1], the column thus contains only zeros). Extracting them requires to perform a set intersection between the sample set Lt reaching node t and the non zero values indices[indptr[j] : indptr[j+1][ of the column j.

For instance, the following matrixAhas only3non zero elements, but we would use an array of20elements:

A∈ R4×5 =       a 0 0 b 0 0 0 0 c 0 0 0 0 0 0 0 0 0 0 0       . (8.3)

The csc representation of this matrixAis given by

data = h_a _b _ci, indices = h 0 0 1 i , inptr = h0 1 1 1 3 3 i .

2 The compressed row storage (csr) sparse array format is made of three arrays_indptr, indicesandvalue. The non zero elements of thei-th row of the sparse csc matrix are stored fromindptr[i]toindptr[i+1]in theindicesarrays, giving the column indices, andvalue arrays, giving the stored values. It is the transposed version of the csc sparse format.

8.1 t r e e g r o w i n g 159

Let nnz,j = (indptr[j+ 1] −indptr[j])∀j be the number of samples with non zero values for input variable j and let us assume that the indices of the input csc matrix array are sorted column- wise, i.e. for thej-th row, the elements ofindicesfromindptr[j]to indptr[j+1] −1 are sorted. Standard intersection algorithms have the following time complexity:

1. in O(|Lt|logn_nz_,_j) by performing |Lt| binary search on the sortednnz,j nonzero elements;

2. in O(|Lt|log|Lt|+n_nz_,_j) by sorting the sample set Lt and re- trieving the intersection by iterating over both arrays;

3. in O(n_nz_,_j) by maintaining a data structure such as a hash table of Lt allowing to efficiently check if the elements of indices[indptr[j]:indptr[j+1][are contained in the sample par- titionLt.

The optimal intersection algorithm depends on the number of non zero elements for input variable j and the number of samples |Lt| reaching nodet. During the decision tree growth, we have two oppo- site situations: either the size of the sample partition|Lt|is high with respect to the number of non zero elements (|Lt| ≈ O(n) _≫ nnz,j) or, typically at the bottom of the tree, the partition size is small with respect to the number of non zero elements (n_nz_,_j _≫ |Lt|). In the first case (i.e., at the top of the tree), the most efficient approach is thus approach (3), while in the second case (i.e., at the bottom of the tree), approach (1) should be faster. We first describe how to imple- ment approach (3), then approach (1), and finally how to combine both approaches.

A straightforward implementation of approach (3) is, at each node, to allocate a hash table containing all training examples in that node (in O(|Lt|)) and then to compute the intersection by checking if the non zero elements of the csc matrix belong to the hash table (in O(nnz,j)). We can however avoid the overhead required for the al- location, creation, and deallocation of the hash table by maintaining and exploiting a mapping between the csc matrix and the sample set Lt. Since the array L is constantly modified during the tree growth, we propose to use an array, denotedmapping, to keep track of the position of a sampleiin the array Las illustrated in Figure8.1. Dur- ing the tree growing, we keep the following invariant:

mapping[L[i]] =i. (8.4)

In the above example, the array L was [0,1,. . . 9] with mapping = [0, 1,. . . 9]. After a few splits, the arrayLhas become

indices 0 Pp j=1nnz,j indptr[j] indptr[j+1] mapping 0 L 0 start end n Lp n

Figure8.1: The arraymappingallows to efficiently compute the intersection between theindicesarray of the csc matrix and a sample setLt.

and the associatedmapping array is mapping= [9,1,4, 3, 6,2, 8, 5,7,0].

Thanks to themapping array, we can now check in constant time O(1)whether a sampleibelongs to the sample setLt. Indeed, given that Lt is represented by a slice from an index start to an index end−1inL, sampleibelongs toLtif the following condition holds:

start₆mapping[i]< end. (8.5)

To extract the non zero values of the csc matrix associated to thej-th input variable in the sample setLt, we check the previous condition for all samples fromindptr[j]toindptr[j+1] −1in theindicesarray. Thus, we perform the intersection betweenLtand then_nz_,_j non zero values inO(n_nz_,_j). The whole method is described in Algorithm8.2. Note that to swap samples in the array L, we use a modified swap function (see Algorithm8.3) which preserves the mapping invariant .

In practice, the number of non zero elementsn_nz_,_jof featurejcould be much greater than the size of a sample setLt. This is likely to hap- pen near the leaf nodes. Whenever the tree is fully developed, there are only a few samples reaching these nodes. The approach (1) shown in Algorithm 8.4 exploits the relatively small size of the sample set and performs repeated binary search on then_nz_,_j non zero elements associated to the featurej.

8.1 t r e e g r o w i n g 161

Algorithm 8.2 Return the n_neg strictly negative and n_pos positive

In document Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity (Page 169-173)