Generalization Error Bound - The Hierarchical Linear Support Vector Machine Algorithm (H-LSVM)

5.2 The Hierarchical Linear Support Vector Machine Algorithm (H-LSVM)

5.2.2 Generalization Error Bound

An interesting point of analysis in the H-LSVM algorithm is to determine the generalization capability of the model as well as to quantify somehow the complexity of the resulting tree.

The generalization error bound derived in this section is obtained from the results given by Golea et al. [137]. In this work, the bounds depend on the effective number of leaves

Algorithm Training Classification Linear SVM N M log 1 M SMO-SVM N2_{× M} _n SV × M × nK H-LSVM M kNH λ N P H(x) × M

Table 5.1: Number of operations needed to train a set S of N patterns in a M - dimensional space (Training column) and to classify a new pattern (Classification column) by Linear SVM, SVM-SMO and the H-LSVM algorithm. λ: regularization parameter in Pegasos formulation. : optimization tolerance. nSV: number of support

vectors of the nonlinear SVM model. nK: operations are needed to compute the kernel

between each support vector and the test pattern. NH: total number of internal nodes

in the H-LSVM tree. ni: number of training samples which reach the node i in the

H-LSVM tree. k: size of the random subset at each iteration of the Weighted-Pegasos algorithm. N_HP(x): number of nodes encountered by pattern x in the H-LSVM tree. Computational costs of linear SVM, SMO-SVM and Pegasos are extracted from [136].

Leff, a data-dependent quantity which reflects how uniformly the training data covers the tree’s leaves. Leff can be considerably smaller than the total number of leaves in the tree (L) [138] and it makes this bound different from the Vapnik−Chervonenkis one, dependent on L [139, 140].

Suppose a two-class decision tree T whose internal decision nodes are labeled with boolean functions from some class U and whose leaves are labeled as −1 or 1. For- mally, let P = (p1, . . . , pL) the probability vector which represents the probability that a pattern ~x reaches leaf i for i = 1 . . . L. Then, the quadratic distance between the probability vector P and the uniform probability vector U = (1/L, . . . , 1/L) is given by d(P, U ) =PL

i=1(pi− 1/L)2 and the effective number of leaves in the tree is defined by Leff≡ L(1−d(P, U )). A bound of misclassification probability under certain distribution D, PD[T (x) 6= y], can be estimated using the following theorem [137]:

Theorem 5.1. For a fixed ξ > 0, there is a constant c that satisfies the following. Let D be a distribution on X × {−1, +1}. Consider the class of decision trees of depth up to D, with decision functions in U . With probability at least 1 − ξ over the training set S (of size N ), every decision tree T that is consistent with S has

PD[T (x) 6= y] ≤ c

LeffVCdim(U ) log2N log D N

where V Cdim is the Vapnik-Chevonenkis dimension.

The H-LSVM algorithm is in line with this framework identifying the class U with the linear SVM. As discussed in Section 2.1.1, the Vapnik-Chervonenkis dimension of an hyperplane in a M -dimensional space is (M + 1) and therefore, the error bound for the H-LSVM method is reformulated as,

Lemma 5.2. For a fixed ξ > 0, there is a constant c that satisfies the following. Let D be a distribution on X × {−1, +1}. Consider the class of decision trees of depth up to D, with H-LSVM decision functions. With probability at least 1 − ξ over the training set S (of size N ), every decision tree T that is consistent with S has

PD[T (x) 6= y] ≤ c

Leff(M + 1) log2N log D N

1₂

. (5.6)

In practice it is quite difficult to have a consistent tree with the training data S. In that case, a bound of the misclassification probability can be obtained as a function of the misclassification probability in S, PS[T (x) 6= y]. Now, the probability vector is reformulated according to the training set as

P_i0= piPS[T (x) = y | x reaches leaf i] PS[T (x) = y]

Applying the adapted version of theorem 5.1 to not consistent trees [137], the following result is obtained for the H-LSVM tree,

Lemma 5.3. For a fixed ξ > 0, there is a constant c that satisfies the following. Let D be a distribution on X × {−1, +1}. Consider the class of decision trees of depth up to D with H-LSVM internal node decision functions. With probability at least 1 − ξ over the training set S (of size N ), every decision tree T has

PD[T (x) 6= y] ≤ PS[T (x) 6= y] + c L0

eff(M + 1) log2N log D N

!1₃

where c is a universal constant, and L0_eff = L(1 − d(P0, U )) is the empirical effective number of leaves of T .

Therefore, the parameters of the tree which determine the error bound for the H-LSVM algorithm are the depth D of the tree and the effective number of leaves Leff: the lower these parameters are, the better generalization error. The value for these parameters and the subsequent estimation of the model complexity according to Equation 5.7 are provided and analyzed in the following section.

5.3 Experimental Results

The experiments developed in this section provide an analysis of the H-LSVM model divided into three subsections corresponding to the threefold aim of the experiments:

• Compare H-LSVM with linear SVMs and nonlinear SVMs in terms of classification accuracy and prediction complexity.

• Compare H-LSVM with Zapi´en’s algorithm described at the beginning of this Chapter (Section 5.1.1) and illustrated in Figure 5.2(a) in terms of classification accuracy and prediction complexity. To the best of the knowledge of the author of this thesis, this is the only existing approach combining exclusively linear SVMs and decision trees.

• Analyze numerically the H-LSVM error bounds derived in Section 5.2.2.

Following the aim of the H-LSVM algorithm, the experiments have been conducted in binary classification problems in which nonlinear SVMs produce a large number of support vectors. Table 5.2 presents the number of training and test patterns patterns, the number of features and the public repository for each dataset. The Shuttle dataset has been converted to a binary classification problem by differentiating class 1 from the rest. In the same way, the Vehicle dataset has been reformulated as a binary classification task consisting of differentiating class 3 from the rest. The M3VO and M3VOm8 datasets correspond to differentiate digit 3 from all the other digits in the MNIST and MNIST8m problems, respectively. Finally, the Covtype dataset has been

# Train # Test # Feat. Repository IJCNN 49,990 91,701 22 LIBSVM [53] Shuttle 43,500 14,500 9 LIBSVM [53] M3VO 60,000 10,000 780 LIBSVM [53] M3VOm8 810,000 7,290,000 784 LIBSVM [53] Vehicle 78,823 19,705 100 LIBSVM [53]

Faces 8,525 4,263 576 Peter Carbonetto [142]

Covtype 522,910 58,102 54 LIBSVM [53]

Table 5.2: Binary datasets used to compare H-LSVM with linear SVMs and nonlinear SMVs.

transformed into a binary classification problem consisting of differentiating class 2 from the rest.

In most of the datasets (IJCNN, Shuttle, M3VO and Vehicle), the training and test subsets are given beforehand. In Faces dataset, the experimental setup described by Zapi´en et al. [124] was followed using two thirds of the observations for the training and the rest as testing set. Moreover, data was normalized to minimum and maximum feature values. The experiments were run over 10 different randomly chosen training-test partitions of the dataset. In the case of the M3VOm8 and Covtype datasets, it has been tried to use as many as training patterns as possible in order to simulate a large-scale system with a large number of support vectors. Then, the first 810, 000 patterns in the M3VOm8 dataset were used for training and the remaining samples for test2. In the Covtype dataset, according to the experiments carried out in [115, 141], 9/10 of the samples for training and the remaining patterns for test and the experiments were run over 10 different randomly chosen training-test partitions of the dataset.

In all the experiments and according to the scenario described in the complexity cost analysis (Section 5.2.1), linear SVMs and nonlinear SVMs implemented in LIBLINEAR [52] and LIBSVM [53] packages were used, respectively. In the case of nonlinear SVMs, the Gaussian kernel, k(xi, xj) = exp −γkxi− xjk2, was considered. The H-LSVM has been implemented in C language and the code is publicly available at

https://sites.google.com/site/irenerodriguezlujan/HLSVM-1.1.zip.

2_{LIBSVM for the M3VOm8 dataset did not finish in reasonable time when training with all the} available patterns.

LIBLINEAR LIBSVM H-LSVM C C γ λ ρ IJCNN 101 101 100 10−5 0 Shuttle 102 106 100 10−7 0.2 M3VO 100 ₁₀2 ₁₀−2 ₁₀−4 _0.2 M3VOm8 100 ₁₀2 ₁₀−2 ₁₀−5 _0.2 Vehicle 10−1 101 10−1 10−6 0.1 Faces 10−1 101 10−2 10−5 0.1 Covtype 100 100 0.346 10−7 0.0

Table 5.3: Parameters used in the linear SVM, nonlinear SVM and H-LSVM models for each binary dataset.

Linear SVMs, nonlinear SVMs and H-LSVM need to determine the values of a few parameters. In all datasets, except Covtype, the hyperparameter selection has been done using 5-fold cross validation over the training set. The cost parameter C in linear SVMs and nonlinear SVMs were selected from the grid 10i, i = −6, . . . , 6. The γ parameter of the Gaussian kernel was taken from the range 10i_{, i = −3, . . . , 3. Finally, for the H-LSVM} model, the maximum number of Weighted-Pegasos iterations was fixed to T = 107, the allowable tolerance was set to PEG = 10−4 and the minimum proportion of patterns needed to split a node δ was chosen as 10−i with i = blog₁₀N c to guarantee that the H-LSVM grows to sufficient size (pruning is applied if necessary). The regularization parameter λ was chosen from the grid 10_Ni, i = −6, . . . , 6 being N the number of training samples. The grid was obtained from the equivalence λ = _CN1 between the LIBLINEAR and LIBSVM cost parameter C and the λ regularizer in H-LSVM. The prune rate ρ took values in [0.0, 0.1, 0.2]. Unfortunately, applying this hyperparameter procedure is unfeasible due to the size of the Covtype dataset and the number of support vectors in the resulting model; then, the nonlinear SVM hyperparameters provided in [115] were applied. The resulting parameters for each dataset and each model are given in Table 5.3.

In document A practical view of large-scale classification: feature selection and real-time classification (Page 148-154)