Cost-Complexity Pruning - Decision Trees

2.3 Learning Algorithms

2.3.3 Decision Trees

2.3.3.1 Cost-Complexity Pruning

Generally, including a pruning algorithm at the end of the training of a decision tree is mandatory, especially in those cases in which the stopping criterion does not incorporate pre-pruning rules. Most of the existing pruning methods remove the deepest nodes of the tree whose validity is in doubt as they can have been generated with few samples. Among them, the Cost-Complexity pruning algorithm proposed by Breiman [58] is described in what follows since it is a key element of the algorithm presented in Chapter 5.

As it name suggests, the Cost-Complexity objective function establishes a trade-off between the misclassification cost of the tree and its complexity, approximated by the number of leaves in the tree. Formally, the cost-complexity objective function for a tree T with a set of leaves ˜T is given by

Rα_emp(T ) = α| ˜T | + Remp(T ) , (2.39)

where α ∈ [0, ∞) is a scalar parameter overweighting the complexity and misclassification terms. As α varies in the interval [0, ∞), a set of decreasing-size subtrees of the original tree is obtained. In fact, when α = 0 the complexity term in Equation 2.39 is not considered and therefore, the best tree is that with the lowest misclassification rate. On the other hand, there exists a large enough value for α which produces a decision tree formed exclusively by the root. It can be shown that for every value of α, there exists a tree T (α) minimizing Equation 2.39 [58, Chapter 10].

The main idea of the Cost-Complexity method is to construct a set of decreasing-size subtrees by means of variations in the α parameter. Although α can take infinite values in the interval [0, ∞), the set of subtrees of the original tree is finite and thus, the problem is not so arduous at it looks at first glance. It works as follows, suppose that T (αk) is the minimizing tree of Equation 2.39 given αk then, this tree is still the minimizing tree in Equation 2.39 until a jump point αk+1 is reached. Now, T (αk+1) becomes the new minimizing tree until achieving the next jump point αk+2 and so on. The result of this process is a sequence of subtrees T = T (0) > T (α1) > . . . > T (αK) = root(T ). The search of the subtree minimizing the Cost-Complexity function among all the pos- sible subtrees in T is unfeasible, being mandatory the implementation of an efficient pruning algorithm. The starting point is not the original tree T but T0, the smallest subtree of T verifying R(T0) = R(T ). The subtree T0can be easily obtained by removing the leaves of the original tree T which do not improve the misclassification error. The heart of the Cost-Complexity pruning lies in finding the weakest-link cutting based on the following proposition [58]:

Proposition 2.2. For t any nonterminal node of T0, Tt the subtree of T with root t and ˜

Remp(t) > Remp(Tt) = X

t0_{∈ ˜}_T t

Remp(t0) . (2.40)

Then, if {t} is the subbranch of Tt consisting of the single node t, the cost-complexity functions for {t} and Ttcan be written as,

Rα_emp({t}) = Remp(t) + α Rα_emp(Tt) = Remp(Tt) + α| ˜Tt|.

At some critical value of α, Rα_emp({t}) and R_empα (Tt) met. Working out α in Remp(t)+α = Remp(Tt) + α| ˜Tt|:

α = Remp(t) − Remp(Tt) | ˜Tt| − 1

Regarding Equation 2.40, α is positive and it can be interpreted as the value which makes the subbranch {t} preferable to Tt. Writing the previous expression as a function of the set of internal nodes t in T0,

g0(t) =    Remp(t)−Remp(Tt) | ˜Tt|−1 if t /∈ ˜T0; +∞ if t ∈ ˜T0.

The weakest link ¯t0in T0is the node minimizing g0(t). Intuitively, ¯t0is the first node that as α increases, Rα_emp({¯t0}) becomes equal to Rempα (T¯t0) and therefore, ¯t0 is preferable

to T¯t0. The critical value for α associated to the weakest link is α1 = g0(¯t0). The

process is repeated starting from T1 = T0− T¯t0 instead of T0. If at any step there are

several candidates to minimize gk(t) (for example gk(¯tk) = gk(¯t0k)), the next subtree in the sequence is defined as Tk+1 = Tk− T¯tk − T¯t0k. The algorithm finishes when only

the root is left, having obtained the complete sequence of subtrees. The only point left to address is how to select one of these subtrees as the optimum-size tree. As

the training misclassification rate is a biased estimation of Remp(Tk), an independent pruning set, made up randomly or via cross-validation, is used to evaluate the goodness of each subtree and select that with the highest classification accuracy.

Other pruning methods have been developed in parallel to decision tree algorithms. For example, the C4.5 algorithm tackles the pruning task from other perspective, making pessimistic assumptions about the error rate in each node of the tree. A thorough analysis of the CART and C4.5 pruning techniques was carried out by Esposito et al. [62] concluding that, while C4.5 pruning tends to prune less than necessary, the Cost- Complexity method usually yields subtrees smaller than the optimal. Bearing in mind all these factors, the Cost-Complexity algorithm seems more suitable for large-scale machine learning systems trying to reduce the classification cost.

Table 2.4 gathers some of the advantages and limitations of the previous classifiers. In accordance with the goal of Chapter 5 of constructing a classifier capable of classifying new patterns in few milliseconds, the nonlinear SVMs do not look as the best option in spite of their effectiveness and robust mathematical background. However, the high pre- diction speed of linear SVMs (besides their good generalization properties) and decision trees together with the easy combination of decision trees with other classification algorithms, points to the hybrid architecture of decision trees and linear SVMs presented in Chapter 5 as an appealing solution to approximate nonlinear SVMs by piecewise linear functions with low classification cost.

Classifier Advantages Disadvantages

· Simplicity · Poor performance in non-linear

domains

Linear · Good scalability

· Fast training algorithms in general

· Fast classification

· Robust mathematical basis · Lack of expressiveness and com-

prehensibility

SVM · Flexible method for creating

nonlinear models

· Poor scalability of training algorithms (except for the linear ker- nel)

· Good generalization · Good accuracy levels

· Simplicity · Instability and high risk of over-

fitting

Decision trees · Fast classification · Higher training cost than linear

models

· Graphical representation · The structure of decision

boundaries limits its application · Relatively faster learning speed

than other classification methods

· Oblique decision trees are not convex and computationally expensive

· Can be combined with other decision techniques

Table 2.4: Main advantages and disadvantages of some of the most popular learning algorithms.

Chapter

3

Quadratic Programming Feature

Selection (QPFS)

Chapters 1 and 2 have highlighted how the incorporation of an effective feature selection phase in a large-scale classification system is highly recommendable in order to, not only reduce the computational cost of the learning algorithm, but also to favor the generalization and the understandability of the final model. In this respect, this chapter presents one of the contributions of this thesis: a new feature selection method for very large and high dimensionality multiclass classification problems [63]. The new method is in line with filter feature selection techniques introduced in Section 2.2.1 and, in particular, it is inspired by the mRMR algorithm which not only takes into account the relevance of each feature with the target class but also dependences among variables. The proposed method, named Quadratic Programming Feature Selection, reduces the feature selection task to a quadratic optimization problem simplifying the optimization process thanks to the Nyström method (Appendix C). The chapter presents the QPFS algorithm, including the Nyström approximation, error estimation and theoretical complexity. Finally, experiments comparing the proposed algorithm with state of the art filter techniques show that the new QPFS+Nyström method is a competitive and efficient filter-type feature selection algorithm for large-scale and high-dimensional supervised classification problems. The new method is superior in terms of computational efficiency to other successful feature selection algorithms while it maintains their classification accuracy.

3.1 Scalability of Feature Selection Methods

As discussed in Section 2.2, feature selection methods can be categorized into three groups as a function of their dependence on the classifier: filter methods perform feature selection regardless of the classifier, wrapper methods use search techniques to select candidate subsets of variables whose goodness is based on their classification accuracy, and embedded methods incorporate feature selection in the classifier objective function or algorithm. Section 2.2 has analyzed the most prominent advantages and disadvantages of each category, revealing their differences in terms of computational load as reflected in Figure 3.1. Among these groups, filter methods are often preferable to other selection methods for high dimensionality problems because of (i) their scalability, favored by their computational speed and their simplicity [64, 65]; and (ii) their usability with alternative classifiers, which means evaluate the costly feature selection process only once and then analyze different classifiers. A common disadvantage of filter methods is that most proposed techniques are univariate ignoring dependences among features. To overcome this problem, multivariate filter techniques taking into account dependences among features were proposed. Multivariate filters improve significantly the classification accuracy of univariate approaches but increase the computational ef- fort. The applicability of wrapper methods in high-dimensional problems is particularly critical in those cases where the number of features is large compared to the number of training samples (e.g., 10,000 features and 100 samples); a common case in gene selection domains [21]. In this situations, wrapper methods not only can be much slower, especially in learning the classifier, but also can prone to overfitting. Finally, embedded techniques can be located in an intermediate point between univariate filter and wrapper methods, being far less computationally demanding than wrapper methods but more expensive than univariate filter approaches. Regarding the multivariate filters, the most influential embedded techniques have a computational cost similar that of the multivariate filters [65]. However, as pointed out in Chapter 1, large-scale classifiers can be focused on different aspects like accelerating the training time, improving the classification cost or reducing the memory requirements thus, multivariate filters are versatile enough to use them whatever is the final purpose. Moreover, their use as a preprocessing step can lighten the computational cost of the classifier and overcome overfitting at the same time [64].

Independence of classiﬁer C o m pu ta tio n a l co m p le xi ty Univariate Filter Wrapper

Embedded Multivariate_Filter

Figure 3.1: Dependence between the computational cost and the independence with the classifier of the main feature selection algorithms: univariate filter, multivariate

filter, wrapper and embedded methods.

Leaving aside the computationally intensive wrapper methods, Table 3.1 shows the time charge of some of the most influential filter and embedded techniques aforementioned in Section 2.2. Note that these computational costs correspond to ranking features ac- cording to certain criterion but they do not include the cost of determining the optimal number of features to be selected. As expected, the univariate filter feature selection methods (MaxRel) has the lowest computational cost but it does not takes into account dependences among features, which limits its effectiveness. The SFS method captures linear dependences as well but it is not simply a feature selection technique as it is able to generate new features. However, its cost is cubic in the number of features which makes SFS hardly scalable. In spite of the computational efficiency of multivariate feature selection methods based on linear similarities – like CFS –, multivariate algorithms, such as mRMR or ReliefF, with reasonable computational requirements and based on nonlinear similarities are often preferred as they are not reduced to linear dependences [28, 66]. Specifically, mRMR has a quadratic dependence on the dimension of the classification problem while ReliefF scales at least quadratically with the number of patterns. However, the mRMR provides in general better classification rates than ReliefF as it will be shown later in Section 3.3. Finally, the RFE-SVM algorithm introduced in Section 2.2.3 is a good example of embedded methods because of its classification effectiveness and its leadership in terms of computational cost. RFE-SVM performs backward feature elimination and its cost is totally dependent on the criterion used to remove features. The most popular techniques are (i) deleting features by a factor of 2 (that is, reduce the number of remaining features by half at each iteration)

Model Examples Complexity Univariate MaxRel N M CFS N M2 Filter Multivariate SFS N M3 mRMR N M2 ReliefF N2M

Embedded RFE-SVM max(N, M )N2 (Features removed by a factor of 2)

N2M2 (Features removed one by one)

Table 3.1: Frequently used filter and embedded feature selection methods. N is the number of training patterns and M the number of features.

or (ii) removing features one by one.

In document A practical view of large-scale classification: feature selection and real-time classification (Page 67-76)