Tree induction methods and linear regression are both popular techniques for supervised learning tasks. The two schemes have somewhat complementary properties: the simple linear models fit by regression exhibit high bias and low variance, while tree induction fits more complex models which results in lower bias but higher variance. Therefore, a natural step to improve the performance of decision and regression trees is to combine these two techniques into one.
In the existing literature the term model trees has been used to refer to a more gen- eral type of decision and regression trees that are able to explore multiple representation languages in the internal (decision) nodes and in the leaves (Gama, 2004). The first work that explores the idea of using linear combination of attributes in internal nodes is CART by Breiman et al. (1984), followed by FACT (Loh and Vanichsetakul, 1988), Ltree (Gama, 1997), QUEST (Loh and Shih, 1997) and Cruise (Kim and Loh, 2001). In the regression domain, model trees mainly represent algorithms that use a linear combination of attributes or kernel regression models only in the leaves of the tree. One of the earliest works that explores the idea of using functional leaves is the Perceptron Tree by Utgoff (1988), where leaves can contain a general linear discriminant function. Another work that proposes a similar approach is by Quinlan (1992).
Adding a general complex function in the leaves enables the exploration of a more so- phisticated representation language. Functional leaves may also implement a naive Bayes classifier, which was shown to significantly improve the performance in a study by Kohavi (1996). The most complex type of model trees employs kernel regression models in the leaves (Torgo, 1997). In a study by Gama (2004) has been further shown that using functional leaves is a variance reduction method, while using functional inner nodes is a bias reduction process.
A linear model tree is a regression tree whose leaves implement a general linear func- tion. Unlike ordinary regression trees, model trees construct a piecewise linear (instead of a piecewise constant) approximation to the target function. The final model tree consists of a tree with linear regression functions at the leaves, and the prediction for an instance is obtained by sorting it down to a leaf and using the prediction of the linear model associated with that leaf. The linear models at a leaf typically do not incorporate all attributes present in the data, in order to avoid building overly complex models. This means that ordinary regression trees are a special case of model trees: the ’linear regression models’ here do not incorporate any attribute and are just the average class value of the training instances at that node.
The existing work on learning model trees includes algorithms which build the tree as a standard regression tree, using variants of the least squares evaluation function, such as the maximization of the BSS measure previously defined. The use of the variance as the impurity measure is justified by the fact that the best constant predictor in a node is the expected value of the predicted variable for the instances that belong to the node. After the tree is fully grown, the post-pruning phase is combined with constructing linear regression models in the leaves.
This category of algorithms include M5 by Quinlan (1992) and M5’ (Wang and Witten, 1997), both based on variance reduction. The same approach is also used in HTL by Torgo (1997), which replaces the linear models with more complicated models, such as kernels and local polynomials. For such regression trees, both construction and deployment of the model is expensive: However, they are potentially superior to the linear regression trees in terms of accuracy.
As noted by Karaliˇc (1992) the variance of the response variable can be a poor estimator of the merit of a split when linear regressors are used in the leaves, especially if the points are arranged along a line which is not perpendicular to the axis of the response. To correct
Decision Trees, Regression Trees and Variants 31
this, Karaliˇc (1992) proposed an impurity function that uses the mean square error of the linear model fitted in the leaf. The appropriate impurity function would be:
RSS=
∑
i∈IL
(yi− ˆfL(xi))2+
∑
i∈IR(yi− ˆfR(xi))2,
where ˆfL and ˆfR represent the optimal models for the left and the right partitions, respec-
tively. For every possible split attribute and split point, a different pair of optimal models ˆ
fL and ˆfR will be obtained.
By using this split criterion, RETIS solves the aforementioned problem and builds trees of higher quality. However, if exhaustive search is used to determine the split point, the computational cost of the algorithm becomes excessively high. Namely, for real-valued predictor attributes, a linear system has to be formed and solved for all possible values in their domain. The situation is even worse for categorical attributes, since the number of linear systems that need to be solved depends exponentially on the cardinality of the domain of each attribute.
A simple modification alleviates the problem of high computational complexity by con- sidering only a sample of all the possible split points. However, it is unclear how this would influence the accuracy of the generated trees. SMOTI by Malerba et al. (2002) allows re- gression models to exist only in the internal nodes of the tree, instead of only in the leaves, which accounts for both global and local effects. However, its split evaluation criterion is even more computationally intensive, because it takes into account all the linear regression models reported along the path from the root to the leaf, in addition to the mean square error of the locally fitted linear model.
The LLRT algorithm by Vogel et al. (2007) provides an improvement of the computa- tional complexity of RETIS-style algorithms, by replacing the Singular Value Decomposi- tion, typically used for determining the regression coefficient vector in the linear regression equation, with quicker alternatives such as Gaussian Elimination. The required matrices for all the features in the model are pre-calculated through a single scan of the data, which turns out to be equivalent to running a linear regression over the whole training dataset. The computational complexity of each split evaluation operation is O(p3), where p is the
number of predictor variables used in the linear model, resulting O(p4) operations for deter-
mining each potential split. The negative side of this approach is the requirement to store the complete set of matrices in memory at any time.
As in previous attempts to learn decision and regression trees by replacing exhaustive search with some form of statistical tests for split variable section, Chaudhuri et al. (1994) incorporate the Student’s t-test in the algorithm SUPPORT. The main idea is to fit a functional model or just a simple linear regressor for every node in the tree and then partition the instances into two groups: instances with positive residuals and instances with negative residuals. Unfortunately, it is not clear why the differences in the distributions of the signs of the residuals are good criteria to evaluate selection decisions. Further improvements of the same approach have been proposed later by Loh (2002), which consist mainly of a more sophisticated mechanism for handling the variable selection bias and replacing the Student’s t-test with the χ2-test.
A conceptually different approach has been proposed by Dobra and Gehrke (2002), based on transforming the regression problem into a classification problem. The main idea employed in the SECRET algorithm is to use the expectation-maximization (EM) algorithm on the instances associated with each node ,in order to determine two Gaussian clusters with shapes close to flat disks. After the clusters are determined, an instance is labeled with class label 1 if the probability to belong to the first cluster exceeds the probability to belong to the second cluster, or class label 2 in the opposite case. Having instances labeled in this manner, the gini gain or a similar split evaluation can be used to determine the split attribute and the corresponding split point. The linear regressors in the leaves are determined using least
32 Decision Trees, Regression Trees and Variants
squares linear regression. This approach avoids forming and solving a large number of linear systems of equations, which is required in an exhaustive search procedure as used by RETIS. However, its effectiveness depends on the validity of the assumption that the instances can be separated in two groups with the shapes of flat disks in the space of the response variable.