Change detection and adaptation are indivisible parts of the learning process and play an important role in online learning. Adaptation is basically a method of forgetting or degrading the induced models. Research in machine learning has been mostly concerned with the problem of how a system may acquire (useful) knowledge that it does not possess, and relatively little attention has been paid to the converse problem: How may a system dispose of knowledge it already possess but is no longer useful?
While learning is a process in which an organized representation of experience is con- structed, forgetting is a process in which parts of that organized representation are rear- ranged or disposed. In machine learning, there are two main approaches for adapting to changes in the target concept or the target function (i.e., forgetting):
• Methods for blind adaptation: These methods are based on blind data management, consisting of instance weighting and selection. In this category, we have methods for learning over a sliding window and methods for weighting the instances or parts of the hypothesis according to their age or their utility for the learning task (Klinkenberg, 2004; Klinkenberg and Joachims, 2000; Klinkenberg and Renz, 1998). Typically, the oldest information has less importance. Weighting examples corresponds to gradual forgetting, while the use of time windows corresponds to abrupt forgetting. The dis- advantage of this approach is its ignorance with respect to the existence of an actual change. Namely, the model is updated without considering whether changes have really occurred or not.
• Methods for informed adaptation: This category includes methods that use some form of change detection, in order to determine the point of change or the time-window in which change has occurred. The adaptation is initiated by the event of change detection and is therefore referred to as informed (Gama et al., 2003). A very useful feature of these methods is their ability for a granular adaptation of global models (such as decision rules and decision trees), where only parts of the model are affected by the change (Hulten et al., 2001).
Naturally, there is always a trade-off between the cost of an update and the anticipated gain in performance. This motivates the development of incremental learning algorithms with fast execution and response, which require only a sub-linear amount of memory for the
22 Learning from Data Streams
learning task, and are able to detect changes and perform an informed adaptation of the current hypothesis with a minimal loss of accuracy.
23
3
Decision Trees, Regression Trees and Variants
The only way of discovering the limits of the possible is to venture a little way past them into the impossible.
The Second Clarke’s Law by Arthur C. Clarke, in ”Hazards of Prophecy: The Failure of Imagination”
At the core of this research work is the problem of any-time adaptive learning of regression trees and different variants of regression trees, such as model trees, option regression trees, and multivariate-response trees. We will elaborate on the tree induction task through a short presentation on the history of learning decision trees, and address each of the main issues in detail. Our presentation will revolve around a single important question: How to determine the amount of statistical evidence necessary to support each inductive decision in the learning process?. We will review ideas from the field of statistical learning theory and unify them with existing machine learning approaches.
This chapter will explore the different dimensions of the tree induction process. We will consider both standard greedy search and the more explorative approach of using options, both a discrete-valued target function and a smooth regression surface, and both the single- target prediction case and the problem of predicting the values of multiple target variables. We will provide background for the reader to understand the different types of tree models and their learning from streaming data and thus our contributions.
This chapter is organized as follows. We start with introducing the tree induction task and discuss the main properties of Top-Down Induction Decision Tree (TDIDT) algorithms. We proceed with following the history of decision trees which helps us introduce the two representative categories of decision tree learning algorithms, and the use of statistical tests in the tree induction task. We next discuss the main issues when learning decision and regression trees, in particular selection and stopping decisions. The last tree sections intro- duce model trees, decision and regression trees with options, and multi-target decision and regression trees.
3.1
The Tree Induction Task
Decision and regression trees are distribution-free models. The tree induction procedure, as already described, is very simple and efficient. In addition, decision and regression trees do not require tuning of parameters or heavy tedious training. Due to the informed selection of attributes in the split tests they are quite robust to irrelevant ones. For these reasons, they are easy to implement and use.
Tree growing, also known as ”hierarchical splitting”, ”recursive-partitioning”, ”group di- viding” or ”segmentation” , has its origin in the analysis of survey data. A classification tree or a regression tree is a set of rules for predicting the class or the numerical value of an object from the values of its predictor variables. The tree is constructed by recursively partitioning
24 Decision Trees, Regression Trees and Variants
a learning sample of data in which the class label or the target variable value and the values of the predictor variables for each case are known. Each partition is represented by a node in the tree. Figure 1 gives a decision tree for the well known toy problem of predicting whether a tennis game would be played or not, depending on the weather conditions.
Outlook
Sunny Overcast Rain
Humidity No High Yes Normal Wind No Strong Yes Weak Yes
Figure 1: An illustrative decision tree for the concept of playing a tennis game. An example is classified by sorting it through the tree to the appropriate leaf node, then returning the prediction associated with this leaf.
Decision and regression trees are known to provide efficient solutions to many complex non-linear decision problems by employing a divide-and-conquer strategy. Each division step consists of choosing a splitting test which would basically split the feature space, and as a result reduce the number of possible hypotheses. The divide step is then repeated on the resulting sub-spaces until a termination condition is met (i.e., all leaves of the decision tree are ”pure”) or until another user-defined stopping criteria has been fulfilled (e.g., maximal tree depth, node significance threshold, etc.).
In the regression setting, the problem is typically described with p predictor (explana- tory) variables or attributes (x1, x2, ..., xp) and a continuous response variable or a target
attribute y. Given a sample of the distribution D underlying the data, a regression tree is grown in such a way that, for each node, the following steps are continuously reapplied until a termination condition is met:
1. Examine every allowable split.
2. Select and execute (create left and right children nodes) the best of these splits. The root node comprises the entire sample which corresponds to the entire instance space X, while its children nodes correspond to the resulting subspaces XL and XR obtained with
the execution of the best split (X = XL∪ XR). Basically, a split corresponds to a hyperplane
which is perpendicular to an axis that represents one of the predictor variables. The splitting of the instance-space is performed recursively by choosing a split that improves the impurity of the node with respect to the examples that are assigned to the current region (node) which strengthens the association between the target variable and the predictor variables. In this context, two main approaches have appeared in the historical development of decision and regression trees and will be discussed in more detail in the following section.
Another point of divergence among existing tree learning algorithms is the stopping criterion. One line of algorithms is using a direct stopping rule, i.e., pre-pruning. The stopping rule refers to the criteria that are used for determining the ”right-sized” decision or regression tree, that is, a tree with an appropriate number of splits. The other line of algorithms builds a tree until no more splitting is possible, and proceeds with a pruning phase in which the size of the tree is reduced (post-pruning). We will elaborate further on the meaning of ”right-sized” tree and optimal predictive accuracy in a separate discussion on the important issues of learning a tree-based predictor.
From the viewpoint of learning as search, tree learning algorithms perform a simple-to- complex, hill-climbing search of a complete hypothesis space. The hill-climbing search begins
Decision Trees, Regression Trees and Variants 25
with an empty tree, progressively considering more elaborate hypotheses. The algorithm maintains only a single current hypothesis as it explores the space of all possible trees. Once it selects a splitting test and refines the current hypothesis, it never backtracks to reconsider this choice. As a consequence, this search method has no ability to determine how many alternative regression trees are consistent with the available training data, or pose new queries that optimally resolve among these competing hypotheses. Therefore, although efficient, this procedure will result in the first tree that fits the training data and is susceptible to the usual problem of converging to a locally optimal solutions.
Decision and regression trees can represent fairly complex concepts, therefore their bias component of the error is typically much lower. However, batch tree learning algorithms are characterized with a limited lookahead, instability and high sensitivity to the choice of training data. Because splitting is performed until the stopping rule can be applied decision and regression trees tend to overfit the data.
Post-pruning is very important when building decision and regression trees. The result is a higher variance component of the error, which is related to the variability resulting from the randomness of the learning sample. There exist different methods to reduce the variance of a machine learning method. The most common method for decision and regression trees is complexity control, i.e., pruning. A somewhat less explicit variance control method is model averaging. Model averaging consists in aggregating the predictions made by several models. Due to the aggregation, the variance is reduced and hence the accuracy of the aggregated predictions is typically higher. A deeper discussion and details of various model averaging method is given in Chapter 4.