• No results found

5.5 Comparative Assessment

5.6.2.4 Datasets with Multiple Targets

One of the real-world multi target prediction datasets we have used in our experimental evaluation is the Vegetation Condiiton problem (Kocev et al., 2009). This dataset is for pre- diction of the vegatation condition in Victoria State, Australia. Data provider is Matt White from the Arthur Rylah Institute for Environmental Research, Department of Sustainability and Environment (DSE) in Victoria, Australia. The prediction task is to determine the en- vironmental characteristics of a site such as climatic, radiometric and topographic data from

4http://tunedit.org/challenge/IEEE-ICDM-2010 5http://stat-computing.org/dataexpo/2009

Experimental Evaluation of Online Learning Algorithms 63

intrinsic characteristics of the species present at that site such as physiognomy, phenology and phylogeny data. The relationship between site properties and species properties is the functional dependency which we try to model by using a multi-target regression tree. The dataset consist of 16967 records, with 40 descriptive attributes and 6 target attributes.

The second real-world dataset, which we will refer to as Kras, represents a collection of satellite measurements and 3D data obtained with the Light Detection And Ranging (LiDAR) technology for measuring different forest properties. Multi-spectral satellite im- agery is able to capture horizontally distributed spatial conditions, structures and changes. However, it cannot capture the 3D forest structure directly and is easily influenced by topo- graphical covers and weather conditions. LiDAR, on the other hand, provides horizontal and vertical information (3D) at high spatial resolution and vertical accuracies. By combining satellite sensed data with LiDAR data, it is possible to improve the measurement, mapping and monitoring of forest properties and provide means of characterizing forest parameters and dynamics (Stojanova et al., 2010). The main objective is to estimate the vegetation height and canopy cover from an integration of LiDAR and satellite data in a diverse and unevenly distributed forest in the Kras (Karst) region of Slovenia. The dataset contains 60607 records, described with 160 numerical attributes and 2 target attributes.

65

6

Learning Model Trees from Time-Changing Data

Streams

Panta rei ... everything is constantly changing, nothing remains the same!

Heraclitus

One of the main issues in traditional machine learning has been the problem of inferring high quality models given small, possibly noisy datasets. However, in the past two decades a different type of problems has arisen, motivated by the increasing abundance of data that needed to be analyzed. This has naturally shifted the focus towards various approaches that limit the amount of data necessary to induce models of the same quality as those induced over the whole training sample.

Although the task of learning from data streams poses some different challenges, it still requires a solution to the problem of learning under an abundance of training data. In the previous chapters we discussed several approaches to the problem that has been at the core of our research work: What is the minimum amount of data that can be used to make an inductive selection or a stopping decision without compromising the quality of the learned regression tree? The main conclusion is that, there is no single best method that provides a perfect solution. Rather, there are various techniques each with a different set of trade- offs. The general direction is given by the framework of sequential inductive learning, which promotes a statistically sound sampling strategy that will support each inductive decision.

In this chapter, we describe an incremental algorithm called FIMT-DD, for learning regression and model trees from possibly unbounded, high-speed and time-changing data streams. The FIMT-DD algorithm follows the approach of learning by using a probabilis- tically defined sampling strategy, coupled with an advanced automated approach to the more difficult problem of learning under non-stationary data distributions. The remainder of this chapter is organized as follows. We start with a discussion on the ideas behind the sequential inductive learning approach, which represents a line of algorithms for incremental induction of model trees. The next section discusses the probabilistic sampling strategy that is used to provide the statistical support for an incremental induction of decision trees. It gives the necessary background for introducing the relevant details of the algorithms we pro- pose in the succeeding sections, for learning regression and model trees from data streams. Furher, we present our change detection and adaptation algorithms, each discussed in a separate section. The last section contains the results from the experimental evaluation on the performance of the algorithms presented in the previous sections.

6.1

Online Sequential Hypothesis Testing for Learning Model

Trees

The main idea behind the sequential inductive learning approach is that each inductive decision can be reformulated as the testing of a hypothesis over a given sample of the entire

66 Learning Model Trees from Time-Changing Data Streams

training set. By defining the level of confidence (δ ), the admissible error in approximating the statistic of interest (ε), and the sampling strategy, one can determine the amount of training data required to accept or reject the null-hypothesis and reach a decision. This approach not only enables us to reach a stable statistically supported decision but also enables us to determine the moment when this decision should be made.

In Chapter 3 we have discussed two batch-style algorithms for learning linear model trees by a sequential hypothesis testing approach. Both CRUISE by Kim and Loh (2001) and GUIDE by Loh (2002) rely on a χ2 testing framework for split test selection. GUIDE

computes the residuals from a linear model and compares the distributions of the regressor values from the two sub-samples associated with the positive and negative residuals. The statistical test is applied in order to determine the likelihood of the examples occurring under the hypothesis that they were generated from a single linear model, and therefore a confidence level can be assigned to the split.

The same approach has been used in the incremental model tree induction algorithms BatchRA and BatchRD, as well as their online versions OnlineRA and OnlineRD, proposed by Potts and Sammut (2005). The algorithms follows the standard top-down approach of building a tree, starting with a selection decision at the root node. The null hypothesis is that the underlying target function f is a good linear approximation over the data in the complete node, i.e.,

H0: f(x) = xTθ

where θ is a column vector of d parameters, given that we have d regressors (predictive attributes), and x is a vector of d attributes. Three linear models are further fitted to the examples observed at the node: ˆf(x) using all N examples; ˆf1(x) using N1 examples lying on

one side of the split; and ˆf2(x) using N2 examples lying on the other side of the split. The

residual sums of squares are also calculated for each linear model, and denoted RSS0, RSS1

and RSS2 respectively.

When the null hypothesis is not true, RSS1+ RSS2 will be significantly smaller than

RSS0, which can be tested using the Chow test (Chow, 1960), a standard statistical test for homogeneity amongst sub-samples. As noted by Potts and Sammut (2005), there is clearly no need to make any split if the examples in a node can all be explained by a single linear model. On the other hand, the node should be split if the examples suggest that two separate linear models would give significantly better predictions.

In econometrics, the Chow test is most commonly used in time series analysis to test for the presence of a structural break. The test statistic follows the F distribution with d and N− 2d degrees of freedom and is computed by using the following equation:

F=(RSS0− RSS1− RSS2)(N − 2d) d(RSS1+ RSS2)

. (38)

The split least likely to occur under the null hypothesis corresponds to the F statistic with the smallest associated probability in the tail of the distribution. Therefore, if the null hypothesis can be rejected under the available sample of training data with the desired degree of confidence (p), the best splitting test is determined with the minimum probability value denoted with α. However, if the value of p is not small enough to reject the null hypothesis with the desired degree of confidence, no split should be made until further evidence is accumulated. For example, if α = 0.01% a split is only made when there is less than a 0.01% chance that the data observed in the node truly represent a linear dependence between the target attribute and the predictor attributes (the regressors).

A major disadvantage of fitting linear models on both sides of every possible splitting point is the fact that this is intractable for the case of numerical (real-valued) attributes. The problem is partially solved by using a fixed number of k candidate splits per regressor.

Learning Model Trees from Time-Changing Data Streams 67

Another disadvantage of the proposed method is its computational complexity, which tends to be high for a large number of regressors. The residual sums of squares are computed incrementally by using the recursive least squares (RLS) algorithm. However, each leaf stores k(d − 1) models which are incrementally updated with every training example. Each RLS update takes O(d2) time, hence the overall training complexity of BatchRD is O(Nkd3)

where N is the total number of examples. The online version of this algorithm is termed OnlineRD.

As an alternative, a more efficient version of the same algorithm is proposed, which is based on a different splitting rule. The same splitting rule has been used in the batch algorithms SUPPORT by Chaudhuri et al. (1994) and GUIDE by Loh (2002). The algorithm computes the residuals from a linear model fitted in the leaf and compares the distributions of the regressor values from the two sub-samples associated with the positive and negative residuals. The hypothesis is that if the function being approximated is almost linear in the region of the node, then the positive and the negative residuals should be distributed evenly. The statistics used are differences in means and in variances, which are assumed to be distributed according to the Student’s t distribution. These statistics cannot be applied for evaluating splits on categorical attributes, which are therefore ignored. The online version of this algorithm is termed OnlineRA.

The stopping decisions within the sequential inductive learning framework can be eval- uated in an explicit or in an implicit manner. Whenever there is not enough statistical evidence to accept the null hypothesis, an implicit temporal stopping decision is made, which (in theory) may delay the split selection decision indefinitely. However, if a linear model tree is being grown to approximate a smooth convex function, then as more examples are observed, the statistical splitting rule will repeatedly split the leaf nodes of the tree. As a result, the tree will continuously grow resembling a smooth mesh structure that fractures the input space into a large number of linear models. Therefore, it is desirable to limit the growth of the tree by using some form of a pre-pruning heuristic.

Potts and Sammut (2005) have proposed to monitor the difference between the variance estimate using a single model and the pooled variance estimate using separate models on each side of a candidate split:

δ= RSS0 N− d−

RSS1+ RSS2

N1+ N2− 2d

.

The value of the parameter δ decreases with the growth of the tree, since the approxi- mation becomes more accurate due to the increased number of leaves. The stopping rule is thus a simple threshold monitoring task: When δ eventually falls below a pre-determined threshold value δ0, the node will not be considered for further splitting. This is a very

convenient method to limit the growth of the tree: In case the user is not satisfied with the obtained accuracy, the value of the threshold can be further decreased, and the tree will be able to grow from its leaves to form a more refined model. As long as the distribution underlying the data is stationary, no re-building or restructuring is required.

Both of the discussed algorithms use an additional pruning method, which can be also considered as a strategy for adapting to possible concept drift. The pruning takes place if the prediction accuracy of an internal node is estimated to be not worse than its corresponding sub-tree. This requires maintaining a linear model in each internal node, which increases the processing time per example to O(hd2), where h is the height of the tree.