Drift Detection Methods in FIMT-DD - Algorithms for Learning Regression Trees and Ensembles on

The learning algorithm presented thus far models the functional dependencies by minimizing the mean squared error of the regression tree. In our online learning scenario, however, it is expected that the underlying functional dependencies will change over time. The change can affect the whole instance space or only some regions of it. As a result, the affected parts of the model tree or the whole model will become poor approximations of the true functional dependencies, and their predictions will most likely be incorrect. The error of the whole model will consequently increase, making it unsuitable for descriptive or predictive purposes.

While the predictive accuracy of the model tree is very important, we are primarily interested in informed detection of changes, which would enable us to alter the structure of the model tree as a response to the change in the functional dependencies. An intuitive approach is to assume that the increase in the error is due to a smaller portion of incorrect sub-trees which are being affected with the change, leaving larger portion of non-affected sub-trees that still perform accurately. This suggests a simple method to detect changes: monitor the quality of every sub-tree by monitoring the evolution of its error. If the error starts increasing, this may be a sign of change in the target function.

6.6.1 The Page-Hinkley Test

From the various methods discussed in Chapter 2, we are further interested in online change detection, due to the fact that online learning algorithms do not have access to the full history of the data stream. One of the most successful online change detection tests is the Page-Hinkley (PH) test proposed by Mouss et al. (2004). The PH test is a sequential adaptation of the detection of abrupt change in the average of a Gaussian signal. At any point in time, the test considers two variables: 1) a cumulative sum m(T ), and 2) its minimal value M(T ) = mint=1,...,Tm(t), where T is the number of observed examples. The cumulative

sum m(T ) is defined as the cumulative difference between the monitored signal xt and its

current mean value x(T ), corrected with an additional parameter α: m(T ) = T

∑

t=1 (xi− x(T ) − α) (48) where x(T ) = 1 T T

∑

t=1 xt. (49)

The parameter α corresponds to the minimal absolute amplitude of change that we wish to detect. It should be adjusted according to the expected standard deviation of the signal. Larger deviations will require larger values for α, but if α is set to a too large value, it can produce larger detection delays. Our experiments have shown that setting α to 10% of the standard deviation gives best results.

The PH test monitors the difference between the minimum M(T ) and m(T ), defined as: PH(T ) = m(T ) − M(T ). When this difference is greater than a user-specified threshold λ , i.e., PH(T ) > λ , the test will trigger an alarm that informs about a change in the mean of the signal. The threshold parameter λ corresponds to the admissible false alarm rate. Increasing λ entails fewer false alarms (higher precision) at the cost of a larger number of missed events (lower recall).

80 Learning Model Trees from Time-Changing Data Streams

The variable xt, which is monitored over time, is in our case the absolute error at a given

node or a sub-region of the instance space, i.e., x = |y − o|. Here y is the true value of the target attribute and o is the prediction from the constant regressor. Having multiple nodes in the tree we have multiple instantiations of the change detection test running in parallel. The model tree divides the instance space into multiple nested regions Ri. Each node

corresponds to a hyper-rectangle or a region of the instance space and is associated with a corresponding variable xRi

t . The root node covers all of the hyper-rectangles, while sub-

sequent nodes cover sub-regions (hyper-rectangles) nested in the region covered by their parent node. Consequently, the structure of the regression tree gives multiple views of the instance space and the associated error, at different levels of granularity.

For computing the absolute loss, we have therefore considered two possible alternatives: 1) using the prediction from the node where the error is being monitored, or 2) using the prediction from the leaf node where the example is assigned. Consequently, we propose two different methods for change detection:

• Top-Down (TD) method: The error is computed using the prediction from the current node. The computation can be performed while the example is passing the node on its path to the leaf. Therefore, the loss will be monitored in the direction from the top towards the ”bottom” of the tree.

• Bottom-Up (BU) method: The error is computed using the prediction from the leaf node. The example must therefore reach the leaf first. The computed difference at the leaf will be then back-propagated to the root node. While back-propagating the error (re-tracing the path the example took to reach the leaf), the PH tests located in the internal nodes will be updated correspondingly.

The idea for the BU method is based on the following observation: When moving towards the ”bottom” of the tree, predictions in the internal nodes become more accurate (as a consequence of splitting). In case of concept drift, using more accurate predictions will emphasize the error, shorten the delays and reduce the number of false alarms. This was investigated and confirmed by empirical evaluation, which has shown that, when using the TD method, most of the false alarms were triggered at the root node and its immediate descendants.

6.6.2 An improved Page-Hinkley test

Tuning the values of the parameters α and λ is a tedious task. Our experiments have shown that uniformly good results (over the whole collection of real-world and artificial datasets) can be achieved for the combination of values: α = 0.005 and λ = 50. If we would like to tolerate larger variations, the value of λ can be increased to 100. For more sensible change detection and under a risk of an increased rate of false alarms, the value of α can be optionally decreased to 0.001. As a general guideline, when α is decreased, the value of λ should be increased: This is to reduce the number of false alarms. On the other hand, when the value of α is increased, the value of λ should be decreased: This is to shorten the delays in change detection.

While there are many optimization techniques that can be employed to tune the values of the parameters, we have tried to automate the process. We have transformed the parameter α into a variable whose value depends on the incrementally computed standard deviation of the absolute loss xt. In particular, we propose an improved Page-Hinkley test where the

sum m(T ) is computed as follows: m(T ) =

∑

t=1

Learning Model Trees from Time-Changing Data Streams 81 αPH(T ) = sd(xT) = s 1 T T

∑

t=1 (xt− x(T ))2. (51)

The value of αPHis initially set to 0.01 and is periodically updated after every 50 examples

observed in the particular node. By adjusting the value of αPH according to the standard

deviation of the signal, we minimize the effect of the variance, which results in a shorter detection delay.

In document Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams. Elena Ikonomovska (Page 93-95)