2.5 Online Approaches to Learn in the Presence of Concept Drift
2.5.2 Two Online Classifiers for Learning and Detecting Concept Drift (Todi)
Todi (Nishida; 2008) uses two online classifiers for learning and detecting drifts. One of them (H0) is rebuilt every time a concept drift is detected. The other one (H1) is not rebuilt when
a drift is detected, but can be replaced by the current H0 if a drift is confirmed. Similarly
to STEPD, Todi detects concept drift by performing a statistical test of equal proportions to compare H0’s accuracies on the most recent W training examples and on all the training
examples presented so far excluding the last W training examples. H0 must have been trained
with at least 2W examples for the drift to be detected.
The authors claim that the approach uses only one significance level α. They argue that it is not necessary to have a warning level to store training examples in a short time memory, as DDM, EDDM and STEPD, because H1 is maintained after the concept drift detection. So, the
accuracy of the ensemble is less deteriorated in the case of false alarms. However, Todi actually also works with two levels. Instead of calling them “warning” and “drift” levels, they are called “drift detection” and “drift confirmation”.
After the detection of a concept drift, a statistical test of equal proportions with significance level β is done to compare the number of correctly classified training examples by H0 and H1
since the beginning of the training of H0. If statistically significant difference is detected, which
means that H0 was successful to handle concept drift significantly, the drift is confirmed. Then,
H0 replaces H1 and a new H0 starts learning from scratch. If the concept drift is not confirmed,
Todi reinitializes H0 and keeps H1.
The classification is done by selecting the output of the most accurate classifier considering the W most recent training examples: considering that siis the number of correct classifications
of the classifier Hi for the last W examples, if s0 > s1 + 1, Todi selects the classifier H0 to
classify the instance presented in the current time step. If s0 < s1− 1, then Todi selects H1.
2.5 Online Approaches to Learn in the Presence of Concept Drift
Successful points
Todi was able to confirm drifts without misdetections and to obtain better accuracy than DDM and STEPD using naive-bayes as the base learners for the SEA Concepts (Street and Kim; 2001) data set using both sudden (original data set) and gradual (1000 time steps to completely change from the old to the new concept) drifts. Besides, Todi’s error rate was better than Bogofilter’s (Raymond et al.; 2007), which is a well-known spam filter, and STEPD’s on a real world spam data set.
Problems
Even though the authors claim that Todi does not need a warning level, we can consider their “drift detection” as the warning level used by DDM, EDDM and STEPD and their “drift confirmation” as the drift detection level. As explained in section 2.5.1, DDM, EDDM and STEPD can be considered as true online learning algorithms if, instead of storing the examples seen during a warning level in a memory for posterior training, these examples are directly used to train a new classifier. Todi works in a similar way to that. The main difference is that the new classifier can also be used for predictions before a drift is confirmed and that concept drift is monitored by using a classifier that is built only with examples learnt from the moment in which a drift is confirmed.
Todi uses a window over data which may create an unstable system if it is too small and may not react quickly to drifts if it is too big. Besides, subsequent drifts can only be detected if there are at least 2W examples between them. So, drifts cannot be always continuously monitored.
Although H1 is not rebuilt when a drift is detected (only after a drift is confirmed), it is
not prepared to deal with new concepts. So, even though it may increase the robustness to false alarms, it cannot help to reduce the drop in generalization right after the drift if the concept really changed.
The approach also has the problem of presenting no mechanism to deal with recurrent concepts.
2.5.3 Concept Drift Committee (CDC)
In this approach (Stanley; 2003), an ensemble (committee) of decision trees is created to handle concept drift. All the ensemble members are trained with incoming training examples. The ensemble is initially empty and, every time that a new training example is made available, a new decision tree is added to the ensemble if the maximum ensemble size was not attained yet. When the maximum size is achieved, a new decision tree is added only if an existing decision tree can be removed from the ensemble. Each ensemble member has a weight proportional to the accuracy on the last n training examples. The decision tree with lowest weight below some
2.5 Online Approaches to Learn in the Presence of Concept Drift
threshold t is eliminated. Every ensemble member also has an age (number of time steps since it was added to the ensemble) and cannot be eliminated or used for classification before an age of maturity is reached. So, ensemble members that were not trained enough are given a chance to learn the concept without prejudicing the ensemble’s generalization.
Successful points
The approach showed to have comparable generalization to FLORA4 (Widmer and Kubat; 1996) on the first sudden drift using the STAGGER concepts data set (Schlimmer and Granger; 1986) and better generalization on the second sudden drift (although this difference may be caused by the use of different base learners). Both approaches were better than instance-based learning 3 (IB3) (Aha et al.; 1991).
The approach showed to have comparable generalization to FLORA4 (Widmer and Kubat; 1996) on a boolean data set containing drifts with drifting time (number of time steps to complete the drift) 100, but better generalization when the drift had drifting time 200.
Problems
CDC had worse generalization than FLORA4 during the first concept of the STAGGER data set.
In the beginning of the learning, a new decision tree is added at each time step while the ensemble does not have the maximum number of decision trees. As diversity is not encouraged, very similar decision trees will be created. So, the memory requirements and computational time are increased unnecessarily. The accuracy of the system during this period is likely to be similar to the accuracy of a single decision tree, wasting the power of ensembles to achieve better accuracy.
The approach does not present any specific method to handle recurrent concepts. The elimination of ensemble members with low accuracy on the most recent n training examples may remove members that could be useful in the case of a recurrent concept.
The size of the window n critically influences the behaviour of the approach. A too large n would reduce the accuracy to handle drifts, whereas a too small n may create an unstable system.
Moreover, no strategy to improve the learning of existing members on the new concept is adopted, so that the ensemble only recovers from a drift when the new ensemble members achieve the age of maturity and are accurate enough. So, there is a delay in the system’s reaction to drifts. This is probably one of the reasons why CDC does not achieve very good accuracy for abrupt drifts.
2.5 Online Approaches to Learn in the Presence of Concept Drift
2.5.4 Dynamic Weight Majority (DWM) and Addictive Expert Ensembles