In this section we discuss the error metrics that we used to measure and estimate the gener- alization power of the algorithm, metrics for measuring the complexity of the models induces and metrics for measuring the effectiveness of change detection methods when learning from non-stationary distributions.
5.2.1 Error Metrics
The most relevant metric for a learning algorithm is the generalization error, which is an estimate of the goodness of the fit of the induced model with respect to the target func- tion. While the model is being fitted on a training set, our estimates of the generalization error are always obtained by using a separate independent set of instances (testing set). Standard error metrics used in the machine learning literature are the mean squared error (MSE) (Armstrong and Collopy, 1992), and the mean absolute error (MAE) (Abramowitz and Stegun, 1972).
Given a training set {(x1,y1), (x2,y2), ...} the regression task is to construct an estimator
ˆ
f(x) that approximates an unknown target function f (x). The quality of the approximation is evaluated with the expected mean squared error over the whole distribution of examples:
e( ˆf) =
Z
( ˆf(x) − f (x))2p(x)d(x). (21)
Since we do not have access to the true distribution p(x), we approximate the integral with a summation over a separate set of instances of size N, ideally drawn independently at random from the same distribution and held for testing purposes. Given a test example x, the estimator produces a prediction ˆf(x). The mean squared error is defined as the averaged squared difference between the predicted value ˆfi= ˆf(xi) and the desired correct
value yi= fi= f (xi): MSE= 1 N N
∑
i=1 ( fi− ˆfi)2. (22)Because the error magnitude depends on the magnitudes of possible function values, the relative mean squared error (RE) can be used instead of MSE:
RE= N· MSE
∑i( fi− f )2
(23) where the MSE is normalized with respect to the error of the baseline predictor that always predicts the average value:
f = 1 N N
∑
i=1 fi. (24)The square root of RE is known under the term root relative mean squared error RRSE:
RRSE= s 1 N N
∑
i=1 ( fi− ˆf(xi))2 ( fi− f )2 (25) Other error metrics that we have used are the mean absolute error (MAE):Experimental Evaluation of Online Learning Algorithms 51 MAE= 1 N N
∑
i=1 | fi− ˆf(xi)| (26)and its normalized variant the relative mean absolute error (RMAE):
RMAE = N· MAE
∑i| fi− f |
(27) The RE, RRSE and RMAE errors are nonnegative. For acceptable models, they should have values smaller than 1: 0 ≤ RE ≤ 1, 0 ≤ RRSE ≤ 1, 0 ≤ RMAE ≤ 1. If for some function the relative error, the root relative mean square error or the relative mean absolute error is greater than 1, the model is completely useless. Namely, RE = 1, RRSE = 1 and RMAE = 1 are trivially achieved with the baseline predictor ˆf = f .
A different type of measure that quantifies the statistical correlation between actual function values fi and the predicted function values ˆfi for a dependent regression variable is
the Pearson’s correlation coefficient (CC):
CC= Sf ˆf Sf · Sfˆ (28) where Sf ˆf =∑ N i=1[( fi− f )( ˆfi− ˆf)] N− 1 Sf = ∑Ni=1( fi− f )2 N− 1 Sfˆ=∑ N i=1( ˆfi− f)2 N− 1 f = 1 N N
∑
i=1 fi and ˆf = 1 N N∑
i=1 ˆ fi.The correlation coefficient has values in the range [-1,1], where 1 stands for perfect positive correlation, -1 stands for perfect negative correlation, and 0 stands for no correlation at all. Thus, we are interested in positive values of correlation. As opposed to the mean squared error and the mean absolute error that need to be minimized, the learning algorithm aims to maximize the correlation coefficient. It should be also noted that the Pearson’s correlation coefficient measures the linear correlation between the two variables.
The mean square error and the mean absolute error can be calculated incrementally by maintaining the sums of squared and absolute differences between the true function values fi
and the predicted function values ˆfi. For their normalized versions, we need to incrementally
calculate the sum of true function values ∑N
i fi, of predicted function values ∑Ni fˆi, as well as
the sums of squared predicted and true function values: ∑Ni fˆi2 and ∑ N
i fi2, and the number
of observed training instances N. The later sums are necessary for the computation of the term ∑N
i ( fi− f )2, which can be decomposed into: N
∑
i=1 ( fi− f )2= N∑
i=1 ( fi− 1 N N∑
i=1 fi)2= N∑
i=1 fi2+1 N( N∑
i=1 fi)2. (29)The sum of their products ∑i fifˆion the other hand is necessary for an online computation of
the correlation coefficient. All of the above evaluation metrics can be applied in a straight- forward way to evaluate multi-target predictive models: The necessary statistics (sums and counts) should be computed per target variable.
52 Experimental Evaluation of Online Learning Algorithms
5.2.2 Metrics for Model’s Complexity
For the case of interpretable models, it is also very important to have a model of low complexity, which can be interpreted easily. Having to deal with only regression trees and their variants, a straightforward way of measuring model complexity is to count the total number of nodes. If we are interested in measuring the number of rules we can extract from the tree, then a more suitable measure is the number of leaves in the tree, since each leaf corresponds to a different predictive rule (which is a conjunction of logical or comparison tests).
In the case of option trees, we use two different measures of complexity: the number of different trees represented with a single option tree, which gives us a measure which is comparable to the size of an ensemble, as well as the total number of leaves in the option tree, i.e., the number of different rules that can be extracted from the tree.
The algorithms are further expected to work under constrained resources in terms of memory, running time, and processing time per example. Therefore, we further measure the memory allocation (in MB) at any point in time during the learning process, as well as the total elapsed running time (in seconds).
5.2.3 Metrics for Change Detection
Another important dimension along which we evaluate an online learning algorithm, is its ability to adapt to concept drift. Among other measures of evaluation, it is also very important to evaluate the performance of the change detection method and the adaptation method of the algorithm. This would tell us how fast and how well the algorithm will recover and repair the inferred model after a change in the target representation.
Let us first illustrate the corresponding problem statement. Assume that one observes a piecewise constant signal disturbed by white noise. Let us assume an unknown number of jumps in the mean which occur at unknown points in time. The problem of online detection relevant for our work is the online detection of such jumps, as quickly as possible, in order to allow the detection of near successive jumps.
In this context, the adequate evaluation metrics for online change detection are: the number of false alarms, the number of true positives, and the detection delay. In a proba- bilistic manner of thinking, we are thus interested in the following metrics:
• Probability of false alarms: Measures the robustness and the sensitivity to noise. In other words, we are typically interested in a high mean time between false alarms. • Probability of true positives: Quantifies the capacity of the method to detect all changes
(all of the jumps in the signal).
• Mean detection delay: The mean time between the change (jump) and the alarm time, conditioned by a nonexistence of false alarms before the change (jump). Naturally, it is highly desirable that the alarm is given without any delay, with as few lost observations as possible.
There exists a close connection between the ability of quick change detection and the sensitivity of the method. The ability of quick change detection causes the detector to be sensitive and thus increases the risk of false alarms. In conclusion, a detector would perform optimally if, for a fixed mean time between false alarms, the delay for detection is minimized (Basseville and Nikiforov, 1993).
The above probabilities can be easily estimated when the evaluation of the change de- tection method is performed in a controlled environment, through counting the false alarms and the true positives over the total number of changes artificially introduced in the target function. However, it is much more difficult to assess the quality of change detection and
Experimental Evaluation of Online Learning Algorithms 53
adaptation methods for real-world datasets, in which the number of changes in the target function cannot be known up-front, considering that the target function is unknown itself. The best evaluation method in such a scenario is to track the performance of the learning algorithm over a testing set of most recent examples, and consider an alarm for a change when its performance starts to degrade drastically.