Change Detection Methods - Novel methods for mining and learning from data streams

In order to meet the requirements of learning from non-stationary data streams, a learning algorithm needs to be aware of any change in the data generating process that could invalidate the learned model. Such an awareness could be achieved by either (i) directly inspecting the arriving data and checking them for a change or (ii) by observing how the performance of the learned model changes in the course of time and triggering a change whenever this performance significantly deteriorates. In the following, we describe the most applied change detection methods, as reviewed in the surveys [76, 106, 53]:

Classifier-dependent detection methods

This type of detectors uses a change detection strategy that compares the performance of the current model with the best achieved performance up to now, under the assumption that the best performance corresponds with no change in the target

Statistical process control (SPC) is a family of statistical methods that can be applied to monitor and control processes, such as industrial processes, in order to keep them in an optimal sustainable operation mode. Diﬀerent variations of this method have been adapted and applied for detecting drifts [97, 72, 15]. Drift detection method (DDM), proposed by Gama et al. [72], is one of the early adaptations of SPC for detecting drifts. For a stream of instances (xi, yi)

and their assigned predictions ˆyi, the zero-one loss function l computes the

disagreement between the true class and the prediction, i.e., li = I(yi ̸= ˆyi),

where I is the indicator function. The committed error on one example, with the binary values it takes, forms a Bernoulli trial. As a result, the number of errors committed on a sample of n instances follows the binomial distribution, provided they are independent. For the ith sample, pi is the probability of being

assigned the wrong class with the standard deviation σi =

√

pi(1− pi)/i. These

values are incrementally updated on the observed stream. DDM keeps track of the best observed performance by storing the variables pmin and σmin. These

variables are updated (pmin = pi and σmin = σi) whenever pi+ σi < pmin+ σmin

is satisfied after observing the ith example. The confidence interval pi ± zσi,

such that z depends on the desired confidence level α, helps in defining the following three states for change detection:

– In-control state is the state at which the prediction performance does not seem to exhibit any change. The system is in this state as long as pi+ σi <

pmin+ 2· σmin.

– Out-of-control state is the state at which the error has significantly increased, which requires a suitable model adaptation in order to recover the drop in performance. The system is in this state whenever pi + σi ≥

pmin+ 3· σmin.

– Warning state is the state at which the error has increased without reaching the critical level. This state occurs when the system’s performance lies between the two previous states.

Early drift detection method (EDDM) [15] builds upon the previously discussed DDM method in order to shorten the temporal gap between the drift and its detection. The problem with the previous method is that the more we see data the more resistant becomes pi towards slow and gradual changes. EDDM

misclassification cases instead of the error rate. This method uses p′_i as the average number of correct predictions between two wrong predictions and σ_i′ is its standard deviation. Similar to the DDM, the system is in the warning level when (p′_i + 2· σ_i′)/(p_max′ + 2 · σ_max′ ) < α and in the drift level when (p′_i + 2·

σ_i′)/(p′_max + 2· σ_max′ ) < β, such that α and β takes the values 0.95 and 0.90 respectively.

EWMA for concept drift detection (ECDD) [135] employs an idea similar to SPC for detecting drifts. This is achieved by observing the change in the expo- nentially weighted moving average (EWMA) [133], which progressively down- weights older observation in order to form a more recent estimate of the av- erage Zt = (1− λ)Zt−1 + λXt, with X0, . . . , Xt, . . . are independent random

variables with a known mean µ0 and standard deviation σX. Roberts [133]

shows that the mean of Zt is µZt = µ0 and the standard deviation is given by σZt =

√

2−λ(1− (1 − λ)2t)σX. EWMA detects a change in the mean Zt, from µ0 to the unknown mean µ1, whenever the diﬀerence between Zt and µ0 exceeds a certain threshold, i.e., Zt > µ0+ LσZt, where L is the control limit which determines how sensitive the detection should be.

ECDD changes the EWMA method in order to avoid the assumption of knowing

µ0 before the change. It defines the variable ˆpt = t−1_t ptˆ−1 + 1_tXt for the exact

average of all past observations, which weights all observations in the same way. ECCD assumes that the random variables are Bernoulli random variables repre- senting a stream of binary prediction errors. A change is detected in this binary stream whenever Zt> ˆpt+LˆσZt, such that ˆσZt =

√

2−λ(1− (1 − λ)2t)ˆpt(1− ˆpt). Adaptive windowing (ADWIN) [21, 20] is a drift detection method that, instead of sliding a window over only the recent samples, shrinks the window of observations whenever a change is detected. In this way, the expected value of the observations in the remaining part and the removed part of the window are guaranteed to be diﬀerent, with probability of 1− δ. ADWIN2 [21, 20] is de- veloped as an eﬃcient alternative to ADWIN; it needs to check only O(log(n)) sub-windows for the shrinkage, where n is the size of the window. ADWIN2 ac- complishes this by approximating the window through storing only a variation of exponential histograms [46].

Classifier-independent detection methods

testing methods can be applied to detect changes in the data generating process. The choice of the most suitable hypothesis test depends on the wanted change criteria reflected in the design of the null hypothesis. In the following, we explain a number of important methods of that kind as reviewed in the survey paper [106], without any claim to completeness:

 The Welch’s t-test is a two-samples test used to check whether two normally distributed populations have the same mean. This test diﬀers from Student’s

t-test in that Welch’s t-test allows the population’s variances to be unequal.

From the two samples X1, X2 of diﬀerent sizes N1 and N2, the test statistic is

t = ¯ X1− ¯X2 √ s2 1 N1 + s2 2 N2 ,

where ¯X1, ¯X2are the sample means and s12, s22are the sample variances of X1, X2, respectively. The t-distribution, with a degree of freedom based on the sizes and the variances of the two samples, is then applied to test the null hypothesis that the means of the two populations are equal.

The Kolmogorov-Smirnov test is a two-sample test for the null hypothesis that two samples are drawn from the same distribution. This is achieved by tak- ing the supremum distances between the two empirical cumulative distribution functions. More formally, for the samples X1and X2of sizes N1, N2respectively, the test statistic becomes

d = sup

x |F1

(x)− F2(x)| .

The null hypothesis is then rejected with confidence α when d > c(α) √

N1+N2

N1N2 ,

where c(α) is found in the Kolmogorov-Smirnov table.

Sequential probability ratio test (SPRT) [174] is a statistical hypothesis testing method for sequential data. For the sequence Xn= x0, . . . , xn of the inde-

pendent samples, SPRT tests the null hypothesis that at the sample xw, with

1 < w < n, the data generating distribution does not change from p0 to p1 . The cumulative variable Sn holds the log ratio of the two likelihoods: the like-

of xw, . . . , xn being generated by the distribution p1. Sn takes the form Sn= log P(xw, . . . , xn; p0) P(xw, . . . , xn; p1) = n ∑ i=w logP(xi; p0) P(xi; p1) = Sn−1+ log P(xn; p0) P(xn; p1) .

The incremental observation of the samples is continued as long as Sn remains

in a user-defined interval [a, b]. The stopping rule is then activated whenever

Sn ∈ [a, b]; such that H/ 0 is accepted when Sn ≥ b and H1 is accepted when

Sn ≤ a. The choice of a < 0 < b < ∞ depends on the acceptable type I and

type II errors.

The cumulative sum (CUSUM) [125] is a method that triggers a change signal when a parameter of a probability distribution changes. The cumulative variable

Sn is defined as

Sn = max(0, Sn−1+ xn− wn) ,

such that Sn = 0 and wn is the weight for the sample xn. CUSUM resembles

SPRT when wn is chosen to be the likelihood of xn. On the other hand, it

detects the change only in one direction, in the positive direction in the previous formulas.

Page-Hinkely test (PH) [125] indicates a change whenever the average of Gaus- sian random variables significantly changes. This is accomplished through the continuous update of the variables mn and Mn at the time point n:

mn= n ∑ i=1 (xi− ¯xi− δ) = mn−1+ (xn− ¯xn− δ) Mn= min(mn, Mn−1) , with ¯xi = _n1 ∑n

i=1xi and δ represents the tolerance towards the allowed change.

The PH test simply monitors the quantity P Hn = mn− Mn. A change of the

mean, in the positive direction, is triggered whenever the P Hn > λ, where λ

In document Novel methods for mining and learning from data streams (Page 44-49)