In this chapter we introduced the problem of nonparametric change detection, and the change point model framework. We showed how this framework could be used to adapt nonparametric hypothesis tests to the sequential context. Through discreti-sation, we gave a method for computing the ranks in a computationally efficient manner.
Experimental analysis showed that there is no uniformly best CPM, although the Lepage method did give arguably the best performance over the widest class of changes. When detecting a change in the location parameter of a non-skewed distribution, we found that the MW and CvM methods gave roughly similar perfor-mances, being superior to the LP CPM when the change magnitude is small, and inferior when the change magnitude is large. For skewed distributions however, the CvM method is slightly better.
In the case of scale shifts, the best performance is given by the Mood CPM, with the Lepage CPM being close. The other detectors are significantly worse than these,
3.6 Conclusions 84 with the MW being particularly bad when the change involves the scale decreasing.
Finally when we considered detecting arbitrary changes, we found that the LP and CvM CPMs seemed to give roughly similar performance on average. Our analysis here is of course not exhaustive, since there are infinitely many possible types of changes that we could consider.
To summarise, we can make the rough recommendation that, if no knowledge is available regarding the stream distribution or type of change that will be encoun-tered, the Lepage CPM seems to be the best choice. Although it does not give the best performance in every case, there are no types of change for which it per-forms badly, and it generally gives performance which is among the best out of the methods we considered.
Chapter 4
Streaming Distribution Estimation
In the previous chapter we discussed one approach to distribution-free change de-tection on data streams, which used nonparametric hypothesis testing. This chapter tackles the same problem from a different angle; rather than using hypothesis tests, we instead try to estimate the unknown stream distribution. This estimate can then be used for change detection. Specifically, we propose to use the probability in-tegral transform to convert new stream observations to N (0, 1), and then deploy standard parametric techniques to monitor the transformed observations. Unfortu-nately, most existing methods of density estimation are computationally expensive and require all stream observations to be stored in memory, and are hence unsuitable for use with streaming data. We therefore propose a novel method for estimating the stream distribution, which is based on an adaptation of kernel density estimation.
The chapter proceeds as follows: Section 4.1 gives a more detailed overview of our proposed approach. Section 4.2 situates our work within the existing body of literature which has considered using density estimation the purpose of change detection. Sections 4.3.1 and 4.3.2 introduce our streaming distribution estimation techniques. Section 4.4 investigates how accurately these methods can estimate a distribution, and Section 4.5 considers their application to change detection. Fi-nally Section 4.6 consists of a simulation study to assess the performance of our the resulting algorithm, and compares it to the CPM framework from the previous chapter.
4.1 Overview 86
4.1 Overview
As before, we assume that the data stream consists of the observations x1, x2, . . ..
Suppose that the pre-change distribution of the stream F0(x; θ0) were known, in-cluding its parameters. In that case, we could transform the observations into a stream of standard Gaussian variables by using the probability integral transform (PIT): given an arbitrary random variable X with cumulative distribution function F0, define:
Y = Φ−1(F0(X)), (4.1)
where Φ(x) is the CDF of the unit Gaussian. Y then has a unit Gaussian distribu-tion, Y ∼ N (0, 1).
Our motivation for considering this transform is that the problem of detecting a change in the mean or variance of a univariate stream of Gaussian variables has been widely studied [12] and many techniques exist which allow the ARL0 to be controlled. These techniques can hence be deployed on the transformed observa-tions, resulting in the desired false positive rate being maintained.
Since in practice we do not know F0, we propose to replace F0 with an esti-mate. Our change detection algorithm thus has three components; first, estimate F0 sequentially as new observations are received. Second, use this estimated dis-tribution to transform new observations to N (0, 1), and then finally monitor the transformed observations for changes using standard parametric techniques.
In this chapter we present two computationally efficient methods for estimating F0 sequentially, which allow a recursive formulation and do not require any obser-vations to be stored in memory. The first directly uses the empirical CDF as an estimate of F0. The second uses a sequential version of kernel density estimation to estimate the pre-change density f0, and then performs numerical integration to yield an estimate of F0. We shall refer to these two techniques as ECDF and SKDE respectively. Although SKDE introduces an additional layer of complexity com-pared to using the empirical distribution, our experiments will show that it can be superior, especially when only a small number of observations from the stream are available.
4.2 Related Work
The idea of using a transformation to convert a sequence of random variables to unit Gaussians was proposed in [67] as a method for detecting a change in a sequence of Gaussian random variables which have unknown mean and variance. The authors estimate the mean and variance of the sequence, and standard techniques are then deployed to first transform the variables so that they have a Student-t distribution, before a second transform is applied to make them unit Gaussian.
The development in [67] did not explicitly mention the probability integral transform. Instead, an approximate transformation was used which was claimed to be more computationally efficient. However several years later, [125] indepen-dently considered the same problem of monitoring for a change in a sequence of Gaussian variables with unknown parameters, and formulated this in terms of the PIT. The unit Gaussian variables produced by this PIT approach are now known as Q statistics in the literature. A Shewhart control chart (see Section 2.1.2) is often used to monitor these Q statistics, and the resulting change detection algorithm is conventionally known as the Q chart. The theoretical properties of this chart are studied in [173]. Similarly, several CUSUM charts have been proposed to monitor the Q statistics, such as [174, 102].
Q charts have also been considered for other parametric change detection prob-lems where the pre-change value of the monitored parameter is unknown. [126]
considers a Q chart for the Geometric distribution where no information is available about the parameter, and [127] designs a similar chart for the Binomial distribution.
These methods are based on using the uniform minimum variance unbiased esti-mator (UMVU) of the respective probability distribution to get an estimate of the unknown parameters, before the probability integral transform is deployed. How-ever the limitation of these techniques is that they assume the distributional form of the stream to be known, and are hence unsuitable for the nonparametric problem.
The idea of using density estimation to perform nonparametric change detection has been considered several times. Most approaches fall into one of two categories.
The first is bootstrap control charts [10, 141, 103, 30, 121]. These make use of
4.2 Related Work 88 the standard nonparametric bootstrap [41], which is a general method for estimat-ing the samplestimat-ing distribution of some statistic T . Given a collection of observations x1, . . . , xn, let T (x1, . . . , xn) be the statistic of interest. A new set of observations x∗1, . . . , x∗ncan sampled from the empirical ECDF. This new sample is called a boot-strap sample, and can be used to form the bootboot-strap estimate T∗of T . By drawing many such samples, the sampling distribution of T can hence be approximated by the collection of replicates..
In the bootstrap control chart, this technique is used to estimate the control limits corresponding to a desired value of the ARL0. Usually, this is then used to design a Shewhart control chart, however a bootstrap CUSUM was considered in [30].
Unfortunately, the performance of bootstrap control charts has come under criti-cism. [82] investigated the performance of several popular bootstrap approaches, and found that the realised value of the ARL0 often differed significantly from the desired value. The main problem is that when a relatively high ARL0 is required, bootstrap control charts require accurate estimation of the extreme percentiles of the stream distribution. For example, if a bootstrap Shewhart chart is deployed with a desired ARL0 of 500, then a change will be flagged if an observation lies in the upper 1/500 percentile of the distribution. However accurately estimating the ex-tremes of the distribution requires a large number of data points, which will not generally be available. Several attempts to mitigate this have been proposed such as CUMIN charts [1, 2] which group observations together so that less extreme per-centiles are required, but their performance is questionable. Another limitation of the bootstrap technique is that most existing charts assume that batches containing several observations are available at each time instance, rather than being applicable in the case where observations are being received individually.
The most common alternative to bootstrap control charts are approaches which assume that the pre-change distribution F0 has been accurately estimated from a reference sample of observations which are known to come from F0. Call this esti-mate ˆF0. A separate estimate is then repeatedly made of the recent distribution, by (for example) estimating this distribution over a sliding window. Write ˜F0 for the recent estimate of F0 made at time t. Change detection is then carried out by
re-peatedly comparing these two distributions using some specified distance metric d, and a change is flagged at time t if d( ˆF0, ˜F0) > htfor some threshold ht. Examples of such methods include [98, 86, 114].
However there are two main problems with this type of approach. First, much of this literature was generated within the machine learning community, which has a different set of concerns from the traditional quality control and statistics litera-tures, and this has led to a general ignoring of the need to have a controlled rate of false positives. Almost all methods of this type fail to be distribution free, in the sense that the ARL0 is strongly dependent on the unknown F0. A typical example is [16], which uses the DENSTREAM [22] density estimation algorithm to fit an ad hoc mixture model to the data stream, which can be done in a computationally efficient manner. The Kullback-Leibler (KL) distance between this estimated dis-tribution and the reference disdis-tribution is then calculated, with a change flagged if it exceeds some threshold. However the authors do not discuss how to choose this threshold. The problem is that the null distribution of the KL distance, as-suming that no change has occurred, depends on the particular stream distribution, and there is hence no way to choose this threshold so that a desired ARL0 can be maintained independent of F0. As is often done in this literature, the performance of the algorithm is evaluated purely in terms of how quickly changes are detected, but no attention is given to the problem of false positives, or assessing the sig-nificance of results. Another general problem with this sort of approach is that a sufficiently large reference sample may not always be available. Finally, several of these approaches rely on density estimation techniques which are not suitable to the streaming problem, such as unmodified kernel density estimation where the amount of computation required grows linearly with the number of observations.
Our innovation compared to these methods is the use of the probability inte-gral transform to produce an approximately Gaussian stream. This allows the false positive rate to be controlled, in a manner which is not generally done.