Density Estimation Experiments - Detecting changes in high frequency data streams, with applica

∂x∂x(x).

Because f⁰⁰(x) is unknown, we cannot use this formula directly. In the previous sec-tion, we discussed the popular plug-in method where this function is first estimated, and then this estimate is used to select the bandwidth. However we mentioned, this does not seem suitable for use in a streaming context since it requires storing all the stream observations. Therefore we will use the simpler IQR rule-of-thumb tech-nique we discussed, since this is very simple to compute, and gives estimates that are very close to those obtained through the plug-in method. The bandwidth is hence:

hn= 1.06An^−1/5, A = min(σn, IQRn/1.34),

where σnand IQRnare the standard deviation and IQR of the first n observations.

4.4 Density Estimation Experiments

In the next section we will consider the use of the ECDF and SKDE estimators for change detection. However, we first investigate their performance when considered purely as density estimation techniques, since this is interesting in its own right, and will also provide insight into which method is likely to work better when used for change detection.

Recall that our streaming versions of both ECDF and SKDE work by recursively estimating the density at the m points p₁, . . . , p_mand then using interpolation to ex-tend this over the full range. At the m chosen points, it is obvious that the estimates will converge to those given by the equivalent fixed-sample technique, and will therefore converge to the true stream distribution since this convergence is guaran-teed for both fixed-sample ECDF and SKDE by the standard theorems discussed above. However, this convergence is not guaranteed at the points which have been interpolated. Generally, as long as m is chosen large enough, we would expect the interpolation to be very close to the offline estimate and hence to the true density.

We will now investigate this empirically.

(a) t(4) (b) GaussianM ixture

Figure 4.3: Linearly interpolated estimate of the true density, for m = 10 (red) and m = 50 (blue)

Figure 4.3 shows the problem that can arise if m is set too low. In this case, both a Gaussian mixture and a Student-t distribution are being estimated using SKDE with m = 10 and m = 50. The plotted figures show the estimated density after 1000 observations have been received. When m = 10, the estimated density at these 10 points is very close to the true density. However, the interpolation is very bad due to the low number of interpolation points. However when the estimation is performed with m = 50, it can be seen that the interpolation now seems to capture the true density well.

To quantify this, we again measure estimation accuracy using the mean square error between the estimated distribution ˆF and the true distribution F as in Equation 4.4.

Several different stream distributions are investigated. For each, we consider m ∈ {20, 50, 100} points equally spaced on the interval [−10, 10]. 1000 streams are generated each containing 300 observations. These are sequentially processed by both the SKDE and ECDF estimators. In the case of SKDE, the resulting density is then numerically integrated using quadratures to yield an estimate of the distri-bution. We then compute the MSE between the estimated and and true distribution at each time point. The results are shown in Figure 4.4 for SKDE, with the results for ECDF being more or less identical. It can be seen that for m = 20, the MSE

4.4 Density Estimation Experiments 102

Figure 4.4: MSE between the estimated and true distributions using SKDE with various choices of m, as a function of the number of observations. The dotted lines show standard deviations.

can be relatively high, particularly when the true distribution is Student-t. However using more than 50 points does not seem to give any significant performance im-provement on any of the tested distributions, or the others which we tested on but have omitted for brevity. We hence tentatively conclude that only a small number of points are necessary to get good estimation accuracy on these sorts of distributions, and use m = 50 for the remainder of this chapter.

We next compare the relative performance of the SKDE and the ECDF algo-rithms. We again use the four previously considered distributions, and set m = 50.

Figure 4.5 shows the mean square error between the estimated and true distribu-tions over time. Note that we have also included the offline version of KDE for

Distance

50 100 150 200 250 300

0.0000.0010.0020.0030.004 ECDF

SKDE KDE

(a) N (0, 1)

Distance

50 100 150 200 250 300

0.0000.0010.0020.0030.004 ECDF

SKDE KDE

(b) t(4)

Distance

50 100 150 200 250 300

0.0000.0010.0020.0030.004 ECDF

SKDE KDE

Distance

50 100 150 200 250 300

0.0000.0010.0020.0030.004 ECDF

SKDE KDE

(d) GaussianM ixture

Figure 4.5: MSE between the estimated and true distributions as a function of the number of observation, using KDE, SKDE, and ECDF with m = 50 The dotted lines show standard deviations.

4.5 Change Detection 104 comparison.

There are several aspects of these results which deserve comment. First, the per-formance of SKDE seems to be very close to the offline version of KDE, although the offline version performs better, as would be expected. Second, we see that the ECDF approach generally does not give an accurate estimate of the stream distri-bution when only a small number of points are available, and for all the considered distributions it is inferior to SKDE until roughly 150 observations are available.

After this point it performs slightly better, although the difference is comparatively small. This can perhaps be interpreted as preliminary evidence that the SKDE ap-proach is superior, but this will be considered further in Section 4.6.

4.5 Change Detection

Having introduced our techniques for streaming distribution estimation, we can now incorporate these into a change detection algorithm. Our approach was previously outlined in Section 4.1, and Algorithm 1 gives the pseudocode for this. First. decide on whether to use ECDF or SKDE, and choose the number of points p1, . . . , p_m at which the distribution is to be estimated. Next, select a parametric change detection algorithm which will be used to monitor the transformed Gaussian observations, and then specify the desired ARL0.

Whenever an observation x_t is received from the stream, it is used to update the estimate ˆf₀(p_i) of the stream distribution at each of the individual points p_i. Next, linear interpolation is performed on these points to give an estimate ˆF₀ of the whole distribution. This estimate is then used in the probability integral transform shown in Equation 4.1, to transform x_tinto a roughly N (0, 1) observation y_t. The transformed observation is then fed to the parametric change detector, and if it signals for a change then we conclude that a change has occurred.

If the SKDE approach is used, the density must be numerically integrated in order to compute ˆF₀(x_t). Since ˆf₀ has been linearly interpolated, this integral can be performed analytically and is hence very computationally efficient. Suppose that x_tlies between p_m and p_m+1. Then:

Fˆ₀(x_t) =

As discussed in Section 2.1.2, there are many possible choices of parametric change detection algorithms. We choose to use a version of the CPM framework from the previous chapter. Two instances of the CPM are run in parallel; the first is intended to detect shifts in the mean, and hence uses a two-sample Student-t test statistic. The second detects shifts in the variance, and uses a two sample F statistic. Both the Student-t and F tests are optimal for testing the mean and variance respectively of two samples of Gaussian observations. Their use in the CPM was first proposed in [71] and [72] respectively.

Algorithm 1 General density estimation based change detection algorithm Choose a desired value for m and the ARL0

Initialise ˆf (p₁) = 1/(p_m− p₁), . . . , ˆF (p_m) = 1/(p_m− p₁) Initialise the CPMs

For each observation xt

Update estimate ˆf₀ of f₀ at each p_i Compute ˆF0using linear interpolation Let y_t= Φ⁻¹( ˆF₀(x_t))

Add y_tto points in CPM, compute D_t If Dt> h_tflag for change

Unlike the nonparametric test statistics used in our previous CPMs, the T and F tests both have a simple formulation in terms of the sufficient statistics of the Gaussian, which can be computed recursively. Define ˆµ_i:j and ˆσ_i:j as the mean and standard deviation of the observations x_i, . . . , x_j. Then, the two sample Student-t statistic can be defined as: Similarly, the two-sample F statistic can be written as:

In document Detecting changes in high frequency data streams, with applications (Page 100-106)