The Bernoulli Change Point Model - Detecting changes in high frequency data streams, with appli

3, assuming full knowledge of the pre-change distribution is unrealistic and limits the applicability of these methods. Although θ₀ can generally be replaced by an estimate, it has been shown [19, 168] that this can have a serious impact on the per-formance of Bernoulli change detection algorithms, and cause the realised ARL0

to deviate substantially from the desired value.

In this chapter, we are concerned with the task of monitoring for a change in θ when there is no prior knowledge available regarding either its pre- or post- change value. We will present two novel algorithms for this problem. The first is an exten-sion of the Change Point Model framework from Chapter 2 which makes it suitable for detecting changes in a stream of Bernoulli random variables. We will describe this method in Section 5.2. The second is an adaptation of the classic EWMA chart previously described in Section 2.1.2, which we have modified to be usable for a binary random stream, where the pre-change value of θ is unknown. This method will be described in Section 5.3.

5.2 The Bernoulli Change Point Model

The first approach we use for detecting a shift in a Bernoulli parameter is the CPM framework described in Section 3.3.1 in the context of nonparametric change de-tection. Recall that this framework involved using a test statistic D_k,tfor comparing the distribution of two samples. In order to extend this work to the task of Bernoulli change detection, we replace Dk,t with a relevant test statistic. Specifically, we choose to use Fisher’s Exact Test [47] (FET) since it has a null distribution which can be computed exactly, rather than relying on Gaussian approximations which only hold asymptotically. This property is important since we would again like our change detector to be deployable in situations where only a small number of observations are available between change points.

The idea behind FET is as follows: suppose the observations at time t are broken up into two samples x₁, . . . , x_k and x_k+1, . . . , x_t. Let the null hypothesis be that there are no change points in the sequence, which implies that both sample have

been generated by the same Bernoulli distribution with a fixed parameter θ₀. Under this assumption, the Xivariables are identically distributed, with P (Xi = 1) = θ₀, and P (X_i = 0) = 1 − θ₀ for all i. Let S_t be a random variable defined as the number of failures observed up until time t, i.e.

S_t =

i=1

X_i.

Then, conditional on S_t = s_t, FET uses a combinatorics argument to reason about how the observed failures are distributed between the two samples. Let S_k be the number of failures in the first sample. Under the null hypothesis, the probability that S_k = s_kfollows the hypergeometric distribution:

P (S_k = s_k|S_t= s_t) =

where _k^t is the binomial coefficient. A fundamental property of the FET is that this probability does not depend on the unknown parameter θ0. By conditioning on the value of the sufficient statistic S_t, this dependency has been removed. Therefore the p-values of the FET under the null hypothesis are independent of θ₀, which makes this test suitable for situations where this parameter is not known. Now, as noted in the Introduction, we will generally be more interested in detecting an increase in θ_t, which corresponds to an unusually small number of failures occurring within the first k observations. The probability of there being sk or less failures in the first k observations under the null hypothesis that there is no change point and all observations are identically distributed is:

This is the one-sided p-value of the FET. Note that this has a hypergeometric distri-bution.

We now define the test statistic Dk,tto be this observed p-value. For consistency with Chapter 1 where we rejected the null hypothesis for large values of D_k,t, we

5.2 The Bernoulli Change Point Model 118 actually define:

F_k,t = 1 − p_k,t, F_k,t ∈ [0, 1].

Finally, the null hypothesis that no change occurs at k is rejected if F_k,t > h_k,t for some threshold h_k,t, as before.

Similar to Section 3.3.1, Dtis then defined as the maximum of these test statis-tics:

D_t = max

k D_k,t, 1 < k < t.

and a change is flagged at time t if Ft> h_tfor some appropriately chosen threshold h_t.

As in Section 3.3.2, these thresholds should be chosen so that there is a fixed probability of a false-positive change detection occurring at each time instance, i.e.

P (D_t> h_t|D_t−1≤ h_t−1, . . . , D₁ ≤ h₁) = α. (5.2) Again, it does not seem possible to find an analytic expression for these thresholds, and we instead use Monte-Carlo simulation. However, a problem now arises – in the above analysis, the FET involved conditioning on the observed number of suc-cesses, as can be seen from Equation 5.1. This implies that the thresholds used in the CPM should be conditional on the particular data sequence that been ob-served, with different sequences requiring different thresholds. Since a collection of n Bernoulli random variables has 2ⁿ possible realisations, it is hence not possi-ble to use the sort of precomputed lookup tapossi-ble which we described in Chapter 3 for the nonparametric models. This leads to problems when working with stream-ing data, since without a lookup table we would have to compute the thresholds as observations were received, which is computationally prohibitive.

This problem would not occur if θ0were known. In this case we could compute the thresholds via Monte-Carlo simulation as before by simply generating many sequences of independent Bernoulli(θ₀) random variables, and looking at the

em-θ0

D100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.700.750.80

Figure 5.1: Plot of how D₁₀₀varies as θ₀ changes, attaining a maximum at θ₀ = 0.5

pirical distribution of Dtas before. Since we do not in practice know θ0, we com-promise by designing a conservative test. In general, pt is smallest (with Dtbeing largest) when θ₀ = 0.5. This is illustrated in Figure 5.1 which shows how the av-erage value of D_t varies with θ₀ for a sample of 100 Bernoulli random variables.

Therefore, in order to generate a threshold sequence htthat will an give an ARL0of at least 1/α, we simulate Bernoulli sequences under the assumption that θ₀ = 0.5.

Because other values of θ₀ result in lower values of D_t, this will result in an ARL₀ which is greater than 1/α, i.e. we will have:

P (D_t> h_t|D_t−1 ≤ h_t−1, . . . , D₁ ≤ h₁) ≤ α. (5.3) Recall from Section 2.1.2 the importance we placed on being able to bound the ARL0 in order to better assess the significance of detected changes. From our arguments there, it follows that it is acceptable to have a conservative detector which gives fewer false positives (and hence a larger ARL₀) than expected. We therefore do not consider this to be a major problem, even if it means we will have

5.2 The Bernoulli Change Point Model 120 θ0 ARL0

0.50 500 0.20 530 0.10 620 0.05 870 0.01 1450

Table 5.1: Assuming that the desired ARL0 is 500,this Table shows the realized ARL₀for several different values of θ0

a detector which is slower to detect changes than it would be if a non-conservative test had been used.

With the above caveats in place, we determined the ht thresholds using the Monte Carlo method described previously in Section 3.3.2.. One million Bernoulli sequences of length 3000 were generated, and D_t was computed at each point in each sequence. htcan then be chosen at each time point so that the proportion of D_tvalues exceeding h_t is equal to α. The sequences of h_tvalues required to give various values of the ARL₀ are given in Table A.1 in Appendix C. It can be seen that the value of the threshold required to give the desired ARL0 appears to have settled down after t = 2000 , so it seems reasonable to use the value of h₂₀₀₀ as an approximation of h_tfor t > 2000.

In Table 5.1 we investigate just how conservative our procedure is. For a target ARL₀ of 500, this table shows the realised ARL0 for various choices of θ0, using the thresholds which have been computed under the assumption that θ₀ = 0.5. It can be seen that in general, the detector is not too conservative unless θ0is quite low (< 0.05), suggesting that our procedure should give good performance for moderate values of θ₀. We will investigate this in the experiments section later in this chapter.

Note that by symmetry the ARL₀ when θ₀ = 1 − γ will be identical to that for θ₀ = γ, so we have not included values of θ₀ > 0.5 in this table.

5.2.1 Implementation Issues

Having completed the description of the Bernoulli CPM, we now discuss imple-mentation issues. In many important real world scenarios, computational resources are limited so it is important to have a change detection algorithm which can be computed efficiently. For the CPM, the majority of computation time is spent cal-culating the Dk,t statistics. From Equation 5.1 we can see that this is equivalent to evaluating the probability mass function of the hypergeometric distribution. Most common statistical packages will provide a highly optimised routine for this task.

However, we can increase efficiency by exploiting the high level of correlation be-tween the D_k,t statistics.

Consider a fixed sized sample containing t observations, of which stare failures.

Let sk be the number of failures observed in the subsample x1, . . . , x_k, and for

By exploiting some combinatorics, we can compute ξk+1,trecursively from ξk,t. We make use of the following identities for the binomial coefficient:

n With these, basic algebraic manipulation shows that:

ξ_k,t+1 =

Using these recursive formulations significantly decreases the processing time re-quired to compute each value of D_k,t. Recall that D_k,t is defined as:

In document Detecting changes in high frequency data streams, with applications (Page 116-122)