• No results found

The Bernoulli Change Point Model

3, assuming full knowledge of the pre-change distribution is unrealistic and limits the applicability of these methods. Although θ0 can generally be replaced by an estimate, it has been shown [19, 168] that this can have a serious impact on the per-formance of Bernoulli change detection algorithms, and cause the realised ARL0

to deviate substantially from the desired value.

In this chapter, we are concerned with the task of monitoring for a change in θ when there is no prior knowledge available regarding either its pre- or post- change value. We will present two novel algorithms for this problem. The first is an exten-sion of the Change Point Model framework from Chapter 2 which makes it suitable for detecting changes in a stream of Bernoulli random variables. We will describe this method in Section 5.2. The second is an adaptation of the classic EWMA chart previously described in Section 2.1.2, which we have modified to be usable for a binary random stream, where the pre-change value of θ is unknown. This method will be described in Section 5.3.

5.2 The Bernoulli Change Point Model

The first approach we use for detecting a shift in a Bernoulli parameter is the CPM framework described in Section 3.3.1 in the context of nonparametric change de-tection. Recall that this framework involved using a test statistic Dk,tfor comparing the distribution of two samples. In order to extend this work to the task of Bernoulli change detection, we replace Dk,t with a relevant test statistic. Specifically, we choose to use Fisher’s Exact Test [47] (FET) since it has a null distribution which can be computed exactly, rather than relying on Gaussian approximations which only hold asymptotically. This property is important since we would again like our change detector to be deployable in situations where only a small number of observations are available between change points.

The idea behind FET is as follows: suppose the observations at time t are broken up into two samples x1, . . . , xk and xk+1, . . . , xt. Let the null hypothesis be that there are no change points in the sequence, which implies that both sample have

been generated by the same Bernoulli distribution with a fixed parameter θ0. Under this assumption, the Xivariables are identically distributed, with P (Xi = 1) = θ0, and P (Xi = 0) = 1 − θ0 for all i. Let St be a random variable defined as the number of failures observed up until time t, i.e.

St =

t

X

i=1

Xi.

Then, conditional on St = st, FET uses a combinatorics argument to reason about how the observed failures are distributed between the two samples. Let Sk be the number of failures in the first sample. Under the null hypothesis, the probability that Sk = skfollows the hypergeometric distribution:

P (Sk = sk|St= st) =

where kt is the binomial coefficient. A fundamental property of the FET is that this probability does not depend on the unknown parameter θ0. By conditioning on the value of the sufficient statistic St, this dependency has been removed. Therefore the p-values of the FET under the null hypothesis are independent of θ0, which makes this test suitable for situations where this parameter is not known. Now, as noted in the Introduction, we will generally be more interested in detecting an increase in θt, which corresponds to an unusually small number of failures occurring within the first k observations. The probability of there being sk or less failures in the first k observations under the null hypothesis that there is no change point and all observations are identically distributed is:

This is the one-sided p-value of the FET. Note that this has a hypergeometric distri-bution.

We now define the test statistic Dk,tto be this observed p-value. For consistency with Chapter 1 where we rejected the null hypothesis for large values of Dk,t, we

5.2 The Bernoulli Change Point Model 118 actually define:

Fk,t = 1 − pk,t, Fk,t ∈ [0, 1].

Finally, the null hypothesis that no change occurs at k is rejected if Fk,t > hk,t for some threshold hk,t, as before.

Similar to Section 3.3.1, Dtis then defined as the maximum of these test statis-tics:

Dt = max

k Dk,t, 1 < k < t.

and a change is flagged at time t if Ft> htfor some appropriately chosen threshold ht.

As in Section 3.3.2, these thresholds should be chosen so that there is a fixed probability of a false-positive change detection occurring at each time instance, i.e.

P (Dt> ht|Dt−1≤ ht−1, . . . , D1 ≤ h1) = α. (5.2) Again, it does not seem possible to find an analytic expression for these thresholds, and we instead use Monte-Carlo simulation. However, a problem now arises – in the above analysis, the FET involved conditioning on the observed number of suc-cesses, as can be seen from Equation 5.1. This implies that the thresholds used in the CPM should be conditional on the particular data sequence that been ob-served, with different sequences requiring different thresholds. Since a collection of n Bernoulli random variables has 2n possible realisations, it is hence not possi-ble to use the sort of precomputed lookup tapossi-ble which we described in Chapter 3 for the nonparametric models. This leads to problems when working with stream-ing data, since without a lookup table we would have to compute the thresholds as observations were received, which is computationally prohibitive.

This problem would not occur if θ0were known. In this case we could compute the thresholds via Monte-Carlo simulation as before by simply generating many sequences of independent Bernoulli(θ0) random variables, and looking at the

em-θ0

D100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.700.750.80

Figure 5.1: Plot of how D100varies as θ0 changes, attaining a maximum at θ0 = 0.5

pirical distribution of Dtas before. Since we do not in practice know θ0, we com-promise by designing a conservative test. In general, pt is smallest (with Dtbeing largest) when θ0 = 0.5. This is illustrated in Figure 5.1 which shows how the av-erage value of Dt varies with θ0 for a sample of 100 Bernoulli random variables.

Therefore, in order to generate a threshold sequence htthat will an give an ARL0of at least 1/α, we simulate Bernoulli sequences under the assumption that θ0 = 0.5.

Because other values of θ0 result in lower values of Dt, this will result in an ARL0 which is greater than 1/α, i.e. we will have:

P (Dt> ht|Dt−1 ≤ ht−1, . . . , D1 ≤ h1) ≤ α. (5.3) Recall from Section 2.1.2 the importance we placed on being able to bound the ARL0 in order to better assess the significance of detected changes. From our arguments there, it follows that it is acceptable to have a conservative detector which gives fewer false positives (and hence a larger ARL0) than expected. We therefore do not consider this to be a major problem, even if it means we will have

5.2 The Bernoulli Change Point Model 120 θ0 ARL0

0.50 500 0.20 530 0.10 620 0.05 870 0.01 1450

Table 5.1: Assuming that the desired ARL0 is 500,this Table shows the realized ARL0for several different values of θ0

a detector which is slower to detect changes than it would be if a non-conservative test had been used.

With the above caveats in place, we determined the ht thresholds using the Monte Carlo method described previously in Section 3.3.2.. One million Bernoulli sequences of length 3000 were generated, and Dt was computed at each point in each sequence. htcan then be chosen at each time point so that the proportion of Dtvalues exceeding ht is equal to α. The sequences of htvalues required to give various values of the ARL0 are given in Table A.1 in Appendix C. It can be seen that the value of the threshold required to give the desired ARL0 appears to have settled down after t = 2000 , so it seems reasonable to use the value of h2000 as an approximation of htfor t > 2000.

In Table 5.1 we investigate just how conservative our procedure is. For a target ARL0 of 500, this table shows the realised ARL0 for various choices of θ0, using the thresholds which have been computed under the assumption that θ0 = 0.5. It can be seen that in general, the detector is not too conservative unless θ0is quite low (< 0.05), suggesting that our procedure should give good performance for moderate values of θ0. We will investigate this in the experiments section later in this chapter.

Note that by symmetry the ARL0 when θ0 = 1 − γ will be identical to that for θ0 = γ, so we have not included values of θ0 > 0.5 in this table.

5.2.1 Implementation Issues

Having completed the description of the Bernoulli CPM, we now discuss imple-mentation issues. In many important real world scenarios, computational resources are limited so it is important to have a change detection algorithm which can be computed efficiently. For the CPM, the majority of computation time is spent cal-culating the Dk,t statistics. From Equation 5.1 we can see that this is equivalent to evaluating the probability mass function of the hypergeometric distribution. Most common statistical packages will provide a highly optimised routine for this task.

However, we can increase efficiency by exploiting the high level of correlation be-tween the Dk,t statistics.

Consider a fixed sized sample containing t observations, of which stare failures.

Let sk be the number of failures observed in the subsample x1, . . . , xk, and for

By exploiting some combinatorics, we can compute ξk+1,trecursively from ξk,t. We make use of the following identities for the binomial coefficient:

 n With these, basic algebraic manipulation shows that:

ξk,t+1 =

Using these recursive formulations significantly decreases the processing time re-quired to compute each value of Dk,t. Recall that Dk,t is defined as: