Statistical Classifiers - Detecting changes in high frequency data streams, with applications

2.2 Classification

2.2.1 Statistical Classifiers

We now introduce the classifiers which we will use later; first, the Linear Discrimi-nant Analysis (LDA) and Quadratic DiscrimiDiscrimi-nant Analysis (QDA) classifiers which are among the most simple methods suitable for use in both the streaming and non-streaming contexts. Second, the k-Nearest Neighbours (KNN) classifier, which is again a conceptually simple method but one which induces a more complex deci-sion boundary than LDA or QDA.

LDA and QDA: In both LDA and QDA, each of the C conditional densities is modelled as a multivariate Gaussian distribution: f (f_i|c_i = j) ∼ N (µ_j, Σ_j). The prior probability of a point belonging to class j can be easily estimated based on the frequency of the observed class labels:

p(c_t = j) = 1 t − 1

t−1

i=1

I(c_i = j),

where I is the indicator function. Training this classifier then consists of learning the parameters µj, Σ_j for each of the C classes. In the non-streaming case, this can be done using standard maximum likelihood estimation techniques. However, we often encounter the problem that the number of observations per class is too small compared to the dimensionality of the feature vector to allow for accurate estimation. Therefore, LDA makes the assumption that all of the classes have equal covariance matrices, i.e. Σ_j = Σ. This allows the estimates to be pooled, and Σ can then to be estimated using all available points, which may give better results in higher dimensional spaces. In contrast, QDA allows the covariance matrices to differ between classes, and estimates each one separately. Rather than choos-ing between the extremes of LDA and QDA, an alternative approach is regularised discriminant analysis, which allows them to be combined in a smooth manner [48].

In the streaming case, estimation of the Gaussian parameters must take into ac-count the possibility that they are changing over time. One approach to estimation is hence to weight the observations so that more recent feature vectors contribute more to the estimate. This is the approach taken by both [95] and [3], who provide recursive updating formulas which allow older observations to be progressively dis-counted. The recursive nature of their estimators means that old observations do not need to be stored in memory; whenever a new observation is received, the estimates of the relevant class conditional density can be quickly updated, and the observa-tion discarded. This makes the resulting LDA and QDA classifiers ideal for high volume data streams where it is not possible to retain all observations in memory.

Regardless of whether LDA or QDA is used, we end up with an estimate of the density f (ft|c_t = j) for each of the C classes. This can then be used for classifica-tion based on Bayesian decision theory. Using Bayes Theorem, the probability of a feature vector belonging to class j is proportional to the product of the class density and the prior for class j, i.e:

p(c_t= j|f_t) ∝ f (f_t|c_t= j)p(c_t = j).

Therefore, one intuitive decision rule is to assign ftto the class which has the high-est posterior probability given ft. This minimises the expected number of misclas-sifications; more complex procedures which take into asymmetric costs can also be defined, but this is tangential and so we will not pursue it further. The choice of whether to use LDA or QDA depends on several factors. QDA may give better performance, assuming that enough points are available from each class to allow accurate estimation of the covariance matrices. However if insufficient points are available, then it may be advisable to combine the covariance matrices and use LDA.

KNN: Both LDA and QDA assume that the class conditional densities are Gaus-sian. Other statistical classifiers do not make this assumption, and can hence induce more complex decision boundaries. A simple examples of such a classifier is the

2.2 Classification 46 k-nearest neighbours (KNN) method [35]. Suppose that t feature vectors have been observed, along with their class labels. The KNN decision rule classifies a new fea-ture vector f_tby assigning it to the class that most of its neighbours belong to. More formally, let d(f_t, f_i) be a metric on the space of feature vectors which measures how similar any two are. Under this metric, let f(1), . . . , f_(k)be the k observations in (f₁, . . . , f_t−1) which are closest to f_t. Here k is a free parameter which is cho-sen by the user. Then, f_t is assigned to the class which has most representatives amongst {f(1), . . . , f_(k)}. If there is a tie, then a class is chosen at random.

This classification rule depends on the choice of k, and various methods have been proposed for choosing this, such as [60, 76]. Unlike LDA and QDA, KNN does not have an obvious recursive formulation, and finding the nearest neighbours of a given feature vector may be computationally expensive. Several methods for making the algorithm more efficient have been proposed, such as storing observa-tions in specialised data structures [78], approximate search procedures [6], or data reduction techniques where the observations are reduced to a small number of pro-totypes which capture much of the structure of the original data [26]. The problem of weighing observations so that more emphasis can be given to more recent feature vectors is considered in [97]. However, we will not explore these implementation issues further since our goal is not to undertake a study of KNN itself; rather, we are using it as a simple example of a flexible classification rule.

Chapter 3 Nonparametric Change Point Models

The following two chapters present the methods we have developed for moni-toring data streams for changes in a nonparametric manner where nothing is as-sumed about the stream distribution. Although the word ‘nonparametric’ has sev-eral slightly different meanings within the statistics literature, our use of the word is meant to imply statistical methods which can maintain some basic level of perfor-mance regardless of the (potentially unknown) distribution of the stream to which they are applied. For example, nonparametric hypothesis tests such as the Mann-Whitney have a null distribution which is independent of the parent data distribu-tion. Similarly in the context of change detection, we require nonparametric meth-ods to have an ARL₀ function that does not depend on the stream distribution. We will therefore use the terms ‘nonparametric’ and ‘distribution-free’ interchangeably.

This chapter focuses on the Change Point Model (CPM), which is a general framework for adapting traditional statistical tests to the streaming change detec-tion problem. This was originally introduced by [71] in a parametric context; our contribution is to extend it for use in nonparametric monitoring. Section 3.1 de-scribes the nonparametric change detection problem in more detail, and Section 3.2 summarises the literature that it has generated. A key finding from this literature review is that, while methods for nonparametric monitoring of the location param-eter/mean of a stream are common, the problem of detecting more general changes such as those involving the scale/variance is much less widely studied, and this is

3.1 Overview 48 hence our focus.

Section 3.3 introduces the general CPM framework, and Section 3.4 presents specific nonparametric test statistics which can be integrated into this framework, with a focus on detecting changes in higher order moments. In Section 3.4.5 we discuss how the nonparametric CPM can be implemented in a computationally ef-ficient manner which makes it suitable for use with streaming data, and Section 3.5 investigates its performance using both synthetic and real data.

3.1 Overview

Recall that under the assumption that the stream of observations x₁, x₂, . . . contains a single change point, its distribution can be written as:

X_i ∼







f₀ if i < τ , f₁ if i ≥ τ .

Most traditional approaches to sequential change detection, such as those re-viewed in Section 2.1.2, assume that the distributional form of f₀ is known before and after the change with only the parameter vector θ0 being unknown. However this assumptions rarely holds in streaming applications; either there may be no prior knowledge of the stream distribution, or assumptions made about this distribution may be incorrect. Several authors [25, 79, 81, 80] have investigated the performance of parametric change detection algorithms when the distribution of the stream is in-correctly specified, and find that even small misspecifications can have very large effects on the false positive rate, causing the realised value of the ARL0 to deviate significantly from what is expected. As we discussed in Section 2.1.2, this is ex-tremely undesirable since having a bound on the rate of false positives is one of the ways in which we can assess the significance of any discovered change points.

There is hence a need for nonparametric change detection methods which are able to maintain a specified level of performance, such as the false alarm rate, re-gardless of the true distribution of the stream. This chapter proposes a framework for performing nonparametric change detection by adapting traditional

nonparamet-ric hypothesis tests to the streaming context. Unlike most existing nonparametnonparamet-ric approaches, we do not restrict attention to simple changes in the stream mean.

Our approach is based on a generalisation of the change point model intro-duced for the Gaussian distribution by [72] in order to adapt the likelihood-based testing procedures previously discussed in Section 2.1.1 to the streaming problem.

This framework was recently extended for the purpose of detecting nonparamet-ric changes to the location parameter of a sequence of random variables by [69].

However, their work does not satisfy the O(1) computational and memory com-plexity requirements for the processing of data streams. We extend their work in three ways. First, we extend the CPM framework to allow detection of changes in both the scale parameter, and in higher order moments, a problem which has not re-ceived sufficient attention in the literature. Second we introduce the idea of stream discretisation, which allows the test statistics used in both their method and ours to be computed in a fast manner thereby facilitating deployment of these techniques on high frequency streams.

In document Detecting changes in high frequency data streams, with applications (Page 44-49)