Efficient Streaming Classification Methods

(1)

Efficient Streaming Classification Methods

Niall M. Adams1, Nicos G. Pavlidis2, Christoforos

Anagnostopoulos3_{, Dimitris K. Tasoulis}1

1_{Department of Mathematics} 2_{Institute for Mathematical Sciences}

Imperial College London

3_{Statistical Laboratory}

University of Cambridge

(2)

Streaming Data I

Adata stream consists of a sequence of data items arriving athigh frequency, generated by a process that is subject tounknown changes(generically calleddrift).

Many examples, often financial, include:

I credit card transaction data (6000/s for Barclaycard Europe)

I stock market tick data

I computer network traffic

The character of streaming data calls for algorithms that are

I efficient, one-pass- to handle frequency

(4)

Streaming Data II

A simple formulation of streaming data is a sequence of

p-dimensional vectors,arriving at regular intervals

. . . , xt−2,xt−1,xt

wherexi ∈Rp.

In a real application, with credit card transactions, the variables include: transaction value and time, location, card response codes.

Since we are concerned withK-class classification, need to

accommodate a class label. Thus, at timet we can conceptualise

thelabel-augmented streaming vectoryt = (Ct,xt)0, where

Ct ∈ {c1,c2, . . . ,ck}.

However, in real applicationsCt arrives at some times >t, and

the streaming classification problem is concerned with predicting

(5)

Streaming Data and Classification

Implicit assumption: single vector arrives at any time.

Assumption common in literature, which we use, is that the data stream is structured as

. . . , (Ct3,xt2),(Ct2,xt1),(Ct1,xt),

That is, the class-label arrives at the next tick.

We will treat the streaming classification problem as: predict the

class ofxt, and adaptively (and efficiently) update the model at

timext+1, when Ct arrives.

This is naive, but the problem is challenging even formulated thus. Will return to label timing later.

(6)

Streaming Data and Classification

Can use the usual formulation for classification

P(Ct|xt) =

p(xt|Ct)P(Ct)

p(xt)

(1) and construct either

I Sampling paradigm classifiers, focusing on class conditional

densities

I Diagnostic paradigm classifiers, directly seeking the posterior

probabilities of class membership

Note the we will usually restrict attention to theK = 2 class

problem.

Eq.1 also illustrates where changes can happen: the prior,P(Ct),

(7)

Notional drift types

1. Jump 0 200 400 600 800 10 15 20 25 (in mean) 2. Gradual change 0 200 400 600 800 12 14 16 18 20 22

(in mean and variance) Trend, seasonality etc.

(8)

Drift: Real Examples

(9)

(10)

Methods

A variety of approaches for streaming classification have been proposed, including

I Data selection approaches with standard classifiers. Most

commonly, use of a fixedor variable size window of most

recent data. But how to determine size in either case?

I Ensemble methods. One example is the adaptive weighting of

ensemble members changing over time. This category also

includes learning with expert feedback.

Example: Window method happens in consumer credit scoring -not for efficiency, but because the populations have changed.

(11)

Forgetting-factor methods

We are interested in modifying standard classifiers to incorporate

forgetting factors- parameters that control the contribution of old data to parameter estimation.

We adapt ideas fromadaptive filter theory, to tune the forgetting

factor automatically.

Simplest to illustrate with an example: consider computing the

mean vector and covariance matrix of a sequence ofn multivariate

vectors. Standard recursion

mt =mt−1+xt, µˆt =mt/t, m0 = 0

(12)

For vectors coming from a non-stationary system, simple averaging of this type is biased.

Knowing precise dynamics of the system gives chance to construct optimal filter. However, not possible with streaming data (though

interestinglinksbetween adaptive and optimal filtering).

Incorporating aforgetting factor,λ∈(0,1], in the previous recursion

nt =λnt−1+ 1, n0 = 0

mt =λmt−1+xt, µˆt=mt/nt

St =λSt−1+ (xt−µˆt)(xt−µˆt)T, Σˆt =St/nt

λdown-weights old information more smoothly than a window.

Denote theseforgettingestimates as ˆµλ_t, ˆΣλ_t, etc.

nt is theeffective sample size or memory. λ= 1 gives offline

solutions, andnt=t. For fixed λ <1 memory size tends to

(13)

Setting

λ

Two choices forλ, fixed value, or variable forgetting,λt. Fixed

forgetting: set by trial and error, change detection, etc (cf. window).

Variable forgetting: ideas from adaptive filter theory suggest

tuningλt according to a local stochastic gradient descent rule

λt =λt−1−α

∂ξ_t2

∂λ, ξt: residual error at timet,α small (2)

Efficient updating rules can implemented via results from numerical linear algebra (O(p2)).

Performance very sensitive toα. Very careful implementation

required, including bracket onλt and selection of learning rate α.

Framework provides an adaptive means forbalancing old and new

(14)

Tracking illustrations

Does fixed forgetting respond to an abrupt change?

(15)

(16)

Adaptive-Forgetting Classifiers

Our recent work involves incorporating theseself-tuning forgetting

factors in I Parametric I Covariance-matrix based I Logistic regression I non-parametric I Multi-layer perceptron

(sampling paradigm) (diagnostic paradigm)

(17)

Streaming Quadratic Discriminant Analysis

QDA can be motivated by reasoning about relationship of between and within group covariances, or assuming class conditional densities are Gaussian.

For static data, latter assumption yields discriminant function for

jth class gj(x) = log(P(Cj))− 1 2log(|Σj|)− 1 2(x−µj) T_Σ−1 j (x−µi) (3)

whereµj and Σj are mean vector and covariance matrix,

respectively, for classj.

Frequently, plug-in ML estimates for unknown parameters: µj, Σj,

P(Cj).

(18)

Results in CA’s thesis show that the AF framework above can be

generalised, using likelihood arguments, to thewhole exponential

family. Thus, the priors,P(Ct) can also be handled.

The approach is then:

I Forgetting factor for prior (binomial/multinomial)

I Forgetting factor for each class

The class ofxt is predicted when it arrives. Immediately thereafter,

the class-label arrives, and thetrue class parameters are updated.

This will be problematic for largeK or very imbalanced classes:

few updates complicates the interpretation of the update equation forλt (Eq. 2).

(19)

Streaming LDA

The discriminant function in Eq.3 reduces to a linear classifier under various constraints on the covariance matrices (or mean vectors).

We consider the case of a common covariance matrix:

Σ1 = Σ2 =. . .= ΣK = Σ. Again, we will substitute streaming

estimatesµλ_j, Σλ.

Have a couple of implementations options. One approach is

I Forgetting factor for prior

I Forgetting factor for each class

(20)

Streaming logistic regression

Diagnostic methods do not have to handle prior probabilities

separately, easing interpretation issues with updatingλ.

In two class problems, logistic regression models the log-odds ratio as a linear function log p(x|C1) p(x|C2) =βTx

yielding simple forms for the posterior probabilities.

With static data, under mixture sampling, the likelihood function is simple, but requires non-linear optimisation.

The idea will be as before, to incorporate a forgetting factor in the opimisation criterion, and tune it according to the gradient.

(21)

Thus, we incorporate the (exponential) forgetting factor into the log-likelihood l(βt|y1...t) = t X i=1 λt−il(βt|yt) =l(βt|yt) +λl(βt|y1...(t−1)) As before,λ∈(0,1].

We update the parameter vector sequentially with

βt+1=βt−α∇βl(βt|y1...t)

where∇_βl(βt|y1...t) is the gradient of the likelihood w.r.t. β, andα

isanotherlearning rate, as before.

An online version of backpropagation with momentum.

Calculations in Pavlidis et al (2010b) show that this computation,

and the corresponding sequential optimisation of the forgetting factor, analogous to Eq. 2, can be made efficiently online via incremental update.

(22)

Streaming Multilayer perceptrons

We have extended the AF approach further, noting that the multilayer perceptron (MLP) can be regarded as a generalisation of logistic regression. This allows the opportunity for much more

flexiblemodels.

This introduces a large number of tunable parameters, and a

model parameter -the number of hidden nodes.

NPs work shows that the same modification to the optimisation criterion can yield an efficient updating mechanism. Can avoid explicit computation and storage of Hessian.

Extensive experimental analysis suggests that recursive

computation of the derivative of the optimisation criterion w.r.t. λ

becomes unstable. We thus prefer to approximate this gradient using a finite difference approach.

(23)

The forgetting factor, introduced through the optimisation criterion as with logistic regression, can be adapted as before.

A regularisation term, with parameterγ, can also be incorporated,

yielding an optimisation criterion like

t X

i=1

λt−il(w|yi) +γ||w||q

This raises the question of the relationship between forgetting and model complexity, which can (in principle) be tuned. More on this later.

We restrict attention to an MLP with a single hidden layer. Initial

experiments make us favourq= 1 (analogous to the LASSO), over

(24)

Illustrations

(25)

Logistic regression, forgetting factor, jump + drift environment. Evidence of monitoring capacity of AF?

(26)

(27)

(28)

(29)

(30)

(31)

(32)

Performance assessment

Assessing the performance of streaming classifiers is complicated, due to the changing nature of data streams. For simulated

assessment(which is crucial), we proceed as follows

I The data stream can have jumps,drift, or both.

I Data generation from an appropriate (usually simple)

stochastic mechanism (Gaussian, mixture)

I Generate extra points, at each tick, to facilitate comparison

with ground truth

Then compute performance measures, such as error rate, at each tick. For comparison between classifiers, can summarise average across time also.

Real data moredifficult - usually rely on time-averaged pointwise

(33)

Simulation Conclusions

We have conductedextensive simulation experiments (and PD and

real data!), exploring and comparing these AF methods with comparable approaches in the literature.

I GENERAL: AF methods

I handle drift/jumps/both automatically

I are competitive with comparable methods (PA,PA-II, OLDA, PERC), but more general

I generally recover better from change

I exhibit performance degradation with speed of drift, and dimension (as do other methods).

I QDA:

I Can suffer from problem of many parameters I but more responsive to change

(34)

Some general comments

So, we have a collection of different classifiers for handling the simple streaming classification problem.

The methods are efficient (typicallyO(p2_{)) per tick. While this is}

slow for information retrieval, it is appropriate for learning. Testing algorithms is difficult. We much prefer the simulation approach above, than the use of PD data sets such as STAGGER, which are not particularly challenging or interesting.

(35)

Issues

At least three interesting things to explore

I optimisation parameters (learning rates, bracket limits)

I Complexity and forgetting in streams

(36)

Optimisation parameters

Let us revisit the updating mechanism forλ

λt=λt−1−α

∂ξ2_t ∂λ

This is normally implemented with a bracket, to constrain the

movement ofλ λt=λt−1−α ∂ξ2_t ∂λ λmax λmin

(37)

Learning rates and bracket values interact

Error rates, logistic, gradual drift - (learning rate=0.01,0.1,0.5)

Seems clear brackets should not be fixed.

We have had some success with using the RPROP algorithm to

(38)

Complexity and Forgetting

It is interesting to speculate about the relationship between forgetting (discarding old data) and complexity control (managing flexibility).

It is easy to conceive problems where a simple classifier has to change, where a more complex classifier can simply learn

everything. Typically, these would be based on data in a bounded region.

(39)

With the MLP, we have experimented with tuningγ (complexity)

the same way asλ(forgetting). Some interesting initial results.

Drift and jumps, continuous XOR variant: 4 network architectures, complexity? 4 networks? correlation between tuned parameters

Methods exhibit good performance. Correlation appreciable only at jumps. Forget everything, suppress complexity?

(40)

label timing

In real problems, the label is delayed in complicated ways. Examples

I credit scoring: under standard definitions, good risk customers cannot be so flagged until near the end of a loan term. In contrast, bad risk customers are so labelled anytime during the loan term.

I fraud detection: fraudulent transactions (that pass filters)

identified months after event. In contrast, legitimate

transactions are never explicitly flagged – status presumed since not flagged as fraud

One approach is to use semi-supervised approaches. For example,

incorporate the unlabelled vector to the predicted class. Reject

(41)

Another (of many) timing issue is that more than one vector can arrive at a time (eg. fraud)

The QDA/LDA streaming models admit efficient batch block

updating, via rankk update. Thus, multiple vectors can be

incorporated at one time.

Experiments with credit application data suggested that this approach works no better (in terms of error rate) than a random ordering (within tick) and sequential updating.

Of course, this depends very much on the underlying dynamics of the data.

(42)

Conclusion

I AF-classifier works well for streaming problems (and indeed

for large data sets).

I Varying forgetting factor does respond to change, and gives

something to monitor, though sequential monitoring difficult, because gradient-like quantities very volatile.

I Interesting relationship between forgetting and complexity.

I Many of the assumptions are restrictive compared to real

(43)

Future Work

I THEORETICAL ANALYSIS- difficult, perhaps adapt ideas from stochastic approximation

I BAYESIAN FORMULATION- based on the power-prior, can alleviate (or at least shift) some difficulties with the bracket

onλ

I HYBRID MODELS- exploit links between optimal filters

I TIMING ISSUES - motivated by real problems.

(44)

References

I Adams, N.M., Tasoulis, D.K., Anagnostopoulos, C. and Hand, D.J. ’Temporally-adaptive linear classification for handling population drift in credit scoring’, In Lechevallier, Y. And Saporta. (eds), COMPSTAT2010, Proceedings of the 19th International Conference on Computational Statistics, 2010, Springer, 167-176.

I Anagnostopoulos, C. ’A statistical framework for streaming data analysis’, PhD Thesis, Department of Mathematics, Imperial College London, 2010.

I Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J., ’Streaming Gaussian classification using recursive maximum likelihood with adaptive forgetting’, Machine Learning, (2010), under review. I Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Temporally adaptive estimation of

logistic classifiers on data streams’. Adv. Data An. Classif.,3(3) (2009),243-261. I Haykin, S. ’Adaptive filter theory’, 4th edition, Prentice Hall (1996).

I Kelly, M.G., Hand, D.J. and Adams, N.M., ’The impact of changing populations on classifier performance’ in ’KDD 99, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, Chaudhuri, S. and Madigan, D. ed(AAAI), 1999, 367371.

I Pavlidis, N.G., Adams, N.M and Hand, D.J., ’Adaptive Online Logistic Regression’, Pattern Recogn., (2010) under review.

I Pavlidis, N.G., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Adaptive consumer credit classification’, Europ. J. Finance, (2010), under review.

I Weston, D.J., Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Handling missing feature values for a streaming quadratic discriminant classifier’, Data Mining and Knowl. Disc., (2010), under review.

Efficient Streaming Classification Methods