Efficient Streaming Classification Methods
Niall M. Adams1, Nicos G. Pavlidis2, Christoforos
Anagnostopoulos3, Dimitris K. Tasoulis1
1Department of Mathematics 2Institute for Mathematical Sciences
Imperial College London
3Statistical Laboratory
University of Cambridge
Contents
I Streaming data and classification
I Our approach: incorporate self-tuning forgetting factors
I Illustrations and examples
I Issues (current work in progress)
I freeparameters
I non-linearity, regularisation, over-fitting I label timing
I Conclusion
Research support by the EPSRC/BAe funded ALADDIN project:
Streaming Data I
Adata stream consists of a sequence of data items arriving athigh frequency, generated by a process that is subject tounknown changes(generically calleddrift).
Many examples, often financial, include:
I credit card transaction data (6000/s for Barclaycard Europe)
I stock market tick data
I computer network traffic
The character of streaming data calls for algorithms that are
I efficient, one-pass- to handle frequency
Streaming Data II
A simple formulation of streaming data is a sequence of
p-dimensional vectors,arriving at regular intervals
. . . , xt−2,xt−1,xt
wherexi ∈Rp.
In a real application, with credit card transactions, the variables include: transaction value and time, location, card response codes.
Since we are concerned withK-class classification, need to
accommodate a class label. Thus, at timet we can conceptualise
thelabel-augmented streaming vectoryt = (Ct,xt)0, where
Ct ∈ {c1,c2, . . . ,ck}.
However, in real applicationsCt arrives at some times >t, and
the streaming classification problem is concerned with predicting
Streaming Data and Classification
Implicit assumption: single vector arrives at any time.
Assumption common in literature, which we use, is that the data stream is structured as
. . . , (Ct3,xt2),(Ct2,xt1),(Ct1,xt),
That is, the class-label arrives at the next tick.
We will treat the streaming classification problem as: predict the
class ofxt, and adaptively (and efficiently) update the model at
timext+1, when Ct arrives.
This is naive, but the problem is challenging even formulated thus. Will return to label timing later.
Streaming Data and Classification
Can use the usual formulation for classification
P(Ct|xt) =
p(xt|Ct)P(Ct)
p(xt)
(1) and construct either
I Sampling paradigm classifiers, focusing on class conditional
densities
I Diagnostic paradigm classifiers, directly seeking the posterior
probabilities of class membership
Note the we will usually restrict attention to theK = 2 class
problem.
Eq.1 also illustrates where changes can happen: the prior,P(Ct),
Notional drift types
1. Jump 0 200 400 600 800 10 15 20 25 (in mean) 2. Gradual change 0 200 400 600 800 12 14 16 18 20 22(in mean and variance) Trend, seasonality etc.
Drift: Real Examples
Methods
A variety of approaches for streaming classification have been proposed, including
I Data selection approaches with standard classifiers. Most
commonly, use of a fixedor variable size window of most
recent data. But how to determine size in either case?
I Ensemble methods. One example is the adaptive weighting of
ensemble members changing over time. This category also
includes learning with expert feedback.
Example: Window method happens in consumer credit scoring -not for efficiency, but because the populations have changed.
Forgetting-factor methods
We are interested in modifying standard classifiers to incorporate
forgetting factors- parameters that control the contribution of old data to parameter estimation.
We adapt ideas fromadaptive filter theory, to tune the forgetting
factor automatically.
Simplest to illustrate with an example: consider computing the
mean vector and covariance matrix of a sequence ofn multivariate
vectors. Standard recursion
mt =mt−1+xt, µˆt =mt/t, m0 = 0
For vectors coming from a non-stationary system, simple averaging of this type is biased.
Knowing precise dynamics of the system gives chance to construct optimal filter. However, not possible with streaming data (though
interestinglinksbetween adaptive and optimal filtering).
Incorporating aforgetting factor,λ∈(0,1], in the previous recursion
nt =λnt−1+ 1, n0 = 0
mt =λmt−1+xt, µˆt=mt/nt
St =λSt−1+ (xt−µˆt)(xt−µˆt)T, Σˆt =St/nt
λdown-weights old information more smoothly than a window.
Denote theseforgettingestimates as ˆµλt, ˆΣλt, etc.
nt is theeffective sample size or memory. λ= 1 gives offline
solutions, andnt=t. For fixed λ <1 memory size tends to
Setting
λ
Two choices forλ, fixed value, or variable forgetting,λt. Fixed
forgetting: set by trial and error, change detection, etc (cf. window).
Variable forgetting: ideas from adaptive filter theory suggest
tuningλt according to a local stochastic gradient descent rule
λt =λt−1−α
∂ξt2
∂λ, ξt: residual error at timet,α small (2)
Efficient updating rules can implemented via results from numerical linear algebra (O(p2)).
Performance very sensitive toα. Very careful implementation
required, including bracket onλt and selection of learning rate α.
Framework provides an adaptive means forbalancing old and new
Tracking illustrations
Does fixed forgetting respond to an abrupt change?
Adaptive-Forgetting Classifiers
Our recent work involves incorporating theseself-tuning forgetting
factors in I Parametric I Covariance-matrix based I Logistic regression I non-parametric I Multi-layer perceptron
(sampling paradigm) (diagnostic paradigm)
Streaming Quadratic Discriminant Analysis
QDA can be motivated by reasoning about relationship of between and within group covariances, or assuming class conditional densities are Gaussian.
For static data, latter assumption yields discriminant function for
jth class gj(x) = log(P(Cj))− 1 2log(|Σj|)− 1 2(x−µj) TΣ−1 j (x−µi) (3)
whereµj and Σj are mean vector and covariance matrix,
respectively, for classj.
Frequently, plug-in ML estimates for unknown parameters: µj, Σj,
P(Cj).
Results in CA’s thesis show that the AF framework above can be
generalised, using likelihood arguments, to thewhole exponential
family. Thus, the priors,P(Ct) can also be handled.
The approach is then:
I Forgetting factor for prior (binomial/multinomial)
I Forgetting factor for each class
The class ofxt is predicted when it arrives. Immediately thereafter,
the class-label arrives, and thetrue class parameters are updated.
This will be problematic for largeK or very imbalanced classes:
few updates complicates the interpretation of the update equation forλt (Eq. 2).
Streaming LDA
The discriminant function in Eq.3 reduces to a linear classifier under various constraints on the covariance matrices (or mean vectors).
We consider the case of a common covariance matrix:
Σ1 = Σ2 =. . .= ΣK = Σ. Again, we will substitute streaming
estimatesµλj, Σλ.
Have a couple of implementations options. One approach is
I Forgetting factor for prior
I Forgetting factor for each class
Streaming logistic regression
Diagnostic methods do not have to handle prior probabilities
separately, easing interpretation issues with updatingλ.
In two class problems, logistic regression models the log-odds ratio as a linear function log p(x|C1) p(x|C2) =βTx
yielding simple forms for the posterior probabilities.
With static data, under mixture sampling, the likelihood function is simple, but requires non-linear optimisation.
The idea will be as before, to incorporate a forgetting factor in the opimisation criterion, and tune it according to the gradient.
Thus, we incorporate the (exponential) forgetting factor into the log-likelihood l(βt|y1...t) = t X i=1 λt−il(βt|yt) =l(βt|yt) +λl(βt|y1...(t−1)) As before,λ∈(0,1].
We update the parameter vector sequentially with
βt+1=βt−α∇βl(βt|y1...t)
where∇βl(βt|y1...t) is the gradient of the likelihood w.r.t. β, andα
isanotherlearning rate, as before.
An online version of backpropagation with momentum.
Calculations in Pavlidis et al (2010b) show that this computation,
and the corresponding sequential optimisation of the forgetting factor, analogous to Eq. 2, can be made efficiently online via incremental update.
Streaming Multilayer perceptrons
We have extended the AF approach further, noting that the multilayer perceptron (MLP) can be regarded as a generalisation of logistic regression. This allows the opportunity for much more
flexiblemodels.
This introduces a large number of tunable parameters, and a
model parameter -the number of hidden nodes.
NPs work shows that the same modification to the optimisation criterion can yield an efficient updating mechanism. Can avoid explicit computation and storage of Hessian.
Extensive experimental analysis suggests that recursive
computation of the derivative of the optimisation criterion w.r.t. λ
becomes unstable. We thus prefer to approximate this gradient using a finite difference approach.
The forgetting factor, introduced through the optimisation criterion as with logistic regression, can be adapted as before.
A regularisation term, with parameterγ, can also be incorporated,
yielding an optimisation criterion like
t X
i=1
λt−il(w|yi) +γ||w||q
This raises the question of the relationship between forgetting and model complexity, which can (in principle) be tuned. More on this later.
We restrict attention to an MLP with a single hidden layer. Initial
experiments make us favourq= 1 (analogous to the LASSO), over
Illustrations
Logistic regression, forgetting factor, jump + drift environment. Evidence of monitoring capacity of AF?
Performance assessment
Assessing the performance of streaming classifiers is complicated, due to the changing nature of data streams. For simulated
assessment(which is crucial), we proceed as follows
I The data stream can have jumps,drift, or both.
I Data generation from an appropriate (usually simple)
stochastic mechanism (Gaussian, mixture)
I Generate extra points, at each tick, to facilitate comparison
with ground truth
Then compute performance measures, such as error rate, at each tick. For comparison between classifiers, can summarise average across time also.
Real data moredifficult - usually rely on time-averaged pointwise
Simulation Conclusions
We have conductedextensive simulation experiments (and PD and
real data!), exploring and comparing these AF methods with comparable approaches in the literature.
I GENERAL: AF methods
I handle drift/jumps/both automatically
I are competitive with comparable methods (PA,PA-II, OLDA, PERC), but more general
I generally recover better from change
I exhibit performance degradation with speed of drift, and dimension (as do other methods).
I QDA:
I Can suffer from problem of many parameters I but more responsive to change
Some general comments
So, we have a collection of different classifiers for handling the simple streaming classification problem.
The methods are efficient (typicallyO(p2)) per tick. While this is
slow for information retrieval, it is appropriate for learning. Testing algorithms is difficult. We much prefer the simulation approach above, than the use of PD data sets such as STAGGER, which are not particularly challenging or interesting.
Issues
At least three interesting things to explore
I optimisation parameters (learning rates, bracket limits)
I Complexity and forgetting in streams
Optimisation parameters
Let us revisit the updating mechanism forλ
λt=λt−1−α
∂ξ2t ∂λ
This is normally implemented with a bracket, to constrain the
movement ofλ λt=λt−1−α ∂ξ2t ∂λ λmax λmin
Learning rates and bracket values interact
Error rates, logistic, gradual drift - (learning rate=0.01,0.1,0.5)
Seems clear brackets should not be fixed.
We have had some success with using the RPROP algorithm to
Complexity and Forgetting
It is interesting to speculate about the relationship between forgetting (discarding old data) and complexity control (managing flexibility).
It is easy to conceive problems where a simple classifier has to change, where a more complex classifier can simply learn
everything. Typically, these would be based on data in a bounded region.
With the MLP, we have experimented with tuningγ (complexity)
the same way asλ(forgetting). Some interesting initial results.
Drift and jumps, continuous XOR variant: 4 network architectures, complexity? 4 networks? correlation between tuned parameters
Methods exhibit good performance. Correlation appreciable only at jumps. Forget everything, suppress complexity?
label timing
In real problems, the label is delayed in complicated ways. Examples
I credit scoring: under standard definitions, good risk customers cannot be so flagged until near the end of a loan term. In contrast, bad risk customers are so labelled anytime during the loan term.
I fraud detection: fraudulent transactions (that pass filters)
identified months after event. In contrast, legitimate
transactions are never explicitly flagged – status presumed since not flagged as fraud
One approach is to use semi-supervised approaches. For example,
incorporate the unlabelled vector to the predicted class. Reject
Another (of many) timing issue is that more than one vector can arrive at a time (eg. fraud)
The QDA/LDA streaming models admit efficient batch block
updating, via rankk update. Thus, multiple vectors can be
incorporated at one time.
Experiments with credit application data suggested that this approach works no better (in terms of error rate) than a random ordering (within tick) and sequential updating.
Of course, this depends very much on the underlying dynamics of the data.
Conclusion
I AF-classifier works well for streaming problems (and indeed
for large data sets).
I Varying forgetting factor does respond to change, and gives
something to monitor, though sequential monitoring difficult, because gradient-like quantities very volatile.
I Interesting relationship between forgetting and complexity.
I Many of the assumptions are restrictive compared to real
Future Work
I THEORETICAL ANALYSIS- difficult, perhaps adapt ideas from stochastic approximation
I BAYESIAN FORMULATION- based on the power-prior, can alleviate (or at least shift) some difficulties with the bracket
onλ
I HYBRID MODELS- exploit links between optimal filters
I TIMING ISSUES - motivated by real problems.
References
I Adams, N.M., Tasoulis, D.K., Anagnostopoulos, C. and Hand, D.J. ’Temporally-adaptive linear classification for handling population drift in credit scoring’, In Lechevallier, Y. And Saporta. (eds), COMPSTAT2010, Proceedings of the 19th International Conference on Computational Statistics, 2010, Springer, 167-176.
I Anagnostopoulos, C. ’A statistical framework for streaming data analysis’, PhD Thesis, Department of Mathematics, Imperial College London, 2010.
I Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J., ’Streaming Gaussian classification using recursive maximum likelihood with adaptive forgetting’, Machine Learning, (2010), under review. I Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Temporally adaptive estimation of
logistic classifiers on data streams’. Adv. Data An. Classif.,3(3) (2009),243-261. I Haykin, S. ’Adaptive filter theory’, 4th edition, Prentice Hall (1996).
I Kelly, M.G., Hand, D.J. and Adams, N.M., ’The impact of changing populations on classifier performance’ in ’KDD 99, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, Chaudhuri, S. and Madigan, D. ed(AAAI), 1999, 367371.
I Pavlidis, N.G., Adams, N.M and Hand, D.J., ’Adaptive Online Logistic Regression’, Pattern Recogn., (2010) under review.
I Pavlidis, N.G., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Adaptive consumer credit classification’, Europ. J. Finance, (2010), under review.
I Weston, D.J., Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M. and Hand, D.J. ’Handling missing feature values for a streaming quadratic discriminant classifier’, Data Mining and Knowl. Disc., (2010), under review.