There are three components that constitute the proposed drift handling algorithm. The first involves training a static PSL-like classifier. The second component takes the static classifier and a dataset (training or validation dataset) as input in order to assign to its various nodes competence values. This is described in the next section. The third component adapts the classifier to streaming data by parameterizing the cascade layers based on the competence values learned in the previous phase and is detailed in the following section.
6.1.1 Assigning Competence Values to the Static PSL-classifier
Experiments on enabling bootstrapping to PSL in Chapter 5 demonstrated that the nodes consisted of very diverse classifiers as attested to by the varied accuracies between the nodes in each layer. Though some nodes were shown to overfit data under certain con- ditions, it became apparent that since each node was trained on mostly non-overlapping positive datasets, each one came to specialize at predicting distinct portions of the distri- bution of the target concept.
Intuitively, the next step involved recognizing that each node, made up of a cluster of weak classifiers, could be seen as an individual ensemble classifier itself. Theoretically, the competence level of each node could then be determined based on its individual per- formance on any combination of the training or validation datasets as well as streaming data containing drifting concepts. Using this strategy, the intention was to leverage the confidence votes of each node in their contribution to making a final detection as concept drifts takes place and affects different class descriptors.
This strategy was pursued and in the process of assigning confidence values to the nodes, the nested cascade within each layer was transformed into an ensemble of node classifiers. The outcome was that the previous vote-combination strategy of one-node-one- vote had to be revised in order to accommodate weighted voting based on the predictive competence of each node. The cascade architecture of a single layer is shown in Figure 6.1. The evaluation of a node becomes no longer conditional on the outcome of the predictions
6.1. Implementation Details 87
Algorithm 9: Concept Drift Learning
Given: Ti = threshold for layer i,αij = confidence value for nodej on layer i,
T P R and F P R
true and false positive rate matrices for all nodes, vote
= matrix of sums of all αij values for each layer representing possible
thresholds,Lij = loss value for a nodej on layer i,threshi = threshold
value for layer i,ω = false positives weight adjustment,υ = true positives weight adjustment,Ek = error metric for a given threshold valuek.
Input: PSL-like classifier h,SDn = static dataset vector {(x1, y1), ...(xn, yn)}
where each samplex corresponds to a class label y∈ {−1,1},Dt
n= tth
snapshot of non-stationary dataset containingn samples, Line1 H= TrainPSLClassifier(SDn) 3 3 T P R , F P R = EvaluateClassifier(SDn, h) 4
foreach ith cascade layer do
6 6 foreachjth node H i do 7 Lij = F P Rij 2 + 1−T P Rij 2 9 9 Hijαij = 1 2ln 1−Lij Lij ! 10
foreach tth snapshot of dataset streamD1,..,T
n do
12 12
if Hi(Dtn)> acceptable error then
14 14
lower begining cascade layer i 16
16
if Hi(Dtn)≤ acceptable error then goto step 12
18 18
foreachx sample in dataset stream Dt
1,..,n do
20 20
foreach ith cascade layer do
21
votei,k =Hi(x)
22
update statistics F P Ri, T P Ri based on each valuevotei
23
foreach ith cascade layer do
24
foreach kth possible threshold value in vote
i do 25 Ek= F P Ri,k×ω 2 + (1−T P Ri,k)×υ 2 27 27 if Ek< Emin then 29 29 Ti =votei,k 30 Emin=Ek 31
of the prior nodes and thus every node within each layer is guaranteed to be executed. The combination of the weighted confidence voting replaces the veto and the unanimous voting of the previous scheme.
Figure 6.1: Diagram of the concept-drift learning algorithm.
Algorithm 9 outlines in detail the entire concept drift learning method. Once an ini- tial PSL-like classifier has been trained on a static dataset (Step 3), it is first validated against it to gather competence values for all the nodes (Step 6). Due to the large class distribution skews that discriminate against the performances on the positive detections, instead of using the overall error rate, a performance metric was devised that was in ef- fect the average between the false positive rate and the false negative rate in respect to the target class (Step 9)1. This performance metric of each individual node-classifier was used to calculate the competence values in the form used by Yoav and Schapire [192] for determining the voting strength of individual weak classifiers
hijα= 0.5ln 1−Eij Eij (6.1) where each confidence value α is assigned to ajth node h on an ith cascade layer, based
on the performance metric E. Usually ensemble combination rules consist of a simple
weighted majority vote as in
1The alternative could have also been the g-mean measure by using the products of the accuracies of
6.1. Implementation Details 89 Hi(x) =sign n X j=1 αijhij(x) (6.2)
where H represents a prediction for a layer i using the sum of n number of ensembles applied to on an instancex whose prediction is determined by thesign. This was modified by parameterizing each layer with a minimum confidence threshold value Θi
Hi(x) = 1 if Pn j=1 αijhij(x)> Θi 0 otherwise (6.3)
in which a sample instance x is positively predicted and passed to succeeding layers for more rigorous testing only if the sum of confidence values for the current layer surpass it, thus producing more robust collective decisions.
6.1.2 Concept-Drift Learning Algorithm
In contrast to the node-classifier confidence weights, the layer confidence thresholds are learned and optimized on the incoming data streams from the application domain in which drift is present. In Step 20, an optimal layer threshold Θi for layer iis computed by first
creating a distribution of all sums of node-classifier confidences. By treating each distinct sum as a threshold, an error metric can then be calculated for each one. Additionally, the confidence threshold value can be set to either favour higher hit rates or lower false positive rates by varying the weights of ω and υ values in (Step 27). Once the algorithm has completed classifying all instances from a current data stream and all layer confidence sums with their respective errors have been computed, the sum with the lowest error rate for each layer is selected as the optimal threshold (Step 29).
Once threshold learning is finished, the classifier is ready to be redeployed and can begin to handle drift (Step 12). Unlike most ensemble based methods, the algorithm makes use of a trigger mechanism to inform it that drift is occurring in the environment. In its current form, the algorithm uses the classification error rate as a trigger for drift handling to begin if the generalization ability falls beyond a predefined level (Step 14).
Initial experiments have shown that employing layer confidence thresholds achieves an aggressive strategy for eliminating false positive detections that can sometimes also reduce positive hit rates if not carefully applied. This observation can be combined with the flexibility of cascaded classifiers which allows it be combined with a strategy that varies the number of layers that utilize confidence thresholds depending on its current generalization. During runtime, the drift learning algorithm can progressively increase the number of layers to which layer thresholds are applied (Step 16) until a sufficient number of false positives has been eliminated, while the calculation of the preceding layers is preserved in the original form. The ability of the framework to progressively increase the number of layers that use confidence thresholds becomes the algorithm’s facility for handling gradual drifts.
As data streams begin to drift more acutely, the algorithm activates proportionally larger numbers of layer confidences in respect to increasing false positive rates. This
proceeds until the drift stabilizes or until all available layers with confidence thresholds are exhausted. If all layer thresholds have been deployed and the error rate is still above an acceptable level, then the layer thresholds have been rendered irrelevant to current conditions and subsequently optimal layer threshold learning is re-initiated (Step 18).
The concept of using layer thresholds shares some affinity with the manner in which Viola-Jones employ the layer thresholds. Their approach adjusts the layer thresholds at training for the purpose of controlling the trade-off between the TPR and FPR. Similarly, the proposed layer thresholds also modify for each layer the operating point on the ROC curve. The approach outlined here differs in that the layer thresholds are learned based on clusters of weak classifiers that are trained on different partitions of data with distinct boosting rounds and thus do not explicitly represent the boosted margin theory of Yoav and Schapire [192].