Future work - The multiple pheromone Ant clustering algorithm

A clear limitation existing within the current MPACA implementation is the final assignment of data elements to clusters. At present a centroid driven approach is used. Despite this returning comparable results to other algorithms in literature as given in chapter (4), this analysis is still deemed to be too crude. This mechanism is also prone to the pitfalls of K-Means and other centroid based algorithms described in chapter (2). In order to compensate for these limitations two alternative methods are currently being investigated; (i) a Bayesian driven membership calculation and (ii) a K-Nearest driven approach. Preliminary results attained for these methods are too premature to be included in this thesis.

5.5.1 Bayesian Cluster Membership Calculation

The Bayesian approach determines the most likely membership of a given data element depending on the features which constitute it, this being called the evidence set. This mechanism uses the notion that each colony consists of multiple ants, and each ant carries multiple features. Cluster membership for a data element is determined by analysing the population distribution of the ants in each class based on their feature values. The evidence set is analysed per feature, and a collective evaluation of all results is performed. For each feature, the count of this feature within the colony is taken as a ratio of that feature within the entire system. This results in a mechanism which is dependent on the frequency of encounters to probabilistically determine which colony data elements should belong to, as outlined in algorithm (10). Therefore, the probability of a data element de, being in a cluster c, is defined as in equation (5.1):

P(c|de) = P(de|c)P(c)

Algorithm 10 Bayesian Approach to cluster membership calculation for all data elements, de ∈ Dataset, ds do

Let HighestSigma → 0.0

Let Evidence-Set → Features at Node Let |F| → all distinct features in the system for all colonyId ∈ Colony do

Let colonySigma → 1.0

for all feature, f ∈ evidenceSet do

Let featuresInColony → all feature in this colony which match f Let numerator → featureInColonyCount + 1

Comment: where the addition of value 1 ensures a level of smoothing Let featureInSystemCount → all features in system which match f Let denominator → featureInSystemCount + |F|

Comment: where again |F| ensure smoothing Let colonySigma → colonySigma ×_denominatornumerator end for

if (colonySigma ≤ HighestSigma) then Continue

else

Let HighestSigma → colonySigma end if

end for end for

Smoothing is applied to ensure that non-zero results are handled without skewing the calcula- tions. Unlike the centroid calculation, which uses a similar proximity method on the locus of points, this mechanism uses the actual feature match counts to create the probability distribution.

5.5.2 K-Nearest Neighbourhood Cluster Membership Calculation

Algorithm 11 K-Nearest Neighbourhood approach to cluster membership calculation for all data elements, de ∈ Dataset, ds do

Match de, to Node Id

Let NodesInProximity → all nodes which are within K-neighbouring distance

Let AntsWithinK-Radius → all ants on all nodes within NodesInProximity which have their deposit mode set to TRUE

Let ColonyCount → all distinct colonies which are in AntsWithinK-Radius Let ColonyVotingArray[ColonyCount] → 0

for all ant ∈ Ants-Within-K-Radius do ColonyVotingArray[ant.ColonyId]++ end for

Set Node membership to highest colony count Id in ColonyVotingArray end for

The K-Nearest Neighbourhood mechanism uses the knowledge of colony distribution on nodes. Each node is representative of an original data element, and colonies of ants are distributed unevenly on such nodes, where certain node groupings have a higher tendency to be populated with ants belonging to one colony rather than another. The mechanism first filters out ants

within colonies which are not in deposit mode are filtered since technically they should not belong there. Subsequently each data element (node) is allocated to a colony depending on the most frequent colony Id of ants present on it, and also of nodes within its K-Nearest distance, which is effectively a majority polling mechanism. All nodes which are less than K-steps away are collectively grouped under this capping, as outlined in algorithm (11).

5.5.3 Ongoing Research

During this research period, the MPACA has been introduced at a number of conferences and publications, namely; [Chircop and Buckingham, 2013], [Chircop and Buckingham, 2014], [Chircop and Buckingham, 2011b], [Chircop and Buckingham, 2011a]. Given the success of applying the MPACA to real-world domains, amongst these the GRiST [Buckingham and Adams, 2013] and the ADVANCE [adv, 2013] datasets, further work is continuing in these ar- eas. Both domains have their own set of challenges for ACO and the promising initial results of the MPACA make them well worth addressing. Both are essentially prediction problems that are most obviously modelled by regression. However, the exact number produced by the prediction is not as important as the decision class to which it is assigned. For the mental-health domain, the GRiST clinical risk judgements have one of eleven categories, zero to ten, but the psychological representation of risk has less granular categories that map onto risk management decisions. These are more like low, medium, and high risk, where the most important category is the high-risk one. The MPACA has shown some success in detecting these categories. Future work will explore how its clustering abilities can be applied to GRiST data more associated with risk management than risk evaluation to help provide more robust linkage between risk categories and their management. It is clear from the current GRiST data that only a subset is required for evaluating the risks and that much of the supporting data is about how to manage them. There is little understanding about the role of these management data and finding patterns within them is an ideal application of the MPACA. The rewards are high because it is clear that the clinically-relevant representations of risk (e.g. none, low, medium, high, maximum) are linked to how patients in each category are managed. Management data can help define and refine the categories, which then feeds back into better thresholds for assigning patients via the evaluation process.

Regarding the ADVANCE logistics domain, the prediction was for the expected demand on vehicle space each day so that haulier companies know how many lorries to deploy. Again, the exact number is less important than the impact on decisions. If the demand rises or falls more than a threshold amount from the normal amount, this has severe consequences, either

by wasting space (and money) or by failing to deliver goods on time. The MPACA showed good results when clustering data to find those days when the thresholds are exceeded. Future work will attempt to match the thresholds more accurately to the cognitive model of decisions Buckingham et al. [2012]. In this case, we have a clear understanding of what situations drive different decisions but not how to detect those situations from the data; the MPACA could be a valuable resource for achieving the latter.

5.5.4 Application of the MPACA as a Classifier

Results presented in chapter (4) demonstrate that the MPACA can quite easily be used in a classification mode. This is still ongoing research work, and further experimentation is required. It would certainly be interesting to compare the MPACA as a classifier versus results of the other classifiers presented in the results chapter.

An important guideline to remember for future research on the MPACA is to avoid chasing per- formance optimisation without understanding how it is being achieved. Otherwise, the particular qualities of the MPACA could be lost or diluted, with improvements failing to come from the metaphor that has motivated the research in the first place. Future work will attempt to exploit the novel strengths of the algorithm rather than forcing it to fit unsuitable problem domains by bloating its functionality and diluting its distinctive properties.

In document The multiple pheromone Ant clustering algorithm (Page 163-166)