The Main Model Architecture and Processes

The MPACA uses graph space architecture. It converts the given raw dataset (of objects) into nodes, where ordinal features (continuous or discrete) for each data element dimension are used as Cartesian points to plot these nodes on the graph, reflecting a spatial arrangement. Nominal (non-ordinal) features are ignored for the purposes of graph space creation, since these have no implied order. However, nominal features that influence cluster formation (feature selection is outside the remit of this thesis) are still present at the node and influence cluster formation by being part of the pheromone trails. When an ant matches its own feature value, irrespective of whether it is nominal or ordinal, it lays a pheromone trail. The impact of the trails is to strengthen connections between nodes with matching feature values rather than moving the nodes closer together, which some ACO algorithms do. Edge distances are calculated and expressed as the Euclidean distance rather than, say, the city-block distance that is sometimes used.

Algorithm 4 The MPACA Outline

1. A given dataset is normalised and converted into graph space.

2. A number of ants are created for each feature of each node, a number controlled by the ant complement parameter.

3. Ants are initially imprinted a feature value for the feature of the node they are created in, called the instinct or base feature, and this feature stays within the ant all throughout execution. From this point onwards ants react to any other nodes which match this feature. 4. The ant deposits pheromone traces for the feature or the feature combinations, which it is

carrying:

(a) This continues until it keeps reaching nodes which match the feature combinations it is carrying;

(b) Once this condition no longer holds, pheromone deposition ceases;

5. Ants traverse the graph by selecting edges to visit depending on a stochastic mechanism influenced by the pheromone quantity representative for each feature value. That is, ants detect pheromone traces deposited by other ants on edges. If these pheromones represent feature values which are of interest to them, this process increases the likelihood of ants choosing these edges over others.

6. Ants further reinforce interesting paths by depositing pheromone, causing the positive feedback to occur. Pheromones evaporate and this evaporation serves to eliminate paths which are not adequately reinforced.

7. Two crucial operators are feature merging and colony merging, both having separate thresholds, but both of which gather information at the same points when applying the ant feature encounter counting mechanism.

8. The ant feature encounter counting mechanism used consists of a number of events: (a) The ant records those features being carried by other ants;

(b) The ant records the Id of the encountered ant;

(c) The ant records the ant deposit state (active/de-active) the detected ants are in; (d) The ant records the colony Ids the detected ants are in;

(e) All of the above include a time-stamp when this encounter took place.

Only ants that are within the same vicinity are recorded, defined by the visibility parameter, or a number of steps away.

9. Feature merging occurs when a repeated number of features are encountered by the same ant within the same time-window. There is an inherent inbuilt forgetting mechanism for merged features. These features are dropped when ants no longer detect the same features frequently enough within the same time-window. Only one exception applies, that is the feature that the ant is initially imprinted with is never dropped.

10. Colony merging occurs by ants determining which colony they should merge into. This depends on the colony Id which has been detected most frequently amongst the encounters in deposit mode, which encounters exceed the colony merging threshold. If more than one colony exists, the ant is set to belong into the colony which has the highest ant population amongst the detected ants.

11. Once ants reach a stable dynamic equilibrium in the colonies they form, the algorithm terminates.

12. When this occurs the initial data points are compared against the centroids of the features which are carried by the ants present in each colony (i.e. cluster), and this value is used to determine which cluster is best representative for such a data point.

Before the MPACA sets its ants loose on the graph space to find clusters, the domain has to be initialised. The following considerations governed the mechanisms used:

1. the environment must be scalable;

2. nodes must be linked by pathways along with ants can move;

3. pheromones must be laid by ants, which must also be able to detect and follow particular ones.

3.3.1 Partially-Connected Graph Space

In an ideal world with no computational constraints, the multi-dimensional space (MDS) could be represented by hypercubes for each point and ants would be able to reach every possible point in the space. In reality this is not feasible. The sheer number of possible movement positions would make convergence slow, if not impossible. To overcome this limitation, graph space is used. This restricts ant movement to only those edges connecting nodes, therefore rendering the algorithm more scalable.

Another scalability issue concerns the number of connecting edges. In a fully connected bi- directional graph, the number of edges is dependent on the number of nodes present. The MPACA adds a node in hyper-space for each existing data element and a large data sample would result in a correspondingly large graph and thus scalability problems. Excluding the node it is on, and the node it just arrived from, the ant is allowed to travel towards any other node in space. Therefore, the number of possible edges from any single node is n − 2, where n represents the number of elements present. This will be a very large number for large datasets and, again limits convergence because ants can be scattered too widely. This is apart from the time taken to process so many paths and possibilities. Hence, the MPACA does not utilise a fully-connected graph. Instead, its truncation is part of the pre-processing phase for setting up the graph and edge distances.

3.3.2 Connecting Nodes and Measuring Distances

Although attributes used for the graph initialisation process are limited to ordinal ones, their variety of distributions creates its own challenges. The values for each dimension need to be standardised, in such a way that the magnitude of different dimensions are scaled to a common representation. Additionally, when non-uniformly distributes ranges are considered, like ages ranging from 10 to 15 and from 45 to 55, these values are either re-enumerated as ordinal values, or considered as nominal features. This choice is at the discretion of the MPACA operator.

This normalisation process eliminates the unit of measurement by transforming the data to scores of standard deviations (SD) from the mean. It is applied to each and every ordinal data element dimension so that the unit of each dimension becomes comparable to any other dimension. It is the usual normalisation process of converting values to SDs from the mean, known as zvalues, as per equation (3.1), where x is the original value and µ is the mean:

z= (x − µ)

SD , (3.1)

The distance between nodes is denoted in terms of the Euclidean distance, E, between each pair of nodes (a, b) with the dimensions now measured on a comparable scale of z values. However, each dimension is further converted into a number of steps, where each step equates to the distance moved by an ant in each time cycle. A step size parameter is calculated on the basis that four SDs above and below the mean adequately encompasses all the population apart from outliers. It is generally accepted that 95% of data points are covered within four SDs [Al-Saleh and Yousif, 2009].

The initial assumption is that dividing one SD unit into 10 steps gives an adequate granularity, so each step is 0.1 of a SD. This means 40 steps will cover the majority of the population as discussed, although there is no actual limit to the number of steps for a dimension: the outliers are still included in the graph.

Having converted dimensions into steps, the Euclidean distance, E, between two nodes is the square root of the sum of squared number of steps along each dimension. However, as the number of dimensions in the hyper-dimensional space increases, the corresponding length of the edges will also increase: this effect is known as the curse of dimensionality [section (2.2.2.4)]. To compensate for it, an adjusted Euclidean step size, U , is used, which is the Euclidean distance for one step along each dimension. Two points are separated by one Euclidean step, U , if they are one step apart along each dimension. This makes U the square root of the number of dimensions, D where U =√D. The number of steps along the edge, ES, is calculated using the normal Euclidean equation and then divided by U , as per equation (3.2):

ES= E U =

E √

D (3.2)

where E is the length of the edge.

A second problem with the domain initialisation comes with too many data points. If every one is connected to every other one, then there will be too many edges. The maximum edge length parameter is used to reduce the number by specifying the maximum length of an edge:

any nodes further apart than this are not connected by an edge, as explained per section (3.3.1). Experimentation will be needed to determine the recommended settings for accurate cluster definitions.

3.3.3 Multiple-steps within Edges

In the MPACA an edge is divided into a number of steps that an ant takes to traverse it. This concept is a key variation between the MPACA and other ant colony algorithms, where edges are traversed in single atomic units. This has been identified as a weak point in other algorithms because, irrespective of how distant nodes are from each other, it takes an ant just one step to get from one node to another. This creates a topological structure for the graph rather than one that conserves distances.

In the MPACA, longer edges have more steps and traversals take more time and pheromone has longer to evaporate before the path is completed. This reinforces the inverse-relationship of length to pheromone quantity. The position of the ant along an edge is determined by the system time, where every time increment represents one step along the edge until its full length has been traversed. Thus, this process does not delay the clustering process, on the contrary it adds further emphasis on the distance between nodes.

For the algorithm to operate, a number of ants need to be initialised at each node of the graph, where the number of ants is proportional to the problem domain size.

In document The multiple pheromone Ant clustering algorithm (Page 89-93)