3.3 Extended Technique
3.3.3 Feature Subset Selection
3.3.3.1 Ant Colony Optimization (ACO)
Ant colony optimization is an optimization technique that was introduced in the early 1990’s by Dorigo and his colleagues [179, 180]. The technique was inspired by the foraging
Feature Vectors
Samples Original Data
Feature Subset Selection Selected Data Classifier Filter Approach Feature Vectors
Samples Original Data
Feature Subset Selection Selected Data Wrapper Approach Performance Evaluation F eature Sel e cti on Searc h Classifier
behaviour of real ants, as shown in Figure 3.10. This behaviour, called stigmergy, was discovered by the French biologist Grasse in the late 1950’s [185]. It involves the indirect communication between ants using chemical pheromones that they leave on trails, which permits them to find the shortest path between the nest and the food supply. This behaviour is utilized in Ant Colony Optimization to search for approximate solutions and discrete optimization problems [186].
ACO is one of the most successful mechanisms of swarm intelligence [187]. Swarm intelligence aims to design intelligent multi-agent systems whose inspiring source is the collective behaviour of social animals and insects such as birds, fish, ants, bees and wasps [188].
Figure 3.10: Ants’ behaviour to find paths from a source node to a destination node.
ACO works at first by ants randomly explore the area surrounding their nest. They leave a chemical pheromone trail on the ground while moving around which can be smelled by other ants. Ants tend to choose paths marked by strong pheromone levels when choosing their way. Whenever an ant finds a food source, it evaluates the quantity and the quality of the food and changes the amount of the pheromone it leaves on the path back to the nest accordingly. Afterwards, these pheromone trails will guide other ants to the food source via the shortest path. The pheromone level left on the ground will decrease with time; therefore, only paths with strong amounts of pheromone will stay to guide ants [188].
One of the problems that the research community has simplified to obtain scientific test cases for ACO is the well-known traveling salesman problem (TSP) [189]. The TSP replicates the
Source
scenario of a travelling salesman who must pass through a number of cities. The travelling salesman intends to navigate between these cities so that the total travelling distance is minimal, while visiting each city exactly once. After that, ACO was successfully applied to a great number of problems such as the quadratic assignment problem (QAP), routing in telecommunication networks, graph colouring problems, scheduling, etc. [190].
The first ACO algorithm developed was the Ant System (AS) [183], which Dorigo created for his masters dissertation. Since then, several improvement of the AS have been developed, many of which were by Dorigo himself such as: Elitist AS [191], Ant-Q [192] and Ant Colony System [193]. Other improvements to the original system were presented by different researchers, including: MAX-MINAS [194] and Hyper-Cube AS [195].
The ACO algorithm can be applied to any optimization problem that the following aspects can be defined for [188]:
□ Appropriate problem representation: This insures that the problem can be expressed as a graph consisting of a set of nodes and edges between them.
□ Heuristic desirability (ɳ) of edges: It measures the “goodness" of paths from one node to another in the graph.
□ Construction of feasible solutions: A mechanism to ensure that only feasible solutions are constructed which needs defining of suitable traversal stopping criteria for stopping path construction whenever a solution is achieved.
□ Pheromone updating rule: A technique for updating the pheromone levels on edges which utilizes a corresponding evaporation rule. This involves updating the paths that the n best ants chose.
□ Probabilistic transition rule: This is the rule that controls the probability of an ant traversing from one node to another in the graph.
Constructing a solution initially begins with an empty partial solution and then the solution is extended in the following steps by adding a feasible solution component from the set of solution components [190]. The transition rule for any ant 'm' that allows it to decide on including the ith feature at any time ‘t’ in the solution is influenced by two aspects: the heuristic and level of pheromone. Often a classifier performance is used as heuristic information for feature selection [188].
The probabilistic transition rule is calculated as follows: P t ∗ ∑ ∗ if i ϵh 0 otherwise (3.5)
Where hk is the set of feasible features that can be added to the partial solution; τi is the
pheromone value and ηi is the heuristic desirability, they are both associated with feature i.
The two parameters α and β are used to control the relative importance of the pheromone value and heuristic information. As mentioned earlier, the value of local heuristic desirability ηi for the ith feature is assessed using classifier classification accuracy used in the problem.
The process of pheromone evaporation on all nodes is activated after all ants have completed their solutions. The goal of pheromone evaporation is to escape the state in which all ants construct the same solution [188]. This is done by dropping larger pheromone amounts on good routes. This is achieved by having ants deposit an amount of pheromones depending on the quality of their solution, i.e. classification accuracy. In addition, to increase the usefulness of dropping pheromones on routes, a little bit of the pheromones is removed at the end of every iteration to emphasize the pheromone reduction on less quality routes. Evaporation rate is shown in Equation 3.6 which shows each ant k depositing a specific quantity of pheromone on each node i that it has navigated.
∆τ t φ ∗
1 φ ∗
0 otherwise
if i ϵS t (3.6)
Where Sk(t) is the feature subset found by ant k at iteration t, and |Sk(t)| is its length while C(Sk(t)) is the classifier performance for that ant at that iteration. N is the total number of features in the data set. The parameter φ controls the relative weight that controls the importance of the classifier performance and the feature subset length.
At the end of every iteration, the pheromone update is performed on all nodes. This is done by depositing the new pheromone which includes the effect of pheromone evaporation. Pheromone update is computed as:
τ t 1 1 ρ ∗ τ t ∑ ∆τ t (3.7)
Where ρ is the pheromone trail decay coefficient which ranges from 0 to 1, m is the number of ants and ∆τ t is the evaporation rate computed in Equation 3.6.
The stopping criteria for feature selection has been targeted in numerous ways such as using a fixed number of features [196] in which the user defines the minimum and maximum limit for the feature subset length. The ants, consequently, stop choosing the next feature if that maximum number of features is reached. Another often used stopping criterion is accuracy inversions [197] which corresponds to a selection of a feature degrading the performance. It works by the ants stopping the feature selection and returning the subset whenever a max number of inversions is reached.
The process of feature selection using ACO starts with generating a number of ants which are then placed randomly on the graph. Often, the number of ants is chosen to be equal to the number of features; this allows each ant to begin constructing its path at a different feature [188]. Then, ants traverse nodes using the probabilistic rule until the stopping criterion is met. The resulting subsets are produced by all ants is then gathered and evaluated. This process stops if an optimal subset has been obtained or the algorithm has executed a specific number of times. The best feature subset encountered is output as the best solution. If these two conditions have not been satisfied, the pheromone is updated and a new set of ants are created and the process repeats again [190]. The overall process of ACO feature selection is shown in Figure 3.11.
ACO was utilized in keystroke dynamics in several studies. An example of the studies utilizing ACO, together with other feature selection techniques, is the one performed in [5]. In addition to ACO, Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) were applied to the data before feeding it into a back propagation neural network (BPNN) classifier. Based on feature reduction rate and classification accuracy, this study proved that ACO yields better performance than PSO and GA.
Moreover, while ACO, PSO and GA were all used in [182] for feature subset selection, the Extreme Learning Machine (ELM) was chosen to be the learning method. Supportive of the conclusions found in [5], this work demonstrated that ACO results in the best feature subset selection with ELM.
Figure 3.11: ACO feature selection process.