Label Based Integer Encoding - Representations of Clustering Solutions for Evolutionary Algorit

4.4 Representations of Clustering Solutions for Evolutionary Algorithms

4.4.2 Label Based Integer Encoding

Label Based Integer Encoding (LBIE) [90, 115, 100] represents a clustering solution by recording the cluster membership of each object from the data set. Each LBIE representation, ~l, is a integer vector, li ∈ [1, k], of length n. Each position, li,

corresponds to an object in the data set ~xi and defines the cluster the object belongs

to. For example, the vector (111222233) describes a clustering solution for a data set of nine objects where there is a cluster containing three objects, a cluster containing four objects and a cluster containing two objects.

This representation is naturally redundant; for example (333111122) would gen- erate a clustering solution that is identical to the previous example, even though the two solutions have completely redundant phenotypes. To eliminate this issue a renumbering procedure can be employed so that all permutations of a solution are treated as identical solutions. A suitable procedure is given in Algorithm 4.4.

where ~l is the vector to be renumbered, k is the number of clusters and n is the size of the dataset. ~a is a vector of size k that stores the order in which each cluster number is observed in the solution, in the solution (333111122) cluster 3 is the first observed cluster, cluster 1 is the second observed cluster and cluster 2 is the third observed cluster. This is populated by looping through the solution and using a counter, b, of the number of clusters we have not observed and added to ~a yet. ~t is then populated by looping through the observed clusters, this vector maps from the order the cluster was observed to the cluster number. Finally the algorithm loops through the solution again replacing the cluster numbers with the value in ~t which results in a renumbered vector.

This representation is advantageous as it can be used to represent a cluster of any shape. However, a disadvantage is that it may not scale well to large datasets

CHAPTER 4. SOLVING PROBLEMS WITH MULTIPLE OBJECTIVES 88 Algorithm 4.4 Renumbering Procedure

Require: ~l = (l1, . . . , ln) ~a ← (a1, . . . , ak) ~t ← (t1, . . . , tk) b ← 1 for i = 1 to n do if li 6∈ ~a then ab ← li b ← b + 1 end if end for for i = 1 to k do tai ← i end for for i = 1 to n do li ← tli end for return ~l

as each solution will have to be the same length as the number of objects in the dataset. Larger solutions will require more space for storage and require a greater amount of time for execution.

Krishna and Murty [90] describe an alternative matrix based binary encoding representation that is conceptually similar. The representation is a n by k sparse matrix of binary values where each row represents an object from the data set and each column represents a cluster, only one column from each row may be set to 1. This technique requires a pre-defined value of k and cannot be manipulated easily by common mutation and crossover operators, so in this work we use label based integer encoding. We did not use this alternative encoding in our main investigation and only present it here for comparison.

For our implementation, each ~l is randomly initialised. The value of k is selected from the range of integers 2,₁₀n and each position in the new ~l is set to a random integer value in the range [1, k].

4.4.2.1 Mutation Operators

We experiment with mutation operators that manipulate multiple or single positions in an unguided or guided fashion giving rise to four mutation operators. For each invocation of a mutation operator that mutates multiple positions the probability of mutating multiple positions is first determined by a random value drawn from the range [0, 1]. For each position, further random values are drawn to determine if that position will be mutated. Where only one position is to be mutated a random position is selected from the representation. The mutation that is performed on each position is either guided or unguided, that is to say it has knowledge of the data set and present clustering solution or it does not.

Unguided Mutation To manipulate a position in an unguided fashion the value of the position is set to an integer drawn from [1, k].

Guided Mutation Krishna [90] proposed a guided mutation operator where cluster memberships of objects are changed at random with a weighting towards clusters that are close to the object. We calculate the probability that the object in the ith

position is assigned to the gth _{cluster as follows:}

Pr {li = g} =

δmax− δ (~xi, ~cg)

j∈P(δmax− δ (~xi, ~cg))

where δmax = max δ (~xi, ~cg) ∀Pg ∈ P (4.15)

where P is the clustering solution derived from the encoding, ~l. From this we can then assign an object to a cluster in a biased fashion.

4.4.2.2 Crossover Operators

The encodings used for LBIE are fixed length encodings, that is, for a given data set all of the solutions are of the same length. Therefore, we can experiment using the standard one-point, two-point, three-point and uniform crossover operators previously defined in section 4.4.1.2.

CHAPTER 4. SOLVING PROBLEMS WITH MULTIPLE OBJECTIVES 90

In document Multi-objective evolutionary algorithms for data clustering (Page 102-105)