K-adaptive GCF Algorithm (K-adaptive GCF-I)

2.3 Applications

3.1.2 K-adaptive GCF Algorithm (K-adaptive GCF-I)

From the previous version of the algorithm, one of the possible improvements that can be performed is to allow calculating the parameter K (the number of communities which the graph

Algorithm 4: Genetic-based Community Finding Algorithm with a fixed number of com-munities (K-fixed GCF-I)

Input: A graph G = (V, E) where V is a set of vertices denoted by {v1, . . . , vn} and E is a set of edges E denoted by eij representing whether there is a connection between the vertices v_i and v_j. And positive numbers ngen, µ, λ and mutpb are the main GA parameters to be fixed.

Output: K-best individuals

1 C ← InitRamdomP op(λ, |V |)

2 for j ← 1 to λ do

3 Fj ← F itness(C_j)

4 i ← 1

5 convergence ← 0

6 while i ≤ ngen ∧ convergence = 0 do

7 Cbest ← SelectN Best(C, F, µ)

8 C ← Cbest

9 for j ← µ to λ do

10 p1, p2 ← RandomSel(Cbest)

11 c1, c2 ← Crossover(p1, p2)

12 c1, c2 ← M utation(c1, c2, mutpb)

13 C ← C ∪ {c1, c2}

14 i ← i + 1

15 for j ← 1 to λ do

16 F_j ← F itness(C_j)

17 convergence ← CheckConvergence(Cbest, C, F )

18 Cbest ← ∅

19 i ← 1

20 count ← 0

21 SortedCbest ← SortN Best(C, F )

22 while count ≤ K do

23 j ← 1

24 stop ← 0

25 while j ≤ count ∧ stop = 0 do

26 if SortedCbest_i⊆ Cbest_j∨ Cbest_j ⊆ SortedCbest_i then

27 Cbest ← Cbest/Cbestj

28 Cbest ← Cbest ∪ SelectBigCom(Cbestj, SortedCbesti)

29 stop ← 1

30 j ← j + 1

31 if j = count then

32 Cbest ← Cbest ∪ SortedCbesti 33 count ← count + 1

34 i ← i + 1

35 return Cbest

is divided) during the execution of the evolutionary process. For this purpose, the encoding and the fitness function have been modified in the new version of the algorithm.

3.1.2.1 Encoding

In this new approach, the possible solutions can contain groups of communities, and not just an unique community. For this reason, the genotypes (chromosomes) are represented as a set of vectors of binary values. Each allele represents a community that is composed by a set of binary values, one for each node in the graph. This binary vectors are similar to the chromosomes of the previous encoding, the value 1 means that the node belongs to the community, and the value 0 the opposite. The number of binary vectors (communities) contained in the chromosome (group of communities), corresponds to the value of the parameter K, as shown in Figure 3.4.

0 1 0 1 1 0 0 0

1 0 1 0 0 0 0 0

K = 2

Chromosome

Size = N x K Community 1 Community 2

N = 8

Figure 3.4: A chromosome representing a group of communities of the graph. Each allele is a particular community where its binary vector represents the nodes of the graph and if they belong, or not, to the current community. So N is the number of nodes contained in the graph. In this example the solution contains 2 vectors representing two different communities, hence the K is equal to 2.

In this new encoding the length of a particular chromosome could vary according to the number of communities (K) in which the graph is partitioned. This variable length of the chromosomes will be adequately managed in the generation process of the initial population.

In this process, individuals with different sizes can be created using a initial setup parameter (maxK) of the algorithm (from 2 to maxK).

In addition, the crossover operator will take into account this parameter to generate correct individuals in the evolutionary process. Therefore, if a chromosome represents a graph partition of K communities, and the number of nodes of the graph is N (length of each allele), the total number of bits contained into this chromosome will be N xK (see Figure 3.4).

3.1.2.2 The Clustering Centroid Fitness Function

The previous encoding only allows to use metrics related to measures of a member belonging to one community, because one individual only represents a single community. However, this new encoding, which can represent a group of communities, makes possible to include measures between the different communities encoding on it.

For this new approach a new fitness function, called Centroid Fitness (CF), have been designed to measure the distance between the community centres belonging to a particular chromosome. This new metric is called dout and it has been represented in Figure 3.5 With this

new measure, large distances between centres could be desirable because it represents a bigger gap between clusters or communities.

Figure 3.5: Graph sample illustrating three communities and the distances that are calculated using the fitness function of the algorithm. The distance d_in represents the average distance calculated between the nodes which belong to a community. The distance d_out represents the distance between community centres.

As a result of this new measure, which can be calculated for each individual, a new fitness function which combines the Clustering Coefficient, the distance between nodes (din) and finally the distance between centres (d_out) can be designed. The idea of this new fitness is to find a set of communities that could satisfy all of the previously defined conditions. This new fitness tries to find groups of communities where each community is strongly connected and has similar nodes, but also whose nodes are as different as possible with the rest of communities.

The function defined is a simple weighted function: let F (x, y) be the fitness function, CC the clustering coefficient, din the distance between nodes, and doutthe distance between centres, the value of the new fitness is calculated as follows:

F_i(CC, d_in, d_out) = w₁ CC_i

M ax({CC_i}^K_i=1) + w₂

1 − d_in_i M ax({d_in_i}^K_i=1)

+ w₃ d_out_i M ax({d_out_i}^K_i=1)

(3.2) Where w_i are the weights given to each fitness: w_i ∈ (0, 1). The values were experimentally fixed to w1 = 0.05 , w2 = 0.05 and w3 = 0.9.

3.1.2.3 The Algorithm

The GCF-I Algorithm with adaptive K evolves using a standard GA. The steps of the process are similar to the previously described in the GCF-I algorithm with fixed K. However, the finally subsumption process is no longer required, because any individual represents a group of communities. So in this new approach, the chromosome that has the best fitness function value is selected as a final solution (see Algorithm 5).

Algorithm 5: Genetic-based Community Finding Algorithm with an adaptive number of communities (K-adaptive GCF-I).

Output: Best individual

1 C ← InitRamdomP op(λ, |V |, maxK)

2 for j ← 1 to λ do

3 Fj ← F itness(C_j)

4 i ← 1

5 convergence ← 0

6 while i ≤ ngen ∧ convergence = 0 do

7 Cbest ← SelectN Best(C, F, µ)

8 C ← Cbest

9 for j ← µ to λ do

10 p1 ← RandomSel(Cbest)

11 p2 ← RandomSel(Cbest)

12 c1, c2 ← CrossoverComm(p1, p2)

13 c1 ← M utationComm(c1, mutpb)

14 c2 ← M utationComm(c2, mutpb)

15 C ← C ∪ {c1, c2}

16 i ← i + 1

17 for j ← 1 to λ do

18 F_j ← F itness(C_j)

19 convergence ← CheckConvergence(Cbest, C, F )

20 return Best(Cbest)

In the process of random population generation, there is an input parameter (maxK) that specifies the maximum number of communities per individual (line 1). For each individual gen-erated, a random value is selected between 2 and the value of maxK parameter, corresponding to the K communities that contains this solution. Therefore, this individual has a size of K binary vectors (with length equals to |V |), representing each one a different community of the graph partition. In addition, these binary vectors will be randomly generated too.

Due to the new encoding designed for the algorithm, the genetic operators had to be extended for working with groups of communities as shown in Figure 3.6. To apply the crossover operator, the algorithm chooses a random crossover point. Then, every community preceding this point is copied from both parents to create a new child, and every community succeeding this point is copied to create a second new child (sub-figure a). Once the crossover operator has finished, the mutation is executed. The algorithm randomly chooses some values of the vectors (with a mutpb probability) representing the communities, and change their values from 0 to 1 or viceversa (sub-figure b).

(a) Crossover (b) Mutation

Figure 3.6: Genetic operators for the k-adaptive GCF-I algorithm. In the Crossover example (a), two groups of communities with different K are selected. As shown in the example, the K value is adaptive during the evolution process, where the best graph partitions will survive. In the mutation operation (b), two nodes randomly selected from two different communities have been changed.

In document Evolutionary Computation for Overlapping Community Detection in Social and Graph-based Information (Page 70-75)