Advanced Graph-based K-means Concept - Advanced Graph-based K-means

3.3 Advanced Graph-based K-means

3.3.2 Advanced Graph-based K-means Concept

Advanced Graph-based Kmeans (AGKmeans) is proposed in this thesis based on a Kmeans clustering algorithm as a new clustering technique. AGKmeans can be applied to datasets that can be modeled as graphs and initial centroid selection is not performed randomly.

In AGKmeans, initial centroids are selected based on the number of unique contacts, number of contacts and distances among nodes. By using the number of contacts and the number of unique nodes contacts, AGKmeans selects nodes with more frequent and diverse contacts as centroids. Nodes with frequent contacts have a shorter distance to their neighbour nodes. Diverse contact nodes have a variety of neighbours. If the initial centroids were not appropriate centroids, their diversity can increase the chance for selecting appropriate nodes as centroids. The selection of nodes with highly diverse contacts as initial centroids can decrease the probability of falling into a weak, but a locally optimal solution. Utilizing the distances among nodes as a factor for initial centroid selection can accelerate the appropriate cluster formation and decrease the number of iterations required to converging to a solution by selecting nodes from dierent clusters.

The Advanced Graph-based Kmeans algorithm uses the Dijkstra [16] shortest path algorithm. The AGK- means algorithm has two steps: Assignment step and update step. The algorithm is provided initial centroid nodes as the number of clusters. For initial node selection, nodes would be sorted by their degree and the summation of their edge weights which demonstrate the number of contacts and the number of unique contacts. Then, the distance among the top 2k nodes would be calculated by Dijkstra metric and k nodes with the longest distance between them would be selected as initial centroids.

For the assignment phase, each node would be located in the cluster which have the shortest path length to the node.

For the update phase, the centroid would be updated in a way to decrease the summation of the shortest path length between the assigned nodes and their centroid in each cluster. The centroids would be updated based on the shortest path to their adjacent nodes. For each centroid, the update would occur if one of the adjacent nodes has a shorter path length to the nodes of that cluster. This method would assist to nd the best nodes as centroids which would have the shortest paths to all the nodes of the cluster.

When a new centroid has been found, the algorithm proceeds by switching to assignment step, and this would be repeated up until the algorithm converges to the same cluster collection and no new assignment

changes the shortest path or is stopped by a prescribed number of iterations. Algorithm 1 demonstrates the pseudocode code for the AGKmeans clustering algorithm. AGKmeans has been implemented in python by NetworkX.2 _{AGKmeans is implemented as a mudule in clustering module.}

Algorithm 1 Advanced Graph-based kmeans clustering

Sort nodes based on their degree and the summation of their edge weights calculate the distances among the rst 2k nodes

Insert the rst k nodes with longest distance to the centroidlist assignment(G,centroidlist,df): for i in centroidnodes: do for j in G.nodes: do if !(shortest_path_length(i,j)) then shortest_path_length(i,j)=inf end if if shortest_path_length(G,i,j)<df[i][j]) then df[i][j]=shortest_path_length(G,i,j) shortest_path_length(i,j)=inf end if end for end for update(G,centroidnodes,df): for i in centroidlist: do for j in G.neighbours(i) do for k in cluster(i) do sump+=shortest_path_length(G,j,k) if sump < sum(df[i][k]): then

i= j end if end for end for end for

Figure 3.5 demonstrates the AGKmeans algorithm in a very simple social graph. In this social graph, the contact strength is shown as labels in some edges and for non-label edges, the weight is considered 1. The algorithm starts with k=2. At rst, nodes are sorted based on their degree and the connected edge weights. The top 4 nodes are selected which are shown in part b. Then, the distances among these nodes are calculated

2_{A Python package for the creation, manipulation, and study of the structure of networks}

and because nodes with labels a and d have the longest distance, these nodes are selected as initial centroids. The assignment phase of AGKmeans is demonstrated in part c, in which two clusters form. In the update phase of the algorithm, node b has been detected as a better centroid instead of a, because the summation of distances of nodes within the cluster to the centroid can improve from 27 to 24. The assignment phase and new cluster formation with centroids b and d are shown in part d. AGKmeans found an appropriate solution in one iteration. However, if nodes a and b have been selected as initial centroid, the algorithm needs more iterations for nding the same solution.

Figure 3.5: Advanced Graph-based kmeans clustering process

3.3.3 Summary

In chapter 3, PYDTNSIM provides a single framework for the evaluation of PSN algorithms. PYDTNSIM supports both unclustered and cluster-based routing algorithms of PSNs. Furthermore, AGKmeans has been developed as a new clustering technique to represent the eect of clustering techniques on the performance of the routing algorithms. PYDTNSIM provides this opportunity to evaluate the implemented algorithms with dierent contact pattern datasets, network resources capacities, message generation rates, and use case scenarios. In the next chapters, dierent use case scenarios and contact pattern extraction analyses are proposed and the experimental results with dierent circumstances by PYDTNSIM simulator would be demonstrated and analyzed.

Chapter 4 Experimental Design

To compare dierent unclustered and cluster-based routing algorithms in DTNs, several experiments were conducted. Experimental parameters were determined in the context of dierent datasets, use case scenarios, and participant's device resources. In this chapter, the experimental design including the denition of experimental parameters, specication of performance metrics, the implementation of algorithms, and the process of results visualization are described. The process of generating contact patterns and descriptive statistics on the contact pattern properties for each dataset are also provided.

In document A Simulation Performance Study of PSN Algorithms in Practical Use Cases (Page 36-39)