Metrics based on internal connectivity - Publications and Contributions

1.6 Publications and Contributions

2.1.1 Metrics based on internal connectivity

1. Triangle Participation Ratio (TPR) [182]:

This is a measure of graph cohesion, which is defined as the fraction of nodes in a graph that belongs to a triangle. A triangle, in graph theory, is a planar undirected graph with 3 vertices and 3 edges generating a complete graph in the form of a triangle, as shown in Figure 2.1.

V1 V2

Figure 2.1: A triangle example into a graph.

Taking this into account, the TPR can be defined as follows:

T P R = |{u : u ∈ V, {(v, w) : v, w ∈ V, (u, v) ∈ E, (u, w) ∈ E, (v, w) ∈ E} 6= ∅}|

|V | (2.1)

This ratio can take values from 0 to 1. If every node belongs to a triangle, the TPR value is equal to 1. Otherwise its value will be 0.

2. Local Clustering Coefficient (LCC)[55, 176]:

This coefficient measures the transitivity of a node into the graph, and it is usually used for undirected graphs to represent the probability that two neighbours of a vertex are connected. The transitivity property in a graph can be observed when triangles are formed.

As was defined by Watts and Strogatz [176], suppose that a vertex vi has ki neighbours;

then a maximum of k_i(k_i− 1)/2 edges can exist between them. This happens when every neighbour of v_i is connected to every other neighbour of v_i. Taking this into account, the LCC of a node vi is defined as the fraction of the number of connected pairs between all neighbours of v_i, and this maximum possible number of edges between all neighbours:

LCC_i = 2 ×P

j,ha_jhaija_ihajia_hi

ki(ki− 1) (2.2)

The LCC measure provides values ranging from 0 to 1. Where 0 means that the node and its neighbours do not have clustering features, so they do not share connections between them. Whereas, the value 1 means that they are completely connected, as shown in Figure 2.2.

Figure 2.2: Examples of Local Clustering Coefficient for the node V1 showing different possible graphs, and their related LCC value.

3. Local Weighted Clustering Coefficient (LWCC)[15]:

To study weighted graphs, the definition of LCC can be extended. Following the same assumption of LCC definition, let W be the weight matrix with coefficients w_ij and A be the adjacency matrix with coefficients aij, if we define:

Si =

|V |

j=1

aijajiwij (2.3)

Then, the Local Weighted Clustering Coefficient can be defined as :

LCC_i^w= 2 ×P

j,h

(wij+w_ih)

2 a_jhaija_ihajia_hi

Si(ki− 1) (2.4)

For this new definition, the connections between the neighbours of a particular node are considering, but now adding the weight information related to the original node. This new measure calculates the distribution of the weights of the node that we are analysing, and shows how good the connections of that cluster are. The LWCC has the same value than the LCC when all the weights are fixed to the same value

4. Global Clustering Coefficient (GCC) [176]:

This coefficient measures the global transitivity of the graph providing a general overview of the graph structure. GCC is defined as the ratio of the triangles and connected triplets in the graph:

GCC = 3 × |T riangles|

|T riples| (2.5)

It provides values from 0 to 1. If all possible connections among the neighbours of all the nodes into the graph are available, GCC gives a value of 1. A network with GCC close to 1 contains highly connected clusters. Otherwise, if there are no connections between the neighbours nodes, this coefficient has a value of 0.

5. Density(D) [174]:

This is defined by the number of edges in the graph G divided by the total number of possible edges, it can be expressed as follows:

D = m

n(n − 1)/2 (2.6)

where the number of the nodes within the graph is n = |V |, and the number of edges is m = |E|.

The density is a real value between 0 and 1. Any graph that does not contain any edge, so all the nodes will be isolated, will have a density of 0, whereas for a full connected graph, where every node is connected to the rest of the nodes in the network, will have a density of 1.

6. Clique Number (CN) [10]: A clique of a graph is a subset of mutually adjacent vertices in V (every two vertices in the subset are connected by an edge). It means that the induced subgraph from these nodes is complete. A clique is called maximal if it is not contained in any other clique, as shown in Figure 2.3. The size of the maximum clique is called the clique number of the graph that is denoted by ω(G) [10].

The Maximum Clique Problem is one of the classic NP-complete problems, and there are several proposed algorithms in the literature to manage it. The Bron-Kerbosch algorithm [34] is one of the most well-known and widely used method based on a recursive back-tracking search. Tomita et al. [167] proposed a similar technique to the Bron-Kerbosch algorithm using a depth-first search algorithm with pruning methods. Finally, in the last years, different variations of the Bron-Kerbosch algorithm have been implemented to be applied into larger graphs as the method presented by Eppstein et al. [62].

not a clique 3-clique (not maximal) 4-clique (maximal)

Figure 2.3: Examples of different cliques contained in a graph containing 5 nodes.

7. Centralization (C_D) [74]:

This metric is based on the concept of degree centrality of the nodes of a graph, which is defined as the number of edges incident upon of them. According to this, the degree cen-trality (C_d(vi)) of a vertex vifor a given indirected graph G is defined as its neighbourhood (k_i) (see Figure 2.4). In the case of a directed graphs, it is possible to define two distinct measures of degree centrality; namely indegree and outdegree. Accordingly, indegree centrality corresponds to the number of links directed to the node, otherwise outdegree centrality is the number of links that the node directs to the rest of nodes of the graph.

Figure 2.4: Example of degree centrality. In this graph, the degree centrality for the V 1 node is C_d(V 1) = 6, whereas for the rest of nodes, this value is 1.

The definition of degree centrality on the node level can be extended to the whole graph.

Freeman [74] provided a measure of graph centralization based on differences for the node centralities. This index of graph centralization has two main features: (1) it should index the degree to which the centrality of the most central node exceeds the centrality of all other nodes, and (2) it should be expressed as a ratio of that excess to the maximum possible value for a graph containing the same number of nodes. The maximum difference for node centralities in a graph takes place when the graph contains one central node to which all other nodes are connected (a star graph), as shown in Figure 2.4. In this case the difference has a value equal to n²− 3n + 2 where n = |V |. Thus, let v∗ be the node with highest degree centrality, the degree centralization of the graph G is defined as follows:

C_d(G) = Pn

i=1[C_d(v∗) − C_d(v_i)]

n²− 3n + 2 (2.7)

As previously mentioned, Freeman proved that this measure takes its maximum value, 1, for those graphs whose topology is a star or a wheel, whereas decentralized graphs are characterized by having a centralization close to 0.

8. Heterogeneity (H) [57]:

The heterogeneity of the degree distribution has been the focus of considerable research in recent years. Many measures of network heterogeneity are based on the variance of the con-nectivity. This measure notices the tendency of a network to contain ”hub” nodes. Dong and Horvath [57] defined this measure as the coefficient of variation of the connectivity distribution:

H = pvariance(k)

mean(k) (2.8)

In document Evolutionary Computation for Overlapping Community Detection in Social and Graph-based Information (Page 34-38)