John M. Doe
4.1 identifiable structural patterns
Before attempting to perform community detection over user interactions networks from mi-croblogging, a preliminary experiment to build evidence of identifiable structural patterns in these networks will be performed. The purpose of this preliminary analysis is to motivate the idea that microblogging user interactions networks do in fact have structural patterns that po-tentially can be identified using community detection approaches.
Therefore, to provide preliminary evidence of distinctive structural patterns in the network of user interactions, a comparative analysis of users in the ground-truth functional communities and randomly chosen connected nodes with the same path distribution is proposed [YL15]. If such distinctive connectivity patterns exist compared to randomly selected sets of connected nodes, then structural community detection algorithms likely will be able to discover the func-tional communities based on their network connectivity.
The sets of nodes to be used in this analysis for comparison will be now defined. For every ground-truth community Ci (of any type) in the experimental datasets, a corresponding non-community ˜Ciis formed from the user interactions network with the following conditions:
1. ˜Cimust be of the same size than Ci
2. like every Ci, ˜Cimust also be connected
3. users in ˜Ci must have the same distribution of shortest path distances of Ci
In microblogging, the first and third constraints are not easily satisfiable. Both constraints are approached by first computing the χ2 distances [PW10] between the shortest path length histograms for every Ciand for all potential candidates ˜Ci. Then, for the first constraint, if it is not possible to find a non-community ˜Ciof the same size for a ground-truth community Ci, the closest candidate ˜Ci that has at least 75 % of the size of Ci is selected. Likewise, for the third constraint, if an exact match cannot be found, the closest candidate ˜Ci in descending order with the same distribution is selected. In case of multiple candidates, one is selected randomly.
4.1 identifiable structural patterns 61
After every ground-truth community Ciis paired with a suitable non-community ˜Ci, the set of structural properties that will be used to compare the structural patterns in the interactions network G = (V, E, W) for both, ground-truth communities Ci and non-communities ˜Ci are defined below. For this analysis, the edge weights W will not be considered and remember that every community is guaranteed to have at least three members (refer to Chapter3).
clustering coefficient (cc) [WS98]
This property is defined as the average local clustering coefficient of all the nodes in a community C in the undirected sub-network GC = (VC, EC) induced by C. The local clustering coefficient for a node is the proportion of edges between the nodes within its neighbourhood divided by the number of edges that could possibly exist between them.
In [WS98], this metric is used to measure how likely a set of nodes, i.e. a community C, is to form a small-world network, where the distances Li between two randomly chosen nodes follow the proportionality Li ∝ log(n) with n the number of nodes in the network.
A small-world network has relatively high clustering coefficient but small mean-shortest path length. The clustering coefficient is in the range [0, 1], where values closer to one indicate a community with a stronger likelihood to being a clique, i.e. a complete graph.
average degree (avgdeg) [Bon76]
This property measures the average degree of a set of nodes, i.e. a community C. The degree of a single node is the number of edges connected to that node and the average degree of a community C is defined as 2|EC|/|VC|, where GC= (VC, EC)is the undirected sub-network induced by the community C. It is in the range [0,∞), where higher is better.
edge density [RCC+04]
This property measures how similar a set of nodes, i.e. a community C, is to a clique structure. The edge density of a community C is defined as 2|EC|/(|VC|(|VC| − 1)), where GC= (VC, EC)is the undirected sub-network induced by the community C. Density is in the range [0, 1], where values closer to one are better.
cohesiveness [LLM10]
This property measures the fraction of total edges possible between a set of nodes, i.e. a community C, that are non-bridging. A non-bridge edge is such that when removed, the number of connected components in C is preserved. This measure captures how resilient is the community C and is in the range [0, 1], where values closer to one indicate a stronger community that is harder to split or fragment.
Table 4.1: Ratio between structural properties of ground-truth functional communities and randomly cho-sen nodes with similar shortest path distribution for two reprecho-sentative experimental datasets.
(a) RTE2015
C. Type CC AvgDeg Density Cohesiv All > 1.0 cities 0.4034 1.0178 0.9169 0.5489
countries 0.4365 0.9958 0.9510 0.6912
hashtags 2.1117 1.2542 1.0885 2.0287 Yes mentions 3.7619 1.7942 1.3538 3.1683 Yes places 0.3981 0.9914 0.9329 0.4795
quotes 2.3291 1.3907 1.1491 2.2839 Yes retweets 2.8460 1.6003 1.1834 2.6283 Yes urls 2.6746 1.2983 1.1495 2.4909 Yes Average 1.8702 1.2928 1.0906 1.7900 Yes
(b) Ireland2017
C. Type CC AvgDeg Density Cohesiv All > 1.0 cities 22.2567 1.1828 1.0691 12.4548 Yes countries 8.6811 1.0263 1.0205 6.8682 Yes hashtags 31.8735 1.2197 1.1139 15.7759 Yes mentions 48.7442 1.4785 1.2394 22.7188 Yes places 11.1738 1.0794 1.0380 7.5388 Yes quotes 38.2470 1.3039 1.1757 20.3443 Yes Average 26.8294 1.2151 1.1094 14.2835 Yes
Each structural property p above is computed for every Ciand ˜Ci, and then the average ratio r = p(Ci)/p( ˜Ci)is computed for all community types in each experimental dataset. If r > 1.0, then a measurable difference in the structural property p for Cicompared to ˜Cican be asserted.
The results for two representative datasets, RTE2015 and Ireland2017, are in Table 4.11. The WorldCup2014 dataset has similar results as RTE2015. In all datasets, the property ratios averaged over all communities is larger than one. Furthermore, Ireland2017 is the only dataset where all the community types have a property ratio greater than one. In contrast, the remaining datasets do not exhibit a distinguishable ratio for the countries, cities and places community types.
This is explained by the low number of communities built for these types (refer to Table3.3).
In the example of RTE2015, ground-truth functional communities have, in average, 87 % higher clustering coefficient, 29 % higher average degree, 9 % higher edge density and 79 % higher cohesiveness than their respective non-communities. In the case of Ireland2017, the ground-truth has, in average, ≈27 times better CC, 22 % higher average degree, 11 % higher edge density and ≈14 times better cohesiveness. All the obtained results suggest that the ground-truth functional communities have more community-like structural properties compared to randomly chosen nodes in the same network with nearly the same shortest paths distribution.
1 All the structural properties can be found in SectionB.1in the appendices.