• No results found

are not suited to the ranking of nodes: indeed, we work with so-called small world networks, having a low diameter, and an even smaller average distance. Indeed, in a typical graph, the average distance between two nodess and t is between 1 and 10, meaning that most of the n centrality values lie in this range. In order to obtain a ranking, we need the error to be close to 10n, which might be very small. Nevertheless, an approximation algorithm was proposed in [128], where the sampling technique developed in [72] was used to actually compute the top-k nodes: the result is not exact, but it is exact with high probability. The authors proved that the time complexity of their algorithm is O(mn23log n), under the rather strong assumption that closeness centralities are uniformly distributed between0 and D (in the worst-case, the time complexity of this algorithm is O(mn)).

Other approaches have tried to develop incremental algorithms that might be more suited to real-world networks. For instance, in [109], the authors develop heuristics to determine the k most central nodes in a varying environment. Furthermore, in [142], the authors consider the problem of updating the closeness centrality of all nodes after edge insertions or deletions: in some cases, the time needed for the update could be orders of magnitude smaller than the time needed to recompute all centralities from scratch.

Finally, some works have tried to exploit properties of real-world networks in order to find more efficient algorithms. In [106], the authors develop a heuristic to compute thek most central nodes according to different measures. The basic idea is to identify central nodes according to an “easy” centrality measure (for instance, degree of nodes), and then to inspect a small set of central nodes according to this measure, hoping it contains the top-k nodes according to the “hard” measure. The last approach [129], proposed by Olsen et al., tries to exploit the properties of real-world networks in order to develop exact algorithms with worst-case complexity O(mn), but performing much better in practice. As far as we know, this is the only exact algorithm that is able to efficiently compute thek most central nodes in networks with up to1 million nodes.

However, despite this large amount of research, the major graph libraries still use the textbook algorithm: among them, Boost Graph Library [87], Sagemath [64], igraph [152], NetworkX [147], and NetworKit [151]. This is due to the fact that efficient available exact algorithms for top-k closeness centrality, like [129], are relatively recent and make use of several other non-trivial routines. Conversely, our algorithm is very simple, and it is already implemented in some graph libraries, such as NetworKit [151], WebGraph [23], and Sagemath [64].

5.2

Overview of the Algorithm

In this section, we describe our new approach for computing thek nodes with largest closeness (equivalently, thek nodes with smallest farness). If we have more than one node with the same score, we output all nodes having a centrality bigger than or equal to the centrality of thek-th node.

The basic idea is to keep track of a lower bound on the farness of each node, and to skip the analysis of a node s if this lower bound implies that s is not in the top-k. More formally, let us assume that we know the farness of some nodes s1, . . . , s`, and a

lower bound L(v) on the farness of any other node v. Furthermore, assume that there

are k nodes among s1, . . . , sl satisfying f (si) < L(v) ∀v ∈ V − {s1, . . . , sl}, and hence

f (v) ≥ L(v) ≥ maxw∈V −{s1,...,sl}L(w) > f (si). Then, we can safely skip the exact com- putation of f (v) for all remaining nodes v, because the k nodes with smallest farness are amongs1, . . . , sl.

This idea is implemented in Algorithm 8: we use a list Top containing all “analyzed” nodes s1, . . . , sl in increasing order of farness, and a priority queue Q containing all nodes

“not analyzed, yet”, in increasing order of lower bound L (this way, the head of Q always has the smallest value of L among all nodes in Q). At the beginning, using the function computeBounds(), we compute a first bound L(v) for each node v, and we fill the queue Q

Algorithm 8: Pseudocode of our algorithm for top-k closeness centralities.

Input : A graph G = (V, E)

Output: top-k nodes with highest closeness and their closeness values c(v)

1 global L, Q ← computeBounds(G); 2 global Top ← [ ];

3 global Farn;

4 for v ∈ V do Farn[v] = +∞; 5 while Q is not empty do 6 s ← Q.extractMin();

7 if |Top| ≥ k and L[s] > Top[k] then return Top;

8 Farn[s] ← updateBounds(s); // This function might also modify L 9 add s to Top, and sort Top according to Farn;

10 update Q according to the new bounds;

11 end

according to this bound. Then, at each step, we extract the first element s of Q: if L(s) is smaller than thek-th smallest farness computed until now (that is, the farness of the k-th node in variable Top), we can safely stop, because for eachv ∈ Q, f (v) ≤ L(v) ≤ L(s) < f (Top[k]), andv is not in the top-k. Otherwise, we run the function updateBounds(s), which performs a BFS from s, returns the farness of s, and improves the bounds L of all the other nodes. Finally, we inserts into Top in the right position, and we update Q if the lower bounds have changed.

The crucial point of the algorithm is the definition of the lower bounds, that is, the definition of the functions computeBounds and updateBounds. We propose two alternative strategies for each of these two functions: in both cases, one strategy is conservative, that is, it tries to perform as few operations as possible, while the other strategy is aggressive, that is, it needs many operations, but at the same time it improves many lower bounds.

Let us analyze the possible choices of the function computeBounds. The conservative strategy computeBoundsDeg needs time O(n): it simply sets L(v) = 0 for each v, and it fills Q by inserting nodes in decreasing order of degree (the idea is that nodes with high degree have small farness, and they should be analyzed as early as possible, so that the values in Top are correct as soon as possible). Note that the nodes can be sorted in time O(n) using counting sort.

The aggressive strategy computeBoundsNB needs time O(mD), where D is the diameter of the graph: it computes the neighborhood-based lower boundLNB(s) for each node s (we

explain shortly afterwards how it works), it setsL(s) = LNB(s), and it fills Q by adding nodes

in decreasing order ofL. The idea behind the neighborhood-based lower bound is to count the number of paths of length` starting from a given node s, which is also an upper bound U`

on the number of nodes at distance` from s. From U`, it is possible to define a lower bound

onP

v∈V dist(s, v) by “summing U`times the distance`”, until we have summed n distances:

this bound yields the desired lower bound on the farness ofs. The detailed explanation of this function is provided in Section 5.3.

For the function updateBounds(s), the conservative strategy updateBoundsBFSCut(s) does not improveL, and it cuts the BFS as soon as it is sure that the farness of s is smaller than the k-th biggest farness found until now, that is, Farn[Top[k]]. If the BFS is cut, the function returns+∞, otherwise, at the end of the BFS we have computed the farness of s, and we can return it. The running time of this procedure is O(m) in the worst-case, but it can be smaller in practice. It remains to define how the procedure can be sure that the farness ofs is at least Farn[Top[k]]: to this purpose, during the BFS, we update a lower bound on the farness ofs. The idea behind this bound is that, if we have already visited all nodes up to distance`, we can upper bound the closeness centrality of s by setting distance ` + 1 to a number of nodes equal to the number of edges “leaving” level`, and distance ` + 2 to all the remaining nodes. The details of this procedure are provided in Section 5.4.