Network sampling techniques can be roughly divided into two categories:
random selection and network exploration techniques. In the first category, nodes or links are included in the sample uniformly at random or proportional to some particular characteristic like the degree of a node or its PageRank score []. In the second category, the sample is constructed by retrieving a neighborhood of a randomly selected seed node using random walks, breadth-first search or another strategy. For the purpose of this study, we consider three techniques from each of the categories.
Zmanjševanje omrežij lines) and the links be-tween their endpoints (dashed lines).
.. Random selection
From the random selection category, we first adopt random node selection by degree [] (RND). Here, the nodes are selected randomly with probability proportional to their degrees, while all their mutual links are included in the sample (Fig..(a)). Note that RND improves the performance of the basic random node selection [,], where the nodes are selected to the sample uniformly at random. RND fits better spectral network properties [] and produces the sample with larger weakly connected component []. More-over, it shows good performance in preserving the clustering coefficient and betweenness centrality distribution of the original networks []. Neverthe-less, it can still construct a disconnected sample network, despite a fully con-nected original network.
Next, we adopt random link selection [] (RLS), where the sample con-sists of links selected uniformly at random (Fig..(b)). RLS overestimates
Zgoščevanje skupin vozlišč pri zmanjševanju realnih omrežij N. Blagus
degree and betweenness centrality exponent, underestimate the clustering co-efficient and accurately matches the assortativity of the original network [].
The samples created with RLS are sparse and the connectivity of the original network is not preserved, still RLS is likely to capture the path length of the original network [].
Last, we adopt random link selection with induction [] (RLI), which improves the performance of RLS. In RLI, the sample consists of randomly selected links as before, while also all additional links between their endpoints (Fig..(c)). RLI outperforms several other methods in capturing the degree, path length and clustering coefficient distribution. It selects nodes with higher
Zmanjševanje omrežij
degree than RLS, thus the connectivity of the sample is increased [].
Techniques from random selection category imitate classical statistical sampling approaches, where each individual is selected from population in-dependently from others until desired size of the sample is reached.
.. Network exploration
From the network exploration category, we first adopt breadth-first sampling [] (BFS). Here, a seed node is selected uniformly at random, while its broad neighborhood retrieved from the basic breadth-first search is included in the sample (Fig..(a)). The sample network is thus a connected subgraph of the original network. BFS is biased towards selecting high-degree nodes in the sample []. It captures well the degree distribution of the net-works, while it performs worst in inclusion of hubs in the sample quickly in the sampling process []. BFS imitates the snowball sampling approach for collecting social data used especially when the data is difficult to reach [].
Selected seed participant is asked to report his friends, which are than invited to report their friends. The procedure is repeated until the desired number of people is sampled.
Next, we adopt a modification of BFS denoted forest-fire sampling []
(FFS). In FFS, the broad neighborhood of a randomly selected seed node is retrieved from partial breadth-first search, where only some neighbors are in-cluded in the sample on each step (Fig..(b)). The number of neighbors is sampled from a geometric distribution with mean𝑝/(1 − 𝑝), where 𝑝 is set to 0.7 []. FFS matches well spectral properties [], while it underestimates the degree distribution and fails to match the path length and clustering coef-ficient of the original networks []. However, FFS corresponds to a model by which one author collects the papers to cite and include them in the bib-liography []. The author starts with one paper, explores its bibbib-liography and selects the papers to cite. The procedure is recursively repeated in selected papers until desired collection of citations is reached.
Last, we adopt expansion sampling [] (EXS), where the seed node is again selected uniformly at random, while the neighbors of the sampled nodes are included in the sample with probability proportional to
1 − 𝛽|𝑁({𝑣})−(𝑁(𝑆)∪𝑆)|, (.) where𝑣 is the concerned node, 𝑆 the current sample and 𝑁(𝑆) the neighbor-hood of nodes in𝑆 (Fig..(c)). Expression|𝑁({𝑣}) − (𝑁(𝑆) ∪ 𝑆)| denotes
Zgoščevanje skupin vozlišč pri zmanjševanju realnih omrežij N. Blagus
mixtures. (a) Community (b) Module
the expansion factor of node𝑣 for sample 𝑆 and means the number of new neighbors contributed by𝑣. The parameter 𝛽 is set to 0.9 []. Note that EXS ensures that the sample consists of nodes from most communities in the original network and that the nodes that are grouped together in the original network, are also grouped together in the sample []. EXS imitates the mod-ification of snowball sampling approach mentioned above, where for example we want to gather the data about individuals from different countries. Thus, on each step we include in the sample the individuals, which knows larger number of others from various countries.