4.2 General Method for Graph Matching
4.2.3 Models for Sampled Degree and Distance
The methodology described so far is general, in that we have not yet specified the distributions of specific attributes that determine the fingerprints of nodes. However, these models are needed in order to compute the likelihood function. From now on, we will use the node degree as the only absolute attribute, and hop distance as the relative attribute.
In the following, X1= (X11, X21) is the degree pair of u1,2, and Xi = (X1i, X2i), i > 1 is
the distance pair to anchor pair i3. Figure 4.1 depicts the Bayesian network of the formulation as described in Section 4.2.2.
The aforementioned methodology requires a probabilistic model for the node attributes that determine the fingerprint of nodes. In particular, we need probabilistic models that represent the node attributes in G and also after the sampling process, that is, in G1 and
G2. Note that these models are represented in equations (4.5) and (4.6). As we will use
only degree and distance in the fingerprint of a node, we require models only for these two attributes.
As mentioned earlier, we consider the edge sampling process. Under the edge sampling process, we can now determine a model for degree and distance on the sampled graph given their values in the original graph G. Recall that qi(x, y) denotes the probability that attribute
i has value x in the sampled graph, given that it has value y in the original graph G. Let
3
More precisely, with (w1i, w2i) the i-th anchor pair, X1iis the graph distance in G1to w1i(and analogously
i = 1 denote the node degree. As the edge sampling process samples each edge independently with probability s, q1(x, y) follows a binomial distribution with parameter s and y, i.e.,
q1(x, y) = Bi (x, y, s) = y x sx(1− s)y−x
Finally, note that p1(y) denotes the degree distribution of G, i.e., the probability of sampling
a degree y from G. We comment on the choice of the underlying degree distribution in Section 4.4.
The model for distances under edge sampling requires a more careful treatment as this has not been studied in the literature. Consider two adjacent nodes u and v in G. We will assume that the distance between u and v after sampling follows a geometric distribution. In particular, let Zu,v denote a geometric random variable with parameter α for the distance
between nodes u and v after sampling. We provide an intuitive argument for this model. In a well-connected graph G, (i) paths of different lengths 1, 2, 3, . . . tend to exist between two adjacent nodes, and (ii) sampling is likely to preserve a path of length ℓ between two nodes with the same probability α for all path lengths. Note that the number of paths of length ℓ between two nodes is likely to increase with ℓ. Thus, although a path of higher length is more likely to be destroyed by the edge sampling process, the larger number of paths might offset this probability.
Using the above assumption, a distance 1 would be preserved in the sampled graph with probability α, a distance two would emerge between two adjacent nodes in G with probability α(1− α), and so forth. Thus, the sampled distance Zu,v between two adjacent nodes u and
v in G has a probability distribution of the form P [Zu,v = ℓ] = α(1− α)(ℓ−1), ℓ = 1, 2, . . ..
which is a geometric distribution with parameter α. The same argument can now be extended to nodes at any distances. Let u and v have a shortest path of length y in G with the path (u = u0, u1, . . . , ui, ui+1, . . . , uy = v). Thus, their distance after sampling can be written as
Zu,v = Zu0,u1+· · · + Zui,ui+1+· · · + Zuy−1,uy, where the random variable Zui,ui+1 denotes the
distance after sampling for each adjacent pair in the path. We now make an assumption that Zi,i+1 are independent for all i. Thus, we have the sum of independent geometric random
variables with parameter α, which can be expressed as a negative binomial random variable with parameters y and α, denoted by NBi ().
Let X1i for i > 1 denote the distance from node u to an anchor node i− 1 in G1. Then
qi(x, y) is the probability of observing distance x between a node and the (i− 1)-th anchor
node in the sampled graph, given that their distance in the underlying graph is y. Thus, we have qi(x, y) = NBi (x, y, α) = x− 1 y− 1 αy(1− α)(x−y), x≥ y, j = 1, 2. (4.14) We set the parameter α = s, where s is the sampling probability of the edge sampling process. As described above, edge sampling tends to preserve paths of different lengths with the same probability, thus using the probability of preserving a single edge (s) as α could be a reasonable assumption. Finally, pi(y) for i > 1 corresponds to the distance distribution of G, i.e., the
probability that two given nodes are at distance y in G. We comment on the choice of the underlying distance distribution in Section 4.4.
Note that we investigated the assumption of the geometric distance distribution for ad- jacent nodes, and negative binomial distribution for larger distances after sampling, via ex- periments on real data. We did several experiments on an e-mail network dataset, as we
will introduce in Section 4.4. Our findings suggest that this assumption holds for real social networks, and that paths of different lengths exist between adjacent nodes and the choice of s for the geometric parameter is indeed reasonable.