Data corrupted by random edge addition/deletion

2.7 Extreme value estimation results

2.7.2 Data corrupted by random edge addition/deletion

PA models are designed to describe human interaction in social networks but what if data collected from a network is corrupted or usual behavior is changed? Corruption could be due to collection error and atypical behavior could result from users hiding their network presence or trolls acting as provocateurs. In such circumstances, the task is to unmask data corruption or atypical behavior and recover the parameters associated with the original preferential attachment rules.

In the following, we consider network data that are generated from the linear PA model but corrupted by random addition or deletion of edges. For such corrupted data, we attempt to recover the original model and compare the per- formances of MLE, SN, and EV methods.

Randomly adding edges.

We consider a network generating algorithm with linear PA rules but also a possibility of adding random edges. Let G(n)= (V(n), E(n)) denote the graph at time n. We assume that the edge set E(n) can be decomposed into two disjoint sub- sets: E(n) = EPA

(n)S ERA

(n), where EPA

(n)is the set of edges resulting from PA rules, and ERA_(n)

be viewed as an interpolation of the PA network and the Erd ¨os-R´enyi random graph.

More specifically, consider the following network growth. Given G(n − 1), G(n)is formed by creating a new edge where:

(1) With probability pa, two nodes are chosen randomly (allowing repetition)

from V(n − 1) and an edge is created connecting them. The possibility of a self loop is allowed.

(2) With probability 1 − pa, a new edge is created according to the preferential

attachment scheme (α, β, γ, δin, δout)on GPA(n − 1) := (V(n − 1), EPA(n − 1)).

The question of interest is, if we are unaware of the perturbation effect and pretend the data from this model are coming from the linear PA model, can we

recover the PA parameters? To investigate, we generate networks of n = 105

edges with parameter values

(α, β, γ, δin, δout)= (0.3, 0.4, 0.3, 1, 1), pa∈ {0.025, 0.05, 0.075, 0.1, 0.125, 0.15}.

For each network, the original PA model is fitted using the MLE, SN and EV methods, respectively. The angular MLE in (2.6.5) in the extreme value estimation is performed based on ntail = 500 tail observations. In order to compare

these estimators, we repeat the experiment 200 times for each value of pa and

obtain 200 sets of estimated parameters for each method. Figure 2.7.2 summa- rizes the estimated values for (δin, δout, α, γ, ιin, ιout)for different values of pa. The

mean estimates are marked by crosses and the 2.5% and 97.5% empirical quantiles are marked by the bars. The true value of parameters are shown as the horizontal lines.

While all parameters deviate from the true value as paincreases and the net-

work becomes more “noisy”, the EV estimates for (δin, δout)exhibit smaller bias

than the MLE and SN methods (Figure 2.7.2 (a) and (b)). All three methods give underestimated probabilities (α, γ) (Figure 2.7.2 (c) and (d)). This is be- cause the perturbation step (1) creates more edges between existing nodes and consequently inflates the estimated value of β.

Also note that the mean EV estimates of (ιin, ιout)stay close to the theoreti-

cal values for all choices of pa; see Figure 2.7.2 (e) and (f). The MLE and SN

estimates of (ιin, ιout), which are computed from the corresponding estimates for

(α, β, γ, δin, δout), show strong bias as pa increases. In this case, the EV method is

robust for estimating the PA parameters and recovering the tail indices from the original model.

Randomly deleting edges.

We now consider the scenario where a network is generated from the linear PA model, but a random proportion pdof edges are deleted at the final time. We do

this by generating G(n) and then deleting [npd]edges by sampling for E(n) with-

out replacement. For the simulation, we generated networks with parameter values

(α, β, γ, δin, δout)= (0.3, 0.4, 0.3, 1, 1), pd ∈ {0.025, 0.05, 0.075, 0.1, 0.125, 0.15}.

Again, for each pd, the experiment is repeated 200 times and the resulting pa-

rameter plots are shown in Figure 2.7.3 using the same format as for Figure 2.7.2. For the EV method, 100 tail observations were used to compute an ˆαEV

Figure 2.7.2: Mean estimates and 2.5% and 97.5% empirical quantiles of (a) δin; (b) δout; (c) α; (d) γ; (e) ιin; (f) ιout, using MLE

(black), SN (red) and EV (blue) methods over 200 replica- tions, where (α, β, γ, δin, δout) = (0.3, 0.4, 0.3, 1, 1) and pa =

0.025, 0.05, 0.075, 0.1, 0.125, 0.15. For the EV method, 500 tail observations were used to obtain ˆαEV_.

Surprisingly, for all six parameters considered, MLE estimates stay almost unchanged for different values of pd while SN and EV estimates underestimate

(δ_in, δ_out)and overestimate (α, γ), with increasing magnitudes of biases as pd in-

creases. For tail estimates, the minimum distance method still gives reasonable results (though with larger variances), whereas the SN method keeps underes- timating ιinand ιout.

The performance of MLE in this case is surprisingly competitive. This is intriguing and in ongoing work, we will think about why this is the case.

Figure 2.7.3: Mean estimates and 2.5% and 97.5% empirical quantiles of (a) δin; (b) δout; (c) α; (d) γ; (e) ιin; (f) ιout, using MLE

(black), SN (red) and EV (blue) methods over 50 replica- tions, where (α, β, γ, δin, δout) = (0.3, 0.4, 0.3, 1, 1) and pd =

0.025, 0.05, 0.075, 0.1, 0.125, 0.15. For the EV method, 100 tail observations were used to compute ˆαEV_.

CHAPTER 3

DEGREE GROWTH RATES AND INDEX ESTIMATION IN A DIRECTED LINEAR PA MODEL

3.1 Overview

Empirical studies on social network data often reveal that in- and out-degree distributions marginally follow power laws. Theoretically, this is also true for linear preferential attachment models, which makes preferential attachment ap- pealing in network modeling; see [7, 39, 40] for references. Also, the empirical joint degree frequency converges to the probability mass function (pmf) of a pair of limit random variables that are jointly regularly varying (cf. [40, 52, 55, 67]). However, questions related to joint degree growth and index estimation still remain unresolved. In this chapter, we focus on three main problems:

1. For a fixed node in a linear preferential attachment graph, what is the joint behavior of in- and out-degree as the graph size grows?

2. What are the convergence properties of the tail empirical joint measure of in- and out-degrees indexed by node?

3. When estimating the marginal power-law indices of in- and out-degree, can we use the Hill estimator as a consistent estimator?

What is the justification for interest in Hill estimation of power-law indices for network data? Repositories of large network datasets such as KONECT (http://konect.cc/, [42]) provide summary statistics for all the archived network datasets and among the summary statistics are estimates of degree

indices computed with Hill estimators, despite the fact that evidence for Hill estimator consistency is scant for network data [68].

Another justification is robust parameter estimation methods in network models based on extreme value techniques. In [63], we couple the Hill estimation of marginal degree distribution tail indices with a minimum distance threshold selection method introduced in [10] and compare this method with the parametric estimation approaches used in [64]. The Hill estimation is more robust against modeling error and data corruption. Therefore, an affirmative answer to the third question helps justify all of these inference methodologies.

In the directed case, consistency of the two marginal Hill estimators results from resolving the first two questions, since in a similar vein to [68], we consider the Hill estimator as a functional of the marginal tail empirical measure. So convergence results of marginal tail empirical measures lead to the consistency of Hill estimators by a mapping argument.

To answer the first question about degree behavior of fixed nodes as graph size grows, we mimic in- and out-degree growth of a fixed node using pairs of switched birth processes with immigration (SBI processes). The SBI processes use Bernoulli switching between pairs of independent birth processes with immigration (BI processes). We embed the directed network growth model into a sequence of paired SBI processes. Whenever a new node is added to the network, a new pair of SBI processes is initiated. Using convergence results for BI processes (cf. [49, Chapter 5.11], [57, 68]), we give the joint limits of the in- and out-degrees of a fixed node as well as the joint maximal degree growth. Proving the convergence of the tail empirical joint measure in the second question requires showing concentration results for degree counts compared with

expected degree counts. With embedding techniques, we prove the limit distribution of the empirical joint degree frequencies in a way that is different from the one used in [55], and then justify the concentration results.

In document Heavy Tail Phenomena in in Preferential Attachment Networks (Page 59-66)