Experimental setup - INTRODUCE NEW INFORMATION DIFFUSION MOD-

Chapter 5 INTRODUCE NEW INFORMATION DIFFUSION MOD-

5.5 Experiments

5.5.1 Experimental setup

Data set introduction We use four real network data sets as test-beds. We introduce

them in the order of increasing number of nodes in the networks. The first network we use is calledNetHEPT which contains 15,233 nodes and 58,891 edges. NetHEPT is a collaboration network which represents the collaboration relationship among authors writing papers, which

0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 160 180 200 I n f l u e n ce S p r e a d Number of seeds Random MaximumDegree DegreeDiscount BestOverlappingCoverage Greedy (a) NetHEPT 0 5 10 15 20 25 30 35 40 0 50 100 150 200 250 I n f l u e n ce S p r e a d Number of seeds Random MaximumDegree DegreeDiscount BestOverlappingCoverage Greedy (b) Ego-Facebook 0 5 10 15 20 25 30 35 40 0 50 100 150 200 250 300 I n f l u e n ce S p r e a d Number of seeds Random MaximumDegree DegreeDiscount BestOverlappingCoverage Greedy (c) Wiki-vote 0 5 10 15 20 25 30 35 40 0 50 100 150 200 250 300 350 I n f l u e n ce S p r e a d Number of seeds Random Maxim um Degree DegreeDiscount BestOverlappingCoverage Greedy (d) Amazon0302

Figure 5.2. Comparison of influence spread of different algorithms on different data sets with increasing number of seeds.

is an undirect network. NetHEPT is also used by [2] and [10] and it was downloaded from http://research.microsoft.com/enus/people/weic/graphdata.zip. The second network data is called ego-Facebook data set [12], which consists of 4,039 nodes and 88,234 edges and it is an undirect network. The Facebook data was collected from survey participants using an online application which could provide users’ basic information. The third network data is Wiki-vote [6], which consists of 7,115 nodes and 103,689 edges. According to [6], the network contains all the Wikipedia voting data from the inception of Wikipedia till January 2008. The edge (i, j) represent user ivotes user j. The network is a directed network but is addressed as an undirect network in this experiment. Every edge in Wiki-vote will represent nodes voting each other since currently we only consider undirect cases under the SC model. The forth network data is Amazon data set [7], which was collected by crawling Amazon website on March 02, 2003. The Amazon data set consists of 262,111 nodes and 1234,877 edges, which is the largest data set we use in this experiment.

Propagation Probability Model Trivalency model. Trivalency model was first

used in [3], which randomly select a probability for each edge from an array containing three probabilities. We use the probability array {0.01,0.02,0.05} in this model.

Server specification The experiments run on a cluster server, the node we use is

equipped with Quad-Core AMD Opteron(tm) Processor 2376 and can access up to 264G system memories. Since each node has eight processors including both physical and logi- cal ones, to increase output efficiency, we run eight threads simultaneously, each thread is responsible for selecting k seeds for a specific algorithm and 1≤k≤8.

Algorithms introduction We run the following algorithms under the SC model on

all four network data sets.

• Random: All the seeds are selected randomly from the node set. This algorithm takes

Random MaximumDegreeDegreeDiscountBestOverlapping 0 50 100 150 200 250 I n f l u e n ce S p r e a d Algorithm Influence Spread (a) NetHEPTBig

Random MaximumDegreeDegreeDiscountBestOverlapping 0 50 100 150 200 250 300 I n f l u e n ce S p r e a d Algorithm Influence Spread (b) WikiBig

Figure 5.3. Comparison of running time of different algorithms when number of seeds = 100

• MaximumDegree: This heuristic simply selects k seeds with the largest degrees,

which is first used to go against other algorithms in [10].

• DegreeDiscount: This is an algorithm based on Maximum Degree. Instead of

selecting k seeds with largest degrees directly from the node set. This Algorithm adds a degree discount to the nodes whose neighbors are already fully or partially influenced. For a specific node, the degree discount depends on how many neighbors of this node have been influenced by previous picked seeds. The algorithm was first proposed in [2]. The original algorithm in [2] was designed under IC model. In this paper, we calculate the degree discount similarly with the original version in [2], but just under the SC model.

• Greedy: This is the approximation algorithm with approximation ratio 1−1/e we

introduced in Algorithm.6. The greedy algorithm was first introduced and proved with the approximation ratio in [10] and since it has the best influence spread, it is used to mark the best influence spread a network graph can achieve withk seeds in nearly all works targeted on influence maximization problem.

BestOverlappingCoveragesince it calculates the overlapping gains and overlapping loss between each two potential nodes.

Certainly there are still other heuristics in the existing works under other models. We don’t modify them to adjust under the SC model since most of the algorithms are designed on a specific model and may be too complicate under the proposed SC model. For all the above algorithms, once a seed set is chosen, the influence spread running all real data sets are ran 1000 times and output the average spread value. For the Greedy algorithm, we use R = 100 to select each seed in the seed set. In the BestOverlappingCoverage algorithm, we useR = 100 to calculate the overlapping gain and overlapping loss. Besides, we useM = 300 to pre-select the potential seed list.

In document Constructing Empirical Likelihood Confidence Intervals for Medical Cost Data with Censored Observations (Page 78-82)