Influence Maximization in GOLAP

(1)

UC Irvine

UC Irvine Electronic Theses and Dissertations

Title

Influence Maximization in GOLAP Permalink

https://escholarship.org/uc/item/3d9063z9 Author

Jin, Jennifer Kim Publication Date 2019

(2)

UNIVERSITY OF CALIFORNIA, IRVINE

Influence Maximization in GOLAP DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY in Computer Engineering

by

Jennifer Kim Jin

Dissertation Committee:

Professor Phillip Sheu, Chair Professor Jean-Luc Gaudiot Professor Nikil Dutt

(3)

(4)

LIST OF FIGURES

Page

Figure 1. An example of node and edge hierarchy ... 8

Figure 2. Example of SIP ... 12

Figure 3. Example of Top-m SIP ... 15

Figure 4. Example of k-colors SIP-t ... 19

Figure 5. Exactly SIP-t ... 24

Figure 6. Example k-colors SIP-t% ... 30

Figure 7. At most SIP-t% ... 35

Figure 8. Knapsack Conversion ... 38

Figure 9. Heuristic Algorithm SIP-t% ... 41

Figure 10. Assumption 1 ... 43

Figure 13. IM-SIP ... 46

Figure 14. Citation network ... 48

Figure 15. IM-SIP-T Greedy ... 57

Figure 16. Greedy and Optimal solution ... 60

Figure 17. Relevance between Genes for Diseases ... 65

Figure 18. Example of ‘Gene-as-Node’ Relevance Graph ... 65

Figure 19.A GI cancer network derived from abstracts that are stored in PubMed, using co-occurrence and text mining ... 68

Figure 20. Associations of genes with GI cancers based on the literature, gene ontology, pathway, and transcription factor enrichment analysis ... 72

Figure 22.Time elapse for IM-SIP-t colored nodes constraint ... 79

(8)

LIST OF TABLES

Page Table 1. Number of s node combinations ... 46

(9)

ACKNOWLEDGMENTS

I would like to express my deepest appreciation to my advisor, Professor Phillip Sheu. Professor Sheu has been a tremendous mentor to me. I would like to thank him for encouraging my research and for allowing me to grow as a scholar. His strong guidance, scientific advice and knowledge, constructive criticism and many insightful discussions and suggestions have been invaluable throughout the years.

I am thankful to the members in my Lab. Guigang Zhang helped me on the thesis structure, provided me ideas, and helped me develop the use cases and examples. Shaoting Wang helped me brainstorm algorithm ideas. I would like to express my sincere gratitude to Charles Wang from Asia University, for collaborating with me on the biomedical project. His knowledge and expertise on bioinformatics has been extremely helpful. I would like to extend my sincere thanks to Bryan Chou, Jimmy Shih and Kyle Li for helping me with my theorems and proofs.

As a cooperator of our Lab, I am also thankful to the supports from NEC Solution Innovators, Ltd., Japan. Mr. Hiroyuki Shigematsu and Mr. Atsushi Kitazawa gave me many beneficial ideas on my research directions. I am also grateful to Mr. Masahiro Hayakawa for helping me polish my papers.

For this dissertation I would also like to thank my committee members: Professor Jean-Luc Gaudiot and Professor Nikil Dutt. With their expert knowledge in this area, they provided many useful comments so I could refine my dissertation and enlightened me to some future research directions.

(10)

I would also like to thank my friends for providing support and friendship that I needed. I am deeply thankful to my family. Words cannot express how grateful I am to my mom and dad for their love, invaluable advice, undying support and sacrifices. I am extremely grateful to my mother-in law and father-in-law for always supporting me with prayer and encouraging words. Your prayer for me was what sustained me thus far. The success would not have been possible without the support and nurturing of my beloved husband, Daniel Jin. I would also like to thank my best friend and my amazing sister, Mira Kim for supporting me in everything. I can’t thank her enough for encouraging me throughout this experience. To my lovely daughters Natalie Grace Jin and Lauren Sophia Jin, I would like to say thank you for being such sweet girls and always cheering me up.

Finally, I thank my God for helping me through all the difficulties. I have experienced your guidance day by day. Thank you, Lord.

Jennifer Kim Jin University of California, Irvine 2019

(11)

CURRICULUM VITAE

Jennifer Kim Jin

2009 B.A. in Computer Science, University of Texas, Dallas

2011 Master’s in computer science, University of California, Los Angeles 2011-18 Teaching Assistant, University of California, Irvine

2019 Ph.D. in Computer Science, University of California, Irvine FIELD OF STUDY

Semantic Computing and Graph Databases

PUBLICATIONS

Charles C.N. Wang, Jennifer Jin, Jan-Gowth Chang, Masahiro Hayakawa, Atsushi Kitazawa, Jeffrey J.P. Tsai and Phillip C.-Y. Sheu, “Identification of Most Influential Co-Occurring Gene Suites for GI Cancers using Biomedical Literature Mining and Graph-based Influence

Maximization,” Future Generation Computer Systems Special Issue on Data Exploration in the Web 3.0 Age, submitted on Feb 10, 2019.

Jennifer Jin and Masahiro Hayakawa, “Network Analysis for Graph-based OLAP (GOLAP),”

International Journal of Semantic Computing, 13(1), 2018.

Jennifer Jin, “OLAP and machine learning,” Encyclopedia with Semantic Computing and

Robotic Intelligence, 1(1), 2017.

Shao-Ting Wang, Jennifer Jin, Pete Rivett, and Atsushi Kitazawa, “Technical Survey Graph Databases and Applications”, International Journal of Semantic Computing 9(4), pp. 523 -545, 2015.

Jennifer Jin, Mira Kim and Pete Rivett, “Technology Outlook: Semantic Computing for Education,” International Journal of Semantic Computing, 9(3), 2015.

Jennifer Kim, George Wang and Sang Tae Bae, “A survey of Big Data Technologies and How

Semantic Computing can Help,” International Journal of Semantic Computing, 8(1), 2014. Jennifer Kim, David Ostrowski, Hiroshi Yamaguchi and Phillip C.Y. Sheu, “Semantic

Computing and Business Intelligence,” International Journal of Semantic Computing, 7(1),

(12)

Jennifer Kim, “Design and Evaluation of Mobile Applications with Full and Partial Offloadings,” The 7th International Conference on Grid and Pervasive Computing (GPC

2012), Hong Kong, May 2012.

Jennifer Kim, “Architectures with Full and Partial Offlaoding for Mobile Applications,” IEEE

International Conference on Internet (ICONI 2010), Mactan Island, Philippines, Dec. 2010.

Jennifer Kim, “Architecture patterns for Service-based Mobile Applications,” IEEE

International Conference on Service-Oriented Computing and Applications (SOCA 2010), Perth, Australia, Dec. 2010

Jennifer Jin, “Augumented Influence Analytics on Social Network Services,” for Journal

Publication (to be completed in Jan. 2019)

Jennifer Jin, “Software Framework for Gene Disease Relevance Analytics,” for Journal

(13)

ABSTRACT OF THE DISSERTATION

Influence Maximization in GOLAP

By

Jennifer Kim Jin

Doctor of Philosophy in Computer Engineering University of California, Irvine, 2019

Professor Phillip Sheu, Chair

The notion of influence among people or organizations has been the core conceptual basis for making various decisions and performing social activities in our society. With the increasing availability of datasets in various domains such as Social Networks and digital healthcare, it becomes more feasible to apply complex analytics on influence networks. However, there exist technical challenges of representing various types of influence networks, handling the variability on the analytics types, and optimizing the time complexity in running the analytics. We present a comprehensive approach to managing influence networks using a set of extended graph models, called Graph-based OLAP (GOLAP). The design space for GOLAP is defined by the incorporation of node types (i.e., colors), weights on relationships (i.e., edges), constraints on the number of nodes for a certain node type, and constraints on the percentage of nodes for a certain node type. We begin with defining a method to find a Strongest Influence Path (SIP) which is the strongest path from the source node to the target node. Then, we extend it with k-colors, a constraint on the number of nodes, and a constraint on the percentage of nodes. Hence, we can answer complex queries on influence networks such as “find the SIP with t nodes of color c” or “find the SIP with t% nodes of color c.” Based on the SIP model, we present a set of Influence Maximization (IM)

(14)

methods which find a set of s seed nodes that can influence the whole graph maximally with various constraints such as having ‘t nodes of color c’. We apply the IM methods to

Gastrointestinal (GI) cancer data and prove the proposed approach works well in the context of GI cancers. We use text mining to identify objects and relationships to construct a graph and use graph-based IM to discover the most influential co-occurring genes. We also address the methods for optimizing the time complexity of the analytics algorithms. We apply heuristic-based and graph reduction-based methods to reduce the time complexity. In addition to proving the proposed methods, we present the result of our implementation on the methods.

(15)

Chapter 1

Introduction

The research on influence analysis in social networks is becoming a hot topic. For example, in a social network a democrat or a republican can influence another democrat or a republican. In science, an author/scientist may influence another author or scientist and a publication can influence another publication.

There are many existing models for information spreading. The traditional models include Independent Cascade Model (ICM) [1], and Linear Threshold Model (LTM) [2]. However, these existing influence models are discrete. If one node accepts another node, it will have influence; if not, it has no influence. However, in the real world, if one node accepts or does not accept another node, it will have some possible influence to a certain degree. Therefore, we propose a new model based on the probability of influence. It is consequently not discrete. We define this model as the Strongest Influence Path (SIP) model.

Our definition for the SIP model is as follows. Given a graph G, assume an application is only interested in finding the strongest influence path from a node to another node. Can we design an algorithm to solve this problem? We call this problem the SIP (Strongest Influence Path) Problem. We can utilize this model to solve the Influence Maximization (IM) problem, the problem of finding the strongest seed nodes in a graph that can influence the whole graph maximally.

We also discovered that the traditional approaches to the IM problem are limited – They only consider traditional graphs. A traditional graph does not have any way of representing different classes of nodes. In many domains, queries involving various classes of nodes can produce useful information. In a social network, for example, the nodes could be classified based on gender, ethnicity, political party, area of residence, etc. The edges could be classified based on the

(16)

relationship between the nodes. By using classifications, we are able to answer class-level queries. Thus, with the combination of two sets of dimensions, we can post queries such as “Find a republican that has the highest influence on all the other people they work with.” We propose using colors on nodes to represent different classes, and we study the IM problem in the context of “colored” graphs.

While reference [3] introduces Graph-based OLAP (GOLAP) that extends OLAP (On Line Analytical Processing) to address graph-based problems that involve object attributes, it does not address the IM problem.

Our contributions can be summarized as the following:

(1) We propose a new influence model: Strongest Influence Path (SIP) model.

(2) We propose algorithms to solve the k-colors constraint SIP problem and its variants. (3) We propose the k-colors graph reduction methods based on the SIP model.

(4) We propose algorithms to solve the k-colors Influence Maximization problem based on the SIP model.

In Chapter 2, we summarize related works. We define the key concepts of this paper in Chapter 3. In Chapter 4, we propose the definition of SIP and its algorithms. In Chapter 5, we introduce Top-m SIP and its algorithms. In Chapter 6, we discuss SIP with t colored nodes. We present algorithms for SIP with t% colored nodes in Chapter 7 and a heuristic algorithm in Chapter 8. We introduce the k-colors influence maximization problem based on SIP in Chapter 9 and introduce a greedy heuristic algorithm in Chapter 10. In Chapter 11, we present k-colors influence maximization problem in Biomedical domain. We propose a k-colors graph reduction algorithm based on SIP in Chapter 12. Chapter 13 reports the experiment results that validate our methods.

(17)

Finally, in Chapter 14 we conclude the paper and discuss some possible research directions in the future.

(18)

Chapter 2

Related work

In this section, we summarize existing research that are related to GOLAP, the strongest path and influence in graphs.

2.1.

Influence Research

Influence analysis has caught researchers’ interest from the 1950s. More and more scientists and researchers begin to realize the importance of influence in social networks [4, 5]. In the early ages, several theories were proposed, such as the theory of six degrees of separation, the theory of four degrees of separation, and small world phenomenon [6, 7]. Analyzing influence on social networks [8, 9, 10] can be used to develop new applications, such as advertising, sentiment diffusion [11], link prediction [12], recommended systems and social communities finding.

Finding the most influential node in a social network such as the most important blogger [13] or an opinion leader [14] can be measured by degree centrality. High degree centrality of a node indicates the node’s high importance. The PageRank algorithm [15] is used to find the importance of nodes in social networks. Betweenness Centrality [16] is another key factor to measure a node’s importance. Closeness Centrality is also typically used to measure a node’s importance. Besides these three methods, some others include HITS [17], K-Shell [18], and randomly walk algorithm [19]. There are several algorithms to measure a node’s importance [20], but the main methods can be summarized into two categories: measurement based on topological structure, and measurement based on contents and measurement based on users’ behaviors [21].

Influence maximization (IM) is one of the most studied topics in influence research, especially in social networks. The IM problem finds a set of seed nodes that can influence the whole graph

(19)

maximally. Most existing IM problems use diffusion models [22-24] such as ICM [1], LTM [2], SIS [25] and SIR [26].

IM is being applied to different types of network: labeled social networks [27,28], multiple labeled networks [29], and dynamic social networks [30-33]. Other than general graphs, special graphs such as Bipartite and Hierarchical IM [34] problems have been studied. Most current works focus on positive weighted networks and do not consider negative weighted graphs [35, 36]. IM is being applied to different social networks such as messenger-based social network [37], microblogging [38], and viral marketing social networks [39]. It can be applied to complex social networks [40], multiple social networks [41] and some unknown social networks [42]. These complex social networks may have multiple acceptance [43, 44]. Rahaman and Hosein [45] propose the multi-stage influence maximization problem. Some studies use the non-backtracking random walk method [46, 47]. Since greedy algorithms [48] have high time complexity, some obtain approximation results while sacrificing the accuracy [49]. K-Means based methods [50-55] are also used to solve IM problems. Lee and Chung [56] propose a query approach while Jiang et al. [57] propose the community detection method. Other methods include Memetic [58], SIMPATH [59], and the community discovery algorithm [60].

As most social networks are very large in size, scalability is a major concern. Some researchers propose scalable methods that do not lose efficiency as the size of the network grows. Li and Yu [60] propose a scalable community discovery algorithm to find the core members of the community. Wang et al. [82] present a bottom-ksketch (a summary is a set of items with nonnegative weights that supports approximate query processing) based Reverse Influence Sampling (RIS) framework, which brings the order of samples into the RIS framework. By applying the sketch technique, they derive early termination conditions to significantly accelerate

(20)

the seed set selection procedure. Chen et al. [62] propose a scalable IM algorithm tailored for linear threshold model based on the fast computation in directed acyclic graphs (DAGs). Mohan et al.

[63] propose a model to find a set of influential nodes from a very large graph. By dividing the graph into communities and finding influential nodes within each of those communities, a better set of nodes for fast information diffusion is obtained. Chen et al. [64] design a new heuristic algorithm that is easily scalable to millions of nodes and edges. Their algorithm has a tunable parameter for users to control the balance between the running time and the influence spread of the algorithm.

As discussed, methods to solve classic IM problems have been proposed by many. Others propose special types of IM problems by adding special conditions or constraints, making it even more complex. Some special IM problems include privacy reserved influence maximization [65], online social networks [66-68] and location based social networks [69] influence maximization. Some social networks’ nodes have several attitudes [70]. Some include time constraint IM [71, 72], topic-based IM [73], min-cut IM based on big data, budget IM [74-76] and profit IM [77] for multiple products.

As discussed, there are a lot of existing research done on influence research on social networks. However, to our knowledge most of the existing models do not consider partial influence from one node to another. The models are set up so a node either influences or does not influence another node. Another shortcoming of the existing work is that they do not consider different classes of nodes. They perform IM on just regular graphs, where nodes are not classified.

(21)

2.2.

Strongest Path

As will be discussed later, the strongest path problem can be translated to a shortest path problem. In graph theory, the shortest path problem is a problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized.

Our SIP-based algorithms are based on Dijkstra’s shortest path algorithm [78]. Classic shortest path algorithms can be used instead if desired. These include Dijkstra's algorithm [78], Floyd-Warshall algorithm [79], and A* algorithm [80]. Dijkstra's algorithm solves the single-source shortest path problem; Bellman–Ford algorithm [81] solves the single-source problem if edge weights may be negative; A* search algorithm solves single pair shortest path using heuristics to speed up the search; and Floyd–Warshall algorithm solves all pairs shortest paths.

2.3.

Graph-based OLAP (GOLAP)

To our knowledge, not much work has been done for queries related to different classes of nodes. Chou et al. [3] propose GOLAP system with two dimensions: nodes and edges. For each dimension, different colors can be used to represent different classes. For example, the nodes in a social network may represent people. In Figure 1, the nodes are colored in blue for democrats and red for republicans. The edges in a social network may represent relationships. Based on the type of relationship, the edges could be colored in different colors. In Figure 1, green edges represent family, red edges represent friends and blue edges represent co-workers.

By using different colors for different types of people, we can visualize different classes and run class-specific queries. An example of a node dimensional query could be finding the top 3 democrats that are the most influential in the entire network. Another query may be finding the most influential people among co-workers. This would be an edge dimensional query.

(22)

Figure 1. An example of node and edge hierarchy

Our model utilizes colors to classify the nodes. By applying Graph-based OLAP (GOLAP) proposed by Chou et al. [3], we are able to convey and apply this concept to solve problems using the node dimension.

2.4.

Graph Reduction

With big data trending, graph reduction has been a topic of interest for quite some time. In certain applications, we only need a portion of a graph. We can reduce these graphs into small graphs and obtain the same or similar query results.

Wang et al. [82] propose graph reduction that can answer queries without losing information. Navlakha et al. [83] describe a graph summarization method with bounded errors. A graph can be reduced by a two-part representation: summary and correction. The summary is an aggregate graph that groups the nodes into sets with edges representing relations between sets. The corrections part specifies a list of edge-corrections that should be applied to the summary to recreate the graph. Toivonen et al. [84] compress a weighted graph by merging nodes and edges to achieve the target

(23)

reduction rate while minimizing the difference on the edge weights between the original graph and the compressed graph.

Most social networks are very large in scale. Some of the algorithms especially for solving influence maximization have high time complexity, making it challenging for users to obtain results in real-time. We propose graph reduction to alleviate this issue. Also, existing work do not consider reducing graphs with colored nodes. Being able to reduce a colored graph is more complex but useful in many domains.

(24)

Chapter 3

Definitions

Below we first review some terms in the graph theory and the strongest influence path (problems) discussed in this thesis.

(1) Weighted Graph: A weighted graph G is a triplet (V, E, w), where w: E → eVal is a function mapping an edge to its weight, and eVal is the set of possible values. For a weighted graph, eVal would usually be real numbers.

(2) Path: A path p from node u to v in G is a sequence of nodes (u, v0, v1, …,vn, v) such

that for every i∈ {1, …, n}, (vi-1,vi) ∈E.

(3) Shortest Path: A shortest path from u to v is a path p= (u, v0, v1, …,vn, v) of which the

sum ∑𝑛𝑖=1𝑤(𝑣𝑖−1, 𝑣𝑖) is minimal.We use the notation δG(u, v) to denote the total

weight of the shortest path from u to v. There may be multiple shortest paths between two nodes.

(4) Influence: The influence from a node to another node is depicted as an edge weight ranging between 0 and 1. The edge weight inf (u, v) represents the probability of node

u influencing node v.

(5) Strongest Influence Path (SIP): The strongest influence path from a node to another node is the path that has the strongest influence value (calculated by multiplication of influence probabilities along the path). There are multiple paths between a node to another node but there exists at least one strongest influence path among all these paths.

(6) Influence Maximization (IM): The influence maximization problem finds a set of s

seed nodes that can influence the whole graph maximally. It asks for a parameter s, to find a s-node set of maximum influence.

(25)

Chapter 4

Strongest Influence Path (SIP)

In this section, we present our approaches to find the strongest influence path from a node to another node in a graph.

4.1.

The SIP problem

Given a directed graph G where each edge (a,b) carries a weight (0, 1), that designates the probability of influence from node a to node b. Note that the weight of an edge should not be 0 because if there was no influence between the two end nodes, that edge should not exist as edges represent influence. The Strongest Influence Path (SIP) problem finds the path from a given source node s to a given destination node d that has the strongest probability. The probability of influence for a path from s to d is calculated by multiplying the influence probabilities of all the edges on the path.

4.2.

Use case of SIP

Consider a corporate network that consists of different companies. Each node represents a company and each edge represents influence. A small company s wants to influence a mega company d into investing on a business idea they have. s wants to know which companies to go through to have the highest chance of getting an introduction to the mega company d. The path highlighted in red is the strongest influence path from node s to node d. As shown in Figure 2, the path highlighted in red is the strongest influence path from node s to node d. A directed edge from a node to another is labelled by the probability that the first node can influence the second node. For simplicity, we combine two bidirectional edges between two nodes a and b (i.e., they can influence each other) into one edge without an arrow, and label the edge with two probabilities –

(26)

the first being the probability of node a influences node b and the second being the probability of node b influences node a. With such, the influence probability of the SIP from s to d is the multiplication of influences 0.6*0.4*0.7*0.5*0.9*0.8*0.5 which is 0.03024.

Figure 2. Example of SIP

4.3.

Algorithm of SIP

Our SIP algorithm that finds the strongest influence between two nodes is based on Dijkstra’s shortest path algorithm. To compute the strongest path between two nodes, we can adapt Dijkstra’s algorithm as Hangal et al. does [85]. They define the influence of a path to be the product of the influence of each edge on the path.

For the SIP, let us say there are 3 edges in the path, e1, e2 and e3 whose weights are between 0 and 1. The total influence is calculated by inf(e1)*inf(e2)*inf(e3), which is still between 0 and 1. Maximizing inf(e1)*inf(e2)*inf(e3) is equivalent to minimizing -log(inf(e1)*inf(e2)*inf(e3)) because -log(1) = 0 and -log(0) is ∞. Since -log(inf(e1)*inf(e2)*inf(e3)) = -(log (inf(e1)) + log (inf(e2)) + log (inf(e3))) = (-log (inf(e1)))+ (-log (inf(e2))) + (-log (inf(e3))). It becomes a shortest path problem with all positive distances as -log(x) is positive when 0<x<1. Therefore, our SIP method can correctly find the strongest influence path as long as the edge weights stay in range (0, 1].

(27)

We repeat Dijkstra’s algorithm in the following:

Partial snippet of Dijkstra’s algorithm (finds the shortest path from a source node to a destination node)

Input Graph G, source node s, destination node d

Output The shortest path from s to d

1 For each neighbor v of u; // where v is still in 1; 2 altdist[u] + length(u,v)

3 4 5 6

if alt < dist[v]; // a shortest path to v has been found dist[v] alt;

prev[v] u; return dist[], prev[]

In the above, dist[v] designates the distance from the source node to v. Prev[v] is the previous node in the optimal path from the source. Node u is the node with the shortest distance that was selected. As shown in line 2, the distance is calculated by the sum of the weights.

The SIP algorithm is revised from the Dijkstra algorithm using influence[v] which is the total influence value from the source node to v instead of dist[v]. It is calculated by the product of the weights. The SIP algorithm finds the path with the highest overall weight as shown in line 3.

Partial snippet of SIP algorithm (finds the strongest influence paths from a source node to a destination node)

Input Graph G, source node s, destination node d

Output The Strongest Influence Path from s to d

1 for each neighbor v of u; // where v is still in 1; 2

3 4 5

if inf[u] * inf(u,v)> inf[v]; // a strongest influence path to v has been found inf[v] = inf[u] * inf(u,v);

prev[v] =u; Return inf[], prev[]

The time complexity of the SIP algorithm is the same as that of the Dijkstra’s algorithm which is O(e log n).

(28)

Chapter 5

Top-m SIP (Top-m Strongest Influence Path)

In this section, we present our approaches to find the top-m strongest influence paths from a node to another node in a graph.

5.1.

The Top-m SIP Problem

Given a directed graph G where each edge (a, b) carries a weight between 0 and 1 that designates the probability of influence from node a to node b. SIP-m (s, d) is the m strongest influence paths from the source node s to the destination node d. The Top-m Strongest Influence Path (Top-m SIP) problem finds not just the strongest influence path but the top-m strongest influence paths from a node to another node (SIP-1, SIP-2, …, SIP-m).

5.2.

Use case of Top-m SIP

Consider a corporate network that consists of different companies. Each node represents a company and each edge represents an influence. Word on the street is that a huge software company d is looking for a hardware company to collaborate on creating the next big tech gadget. A hardware company s would like a chance of working with company d but knows they have a lot of competition. s is interested to see which path of companies to go through to influence company

d for picking s as their partner. To compare options, s would like to see what their top three path choices are.

Figure 3 shows an example of the top-m SIP where the value of m is 3. Shown in red is the SIP-1, which is the strongest influence path from node s to node d. The second strongest SIP (SIP-2) is shown in blue. The next strongest SIP (SIP-3) is shown in purple. Company s compares their choices and decides to avoid SIP-1 since they had a bad business deal in the past with company 7.

(29)

Company 3, a rival of s with a potential to steal their opportunity is on SIP-2 so s decides to skip this path. s chooses SIP-3 to pass a good recommendation to d.

Figure 3. Example of Top-m SIP

5.3.

Algorithm of Top-m SIP

The basic idea of the top-m SIP algorithm is that the mth SIP will be a deviation from the previously discovered SIPs. It first calls the regular SIP algorithm to compute SIP (SIP-1). In order to find SIP-k, where k ranges from 2 to m, the algorithm assumes that all the paths from the source to the destination have previously been found. The algorithm iterates through the edges of the SIP found, removing one of the edges in each iteration and calculating alternative SIPs. Among these alternative routes, the one with the strongest influence is the next SIP.

Our algorithm is similar to Yen’s algorithm [85] which computes the top-m shortest paths from a source node to a destination node. Yen’s algorithm utilizes the idea that the mth SIP will be a deviation from the previously discovered SIP. It holds two lists; list A holds the permanent m SIPs while list B holds the candidate m SIPs.

For example, the path SIP-1 in Figure 3 (marked in red) is computed by the SIP algorithm and placed on list A. Subsequently, the SIP calculated without e(s,2) in graph G, the SIP calculated

(30)

without e(2,4) in graph G, the SIP calculated without e(4,7), …, and the SIP calculated without

e(17,d) are placed on list B. Finally, list B (containing the candidate paths) is sorted in decreasing order so among all the paths in list B, the one with the highest path weight becomes SIP-2 and is moved to list A.

Next, the SIP calculated without e(2, 3) in graph G, the SIP calculated without e(3, 6) in graph

G, the SIP calculated without e(6,9), …, and the SIP calculated without e(18, d) are placed on list

B. Finally, list B (containing the candidate paths) is sorted in decreasing order so among all the paths in list B, the one with the highest path weight becomes SIP-3 and is moved to list A. The process is repeated to find SIP-4, SIP-5 etc. until SIP-m is found and placed on list A.

Similar to Dijkstra’s algorithm which takes the summation of weights ranging from 1 to ∞ to find the minimum path weight, Yen’s algorithm returns the top m paths with the smallest path weight. Yen’s algorithm can be converted to top-m algorithm, in the same way Dijkstra’s algorithm is converted to the SIP algorithm explained in Section IV-C. As long as the edge weights are between 0 and 1, we can find the m paths whose path weights are the maximum.

The Top-m SIP algorithm is summarized in the following.

Top-m SIP algorithm (finds the top m strongest influence paths from a source node to a destination node)

Input Graph G, source node s, destination node d, m

Output Top-m Strongest Influence Paths

1 Function TopmSIP(G, s, d, m):// Determine top m longest paths from the source to the destination.

2 A[0] = SIP(G, s, d);// Initialize the heap to store the potential kth longest path.

3 B = [];

4 for m from 1 to M:// The spur node ranges from the first node to the next to last node in the previous m-longest path.

5 for i from 0 to size (A[m − 1]) − 1:// Spur node is retrieved from the previous m -longest path, m − 1.

6 spurNode = A[m-1].node(i);// The sequence of nodes from the source to the spur node of the m-longest path.

7 rootPath = A[m-1].nodes(0, i);

(31)

9 if rootPath == p.nodes(0, i):// Remove the links that are part of the previous longest paths which share the same root path.

10 remove p.e (i, i+1) from G;

11 for each node rootPathNode in rootPath except spurNode:

12 remove rootPathNode from G;

13 spurPath = SIP(G, spurNode, destination);// Calculate the spur path from the spur node to the destination.

14 totalPath = rootPath + spurPath;// Entire path is made up of the root path and spur path.

15 B.append(totalPath);// Add the potential m-longest path to the heap.

16 restore edges to G;// Add back the edges and nodes that were removed

from the graph.

17 restore nodes in rootPath to G;

18 if B is empty:

// This handles the case of there being no spur paths, or no spur paths left. This could happen if the spur paths have already been exhausted (added to A), or there are no spur paths at all - such as when both the source and destination vertices lie along a "dead end".

19 break;

20 B.reverseSort();// Sort the potential m-longest paths by cost in decreasing order. 21 A[k] = B[0];// Add the highest cost path becomes the m-longest path.

22 B.pop();

23 return A;

The time complexity of the top-m SIP algorithm is the same as that of Yen’s algorithm which is O(mn3). The top-m SIP algorithm calls the SIP algorithm to obtain the longest path. The time complexity of the SIP algorithm is O(n2). Since the top-m SIP algorithm makes m*l calls to the SIP algorithm in computing the spur paths, where l is the length of the spur paths and the expected value of l in a condensed graph is O(log n) [85], while in the worst case it is n. So the best case time complexity is O(mlog n×n2_{) and the worst case time complexity is O(mn}3_).

(32)

Chapter 6

k

-Colors Strongest Influence Problem with

t

constraint

(

k

-Colors/SIP-

t)

In this section, we present our approaches to find the strongest influence path with at most/exactly/at least t nodes of color c from a node to another node in a graph.

6.1.

k-Colors t-SIP (k-Colors Strongest Influence Path with t Constraint)

Given a directed k-colors graph G where each edge (a, b) carries a weight (0, 1) that designates the probability of influence from node a to node b. The nodes are one of k different colors and the colors represent different types of nodes, such as gender, ethnicity, age group and political preference. The k-colors SIP-t problem finds the SIP from the source node s to the destination node

d which passes through exactly/at least/at most t nodes that have the same color (called constraint color).

6.2.

Use case of K-colors t-SIP

Consider a corporate map that consists of different companies based on supply chains (Figure 4). Each node represents a company, each edge represents the strength of supply, and each node is colored based on its size: blue for large business, green for medium-size business, and red for small business). A large company d wants to buy a small company s and at the same time acquire the intermediate suppliers with the strongest supplies (influence) between the two companies and at least 4 of them are small businesses. So the problem is to find the strongest path from s to d that includes at least 4 intermediate companies (red nodes) that are small businesses (exemplified by the red path in Figure 3).

(33)

Figure 4. Example of k-colors SIP-t

6.3.

Algorithm of k-Colors t-SIP

The basic idea of our algorithm is that it computes the paths by layers. Assume the constraint color is green. First, we compute the paths containing zero green node (inf[0]). Then we compute the paths containing one green node (inf[1]) based on inf[0], the paths containing two green nodes (inf[2]) based on inf[1], and so on. We keep going until no more paths can be computed. The visited nodes are reset when starting a new layer of path calculation. To avoid loops in a path, we do not visit a neighbor if it is already in the path. For example, if B is a green node and the path of

C’s inf[1] is A→B→C, we do not compute B’s inf[2] from C because it will cause a loop. Note that we assume there only exists one SIP.

The following is an example. With A as the source node and t=1, it keeps track of the paths from node A to all the other nodes with zero green node and one green node. For example, inf[1].C

(34)

Figure 5(a) Step 1

Figure 5(a) shows the layer inf[0] with zero green node. We start with the source node which is

A and visit its outgoing neighbors B, C and E. B is green so it will be excluded from inf[0]. However it is recorded in inf[1] because we will not visit A again (to avoid a loop.) Therefore inf[1].B=0.9 (A→B). C is white so inf[0].C=0.1 (A→C). E is white so inf[0].E=0.2 (A→E). Among all the inf[0] values, E has the maximal influence so the next node to visit is E.

Figure 5(b) Step 2

Figure 5(b) computes the influences of node A on E’s outgoing neighbors D and F. D is white so inf[0].D=0.2*0.5=0.1 (A→E→D). F is green so inf[1].F=0.2*0.1=0.02 (A→E→F). Among the unvisited nodes, both nodes C and D have the maximal inf[0] so we randomly choose C.

(35)

Figure 5(c) Step 3

Next, shown in Figure 5(c), we compute the influences of node A on C’s outgoing neighbor D. The influence 0.1*0.5 (A→C→D) is less than the current inf[0].D=0.1 so there is no need to update its value. Now only node D is visited because it is the only unvisited node with an inf[0] value.

Figure 5(d) Step 4

In Figure 5(d), we compute node A’s influence on D’s outgoing neighbor G. G is white so

inf[0].G = 0.1*0.5 =0.05(A→E→D→G). Next, we choose the only unvisited node with an inf[0] value, which is G, and mark node D visited.

(36)

Figure 5(e) Step 5

In Figure 5(e), G has no outgoing neighbors. There are no nodes with an inf[0] value that has not been visited anymore, so we can start the path layer with one green node (inf[1]). Node B has the maximum inf[1] value so we start with B, and mark node G visited.

Figure 5(f) Step 6

Figure 5(f) shows the inf[1] layer. We first reset all the nodes as unvisited, and compute the influences of node A on B’s outgoing neighbor C. C is white so inf[1].C = 0.9*0.9 =0.81 (A→B→ C). So the next node with the maximum inf[1] value is C, and we mark node B visited.

(37)

Figure 5(g) Step 7

In Figure 5(g), we compute node A’s influence on C’s outgoing neighbor D. C is white so

inf[1].D = 0.81*0.5 =0.405 (A→B→C→D). Therefore the next node with the maximal inf[1] is D, and we mark node C visited.

Figure 5(h) Step 8

In Figure 5(h), we compute the influence of node A on D’s outgoing neighbor G. G is white so

inf[1].G = 0.405*0.5 =0.203 (A→B→C→D→G). G has no outgoing neighbor. We mark nodes D

(38)

Figure 5(i) Step 9

Figure 5. Exactly SIP-t

Finally, in Figure 5(i), we compute the influence of node A on F’s outgoing neighbor G which is 0.02*0.1=0.002. Because it is less than inf[1].G which is 0.203, we do not need to update its value and can mark node F as visited. At this point, every node that has an inf[1] value has been visited, so we can go to the next layer inf[2]. Since no path contains 2 green nodes, there are no values for inf[2], so the algorithm stops.

This algorithm computes the SIP including zero green node, the SIP including one green node, up to the SIP including t green nodes from a source node to all the other nodes in the graph. In our example, the source node is A. Given t=1 and destination node d = C, we look at the value for

inf[1].C which is 0.405 (A→B→C). Given t=0 and destination node d = D, we look at the value for inf[0].D which is (A→C→D).

The k-Colors SIP-t algorithm is summarized as follows.

t-SIP-Path constraint Algorithm (Find all strongest influence paths form a node to any other nodes with exactly t nodes of color c.)

Input: Graph G, source node s, destination node d, number of nodes of color c constraint t, color c

Output: All strongest influence paths from S to any other nodes containing exactly t nodes of color c

1. Create vertex set R //object nodes 2. int i, j;

(39)

4. for i=0 to t:

5. inf[i].v← -∞ //Unknown influence from source to v going through i nodes of

color c

6. prev[i].v← UNDEFINED //Previous node in optimal path from source

7. for j=0 to t: 8. while R ≠G:

9. u ← vertex in O with max inf[j] //Node w/ the strongest influence will be selected next

10. R =R∪{u}

11. for each neighbor w of u 12. if j==0:

13. if color[w]== c: 14. inf[1].w=inf(u,w) 15. prev[1].w=u

16. else

17. inf[0].w=inf(u,w) 18. prev[0].w=u

19. else

20. if color[w]== c:

21. if inf[j+1].w<inf[j].u*inf(u,w) 22. inf[j+1].w=inf(u,w)*inf[j].u

23. prev[j+1].w=u

24. else

25. if inf[j].w<inf[j].u*inf(u,w) 26. inf[j].w=inf[j].u*inf(u,w) 27. prev[j].w=u

28.return inf[t].d, prev[t].d

The time complexity of SIP-t is O(tenlogn). This is because the complexity of Dijkstra’s Algorithm is e logn, we need to keep track of t different influence levels and reset the unvisited nodes for each influence level which is another n.

Theorem 1: The k-Colors SIP-t algorithm returns the SIP with the t nodes of color c constraint.

Proof:

Let findSIPcConst(s, t, c) be a function that returns the SIP from the seed node s containing exactly t nodes of color c. We use proof of induction to prove the correctness. Our base case is to show findSIPcConst(s, 0, c) holds. The base case finds the SIP from the source to every

(40)

other node including exactly ‘0’ nodes of color c. That is, the SIP does not contain any node of c color.

Let inf[0].v be the label found by the algorithm and let δ[0].v be the strongest path influence from s-to-v. We want to show that inf[0].v = δ[0].v for every vertex v at the end of the algorithm, showing that the algorithm correctly computes the distances. We prove this by induction on |R| via the following lemma: For each x∈ R, inf[0].x = δ[0].x. We use proof by induction to prove the

base case. The base case is when |R| = 1. Since R only grows in size, the only time |R| = 1 is when

R = {s}. If color[s]==c, then inf[0].s remains - ∞ since the path from s to s including 0 node of

color c does not exist. inf[0].s = 1 = δ[0].s, which is correct.

Let u be the last vertex added to R. Let R’ = R∪ {u}. Our inductive hypothesis is for each x∈

R’, inf[0].x = δ[0].x. By the inductive hypothesis, for every vertex in R’ that isn not u, we have the

correct distance label. We need only show that inf[0].u = δ[0].u to complete the proof. If color[u]==c, we would not consider it because that path would contain at least 1 node of color c

and wouldn’t meet the inf[0] constraint. If color[u]!=c, suppose for a contradiction that the strongest path from s-to-u with no node of color c is Q and has influence inf(Q) > inf[0].u.

Q starts in R’ and at some leaves R’(to get to u which is not in R’). Let xy be the first edge along

Q that leaves R’. Let Qx be the s-to-x subpath of Q. Clearly: inf(Qx) * inf(x, y) ≥ inf(Q). Since

inf[0].x is the influence of the strongest s-to-x path by the I.H., inf[0].u≥ inf(Qx) , giving us inf[0].x

(41)

so inf[0].y ≥ inf[0].x * inf(x, y) . (Since inf[0].y is the length of the strongest s-to-y path by the

I.H.)

Finally, since u was picked by the algorithm (It picks the node with the biggest distance label to be the next node to visit), u must have the biggest distance label out of all the unvisited nodes left (u, y, etc.): inf[0].u ≥ inf[0].y. Combining these inequalities in reverse order gives us the

contradiction that inf[0].u > inf[0].u. Therefore, no such stronger path Q must exist and so inf[0].u

= δ[0].u. This lemma shows the algorithm is correct by “applying” the lemma for R = V. Therefore, the base case holds.

Our inductive hypothesis is that findSIPcConst(s, k, c) where 1≤k<t holds. This means inf[k].u

contains the strongest influence value from s to u with exactly k nodes of color c. Using I.H., we show that findSIPcConst(s, k+1, c) holds. According to our algorithm, for any node v in graph G, where edge(u, v) exists, there are two cases. The first case is if the new visited neighbor v is not color c, this part of the algorithm is the same as the base case where color[v]!=c. Therefore, the same proof can be applied here. The second case is if the new visited neighbor v is color c, we use proof of contradiction to prove the correctness.

Suppose for a contradiction that the strongest path from s-to-v with k+1 node of color c is Q

and has influence inf(Q) > inf[k+1].v. Inf[k].u is the strongest influence path from s to u with exactly k nodes of c based on our I.H. If inf[k].u * inf(u,v) is greater than inf[k+1].v, then inf[k+1].v

is updated to inf[k].u * inf(u,v). A subpath of a strongest path is a strongest path. That means if inf[k+1].v is the strongest path from s-to-v, then inf[k].u is the strongest path from s-to-u. Since inf(Q) > inf[k+1].v, inf[k+1].v which is inf[k].u * inf(u,v) is not the strongest path. That means inf[k].u is not the strongest path. But based on our inductive hypothesis, inf[k].u is indeed the

(42)

strongest path from s-to-u. This gives us a contradiction. In both, findSIPcConst(s, k+1, c) returns a SIP with ‘k+1’ nodes of c color. Consequently, findSIPcConst(s, t, c) holds for any t.

(43)

Chapter 7

k-Colors

Strongest Influence Problem with

t

%constraint(

k-Colors

/ SIP-

t%)

In this section, we present our approaches to find the strongest influence path with at most/exactly/at least t nodes of color c from a node to another node in a graph.

7.1.

k-Colors SIP-t% (k-Colors Strongest Influence Path with t% Constraint)

In Chapter 6, we discuss the k-Colors SIP algorithm that includes exactly/at most/at least t nodes of a constraint color c. However, some applications might be interested in finding SIPs with a percentage constraint on the nodes of the same color.

Given a directed k-colors graph G where each edge (a,b) carries a weight (0, 1) that designates the probability of influence from node a to node b. The nodes are one of k different colors and the colors represent different types of nodes. The k-colors SIP-t% problem finds the SIP from the source node s to each destination node d which passes through exactly/at least/at most t% nodes that have the same color.

7.2.

Use case of k-colors t%-SIP

Consider a corporate network that consists of different companies as shown in figure 6. Each node represents a company and each edge represents influence. Company s wants to reach out to company d about working on a project together. The colors of the nodes represent the types of companies. For example, red nodes represent IT companies, blue nodes represent material companies, white nodes represent software companies and green nodes represent hardware companies. The path highlighted in red is the strongest influence path from node s to node d

(44)

representing an IT company. The percentage of red nodes on the following SIP is computed by total # of red nodes/total # of nodes = 1/8 = 12.5%.

Figure 6. Example k-colors SIP-t%

7.3.

Algorithm of at most k-Colors SIP-t%

Our basic idea is as following. We call the basic SIP function to obtain the SIP from node s to node d. Then, we check if the SIP satisfies the t% constraint. If it does, we return the SIP as the solution. If it does not satisfy the t% constraint, we explore alternative SIPs. To compare the SIP values of node s’s neighbors, we compute the SIP from s’s neighbors to d and multiply the influence value from s to its neighbors. For example, if s had neighboring nodes 1, 2 and 3, we would compare inf(s, 1)*SIP(1, d), inf(s, 2)*SIP(2, d) and inf(s, 3)*SIP(3, d). Out of the paths that meet the t% constraint, we pick the one with the maximum influence value. The count function counts the total number of nodes. For instance, count (SIP(s, d)) counts the total number of nodes on the SIP from node s to node d and count (SIP(s, d)green) counts the total number of green nodes

(45)

Figure 7(a) Step 1

Consider the example shown in Figure 7(a). First, we find the SIP from s to d as shown in Figure 5(a). The constraint on SIP(s, d) is that it has to satisfy 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(𝑠,𝑑)𝑔𝑟𝑒𝑒𝑛)

𝑐𝑜𝑢𝑛𝑡(𝑠,𝑑) ≤ t %.

Figure 7(b) Step 2

So we check if SIP(s,d) meets the green node constraint. If it meets the constraint, SIP(s,d) is the solution. If it does not, we calculate the influence from the source node (node s) to all its neighbors (node 1, node 2, node 3) as shown in blue in Figure 7(b). Now, calculate the influence from node s to all its neighbors. The influence value from node s to node 1 is 0.5, denoted as inf(s, 1). So we have Inf(s,1)=0.5, Inf(s,2)=0.8 and Inf(s,3)=0.8.

(46)

Figure 7(c) Step 3

Then, calculate the SIPs from s’s neighbors (node 1, node 2, node 3) to d as shown in Figure 7(c) which are SIP(1, d), SIP(2, d) and SIP(3, d). If the total number of green nodes on SIP from node 1 to node d over the total number of green nodes from node 1 to node d plus 1 (node s) is less than or equal to t% ( 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(1,𝑑)𝑔𝑟𝑒𝑒𝑛)

𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(1,𝑑))+1 ≤ t %), we consider that path as a candidate path. If the

conditions 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(1,𝑑)𝑔𝑟𝑒𝑒𝑛) 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(1,𝑑))+1 ≤ t %, 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(2,𝑑)𝑔𝑟𝑒𝑒𝑛) 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(2,𝑑))+1 ≤ t % and 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(3,𝑑)𝑔𝑟𝑒𝑒𝑛) 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(3,𝑑))+1 ≤ t % are

satisfied, then the result is the path with the highest influence value out of all the candidate paths: max{inf(s,1) * inf (SIP(1,d)), inf (s,2) * inf (SIP(2,d)), inf (s,3) * inf (SIP(3,d))}.

(47)

Figure 7(d) Step 4

Among the three candidate paths, we want to pick a path that meets the constraint and has the highest influence value. Because two SIPs, (s→SIP(1, d)) (let us call this path1) and (s→SIP(2, d)) (path2) satisfy the constraint and another (s→SIP(3, d)) (path3) does not satisfy the constraint as shown in Figure 7(d), we compare the influence value of each path. Since inf (path1) > inf

(path2), we do not consider path2 as a solution. If inf (path1) > inf (path2) and inf (path1) > inf

(path3), then the program ends and the result is path1.

In our example in Figure 7(d), since inf (path1) > inf (path2) and inf (path1) < inf (path3), we store the influence value of path1. Next, we consider alternative paths derived from path3 that can meet the constraint. The alternative paths derived from path3 in our example would be

s→3→2→SIP(2, d) (path3.1) and s→3→6→SIP(6, d) (path3.2). Since SIP(2, d) or SIP(6, d) could contain more green nodes or less green nodes than path3, these alternate paths may meet the green node constraint. In this example, the constraint is to have no more than 30% green nodes. Assume path3 is s→3→6→5→7→11→15→21→25→d (4 green nodes in the path) and path 3.1 is

(48)

s→3→2→5→8→13→19→23→27→d (2 green nodes in the path). Path 3.1 meets the constraint while path3 does not. Path 3.2 in this case would be the same path as path3 since path3 is

s→SIP(3,d) which goes through node 2 and path 3.2 is s→3→6→SIP(6, d). Therefore, they have the same influence values. We then compare the influence value of path 3.1 with the influence value of path1 and choose the path with a higher influence value. We do this since path3 has the highest influence value among the three candidate paths but does not meet the constraint. The alternative paths derived from path3 have a high likelihood of having a higher influence value than path1 with the constraint satisfied.

Note that since path1 meets its green node constraint, we do not need to consider alternate paths of path1 since path1 contains the SIP from node 1 to node d. Alternate paths of path1 have weaker influence values than path1. If an alternate path of path1 had a higher influence value than path1, then that path would have been the path1. Path2’s influence value is lower than path1’s influence value. Path2’s alternate paths have weaker influence values than path2 so we do not need to consider alternate paths of path2. The only reason we are considering alternate paths of path3 is because path3 has a stronger influence value than path1 so alternate paths of path3 might still be stronger than path1 while meeting the green node constraint.

(49)

Figure 7 (e) Step 5

Now, we compute the influence values of path3’s alternate paths. We calculate the influence from node 3 to all its neighbors which are nodes 2 and 6. Inf (3, 2) =0.7 and inf (3, 6) = 0.5. Next, we calculate the influence from the source node s to node 3 to its neighbors. Inf (s→3→2) = 0.56 and inf (s→3→6) = 0.35. Then, we calculate SIP(2, d) and SIP(6, d) as shown in Figure 7(e).

Figure 7 (f) Step 6

(50)

As shown in Figure 7(f), we check if the alternate paths of path3 meet the green node constraint conditions 𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(2,𝑑)𝑔𝑟𝑒𝑒𝑛)

𝑐𝑜𝑢𝑛𝑡𝑆𝐼𝑃((2,𝑑))+2 ≤ t % and

𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(6,𝑑)𝑔𝑟𝑒𝑒𝑛)

𝑐𝑜𝑢𝑛𝑡(𝑆𝐼𝑃(6,𝑑))+2 ≤ t %. If these conditions are satisfied,

then the result is the path with the highest influence value out of all the candidate paths: max{inf (s,1) * inf (SIP(1,d)), inf (s,3) * inf (3,2) * inf (SIP(2,d)), inf (s,3) * inf (3, 6) * inf (SIP(6,d))}. If none of the paths from this step satisfies the t% constraint, repeat the steps until we get the result.

The algorithm is as follows.

SIP-Path t% node of color c constraint Algorithm (Find the strongest influence path from a source node to destination node with exactly t % nodes of color c.)

Input: Graph G, Source Node s, Destination Node d, percentage t of colored node constraint, color c

Output: The strongest influence paths from s to d containing exactly t% nodes of color c

1. char findSIP_t_Percent(G, s, d, t, c){ 2. if (checkTP(SIP(s,d), t, c)) 3. return SIP(s, d); 4. else 5. char pathShared=[s]; 6. char candidatePaths[];

7. getAltPaths(pathShared, candidatePaths, t, c); 8. candidatePaths.sort(key=inf, reverse= True); 9. return candidatePaths [0];

10.}

11.bool check_t_Percent(path, t, c){ 12. int cNode=0;

13. bool meetConstraint=false; 14. for each node p in path: 15. if p.color == ‘c’

16. cNode++;

17. if cNode/len(path)*100==t

18. meetConstraint=true; 19. return meetConstraint; 20.}

21.void getAltPaths(pathShared, candidatePaths, t, c){ 22. float cMax=0;

23. u=pathShared[-1]; 24. for every neighbor v of u:

25. tempPath= pathShared.concat(SIP(v,d)) 26. if (inf(tempPath)>cMax)

27. if(check_t_Percent(tempPath, t, c) 28. cMax=inf(tempPath);

29. candidatePaths.append(tempPath); 30. else

(51)

31. pathShared.append(v);

32. getAltPaths(pathShared, candidatePaths, t, c); 33.}

Theorem 2: The k-Colors SIP-t% algorithm returns the SIP with t% nodes of color c constraint.

Proof:

That is, findSIP_t_Percent(s, d, t, c) returns the specified SIP. Let the original SIP(s, d) be

SIPorig (lines 2, 3). If SIPorig does not meet t% constraint, we explore alternate paths (lines line

5~7). Let alt_Paths be the alternate paths of SIPorig. This involves a recursion of n iterations until

it no candidate alternative path has an influence value exceeding a current maximum influence (lines 21~33). This part refers to “if (inf(tempPath) > cMax)” in the algorithm.

This holds the following;

(1) All the candidate alternative paths have been explored.

(2) At the end of the last recursive step, each path in the resulting list of candidate paths starts with s and ends at d.

Now, sorting the paths in the resulting list and choosing the path with the highest influence value must be the final resulting SIP satisfying the constraint (lines 8, 9). That is, having t% of c

color nodes.

This completes the proof.

■

Theorem 3: The k-Colors SIP-t% algorithm is an NP-complete problem.

Proof: We first show that the decision version of SIP-t% is NP-complete. Then we can reduce the decision version to the optimization version. The decision version of SIP-t% is a problem that determines if there exists a path p from s to d that has inf(p)≥ m and meets the t% constraint.

(52)

First, we prove that the decision SIP-t% is in NP. We shows that given a solution you can verify it in polynomial time. Given a path, it is easy to check if the influence is greater than m and the ratio of nodes of color c/all nodes is t%. We can multiply the edge weights to check if the path weight is greater than m in polynomial time. We can compute the influence and (number of nodes of color c in path / number of all nodes in path)*100 to see if it is exactly t%. This can be done in polynomial time since counting the number of nodes takes less than edge e time. So SIP-t% is in NP.

Next, we show the knapsack problem is a special case of SIP-t% problem such that t=100 in the knapsack. We represent the items in the knapsack in a graph form in which every item is one or more nodes of color c, as shown in Figure 7. Each node is color c and each weight is 1. The number of nodes of color c on the path should be less than the weight constraint w. The number of nodes of color c on the SIPs could vary from 0 to w. Since the weight of each item is different, we link the nodes together to represent different weights. For instance, if the weight of an item is 3kg, we link 3 nodes together with no other outgoing edges. For example, once node 1 is reached, the only outgoing edge is to node 2 and the only outgoing edge from node 2 is to node 3. Each pink bubble in Figure 8 represents an item using a group of nodes. The number of nodes in each bubble represents the weight of the item. Then, we can reduce knapsack to a special case of SIP-t%.

(53)

We create a knapsack problem that has non-negative weights a1, a2, · · ·, an, and profits c1,

c2, · · ·, cn. The problem is to determine if there a subset of weights with total weight at most B,

such that the corresponding profit is at least K. The SIP-t% problem has non-negative nodes n1,

n2, · · · , nnand influence weights e1, e2, · · ·, en. Th

Influence Maximization in GOLAP

UC Irvine

UNIVERSITY OF CALIFORNIA, IRVINE

TABLE OF CONTENTS

LIST OF FIGURES

LIST OF TABLES

CURRICULUM VITAE

Jennifer Kim Jin

ABSTRACT OF THE DISSERTATION

Chapter 1

Chapter 2

Strongest Path

Graph Reduction

Chapter 3

Chapter 4

Use case of SIP

Algorithm of SIP

Partial snippet of Dijkstra’s algorithm (finds the shortest path from a source node to a destination node)

Partial snippet of SIP algorithm (finds the strongest influence paths from a source node to a destination node)

Chapter 5

Use case of Top-m SIP

Algorithm of Top-m SIP

Top-m SIP algorithm (finds the top m strongest influence paths from a source node to a destination node)

Chapter 6

Use case of K-colors t-SIP

t-SIP-Path constraint Algorithm (Find all strongest influence paths form a node to any other nodes with exactly t nodes of color c.)

Proof:

Chapter 7

Use case of k-colors t%-SIP

Algorithm of at most k-Colors SIP-t%

SIP-Path t% node of color c constraint Algorithm (Find the strongest influence path from a source node to destination node with exactly t % nodes of color c.)

Proof: