Low latency queries on big graph data

(1)

c

(2)

LOW LATENCY QUERIES ON BIG GRAPH DATA

BY

RACHIT AGARWAL

DISSERTATION

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy in Electrical and Computer Engineering in the Graduate College of the

University of Illinois at Urbana-Champaign, 2013

Urbana, Illinois

Doctoral Committee:

Assistant Professor Philip Godfrey, Chair Assistant Professor Matthew Caesar Professor Bruce Hajek

Professor Jennifer Rexford, Princeton University Professor Nitin Vaidya

(3)

Abstract

The availability of large datasets and on-demand system capacity to analyze these datasets has led to exciting new applications in the context of big graph data. Many big graph data applications — social search and ranking, personal-ized and socially-sensitive search, social network analysis, online advertising, to name a few — require computing distances and paths between vertices in the graph. Systems for these applications need to meet three performance goals: (1) low memory footprint; (2) low latency; and (3) small stretch — the ratio of the cost of path returned by the system to the actual shortest path. The theory community has established that meeting these goals is impossible for extremely dense graphs. The central theme of this dissertation is to show that these goals can, in fact, be achieved by exploiting graph sparsity, a property almost always encountered in big graph data.

This dissertation formally establishes a separation between the sparse and the dense cases for the problem of computing distances on graphs. For the re-alistic case of sparse graphs, our algorithms exhibit a smooth three-way trade-off between space, stretch and query time — a phenomenon that does not occur in dense graphs. Specific operating points on this trade-off space give us linear-space data structures for computing paths of stretch 2, 3 and larger, and the first data structure for computing paths of stretch less than 2 on general weighted undirected graphs.

We then apply our techniques and algorithms to build systems that enable efficient path computations for various big graph data applications. We first present ASAP, a system that almost always computes the exact shortest dis-tance in tens of microseconds on graphs with millions of vertices and edges. We then present ShapeShifter, a system that enables efficient computation of short paths on dynamic graphs; ShapeShifter can update, upon an edge inser-tion and/or deleinser-tion, the underlying data structure within tens of microsec-onds and answers each user query in less than a millisecond.

(4)

(5)

Acknowledgments

I feel very privileged to have worked with my thesis advisors, Matthew Cae-sar and Brighten Godfrey. To both of them, I owe a great debt of gratitude for their patience, support and friendship. For the last four years, Matt has been a brilliant force in guiding me towards the questions. He has an uncanny ability to challenge fundamental assumptions, and to synthesize new research problems surrounding these assumptions. He has instilled in me the taste for important research, and for that I thank him. When it comes to finding an-swers, Brighten is amazing. Be it an academic problem or a personal one, I always found him standing right beside me, knowing exactly what to do. Over the last four years, he has worked very hard to teach me the skills of under-standing deep technical ideas in their most simplistic form, and explaining my own ideas concisely and precisely. Thank you Matt and Brighten, thank you.

I was also very fortunate to have a great set of people to advise me during my graduate studies. Two people who stand out in shaping my career are Ralf Kötter and Nitin Vaidya. I owe much of my academic career to Ralf — he was the first person to teach me the art of doing good research, writing papers and questioning the questions. I miss him. My first research project at UIUC was with Nitin. He supported me in my first year of grad school, and continued doing so until the end. Thank you Ralf and Nitin. I would also like to thank Jennifer Rexford and Bruce Hajek for serving on my dissertation committee. A part of my thesis work is theoretical in nature; at times when I was exploring these questions, I found it very useful to have Sariel Har-Peled, Chandra Chekuri and Jeff Erickson around me. They helped me better formulate the problem, provided me with directions, and welcomed me into the theory world. Thank you.

(6)

I was the first student of Brighten and Matt; most of the other students in the group were at least a couple of years younger. One could imagine this having both positive and negative aspects. On the positive front, I could ask stupid questions during the group meetings and still be perceived as smart! On the negative front, I had nobody but Brighten and Matt to guide me when I needed direction. Surprisingly, the other students in the group — Virajith Jalaparti, Ashish Vulimiri, Chia-Chi Lin, Ankit Singla, Chi-Yao Hong, Qingxi Li, Wenxuan Zhou — made both the positive and negative fade away within a short period of time. They were extremely smart to challenge my questions and surprisingly mature to even suggest directions. Thank you all for making my experience in Siebel Center so enjoyable.

My life would not be so enjoyable without having around so many friends. Many thanks to Riccardo, Virajith, Parikshit and Ravi for those wonderful evenings, chatting and drinking. Myungjin Lee, who is now a professor, has extended his time to me whenever I needed it; he is awesome. It was only during the lonely hours at Urbana-Champaign when I realized what jewels I had in the friends I made at IIT Kanpur. Gopal, Vibhor, Nidhi, Nisheet, Gunjan, my life would have been terrible without them!

There are not enough words to express my thanks to my family for their support and love and for having faith in me. My mother, Sadhna Agarwal, has made many sacrifices and I want to let her know that her sacrifices have not gone unnoticed. My father, Ghanshyam Das Agarwal, has always kept very high expectations but has never lost faith in me; this, in turn, has forced me to work harder and maintain my focus. Thanks Mom and Dad. My sister, Rachna Agarwal, has always extended her unconditional love and support to me, whenever I needed it the most. She has never asked anything but love in return; I love you sis and I thank you for making me a better person. Last, but not the least, I would like to thank my wife Gargi for standing by me all the time. Thank you for your love, support, encouragement, friendship, and for everything you have coloured me with. This work is dedicated to my family, remembering those moments when I shamefully neglected their presence.

(7)

Chapter 1 Introduction

Big data refers to datasets that are large and complex. The sheer size of these datasets renders inefficient the existing approaches to storing, processing and networking. For instance, every day (on an average), Google generates and processes more than 20 petabytes of data [1]; Facebook loads 60 to 90 ter-abyte (TB) of uncompressed data to its servers [2]; the New York Stock Ex-change generates about a TB of trade data [3]; the Large Hadron Collider produces about 41 TB of new data and transfers it over the Internet [4]; simi-lar examples appear in biology [5] and chemoinformatics [6].

However, scale is only one aspect of big data; the fundamental challenges in big data arise due to an unprecedented complexity. Indeed, the current and emerging problems in big data require understanding the structure of the data, dealing with data redundancy and accuracy, formulating meaningful analysis metrics and managing storage and compute networks that manage big data; with humans in the loop, addressing the human factors of comprehending complex datasets also becomes a significant challenge.

A significant component of big data is modeled as graphs, and is referred to as big graph data. This includes big data relating to world wide web, social networks, the Internet, etc. Big data applications on these graphs encompass web graph analysis [7–9], social network analysis [10–12], designing efficient search engines [7, 13], analyzing information propagation [10, 14], network security [15–17], to name a few. A fundamental goal in many of these ap-plications is to ascertain the user “interest” — web search engines often need to predict the content of interest to the user conducting the search; online advertisement industry needs to predict the products of interest to the user visiting the webpage; social networks need to predict other users and content of interest to the user, etc. A natural way to formalize the notion of interest is by using the proximity between the user and the content, where proximity is defined according to some distance measure on the underlying graph data.

(10)

This dissertation concerns building techniques, algorithms and systems for big graph data applications that require distance computations. We will dis-cuss several concrete applications below; however, we note that since most of these applications compute distances in response to a user query, the goal is to minimize the query latency while maintaining feasible memory require-ments; indeed, it is also desirable to compute the exact shortest distance or a distance estimate which is very close to the exact shortest distance. The theory community has established that meeting these goals is impossible for extremely dense graphs. The central theme of this dissertation is to show that these goals can, in fact, be achieved by exploiting graph sparsity, a property almost always encountered in big graph data.

We begin by developing techniques and algorithms that allow computing paths of small stretch, defined as the worst-case ratio of the distance returned by the algorithm to the actual shortest distance between the two vertices. Our first contribution is to formally establish a separation between the sparse and the dense cases for the problem of computing distances on graphs. For the re-alistic case of sparse graphs, our algorithms exhibit a smooth three-way trade-off between space, stretch and query time — a phenomenon that does not occur in dense graphs.

Specific operating points on this trade-off space give us linear-space data structures for computing paths of stretch 2, 3 and larger, and the first data structure for computing paths of stretch less than 2 on general weighted undi-rected graphs. Applying our techniques to the problem of routing in networks with limited memory, we get a distributed routing protocol that uses little router memory and yet routes along paths that are shorter than what was previously thought possible.

We then use our techniques and algorithms to build systems that enable efficient path computations for various big graph data applications. We first present ASAP, a system that almost always computes the exact shortest dis-tance in tens of microseconds on graphs with millions of vertices and edges. Finally, we present ShapeShifter, a system that allows to efficiently compute short paths on dynamic graphs; ShapeShifter can update, upon an edge inser-tion and/or deleinser-tion, the underlying data structure within tens of microsec-onds and answers each user query in less than a millisecond.

(11)

1.1 Distance Queries and Applications

We start by informally defining exact and approximate distance queries on graphs. We then discuss a number of big graph data applications that perform exact and approximate distance queries.

1.1.1 Exact and Approximate Distance Queries

Consider a social network (Facebook, LinkedIn, Google+, etc.) and denote the set of users as V with each user having a set of “friends”. To model the underlying data as a graph, one can imagine a “vertex” corresponding to each user in the network; the friendship between a user u and another user v can then be modeled as an “edge” between two vertices in the graph leading to set of edges E. Furthermore, each edge can be assigned a “weight” which could, for instance, be a measure of how frequently the two users constituting the edge interact. The data is then said to be modeled as a graph G = (V, E) with a weight function w : E → R. If modeled in such a manner, a path between two users u and v is simply an ordered set of users (u v1 v2· · · v) with

edges between users u and v₁, users v₁ and v₂ and so on. The cost of the path is the sum of the weights of each edge along the path. The “distance” between two users s and t is then the cost of the least-cost path between s and t.

Given a graph G = (V, E) and two vertices s, t ∈ V, a distance query asks for the distance between s and t; a closely related question is that of path

query, which also requires to list one of the paths corresponding to the distance between s and t. A query is said to be a shortest path query if the desired output is in fact the shortest path between s and t. For many applications, however, an approximately shortest path suffices; the approximation is measured in terms of stretch — the worst-case ratio of the distance returned by the query to the actual shortest distance between the two vertices. A query is said to return distances of stretch c, if for any pair of vertices at distance d, the query returns a distance of at most c· d.

(12)

1.1.2 Applications

In this section, we give a non-exhaustive list of contemporary applications where computing short paths is a key component.

Search and ranking of people on social networks

A common operation in social networks is to search for people; indeed, social networks are meant to connect old friends and make new ones. For instance, a study conducted by Google on their social network Orkut [18] suggests that more than 50% of the searches are for other users, with each query having less than 2 words per query. This makes it important for social network ser-vice providers to devise efficient techniques that allow searching people and ranking the search results.

However, devising a technique for search and ranking of people on social network is non-trivial for several reasons. First, the problem is fundamentally different from that of traditional problem of web search, that uses text-based ranking techniques and has no notion of “user preference”. For instance, an experiment conducted in [18] suggests that the average number of answers per query when the retrieval algorithm is based on an exact match between the query and the user name (that is, for the query “Maria”, only the users who declared their names exactly as “Maria” are retrieved) is 48. Furthermore, if partial matches between the user name and the query are allowed (that is, the query “Maria” provides a match to the user named “Maria A.”), the average number of results per query is increased to 6034. In the absence of any user preference, there is no way to rank these results. This demonstrates the ineffectiveness of traditional text-based ranking techniques. Indeed, [18] proposed a distance-based ranking technique, leaving open the question of efficiently (that is, with low latency) computing distances on large graphs.

Currently used techniques for computing such distance-based rankings re-quire preprocessing and storing the distance from each vertex u to vertices within certain number of hops from u. This approach is limited due to sev-eral factors. First, these techniques have high memory footprint and require dedicated servers and resources for managing extremely large datasets. Per-haps more fundamentally, such an approach is limited to rankings based on hop-distance and cannot handle graphs with weighted edges.

(13)

Search and ranking of paths on social networks

The previous application of search and ranking of people on social networks requires computing, from a user u, shortest distances to users in a set X (for instance, all users that match the query “Maria”). In professional social net-works like LinkedIn and Microsoft Academic Search, a different kind of social query is initiated. Here, the goal of a user Alice is to connect to another user Bob. The goal of the social network service provider is to compute and rank a set of paths between users Alice and Bob.

This problem is significantly more challenging than the previous one. First, the social network has to devise techniques to compute these paths on the fly if paths between Alice and Bob were not stored during the preprocessing phase. Second, even if paths can be precomputed and stored, the memory re-quirements increase significantly — if 10 paths are stored between each pair of users, the memory requirements are expected to be an order of magnitude larger than storing a single shortest path. Finally, as in the previous applica-tion, the problem becomes significantly more challenging if one desires to take into the account the edge weights.

Socially-sensitive content search and ranking

Traditionally, web search results only reflect how important a particular piece of content is — here, the importance of a document is defined by computing a ranking function, PageRank for instance. Recently, there has been an increas-ingly intense discussion about personalized and socially-sensitive web search. For instance, it has been argued that the distance between the point where the query is initiated (here the initiation point may be the query context and not necessarily an user) and the relevant webpages is an important aspect in the ranking of the results [19]; similarly, [20] argues that a user may be more in-terested in finding contents from users that are close to her in the social graph. In this context, incorporating distance based ranking functions for search tasks has been proposed in several recent papers [18, 21–23].

Location-aware search is a more general form of socially-sensitive search that has relevance in information retrieval community [24] — it has been found that people who chat with each other are more likely to share interests; and this observation is used to retrieve information relevant to the users.

(14)

The growing interest in involving context and/or social connections in search tasks, suggests that ranking functions may soon incorporate (some form of) distance computations; see an experimental exploration of this socially-sensitive search in [25–27]. Indeed, the problem is pretty challenging — given the large number of possible search attributes and the large size of the underly-ing graph, it is practically impossible to precompute and store the results; on the other hand, computing such ranking functions entirely on the fly leads to high latency. It is, hence, desirable to design techniques that simultane-ously achieve low memory footprint and low latency for socially-sensitive and location-aware search and ranking.

Social network analysis

Distance queries over social networks have also been used to analyze infor-mation dissemination [28, 29], to detect communities [27, 30] to estimate structural similarity between two given networks [31], to compute graph sep-aration metrics [25, 26, 32, 33], to compute centrality measures [25, 26], etc. In addition, algorithms for detecting Sybil attacks rely on detecting communi-ties [16] and hence, can benefit from efficiently answering distance queries.

Routing in networks with limited memory

The distance query problem in big graph data is related to the problem of scalable routing on large networks. In the latter problem, it is desirable that routers require limited memory to store forwarding tables [34–37] and yet route along short paths. Traditionally, the technique used to route using lim-ited memory is hierarchy — the network is partitioned into multiple domains and separate routing protocols are used across and within domains; across domains, routing is performed on higher level aggregates (e.g., IP prefixes), while a separate protocol that typically implements shortest path routing is used within a single domain. While traditional approaches have been able to sustain the network growth, they can be inefficient: in the worst-case, the routes used by traditional schemes can be arbitrarily longer than the shortest path, leading to high network latency.

(15)

1.2 Goals and State-of-the-art

In this section, we outline a number of performance goals for systems and algorithms for answering distance queries based on the applications discussed above. We then briefly discuss the state of the art techniques for answering distance queries and outline the limitations of these techniques.

1.2.1 Goals

We argue that systems and algorithms for answering distance queries must meet the following three goals: (1) low latency; (2) an extremely small con-stant stretch as close to 1 as possible; and (3) feasible memory requirements. We discuss these goals in more depth below.

Low Latency

Most of the applications discussed in the previous section compute distances and paths in response to a user query. This imposes stringent latency require-ments on systems and algorithms for answering user queries. For instance, a study conducted by Google [38] suggests that increasing the query latency from 100 to 400 milliseconds led to significant reduction in the number of user queries, leading to revenue loss. The problem is further exacerbated since many applications (for instance, the application of search and ranking of people on social networks discussed above) initiate multiple sub-queries for a single query. Consequently, one of the most important goals is to design techniques, algorithms and systems that answer each query extremely quickly, with typical latency requirement being less than a few milliseconds.

Low Stretch

Most of the applications above require or could benefit from computing short-est paths. However, if all the desired distances cannot be precomputed and stored, this may be infeasible. In such a case, the structure of social networks makes it necessary that the returned distance estimate be of low stretch. For instance, consider a pair of users at distance 2 (that is, one user is the friend of friend of another user). Then, if the algorithm returns a distance estimate

(16)

of stretch 3, the returned distance estimate will be 6. However, for all prac-tical purposes, this estimate is useless since any pair of users in real-world social network is less than 6 hops away due to small-world property of social networks. Hence, our goal is to design systems and algorithms that return a

distance estimate of stretch less than3.

Feasible Memory

It is rather trivial to achieve the first two goals if one has access to machines with extremely large memory (by precomputing and storing all-pair shortest paths). However, memory limitations and the size of networks in question mean that simple solutions, like precomputing and storing all-pair shortest paths, are infeasible; even for a social network with 3 million users, this would require roughly 4.5 trillion entries. Social networks of interest can in fact be much larger in size — Facebook (1+ billion users [39]), Twitter (500 mil-lion users) and LinkedIn (200 milmil-lion users [40]). Hence, it is desirable to minimize the memory requirements while meeting the above two goals.

1.2.2 State-of-the-art

We briefly discuss the state-of-the-art for answering distance queries on big graph data. A distance oracle is a compact representation of all-pair shortest path matrix of a graph. A stretch-c oracle for a weighted undirected graph

G = (V, E) returns, for any pair of vertices s, t∈ V at distance d(s, t), a distance estimate δ(s, t) that satisfies d(s, t) ≤ δ(s, t) ≤ c · d(s, t). Let n = |V | be the number of vertices and m =|E| be the number of edges in the graph.

For general weighted undirected graphs, Thorup and Zwick [41] showed a fundamental space-stretch trade-off — for any integer k ≥ 2, they designed an oracle of size O(kn1+1/k_{) that returned distances of stretch (2k}_{− 1) in O(k)}

time; the construction time of their oracle was eO(kmn1/k_{), in expectation.}

Thorup-Zwick oracle was a significant improvement over previous construc-tions that had much higher stretch and/or query time [42–44].

The space-stretch trade-off of Thorup-Zwick oracle (TZ-oracle) is essen-tially optimal, assuming the girth conjecture of Erd˝os. In particular, Tho-rup and Zwick [41] showed that any oracle for undirected graphs that

(17)

re-turns distances of stretch less than (2k + 1) must have size Ω(n1+1/k). Their lower bound proof is information-theoretic, essentially showing the existence of dense-enough graphs that are incompressible: if a certain stretch is desired, then the size of the data structure is lower bounded by the number of edges in the specially-constructed graph. For example, proving that stretch less than 3 requires Ω(n2) space uses a graph with Θ(n2_{) edges. Hence, the space-stretch}

trade-off of their oracles is optimal only for the obscure case of extremely dense graphs.

1.3 Dissertation Outline

Can lower stretch be achieved using sub-quadratic space for the realistic case of sparse graphs? This question is both interesting and important for two reasons. First, far from being a narrow special case of the problem, sparse graphs are the most relevant case. Nearly all large real-world networks are sparse, including road networks [45], social networks [11], the router-level Internet graph [46] and the Autonomous System-level Internet graph [46], as well as networks like expander graphs that are important in many settings. For instance, letting µ = c log₂n, empirically, c ≈ 0.6 for an AS-level map of the Internet [46], c≈ 0.4 for a router-level map of the Internet [46], and c ≈ 1.34, 0.65, 1.21, 5.10, 29.9 for social networks Cyworld, Testimonial, Orkut, MySpace, and Facebook, respectively [11, 47]. There is no hope of the proof technique of TZ lower bound being helpful for these graphs, that is, graphs with much less than n2_{edges since this technique will only show that achieving}

any constant stretch value requires Ω(m) bits.

The second reason sparse graphs are interesting is that the mathematical structure of the question changes dramatically in the case of sparse graphs. Indeed, if Ω(m) space is allowed, one can trivially construct stretch-1 oracles (that is, oracles that return the shortest path) by storing the original graph and running a shortest path algorithm for each query; this, however, takes time

O(m) per query. Thus, in the context of distance oracles, the cases of dense and sparse graphs are quite different. In the dense case the key is to compress the graph while ensuring that sufficient information remains to return low-stretch distances. In the sparse case the graph need not be compressed, but the trade-off with query time becomes critical.

(18)

The first part of the dissertation builds an understanding of distance oracles beyond the Thorup-Zwick bound. In particular, we explore several questions:

Is it possible to design oracles of size o(n3/2_{) that return distances}

of stretch 3 in time o(m) for sparse graphs?

Do constant-stretch oracles with sub-linear query time and linear space exist or do we necessarily require super-linear space, that is space Ω(m1+δ_{) for some δ > 0?}

Is it possible to design oracles of size o(n2) that return distances of stretch less than 3 in time o(m) for sparse graphs?

Is there in fact a smooth trade-off between space and query time for any fixed stretch?

Do the space and query time reduce smoothly as the graph gets sparser?

The second part of the dissertation builds systems using techniques devel-oped in the first part. We show that the new techniques not only improve the worst-case stretch of Thorup-Zwick oracles (using the same space) but empir-ically, lead to schemes with average stretch extremely close to 1. Next, we summarize the results in the dissertation and outline our contributions.

1.3.1 Oracles with Linear Space

Chapter 3 explores these questions for stretch 3 and larger. Let S denote the size of the oracle and let T denote the query time. Our main result is design of stretch-3 oracles for each point on the space-time curve S× T2_{= O(n}2_{) for}

sparse graphs; we get similar results for stretch larger than 3 (see Figure 1.1 for a simple visualization of this space-time trade-off).

This answers all our questions above: for any graph, there is indeed a smooth space-time trade-off — for any fixed stretch, it is possible to reduce the space requirements (of the corresponding constant-time oracle) at the cost of higher query time. Moreover, the space-time trade-off improves as the graph gets sparser. Finally, and perhaps most interestingly, there exist oracles of size

(19)

0 0.1 0.2 0.3 0.4 0.5 0.6 1 1.1 1.2 1.3 1.4 1.5 1.6

Query Time Exponent

Space Exponent stretch=3 stretch=4 stretch=6 stretch=7

Figure 1.1: Space-time trade-off for our oracles for stretch 3 and larger for graphs with m = eO(n)edges. Let S be the size of the oracle and let T be the query time; then, the space and the query time exponents are defined as log_n(S) and log_n(T ), respectively.

linear in the input size that can compute distances of constant stretch in sub-linear query time; for instance, it is possible to design oracles of size eO(m)

that return stretch-3 distances in time O(pm). For computing distances of stretch 4k− 1, for any integer k ≥ 1, our linear-space oracles require query time O(m1/(k+1)).

Our second contribution is an extremely simple construction of constant-time oracles that return distances of stretch 2k, for any k ≥ 2. These oracles have size eO(m1−2k+12 n

4

2k+1). These are the points with zero query exponent in

Figure 1.1. For unweighted graphs, the space can be reduced to eO(n1+2k+12 ) at

the expense of an additive stretch of 1. The results in this chapter appeared in [48, 49].

1.3.2 Oracles for stretch 2

Chapter 3 explores one direction in which TZ-oracles can be improved for the case of sparse graphs — for any fixed stretch, it is possible to design oracles that require less space compared to the TZ-oracle at the expense of higher query time. Chapter 4 explores improvement in TZ-oracles in the other di-mension — reducing the stretch at the expense of higher query time. In fact, we show that for sparse graphs, it is possible to reduce both the stretch and the space of TZ-oracle at the expense of sub-linear query time.

(20)

We give several results for stretch 2. Our first result is construction of stretch-2 oracles for each point on the space-time trade-off of S×T = O(n2) for sparse graphs; we get similar results for denser graphs with space-time trade-off dependent on graph density. Hence, for S = O(n3/2) as in TZ-oracle, it is possible to compute stretch-2 distances in sparse graphs using time O(pn). As with stretch-3 oracles, we get a smooth space-time trade-off that improves as the graph gets sparser. Furthermore, for graphs with m = Ω(n1+ǫ_{) edges for}

any ǫ > 0, we also get linear-space oracles that require sub-linear query time. Figure 1.2 shows the space-time trade-off of our oracles.

The stretch-2 oracle above leads to a eO(m1/2_n3/2_{) time algorithm for}

com-puting all-pair stretch-2 distances, matching the run time of a decade-old re-sult due to Cohen and Zwick, albeit using significantly different and simpler techniques. A way to interpret this result is that the trade-off between the query time and the construction time of our oracle is optimal unless there exists a faster algorithm for computing all-pair stretch-2 distances; in other words, improving either the query time or the construction time of this oracle will lead to an asymptotically faster combinatorial algorithm for computing all-pair stretch-2 distances.

0 0.2 0.4 0.6 0.8 1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Query Time Exponent

Space Exponent

stretch=2

Figure 1.2: Space-time trade-off for our oracles for stretch 2 for graphs with

m = eO(n)edges. Let S be the size of the oracle and let T be the query time; then, the space and the query time exponents are defined as log_n(S) and logn(T ), respectively.

(21)

We present two more results in this section. First, we give an extremely simple construction of constant-time stretch-2 oracle of size eO(n4/3m1/3) (the point with zero query exponent on the S-T curve of Figure 1.2). Second, we apply our results on stretch-2 oracles (with super-constant query time) to the problem of routing in networks with limited memory; we get a distributed routing protocol that uses little router memory and yet routes along paths that are shorter than what was previously thought possible. The results in this chapter appeared in [48–50].

1.3.3 Oracles for Stretch Less Than 2

Chapter 5 presents the first oracles that compute distances of stretch less than 2 on general weighted undirected graphs. As with oracles for stretch 2 and larger, our oracles achieve a three-way trade-off between space stretch and query time. For sparse graphs, our oracles achieve a space-stretch-time trade-off of S×T1/k = O(n2_{) for computing distances of stretch 1+1/k; the trade-off}

can be further improved for certain values of stretch. For instance, Figure 1.3 shows the space-time trade-off for our stretch-1.67 oracles.

0 0.2 0.4 0.6 0.8 1 1.5 1.6 1.7 1.8 1.9 2

Query Time Exponent

Space Exponent

stretch=1.67

Figure 1.3: Space-time trade-off for our oracles for stretch 1.67 for graphs with m = eO(n)edges. Let S be the size of the oracle and let T be the query time; then, the space and the query time exponents are defined as log_n(S) and log_n(T ), respectively.

(22)

As with our stretch-2 oracle, we argue that our oracles for stretch less than 2 may achieve an optimal trade-off between query time and construction time, unless there exists a faster algorithm for Boolean Matrix Multiplication (BMM). Specifically, the problem of computing all-pair stretch-less-than-2 dis-tances in undirected graphs is equivalent to combinatorial BMM over the (OR, AND) semiring. Let T denote the query and let T′ denote the construction time of our stretch-1.667 oracles. If we can reduce the query time to T1−ǫ_{, for}

any ǫ > 0, without increasing the construction time (or vice versa), it would be possible to multiply two Boolean matrices in time o(mn). This would lead to a purely o(mn) time combinatorial algorithm for BMM, a long standing open problem. The results in this chapter appeared in [51, 52].

1.3.4 Shortest Paths in Microseconds

Motivated by the application discussed in §1.1, we apply our techniques from Chapter 3, Chapter 4 and Chapter 5 to build systems for computing short paths for big graph data applications. Chapter 6 presents ASAP, a system that quickly computes shortest paths by exploiting the structure of big graph data.

ASAP preprocesses the network to compute a partial shortest path tree (PSPT) for each vertex. PSPTs have the property that for any pair of vertices, each edge along the shortest path is very highly likely to be contained in the PSPT of either the source or the destination. Hence, a shortest path can be com-puted by simply exploring the PSPT of the source and the destination. ASAP demonstrates and exploits the observation that the structure of big graph data enables the PSPT of each vertex to be an extremely small fraction of the entire network; hence, PSPTs can be stored efficiently and each shortest path can be computed extremely quickly.

ASAP, even on networks with millions of vertices and edges, computes short-est paths in tens of microseconds using a single machine. Furthermore, unlike most previous works, ASAP admits efficient distributed implementation and can be easily mapped on distributed programming frameworks like MapRe-duce. Finally, unlike any previous technique, ASAP can compute multiple paths between any given pair of vertices using the same data structure as the one used for single path computation and will minimal latency increase.

(23)

Table 1.1: Summary of results for ASAP on several datasets (see Chapter 6). “Accuracy” refers to the fraction of the vertex pairs (approximated to two decimal places) for which ASAP returns the shortest path.

Dataset ASAP Speed-up

#Paths = 1 #Paths > 1 (compared to Time Accuracy Time #Paths State-of-the-art)

(in µs) (in µs)

DBLP 20.3 1.00 31.6 173 916×

Flickr 26.5 1.00 52.0 523 3171×

Orkut 31.8 0.94 65.0 237 23963×

LiveJournal 48.9 1.00 99.2 453 3197×

ASAP, even on network with millions of vertices and edges, computes the shortest path between most vertex pairs in less than 50 µs; see Table 1.1. ASAP also allows computing hundreds of paths and corresponding distances between most vertex pairs in less than 100 µs without any change in the data structure for single shortest path computation. The results in this chapter appeared in [53, 54].

1.3.5 Shortest Paths on Dynamic Graphs

A particularly challenging problem in big graph data is to handle graph dy-namics — insertions and deletions of edges and vertices over time. Chapter 7 presents ShapeShifter, an extension of ASAP from Chapter 6 that enables quick computation of distances on dynamic graphs. This extension uses our tech-niques from Chapter 5 along with some new ideas for quickly updating vertex partial shortest path trees (PSPT).

The main idea is to compute and store PSPTs that are smaller in size than the PSPTs stored in ASAP. Smaller PSPTs not only reduce the space requirements of ShapeShifter (compared to ASAP), but also enable quick updates since the time taken to compute a PSPT is proportional to the size of the PSPT; in addi-tion, ShapeShifter uses new techniques to quickly identify and update the set of PSPTs that are affected by an update. The challenge, however, is that the PSPTs of any vertex pair may no more intersect. To resolve this, ShapeShifter computes larger PSPTs on the fly leading to almost perfect accuracy.

(24)

Table 1.2: Summary of results for ShapeShifter on several datasets (see Chapter 7). For vertex pairs whose PSPT intersect along the shortest path, ShapeShifter returns the shortest path; otherwise, ShapeShifter returns a low stretch path.

Dataset Fraction of Query time Average

intersecting PSPTs (in µs) Update time Total Shortest Path (in ms/update)

DBLP 1 0.98 414.2 1.5

Flickr 1 0.93 521.0 1.7

Orkut 1.00 0.91 922.1 2.1

LiveJournal 1.00 0.96 997.1 2.4

ShapeShifter, on networks with millions of vertices and edges, computes shortest paths for a large fraction of vertex pairs in less than a millisecond (see Table 1.2); in addition, ShapeShifter returns a low stretch path for all other vertex pairs. ShapeShifter can handle thousands of edge insertions and deletions in a second on a single machine; furthermore, the update time of ShapeShifter decreases almost linearly by using multiple cores and multiple machines. The results in this chapter appeared in [55].

(25)

Chapter 2 Preliminaries

This chapter builds up the basic foundation for the rest of the dissertation. We begin with some formal definitions related to graphs in §2.1. We then define balls, vicinities, inverse-balls and inverse-vicinities of graph vertices in §2.2; these are certain neighborhoods of vertices in the graph that are used in our results. Finally, in §2.3, we briefly review the distance oracle of Thorup and Zwick (TZ) [41], and follow-up research on improving the original TZ-oracle. We also discuss, in §2.3, the known lower bounds on distance oracles.

2.1 Graphs

We start with some basic definitions and terminologies related to graphs. Definition 1 (Graphs). A graph G is a pair G = (V, E), where V is the set of

vertices in the graph and E ⊆ V₂ is the set of edges in the graph. The graph is said to be undirected if the edges have no orientation, that is, for any pair of vertices u, v ∈ V , the edge (u, v) is identical to (v, u). The graph is said to

be directed otherwise; in such a case, the edges are defined as an ordered pair E⊆ V × V .

Social networks and information networks are often modeled as graphs. For instance, each user in a social network is modeled as a vertex; two users have an edge connecting between them if they are “friends”. Most of the early social networks (Facebook, Orkut, LinkedIn) have bidirectional edges — if Alice and Bob are friends, the edge between Alice and Bob is symmetric. These networks are modeled as undirected networks. Many recent social networks (Google+ and Twitter, for instance) have unidirectional edges and are modeled as di-rected networks.

(26)

Definition 2 (Neighbors of a vertex). Two vertices u, v ∈ V of a graph G = (V, E) are said to be adjacent if there is an edge between u and v, that is, if (u, v) _{∈ E. For a graph G = (V, E), the set of neighbors of any vertex v ∈ V ,}

denoted by N (v), is the set of all vertices that are adjacent to v, that is, N (v) :=

{u : (u, v) ∈ E}. For a subset of vertices U ⊆ V , the set of neighbors of vertices

in U , denoted by N (U ), is defined as: N (U ) =S_u_∈UN (u).

Definition 3 (Degree of a vertex). For a graph G = (V, E), the degree of a

vertex v ∈ V, denoted by deg(v), is defined as the number of its neighbors, that is,deg(v) :=|N(v)|.

Definition 4 (∆-maximum degree bounded graphs). A graph G = (V, E) is

said to be ∆-maximum degree bounded graph (or equivalently, ∆-degree bounded graph) if the degree of each vertex in G is at most ∆, that is, for each vertex v∈ V ,

deg(v)≤ ∆.

Definition 5 (µ-average degree bounded graphs). A graph G = (V, E) is said

to be µ-average degree bounded graph (or equivalently, has average degree µ) if µ = 2m/n.

The notion of µ-average degree bounded graphs and ∆-maximum degree bounded graphs will play a crucial role in our construction. Note that a ∆-maximum degree bounded graph is also a ∆-average degree bounded graph; however, the reverse in not true since some vertices in the graph may have degree higher than ∆.

Definition 6 (Edge Weight). A graph G = (V, E) is said to be weighted if it is

associated with a weight function w : E→ R that assigns a weight to each edge

in G. The graph is said to be unweighted if each edge is assigned the same weight.

The weight of an edge is perhaps the most natural way to differentiate be-tween any two edges in the graph. Edge weights in social networks can signify how “strong” the relationship is or how frequently the two user constituting the link interact. Edge weights in information networks may signify the delay of sending a packet from one router to another router. We now define the shortest paths and the notion of stretch.

(27)

Definition 7 (Paths). A path in G from a vertex s = u0 to another vertex t = uk

is a sequence of edges{(u₀, u1), (u1, u2), . . . , (uk−1, (uk))}. Alternatively, a path is

denoted as an ordered sequence of adjacent vertices (u₀, u1, . . . , uk). The length

of a path P is the sum of its edge weights, that is, length(P) :=Pk_i=−1₀ w(u_i, ui+1).

The hop-length of a path P is the number of edges in P.

Definition 8 (Shortest Paths). Let P denote the set of paths between s and t

in G. The shortest distance (or equivalently, the exact distance or simply the distance) between s and t in G, denoted by d(s, t), is defined as the length of

the shortest path between s and t. More formally d(s, t) := minP∈Plength(P). If

P=_{;, we let d(s, t) := ∞.}

Definition 9 (Connected graphs). An undirected graph G = (V, E) is said to

be connected if d(s, t) is finite for all pairs of vertices s, t ∈ V.

Definition 10 (Stretch). Let P be a path between a pair of vertices s, t in G.

Then, P is said to be a path of stretch-k if d(s, t)≤ length(P) ≤ k · d(s, t). In this dissertation, unless stated otherwise, we consider connected weighted undirected graphs with each edge assigned a non-negative weight. Assuming connectedness is not fundamental to our results but simplifies the exposition of our techniques and results; all our results hold for graphs with multiple disconnected components. Assuming non-negative edge weights and undi-rected graphs is, however, fundamental. In particular, we will require that the paths on the input graph constitute a metric space [56]. That is, they have the following three properties: (1) for any pair of vertices u, v ∈ V , we have that d(u, v) ≥ 0; (2) for any pair of vertices u, v ∈ V , we have that d(u, v) = d(v, u); and (3) for any triplet of vertices u, v, w, we have that

d(u, v)≤ d(u, w) + d(w, v). The last of the above three properties is known as triangle inequality.

The notation used in the above definitions is summarized in Table 2.1.

2.2 Balls and Vicinities

In this section, we start with formally defining the vertex balls and vertex vicinities. We then discuss efficient algorithms to construct vertex balls and vicinities.

(28)

Table 2.1: Notation used throughout the dissertation.

G A connected weighted undirected graph

V Set of vertices in the graph

E Set of edges in the graph

n Number of vertices in the graph

m Number of edges in the graph

N (u) Neighbors of vertex u

N (U ) Neighbors of vertices in U deg(u) Degree of vertex u

d(s, t) Shortest distance between vertices s and t ∆ Maximum degree in the graph

µ Average degree of the graph

2.2.1 Definitions and notation

Definition 11 (Landmark vertex). Let G = (V, E) be a weighted undirected

graph and let L⊂ V be a subset of vertices. The landmark vertex of any vertex v, denoted byℓ(v), is the vertex ℓ∈ L that minimizes d(ℓ, v), ties broken arbitrarily.

The set L in the above definition will be referred to as the set of “landmarks”. The notion of landmarks is used to define certain neighborhood of vertices in the graph. Of particular interest are the notion of balls and vicinities:

Definition 12 (Ball and ball radius of a vertex). Let G = (V, E) be a connected

weighted undirected graph and let L ⊂ V be a subset of vertices. The ball of a vertex v ∈ V, denoted by B(v), is the set of vertices w ∈ V for which d(v, w) < d(v, ℓ(v)). The ball radius of v, denoted by r_v, is the distance from v to its landmark vertex, that is, r_v := d(v, ℓ(v)).

In other words, the ball of a vertex v is the set of all vertices w that are strictly closer to v than its landmark vertex ℓ(v).

Observe the following interesting property of the ball of any vertex v. Let

w and w′ be two vertices such that d(v, w)≤ d(v, w′); then, if w′ _{∈ B(v), we} have that w ∈ B(v). That is, if any vertex w′ is contained in B(v), then all vertices at distance less than or equal to d(v, w′) are contained in the ball of

v. Next, we define the vicinity of a vertex; this definition is closely related to the definition of the balls but has a dramatically different structure.

(29)

Definition 13 (Vicinity of a vertex). Let G = (V, E) be a connected weighted

undirected graph and let L ⊂ V be a subset of vertices. The vicinity of a vertex v∈ V, denoted by B⋆(v), is the set of vertices in B(v)_{∪ N(B(v)).}

We make several important observations. First, for unweighted graphs, the vicinity of a vertex v is simply a larger ball of radius rv + 1; hence, vicinities

have the same properties as that of balls. For weighted graphs, however, this does not hold. In particular, consider two vertices w, w′ such that d(v, w)≤

d(v, w′); then, it may be the case that the vicinity of v may contain vertex w′ but not w. To see this, let v′be some vertex in B(v) such that the edge (v′, w′) is contained in the edge set. Then, by definition w′ ∈ B⋆(v). However, if no neighbor of w is contained in the ball of v and if d(v, w) ≥ d(v, ℓ(v)), the vertex w is not contained in the vicinity of v. Finally, note that the vicinity of a vertex v may contain an arbitrarily larger number of vertices than the ball of

v. However, if the graph is µ-degree bounded, we can bound the size of vertex vicinities as: |B⋆(v)| = µ · |B(v)|.

It follows from the discussion above that the vicinity of a vertex u may con-tain vertices w without necessarily concon-taining all vertices along the shortest path between u and w. To account for this distance “asymmetry”, we will need the following notion of distance:

Definition 14 (Candidate distance). Let G = (V, E) be a weighted undirected

graph. The candidate distance from a vertex v to another vertex w ∈ B⋆(v),

denoted as d′_v(w), is defined as the cost of the least-cost path from v to w such

that all intermediate vertices on this path are contained in B(v); that is: d_v′(w) = min

x∈N(w)∩B(v){d(v, x) + weight of edge(x, w)}

Note that the candidate distance from v to w may be arbitrarily larger than the shortest distance between v and w. However, as we will show later, there are certain vertices in the vicinity of v for which the candidate distance is equal to the shortest distance.

Definition 15 (Intersection of balls and vicinities). Let G = (V, E) be a

weighted undirected graph. The balls of a pair of vertices s, t ∈ V are said to

have a non-empty intersection if B(s)∩ B(t) 6= ;, that is, there is a vertex w ∈ V such that w ∈ B(s) and w ∈ B(t). The ball-vicinity and vicinity-vicinity intersec-tion are defined identically.

(30)

2.2.2 Constructing balls and vicinities of bounded size.

We now describe efficient algorithms that, given a weighted undirected graph, construct vertex balls and vicinities of bounded worst-case sizes. We will also give algorithms for efficiently computing candidate distances to vertices in the vicinity of each vertex.

Lemma 1 ( [41]). Let G = (V, E) be a weighted undirected graph with n vertices

and m edges. For any fixed1≤ α ≤ n, there exists a subset of vertices L of size e

O(n/α) such that for each vertex v _{∈ V , we have that |B(v)| = O(α) with high} probability. Moreover, such a set L can be computed in time eO(n).

We outline the proof of the above lemma. Let us first describe an algorithm to construct such a set L such that the bound on the size of set L and on the size of the ball of each vertex is bounded in expectation.

The algorithm starts by sampling each vertex in V (for inclusion in set L) independently with probability 1/α. Hence, the expected size of the set of sampled vertices is O(n/α). To bound the size of the ball of a vertex v, let

u₀, u1, . . . , un be the vertices in G sorted in non-decreasing order of distance

from v; then, if uj is the first sampled vertex in this sorted order, the size of

the ball of v is j− 1. Since each ui is sampled independently with probability

1/α, the size of the ball is a geometric random variable with parameter 1/α. Consequently, the expected size of the ball of v is α.

By sampling each vertex independently at random with probability eO(1/α), it follows using an argument as above and using Chernoff’s bounds, that the size of the landmark set is bounded by eO(n/α) with high probability and the

size of each ball is bounded by O(α) with high probability. It is, in fact, possible to derandomize the above algorithm such that the size of the set L and the size of the ball of each vertex is bounded deterministically:

Lemma 2 ( [41, 57]). Let G = (V, E) be a weighted undirected graph with n

vertices, m edges and maximum degree µ = 2m/n. For any fixed 1 ≤ α ≤ n, there exists a subset of vertices L of size eO(n/α) such that for each vertex v ∈ V , we have that |B(v)| = O(α). Moreover, such a set L and the distance from each vertex v to each vertex w ∈ B(v) can be computed in time eO(mα).

(31)

A deterministic algorithm for constructing such a set L is as follows [41,57]. The algorithm first lets Nv, for each vertex v∈ V , to be the set of O(α) vertices

of V closest to v ties broken arbitrarily. The algorithm then chooses a set L of size O(n log n/α) that hits all the sets Nv, that is, L contains at least one

element from each set Nv. To construct such a set L, the algorithm repeatedly

adds vertices from V to L that hit as many unhit sets as possible until n/α sets N_v_{are unhit. The construction of set L is then completed by adding an element} from each of the unhit set N_v. Thorup and Zwick [41], using a result of Alon and Spencer [57], show that such a set L has size at most O(n log n/α) and can be constructed in time eO(n + nα), given sets Nv. For a µ = 2m/n-degree

bounded graph, sets Nv can be constructed in time O(αµ) using a modified

shortest path algorithm that stops once the closest O(α) vertices have been explored. Hence, the total construction time of the algorithm is eO(mα).

Recall that for µ = 2m/n-degree bounded graphs, the size of the vicinity of any vertex is at most a factor µ larger than the size of ball of the vertex. Using this fact along with the definition of vertex vicinities and candidate distances, we get the following lemma:

Lemma 3. Let G = (V, E) be a weighted undirected graph with n vertices, m

edges and maximum degree µ = 2m/n. For any fixed 1≤ α ≤ n, there exists a subset of vertices L of size eO(n/α) such that for each vertex v∈ V, we have that

|B(v)| = O(α) and |B⋆_(v)_{| = O(αµ). It is possible to compute, in time e}_O(mα),

such a set L, the shortest distance from each vertex v to each vertex w∈ B(v) and the candidate distance from each vertex v to each vertex w ∈ B⋆(v).

2.3 Inverse-balls and Inverse-vicinities

In this section, we extend the idea of vertex balls and vicinities to inverse-balls and vicinities. We then give efficient algorithms to construct inverse-ball and inverse-vicinities of vertices in weighted undirected graphs.

Definition 16 (Inverse-ball of a vertex). Let G = (V, E) be a connected weighted

undirected graph and let L⊂ V be a subset of vertices. The inverse-ball of a vertex v ∈ V , denoted by B(v), is the set of vertices w ∈ V that contain v in their ball, that is, the set of vertices w∈ V for which d(w, v) < d(w, ℓ(w)).

(32)

Definition 17 (Inverse-vicinity of a vertex). Let G = (V, E) be a connected

weighted undirected graph and let L ⊂ V be a subset of vertices. The inverse-vicinity of a vertex v ∈ V , denoted by B⋆(v), is the set of vertices w _{∈ V that}

contain v in their vicinity, that is, the set of vertices w∈ V for which v ∈ B⋆(w).

Constructing inverse-balls and inverse-vicinities of bounded size. The re-sult of Lemma 3 bounds the size of vertex balls and vicinities; while this leads to bounds on average size of inverse-balls and inverse-vicinities, we would like a bound on the worst-case size. We now discuss how to efficiently construct inverse-balls and inverse-vicinities of bounded worst-case size. We will need the following result:

Lemma 4 ( [58]). Let G = (V, E) be a weighted undirected graph with n vertices,

m edges and maximum degreeµ = 2m/n. For any fixed 1_{≤ α ≤ n, there exists} a subset of vertices L of expected size 8n log n/α such that for each vertex v∈ V ,

we have that|B(v)| = α. Moreover, such a set L and the distance from each vertex v to each vertex w∈ B(v) can be computed in time eO(mα).

For sake of completeness, we informally describe the algorithm for con-structing such a set L. Fix some 1≤ α ≤ n. The algorithm maintains two set of vertices — a set L that constitutes the final output of the algorithm and an-other set W that contains all vertices that have inverse-ball of size more than

α. The set L is initialized to an empty set and W is initialized to the vertex

set V . The algorithm runs in multiple iterations; in each iteration, it uniform randomly samples 4n/α vertices from W , inserts them to set L; re-computes the inverse-ball of each vertex and updates W to all vertices that still contains more than α vertices in their inverse-ball. The algorithm terminates when W contains 4n/α or fewer vertices; in this case, all vertices in W are inserted in set L. The main idea behind the proof of correctness is as follows. Clearly, by construction, each vertex has inverse-ball of size at most α. The main chal-lenge is to bound the size of set L. It is shown in [58] that the expected number of iterations performed by the algorithm before termination is at most 2 log n; since 4n/α vertices are added to L in each iteration, the size of the set

(33)

It is easy to verify that the set of vertices in the inverse-vicinity of any vertex

vis given by B⋆(v) =S_w_∈N(v)B(w). Hence, once the inverse-ball for each vertex has been computed, the inverse-vicinity of any vertex v can be computed easily by iterating through each vertex w∈ N(v), and letting each vertex in B(w) to be in the inverse-vicinity of v. Hence, we get:

Lemma 5. Let G = (V, E) be a weighted undirected graph with n vertices, m

edges and maximum degree µ = 2m/n. For any fixed 1≤ α ≤ n, there exists a subset of vertices L of expected size 8n log n/α such that for each vertex v ∈ V ,

we have that |B(v)| = α and |B⋆(v)| ≤ µ · α. It is possible to compute, in time e

O(mα), such a set L, the distance from each vertex v to each vertex w∈ B(v) and the candidate distance from each vertex v to each vertex w ∈ B⋆(v).

Lemma 5 gives an efficient way to sample a set of vertices of size eO(n/α)

such that the size of the inverse-ball of each vertex is bounded by O(α); com-pare this with the sampling technique of Lemma 3 that gives an efficient way to sample a set of vertices of the same size such that the ball of each vertex is bounded by O(α). We emphasize that the above lemma bounds the size of set L in expectation, while the size of inverse-ball and inverse-vicinity for any vertex is bounded deterministically.

It is, in fact, possible to combine the sampling technique of Lemma 3 and Lemma 5 to construct a set L of size eO(n/α) such that the ball, the vicinity, the

inverse-ball and inverse-vicinity of each vertex is of bounded size. Specifically, fix some 1≤ α ≤ n. Then, first the algorithm samples a set of vertices L1 of

size eO(n/α) using the algorithm of Lemma 3. The set L₁ is used as a seed set for the algorithm of Lemma 5. Then, another set of vertices L2 of size eO(n/α)

using the algorithm of Lemma 5. This gives us the final set of sampled vertices

L = L1∪ L2 with the following property:

Lemma 6. Let G = (V, E) be a weighted undirected graph with n vertices, m edges

and maximum degreeµ = 2m/n. For any fixed 1≤ α ≤ n, there exists a subset of vertices L of expected size eO(n/α) such that for each vertex v ∈ V, we have that |B(v)| = O(α), |B(v)| = O(α), |B⋆(v)| = O(αµ) and |B⋆(v)| = O(αµ). It is

possible to compute, in time eO(mα), such a set L, the distance from each vertex v to each vertex w ∈ B(v) and to each vertex w ∈ B(v) and the candidate distance from each vertex v to each vertex w ∈ B⋆(v) and to each vertex w∈ B⋆(v). The notation used in the last two sections in summarized in Table 2.2.

(34)

Table 2.2: Notation on balls and vicinities used throughout the dissertation.

ℓ(v) Landmark of vertex v

B(v) Ball of vertex v

r_v Ball radius of vertex v

B⋆(v) Vicinity of vertex v

B(v) Inverse-ball of vertex v

B⋆(v) Inverse-vicinity of vertex v

2.4 Thorup-Zwick Oracle: Upper and Lower Bounds

For general weighted undirected graphs, Thorup and Zwick [41] showed a fundamental space-stretch trade-off — for any integer k ≥ 2, they designed an oracle of size O(kn1+1/k) that returns distances of stretch (2k_{− 1) in O(k)} time; the construction time of their oracle was eO(kmn1/k), in expectation. In this section, we briefly describe the construction of stretch-3 and stretch-5 distance oracles of Thorup and Zwick. We then review the follow-up research on improving the original construction of Thorup-Zwick (TZ) oracle.

2.4.1 Thorup-Zwick oracles: Upper Bounds

We start with the stretch-3 oracle and then describe the stretch-5 construction.

Stretch-3 oracle

The construction of the stretch-3 oracle starts by sampling a set L of landmark vertices using Lemma 1 for α =pn. The oracle stores, for each v∈ V:

• a hash table storing the exact distance to each vertex in L; • the nearest vertex ℓ(v) and the ball radius rv; and

• a hash table storing the exact distance to each vertex in the ball of v, that is B(v).

When queried for the distance between vertices s and t, the exact distance is returned if s ∈ B(t) or if t ∈ B(s); else, the algorithm returns the distance

(35)

The algorithm clearly returns a distance estimate using three hash table lookups; hence, the query time is O(1). We bound the construction time, size and the stretch. Using Lemma 2, constructing the set L and computing distances from each vertex v to each vertex w ∈ B(v) takes time eO(mpn), leading to a total construction time of eO(mpn). To bound the size, note that the size of set L is eO(pn) for α = pn; furthermore, the size of each ball is bounded by O(pn). Hence, the size of the oracle is eO(npn).

To bound the stretch, note that the exact distance is returned if s ∈ B(t) or if t ∈ B(s). Otherwise, the returned distance is δ(s, t) = d(s, ℓ(s)) + d(t, ℓ(s)). Using triangle inequality, we have that d(t, ℓ(s))≤ d(s, t) + d(s, ℓ(s)). Hence, the returned distance is δ(s, t) ≤ 2d(s, ℓ(s)) + d(s, t). Finally, since t /∈ B(s), we have that d(s, t)≥ d(s, ℓ(s)), leading to the fact that δ(s, t) ≤ 3d(s, t).

Stretch-5 oracle

The construction of the stretch-5 oracle starts by sampling a set L1 of landmark

vertices using Lemma 1 for α = n1/3_{; in the second step, a set L}

2 of landmark

vertices are sampled from the vertex set L1, again using Lemma 1 for α = n1/3.

Let ℓ1(v) and ℓ2(v) be the vertices in L1 and L2, respectively, that are closest

to v. The oracle stores, for each v∈ V :

• a hash table storing the exact distance to each vertex in L2;

• the nearest vertices ℓ1(v) and ℓ2(v) and the corresponding distances;

• a hash table storing the exact distance to each vertex in the ball of

v defined with respect to set L1, that is to each vertex w such that

d(v, w) < d(v, ℓ1(v)); and

• a hash table storing the exact distance to each vertex in set Sv = {w ∈

L₁: d(v, w) < d(v, ℓ2(v))}.

When queried for the distance between vertices s and t, the exact distance is returned if s ∈ B(t) or if t ∈ B(s). Else, the algorithm checks if ℓ1(t)∈ Ss;

if such is the case, the algorithm returns the distance d(s, ℓ1(t)) + d(t, ℓ1(t));

this is easily proved to be a stretch-3 distance using arguments similar to the stretch-3 oracle. If neither of the above two conditions is satisfied, the algo-rithm returns the distance d(s, ℓ₂(s)) + d(ℓ₂(s), t).

(36)

The algorithm returns a distance estimate using five hash table lookups; hence, the query time is O(1). We bound the size and the stretch. To bound the size, note that the size of set L1 is eO(n2/3) and the size of set L2 is eO(n1/3)

for α = n1/3. It is rather straightforward to prove that for each vertex v, |Sv| = O(n/α2); hence, for the above construction, we have that|Sv| = eO(n1/3)

for each vertex v. Furthermore, the size of each ball is bounded by O(n1/3). Hence, the size of the oracle is eO(n4/3).

We bound the stretch for the cases when the distance is returned in the last step of the query algorithm. Note that we return the distance in the third step only if ℓ1(t) /∈ Ss; hence, we have that d(s, ℓ2(s)) ≤ d(s, ℓ1(t)), which by

tri-angle inequality, gives us d(s, ℓ2(s))≤ d(s, t) + d(ℓ1(t), t). Furthermore, since

s ∈ B(t), we get that d(s, ℓ/ 2(s))≤ d(s, t) + d(s, t) = 2 · d(s, t). The algorithm

returns a distance estimate of δ(s, t) = d(s, ℓ2(s))+d(ℓ2(s), t), which using

tri-angle inequality gives us δ(s, t)≤ 2·d(s, ℓ2(s))+d(s, t)≤ 2·2·d(s, t)+d(s, t) =

5· d(s, t), as desired.

Follow-up research

Much of the early research following Thorup-Zwick result focused on improv-ing the construction time. Roditty, Thorup and Zwick [59] derandomized the construction of Thorup and Zwick. Baswana and Sen [60] improved the con-struction time to O(n2) for unweighted graphs. Their result was extended to weighted graphs by Baswana and Kavitha [61]. Baswana, Gaur, Sen and Upadhyay [62] showed that it is possible to achieve subquadratic construc-tion time for unweighted graphs at the expense of a constant additive stretch. Recently, Nilsen [63] achieved subquadratic construction time for weighted graphs with m = o(n2_{) edges.}

The query time of the TZ oracle is not constant for super-constant stretch. Mendel and Naor [64] reduced the query time to O(1) at the expense of in-creasing the stretch to O(k) and the construction time to eO(n2+1/k). It is pos-sible to reduce the stretch (by a constant factor) [65, 66] and/or construction time [67] of their construction. Recently, Nilsen [65] reduced the query time of the TZ oracle to O(log k) using a new query algorithm that incorporates binary search within TZ oracle. Interestingly, Chechik [68] showed that it is possible to reduce the query time to an absolute constant independent of the stretch while keeping the same space-stretch trade-off.

Low latency queries on big graph data

Abstract

Acknowledgments

Table of Contents

Chapter 1

Introduction

1.1

Distance Queries and Applications

1.1.1

Exact and Approximate Distance Queries

1.1.2

Applications

1.2

Goals and State-of-the-art

1.2.1

Goals

1.2.2

State-of-the-art

1.3

Dissertation Outline

1.3.1

Oracles with Linear Space

1.3.2

Oracles for stretch 2

Query Time Exponent

Space Exponent

1.3.3

Oracles for Stretch Less Than 2

Query Time Exponent

Space Exponent

1.3.4

Shortest Paths in Microseconds

1.3.5

Shortest Paths on Dynamic Graphs

Chapter 2

Preliminaries

2.1

Graphs

2.2

Balls and Vicinities

2.2.1

Definitions and notation

2.2.2

Constructing balls and vicinities of bounded size.

2.3

Inverse-balls and Inverse-vicinities

2.4

Thorup-Zwick Oracle: Upper and Lower Bounds

2.4.1

Thorup-Zwick oracles: Upper Bounds