• No results found

Estimating and Optimizing Iterative Processing

Prior work on iterative algorithms mainly focuses on providing theoretical bounds for the number of iterations an algorithm requires to converge (e.g., [46, 44, 34]) or worst case time complexity (e.g., [8]). These parameters, however, are not sufficient for providing wall time estimates due to the following two reasons: i) As simplifying assumptions about the charac- teristics of the input dataset are made, theoretical bounds on the number of iterations are typically loose [46, 8]. This problem is further exacerbated for a category of iterative algorithms executing sparse computation, where the processing requirements of any arbitrary iteration vary a lot as compared with subsequent/prior iterations [26, 52]. For such algorithms, per iteration worst case time complexities are impractical when the goal is to estimate actual run-

time. ii) Per iteration processing runtime cannot be captured solely by a complexity formula.

System level resource requirements (i.e., CPU, networking, I/O), critical path modeling and a cost model are additionally required for modeling runtime.

Iterative execution was also analyzed in the context of recursive query processing. In partic- ular, multiple research efforts [10, 3, 13] discuss execution strategies (i.e., top-down versus bottom-up) with the goal of performance optimization. Ewen et al. [26] optimize execution of incremental iterations that are characterized by few localized updates, in contrast with

bulk iterations, that always update the complete dataset. Although performance optimization

has an immediate impact on the runtime of the queries, the aforementioned techniques are complementary to the runtime prediction problem we study in this thesis.

2.2.1 Approximating and Sampling Large Graphs

With the goal of reducing the processing time of ever increasing input graphs, sampling and sketching techniques that can approximate some of the properties of the complete graph have been studied over the past recent years (e.g., [47, 48, 33, 43]). In these works, the main goal is to take a sample that can be used to approximate the result of the graph processing task. For instance, evaluating whether the input graph is connected, approximating the in/out node degree distributions, the effective diameter (i.e., 90-th percentile longest distance).

Sampling graphs had been analyzed in the context of social networks. Leskovec et al. propose sampling techniques based on random walks [47] with the goal of maintaining certain proper- ties on the sample such as the in/out node degree distributions, clustering coefficient, and effective diameter.

A random walk on a graph starting at a vertex v corresponds to randomly picking an edge that starts at v and ends at one of v’s neighboring vertices. A sampling technique based on random walk takes multiple random walks on the input graph until a certain percentage of vertices (or edges) have been sampled. There are multiple variants of sampling algorithms based on random walks. An excellent survey is that of Hu et al. [43]. In the following we summarize the best performing sampling techniques in the context of preserving connectivity, node in/out degree proportionality, and the effective diameter of the sampled graph.

• Random Walk [47]: Random Walk picks a starting seed vertex uniformly at random from all the input vertices. Then, at each sampling step an outgoing edge of the current vertex is picked uniformly at random and the current vertex is updated with the destination vertex of the picked edge. With a probability p the current walk is ended and a new random walk is started from the original seed vertex. The process continues until the number of vertices picked reaches the sampling ratio. With this sampling strategy there is a risk of getting stuck, if the starting vertex is a sink, or if it belongs to a small isolated component. If after a long number of sampling steps there is no progress in the number of picked vertices, random walk re-initializes the starting node to a new arbitrary vertex of the graph.

• Random Jump [47]: Random Jump is very similar with Random Walk. The difference is that Random Jump re-initializes the starting node to an arbitrary vertex of the graph

each time a new random walk is started. Hence, this sampling scheme has no risk in

getting stuck during the sampling process.

are known to have bias towards high degree nodes in the input graph. That is vertices with high out degree are likely to be visited more often during the sampling process than vertices with low out degree. With the goal of sampling vertices uniformly at random (i.e., with a probability of |V |1 , where |V | the total number of vertices in the graph), Metropolis-Hastings Random Walk adjusts the transition probability within a random walk as follows: Pv,w=              1 kv × mi n(1, kv kw ), if w is neighbor of v (1 − X y!=v Pv,y), if w = v

0, for any other vertex of the graph

, where kv, and kw the out degrees of vertices v, and w . In summary, MHRW always

accepts a walk towards a vertex with a lower degree and rejects some of the moves to vertices with higher degree. Thus, it eliminates the bias towards vertices with high degree.

Sampling vs. Sketching

In the context of data streaming model, McGregor et al. proposes sketching techniques with the goal of reducing the cost of processing large input graphs [33, 5]. Concretely, sketches

reduce the algorithm processing space complexity from O(n2) to O(n×pol yl og (n)). Sketching

techniques use multiple linear projections of the input graph so that they can preserve a certain property of the original graph (such as connectivity, k-connectivity, bipartiteness) in the sketch space with high probability. Once a sketch is constructed, the algorithm is executed in the sketch space to approximate results: Given a graph processing task T, an input graph G, and a corresponding sketch S, the result of executing T on G is approximated with the result obtained by executing T on S. The main differences with sampling approaches based on random walks can be summarized as follows: i) random walk-based sampling approaches aim to preserve multiple properties of the input graph while sketching is customized to preserve only one input property with high probability. ii) random walk samples are used to summarize some of the characteristics of the complete graph whereas sketches aim to reduce the memory (space) requirements of processing large input graphs in the context of data streams.