Estimating the weight of metric minimum spanning trees in sublinear time

(1)

University of Warwick institutional repository

This paper is made available online in accordance with

publisher policies. Please scroll down to view the document

itself. Please refer to the repository record for this item and our

policy information available from the repository home page for

further information.

To see the final version of this paper please visit the publisher’s website.

Access to the published version may require a subscription.

Authors:

ARTUR CZUMAJ AND CHRISTIAN

SOHLER

Title:

ESTIMATING THE WEIGHT OF

METRIC MINIMUM SPANNING

TREES IN SUBLINEAR TIME

Year of

publication:

2009

Link to published

version:

http://dx.doi.org/10.1137/060672121

Publisher

statement:

(2)

ESTIMATING THE WEIGHT OF METRIC MINIMUM SPANNING TREES IN SUBLINEAR TIME∗

ARTUR CZUMAJ† AND CHRISTIAN SOHLER‡

Abstract. In this paper we present a sublinear-time (1+ε)-approximation randomized algorithm to estimate theweightof the minimum spanning tree of ann-point metric space. The running time of the algorithm isO(n/εO(1)_{). Since the full description of an}_n_{-point metric space is of size Θ(}_n2_), the complexity of our algorithm issublinearwith respect to the input size. Our algorithm is almost optimal as it is not possible to approximate ino(n) time the weight of the minimum spanning tree to within any factor. We also show that no deterministic algorithm can achieve aB-approximation ino(n2_/B3_{) time. Furthermore, it has been previously shown that no}_o₍_n2_{) algorithm exists that}

returns a spanning treewhose weight is within a constant times the optimum.

Key words. randomized algorithms, approximation algorithms, sublinear-time algorithms

AMS subject classifications.68W20, 68W25, 68W40

DOI.10.1137/060672121

1. Introduction. Despite extensive investigations over the last few decades, the complexity of the minimum spanning tree problem is still not completely understood. Although an optimal deterministic algorithm is known [22], no tight bounds on the running time of this algorithm could be obtained. The best upper bound on the running time for a deterministic algorithm was obtained by Chazelle [5], who pre-sented an algorithm that achieves a running time of O(|V|+|E|_α(|E|,|V|)), where

αis the functional inverse of Ackermann’s function. In turn, Karger, Klein, and Tar-jan [19] gave an optimal O(|V|+|E|)-time randomized algorithm. Much research has been devoted to studying the minimum spanning problem for various classes of graphs and for variants of the problem. For example, the problem of computing the minimum spanning tree of a set of points in a Euclidean space and related problems have been intensively studied (see [10] for a summary of results). Despite this eﬀort, the fastest algorithm to compute such a minimum spanning tree in the Rd _requires

O(_n2−2/(d/2+1)+ε_{) time for an arbitrarily small constant}_ε_.

In this paper we present another important step toward understanding the min-imum spanning tree problem. We consider the classical variant of the minimum spanning tree problem for metric spacesor, equivalently, in graphs withweights satis-fying the triangle inequality. The input to the problem consists of an _n-point metric space (_{P, d}), and the goal is to estimate the weight of the minimum spanning tree of _P. In this paper, we show that even though the full description of an _n-point metric space is of size Θ(_n2_{), there exists an algorithm that approximates the weight} of the minimum spanning tree of_P to within a (1 +_ε)-factor in timeO(_n/εO(1)_{) (we}

∗_{Received by the editors October 12, 2006; accepted for publication (in revised form) May 19,}

2009; published electronically August 26, 2009. This research was supported in part by NSF grants CCR-0313219 and CCR-0105701, DFG grant Me 872/8-3, EPSRC grant EP/D063191/1, and by the Centre for Discrete Mathematics and its Applications (DIMAP), University of Warwick. A preliminary version of this paper appeared inProceedings of the 36th ACM Symposium on Theory of Computing (STOC), ACM Press, New York, 2004, pp. 175–183.

http://www.siam.org/journals/sicomp/39-3/67212.html

†_{Department of Computer Science and Centre for Discrete Mathematics and its Applications,}

University of Warwick, Coventry, CV4 7AL, United Kingdom ([email protected]).

‡_{Department of Computer Science, Technical University of Dortmund, 44221 Dortmund, Germany}

([email protected]).

(3)

useO to hide polylogarithmic factors). This is the firstsublinear-time algorithm for this problem. Our algorithm is randomized, and it achieves the promised approxima-tion guarantee with the probability of at least 3_/4 (using standard techniques, the probability can be amplified if needed). Furthermore, it was previously shown in [18] that no _o(_n2) algorithm exists that returns a spanning tree whose weight is within any constant times the weight of the minimum spanning tree of _P. Therefore, our result shows that one can approximate theweight of the minimum spanning tree, but there is no hope of finding awitness for that approximation in sublinear time. Our running time is essentially asymptotically optimal, because it is easy to show that no

o(_n)-time algorithm exists that approximates the weight of the minimum spanning tree withinany factor. Also, randomization is essential for the algorithm. We show that no deterministic_B-approximation algorithm with_o(_n2

/B3_{) running time exists.} By the well-known relationship between minimum spanning trees, travelling sales-man tours, and minimum Steiner trees (see, e.g., [23]), our algorithm for estimating the weight of the minimum spanning tree immediately yields sublinear-time (2 +_ε )-approximation algorithms for two other classical problems in metric spaces (or in graphs satisfying the triangle inequality): estimating the weight of the travelling sales-man tour and the minimum Steiner tree. No _o(_n2)-time algorithms were known for these problems before. We believe that besides being interesting by themselves, these approximation results may ﬁnd applications in bounding the quality of solutions in subproblems used in branch and cut (bound) algorithms to compute the exact solution to these problems.

Our algorithms enlarge the ever growing list of problems solvable in sublinear time that are often required for the analysis of massive data sets. Due to the tremendous increase in computational power and interconnectivity during the last decade, more and more often we have to deal with massive data sets, which are sets of size in the range of several Gigabytes or more. Examples for such massive data sets are Internet traﬃc logs, clickstream patterns, sales logs, and call-detail data records in the telecommunications industry. Massive data sets typically cannot be processed by algorithms requiring more than linear time, and often even linear-time algorithms may be too slow. This leads to a natural question of which problems of interest (if any) can be solved in sublinear time. Some simple results indicating that it is sometimes possible to solve certain approximation problems in sublinear time are well known, for example, approximating the median or the average value of a set of

n numbers. More sophisticated results have been obtained in the last few years for a number of more complex problems, including clustering problems in metric spaces [1, 6, 17, 18, 20], graph problems [7, 12, 14, 15, 21], geometric problems [6, 8], matrix approximation [13], and string problems [2, 3, 11]; see also the recent survey in [9].

1.1. Related work. The problem of approximating the weight of the minimum spanning tree in sublinear time was ﬁrst addressed for arbitrary graphs in adjacency list representation in a recent paper by Chazelle, Rubinfeld, and Trevisan [7]. In [7], a sublinear algorithm is given, whose running time is independent of the size of the input graph: If the maximum (or average) degree is _D and if all edge weights are known to be in the interval [1_{, W}], then the algorithm approximates the weight of the minimum spanning tree to within a (1 +_ε)-factor in time O(_D·_W ·_ε−3) with probability at least 3₄. However, if_W = Ω(_n), then no sublinear algorithm is known. If we apply their algorithm to the metric version of the problem, then the running time isO(_n·_W·_ε−3_{) because}

(4)

in_n, which certainly does not have to be the case in general (for example, even in the case when (_{P, d}) corresponds to the set of points_P in a Euclidean plane, it is known that_W must be at least Ω(√_n) and hence the running time is Ω(_n1.5_·

ε−3_)). In [8], the authors consider the problem of estimating the weight of the Euclidean minimum spanning tree of a set_P of_npoints in theRd. They consider the model in which the input point set is stored in a sophisticated data structure that supports two types of access operations, namely, (i) emptiness queries for axis parallel squares and (ii) approximate nearest neighbor queries for a set of prespeciﬁed cones. Additionally, it is assumed that a smallest axis parallel bounding box of_Pis given. In this model the authors give an algorithm that approximates the weight of the Euclidean minimum spanning tree of_P within a relative error of_ε. The algorithm has a running time of

O(√_n/εO(1)) assuming that the dimension is a constant and assuming that all access operations to the data structure can be performed in logO(1)_n time. The algorithm usesO(√_n/εO(1)_{) queries of types (i) and (ii).}

1.2. New contribution. In contrast to both of the aforementioned algorithms, our algorithm does not make any assumption about the input other than that we can evaluate the distance between any two points in the metric space in constant time.

The high level approach of our algorithm is similar to that in [7]. We regard the metric space as a complete graph and we express the weight of its minimum spanning tree by a formula depending on the number of connected components in certain aux-iliary subgraphs. In contrast to that of [7], our approach builds on geometric rather than arithmetic progression. We will assume that the longest edge in (_{P, d}) has length

W = (1 +_ε)r for_r=log1+ε(4n/ε)(as we will see later, this assumption does not aﬀect the complexity of the problem). We denote by_G(i)_{= (}

P, E(i)_{) the graph that} contains an edge between_{p, q}∈_Pif_d(_{p, q})≤(1+_ε)i_{. We use a randomized procedure} to approximate the number_c(i)_{of connected components in each subgraph}

G(i)_. Us-ing the approximationmst≈n−W+ε·log1+εW−1

i=0 (1 +ε)i·c(i)(see Corollary 2.2), we obtain an estimator for the weight of the minimum spanning tree of_P.

The advantage of geometric progression is that we have to estimate onlyO(log_W) times the number of the connected components of a certain threshold graph in contrast to_W times in [7]. We achieve this at the cost of an increased variance of the estimator (which makes it impossible to apply this approach to arbitrary graphs considered in [7]). Instead of approximating the number of connected components within an additive error of _{ε n}, as in [7], we obtain an approximation with additive error of

ε·weight(mst); i.e., we relate the error directly to the weight of the minimum spanning

tree.

Our analysis explores the triangle inequality to ensure trade-oﬀs between the running time, the number of connected components, the vertex degrees, and the size of the minimum spanning tree.

2. Preliminaries. We consider the problem ofestimating the weight of the min-imum spanning tree in a metric space: Given access to the_n×_ndistance matrix of a metric space (_{P, d}), |P| = _n, the goal is to approximate the weight of the mini-mum spanning tree of _P. Throughout the paper, we denote by mst _{the weight of}

the minimum spanning tree of _P. Our main contribution is an algorithm that in

O(_n/ε7_{) time computes a (1 +}

ε)-approximation ofmst, i.e., outputs a valueM such

that mst≤M ≤(1 +ε)·mst with probability at least 3

(5)

multiply-ing the output by 1_/(1−_ε) and replacing _ε by _ε = min{ε/4_,1₂}. Because of this transformation, we will refer to an algorithm that computes a value_M in the interval [(1−_ε)·mst,(1 +ε)·mst] as a (1 +ε)-approximation algorithm for the weight of the

minimum spanning tree. We will use_ε as the approximation parameter throughout the paper.

Our result holds for arbitrary metrics (_{P, d}) and assumes only that a constant-time access to the distance oracle is provided. For our analysis we will also assume that for every pair_{p, q}∈_P we have_d(_{p, q})∈[1_,(1 +_ε)r_{], where}_r₌_log

1+ε(4n/ε); i.e., the distance between any pair of points is at most 4_n/εrounded up to the nearest power of (1+_ε). As we will show below, such an assumption may introduce an additive approximation error of at most ε₂·mst, which will not aﬀect the ﬁnal result. We use

a result of Indyk [18], who showed that inO(_n) time one can approximate to within a factor 1₂ the longest distance in a metric space. (The algorithm picks an arbitrary vertex and takes the longest incident edge. By the triangle inequality, this edge is at least half as long as the longest edge in the metric space.) Once we have such an approximation_W∗, we can rescale the distances such that_W∗= 2·n/ε. Since_W∗is a

1

2-approximation of the largest distance, after the scaling all distances are in [0,4n/ε]. Furthermore, by the triangle inequality the weight of a minimum spanning tree of a metric space is at least as large as the weight of the longest distance, and since the longest distance is at least_W∗, we havemst≥2n

ε . Next, we observe that rounding up every distance smaller than 1 to the distance 1 will changemstby an additive term of

at most_n−1 while preserving the triangle inequality. Since_n−1≤ε

2·mst, in the so modiﬁed metric the weight of a minimum spanning tree is a (1 +ε

2)-approximation of the weight of a minimum spanning tree in the original metric. Hence, we can assume that all edges are in [1_{, W}∗]. To simplify the exposition we increase the upper bound

W∗ to the nearest power of (1 +_ε); i.e., we use _W = (1 +_ε)r_{. In the remainder of} the paper, we will show how to obtain a (1 +_ε)-approximation algorithm under our assumption. Replacing _ε by _ε = _ε/4, this gives an algorithm with approximation guarantee (1 +_ε/2)·(1 +_ε/4) ≤(1 +_ε) without changing the asymptotic running time of our algorithm.

2.1. ApproximatingMST via counting connected components in auxil-iary graphs. Our high level approach to approximating the weight of the minimum spanning tree is similar to the one used in [7]. We express the weight of the minimum spanning tree in terms of the number of connected components in certain auxiliary graphs. While the formula from [7] uses arithmetic progression, our formula will be based on geometric progression. Then we show how to approximate the number of connected components.

For a given parameter_i∈N, we deﬁne athreshold graph_G(i)_{= (}

P, E(i)_{), where} (_{p, q}) ∈_E(i) _{if and only if}

d(_{p, q})≤(1 +_ε)i_{. We call the connected components of}

G(i)_the

i-connected components and denote their number by_c(i)_.

Let us ﬁrst consider the case that the lengths of all edges in the metric space are powers of (1 +_ε). Then the next lemma can be obtained in a similar way as in [7] by replacing arithmetic progression with geometric progression.

Lemma 2.1. Let (P, d)be ann-point metric space such that for any pair of points p, q∈P we have _d(_{p, q}) = (1 +_ε)i _{for some}_i_{∈ {}₀_{, . . . , r}}_{. Let}_W _{= (1 +}_ε₎r_{. Then}

we can write

(2.1) mst ₌ _n−_W ₊_ε·

r−1

i=0

(6)

Proof. Let us denote by_n(_j) the number of edges of weight (1 +_ε)j _{in a minimum} spanning tree. Then, we can write the weight of the minimum spanning tree as

mst=

r

j=0

(1 +_ε)j·_n(_j)

= r

j=0

1 +_ε·(1 +ε) j₋₁

ε

·n(_j)

= (_n−1) +_ε· r

j=0

(1 +_ε)j₋₁

ε ·n(j)

= (_n−1) +_ε· r

j=1

(1 +_ε)j−1

ε ·n(j)

= (_n−1) +_ε· r

j=1 j−1

i=0

(1 +_ε)i·_n(_j)

= (_n−1) +_ε· r−1

i=0 r

j=i+1

(1 +_ε)i·_n(_j)

= (_n−1) +_ε· r−1

i=0

(1 +_ε)i· r

j=i+1

n(_j)_.

Now we use the observation thatr_j₌_i₊₁_n(_j) =_c(i)₋_{1. Hence,}

mst= (n−1) +ε·

r−1

i=0

(1 +_ε)i·(_c(i)−1)

= (_n−1) +_ε· r−1

i=0

(1 +_ε)i·_c(i)−_ε· r−1

i=0

(1 +_ε)i

= (_n−1) +_ε· r−1

i=0

(1 +_ε)i·_c(i)−((1 +_ε)r−1)

=_n−_W +_ε· r−1

i=0

(1 +_ε)i·_c(i)_.

Corollary 2.2. Let (P, d) be an n-point metric space such that all pairwise

distances are in the interval [1_{, W}], where_W = (1 +_ε)r_{. Then we have}

(2.2) mst≤n−W +ε·

r−1

i=0

(1 +_ε)i·_c(i)≤(1 +_ε)·mst.

Proof. Let us consider an arbitrary pair of points_{p, q} with (1 +_ε)i _{< d}(_{p, q})≤ (1 +_ε)i+1. Then we have (_{p, q})∈_/ _E(k) for _k ∈ Nwith _k ≤_i, and (_{p, q}) ∈_E(j) for

j ∈ N with _{j > i}. However, the same property would also hold if the edge weight

d(_{p, q}) were exactly (1+_ε)i+1_{. Hence, by rounding up every edge weight to the nearest} power of (1 +_ε), we do not change the sets of edges in any of the threshold graphs. Thus_c(i)_{is also not aﬀected by this transformation. Let}

(7)

obtained by rounding up the edge weights to the nearest power of (1 +_ε) (_G_P is not necessarily a metric space, since the triangle inequality may become invalid after the rounding). By Lemma 2.1, the weight of the minimum spanning tree of_G_P is exactly

n−W +_ε·_ir₌₀−1(1 +_ε)i·_c(i). Since we increased only edge weights, the weight of the minimum spanning tree of _G_P is larger than that of (_{P, d}), and it follows that

mst ≤n−W +ε·r−1

i=0(1 +ε)i·c(i). Since we increased every edge weight by a factor of at most (1 +_ε), the weight of the minimum spanning tree is increased by at most a factor of (1 +_ε). This yields_n−_W +_ε·_ir₌₀−1(1 +_ε)i_·_c(i)_≤_{(1 +}

ε)·mst,

which proves the corollary.

Our algorithm is based on computing a randomized estimator ˆ_c(i) _{for each}

c(i)_. Using this estimator, we can now phrase our randomized algorithm.

Metric-MST-Approximation (P, ε)

for_i= 0to _r−1 do

Compute estimator ˆ_c(i)for_c(i)

Output_M =_n−_W +_ε·r_i₌₀−1(1 +_ε)i_·_c(i)

Our main contribution is a sublinear-time randomized algorithm that outputs an estimator ˆ_c(i)that with probability at least 1−₄1_r satisﬁes the following property:

(2.3) _c(i+1)−1

r· mst

(1 +_ε)i ≤ ˆc (i) _≤

c(i)+1

r· mst

(1 +_ε)i.

It follows that with probability at least 3₄all estimators ˆ_c(i)simultaneously satisfy inequality (2.3). Observe now that we have

M =_n−_W+_ε·

r−1

i=0

(1 +_ε)i·_cˆ(i)

≥n−W+_ε· r−1

i=0

(1 +_ε)i·

c(i+1)−1 r ·

mst

(1 +_ε)i

=_n−_W−_ε·mst₊_ε·

r−1

i=0

(1 +_ε)i·_c(i+1)

≥n−W−ε·mst+ 1

1 +_ε·ε· r

i=1

(1 +_ε)i·_c(i)_.

Since_c(0)_≤

n, we obtain

M ≥

1− ε

1 +_ε

·n−W −ε·mst₊ 1

1 +_ε·ε· r−1

i=0

(1 +_ε)i·_c(i)

≥

1− ε

1 +_ε

·

n−W +_ε· r−1

i=0

(1 +_ε)i·_c(i)

−ε·mst

≥(1−2_ε)·mst.

From Corollary 2.2, we obtain

M ≤n−W +_ε·

r−1

i=0

(1 +_ε)i·

c(i)+1

r· mst

(8)

≤n−W +_ε·mst+ε·

r−1

i=0

(1 +_ε)i·_c(i)

≤(1 + 2_ε)·mst.

Hence, we have with probability at least 3_/4

(1−2_ε)·mst ≤ M ≤ (1 + 2ε)·mst.

In sections 3–7 we describe details of our randomized algorithm that inO(_n/ε6₎ time computes the estimator ˆ_c(i) _{that satisﬁes inequality (2.3). Replacing}

ε by _ε/2 will conclude the proof of our main theorem.

Theorem 2.3. Let 0 < ε <1 be an approximation parameter. Given access to

the_n×ndistance matrix of a metric space (_{P, d}),|P|=_n, algorithm

Metric-MST-Approximation_{computes in} O₍_n/ε7₎_{time a value}_M _{such that, with probability} 3

4,

(1−_ε)·mst ≤ M ≤ (1 +ε)·mst.

We remark that this result is almost optimal since it is easy to see that any constant-factor algorithm requires time Ω(_n) even in a randomized setting (see Theo-rem 8.1 in section 8). Also, no deterministic algorithm can compute a constant-factor approximation for the weight of a metric minimum spanning tree in_o(_n2_{) time (see} section 8), and no algorithm can compute a spanning tree that approximates the weight of the minimum spanning tree within a constant factor in_o(_n2_{) time [18].}

3. Estimating the numberc(i) ofi-connected components: Main ideas. In this section we recall the approach taken by [7] to approximate the number of connected components in a graph (not necessarily a metric space). The algorithm repeats the following procedure until a certain threshold value is reached, to ensure that the estimation of the number of_i-connected components is with high probability close to_c(i):

• Pick a starting vertex_p∈_P uniformly at random.

• Choose a random integer number_X according to the probability distribution Pr[_X ≥_k] = 1_/k.

• Verify whether the connected component in _G(i) _{containing vertex}

phas at most_X vertices or has more than_X vertices.

With the exception of a minor modiﬁcation in the probability distribution of_X, the scheme above has been proposed by Chazelle, Rubinfeld, and Trevisan [7]. We will run this procedure multiple times, and in each repetition of this procedure we output _β_j, that is, the indicator random variable of the event that in the_jth trial, the connected component has at most _X vertices. That is, if we denote by _n(_pi) the size of the connected component in_G(i) _{containing vertex}

p, then _β_j = 1 if_n(_pi)≤_X and_β_j = 0 otherwise. Notice that

E[_β_j] =

connected componentCinG(i)

Pr[_p∈_C]·Pr[_X ≥ |C|]

=

connected componentCinG(i)

|C|

n ·

1

|C| =

c(i)

n .

Therefore, if there are_srepetitions of the procedure above, then we deﬁne

ˆ

c(i) = n

s ·

s

j=1

(9)

Since by the arguments above E[ˆ_c(i)_{] =}

c(i)_{, this motivates the use of ˆ}

c(i) _{as an} es-timator of the number of connected components; see [7]. The challenging part of completing the analysis is to show that the random variable ˆ_c(i) _{is sharply} concen-trated around its expectation and to show that it can be computed eﬃciently. In particular, our algorithm will use a nonstandard graph traversal that exploits the triangle inequality combined with some randomized algorithm to check whether the degree of the starting vertex is higher than_X (and hence the connected component contains more than_X vertices). Unfortunately, the expected value of our estimator is not exactly_c(i). However, we will show that it is suﬃciently close to_c(i). Moreover, we will develop some concentration bounds for ˆ_c(i) that satisfy the bounds in (2.3).

4. The CLIQUE-TREE-TRAVERSAL. Our method for verifying whether a given connected component in_G(i)has at most_Xvertices is to traverse the graph_G(i) starting at vertex_p. Chazelle, Rubinfeld, and Trevisan [7] used the standard breadth-first search (BFS) traversal algorithm for this purpose. However, in our setting a BFS will not suffice to ensure both near-linear running timeand sufficient concentration. We have to develop a new traversal algorithm that exploits the triangle inequality.

Our graph traversal will not necessarily explore the connected components in

G(i). If we denote by _V_p(i) the set of vertices in the connected component of _p in graph _G(i), then the graph traversal starting at _p will visit a set of vertices _V_exp such that_V_p(i) ⊆_V_exp ⊆_V_p(i+1). We will see later in the analysis that this does not signiﬁcantly aﬀect our estimator. We now describe our graph traversal algorithm in detail.

At the beginning all vertices areunexplored. Then the starting vertex_pis marked asexplored andrepresentative. In the next step, all neighbors of_pthat are in distance less than _ε·(1 +_ε)i _{are marked as} _explored_{. Then we proceed similarly to Prim’s} algorithm for the computation of minimum spanning trees: We choose the shortest of the edges in _E(i+1) _{that connect a representative vertex with an unexplored vertex.} This leads us to a new vertex that is again chosen to beexplored andrepresentative. Then, we repeat all steps above until no further point can be explored. We call this graph traversal theClique-Tree-Traversal_.

We give a pseudocode for theClique-Tree-Traversal _{below. The sets} _V_exp_, Vunexp, and_V_rep denote the sets of explored, unexplored, and representative vertices, respectively.

Clique-Tree-Traversal (P,p,i,ε)

Vrep={p}; Vexp={p}; Vunexp =P\ {p}

whilethere is an edge_e= (_{p, q})∈_E(i+1) with_p∈_V_rep and_q∈_V_unexp do let (_{p, q}) be the shortest such edge

Vexp=_V_exp∪ {q}; _V_unexp =_V_unexp\ {q} if _d(_{p, q})≥_ε(1 +_ε)i then _V_rep=_V_rep∪ {q}

Properties of the Clique-Tree-Traversal_{. First, it is easy to see that the} Clique-Tree-Traversal_{algorithm can be implemented to run in}_O₍n·|V_rep|_{) time,}

where here and in the following we use _V_rep and_V_exp to refer to their ﬁnal value in an execution of the Clique-Tree-Traversal. Also, we have that (for ε≤ 1

(10)

connected component in _G(i) _{that contains}

p(as we have already discussed, this is only approximately true, but it gives a good intuition).

Since we are considering edges incident to_V_rep in increasing order of length, we make sure that all vertices in _V_rep have pairwise distance at least _ε·(1 +_ε)i_{. This} will be used later to obtain a lower bound on the cost of the minimum spanning tree of_G.

Finally, we show that _V_p(i)⊆_V_exp ⊆_V_p(i+1). _V_exp ⊆_V_p(i+1) follows immediately from the fact that the algorithm uses only edges from_E(i+1)_{to explore the graph. To} show_V_p(i)⊆_V_exp, let us consider an arbitrary vertex_q∈_V_p(i)and let us assume that_q

is not explored by the algorithm. Since_qis in_V_p(i), it is connected to_pby a sequence of edges_p=_p0, p1, . . . , pk =q such thatd(pj, pj+1)≤(1 +ε)i for all 0≤j < k. Let

denote the smallest index such that_p is not explored by the algorithm. Thus, we know that_p₋1is explored. Ifp−1is a representative vertex, thenpis explored from

p−1, which is a contradiction to the choice of. Thus,p−1is no representative vertex. But then there is a representative vertex_wwith_d(_{w, p}₋1)< ε(1+ε)i. By the triangle inequality we have_d(_{w, p})≤_d(_{w, p}₋1)+d(p−1, p)< ε(1+ε)i+(1+ε)i= (1+ε)i+1. But this implies that_pis explored from_w, which contradicts the choice ofand hence our assumption that_qis not explored.

We summarize our discussion in the following lemma.

Lemma 4.1. _{The algorithm} Clique-Tree-Traversal _{satisfies the following}

properties. _V_rep and _V_exp refer to the final value of these sets, i.e., their value at the end of an execution of theClique-Tree-Traversal_.

(1) The algorithm can be implemented to run in time O(_n·log_n· |V_rep|).

(2) For_{ε <}1₂, the points assigned to the same representative point form a clique in_G(i).

(3) _Vp(i)⊆Vexp⊆Vp(i+1).

(4) The vertices in_V_rep have pairwise distance at least_ε·(1 +_ε)i.

In the analysis of our algorithm we will also use the notion ofgraph dispersion. To deﬁne this notion, let us ﬁrst extend the Clique-Tree-Traversal _{to a full}

graph traversal in the following natural way: We start with an arbitrary vertex_pand run theClique-Tree-Traversal_{with parameters (}_{P, p, i, ε}_{). If not all vertices are}

explored at the end of this traversal, we start the Clique-Tree-Traversal _from

one of the unexplored vertices (we never start at the same connected component more than once). We do this until every vertex has been explored. We call this process the

full Clique-Tree-Traversal.

It is easy to see that the number of representative vertices computed by the full Clique-Tree-Traversal _{may depend on the starting vertices. Let} _U_rep(i) _be

a maximum cardinality set of representative vertices computed by the full Clique-Tree-Traversalfor givenP,i, andε(the maximum is taken over all possible vertex

orderings). One parameter of particular interest for our analysis is thedispersion of the graph _G(i)_{, which is deﬁned as} _L₍

G(i)_{) =} _|U(i)

rep|; i.e., L(_G(i)_{) is the maximum} number of representative vertices computed by the full Clique-Tree-Traversal

for given_P, _i, and _ε. We will use the dispersion of _G(i) _{together with property (2)} to obtain bounds on the density of_G(i)_{. Furthermore, our main use of} _L₍

G(i)_{) is to} obtain a lower bound formstas stated in the next lemma.

Lemma 4.2. mst≥ε·(1 +ε)i· L(G(i))/4.

Proof. By our initial transformation we have mst ≥2n/ε. Also, we havei ≤ r = log₁₊_ε(4_n/ε). If L(_G(i)_{) = 1, then we have}

ε·(1 +_ε)i_/₄ _≤_ε_·_{(1 +}_ε₎r_/₄ _≤

(11)

Next we consider the case L(_G(i)₎

> 1. Representative vertices in distinct con-nected components have distance more than (1 +_ε)i+1_{. This, together with Lemma} 4.1(4), implies that all vertices in _U_rep(i) have pairwise distance at least _ε·(1 +_ε)i_. Hence, a minimum spanning tree of the subgraph of_G induced by_U_rep(i) has cost at least (_ε·(1 +_ε)i)·(|U_rep(i)| −1)≥(_ε·(1 +_ε)i)· L(_G(i))_/2 ifL(_G(i))_>1. It is known that this minimum spanning tree is a factor 2 approximation of the minimum Steiner tree of_U_rep(i) (cf. [23]). Since a minimum Steiner tree of_U_rep(i) has cost that is at most the cost of a minimum spanning tree of_G, the lemma follows.

5. Estimating the number of connected components: First attempt. We now discuss a ﬁrst version of our algorithmNumber-of-Connected-Components

(_P, _i, _ε). The algorithm combines the sampling approach from [7] with the graph traversal algorithmClique-Tree-Traversaldescribed in section 4. It is identical

to our final algorithm except for two modifications which with sufficiently high prob-ability affect only the running time of the algorithm. Also, in this version we assume that we know the dispersion of the graph in order to determine the number of required iterations to achieve a sharp concentration of the estimator. This assumption will be removed in the final algorithm. The algorithm uses the value _r = log1+ε(4n/ε) defined earlier as a bound on the logarithm of the maximum edge length. It also requires a bound_T on the number of iterations.

Number-of-Connected-Components-Version-1 (P,i,T,ε)

s= 0

while_s≤_T do

s=_s+ 1;_β_s= 0

choose a vertex_p_sindependently and uniformly at random choose a random integer_X according toPr[_X ≥_k] = 1_/k

runClique-Tree-Traversal (P, p_s, i, ε) until one of the following events

happens:

(1) more than _X vertices are explored (2) more than 4r

ε representative vertices are explored (3) the entire connected component in _G(i) _containing

psis explored if event (3) happens then_β_s= 1

outputˆ_c(i)₌n s ·

_s j=1βj

We ﬁrst show that the expected value of ˆ_c(i)_{is close to}

c(i)_.

Lemma 5.1 (expectation bound). The random variable ˆc(i) computed in

algo-rithmNumber-of-Connected-Components-Version-1 satisfies the following:

c(i) ≥ E _cˆ(i) ≥ _c(i+1)− 1 2_r·

mst

(1 +_ε)i.

Proof. The analysis from section 3 says that if we ignore events of type (2) and if Clique-Tree-Traversal (P, p, i, ε) always explores exactly V_p(i) (like a BFS),

then the expected value of ˆ_c(i)_{is exactly}

c(i)_{. Now we observe that the fact that the}

Clique-Tree-Traversal may explore not only vertices inV_p(i) but possibly other

vertices (see Lemma 4.1) can only lower the expected value of the_β_j and hence that of ˆ_c(i)_{. Also, the events of type (2) only reduce the expected value of ˆ}

c(i)_{. Hence, the} inequality_c(i)_≥_ˆ

(12)

Now, we prove the second inequality, namely, E[ˆ_c(i)_] _≥

c(i+1)₋ 1 2r·

mst (1+ε)i. We partition the (_i+ 1)-connected components in_P into two types. An (_i+ 1)-connected component_Cis of type (I) if thereexistsa vertex_p∈_Csuch that the Clique-Tree-Traversalwith starting vertexpstops with more than 4r/εrepresentative vertices.

Otherwise, an (_i+ 1)-connected component is of type (II).

Let _K be the number of connected components of type (I). Then mst ≥ K·

ε(1+ε)i

2 ·

4r

ε = 2K r(1 +ε)i, and hence K ≤ 1 2r ·

mst

(1+ε)i. Now observe that, given an arbitrary (_i+ 1)-connected component of type (II), if _X is at least the number of vertices in that component, then our algorithm will always output _β_j = 1. This implies

E[_β_j] ≥

type (II) connected componentC

Pr[_p_i∈_C]·Pr[_X≥ |C|] = c (i+1)₋

K

n .

Hence,

E[ˆ_c(i)] ≥ _c(i+1)−_K ≥ _c(i+1)− mst 2_r(1 +_ε)i.

Our next step is to show that the number of iterations of the algorithm suﬃces to achieve sharp concentration around the expectation of ˆ_c(i)_{. We parametrize the} additive error in terms ofL(_G(i)_{). Using the lower bound on}_mst_{in terms of}_L₍

G(i)₎ (cf. Lemma 4.2), this bound can be easily transformed into a relative error in terms of the cost of the minimum spanning tree.

Lemma 5.2 (_{concentration bound}). _If _T ≥ 210·n·r2

ε2·L(G(i)), then for the random

vari-ableˆ_c(i) _{computed by algorithm}_{Number-of-Connected-Components-Version-}₁

the following bound holds:

Pr_cˆ(i)−E ˆ_c(i)≤ ε 8_r· L(G

(i)₎ _≥ 15 16.

Proof. Similarly to [7], we can provide an upper bound for the variance of any single_β_i as follows:

Var[_β_i] ≤ E _β_i2 ≤ E _β_i ≤ c (i)

n ,

where the ﬁrst inequality uses the fact that 0≤_β_i≤1 and the last inequality follows from our analysis in the proof of Lemma 5.1 and from section 3. Therefore, we obtain the following inequality:

Var[ˆ_c(i)] = _n

T

2

·

1_≤i≤T

Var[_β_i] ≤ _n

T

2

·T·c

(i)

n =

n c(i)

T .

Next, by Chebyshev’s inequality we obtain

Pr_cˆ(i)−E ˆ_c(i)≥ ε 8_rL(G

(i)₎_≤ 64·n·c (i)_·

r2 T·ε2· L₍_G(i)₎2 ≤

64·_n·_r2

T·ε2· L₍_G(i)₎,

where the last inequality follows fromL(_G(i)₎_≥

c(i)_{, which trivially holds for}

ε < 1 2. With this, we obtain for_T ≥ 210·n·r2

ε2·L(G(i)) the bound

Prˆ_c(i)−E _cˆ(i)≥ ε 8_r· L(G

(13)

Pr_cˆ(i)−E ˆ_c(i)≤ ε 8_r· L(G

(i)₎ _≥ 15 16.

Let us summarize the obtained results and outline how to proceed further. Right now, we know that if the algorithm performs at least 210·n·r2

ε2·L(G(i)) iterations, then our estimator for the number of connected components is sharply concentrated. Ignoring the running time and the problem that the number of iteration depends onL(_G(i)_), one can show that plugging this approximation into our initial approach will yield a (1 +_ε)-approximation of the weight of the minimum spanning tree.

Therefore, we have to deal only with the number of iterations and the running time. Right now, the running time of a single iteration of the algorithm isO(_nr/ε). Hence, in the case thatL(_G(i)_{) = Ω(}

n), we can use our algorithm as it is and obtain linear running time. However, ifL(_G(i)₎

n, then we have to improve the (expected) running time of a single iteration of the algorithm. This can be achieved as follows. Recall that the vertices assigned to the sample representative point in _G(i) form a clique. Since the number of representative points is at mostL(_G(i)), we know that we have at leastL(_G(i)) such cliques in_G(i). This has certain impact on the degree distribution in the graph. For example, it implies that the average degree of_G(i)is at leastn/L(_G(i)). In general, the smallerL(_G(i)) is, the more vertices of high degree are in _G(i)_{. This can be used to speed up the algorithm. Recall that our algorithm uses} a random number_X as a stopping value in the exploration of connected components, i.e., the exploration of the graph stops, when the connected component has more than

X vertices. A simple condition to stop the exploration is that the starting vertex already has more than_X neighbors. But in the case whenL(_G(i)_{) is small, there are} many such vertices in the graph. So, we just need a way to detect that the starting vertex has large degree. This can be done using a simple random sampling approach, whose analysis follows from Chernoff bounds and that is given in section 6. It will turn out that using this algorithm as a filter to immediately stop exploration of connected components whose starting vertex has degree larger than_X will reduce the expected running time of a single iteration of the algorithm toO(log2_n·_r· L(_G(i))_/ε). Finally, we can replace the bound on the number of iterations by a bound on the running time that ensures that with sufficiently high probability we have enough iterations. This will lead to our ultimate algorithm, which will be discussed in detail in section 7.

6. Estimating degrees of vertices: DEGREE-ESTIMATE algorithm. In order to test whether a given vertex_phas degree larger than_X, we perform a more general task and approximate the degree deg_i(_p) of vertex_pin_G(i)_{within a constant} factor. Note that in our setting ﬁnding deg_i(_p) exactly requires Ω(_n) time. Our estimator will run in time O(_nlog_n/(1 + deg_i(_p))); i.e., it will be much faster if the degree of the vertex is large. Let_cbe a suﬃciently large constant.

Degree-Estimate (P,i,p)

= 1

repeat

= 2

pick a multiset_S of_N =_c·log_n·vertices uniformly at random with replacement

let_Y be the number of vertices in_Sthat are adjacent to _pin _G(i) until_Y ≥_clog_n or_{> n}

(14)

The algorithm increases the size of a sample set until we see at least _c·log_n neighbors of_pin our sample set. If this is the case, we can be sure that our sample set is a good approximation.

Lemma 6.1. Let (P, d) be a metric space with |P| = n, and let G(i) be the

threshold graph for a given _i. Then, with probability at least 1 − ₈_n12, algorithm

Degree-Estimate runs in time O nlogn

1+deg_i(p)

and returns value_D(_p)such that 1₂·

deg_i(_p)≤_D(_p)≤2·deg_i(_p).

Proof. For deg_i(_p) = 0 the algorithm outputs 0 in time O(_nlog_n). Hence, it remains to consider deg_i(_p)_>0.

Fix . Let _Y_j denote the indicator random variable for the event that the _jth vertex in _S is adjacent to _p. Clearly,_Y =N_j₌₁_Y_j. Further, we know that E[_Y] =

N·deg_i(_p)_/n=_c·log_n··degi(p)

n . Therefore, for= 1

2·n/degi(p) we use a Chernoﬀ bound to obtain

Pr[_Y ≥_c·log_n]≤Pr

|Y −E[_Y]| ≥ 1 2E[Y]

≤2·_e14·E[Y]/3 ≤ 1

16· logn ·_n2

using our assumption that _c is a suﬃciently large constant. By majorization, this implies a similar bound for _< 1₂ ·_n/deg_i(_p). In a similar way, we obtain that for

>2·_n/deg_i(_p) we have

Pr[_{Y < c}·log_n]≤Pr

|Y −E[_Y]| ≥ 1 2E[Y]

≤2·_e14·E[Y]/3 ≤ 1

16·(logn+ 1)·_n2.

Therefore, the probability that the algorithm terminates with 1₂ ·_n/deg_i(_p) ≤ ≤ 2·_n/deg_i(_p) is at least

1−

logn+1

=1

1

16·(logn+ 1)·_n2 = 1− 1 16·_n2.

Now we consider the case that 1₂_n/deg_i(_p)≤ ≤2_n/deg_i(_p). In this case, we have E[_Y]≥₂clog_n. Chernoﬀ bounds imply that

Pr

E[_Y]

2 ≤Y ≤2E[Y]

≤Pr

|Y −E[_Y]| ≥ 1 4E[Y]

≤2·_e14·E[Y]/3 ≤ 1

16·_n2

for_clarge enough. Hence, the overall error probability is at most₈_·1_n2. If the algorithm

makes no error, the running time is clearly O(_n·log_n/(1 + deg_i(_p))), which proves the lemma.

7. A sublinear time algorithm for estimating c(i). In this section we de-scribe and analyze ourO(_n/ε6_{)-time algorithm for estimating the number of connected} components _c(i) _in

G(i)_{. The algorithm combines the algorithm from section 5 with} the sampling algorithmDegree-Estimate from section 6 to speed up the running

time.

(15)

Number-of-Connected-Components (P,i,ε)

s= 0

whilethe running time is less than_T∗=O(_n/ε6) do

s=_s+ 1;_β_s= 0

choose a vertex_p_s independently and uniformly at random choose a random integer_X according toPr[_X≥_k] = 1_/k

D(_p_s) =Degree-Estimate(P, i, p_s)

if_D(_p_s)≤2_X then

runClique-Tree-Traversal (P, p_s, i, ε) until one of the following

events happens:

(1) more than _X vertices are explored (2) more than 4r

ε representative vertices are explored (3) the entire connected component in _G(i) _containing

psis explored if event (3) happensthen _β_s= 1

output ˆ_c(i)=n_s ·s_j₌₁_β_j

We say algorithmDegree-Estimate₍_{P, i, p}₎_{works properly} _{if it returns a value}

D(_p) with 1₂_D(_p) ≤deg_i(_p)≤ 2_D(_p) and its running time is O(_n·log_n/deg_i(_p)). Notice that by Lemma 6.1, every run of algorithmDegree-Estimate₍_{P, i, p}_{) works}

properly with probability at least 1− 1

8n2. Obviously, we can assume that the over-all running time is _o(_n2_{) because otherwise we can simply compute the minimum} spanning tree directly. Therefore, with probability at least 7

8, all runs of algorithm

Degree-Estimate(P, i, p) incorporated inNumber-of-Connected-Components

(_{P, i, ε}) work properly. Hence, from now on, we condition on the assumption that all runs ofDegree-Estimate (P, i, p) work properly.

Before we proceed with the analysis of the algorithm, we ﬁrst explain our use of algorithmDegree-Estimate that is needed to decrease the total running time

of the algorithm and has no inﬂuence on the output value. If in the _jth iteration of algorithmNumber-of-Connected-ComponentsprocedureDegree-Estimate

returns a value _D(_p)_>2_X, then we know that deg_i(_p) _{> X}. If deg_i(_p)_{> X}, then we know that _ni

p > X. Hence, our procedure stops theClique-Tree-Traversal

because of event (1) before event (3) can happen. This would cause_β_j = 0. Therefore, we do not have to invoke the Clique-Tree-Traversal in that case and we can

immediately set_β_j= 0.

For the remaining analysis (besides the running time analysis) we can therefore ignore the procedure Degree-Estimate. We can assume that for every sampled

vertex _p_j, we set _β_j = 0 if algorithm Clique-Tree-Traversal (P, p_j, i, ε) stops

because of event (1) or (2), or we set_β_j= 1 otherwise.

Our next step is to prove a bound on the number of iterations of algorithm

Number-of-Connected-Components_{. Here we will make use of the dispersion} L(_G(i)), and our bound will depend on that value.

Lemma 7.1 (expected running time of a single iteration). Let 0 < ε < 1

2. In a

single iteration of Number-of-Connected-Components, if the call to algorithm Degree-Estimate works properly, then the expected running time (over the choice

of _X) is O(_r·log2_n· L(_G(i))_/ε).

Proof. Let_T denote the random variable for the expected running time of a single iteration of the algorithm under the condition that algorithm Degree-Estimate

(16)

We start our analysis with a partition of_P intoL(_G(i)₎_clusters _{according to the} full Clique-Tree-Traversal. There is exactly one cluster for each representative

vertex. If a vertex _p is not a representative vertex itself, then it has been explored from a representative vertex _q. In this case we assign_pto the cluster containing _q. We observe that each cluster forms a clique in_G(i), because the distance between any two points in the same cluster is at most 2_ε(1 +_ε)i and_{ε <} 1₂. For any vertex_p, let

Cp denote the cluster that containsp. Notice that degi(p)≥ |Cp| −1.

If all calls to algorithmDegree-Estimatework properly, then the testD(p)<

2_X rejects every vertex _p_s in a cluster of size greater than 4_X. In this case, when

|Cps|>4|X|, the running timeT of the iteration is at mostα·nlogn/(1 +D(ps)), where_αis a constant used to upper bound the constants hidden in the big-Oh notation of the running time ofDegree-Estimate_{. Hence we get}

T ≤ α·n·log_n/(1 +_D(_p_s)) ≤ 2·_α·_n·log_n/(1 + deg_i(_p_s)) ≤ 2·_α·_n·logn/|C_p_s|. For any vertex that is in a cluster of size smaller than or equal to 4_X, we run the

Clique-Tree-Traversal_{until either we ﬁnd more than 4}_r/ε_{representative vertices}

or some other stopping criterion is matched. By Lemma 4.1, the running time of this call to Clique-Tree-Traversal _is O₍_n_log_n_{) times the number of representative}

vertices visited, and so we have_T ≤α·n·logn·r/εfor any vertex_p_swith|C_p_s| ≤4|X|. Since the number of clusters is L(_G(i)), we clearly have at most 4_X · L(_G(i)) vertices in all clusters of size at most 4_X. Now we observe that in the case_{X > n} the behavior of our algorithm is identical to the case_X =_n. Therefore, if we deﬁne a random variable _X∗ = _X for _{X < n} and _X∗ = _n for _X ≥ _n, then we get for a ﬁxed value of _X∗ and for a single iteration of algorithm Number-of-Connected-Components

E _T|X∗≤

p:_|Cp|≤4X∗

Pr[_p_s=_p]·α·n·logn·r

ε

+

p:|Cp|>4X∗

Pr[_p_s=_p]· 2α·n·logn

|Cp|

=

p:_|Cp|≤4X∗ 1

n·

α·n·log_n·_r

ε +

p:_|Cp|>4X∗ 1

n·

2·_α·_n·log_n

|Cp|

≤4·α·X∗·r· L(G(i))·logn

ε + 2·α·logn

p∈P 1

|Cp|

=4·α·X

∗_·_r_{· L}₍_G(i)₎_·_log

n

ε + 2·α·logn· L(G

(i)₎

≤6·α·X∗·r· L(G(i))·logn

ε .

Since the choice of_X∗is independent of the other choices, the inequalityE[_X∗]≤log_n implies that

E _T ≤ 6·α·r·log 2

n· L(_G(i)₎

ε = O

r·log2_n· L(_G(i)₎

ε

.

Corollary 7.2 (number of iterations). If in all calls algorithm Degree-Estimate _{works properly, then for certain} _T∗ ₌ O₍_n·_r3 ·_log2_n·_ε−3₎ _the

num-ber of iterations _s of algorithm Number-of-Connected-Components _{is at least}

210_·n·r2