• No results found

Our k-means clustering algorithm uses a coreset construction based on the k-Means++

seeding procedure from Arthur and Vassilviskii [9]. One reason for this design decision was that the k-Means++ seeding works well for high-dimensional datasets, which is often required in practice. This nice property does not apply to many other clustering meth-ods, like the grid-based methods from Har-Peled and Mazumdar [58] and Frahling and Sohler [44], for instance.

Let P ⊂ Rd be a set of points with size |P | =: n. For an arbitrary fixed parameter m ∈N, our coreset construction is as follows (see also Algorithm 6.2.1). First, we choose a set S := {q1, q2, . . . , qm} of size m at random according to D2. Let Qi denote the set of points from P that are closest to qi (breaking ties arbitrarily). By using weight function w : S →R≥0 with w(qi) = |Qi|, we obtain the weighted set S as our coreset.

Note that our coreset construction is rather easy to implement and its running time has a merely linear dependency on the dimension d. Furthermore, empirical evaluation (as given in Section 6.5) suggests that our construction leads to good coresets even for relatively small choices of m (i.e., say m = 200k). Unfortunately, we do not have a formal proof supporting this observation. However, we are able to do a first step by proving that, at least in low-dimensional spaces, our construction indeed leads to small coresets.

6.2 Coreset Construction 105

Algorithm 6.2.1 AdaptiveCoreset(P, m)

1: choose an initial coreset point q1 uniformly at random from P

2: w(q1) ← 0

3: S ← {q1}

4: for i ← 2 to m do

5: choose qi at random according to D2 from P

6: w(qi) ← 0

7: S ← S ∪ {qi}

8: for each p ∈ P do

9: let qi ∈ S, 1 ≤ i ≤ m, be the nearest coreset point to p

10: w(qi) ← w(qi) + 1

Our proof is based on Lemma 6.2.1. Intuitively, this lemma states that if we consider an optimal m-clustering of P , with m large enough, then the optimal m-clustering cost is merely a tiny fraction of the optimal k-clustering cost of P . Lemma 6.2.1 is a consequence of the fact that there exist (k, γ)-coresets of size m ∈ (d/γ)O(d)k log(n), which has already been proven by Har-Peled and Mazumdar [58].

Lemma 6.2.1. Let γ, 0 < γ ≤ 1, and let m ∈N. If m ≥ 16d

γ

!d/2

· k · dlog(n) + 3e , then we get

Meansm(P ) ≤ γ · Meansk(P ) .

Proof. Let C := {c1, . . . , ck} be an optimal solution to the Euclidean k-means clustering problem for P with |P | =: n, i.e., Means(P, C) = Meansk(P ). We consider an exponential grid around each center in C. The construction of this grid is essentially the same as the one from Har-Peled and Mazumdar [58]. In detail, the construction is defined as follows.

Let the average cost per point of an optimal k-clustering solution for P be denoted by R := Meansk(P )

n .

Furthermore, for each j ∈ {0, 1, . . . , dlog(n) + 2e} and each center ci ∈ C, let Vij be the axis-parallel square centered at ci with side length

rj :=√ 2jR .

Then, we recursively define Wi0:= Vi0and Wij := Vij\Vi,j−1for j ∈ {1, 2, . . . , dlog(n)+2e}.

Obviously, each point in P is contained within a Wij since otherwise there would be a point p ∈ P with

D2(p, C) >

rdlog(n)+2e 2

2

= 2dlog(n)+2eR

4 ≥ nR ≥ Meansk(P ) ,

which is a contradiction.

For each i, j individually, we partition Wij into small grid cells with side length r0j :=

r γ

9d· rj =

rγ

9d · 2jR .

We remark that the small grid cells do not have to fit properly in Wij. In fact, we impose a grid with side length rj0 on Wij such that Wij is completely covered. Then, the partition of Wij consists of all the small cells that completely cover Wij as well as all parts of the small cells that partly cover Wij. This partition is illustrated by Figure 6.1.

ri r0i

Figure 6.1: Illustration of the partition of Wij into small grid cells. The area Wij is colored in gray. The white parts of the small cells do not belong to the partition of Wij.

For each grid cell C such that C ∩ Wij contains points from P , we select a single point from P ∩ C ∩ Wij as the representative of all the points in P ∩ C ∩ Wij. Let G be the set of all these representatives. Since we have

ri

grid cells. Since this number is smaller or equal to m, we obtain |G| ≤ m.

Let gp denote the representative of p ∈ P in G. Then, we have Meansm(P ) ≤ Means|G|(P ) ≤ Means(P, G) ≤ X

p∈P

D2(p, gp) . (6.1) For each point p ∈ P , the distance from its representative gp is upper bounded by the diagonal of the grid cell that contains p. Thus, for each p ∈ Wi0, we have

D2(p, gp) ≤ 

d · r002γR

9 . (6.2)

6.2 Coreset Construction 107

Furthermore, for each p ∈ Wij with j ≥ 1, we know that ci is the center of Vi,j−1 and p is not contained in Vi,j−1. It follows that

D2(p, C) ≥

rj−1 2

2

≥ 2j−3R . Hence, in this case, we get

D2(p, gp) ≤ 

d · r0j2 = γ

9 · 2jR ≤

9 · D2(p, C) . (6.3) Due to Inequalities (6.1)–(6.3) and the definition of R, we obtain

Meansm(P ) ≤ X

Now, we go back to our coreset construction. Given the point set P and a parameter m ∈ N, let S be our weighted coreset chosen at random according to D2 from P by Algorithm 6.2.1. Furthermore, let C be an arbitrary set of k centers. For each point p ∈ P , we denote the point from S whose weight has been increased by 1 due to p in line 9 of Algorithm 6.2.1 by qp, i.e., qp is a point from S closest to p. Then, the difference between the cost of clustering P and the cost of clustering S is at most

|Means(P, C) − Means(S, C)| =

We partition P into two subsets Pnear and Pdist. Roughly speaking, the set Pnear contains each point p ∈ P whose distance from its coreset point qp is small compared to the distance from its nearest center in C. More precisely, for any constant ε with 0 < ε ≤ 1, we define

Pnear:= {p ∈ P | D(p, qp) ≤ ε D(p, C)} . The set Pdist contains all the other points from P , i.e.,

Pdist := {p ∈ P | D(p, qp) > ε D(p, C)} .

First, in Claim 6.2.2, we estimate the error of the clustering cost that occurs for any point in Pnear. Then, in Claim 6.2.3, we give an estimation of the error for any point in Pdist.

Claim 6.2.2. If p ∈ Pnear, then

D2(p, C) − D2(qp, C) ≤ 3ε D2(p, C) .

Proof. For the moment, let us assume that D(p, C) ≤ D(qp, C). Let cp denote the element from C closest to p. By triangle inequality and the definition of Pnear, we have

D(qp, C) ≤ D(qp, cp)

≤ D(p, cp) + D(p, qp)

≤ (1 + ε) · D(p, C) . Hence, for the squared distances, we obtain

D2(qp, C) ≤ (1 + ε)2· D2(p, C)

≤ (1 + 3ε) · D2(p, C) .

Thus, we get D2(qp, C) − D2(p, C) ≤ 3ε D2(p, C), which proves the claim in the case D(p, C) ≤ D(qp, C).

Now, assume that D(p, C) > D(qp, C). Let cs denote the element from C closest to qp. Again, by triangle inequality and the definition of Pnear, we have

D(p, C) ≤ D(p, cs)

≤ D(qp, cs) + D(p, qp)

≤ D(qp, C) + ε D(p, C) .

It follows that (1 − ε) · D(p, C) ≤ D(qp, C). For the squared distances, we obtain D2(qp, C) ≥ (1 − ε)2· D2(p, C)

> (1 − 2ε) · D2(p, C) . Hence, we get

D2(p, C) − D2(qp, C) < 2ε D2(p, C)

< 3ε D2(p, C) , which proves the claim in the case D(p, C) > D(qp, C).

Claim 6.2.3. If p ∈ Pdist, then

D2(p, C) − D2(qp, C) ≤ 3

ε D2(p, qp) .

6.2 Coreset Construction 109

Proof. Let cp denote the element from C closest to p, and let cs denote the element from C closest to qp. By triangle inequality, we have

D(p, C) ≤ D(p, cs)

Now, we can show our main result.

Theorem 6. Let k ∈N, let ε, 0 < ε ≤ 1, be a precision parameter, and let δ, 0 < δ < 1,

algorithm AdaptiveCoreset computes a weighted multiset S with size m that is a (k, 6ε)-coreset of P with probability at least 1 − δ.

Proof. Due to Claims 6.2.2 and 6.2.3, we have

|Means(P, C) − Means(S, C)|

Due to Lemma 6.1.3 and Markov’s inequality, we obtain Means(P, S) ≤ 8

δ(2 + ln(m)) · Meansm(P ) with probability at least 1 − δ. Hence, by using Lemma 6.2.1 with

γ := ε2δ

and, thus, |Means(P, C) − Means(S, C)| ≤ 6ε·Means(P, C) with probability 1−δ, provided that the coreset size m satisfies the condition

m ≥ 16d γ

!d/2

· k · dlog(n) + 3e . (6.4)

Hence, the only thing left to do is to prove that there exists a coreset size

m ∈ d that satisfies Inequality (6.4). Since we can assume that n ≥ 16 and m ≥ 8, Inequality (6.4) is satisfied if we have

m

logd/2(m) ≥ 2 · 16ddd/2k log(n)

δd/2εd . (6.6)

We conclude that Condition (6.5) and Inequality (6.6) are both satisfied for a choice of m = (2d)d/2·2 · 16ddd/2k log(n)