Coreset Construction - Approximation Techniques for Facility Location and Their Applications in

Our k-means clustering algorithm uses a coreset construction based on the k-Means++

seeding procedure from Arthur and Vassilviskii [9]. One reason for this design decision was that the k-Means++ seeding works well for high-dimensional datasets, which is often required in practice. This nice property does not apply to many other clustering meth-ods, like the grid-based methods from Har-Peled and Mazumdar [58] and Frahling and Sohler [44], for instance.

Let P ⊂ R^d be a set of points with size |P | =: n. For an arbitrary fixed parameter m ∈N, our coreset construction is as follows (see also Algorithm 6.2.1). First, we choose a set S := {q₁, q₂, . . . , q_m} of size m at random according to D². Let Q_i denote the set of points from P that are closest to q_i (breaking ties arbitrarily). By using weight function w : S →R≥0 with w(q_i) = |Q_i|, we obtain the weighted set S as our coreset.

Note that our coreset construction is rather easy to implement and its running time has a merely linear dependency on the dimension d. Furthermore, empirical evaluation (as given in Section 6.5) suggests that our construction leads to good coresets even for relatively small choices of m (i.e., say m = 200k). Unfortunately, we do not have a formal proof supporting this observation. However, we are able to do a first step by proving that, at least in low-dimensional spaces, our construction indeed leads to small coresets.

6.2 Coreset Construction 105

Algorithm 6.2.1 AdaptiveCoreset(P, m)

1: choose an initial coreset point q₁ uniformly at random from P

2: w(q₁) ← 0

3: S ← {q1}

4: for i ← 2 to m do

5: choose q_i at random according to D² from P

6: w(qi) ← 0

7: S ← S ∪ {q_i}

8: for each p ∈ P do

9: let q_i ∈ S, 1 ≤ i ≤ m, be the nearest coreset point to p

10: w(q_i) ← w(q_i) + 1

Our proof is based on Lemma 6.2.1. Intuitively, this lemma states that if we consider an optimal m-clustering of P , with m large enough, then the optimal m-clustering cost is merely a tiny fraction of the optimal k-clustering cost of P . Lemma 6.2.1 is a consequence of the fact that there exist (k, γ)-coresets of size m ∈ (d/γ)^O(d)k log(n), which has already been proven by Har-Peled and Mazumdar [58].

Lemma 6.2.1. Let γ, 0 < γ ≤ 1, and let m ∈N. If m ≥ 16d

!d/2

· k · dlog(n) + 3e , then we get

Means^∗_m(P ) ≤ γ · Means^∗_k(P ) .

Proof. Let C := {c₁, . . . , c_k} be an optimal solution to the Euclidean k-means clustering problem for P with |P | =: n, i.e., Means(P, C) = Means^∗_k(P ). We consider an exponential grid around each center in C. The construction of this grid is essentially the same as the one from Har-Peled and Mazumdar [58]. In detail, the construction is defined as follows.

Let the average cost per point of an optimal k-clustering solution for P be denoted by R := Means^∗_k(P )

n .

Furthermore, for each j ∈ {0, 1, . . . , dlog(n) + 2e} and each center c_i ∈ C, let V_ij be the axis-parallel square centered at c_i with side length

r_j :=√ 2^jR .

Then, we recursively define W_i0:= V_i0and W_ij := V_ij\V_i,j−1for j ∈ {1, 2, . . . , dlog(n)+2e}.

Obviously, each point in P is contained within a W_ij since otherwise there would be a point p ∈ P with

D²(p, C) >

r_dlog(n)+2e 2

= 2^dlog(n)+2eR

4 ≥ nR ≥ Means^∗_k(P ) ,

which is a contradiction.

For each i, j individually, we partition W_ij into small grid cells with side length r⁰_j :=

r γ

9d· r_j =

rγ

9d · 2^jR .

We remark that the small grid cells do not have to fit properly in W_ij. In fact, we impose a grid with side length r_j⁰ on W_ij such that W_ij is completely covered. Then, the partition of W_ij consists of all the small cells that completely cover W_ij as well as all parts of the small cells that partly cover W_ij. This partition is illustrated by Figure 6.1.

r_i r⁰_i

Figure 6.1: Illustration of the partition of W_ij into small grid cells. The area W_ij is colored in gray. The white parts of the small cells do not belong to the partition of W_ij.

For each grid cell C such that C ∩ W_ij contains points from P , we select a single point from P ∩ C ∩ W_ij as the representative of all the points in P ∩ C ∩ W_ij. Let G be the set of all these representatives. Since we have

r_i

grid cells. Since this number is smaller or equal to m, we obtain |G| ≤ m.

Let g_p denote the representative of p ∈ P in G. Then, we have Means^∗_m(P ) ≤ Means^∗_|G|(P ) ≤ Means(P, G) ≤ ^X

p∈P

D²(p, g_p) . (6.1) For each point p ∈ P , the distance from its representative g_p is upper bounded by the diagonal of the grid cell that contains p. Thus, for each p ∈ W_i0, we have

D²(p, g_p) ≤ √

d · r⁰₀² ≤ γR

9 . (6.2)

6.2 Coreset Construction 107

Furthermore, for each p ∈ W_ij with j ≥ 1, we know that c_i is the center of V_i,j−1 and p is not contained in V_i,j−1. It follows that

D²(p, C) ≥

r_j−1 2

≥ 2^j−3R . Hence, in this case, we get

D²(p, g_p) ≤ √

d · r⁰_j² = γ

9 · 2^jR ≤ 8γ

9 · D²(p, C) . (6.3) Due to Inequalities (6.1)–(6.3) and the definition of R, we obtain

Means^∗_m(P ) ≤ ^X

Now, we go back to our coreset construction. Given the point set P and a parameter m ∈ N, let S be our weighted coreset chosen at random according to D² from P by Algorithm 6.2.1. Furthermore, let C be an arbitrary set of k centers. For each point p ∈ P , we denote the point from S whose weight has been increased by 1 due to p in line 9 of Algorithm 6.2.1 by qp, i.e., qp is a point from S closest to p. Then, the difference between the cost of clustering P and the cost of clustering S is at most

|Means(P, C) − Means(S, C)| =

We partition P into two subsets P_near and P_dist. Roughly speaking, the set P_near contains each point p ∈ P whose distance from its coreset point q_p is small compared to the distance from its nearest center in C. More precisely, for any constant ε with 0 < ε ≤ 1, we define

P_near:= {p ∈ P | D(p, q_p) ≤ ε D(p, C)} . The set P_dist contains all the other points from P , i.e.,

P_dist := {p ∈ P | D(p, q_p) > ε D(p, C)} .

First, in Claim 6.2.2, we estimate the error of the clustering cost that occurs for any point in P_near. Then, in Claim 6.2.3, we give an estimation of the error for any point in P_dist.

Claim 6.2.2. If p ∈ P_near, then

D²(p, C) − D²(q_p, C)≤ 3ε D²(p, C) .

Proof. For the moment, let us assume that D(p, C) ≤ D(q_p, C). Let c_p denote the element from C closest to p. By triangle inequality and the definition of P_near, we have

D(q_p, C) ≤ D(q_p, c_p)

≤ D(p, c_p) + D(p, q_p)

≤ (1 + ε) · D(p, C) . Hence, for the squared distances, we obtain

D²(q_p, C) ≤ (1 + ε)²· D²(p, C)

≤ (1 + 3ε) · D²(p, C) .

Thus, we get D²(q_p, C) − D²(p, C) ≤ 3ε D²(p, C), which proves the claim in the case D(p, C) ≤ D(q_p, C).

Now, assume that D(p, C) > D(q_p, C). Let c_s denote the element from C closest to q_p. Again, by triangle inequality and the definition of P_near, we have

D(p, C) ≤ D(p, c_s)

≤ D(q_p, c_s) + D(p, q_p)

≤ D(q_p, C) + ε D(p, C) .

It follows that (1 − ε) · D(p, C) ≤ D(q_p, C). For the squared distances, we obtain D²(q_p, C) ≥ (1 − ε)²· D²(p, C)

> (1 − 2ε) · D²(p, C) . Hence, we get

D²(p, C) − D²(q_p, C) < 2ε D²(p, C)

< 3ε D²(p, C) , which proves the claim in the case D(p, C) > D(q_p, C).

Claim 6.2.3. If p ∈ P_dist, then

D²(p, C) − D²(q_p, C)≤ 3

ε D²(p, q_p) .

6.2 Coreset Construction 109

Proof. Let c_p denote the element from C closest to p, and let c_s denote the element from C closest to q_p. By triangle inequality, we have

D(p, C) ≤ D(p, c_s)

Now, we can show our main result.

Theorem 6. Let k ∈N, let ε, 0 < ε ≤ 1, be a precision parameter, and let δ, 0 < δ < 1,

algorithm AdaptiveCoreset computes a weighted multiset S with size m that is a (k, 6ε)-coreset of P with probability at least 1 − δ.

Proof. Due to Claims 6.2.2 and 6.2.3, we have

|Means(P, C) − Means(S, C)|

Due to Lemma 6.1.3 and Markov’s inequality, we obtain Means(P, S) ≤ 8

δ(2 + ln(m)) · Means^∗_m(P ) with probability at least 1 − δ. Hence, by using Lemma 6.2.1 with

γ := ε²δ

and, thus, |Means(P, C) − Means(S, C)| ≤ 6ε·Means(P, C) with probability 1−δ, provided that the coreset size m satisfies the condition

m ≥ 16d γ

!d/2

· k · dlog(n) + 3e . (6.4)

Hence, the only thing left to do is to prove that there exists a coreset size

m ∈ d that satisfies Inequality (6.4). Since we can assume that n ≥ 16 and m ≥ 8, Inequality (6.4) is satisfied if we have

log^d/2(m) ≥ 2 · 16^dd^d/2k log(n)

δ^d/2ε^d . (6.6)

We conclude that Condition (6.5) and Inequality (6.6) are both satisfied for a choice of m = (2d)^d/2·2 · 16^dd^d/2k log(n)

In document Approximation Techniques for Facility Location and Their Applications in Metric Embeddings (Page 118-125)