In this section, we first present the LMP primal-dual algorithm that we use for the facility location problem. This algorithm is a simpler version of the algorithm that we presented in Chapter 3. Then, we explain how we use it to design a fast and provable good initialization algorithm for k-means.
4.3.1 Primal-Dual Algorithm
The primal-dual algorithm takes as input a set of points C, a set of potential centers
F , and an opening cost λ, and outputs a set of centers S ⊆ F as its solution. The
algorithm contains two main phases, the dual-growth phase and the opening phase. The first phase increases the α-values of the points in C, while keeping the dual constraint (4.2.6) satisfied. Therefore, it finds a feasible solution to DUAL(λ). The second phase finds the centers S using the dual solution that we found in the first phase. These two phases are explained in detail below:
Dual-growth phase: We set all α-values to zero and initialize a set of active points A to C. Then we increase the α-values corresponding to the points in A at the same speed. We remove points from A, whenever one of the following events occurs, and terminate once A = ∅:
• Event 1: The constraint (4.2.6) becomes tight for some center i ∈ F (i.e.
P
j[αj− d(i, j)2]+= λ). In this case, we call this center tight. We also remove
all the active points j for which αj ≥ d(i, j)2 from A.
• Event 2: For some active point j ∈ A, there exists a tight center i ∈ F , such that αj = d(i, j)2. In this case we remove j from A.
Opening phase: Let N (i) = {j | αj > d(i, j)2} and ti = maxj∈N (i)αj for each
tight center i. We construct a graph G with a vertex for each tight potential center i and an edge between two centers i1, i2 if and only if d(i1, i2)2≤ δ min(ti1, ti2), where
δ is a constant that we fix later. Then we greedily find a maximal independent set of G and return it as the set of centers S.
The dual-growth phase of the primal-dual algorithm is the same for both Jain et al. [39] and our result in previous chapters. The main difference is in the opening phase. Jain and Vazirani’s algorithm adds an edge inG between two tight centers i1, i2if N (i1)∩N (i2) 6= ∅.
One can show that this algorithm guarantees an LMP 9-approximation. We add an edge between two tight centers if N (i1) ∩ N (i2) = ∅ and d(i1, i2)2≤ δ min(ti1, ti2). Then they
show that by choosing δ = 2.314, their algorithm results in an LMP 6.357-approximation. Here, we have observed that our analysis in previous chapter is still valid even if the condition that N (i1) ∩ N (i2) 6= ∅ is dropped. This enables us to overcome one of the issues in implementing a fast version of the primal-dual algorithm, as it saves a factor of n in the construction of G during the opening phase, as described below in more detail. There are two other issues to be addressed when translating the LMP primal-dual algorithm for facility location into an algorithm for k-means. First, obtaining a solution of size exactly k requires computationally expensive methods in theory to avoid pathological cases. Here, we employ a more efficient approach that works provably well in theory for well-clusterable instances and also performs well in a variety of experiments. Second, the algorithms operate on a discrete set of potential centers. Transferring a continuous instance to a discrete one while minimizing the effect on the desired approximation guarantee, might result in a large set F of potential centers.
4.3.2 Runtime Improvements
In this section, we describe our ideas for translating the LMP primal-dual algorithm into a fast and provably good primal-dual algorithm (FastPD) for the k-means problem.
Dual-growth phase and opening phase. In the dual-growth phase, we maintain a min-heap of possible events, which allows us to efficiently find the next event. As there are also at most O(n|F |) events, the total running time for this part of the algorithm would be O(n|F | · log(n|F |)). While processing events, we can also easily compute the
t-values for each center that becomes tight. Using these t-values computed during the
dual-growth phase, we can construct the graph G in time O(|F |2), in contrast to time Ω(n|F |2) as required by primal-dual approaches of Jain and Vazirani. Combining the above observations, the total running time of our primal dual algorithm for any given price λ is O(n|F | log(n|F |) + |F |2).
Opening exactly k centers. In order to use the LMP algorithm, we must find a price
λ so that the opening phase results in a set of exactly k centers. In order to accomplish
this, we run a binary search: if for some value λ, we find a set of centers of size less than k then we decrease λ. Similarly, if we find a set of centers of size greater than k, we increase
λ. The maximum value of λ that we need to consider is O(n∆), as a result this process
takes time at most O(log(n∆)), where ∆ is the maximum value between a point and a potential centers, which without loss of generality we can assume is poly(n). In general, this procedure might not eventually find a set of exactly k centers, since the number of centers returned by the opening phase is not a smooth function in λ. However, note that whenever our procedure does correctly find a set of k centers via binary search, the LMP
factor 6.357 guarantee holds with respect to its output. Moreover, in the next section, we formally prove that it always finds the correct k for a family of well-clusterable instances. Additionally, we observed in our experiments that on practical instances, binary search indeed almost always produces a solution of size k. In the rare case that it does not, we arbitrarily add or remove some centers from our solution to obtain exactly k centers. Overall, our binary search ensures that the number of the times that we run the primal- dual algorithm is O(log n). Therefore, we obtain a total running time O(n|F | log2n +
|F |2log n).
Constructing a Set F of potential centers. The previous guarantees all hold for the discrete k-means problem in which centers may only be chosen from a discrete set
F . We now consider the problem of obtaining a small such set of centers F so that our
approximation guarantees can be translated to the original, continuous k-means problem. To balance approximation performance and running time, here, we simply set F = C. That is, we place one potential center at each data point. It can be seen that this results in a loss of at most a factor of 2 compared to the optimal solution that may select centers anywhere in the underlying space Rd. In order to show that, consider any cluster Q in any solution to the given k-means instance. Let q be the centroid of this cluster, i.e.,
q = |Q|1 P
x∈Qx (each coordinate of q is the average of the coordinates of the points in
Q). Moreover, let cost(Q) =P
x∈Qd(x, q)2 which is the actual cost of the cluster Q. We
know that for any point y:
X
x∈Q
d(x, y)2 ≤ X
x∈Q
d(x, q)2+ |Q|d(y, q)2
Let y be the closest point in Q to q, we get
X
x∈Q
d(x, y)2 ≤ 2X
x∈Q
d(x, q)2 = 2cost(Q)
Therefore, by choosing the closest point to the centroid in each cluster, we obtain that the total cost increases at most by a factor 2. This concludes that, assigning F = C also increases the cost at most by a factor of 2.
By putting the above ideas together we get our fast primal-dual algorithm for k-means that we call FastPD. In summary, FastPD takes as input an instance (C, k) of k-means, sets the potential centers F equal to the set of clients C, then runs a binary search on the opening cost λ in order to find a set S of k opened centers.
For larger instances, we proceed similarly except we first sparsify the instance to reduce |C|. There is a long series of works investigating such techniques, which transform an instance (C, k) of k-means into an instance (C0, k) such that |C0| |C|. In this work we use a sampling approach based on Feldman et al. [28]; it runs in time O(nkd), to obtain
an instance with |C0| = O(k2/). We then run our algorithm FastPD on the resulting
instance (C0, k). Feldman et al. [28] show that for any constant β, any β-approximate
solution for (C0, k) that opens a subset of clients as centers, the same solution (set of
centers) is a β(1 + )-approximate solution.
Combining the loss from sparsification with the loss due to selecting only (sparsified) data points as centers, the overall approximation guarantee that we get is (6.357 · 2)(1 + ). Moreover, the overall running time of our algorithm is O(nkd + poly(k, log n)). We summarize this in the following theorem:
Theorem 4.3.1. For any 0 < < 1, the FastPD algorithm with sparsification, runs
in time O(nkd + poly(k/, log n)). Moreover, any solution of size k that it finds is a
(12.714 + )-approximate solution.