A Note On Spectral Clustering

(1)

Pavel Kolev

1

and Kurt Mehlhorn

2

1 Max-Planck-Institut für Informatik, Saarbrücken, Germany [email protected]

2 Max-Planck-Institut für Informatik, Saarbrücken, Germany [email protected]

Abstract

Spectral clustering is a popular and successful approach for partitioning the nodes of a graph into clusters for which the ratio of outside connections compared to the volume (sum of degrees) is small. In order to partition intokclusters, one first computes an approximation of the bottom k eigenvectors of the (normalized) Laplacian of G, uses it to embed the vertices of G into k -dimensional Euclidean spaceRk, and then partitions the resulting points via ak-means clustering algorithm. It is an important task for theory to explain the success of spectral clustering.

Peng et al. (COLT, 2015) made an important step in this direction. They showed that spectral clustering provably works if the gap between the (k+ 1)-th and thek-th eigenvalue of the normalized Laplacian is sufficiently large. They proved a structural and an algorithmic result. The algorithmic result needs a considerably stronger gap assumption and does not analyze the standard spectral clustering paradigm; it replaces spectral embedding by heat kernel embedding andk-means clustering by locality sensitive hashing.

We extend their work in two directions. Structurally, we improve the quality guarantee for spectral clustering by a factor ofkand simultaneously weaken the gap assumption. Algorithmic-ally, we show that the standard paradigm for spectral clustering works. Moreover, it even works with the same gap assumption as required for the structural result.

1998 ACM Subject Classification G.2.2 [Discrete Mathematics] Graph Theory, Graph Algorithms, G.3 [Probability and Statistics] Probabilistic Algorithms (including Monte Carlo)

Keywords and phrases spectral embedding,k-means clustering, power method, gap assumption

Digital Object Identifier 10.4230/LIPIcs.ESA.2016.57

1 Introduction

A cluster in an undirected graph G = (V, E) is a set S of nodes whose volume is large compared to the number of outside connections. Formally, we define the conductance of S by φ(S) = E(S, S)

/µ(S), where µ(S) =P_v∈Sdeg(v) is thevolume of S. The k-way partitioning problem for graphs asks to partition the vertices of a graph such that the conductance of each block of the partition is small (formal definition below). This problem arises in many applications, e.g., image segmentation and exploratory data analysis. We refer to the survey [10] for additional information. A popular and very successful approach to clustering [4, 8, 10] isspectral clustering. One first computes an approximation of the bottomkeigenvectors of the (normalized) Laplacian of G, uses it to embed the vertices of G intok-dimensional Euclidean spaceRk, and then partitions the resulting points via a

∗ _{The full version of the paper is available at}_{http://arxiv.org/abs/1509.09188}_.

† _{This work has been funded by the Cluster of Excellence “Multimodal Computing and Interaction"}

within the Excellence Initiative of the German Federal Government.

(2)

k-means clustering algorithm. It is an important task for theory to explain the success of spectral clustering. Recently, Peng et al. [7] made an important step in this direction. They showed that spectral clustering provably works if the (k+ 1)-th and thek-th eigenvalue of the normalized Laplacian differ sufficiently. In order to explain their result, we need some notation.

LetLG=I−D−1/2AD−1/2 be the normalized Laplacian matrix ofG, where Dis the diagonal degree matrix andAis the adjacency matrix, and letfj ∈RV be the eigenvector corresponding to thej-th smallest eigenvalue λj of LG. The spectral embedding map F :

V →Rk is defined by

F(u) = √1

du

(f1(u), . . . , fk(u))

T

, for all verticesu∈V. (1) Peng et al. [7] construct ak-means instanceXV by insertingdu many copies of the vector

F(u) intoXV, for every vertexu∈V.

LetX be a set of vectors of the same dimension. Then

4k(X), min partition (X1,...,Xk) ofX k X i=1 X x∈Xi kx−cik2, where ci= 1 |Xi| X x∈Xi x,

is the optimal cost of clusteringX intoksets. Anα-approximate clustering algorithm returns ak-way partition (A1, . . . , Ak) and centers c1, . . . , ck such that

Cost({Ai, ci}ki=1), k X i=1 X x∈Ai kx−cik 2 6α· 4k(X). (2)

Theorderk conductance constantρ(k) is a well studied worst case guarantee fork-way partitioning that is defined by

ρ(k) = min

disjoint nonemptyZ1,...,Zk

Φ(Z1, . . . , Zk), where Φ(Z1, . . . , Zk) = max i∈[1:k]

φ(Zi). (3) Lee et al. [3] connectedρ(k) and thek-th smallest eigenvalue of the normalized Laplacian matrixLG through the relation, also known as higher order Cheeger inequality,

λk/26ρ(k)6O(k2) p

λk. (4)

In this work, we focus attention on theorder kpartition constant ρ_b(k) ofG, defined by b ρ(k)_, min partition (P1,...,Pk) ofV Φ(P1, . . . Pk), where Φ(Z1, . . . , Zk) = max i∈[1:k] φ(Zi). In a consecutive work, inspired by partitioning graphs into expanders, Oveis Gharan and Trevisan [6] proved the following relation

ρ(k)6bρ(k)6kρ(k). (5)

We are now ready to state the main structural result by Peng et al [7].

ITheorem 1.1([7, Theorem1.2]). Let k_>3 and (P1, . . . , Pk) be ak-way partition of V withΦ(P1, . . . , Pk) =ρb(k). Let Gbe a graph that satisfies the gap assumption

1 δ= 2·10 5_·_k3 Υ ∈(0,1/2], where Υ, λk+1 b ρ(k). (6) 1 _{Note that}_λ

k/26ρ_b(k), see (4,5). Thus the assumption impliesλk/26ρ_b(k) =δλk+1/(2·105·k3), i.e.,

(3)

Let(A1, . . . , Ak)be thek-way partition2ofV returned by anα-approximatek-means algorithm applied toXV. Then the following statements hold (after suitable renumbering of one of the partitions):

1. µ(Ai4Pi)6αδ·µ(Pi), and

2. φ(Ai)6(1 + 2αδ)·φ(Pi) + 2αδ,

where the symmetric differenceAi4Pi= (Ai\Pi)∪(Pi\Ai).

Under the stronger gap assumptionδ= 2·105·k5_/_Υ_∈₍₀_,₁_/_{2], they showed how to obtain}

a partition in timeO(m·poly log(n)) with essentially the guarantees stated in Theorem 1.1, wherem=|E|is the number of edges inGandn=|V| is the number of nodes.

However, their algorithmic result does not analyze the standard spectral clustering paradigm, since it replaces spectral embedding by heat kernel embedding and k-means clustering by locality sensitive hashing. Therefore, their algorithmic result does not explain the success of the standard spectral clustering paradigm.

Our Results

We strengthen the approximation guarantees in Theorem 1.1 by a factor of k and simul-taneously weaken the gap assumption. As a consequence, the variant of Lloyd’sk-means algorithm analyzed by Ostrovsky et al. [5] applied to3 XfV achieves the improved approxim-ation guarantees in timeO(m(k2₊ lnn

λk+1)) with constant probability. Table 1 summarizes

these results.

LetObe the set of allk-way partitions (P1, . . . , Pk) with Φ(P1, . . . , Pk) =ρ_b(k), i.e., the set of all partitions that achieve the orderk partition constant. Let

b ρavr(k), min (P1,...,Pk)∈O 1 k k X i=1 φ(Pi)

be theminimal average conductanceover allk-way partitions inO. The minimal average conductance can be considerably smaller than the orderkpartition constant. Consider a graph consisting of one clique of sizeSk=f(n) =o(n/k3/2),k−1 cliques of size (n−f(n))/(k−1) each, andkadditional edges that connect the cliques in the form of a ring. Thenφ(Si)≈k2/n2 for 1₆i₆k−1 andφ(Sk)≈1/f(n)2. Thus ρ_b(k) = maxiφ(Si) =φ(Sk)≈1/f(n)2 and b

ρavr(k) = (1/k)P16i6kφ(Sk)≈k2/n2+ (1/k)·(1/f(n)2)≈ρb(k)/k.

For the remainder of this paper we denote by (P1, . . . , Pk) ak-way partition ofV that achievesρ_bavr(k). In the full version of the paper, we give an analogous relation to (5) for

b

ρavr(k). We state now our main result.

ITheorem 1.2 (Main Theorem).

(a) (Existence of a Good Clustering) Let k _> 3. Let G be a graph satisfying the gap assumption δ=20 4_·_k3 Ψ ∈(0,1/2], where Ψ, λk+1 b ρavr(k) . (7)

Let(A1, . . . , Ak)be thek-way partition returned by anα-approximate clustering algorithm applied to the spectral embedding XV. Then for every i ∈ [1 : k] the following two statements hold (after suitable renumbering of one of the partitions):

2 _The_k_{-means algorithm returns a partition of}_X

V. One may assume w.l.o.g. that all copies ofF(u) are put into the same cluster ofXV. Thus the algorithm also partitionsV.

3

f

(4)

Table 1A comparison of the results in Peng et al. [7] and our results. The parameterδ∈(0,1/2] relates the approximation guarantees with the gap assumption.

Gap Assumption Partition Quality Running Time

Peng et al. [7] δ= 2·105·k3/Υ µ(Ai4Pi)6αδ·µ(Pi) φ(Ai)6(1 + 2αδ)φ(Pi) + 2αδ Existential result This paper δ= 204_·_k3_/_Ψ µ(Ai4Pi)6 αδ 103_k·µ(Pi) φ(Ai)6 1 +₁₀2αδ3_k φ(Pi) +₁₀2αδ3_k Existential result Peng et al. [7] δ= 2·105·k5_/ Υ µ(Ai4Pi)6 δlog 2_k k2 ·µ(Pi) φ(Ai)61 +2δlog_k22k φ(Pi) +2δlog_k22k O(m·poly log(n)) This paper δ= 204·k3/Ψ δ6k/109 4k(XV)>n−O(1) µ(Ai4Pi)6 ₁₀2δ3_k·µ(Pi) φ(Ai)6 1 + 4δ 103k φ(Pi) + 4δ 103k O m k2+_λlnn k+1 1. µ(Ai4Pi)6 ₁₀αδ3_k·µ(Pi), and 2. φ(Ai)6 1 + ₁₀2αδ3_k ·φ(Pi) +₁₀2αδ3_k.

(b) (An Efficient Algorithm) If in addition δ ₆k/109 _and4 ₄

k(XV)> n−O(1), then the variant of Lloyd’s algorithm analyzed by Ostrovsky et al. [5] applied toXfV returns in time O(m(k2+_λlnn

k+1))with constant probability a partition (A1, . . . , Ak)such that for everyi∈[1 :k] the following two statements hold (after suitable renumbering of one of the partitions):

3. µ(Ai4Pi)6 ₁₀2δ3_k·µ(Pi), and

4. φ(Ai)6 1 + ₁₀4δ3_k

·φ(Pi) +₁₀43δ_k.

Part (b) of Theorem 1.2 gives a theoretical support for the practical success of spec-tral clustering based on approximate specspec-tral embedding followed by k-means clustering. Moreover, ifk₆poly(log n) andλk+1 >poly(logn), our algorithm works in nearly linear

time. Previous papers [3, 7, 9] replacedk-means clustering by other techniques for their algorithmic results.

Thek-means algorithm in [5] is efficient only for inputsX for which some partition into kclusters is much better than any partition intok−1 clusters. The authors proved that the algorithm is efficient for inputsX satisfying4k(X)6ε24k−1(X) for someε∈(0, ε0],

whereε0= 6/107, stated that the result should also hold for a largerε0, and mentioned that

they did not attempt to maximizeε0. For the proof of part (b) of Theorem 1.2, we show in

Section 5 thatXfV satisfies this assumption. In this proof, we needδ6k·ε0/600 =k/109.

One of the reviewers suggested to include a numerical example. Consider a graph consisting ofkcliques of sizen/k each pluskadditional edges that connect the cliques in the form of a ring. Such a graph is about the easiest input for a clustering algorithm. Then b

ρavr(k) =ρb(k)≈(k/n)

2_{. For the gap assumption to hold we need}_λ

k+1>2·204·k3·ρbavr(k). Sinceλk+1 62, this implies n>400·k2.5. For smallk, this is a modest requirement on the

size of the graph.

4 _{The case}_4k(XV₎

6n−O(1)constitutes a trivial clustering problem. For technical reasons, we have to exclude too easy inputs.

(5)

For the algorithmic result, we need in addition δ₆k·ε0/600. For the gap condition to hold, we need 2_>λk+1 >(600/ε0k)·204·k3·(k2/n2) orn>4 √ 3·103_·_k2_/√_ε 0. For ε0= 6/107, this amounts ton>4 √

5·106_·_k2_{, a quite large lower bound on}_n_.

Our statement above that Part (b) of Theorem 1.2 gives a theoretical support for the practical success of spectral clustering based on approximate spectral embedding followed by k-means clustering therefore has to be taken with a grain of salt. It is only an asymptotic statement and does not explain the good behavior on small graphs.

2 Highlights of Our Technical Contribution

2.1 Exact Spectral Embedding – Notation

We use the notation adopted by Peng et al. [7]. Letfj∈RV be the eigenvector corresponding to thej-th smallest eigenvalue λj ofLG, and letgi=

D1/2_χ

Pi

kD1/2_χ

Pik

be the normalized indicator vector associated with thei-th optimal clusterPi⊂V.

Since the eigenvectors{fi}ni=1form an orthonormal basis ofR

n_{, each normalized indicator} vectorgi can be expressed asgi =

Pn j=1α

(i)

j fj, for all i∈[1 : k]. Itsprojection into the subspace spanned by the bottomk eigenvectors is given byfbi=

Pk j=1α

(i)

j fj. Peng et al. [7] proved that if the gap parameter Υ is large enough then span({fbi}ki=1) = span({fi}ki=1) and

hence the bottomkeigenvectors can be expressed by fi=Pk_j₌₁β

(i)

j fbj, for alli∈[1 :k]. We show that similar statements hold with substituted gap parameter Ψ.

A corner stone in the analysis of spectral clustering is to prove the existence of exactly k directions near which all spectrally embedded vectors are closely concentrated. These estimation centers are defined by

p(i)= _p 1 µ(Pi) β_i(1), . . . , β_i(k) T . (8)

Our analysis crucially relies on the isometric properties of the following square matrix. LetB∈_Rk×k _{be a matrix defined by}_B

j,i=β

(i)

j , for everyi, j∈[1 :k].

2.2 Exact Spectral Embedding – Structural Results

The proof of Theorem 1.2 (a) follows the proof-structure of [7, Theorem 1.2] in Peng et al., but improves upon it in essential ways.

Our key technical insight is that the matrices BBTandBTBare close to the identity matrix. The proof of Theorem 2.1 appears in the full version of the paper.

ITheorem 2.1 (MatrixBBT is Close to Identity Matrix). IfΨ>104·k3_/ε2 _and _ε_∈₍₀_,₁₎

then for all distincti, j∈[1 :k]it holds

1−ε₆hBi,:,Bi,:i61 +ε and |hBi,:,Bj,:i|6 √

ε.

Informally, Theorem 2.1 implies that each normalized indicator vectorgi is close to the corresponding eigenvectorfi. This gives a simple and intuitive explanation for the success of spectral clustering.

To see this, let Fk,Fbk∈Rk×k be matrices whosei-th column isfiand fbi, respectively. The projection matrixPk into thek-th principle subspace ofLG is given byPk=FkFTk and sinceFbkB=Fk, by Theorem 2.1 it follows thatFbkFbT_k ≈FkFTk. Therefore, each projected indicator vector satisfiesfbi ≈fi. This impliesα(i)≈χi and hence we havegi≈fi.

(6)

Formally, Theorem 2.1 allows us to improve the separation guarantee between any pair of estimation centers by a factor ofkover [7, Lemma 4.3], measured in terms of the Euclidean distance.

ILemma 2.2. If δ= 204_·_k3_/_Ψ_∈₍₀_,_1] _{then for every} _i_∈_{[1 :} _k_] _{it holds that} p(i) 2 ∈ h 1±√δ/4i_µ₍1_P i). Proof. By definitionp(i)₌ _√ 1 µ(Pi)

·Bi,: and Theorem 2.1 yieldskBi,:k 2

∈[1±√δ/4]. _J ILemma 2.3(Larger Distance Between Estimation Centers). Ifδ= 204·k3/Ψ∈(0,1/2] then for any distincti, j∈[1 :k] it holds thatp(i)−p(j)

2

>[2·min{µ(Pi), µ(Pj)}]−1.

Proof. Sincep(i)_{is a row of matrix}_B_{, Theorem 2.1 with} _ε₌√_δ/_{4 yields}

* p(i) p(i) , p (j) p(j) + = hBi,:,Bj,:i kBi,:k kBj,:k 6 √ ε 1−ε = 2δ1/4 3 . W.l.o.g. assume thatp(i)

2 >p(j) 2 , sayp(j) =α· p(i)

for some α∈(0,1]. Then by Lemma 2.2 we havep(i) 2 >(1−√δ/4)·[min{µ(Pi), µ(Pj)}] −1 , and hence p (i)₋_p(j) 2 = p (i) 2 + p (j) 2 −2 * p(i) p(i) , p (j) p(j) + p (i) p (j) > α2−4δ 1/4 3 ·α+ 1 p (i) 2 >[2·min{µ(Pi), µ(Pj)}] −1 . _J

The observation that Υ can be replaced by Ψ in all statements in [7] is technically easy. However, this is crucial for Theorem 1.2 (b), since it yields an improved version of [7, Lemma 4.5] showing that a weaker by a factor ofkassumption is sufficient. Due to space limitation, we defer the proof of Lemma 2.4 to the full version of the paper.

ILemma 2.4. Let(P1, . . . , Pk)and (A1, . . . , Ak) are partitions of the vector set. Suppose for every permutationπ: [1 :k]→[1 :k]there is an indexi∈[1 :k] such that

µ(Ai4Pπ(i))>

2ε

k ·µ(Pπ(i)), (9)

whereε∈(0,1) is a parameter. Ifδ= 204_·_k3_/_Ψ_∈₍₀_,₁_/_2]_and_ε

>64α·k3_/_Ψ_then

Cost({Ai, ci}ki=1)>

2k2

Ψ α.

With the above Lemmas in place, the proof of Theorem 1.2 (a) is then completed as in [7]. We give more details in Section 3.

Before we turn to Theorem 1.2 (b), we consider the variant of Lloyd’s algorithm analyzed by Ostrovsky et al. [5] applied to XV. This algorithm is efficient for inputsX satisfying: some partition intokclusters is much better than any partition intok−1 clusters.

ITheorem 2.5. [5, Theorem 4.15] Assuming that4k(X)6ε24k−1(X)forε∈(0,6/107],

there is an algorithm that returns a solution of cost at most [(1−ε2)/(1−37ε2)]4k(X)with probability at least1−O(√ε)in time O(nkd+k3_d₎_.

(7)

ITheorem 2.6(Normalized Spectral Embedding isε-separated). LetGbe a graph that satisfies the gap assumption δ = 204_·_k3_/_Ψ _∈₍₀_,₁_/_2] _and _δ

6k·ε/600, where ε = 6/107 _{is the}

Ostrovsky et al.’s constant. Then it holds

4k(XV)6ε24k−1(XV). (10)

However, Theorem 2.6 is insufficient for Theorem 1.2 (b), since we need a similar result for the setXfV formed by approximate eigenvectors. To overcome this issue we build upon the recent work by Boutsidis et al. [1] which shows that running an approximatek-means clustering algorithm on approximate eigenvectors obtained via the power method, yields an additive approximation to solving thek-means clustering problem on exact eigenvectors.

In order to state the connection, we need to introduce some of their notation.

2.3 Approximate Spectral Embedding – Notation

LetZ∈Rn×k be a matrix whose rows representnvectors that are to be partitioned intok clusters. For everyk-way partition we associate an indicator matrixX∈Rn×k that satisfies

Xij = 1/ p

|Cj|if thei-th rowZi,:belongs to thej-th clusterCj, andXij = 0 otherwise. We denote the optimal indicator matrixXopt by

Xopt= arg min

X∈Rn×k Z−XXTZ 2 F = arg_X∈min Rn×k k X j=1 X u∈Xj kZu,:−cjk2₂, (11)

wherecj = (1/|Xj|)P_u∈X_jZu,: is the center point of clusterCj.

The normalized Laplacian matrix LG ∈ Rn×n of a graph G is define byLG =I− A, whereA=D−1/2_AD−1/2 _{is the normalized adjacency matrix. Let}_U

k ∈Rn×k be a matrix composed of the bottomk orthonormal eigenvectors of LG corresponding to the smallest eigenvaluesλ1, . . . , λk. We define by Y ,Uk the canonical spectral embedding.

Our approximate spectral embedding is computed by the so called “Power method”. Let S ∈ Rn×k be a matrix whose entries are i.i.d. samples from the standard Gaussian distributionN(0,1) andpbe a positive integer. Then the approximate spectral embedding

e

Y is defined by the following process:

1) B_,I+A; 2) Let UeΣeVeTbe the SVD ofBpS; and 3) Ye ,Ue∈Rn×k. (12) We proceed by defining the normalized (approximate) spectral embedding. We construct a matrixY0 ∈Rm×k such that for every vertexu∈V we add deg(u) many copies of the normalized rowUk(u,:)/

p

deg(u) to Y0. Formally, the normalized (approximate) spectral embeddingY0 (fY0) is defined by Y0 =     1deg(1) Uk(1,:) √ deg(1) · · · 1deg(n) Uk(n,:) √ deg(n)     m×k and Yf0=      1deg(1) e U(1,:) √ deg(1) · · · 1deg(n) e U(n,:) √ deg(n)      m×k , (13)

where1deg(i) is all-one column vector with dimension deg(i).

Similarly to (11) we associate to Y0 (Yf0) an indicator matrix X0 (Xf0) that satisfies

X_ij0 = 1/pµ(Cj) if thei-th rowYi,0: belongs to thej-th clusterCj, andXij0 = 0 otherwise. We may assume w.l.o.g. that a k-means algorithm outputs an indicator matrixX0 such that all copies of rowUk(v,:)/

p

(8)

We associate to matricesY0 andfY0the sets of pointsXV andXfV respectively. We present now a key connection between the spectral embedding mapF(·), the optimalk-means cost

4k(XV) and matricesY0, Xopt0 :

Y 0₋_X0 opt X 0 opt T Y0 2 F = k X j=1 X v∈C? j deg(v)F(v)−c?_j 2 F =4k(XV), (14)

where each center satisfiesc?_j =µ(C_j?)−1_·P v∈C?

j deg(v)F(v) andF(v) =Yv,:/

p deg(v).

2.4 Approximate Spectral Embedding – Algorithmic Results

Our analysis relies on the proof techniques developed in [1, 2]. By adjusting these techniques (c.f. [1, Lemma 5] and [2, Lemma 7]) to our setting, we prove (in the full version of the paper) the following result for the symmetric positive semi-definite matrixBwhose largestk singular values (eigenvalues) correspond to the eigenvectorsu1, . . . , uk ofLG.

ILemma 2.7. Let UeΣeVeT be the SVD of BpS ∈ Rn×k, where p >1 and S is an n×k matrix of i.i.d. standard Gaussians. Letγk =2−λ₂_−λk+1

k <1 and fixδ, ∈(0,1). Then for any

p_>ln(8nk/δ)

ln(1/γk)with probability at least 1−2e−2n−3δ it holds UkU T k −UeUeT _F 6.

We establish several technical Lemmas that combined with Lemma 2.7 allow us to apply the proof techniques in [1, Theorem 6]. More precisely, we prove in Subsection 5.1 that running an approximatek-means algorithm on a normalized approximate spectral embedding f

Y0 _{computed by the power method, yields an approximate clustering of the normalized} spectral embeddingY0.

ITheorem 2.8. Compute matrixfY0 via the power method withp>ln(8nk/δ)

ln(1/γk), whereγk = (2−λk+1)/(2−λk)<1. Run on the rows of Yf0 an α-approximate k-means algorithm with failure probability δα. Let the outcome be a clustering indicator matrix

f

X0

α∈Rn×k. Then with probability at least1−2e−2n−3δp−δα it holds Y0−Xfα0 f X0 α T Y0 2 F 6(1 + 4ε)·α· Y 0₋_X0 opt Xopt0 T Y0 2 F+ 4ε 2_.

Our main technical contribution is to prove, in Subsection 5.2, that XfV satisfies the assumption of Ostrovsky et al. [5]. Our analysis builds upon Theorem 2.6 and Theorem 2.8. ITheorem 2.9(Approximate Normalized Spectral Embedding isε-separated). LetGbe a graph that satisfies the gap assumptionδ= 204_·_k3_/_Ψ_∈₍₀_,₁_/_2]_and_δ

6k·ε/600, whereε= 6/107

is the Ostrovsky et al.’s constant. If the optimum cost5

Y0−X_opt0 (X_opt0 )TY0

F >n −O(1)

and the matrix fY0 is constructed via the power method with p>Ω(_λlnn

k+1), then w.h.p it holds 4k f XV <5ε2_{· 4} k−1 f XV .

Based on the preceding results, we prove Theorem 1.2 (b) in Subsection 5.3.

5

Y0−Xopt0 (Xopt0 )TY0

F >n

(9)

3 The Proof of Part (a) of Theorem 1.2

The proof of part (a.1) builds upon the following Lemmas. Recall thatXV containsducopies ofF(u) for eachu∈V. W.l.o.g. we may restrict attention to clusterings ofXV that put all copies of F(u) into the same cluster and hence induce a clustering ofV. Let (A1, . . . , Ak) with cluster centersc1to ck be a clustering ofV. Itsk-means cost is

Cost({Ai, ci}ki=1) = k X i=1 X u∈Ai dukF(u)−cik2.

The proofs of Lemma 3.1 and Lemma 3.2 appear in the full version of the paper. ILemma 3.1((P1, . . . , Pk)is a goodk-means partition). IfΨ>4·k3/2then there are vectors

{p(i)}k

i=1 such that Cost({Pi, p(i)}ki=1)6(1 + 3k

Ψ)·

k2

Ψ.

I Lemma 3.2 (Only partitions close to (P1, . . . , Pk) are good). Under the hypothesis of Theorem 1.2, the following holds. If for every permutationσ: [1 :k]→[1 :k]there exists an index i∈[1 :k]such that

µ(Ai4Pσ(i))>

8αδ

104_k ·µ(Pσ(i)), then it holds Cost({Ai, ci}

k i=1)>

2αk2

Ψ .

We note that Lemma 3.2 follows directly by applying Lemma 2.4 with ε= 64α·k3/Ψ. Substituting these bounds into (2) yields a contradiction, since

2αk2 Ψ <Cost({Ai, ci} k i=1)6α· 4k(XV)6α·Cost({Pi, p(i)}ki=1)6 1 +3k Ψ ·αk 2 Ψ . Therefore, there exists a permutationπ(the identity after suitable renumbering of one of the partitions) such thatµ(Ai4Pi)< ₁₀8αδ4_k ·µ(Pi) for alli∈[1 :k].

Part (a.2) follows from part (a.1). Indeed, forδ0= 8δ/104 _{we have}

µ(Ai)>µ(Pi∩Ai) =µ(Pi)−µ(Pi\Ai)>µ(Pi)−µ(Ai4Pi)> 1−αδ 0 k ·µ(Pi)

Φ(Ai) = |E(Ai, Ai)| µ(Ai) 6 |E(Pi, Pi)|+αδ 0 k ·µ(Pi) (1−α·δ0 k )·µ(Pi) 6 1 +2αδ 0 k ·φ(Pi) + 2αδ0 k . This completes the proof of Theorem 1.2 (a).

4 The Normalized Spectral Embedding is

ε

-separated

In this section, we prove that the normalized spectral embeddingXV isε-separated. Proof of Theorem 2.6

We establish first a lower bound on4k−1(XV).

ILemma 4.1. Let Gbe a graph that satisfies the gap assumptionδ= 204_·_k3_/_Ψ_∈₍₀_,₁_/_2]_.

(10)

Before we prove Lemma 4.1 we show that it implies (10). By Lemma 3.1 we have4k(XV)6 2k2_/_{Ψ =} _δ0_/k_{. Then, we apply Lemma 4.1 with} _δ

6 k·ε/600, whereε = 6/107 _{is the}

Ostrovsky et al.’s constant, yielding

4k−1(XV)> 1 12− δ0 k = 1 12− 2 204 · δ k > 1010 9·25 · δ k = 1 ε2 · δ0 k > 1 ε2 · 4k(XV). Proof of Lemma 4.1

Let (P1, . . . , Pk) and (Z1, . . . , Zk−1) be partitions ofV. We define a mappingσ: [1 :k−1]7→

[1 :k] by σ(i) = arg max j∈[1:k] µ(Zi∩Pj) µ(Pj) , for everyi∈[1 :k−1].

We lower bound now the clusters overlapping in terms of the volume between anyk-way and (k−1)-way partitions ofV.

ILemma 4.2. Suppose(P1, . . . , Pk)and(Z1, . . . , Zk−1)are partitions of V. Then for any

index`∈[1 :k]\ {σ(1), . . . , σ(k−1)} (there is at least one such`) and for everyi∈[1 :k−1] it holds µ(Zi∩Pσ(i)), µ(Zi∩P`) >τi·min µ(P`), µ(Pσ(i)) , wherePk−_i₌₁1τi= 1 andτi>0.

Proof. By pigeonhole principle there is an index`∈[1 :k] such that` /∈ {σ(1), . . . , σ(k−1)}. Thus, for everyi∈[1 :k−1] we have σ(i)6=` and

µ(Zi∩Pσ(i))

µ(Pσ(i)) >

µ(Zi∩P`)

µ(P`) ,

τi,

wherePk−_i₌₁1τi= 1 andτi>0 for alli. Hence, the statement follows. J

Proof of Lemma 4.1. Let (Z1, . . . , Zk−1) be a (k−1)-way partition of V with centers

c0₁, . . . , c0_k−₁ that achieves4k−1(XV), and (P1, . . . , Pk) be ak-way partition ofV achieving b

ρavr(k). Our goal now is to lower bound the optimal (k−1)-means cost

4k−1(XV) = k−1 X i=1 k X j=1 X u∈Zi∩Pj dukF(u)−c0ik 2 . (15)

By Lemma 4.2 there is an index`∈[1 :k]\ {σ(1), . . . , σ(k−1)}. Fori∈[1 :k−1] let

pγ(i)= ( p` , if p`−c0_i > pσ(i)−c0_i ; pσ(i) _{, otherwise}_.

Then by combining Lemma 2.3 and Lemma 4.2, we have p γ(i)₋_c0 i 2 > 8·min µ(P`), µ(Pσ(i)) −1 and µ(Zi∩Pγ(i))>τi·minµ(P`), µ(Pσ(i)) , (16) wherePk−_i₌₁1τi= 1. We now lower bound the expression in (15). Since

kF(u)−c0_ik2_> 1 2 p γ(i)₋_c0 i 2 − F(u)−p γ(i) 2 ,

(11)

it follows forδ0= 2δ/204 that 4k−1(XV) = k−1 X i=1 k X j=1 X u∈Zi∩Pj dukF(u)−c0ik 2 > k−1 X i=1 X u∈Zi∩Pγ(i) dukF(u)−c0ik 2 > 1 2 k−1 X i=1 X u∈Zi∩Pγ(i) du p γ(i)₋_c0 i 2 − k−1 X i=1 X u∈Zi∩Pγ(i) du F(u)−p γ(i) 2 > 1₂ k−1 X i=1 µ(Zi∩Pγ(i)) 8·min µ(Pγ(i)), µ(Pσ(i)) − k X i=1 X u∈Pi du F(u)−pi 2 > ₁₆1 −δ 0 k,

where the last inequality holds due to (16) and Lemma 3.1. _J

5 An Efficient Spectral Clustering Algorithm

Here, we apply the proof techniques developed by Boutsidis et al. [1, 2] to our setting. More precisely, we prove that anyα-approximatek-means algorithm that runs on an approximate normalized spectral embeddingfY0 computed by the power method, yields an approximate clusteringXf_α0 of the normalized spectral embeddingY0.

Furthermore, we prove under our gap assumption that fY0 is ε-separated. This allows us to apply the variant of Lloyd’s k-means algorithm analyzed by Ostrovsky et al. [5] to efficiently computeXf_α0. Then we use Theorem 1.2 (a) to establish the desired statement.

This section is organized as follows. In Subsection 5.1, we prove Theorem 2.8. Then in Subsection 5.2, we present the proof of Theorem 2.9. Based on the results from the preceding two subsections, we prove Theorem 1.2 (b) in Subsection 5.3.

5.1 Proof of Theorem 2.8

Due to space limits, we defer the proofs of the next Lemmas to the full version of the paper. ILemma 5.1. X0X0T is a projection matrix.

ILemma 5.2. It holds that Y0TY0=Ik×k =fY0

T

f

Y0_.

ILemma 5.3. It holds that Y 0_Y0T₋ f Y0_f_Y0T _F = Y Y T₋ e YYeT _F.

ILemma 5.4. For any matrix U with orthonormal columns and every matrixA it holds U UT−AATU UT _F = U−AATU _F. (17)

Proof Sketch of Theorem 2.8. By combining Lemma 2.7 and Lemma 5.3, with probability at least 1−2e−2n−3δp we have Y 0_Y0T₋ f Y0 f Y0T _F = Y Y T₋ e YYeT _F 6ε. LetY0Y0T₌ f Y0 f Y0T₊_E _{such that} _k_E_k

F 6ε. Based on Lemma 5.2 and Lemma 5.4 we have that (17) holds for the matricesY0andfY0. Hence, by Lemma 5.1 we can apply the proof in [1, Theorem 6] to obtain Y0−Xf_α0 f X0 α T Y0 _F 6 √ α· Y 0₋_X0 opt Xopt0 T Y0 _F+ 2ε . The desired statement follows by simple algebraic manipulations. _J

(12)

5.2 Proof of Theorem 2.9

In this subsection, we show under our gap assumption that the approximate normalized spectral embeddingfY0 isε-separated, i.e. 4k(XfV)<5ε2· 4k−1(XfV). Our analysis builds upon Theorem 2.6, Theorem 2.8 and the proof techniques in [1, Theorem 6].

We use interchangeablyX_opt0 andX_opt0(k)to denote the optimal indicator matrix for the k-means problem onXV that is induced by the rows of matrixY0. Similarly, we denote by

X_opt0(k−1) the optimal indicator matrix for the (k−1)-means problem onXV.

Proof Sketch of Theorem 2.9. By Theorem 2.6 we have Y0−Xopt0(k) Xopt0(k) T Y0 _F 6 ε Y0−Xopt0(k−1) Xopt0(k−1) T Y0 _F . (18)

We set the approximation parameter in Theorem 2.8 to

ε0_, 1 4 p 4k(XV) = 1 4 Y0−Xopt0(k) Xopt0(k) T Y0 _F > n−O(1), (19)

and we note that by Theorem 2.6 it holdsε0 ₆ε₄p4k−1(XV). We construct now matrixYe via the power method withp>Ω(_λlnn

k+1). By Lemma 5.3 we have Y 0_Y0T₋ f Y0 f Y0T _F = Y Y T₋ e YYeT

_F and thus by Lemma 2.7 with high probability it holds that Y 0₍_Y0₎T₋ f Y0 f Y0T _F 6ε 0_. LetY0(Y0)T₌ f Y0 f Y0T₊_E _{such that}_k_E_k F 6ε

0_{. By combining Lemma 5.1, Lemma 5.2,} Lemma 5.3 and by applying the proof techniques in [1, Theorem 6] we obtain

r 4k f XV 6kEk_F+ Y0−X]_opt0(k) ] X_opt0(k) T Y0 _F .

Furthermore, we can show that ln 2−λk

2−λk+1 >12 1− 4δ 204k2

λk+1. Then Theorem 2.8 yields

Y0−X]_opt0(k) ] X_opt0(k) T Y0 2 F 6(1 + 4ε0)· Y0−X_opt0(k)X_opt0(k) T Y0 2 F + 4ε02.

Also, we can show Y0−X_opt0(k)X_opt0(k) T Y0 2 F 6 1

8·1013 and by the definition ofε

0 _{it follows} r 4k f XV 62p4k(XV)62ε· p 4k−1(XV). (20)

Moreover, by applying similar arguments as in the proof of [1, Theorem 6] we can prove that

p 4k−1(XV)6 1 + ε 2 r 4k−1 f XV . (21)

(13)

5.3 Proof of Part (b) of Theorem 1.2

Letp= Θ(_λlnn

k+1). We can compute the matrixB

p_S _{in time}_O₍_mkp_{) and its singular value} decompositionUeΣeVeT in timeO(nk2). Based on it, we construct in timeO(mk) matrixYf0 (c.f. (13)).

By Theorem 2.9, XfV is ε-separated forε = 6/107, i.e. 4k f XV <5ε2_{· 4} k−1 f XV . Hence, by Theorem 2.5 there is an algorithm that outputs in timeO(mk2₊_k4_{) a clustering}

with indicator matrixXf_α0 satisfying f Y0₋ f X0 α f X0 α T f Y0 2 F 6 1 + 1 1010 · f Y0₋_X_]0 opt ] X_opt0 T f Y0 2 F with constant probability (close to 1), whereα= 1 + 1/1010.

Moreover, by applying Theorem 2.8 with ε0 = ₄_·₁₀13 Y 0₋_X0 opt Xopt0 T Y0 _F we can prove that the indicator matrixXf_α0 yields a multiplicative approximation ofXV, i.e.

Y0−Xf_α0 f X0 α T Y0 2 F 6 1 + 1 106 Y 0₋_X0 opt X 0 opt T Y0 2 F . (22)

The statement follows by Theorem 1.2 (a) applied to the partition (A1, . . . , Ak) ofV that is induced by the indicator matrixXf_α0.

Acknowledgement. We would like to thank the anonymous reviewers for their constructive comments.

References

1 Christos Boutsidis, Prabhanjan Kambadur, and Alex Gittens. Spectral clustering via the power method - provably. InProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 40–48, 2015.

2 Christos Boutsidis and Malik Magdon-Ismail. Faster svd-truncated regularized least-squares. In 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, June 29 - July 4, 2014, pages 1321–1325, 2014.

3 James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multi-way spectral partitioning and higher-order Cheeger inequalities. In Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC’12, pages 1117–1130, New York, NY, USA, 2012. ACM.

4 Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, 2002.

5 Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effect-iveness of Lloyd-type methods for the k-means problem.J. ACM, 59(6):28:1–28:22, January 2013.

6 Shayan Oveis Gharan and Luca Trevisan. Partitioning into expanders. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1256–1266, 2014.

7 Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering works! InProceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 1423–1455, 2015.

8 Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

(14)

9 Ali Kemal Sinop. How to round subspaces: A new spectral clustering algorithm. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 1832–1847, 2016.

10 Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, pages 395–416, 2007.