Online Correlation Clustering

(1)

Online Correlation Clustering

Ocan Sankur1 2

March 3, 2010

(Joint work with Claire Mathieu2 and Warren Schudy 2 )

1_{Ecole Normale Sup´}_{erieure, Paris, France} 2_{Brown University, Providence, RI, USA}

(2)

Correlation Clustering

Input: complete graph with edges labeled +/- (similarity)

Output: partition of vertices (clustering) that agrees as much as possible with input: maximize profit = ‘+’ edges within clusters plus ‘−’ edges between clusters.

(3)

Correlation Clustering

(4)

Correlation Clustering

(5)

Background

Ben-Dor, Shamir, Yakhini [BDSY99] and Bansal, Blum, Chawla. [BBC04]: respectively, to cluster gene expression patterns and for information retrieval applications.

(6)

Background

NP-hard [BBC04].

An algorithm that outputs a clustering with profit

(7)

Background

NP-hard [BBC04].

An algorithm that outputs a clustering with profit

≥(1−ǫ)profit(OPT), for any ǫ >0, [BBC04].

Our contribution: We study this problem online.

(8)

Correlation Clustering: Online setting

Vertices arrive one by one. The size of the input is unknown.

Online clustering algorithm

Upon arrival of a vertex v, an online algorithm can

Create a new cluster{v}.

Addv to an existing cluster.

Merge any pre-existing clusters. Split a pre-existing cluster

(9)

Correlation Clustering: Online setting

Vertices arrive one by one. The size of the input is unknown.

Online clustering algorithm

Upon arrival of a vertex v, an online algorithm can

Create a new cluster{v}.

Addv to an existing cluster.

Merge any pre-existing clusters. Split a pre-existing cluster

An online algorithm is c-competitive if on any input I, the algorithm

outputs a clustering ALG(I) s.t. profit(ALG(I))≥c·profit(OPT(I)) where OPT(I) is the offline optimum.

(10)

Our results for maximizing profit

Results

(11)

Our results for maximizing profit

Results

A greedy algorithm that is 0.5-competitive;

No algorithm has a competitive ratio better than 0.834;

(12)

Results

We design a (0.5 +ǫ₀)-competitive algorithm, whereǫ₀ is a small

(13)

Our results for maximizing profit

Results

constant. How small?

(14)

Results

constant. How small?

(15)

Algorithm

Greedy

Algorithm 1 AlgorithmGreedy

Upon arrival of vertexv do

Putv in new cluster{v}.

while ∃clusters C,D s.t. merging C andD improves the profitdo

MergeC and D

end while end for

(16)

(17)

Algorithm

Greedy

: Example

(18)

(19)

Better than a 0

.

5-approximation (1)

Result 1

AlgorithmGreedy _{is 0}.5-competitive.

(20)

Better than a 0

.

5-approximation (1)

Result 1

AlgorithmGreedy _{is 0}_._{5-competitive.}

(21)

Better than a 0

.

5-approximation (1)

Result 1

If profit(OPT)≤(1−α)|E|, Greedy has competitive ratio>0.5.

Idea: design an algorithm (Dense_{) with competitive ratio}>0.5

when profit(OPT)>(1−α)|E|.

(22)

Better than a 0

.

5-approximation (1)

Result 1

(23)

Better than a 0

.

5-approximation (2)

Algorithm 2 GreedyOrDense

With probabilityp, runGreedy_,

With probability 1−p, runDense_.

Result 2

AlgorithmGreedyOrDense _{is (0}_._{5 +}_ǫ₀_{)-competitive.}

(24)

Introducing Algorithm

Dense

Reminder: focus on instances where profit(OPT)>(1−α)|E|.

Idea of Algorithm Dense_{: fix}_τ _{= 1}_._10,

Put every new vertex in a singleton. At times ti =τi

◮ Compute (near) OPT(t_i) using [BBC04],

(25)

Introducing Algorithm

Dense

Reminder: focus on instances where profit(OPT)>(1−α)|E|.

Idea of Algorithm Dense_{: fix}_τ _{= 1}_._10,

Put every new vertex in a singleton. At times ti =τi

◮ Compute (near) OPT(t_i) using [BBC04],

◮ Use it to run a merging procedure.

Simplification: Suppose we have the exact OPT at timesti.

(26)

(27)

Algorithm

Dense

: The merging procedure by example

At time t2, we run the merging procedure.

First, compute OPT(t2).

Then try to recreate OPT(t₂).

(28)

(29)

Algorithm

Dense

The red clustering is the clustering at timet2.

We keep in mind that we obtained B′

1,B2′ by adapting our clusters toB1

and B₂ of OPT(t₂).

Call B1,B2 ∈OPT(t2) ghost clusters.

(30)

(31)

Algorithm

Dense

This is the clustering at time t₃.

We keep ghost clusters C1,C2 ∈OPT(t3), since we obtained C1′,C2′ by

adapting to these.

(32)

Analysis of

Dense

Ifτ is large enough, then the clustering of Dense _{at time} _t_i ₌τi, is

close to OPT(ti): profit(Dense₍_t_i₎₎_≥₍₁₋_O₍_α₎₎ ti 2 .

(33)

Analysis of

Dense

Ifτ is small enough, then the profit of the clustering of Dense stays

high between two updates.

(34)

Analysis of

Dense

Ifτ is small enough, then the profit of the clustering of Dense stays

high between two updates.

Lemma (Main Lemma)

(35)

Better than a 0

.

5-approximation

Algorithm 3 GreedyOrDense

With probabilityp, runGreedy_,

With probability 1−p, runDense_.

Result 2

AlgorithmGreedyOrDense _{is (0}.5 +ǫ₀)-competitive.

(36)

Our results on minimizing cost

Instead of maximizing profit, minimize cost = ‘-’ edges within clusters plus ‘+’ edges between clusters.

(37)

Our results on minimizing cost

In the offline case, there is a 2.5-approximation Ailon, Charikar, Newman

[ACN05].

(38)

Our results on minimizing cost

In the offline case, there is a 2.5-approximation Ailon, Charikar, Newman

[ACN05].

Theorem

Algorithm Greedy_{is O}₍_n₎_{-competitive for minimizing cost, and this is}

(39)

Conclusion

A greedy 0.5-competitive algorithm for maximizing profit.

But 0.5 is not the best we can: We exhibit a randomized 0.5 + 10−14

ratio.

Future work:

- Show a better upper bound - or find a better algorithm.

(40)

References

Nir Ailon, Moses Charikar, and Alantha Newman, Aggregating inconsistent information: ranking and clustering, STOC ’05:

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing (New York, NY, USA), ACM Press, 2005, pp. 684–693. Nikhil Bansal, Avrim Blum, and Shuchi Chawla, Correlation clustering, Mach. Learn. 56(2004), no. 1-3, 89–113.

Amir Ben-Dor, Ron Shamir, and Zohar Yakhini, Clustering gene expression patterns, Journal of Computational Biology 6(1999), no. 3-4, 281–297.