Online Correlation Clustering
Ocan Sankur1 2
March 3, 2010
(Joint work with Claire Mathieu2 and Warren Schudy 2 )
1Ecole Normale Sup´erieure, Paris, France 2Brown University, Providence, RI, USA
Correlation Clustering
Input: complete graph with edges labeled +/- (similarity)
Output: partition of vertices (clustering) that agrees as much as possible with input: maximize profit = ‘+’ edges within clusters plus ‘−’ edges between clusters.
Correlation Clustering
Input: complete graph with edges labeled +/- (similarity)
Output: partition of vertices (clustering) that agrees as much as possible with input: maximize profit = ‘+’ edges within clusters plus ‘−’ edges between clusters.
Correlation Clustering
Input: complete graph with edges labeled +/- (similarity)
Output: partition of vertices (clustering) that agrees as much as possible with input: maximize profit = ‘+’ edges within clusters plus ‘−’ edges between clusters.
Background
Ben-Dor, Shamir, Yakhini [BDSY99] and Bansal, Blum, Chawla. [BBC04]: respectively, to cluster gene expression patterns and for information retrieval applications.
Background
Ben-Dor, Shamir, Yakhini [BDSY99] and Bansal, Blum, Chawla. [BBC04]: respectively, to cluster gene expression patterns and for information retrieval applications.
NP-hard [BBC04].
An algorithm that outputs a clustering with profit
Background
Ben-Dor, Shamir, Yakhini [BDSY99] and Bansal, Blum, Chawla. [BBC04]: respectively, to cluster gene expression patterns and for information retrieval applications.
NP-hard [BBC04].
An algorithm that outputs a clustering with profit
≥(1−ǫ)profit(OPT), for any ǫ >0, [BBC04].
Our contribution: We study this problem online.
Correlation Clustering: Online setting
Vertices arrive one by one. The size of the input is unknown.
Online clustering algorithm
Upon arrival of a vertex v, an online algorithm can
Create a new cluster{v}.
Addv to an existing cluster.
Merge any pre-existing clusters. Split a pre-existing cluster
Correlation Clustering: Online setting
Vertices arrive one by one. The size of the input is unknown.
Online clustering algorithm
Upon arrival of a vertex v, an online algorithm can
Create a new cluster{v}.
Addv to an existing cluster.
Merge any pre-existing clusters. Split a pre-existing cluster
An online algorithm is c-competitive if on any input I, the algorithm
outputs a clustering ALG(I) s.t. profit(ALG(I))≥c·profit(OPT(I)) where OPT(I) is the offline optimum.
Our results for maximizing profit
Results
Our results for maximizing profit
Results
A greedy algorithm that is 0.5-competitive;
No algorithm has a competitive ratio better than 0.834;
Our results for maximizing profit
Results
A greedy algorithm that is 0.5-competitive;
No algorithm has a competitive ratio better than 0.834;
We design a (0.5 +ǫ0)-competitive algorithm, whereǫ0 is a small
Our results for maximizing profit
Results
A greedy algorithm that is 0.5-competitive;
No algorithm has a competitive ratio better than 0.834;
We design a (0.5 +ǫ0)-competitive algorithm, whereǫ0 is a small
constant. How small?
Our results for maximizing profit
Results
A greedy algorithm that is 0.5-competitive;
No algorithm has a competitive ratio better than 0.834;
We design a (0.5 +ǫ0)-competitive algorithm, whereǫ0 is a small
constant. How small?
Algorithm
Greedy
Algorithm 1 AlgorithmGreedy
Upon arrival of vertexv do
Putv in new cluster{v}.
while ∃clusters C,D s.t. merging C andD improves the profitdo
MergeC and D
end while end for
Algorithm
Greedy
: Example
Better than a 0
.
5-approximation (1)
Result 1
AlgorithmGreedy is 0.5-competitive.
Better than a 0
.
5-approximation (1)
Result 1AlgorithmGreedy is 0.5-competitive.
Better than a 0
.
5-approximation (1)
Result 1AlgorithmGreedy is 0.5-competitive.
If profit(OPT)≤(1−α)|E|, Greedy has competitive ratio>0.5.
Idea: design an algorithm (Dense) with competitive ratio>0.5
when profit(OPT)>(1−α)|E|.
Better than a 0
.
5-approximation (1)
Result 1AlgorithmGreedy is 0.5-competitive.
Better than a 0
.
5-approximation (2)
Algorithm 2 GreedyOrDense
With probabilityp, runGreedy,
With probability 1−p, runDense.
Result 2
AlgorithmGreedyOrDense is (0.5 +ǫ0)-competitive.
Introducing Algorithm
Dense
Reminder: focus on instances where profit(OPT)>(1−α)|E|.
Idea of Algorithm Dense: fixτ = 1.10,
Put every new vertex in a singleton. At times ti =τi
◮ Compute (near) OPT(ti) using [BBC04],
Introducing Algorithm
Dense
Reminder: focus on instances where profit(OPT)>(1−α)|E|.
Idea of Algorithm Dense: fixτ = 1.10,
Put every new vertex in a singleton. At times ti =τi
◮ Compute (near) OPT(ti) using [BBC04],
◮ Use it to run a merging procedure.
Simplification: Suppose we have the exact OPT at timesti.
Algorithm
Dense
: The merging procedure by example
At time t2, we run the merging procedure.
First, compute OPT(t2).
Then try to recreate OPT(t2).
Algorithm
Dense
: The merging procedure by example
The red clustering is the clustering at timet2.
We keep in mind that we obtained B′
1,B2′ by adapting our clusters toB1
and B2 of OPT(t2).
Call B1,B2 ∈OPT(t2) ghost clusters.
Algorithm
Dense
: The merging procedure by example
This is the clustering at time t3.
We keep ghost clusters C1,C2 ∈OPT(t3), since we obtained C1′,C2′ by
adapting to these.
Analysis of
Dense
Ifτ is large enough, then the clustering of Dense at time ti =τi, is
close to OPT(ti): profit(Dense(ti))≥(1−O(α)) ti 2 .
Analysis of
Dense
Ifτ is large enough, then the clustering of Dense at time ti =τi, is
close to OPT(ti): profit(Dense(ti))≥(1−O(α)) ti 2 .
Ifτ is small enough, then the profit of the clustering of Dense stays
high between two updates.
Analysis of
Dense
Ifτ is large enough, then the clustering of Dense at time ti =τi, is
close to OPT(ti): profit(Dense(ti))≥(1−O(α)) ti 2 .
Ifτ is small enough, then the profit of the clustering of Dense stays
high between two updates.
Lemma (Main Lemma)
Better than a 0
.
5-approximation
Algorithm 3 GreedyOrDense
With probabilityp, runGreedy,
With probability 1−p, runDense.
Result 2
AlgorithmGreedyOrDense is (0.5 +ǫ0)-competitive.
Our results on minimizing cost
Instead of maximizing profit, minimize cost = ‘-’ edges within clusters plus ‘+’ edges between clusters.
Our results on minimizing cost
Instead of maximizing profit, minimize cost = ‘-’ edges within clusters plus ‘+’ edges between clusters.
In the offline case, there is a 2.5-approximation Ailon, Charikar, Newman
[ACN05].
Our results on minimizing cost
Instead of maximizing profit, minimize cost = ‘-’ edges within clusters plus ‘+’ edges between clusters.
In the offline case, there is a 2.5-approximation Ailon, Charikar, Newman
[ACN05].
Theorem
Algorithm Greedyis O(n)-competitive for minimizing cost, and this is
Conclusion
A greedy 0.5-competitive algorithm for maximizing profit.
But 0.5 is not the best we can: We exhibit a randomized 0.5 + 10−14
ratio.
Future work:
- Show a better upper bound - or find a better algorithm.
References
Nir Ailon, Moses Charikar, and Alantha Newman, Aggregating inconsistent information: ranking and clustering, STOC ’05:
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing (New York, NY, USA), ACM Press, 2005, pp. 684–693. Nikhil Bansal, Avrim Blum, and Shuchi Chawla, Correlation clustering, Mach. Learn. 56(2004), no. 1-3, 89–113.
Amir Ben-Dor, Ron Shamir, and Zohar Yakhini, Clustering gene expression patterns, Journal of Computational Biology 6(1999), no. 3-4, 281–297.