• No results found

Tangles: from weak to strong clustering

N/A
N/A
Protected

Academic year: 2021

Share "Tangles: from weak to strong clustering"

Copied!
54
0
0

Loading.... (view fulltext now)

Full text

(1)

Tangles: from weak to strong clustering

or: Our adventure in machine learning

(2)

Tangle definition

2

Given a set S of bipartitions (cuts), atangleis a setτ which

contains exactly one side of each bipartition such that

|A∩B∩C| ≥a ∀A, B, C∈τ.

(3)

Stochastic Block Model

3

k blocks of equal size nk

(4)

Stochastic Block Model

3

k blocks of equal size nk

Edges within blocks with probability p, between blocks with probability q < p

Consider all cuts up to order Ψ

(5)

4

When are the blocks (distinct) tangles?

When are there no other tangles?

(6)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cuts up to order Ψ

|A, A{|:=P

(7)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cuts up to order Ψ

|A, A{|:=P

(8)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cuts up to order Ψ

|A, A{|:=P

(9)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cuts up to order Ψ

|A, A{|:=P

(10)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cuts up to order Ψ

|A, A{|:=P a∈A,b∈A{ w(a,b) n2 4 p(α1−α 2 1+α2−α22) +q(α1+α2−2α1α2)

(11)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cutsup to order Ψ

|A, A{|:=P

(12)

SBM (Expectation Case)

5

2 blocks of equal size n2

Edges within blocks withweight p, between blocks with weightq

All cutsup to order Ψ

|A, A{|:=P a∈A,b∈A{ w(a,b) X (i ,j) (in,jn)∈AΨ n 1 i · n 2 j X (i ,j) (δni,δnj)∈AΨ n 1 i · n 2 j

(13)

6

(14)

6

How do we sample good cuts?

(15)

6

(16)

Cut finding strategies

7 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

(17)

Karger’s algorithm

8

(18)

Cut finding strategies

9 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s)

(19)

Cut finding strategies

9 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) kneip (m = 513, took 8.4s) kernighan_lin (m = 500, took 73.0s)

(20)

Cut finding strategies

9 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) kneip (m = 513, took 8.4s) kernighan_lin (m = 500, took 73.0s) local_min (m = 500, took 241.9s)

(21)

Cut finding strategies

9 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) kneip (m = 513, took 8.4s) kernighan_lin (m = 500, took 73.0s) local_min (m = 500, took 241.9s) local_min_bounded (m = 500, took 43.1s) fid_mat (m = 500, took 837.1s)

(22)

Cut finding strategies

9 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.5, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 8.6s) karger (m = 500, took 27.6s) 0/5 1/5 2/5 1/2 Cut balance 0.0 0.1 0.2 0.3 0.4 0.5 Cu t v alu e i n te rm s o f | E|

Comparison of sampling strategies, SBM with 5 blocks of 50 nodes, p = 0.2, q = 0.05 (Data is fuzzed for readability)

random (m = 500, took 0.1s) kmodes (m = 44, took 7.0s) karger (m = 500, took 48.5s) kneip (m = 502, took 14.5s) kernighan_lin (m = 500, took 214.1s) local_min (m = 500, took 473.0s) local_min_bounded (m = 500, took 49.0s) fid_mat (m = 500, took 1219.4s)

(23)

The mindset model

10

(24)

The mindset model

10

k mindsets,m questions, npeople

Step 1: Samplek template vectors µ1, . . . , µk ∈ {0,1}m (mindsets)

Step 2: For eachµi, a set of nk people answers asµi does, but

(25)

The mindset model

10

k mindsets,m questions, npeople

Step 1: Samplek template vectors µ1, . . . , µk ∈ {0,1}m (mindsets)

Step 2: For eachµi, a set of nk people answers asµi does, but

deviates on each question independently with probabilityp < 12

(26)

The mindset model

10

k mindsets,m questions, npeople

Step 1: Samplek template vectors µ1, . . . , µk ∈ {0,1}m (mindsets)

Step 2: For eachµi, a set of nk people answers asµi does, but

deviates on each question independently with probabilityp < 12

Cuts are induced by questions.

When

are

the mindsets tangles?

When are there no other tangles?

(27)

Stochastics

11

Everything’s just Bernoulli random variables. Binomial distributions are well understood.

(28)

Stochastics

11

If 1 −3p > k a/n then with probability at least

1−k mexp(−2n(k an −1 + 3p)2 1

9k) every mindset is

(29)

Stochastics

11

If 1 −3p > k a/n then with probability at least

1−k mexp(−2n(k an −1 + 3p)2 1

9k) every mindset is

a tangle.

Ifp≤a/n then with probability at least

1−mkexp(−2n

k (p−

k a

n)

2)every triple with large

(30)

Stochastics

11

If 1 −3p > k a/n then with probability at least

1−k mexp(−2n(k an −1 + 3p)2 1

9k) every mindset is

a tangle.

Ifp≤a/n then with probability at least

1−mkexp(−2n

k (p−

k a

n)

2)every triple with large

inter-section comes from a mindset.

But how do we turn this into

‘Every tangle is a mindset’ ?

(31)

The problem

12

Suppose we have these mindsets:

(1,1,1,0,0,0,0,0,0, 0,0,0)

(0,0,0,1,1,1,0,0,0, 0,0,0)

(0,0,0,0,0,0,1,1,1, 0,0,0)

(32)

The problem

12

Suppose we have these mindsets:

(1,1,1,0,0,0,0,0,0, 0,0,0)

(0,0,0,1,1,1,0,0,0, 0,0,0)

(0,0,0,0,0,0,1,1,1, 0,0,0)

(0,0,0,0,0,0,0,0,0, 1,1,1)

Then we also get a tangle for

(33)

The problem

12

Suppose we have these mindsets:

(1,1,1,0,0,0,0,0,0, 0,0,0)

(0,0,0,1,1,1,0,0,0, 0,0,0)

(0,0,0,0,0,0,1,1,1, 0,0,0)

(0,0,0,0,0,0,0,0,0, 1,1,1)

Then we also get a tangle for

(0,0,0,0,0,0,0,0,0, 0,0,0)

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

(34)

How often is this satisified?

13

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

(35)

How often is this satisified?

13

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

andτ(z) =µi(z), thenτ is a mindset, i.e.τ =µj for somej.

(36)

How often is this satisified?

13

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

andτ(z) =µi(z), thenτ is a mindset, i.e.τ =µj for somej.

Easily holds if every partition of the mindsets is induced by a question.

(37)

How often is this satisified?

13

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

andτ(z) =µi(z), thenτ is a mindset, i.e.τ =µj for somej.

Easily holds if every partition of the mindsets is induced by a question.

This is bound to happen asm→ ∞.

(38)

How often is this satisified?

13

Assumption.If τ ∈ {0,1}m satisfies that for allx , y , z m there

exists a mindset µ such thatτ(x) =µi(x) as well as τ(y) =µi(y)

andτ(z) =µi(z), thenτ is a mindset, i.e.τ =µj for somej.

Easily holds if every partition of the mindsets is induced by a question.

This is bound to happen asm→ ∞.

Caveat:This requires m to be exponential ink.

Theorem.Asympotically, m has to be exponential ink, or else the assumption fails with high probability.

(39)

How often is it

really

satisified?

14

(40)

How often is it

really

satisified?

14

(41)

Back to experiments

15

How do we evaluate the quality of our

clustering numerically?

(42)

Back to experiments

15

How do we evaluate the quality of our

clustering numerically?

Turn it into a hard clustering. Count the number of wrongly

(43)

Dimensions

16 k: number of mindsets m: number of questions n: number of people p: noise probability a: tangle agreement

additional noise questions

(44)

10 20 30 40 50 60 70 80 90 100

size of the smallest mindset|V2| 0.25 0.24 0.23 0.22 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0 noise p 0.07 0.15 0.2 0.32 0.34 0.41 0.29 0.51 0.63 0.49 0.15 0.24 0.27 0.33 0.36 0.42 0.54 0.55 0.68 0.74 0.05 0.22 0.24 0.24 0.59 0.66 0.75 0.75 0.6 0.77 0.11 0.27 0.24 0.21 0.55 0.63 0.77 0.77 0.65 0.76 0.09 0.25 0.27 0.33 0.53 0.66 0.79 0.94 0.92 0.98 0.08 0.44 0.36 0.47 0.55 0.68 0.79 0.89 0.93 0.94 0.13 0.28 0.34 0.43 0.83 0.82 0.86 1 0.99 0.94 0.15 0.33 0.4 0.63 0.83 0.91 0.96 1 1 1 0.16 0.33 0.43 0.57 0.81 0.96 0.94 0.95 1 1 0.27 0.39 0.48 0.64 0.91 0.96 0.95 0.96 0.95 1 0.09 0.5 0.39 0.8 0.89 0.97 1 1 1 1 0.28 0.44 0.51 0.83 0.93 1 1 1 1 1 0.53 0.49 0.65 0.9 0.97 1 1 1 1 1 0.41 0.74 0.72 0.96 1 1 1 1 1 1 0.3 0.65 0.77 1 1 1 1 1 1 1 0.5 0.58 0.88 1 1 1 1 1 1 1 0.57 0.63 1 1 1 1 1 1 1 1 0.7 0.6 1 1 1 1 1 1 1 1 0.46 0.9 1 1 1 1 1 1 1 1 0.65 0.97 1 1 1 1 1 1 1 1 0.77 0.92 1 1 1 1 1 1 1 1 0.87 0.96 1 1 1 1 1 1 1 1 0.93 0.98 1 1 1 1 1 1 1 1 0.87 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.2 0.4 0.6 0.8 1.0

(45)

The same goes for the SBM

18 10 20 30 40 50 60 70 80 90 100 Block size|Vi| 100 90 80 70 60 50 40 30 20 10 Agreemen t a 0 0 0 0 0 0 0 0 0 0.9 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0.9 1 1 1 0 0 0 0 0 0.7 1 1 1 1 0 0 0 0 0.57 0.99 1 1 1 0.95 0 0 0 0.25 0.97 0.99 1 1 0.8 0.7 0 0 0 0.95 0.97 0.99 0.85 0.65 0.8 0.7 0 0 0.86 0.87 0.53 0.59 0.8 0.65 0.76 0.65 0 0.74 0.52 0.45 0.35 0.51 0.67 0.61 0.76 0.64 0.0 0.2 0.4 0.6 0.8 1.0

(46)

Visualizing tangles

19

(47)

Visualizing tangles

19

(48)

Visualizing tangles

19

(49)

Visualizing tangles

19

(50)

Visualizing tangles

19

(51)

Dendrogram

(52)

Dendrogram

(53)
(54)

Thank you!

References

Related documents

The Solar Progress Partnership came together in recognition of the ongoing and future value of clean Distributed Energy Resources (“DER”) to New York, with the goal of sharing

This document is the Service Level Agreement between the Secure File Transfer Service Review Board and HSCIC SSD Service Management for the support and delivery of the Secure

Upon examining own volatility dependency for the three major sectors, namely Service, Industrial and Banking, in four GCC economies (Kuwait, Qatar, Saudi Arabia and UAE),

The melt processing behavior (MFR and melt viscosities) of the composite blends were determined using a 15 kg load in order to get a reasonable flow rate. Composites made

The findings can be summarized as: the relationships between connection to the profession and emotional exhaustion with the profession was partially mediated by

First, we will annotate the “Doing Good” curriculum to identify research and interventions from positive psychology that promote resilience and align with topics and program

Still unsolved, methodologies that use a common description tool for all stakeholders implied, summarizing technological, functional and non-functional information

j. Unlike in the cases of death in custody, the system does not have to provide for an investigation initiated by the state, but may include such investigation. The question in