• No results found

A mixture model for random graphs

N/A
N/A
Protected

Academic year: 2021

Share "A mixture model for random graphs"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

A mixture model for random graphs

J-J Daudin, F. Picard, S. Robin

[email protected]

UMR INA-PG / ENGREF / INRA, Paris Math´ematique et Informatique Appliqu´ees

Examples of networks.

Social: who knows who?

Biological: which protein interacts with which?

(2)

Random graphs

Notation and definition. Given a set of n vertices (i = 1..n), Xij indicates the

presence/absence of a (non oriented) edge between vertices i and j:

Xij = Xji = I{i ↔ j}, Xii = 0.

The random graph is defined by the join distribution of all the {Xij}i,j.

Typical characteristics.

Degree (connectivity) of the vertices: Ki = Pj6=iXij

Clustering coefficient: c = Pr{Xjk = 1 | Xij = Xik = 1}

(3)

Erdos-R´enyi (ER) model

Definition. The {Xij}i,j are i.i.d.:

Xij ∼ B(p).

Characteristics.

Degree : Ki ∼ B(n − 1, p) ≈ P(λ)

Clustering coefficient: c = p

Drawback. The ER fits poorly many real-world networks.

• Empirical degree distributions are often very different from the Poisson distribution because of few vertices having very high degrees.

(4)

Erd¨os-R´enyi mixture for graph (ERMG)

An explicit random graph model

Mixture population of edges. We still suppose that the edges belong to Q groups:

αq = Pr{i ∈ q}, Ziq = I{i ∈ q}.

Conditional distribution of the edges. The edges {Xij} are conditionally

independent given the group of the vertices:

Xij | {i ∈ q, j ∈ ℓ} ∼ B(πqℓ).

πqℓ = πℓq is the connection probability between groups q and ℓ.

(5)

Some properties of the ERMG model

Conditional distribution of the degrees:

Ki | {i ∈ q} ∼ B(n − 1, πq) ≈ P(λq)

where πq = P αℓπqℓ, λq = (n − 1)πq.

Marginal distribution of the degrees: we get a Poisson mixture

Ki ∼ X q B(n 1, πq) ≈ X q αqP(λq).

(6)

Between-group connectivity. Aqℓ denotes the connectivity between groups q and ℓ:

Aqℓ =

X

i<j

ZiqZjℓXij.

In the ERMG model, its expectation is

E(Aqℓ) =

n(n 1)

2 αqαℓπqℓ.

Clustering coefficient:

c = Pr{∇ ∩ V}/Pr{V} = Pr{∇}/ Pr{V}.

In the ERMG model, we get

c = P q,ℓ,m αqαℓαmπqℓπqmπℓm P q,ℓ,m αqαℓαmπqℓπqm .

(7)

Independent model

The absence of preferential connection between groups corresponds to the case where

πqℓ = ηqηℓ.

Distribution of degrees: {Ki | i ∈ q} ∼ P(λq),

where λq = (n − 1)ηqη, η = P αℓηℓ.

Between group connectivity: E (Aqℓ) = n(n − 1)(αqηq)(αℓηℓ)/2.

Clustering coefficient: c = P q αqη 2 q 2 η2 .

The ER model corresponds to

Q = 1, α1 = 1, η = η1 = √p,

so we get the known result: c = η14/η 2

(8)

Examples

Description Network Q π Clustering

coefficient Random 1 p p Independent model (product connectivity) 2 a2 ab ab b2 (a2 + b2)2 (a + b)2 Stars 4     0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0     0 Clusters (affiliation networks) 2 1 ε ε 1 1 + 3ε2 (1 + ε)2

(9)

Scale free network model. (Barabasi & Albert, 99)

The network is build iteratively: the i-th vertex joining the network connects one of the (i 1) preceeding ones with probability proportional to their current degree (busy gets busier):

∀j < i, Pri{i j} ∝ Kji.

The limit marginal distribution for the degrees is then scale free: p(k) k−3.

Analogous modeling with the independent ERMG. At time q, nq = nαq vertices

join the net work. They preferentially connect the oldest vertices:

πqℓ = ηqηℓ, η1 ≥ η2 ≥ · · · ≥ ηq ≥ . . .

(10)

Maximum likelihood estimation via E-M

We denote X = {Xij}i,j=1..n, Z = {Ziq}i=1..n,q=1..Q.

Likelihood

The conditional expectation of the complete-data log-likelihood is

Q(X) = E {L(X,Z)|X } = X i X q τiq logαq + X i X q X j>i X ℓ θijqℓ logb(Xij;πqℓ),

where τiq and θijqℓ are posterior probabilities

τiq = Pr{Ziq = 1 | X }, θijqℓ = Pr{ZiqZjℓ = 1 | X }

Evaluating these probabilities is not straightforward because the {Ziq} are all

(11)

E step. We approximate the conditional joint distribution of the {Ziq}: Pr{Z | X } ≃ Y i Pr{Zi | X,Zi} where Pr{Ziq = 1 | X,Zi} ∝ αq Y m b(Cim;Nmi , πqm)

• The elements of Zi are estimated by their conditional expectation: Zbjℓ = τjℓ.

• The posterior probabilities τiq must therefore satisfy

b

τiq = Pr{Ziq = 1 | X,Zbi}

which is actually a fix point type relation. The bτiq are obtained by iterating it.

M step. Maximizing Q(X) subject to Pq αq = 1 gives

b αq = X i b τiq/n, bπqℓ = X i X j b θijqℓXij , X i X j b θijqℓ .

(12)

Choice of the number of groups

We propose a heuristic penalized likelihood criterion inspired from BIC. Since Q(X) is the sum of

X

i

X

q

τiq logαq which deals with (Q − 1) independent

proportions αqs and involves n terms,

X i X q X j>i X ℓ

θijqℓ logb(Xij;πqℓ) which deals with Q(Q + 1)/2 probabilities

πqℓs and involves n(n − 1)/2 terms,

we propose the following heuristic criterion:

−2Q(X) + (Q 1) log n + Q(Q + 1) 2 log n(n 1) 2 .

(13)

Application to Karate Club Data

• n = 34 members (vertices) of a Karate club

• 2 members are connected is they have social interactions (apart from their sportive activity)

• 156 edges.

This dataset (Zachary, 77) has been intensively studied in the literature, generally with Q = 4 groups. Parameter estimates. b α(%) 5.9 8.9 36.8 48.4 100 16.5 6.8 73.8 b π 16.5 100 52.9 16.0 (%) 6.8 52.8 12.3 0.0 73.8 16.0 0.0 7.8 b λ 15.0 12.2 3.2 3.2

(14)

Dot-plot representation of the graph. Dot present means

Xij = 1

The vertices are re-ordered according to their ’mean group number’:

b qi = X q q τbiq 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

Posterior probabilities τbiq.

0 5 10 15 20 25 30 35 0 0.2 0.4 0.6 0.8 1 2 3 13 16

(15)

Interpretation of the groups

• 2 persons, including the administrator, strongly connected with group 4, but not with groups 2 and 3;

• 3 persons including the instructor, strongly connected with group 3, but not with groups 1 and 4;

• 13 ’ordinary’ members, connected with the instructor;

• 16 ’ordinary’ members, connected with the administrator.

End of the story. The instructor (group 2) finally leaved the club and started another one with about one half the members (corresponding to group 3?).

(16)

Selection of the number of groups. The pseudo BIC actually selects Q = 6 groups

Comparison with the 4 group model. Former groups 1 and 4 are conserved.

Former groups 2 and 3 are each divided in two new groups

We do not know if the new club did last very long...

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Posterior probabilities τbiq. 0 5 10 15 20 25 30 35 0 0.2 0.4 0.6 0.8 1 1 2 3 5 7 16

(17)

Application to

E. coli

reaction network

• n = 605 vertices (reactions) and 1 782 edges.

• 2 reactions i and j are connected if the product of i is the substrate of j (or conversely).

• provided by V. Lacroix and M.-F. Sagot (INRIA H´elix). Number of groups. Pseudo-BIC selects Q = 21.

Group proportions. αbq (%). 0 5 10 15 20 25 0 10 20 30 40 50 60

(18)

Dot-plot representation of the graph.

Biological interpretation: Groups 1 to 20 gather reactions involving all the same compound either as a substrate or as a product.

A compound (pyruvate, ATP, etc) can be associated to each group.

0 100 200 300 400 500 600 0 100 200 300 400 500 600

Posterior probabilities τbiq.

0 100 200 300 400 500 600 0 0.2 0.4 0.6 0.8 1 4 6 7 8 8 9 9 10 11 11 12 13 14 15 16 17 18 18 19 35 345

(19)

Zoom (bottom left). Submatrix of π: q, ℓ 1 9 10 16 1 1.0 9 .11 .65 10 .43 .67 16 1.0 .01 1.0

0

100

200

Vertices degree Ki.

Mean degree in the last group:

K21 = 2.6 5 10 15 20 25 30 35

(20)

Distribution of the degree. According to the ERMG, de degrees have a Poisson mixture distribution.

Histogram + mixture distribution P-P plot

0 10 20 30 40 0 50 100 150 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Clustering coefficient.

Empirical ERMG (Q = 6) ERMG (Q = 21) ER (Q = 1)

(21)

Reaction graph. Group number (group size) 1 (4) 2 (6) 3 (7) 4 (8) 5 (8) 6 (9) 7 (9) 8 (10) 9 (11) 10 (11) 11 (12) 12 (13) 13 (14) 14 (15) 15 (16) 16 (17) 17 (18) 18 (18) 19 (19) 20 (35) 21 (345)

(22)

Conclusions

Past.

• The ERMG model is a flexible generalization of the ER model and a promising alternative to the scale-free ’model’.

• It seems to fit well several real-world networks

• It is properly defined, so its properties can be properly studied. Future.

• Study the probabilistic properties of the ERMG model (diameter, probability for a subgraph to be connected, etc).

• Derive a relevant criterion to select the number of groups.

• Extension to valued graphs: Xij not only 0/1, but some measure of the

References

Related documents

Therefore, the Hamming distance of the best string in a population gives a rough estimate of the minimum number of generations needed to reach to the optimum string, when a

This result is partially a consequence of lower confidence when rating the friend and canonical individual as well as smaller mean absolute distances between those two individuals

At first for designing this system, the disease variables were discriminated and were at the patients' disposal as a questionnaire, and after gathering the relevant data

The annual report is composed of narrative text also complemented by graphical figures, based on the rhetorical proofs (logos, pathos, ethos) and stylistic markers

No studies to our knowledge have examined all three health measures (MVPA, sedentary behavior, components of sleep quality) using accelerometers to 1) compare health behaviors

The re-structure within Planning and Economic Development Services, of which Building Standards is a part, and the appointment of the new Head of the Planning and Economic

Needle and black thread: for sewing on felt patches Yarn or Tapestry Needle: for sewing parts together Pins: for holding parts together while you sew Poly-fil: polyester

Expro has one of the largest inventories of international certified equipment in the industry, including the highest capacity pipeline separation units (up to 500 MM scf/d with a