• No results found

Testing Correlation of Unlabeled Random Graphs

N/A
N/A
Protected

Academic year: 2021

Share "Testing Correlation of Unlabeled Random Graphs"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Testing Correlation of Unlabeled Random Graphs

Jiaming Xu

The Fuqua School of Business Duke University Joint work with

Yihong Wu (Yale) and Sophie H. Yu (Duke)

Computational and Methodological Statistics December 21, 2020

(2)

Graph isomorphism

Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that

(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)

a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7

• Not known to be solvable in polynomial time in worst case

• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar

(3)

Graph isomorphism

Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that

(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)

a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7

• Not known to be solvable in polynomial time in worst case

• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar

(4)

Graph isomorphism

Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that

(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)

a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7

(5)

Two key challenges

• Statistical: two graphs may be correlated but not exactly isomorphic

• Computational: # of possible node mappings is n! (100!≈ 10158)

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

(6)

Motivation: Protein-protein interaction network

Kazemi et al. BMC Bioinformatics (2016) 17:527 Page 6 of 16

proposed in [59]. ANBS for two graphs G

1

(V

1

, E

1

) and

G

2

(V

2

, E

2

) under the alignment π is defined as follows.

ANBS(π) =

|V

1

|

−1



i∈V1(π)

BlastBit

(i, π(i))

BlastBit

(i, i)BlastBit(π(i), π(i))

.

Pathway comparison measures

In order to evaluate the performance of algorithms in

aligning biological pathways, we introduce a new

mea-sure in this section. This meamea-sure captures the quality of

alignments based on a higher level of functional and

struc-tural similarities (beyond the introduced measures such as

the similarity of GO terms and the number of conserved

interactions).

It is known that there are many biological pathways

with similar functions in different species [12]. The KEGG

PATHWAY database [60] provides a set of experimentally

found biological pathways. In this database, a pathway is

called by the name of a species (e.g., hsa for Homo

sapi-ens), followed by a number. The pathways with the same

number have the same function in different species. For

example, hsa03040, mmu03040, dme03040 and sce03040

are in Homo sapiens (human), Mus musculus (mouse),

Drosophila melanogaster (fruit fly) and Saccharomyces

cerevisiae (budding yeast), respectively. These pathways

have the same functions.

2

Assume PWi

,1

denotes the set of

proteins from a pathway with number i in the PPI network

of the first species (i.e., G

1

). Similarly, we define PW

i,2

. For

pathway i,

π,i

denotes the number of conserved

inter-actions between the proteins in this pathway under the

alignment

π, i.e., π,i

= EG1

[PWi,1]

∩ π

−1

(EG2

[PWi,2]

). Note

that we are looking for pathways that are present in both

aligned species.

We say a protein u from a pathway is aligned correctly, if

it is mapped to a protein v from a pathway with the same

function. For pathway i, we define the number of correctly

mapped proteins as

|PWi

,1

∩ π

−1

(PWi

,2

)|. This measure

corresponds to the number of proteins that, from

path-way i in the first species, are mapped to a protein from

the same pathway in the second species. For pathway i, we

define the accuracy as

acc

π,i

=

2

|PWi

,1

∩ π

−1

(PWi

,2

)|

|PWi

,1

| + |PWi

,2

|

.

(2)

This measure corresponds to the fraction of correctly

mapped proteins in pathway i.

We conjecture that a good alignment algorithm should

align proteins from pathways with the same functions

across species, and many interactions among these

pro-an alignment

π successfully aligns a pathway i, if there are

at least

δ conserved interactions under the alignment π

for proteins in that pathway, i.e., if



π,i

≥ δ. This

thresh-olding guarantees that the structural similarity of aligned

pathways are more than a minimum value (here,

δ

con-served interactions). To evaluate the performance of an

algorithm based on this thresholding criterion, we define

a set of measures as follows.

1. We consider pathways with at least

δ (say δ ≥ 2)

interactions in each of the species. Let “#PW

δ

denote the number of such pathways.

2. Alignment

π successfully aligns pathway i, if



π,i

≥ δ. The variable “#FPW

δ

” refers to the number

of successfully aligned pathways. We define the recall

as

recall

π,δ

=

#FPW

δ

#PW

δ

.

(3)

3. Again, for a correctly aligned pathway i, we define

acc

π,δ,i

similar to (2).

The averages over all i of all the acc

π,i

and acc

π,δ,i

values

are represented by acc

π

and acc

π,δ

, respectively. Figure 2

provides a toy example of how to calculate the pathway

alignment measures.

Results

In this section, we compare PROPER with the main

state-of-the-art network alignment algorithms, specifically (i)

with L-GRAAL as the most recent member of GRAAL

family that takes into account both sequence and

struc-tural similarities [23]; (ii) with MAGNA++ that tries to

maximize one of the EC, ICS or S

3

measures [33, 34]

(In our experiments we run MAGNA++ in two different

Fig. 2 In this figure, two example PPI networks are given. Green nodes

are proteins which are in the same pathway (i.e., a pathway with the same number in both species). Dotted lines represent the alignmentπ between these two networks. Under this alignment, there are five conserved interactions between proteins in this pathway (shown by

thick black edges in each network). Also, the number of correctly [Kazemi-Hassani-Grossglauser-Modarres ’16]

Ontology: Assess the correlation of two biological networks in two different species based on network topology

(7)

Motivation: Deformable shape matching

A fundamental problem in computer vision: Detect similar objects that undergo different deformations

Shape REtrieval Contest (SHREC) dataset

3-D shapes → geometric graphs (features → nodes, distances → edges) Determine whether two graphs are similar in topologies

(8)

Motivation: Deformable shape matching

A fundamental problem in computer vision: Detect similar objects that undergo different deformations

(9)

Beyond worst-case: Noisy graph isomorphism

Definition (Hypothesis testing)

Suppose we observe two random graphs A and B: H0 : A and B areindependent

H1 : A and Bπ = (Bπ(i)π(j)) are edge-correlated

conditional on a uniform permutation π

Goal: Test H0 versus H1

• UnderH1, the inherent edge correlation is obscured by the latent

node correspondence π

• Hypothesis testing aspect of graph matching(recover π under H1) • The test needs to rely on graph invariants(invariant under graph

isomorphism), such as

I Subgraph counts (e.g. # of edges or triangles)

(10)

Gaussian setting

A and B are twoGaussian-weighted complete graphs

Definition (Gaussian Wigner model)

H0 : (Aij, Bij)i.i.d.∼ N  (00) , (1 00 1)  H1 : Aij, Bπ(i)π(j) i.i.d. ∼ N(0 0) ,  1 ρ ρ 1   conditional on uniform π

• Under bothH0 andH1, A and B are Gaussian Wigner matrices

marginally

(11)

Erd˝

os-R´

enyi setting

Definition (Erd˝os-R´enyi graphs model [Barak-Chou-Lei-Schramm-Sheng’19])

H0 : A and B are independent G(n, ps)

H1 : A and Bπ = (Bπ(i)π(j)) are independently edge-sampled from

a common parent graph G(n, p) with sampling probability s

• Under bothH0 andH1, A and B areG(n, ps) marginally

• UnderH1, Aij, Bπ(i)π(j) are correlated Bern(ps) with correlation

(12)

Sharp threshold for detection: Gaussian

Q and P: probability measure under H0 andH1

Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1− o(1) if ρ2≥ 4log nn−1 o(1) if ρ2≤ (4 − )log nn 1 lim n→∞TV (P, Q) 0 lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)

strong detection possible test error=o(1)

(13)

Sharp threshold for detection: Gaussian

Q and P: probability measure under H0 andH1

Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1− o(1) if ρ2≥ 4log nn−1 o(1) if ρ2≤ (4 − )log nn 1 lim n→∞TV (P, Q) 0 lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)

strong detection possible test error=o(1)

(14)

Sharp threshold for detection: dense Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20)

If s2 ≥ 2 log n

(n− 1)plog1p − 1 + p , then TV (P, Q) = 1 − o (1)

• (Dense regime) p = n−o(1): If s2 (2− ) log n

nplog1p − 1 + p ,

then TV (P, Q) = o(1)

Remark:

• Sharp phase transition at ns

2plog1 p−1+p



log n = 2

• p7→ p(log1

p − 1 + p) is not monotone and uniquely maximized at

(15)

Sharp threshold for detection: dense Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20)

If s2 ≥ 2 log n

(n− 1)plog1p − 1 + p , then TV (P, Q) = 1 − o (1)

• (Dense regime) p = n−o(1): If s2 (2− ) log n

nplog1p − 1 + p ,

then TV (P, Q) = o(1)

Remark:

• Sharp phase transition at ns

2plog1 p−1+p



log n = 2

• p7→ p(log1

p − 1 + p) is not monotone and uniquely maximized at

(16)

Threshold for detection: sparse Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n− 1)p  log1p − 1 + p  , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1

np∧ o(1), then TV (P, Q) = o(1)

p = n−α for a constant α:

(17)

Threshold for detection: sparse Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n− 1)p  log1p − 1 + p  , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1

np∧ o(1), then TV (P, Q) = o(1)

p = n−α for a constant α:

(18)

What is known algorithmically

• Counting edges: achieve weak detection in linear time, if s = Ω(1)

• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:

correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if

s = Ω (1) and npshn, n1/153ihn2/3, n1−i

• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)

[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if

s2 > 1/2.956 and np≥ n−o(1)

Order-optimalwhen np = Θ(1)

(19)

What is known algorithmically

• Counting edges: achieve weak detection in linear time, if s = Ω(1)

• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:

correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if

s = Ω (1) and npshn, n1/153ihn2/3, n1−i

• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)

[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if

s2 > 1/2.956 and np≥ n−o(1)

Order-optimalwhen np = Θ(1)

(20)

What is known algorithmically

• Counting edges: achieve weak detection in linear time, if s = Ω(1)

• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:

correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if

s = Ω (1) and npshn, n1/153ihn2/3, n1−i

• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)

[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if

s2 > 1/2.956 and np≥ n−o(1)

Order-optimalwhen np = Θ(1)

(21)

What is known algorithmically

• Counting edges: achieve weak detection in linear time, if s = Ω(1)

• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:

correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if

s = Ω (1) and npshn, n1/153ihn2/3, n1−i

• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)

[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if

s2 > 1/2.956 and np≥ n−o(1)

Order-optimalwhen np = Θ(1)

(22)

Proof of positive results: QAP statistic

T (A, B) , max

π∈Sn

X

i<j

AijBπ(i)π(j) (edge correlation)

• This is known as Quadratic Assignment Problem

• Invariant to the node relabeling of both A and B

(23)

Proof of negative results: second-moment method

EQ "  P(A, B) Q(A, B) 2# = O(1) =⇒ TV(P, Q) ≤ 1 − Ω(1)

Strong detection is impossible

EQ " P(A, B) Q(A, B) 2# = 1 + o(1) =⇒ TV(P, Q) = o(1)

(24)

Second-moment calculation: cycle (orbit) decomposition

• Node permutation σ on [n] • Edge permutation σE on [n]2 : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)

(25)

Second-moment calculation: cycle (orbit) decomposition

• Node permutation σ on [n] • Edge permutation σE on [n]2 : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)

(26)

Second-moment calculation: cycle decomposition

 P(A, B) Q(A, B) 2 =  Eπ  P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O Xij

O: disjoint orbits of edge permutation σE with σ , π−1

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( 1 1−ρ2|O| Gaussian

(27)

Second-moment calculation: cycle decomposition

 P(A, B) Q(A, B) 2 =  Eπ  P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O Xij

O: disjoint orbits of edge permutation σE with σ , π−1

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( 1 1−ρ2|O| Gaussian

(28)

Second-moment calculation: cycle decomposition

 P(A, B) Q(A, B) 2 =  Eπ  P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O Xij

O: disjoint orbits of edge permutation σE with σ , π−1

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO]

(29)

Failure of second-moment

We show EQ "  P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2 (2−) log nn + if ρ2 (2+) log nn

• Gaussian: suboptimal by a factor of 2

• ER graphs: suboptimal by an unbounded factor when p = o(1)

Obstruction from short orbits

E(A,B)∼Q "  P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)

Atypically large magnitude ofQ

O∈O:|O|=kXO for short orbits of length

(30)

Failure of second-moment

We show EQ "  P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2 (2−) log nn + if ρ2 (2+) log nn

• Gaussian: suboptimal by a factor of 2

• ER graphs: suboptimal by an unbounded factor when p = o(1)

Obstruction from short orbits

E(A,B)∼Q "  P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)

Atypically large magnitude ofQ

O∈O:|O|=kXO for short orbits of length

(31)

Failure of second-moment

We show EQ "  P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2 (2−) log nn + if ρ2 (2+) log nn

• Gaussian: suboptimal by a factor of 2

• ER graphs: suboptimal by an unbounded factor when p = o(1)

Obstruction from short orbits

E(A,B)∼Q "  P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)

Atypically large magnitude ofQ

O∈O:|O|=kXO for short orbits of length

(32)

Truncated second-moment: dense regime

It suffices to consider k = 1: Y O∈O:|O|=1 XO≈  1 p 2eA∧Bπ(F )

F : set of fixed points of σ , π−1◦eπ A∧ Bπ: Intersection graph

eA∧Bπ(F ): # of edges of subgraph of A∧ Bπ induced by F

• UnderP: eA∧Bπ(S) concentratesuniformly over all S when|S| is

large

• On this typical event E under P, when |F | is large,

 

"

(33)

Truncated second-moment: sparse regime

Need to consider k = Θ(log n). It can be shown

• Long orbits: EQ   Y |O|>k XO  ≤  1 + ρk n2 k = 1 + o(1)

• Shortincompleteorbits:

EQ[XO| O 6⊂ E (A ∧ Bπ)]≤ 1 • Shortcomplete orbits:

XO=

 1 p

2|O|

, ∀O ⊂ E (A ∧ Bπ)

In the subcritical regime nps2 < 1, A∧ Bπ ∼ G(n, ps2) is a pseduoforest

(34)

Truncated second-moment: orbit pseduoforest

OnE , {(A, B, π) : A ∧ Bπ is a pseudoforest} under P:

EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ "  1 p 2e(Hk) 1{Hkis a pseudoforest} # = (1 + o(1)) X H∈Hk

s2e(H) (generating function)

Hk: pseudoforests assembled from edge orbits of length at most k

(35)

Truncated second-moment: orbit pseduoforest

OnE , {(A, B, π) : A ∧ Bπ is a pseudoforest} under P:

EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ "  1 p 2e(Hk) 1{Hkis a pseudoforest} # = (1 + o(1)) X H∈Hk

s2e(H) (generating function)

Hk: pseudoforests assembled from edge orbits of length at most k

(36)

Concluding remarks

• Formulate the problem of testing network correlation and characterize the statistical detection limit

• The impossibility proof applies truncated second-moment method

• The sparse setting leverages the pseudoforest structure of subcritical Erd˝os-R´enyi graphs

• A large computational gap may exist between the statistical and computational limits

Open problem

• Sharp detection threshold in the sparse regime

• Prove the existence of or close the computational gaps

• Sharp partial recovery threshold Reference

References

Related documents

Intervention effects with respect to ‘ PTA ’ , ‘ INFO ’ , ‘ SAFED ’ , clinical patient outcomes, and health-related quality of life were estimated using general- ised

The translating throat nozzle model was tested at three expansion ratios (design points) to simulate configurations for low, intermediate, and high Mach number operating conditions

socialization perspective is one mainly of the success or failure of parents as economic socializers: if parents have encouraged good savings habits and an approach to economic

Application information and application forms for the Teagasc Distance Education Green Cert for Non Agricultural Award Holders will be available at www.teagasc.ie/ecollege from

The jurisdiction and discretion granted to the coastal state regarding it^s fishing resources should therefore be implemented into national legislation to the benefit of such

After successfully supporting the development of the wind power technology, an approach is needed to include the owners of wind turbines in the task of realizing other ways, other

When the results of the analysis were evaluated according to the Regulation of water pollution of Surface water quality management, the Kızılırmak river was

The over-all in- cidence of reported birth injury in upstate New York was 7.5 per 1,000 live births, more than three times the incidence in New York City1. The incidence of