Testing Correlation of Unlabeled Random Graphs

(1)

Testing Correlation of Unlabeled Random Graphs

Jiaming Xu

The Fuqua School of Business Duke University Joint work with

Yihong Wu (Yale) and Sophie H. Yu (Duke)

Computational and Methodological Statistics December 21, 2020

(2)

Graph isomorphism

Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)_{→ V (B) such that}

(u, v)_{∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)}

a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7

• Not known to be solvable in polynomial time in worst case

• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar

(3)

Graph isomorphism

(u, v)_{∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)}

• Not known to be solvable in polynomial time in worst case

• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar

(4)

Graph isomorphism

(u, v)_{∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)}

(5)

Two key challenges

• Statistical: two graphs may be correlated but not exactly isomorphic

• Computational: # of possible node mappings is n! (100!≈ 10158₎

• • • • • • • • • • • _• • • • • • • • • • • • • • • • • • • • • • • • _• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • _• • • • • • • • _• • • • • • • • • • • • • _• • • • • • • • • • • • • • • • • • • • • • • • • • • • •

(6)

Motivation: Protein-protein interaction network

Kazemi et al. BMC Bioinformatics (2016) 17:527 Page 6 of 16

proposed in [59]. ANBS for two graphs G

1

(V

1

, E

1

) and

G

2

(V

2

, E

2

) under the alignment π is defined as follows.

ANBS(π) =

|V

1

|

−1

i∈V1(π)

BlastBit

(i, π(i))

√

BlastBit

(i, i)BlastBit(π(i), π(i))

.

Pathway comparison measures

In order to evaluate the performance of algorithms in

aligning biological pathways, we introduce a new

mea-sure in this section. This meamea-sure captures the quality of

alignments based on a higher level of functional and

struc-tural similarities (beyond the introduced measures such as

the similarity of GO terms and the number of conserved

interactions).

It is known that there are many biological pathways

with similar functions in different species [12]. The KEGG

PATHWAY database [60] provides a set of experimentally

found biological pathways. In this database, a pathway is

called by the name of a species (e.g., hsa for Homo

sapi-ens), followed by a number. The pathways with the same

number have the same function in different species. For

example, hsa03040, mmu03040, dme03040 and sce03040

are in Homo sapiens (human), Mus musculus (mouse),

Drosophila melanogaster (fruit fly) and Saccharomyces

cerevisiae (budding yeast), respectively. These pathways

have the same functions.

2

Assume PWi

,1

denotes the set of

proteins from a pathway with number i in the PPI network

of the first species (i.e., G

1

). Similarly, we define PW

i,2

. For

pathway i,

π,i

denotes the number of conserved

inter-actions between the proteins in this pathway under the

alignment

π, i.e., π,i

= EG1

[PWi,1]

∩ π

−1

(EG2

[PWi,2]

). Note

that we are looking for pathways that are present in both

aligned species.

We say a protein u from a pathway is aligned correctly, if

it is mapped to a protein v from a pathway with the same

function. For pathway i, we define the number of correctly

mapped proteins as

|PWi

,1

∩ π

−1

(PWi

,2

)|. This measure

corresponds to the number of proteins that, from

path-way i in the first species, are mapped to a protein from

the same pathway in the second species. For pathway i, we

define the accuracy as

acc

_π,i

=

2 |PWi

,1

∩ π

−1

_(PWi

,2

)|

|PWi

,1

| + |PWi

,2

|

.

(2)

This measure corresponds to the fraction of correctly

mapped proteins in pathway i.

We conjecture that a good alignment algorithm should

align proteins from pathways with the same functions

across species, and many interactions among these

pro-an alignment

π successfully aligns a pathway i, if there are

at least

δ conserved interactions under the alignment π

for proteins in that pathway, i.e., if

_π,i

≥ δ. This

thresh-olding guarantees that the structural similarity of aligned

pathways are more than a minimum value (here,

δ

con-served interactions). To evaluate the performance of an

algorithm based on this thresholding criterion, we define

a set of measures as follows.

1. We consider pathways with at least

δ (say δ ≥ 2)

interactions in each of the species. Let “#PW

_δ

”

denote the number of such pathways.

2. Alignment

π successfully aligns pathway i, if

π,i

≥ δ. The variable “#FPW

δ

” refers to the number

of successfully aligned pathways. We define the recall

as

recall

_π,δ

=

#FPW

δ

#PW

_δ

.

(3)

3. Again, for a correctly aligned pathway i, we define

acc

_π,δ,i

similar to (2).

The averages over all i of all the acc

_π,i

and acc

_π,δ,i

values

are represented by acc

_π

and acc

_π,δ

, respectively. Figure 2

provides a toy example of how to calculate the pathway

alignment measures.

Results

In this section, we compare PROPER with the main

state-of-the-art network alignment algorithms, specifically (i)

with L-GRAAL as the most recent member of GRAAL

family that takes into account both sequence and

struc-tural similarities [23]; (ii) with MAGNA++ that tries to

maximize one of the EC, ICS or S

3

measures [33, 34]

(In our experiments we run MAGNA++ in two different

Fig. 2 In this figure, two example PPI networks are given. Green nodes

are proteins which are in the same pathway (i.e., a pathway with the same number in both species). Dotted lines represent the alignmentπ between these two networks. Under this alignment, there are five conserved interactions between proteins in this pathway (shown by

thick black edges in each network). Also, the number of correctly [Kazemi-Hassani-Grossglauser-Modarres ’16]

Ontology: Assess the correlation of two biological networks in two different species based on network topology

(7)

Motivation: Deformable shape matching

A fundamental problem in computer vision: Detect similar objects that undergo different deformations

Shape REtrieval Contest (SHREC) dataset

3-D shapes _{→ geometric graphs (features → nodes, distances → edges)} Determine whether two graphs are similar in topologies

(8)

Motivation: Deformable shape matching

A fundamental problem in computer vision: Detect similar objects that undergo different deformations

(9)

Beyond worst-case: Noisy graph isomorphism

Definition (Hypothesis testing)

Suppose we observe two random graphs A and B: H0 : A and B areindependent

H1 : A and Bπ = (Bπ(i)π(j)) are edge-correlated

conditional on a uniform permutation π

Goal: Test _H0 versus H1

• UnderH1, the inherent edge correlation is obscured by the latent

node correspondence π

• Hypothesis testing aspect of graph matching(recover π under _H1) • The test needs to rely on graph invariants(invariant under graph

isomorphism), such as

I Subgraph counts (e.g. # of edges or triangles)

(10)

Gaussian setting

A and B are twoGaussian-weighted complete graphs

Definition (Gaussian Wigner model)

H0 : (Aij, Bij)i.i.d.∼ N (0₀) , (1 0_{0 1}) H1 : Aij, Bπ(i)π(j) i.i.d. ∼ N(0 0) , 1 ρ ρ 1 conditional on uniform π

• Under both_H0 andH1, A and B are Gaussian Wigner matrices

marginally

(11)

Erd˝

os-R´

enyi setting

Definition (Erd˝os-R´enyi graphs model [Barak-Chou-Lei-Schramm-Sheng’19])

H0 : A and B are independent G(n, ps)

H1 : A and Bπ = (Bπ(i)π(j)) are independently edge-sampled from

a common parent graph _{G(n, p) with sampling probability s}

• Under bothH0 andH1, A and B areG(n, ps) marginally

• Under_H1, Aij, Bπ(i)π(j) are correlated Bern(ps) with correlation

(12)

Sharp threshold for detection: Gaussian

Q and P: probability measure under H0 andH1

Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1_{− o(1)} if ρ2_{≥ 4}log n_n₋₁ o(1) if ρ2≤ (4 − )log n_n 1 lim n→∞TV (P, Q) 0 _lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)

strong detection possible test error=o(1)

(13)

Sharp threshold for detection: Gaussian

Q and P: probability measure under H0 andH1

Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1_{− o(1)} if ρ2_{≥ 4}log n_n₋₁ o(1) if ρ2≤ (4 − )log n_n 1 lim n→∞TV (P, Q) 0 _lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)

strong detection possible test error=o(1)

(14)

Sharp threshold for detection: dense Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20)

If s2 ≥ 2 log n

(n− 1)plog1_p − 1 + p , then TV (P, Q) = 1 − o (1)

• (Dense regime) p = n−o(1): If s2 _≤ (2− ) log n

nplog1_p _{− 1 + p} ,

then TV (_{P, Q) = o(1)}

Remark:

• Sharp phase transition at ns

2_p_log1 p−1+p

log n = 2

• p7→ p(log1

p − 1 + p) is not monotone and uniquely maximized at

(15)

Sharp threshold for detection: dense Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20)

If s2 ≥ 2 log n

(n− 1)plog1_p − 1 + p , then TV (P, Q) = 1 − o (1)

• (Dense regime) p = n−o(1): If s2 _≤ (2− ) log n

nplog1_p _{− 1 + p} ,

then TV (_{P, Q) = o(1)}

Remark:

• Sharp phase transition at ns

2_p_log1 p−1+p

log n = 2

• p7→ p(log1

p − 1 + p) is not monotone and uniquely maximized at

(16)

Threshold for detection: sparse Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n_{− 1)p} log1_p _{− 1 + p} , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1

np∧ o(1), then TV (P, Q) = o(1)

p = n−α for a constant α:

(17)

Threshold for detection: sparse Erd˝

os-R´

enyi graphs

Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n_{− 1)p} log1_p _{− 1 + p} , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1

np∧ o(1), then TV (P, Q) = o(1)

p = n−α for a constant α:

(18)

What is known algorithmically

• Counting edges: achieve weak detection in linear time, if s = Ω(1)

• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:

correctly tell_H0 vs. H1 w.p. at least 0.9 in poly-time, if

s = Ω (1) and nps_∈hn, n1/153i_∪hn2/3, n1−i

• Counting (weighted) trees – low-degree poly. approx. of P(A,B)_Q(A,B)

[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if

s2 > 1/2.956 and np≥ n−o(1)

Order-optimalwhen np = Θ(1)

(19)

What is known algorithmically

s2 > 1/2.956 and np≥ n−o(1)

(20)

What is known algorithmically

s2 > 1/2.956 and np_{≥ n}−o(1)

(21)

What is known algorithmically

s2 > 1/2.956 and np_{≥ n}−o(1)

(22)

Proof of positive results: QAP statistic

T (A, B) , max

π∈Sn

X

i<j

AijBπ(i)π(j) (edge correlation)

• This is known as Quadratic Assignment Problem

• Invariant to the node relabeling of both A and B

(23)

Proof of negative results: second-moment method

EQ " P(A, B) Q(A, B) 2# = O(1) =_{⇒ TV(P, Q) ≤ 1 − Ω(1)}

Strong detection is impossible

EQ " P(A, B) Q(A, B) 2# = 1 + o(1) =⇒ TV(P, Q) = o(1)

(24)

Second-moment calculation: cycle (orbit) decomposition

• Node permutation σ on [n] • Edge permutation σE on [n]₂ : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE_: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)

(25)

Second-moment calculation: cycle (orbit) decomposition

• Node permutation σ on [n] • Edge permutation σE on [n]₂ : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)

(26)

Second-moment calculation: cycle decomposition

P(A, B) Q(A, B) 2 = Eπ P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O Xij

O: disjoint orbits of edge permutation σE _{with σ , π}−1_◦

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( ₁ 1−ρ2|O| Gaussian

(27)

Second-moment calculation: cycle decomposition

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( ₁ 1−ρ2|O| Gaussian

(28)

Second-moment calculation: cycle decomposition

e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO]

(29)

Failure of second-moment

We show EQ " P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2_≤ (2−) log n_n +_∞ if ρ2_≥ (2+) log n_n

• Gaussian: suboptimal by a factor of 2

• ER graphs: suboptimal by an unbounded factor when p = o(1)

Obstruction from short orbits

E(A,B)∼Q " P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n₂)

Atypically large magnitude ofQ

O∈O:|O|=kXO for short orbits of length

(30)

Failure of second-moment

(31)

Failure of second-moment

(32)

Truncated second-moment: dense regime

It suffices to consider k = 1: Y O∈O:|O|=1 XO≈ 1 p 2eA∧Bπ(F )

F : set of fixed points of σ , π−1◦eπ A_{∧ B}π: Intersection graph

eA∧Bπ(F ): # of edges of subgraph of A∧ Bπ induced by F

• Under_{P: e}A∧Bπ(S) concentratesuniformly over all S when|S| is

large

• On this typical event _{E under P, when |F | is large,}

 

"

(33)

Truncated second-moment: sparse regime

Need to consider k = Θ(log n). It can be shown

• Long orbits: EQ   Y |O|>k XO  ≤ 1 + ρk n2 k = 1 + o(1)

• Shortincompleteorbits:

EQ[XO| O 6⊂ E (A ∧ Bπ)]≤ 1 • Shortcomplete orbits:

XO=

1 p

2|O|

, _{∀O ⊂ E (A ∧ B}π)

In the subcritical regime nps2 < 1, A∧ Bπ _{∼ G(n, ps}2_{) is a pseduoforest}

(34)

Truncated second-moment: orbit pseduoforest

On_{E , {(A, B, π) : A ∧ B}π is a pseudoforest_{} under P:}

EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ " 1 p 2e(Hk) 1_{H_kis a pseudoforest} # = (1 + o(1)) X H∈Hk

s2e(H) (generating function)

Hk: pseudoforests assembled from edge orbits of length at most k

(35)

Truncated second-moment: orbit pseduoforest

On_{E , {(A, B, π) : A ∧ B}π is a pseudoforest_{} under P:}

EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ " 1 p 2e(Hk) 1_{H_kis a pseudoforest} # = (1 + o(1)) X H∈Hk

s2e(H) (generating function)

Hk: pseudoforests assembled from edge orbits of length at most k

(36)

Concluding remarks

• Formulate the problem of testing network correlation and characterize the statistical detection limit

• The impossibility proof applies truncated second-moment method

• The sparse setting leverages the pseudoforest structure of subcritical Erd˝os-R´enyi graphs

• A large computational gap may exist between the statistical and computational limits

Open problem

• Sharp detection threshold in the sparse regime

• Prove the existence of or close the computational gaps

• Sharp partial recovery threshold Reference