Testing Correlation of Unlabeled Random Graphs
Jiaming Xu
The Fuqua School of Business Duke University Joint work with
Yihong Wu (Yale) and Sophie H. Yu (Duke)
Computational and Methodological Statistics December 21, 2020
Graph isomorphism
Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that
(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)
a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7
• Not known to be solvable in polynomial time in worst case
• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar
Graph isomorphism
Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that
(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)
a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7
• Not known to be solvable in polynomial time in worst case
• In practice, two graphs are often not exactly isomorphic, but still want to tell whethertheir topologies are similar
Graph isomorphism
Given two graphs A and B, decide whether A ∼= B, i.e., there exists a bijection π : V (A)→ V (B) such that
(u, v)∈ E(A) ⇔ (π(u), π(v)) ∈ E(B)
a b c d g h i j 8 5 6 7 1 4 2 3 π(a) = 1 π(b) = 6 π(c) = 8 π(d) = 3 π(g) = 5 π(h) = 2 π(i) = 4 π(j) = 7
Two key challenges
• Statistical: two graphs may be correlated but not exactly isomorphic
• Computational: # of possible node mappings is n! (100!≈ 10158)
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
Motivation: Protein-protein interaction network
Kazemi et al. BMC Bioinformatics (2016) 17:527 Page 6 of 16
proposed in [59]. ANBS for two graphs G
1(V
1, E
1) and
G
2(V
2, E
2) under the alignment π is defined as follows.
ANBS(π) =
|V
1|
−1 i∈V1(π)BlastBit
(i, π(i))
√
BlastBit
(i, i)BlastBit(π(i), π(i))
.
Pathway comparison measures
In order to evaluate the performance of algorithms in
aligning biological pathways, we introduce a new
mea-sure in this section. This meamea-sure captures the quality of
alignments based on a higher level of functional and
struc-tural similarities (beyond the introduced measures such as
the similarity of GO terms and the number of conserved
interactions).
It is known that there are many biological pathways
with similar functions in different species [12]. The KEGG
PATHWAY database [60] provides a set of experimentally
found biological pathways. In this database, a pathway is
called by the name of a species (e.g., hsa for Homo
sapi-ens), followed by a number. The pathways with the same
number have the same function in different species. For
example, hsa03040, mmu03040, dme03040 and sce03040
are in Homo sapiens (human), Mus musculus (mouse),
Drosophila melanogaster (fruit fly) and Saccharomyces
cerevisiae (budding yeast), respectively. These pathways
have the same functions.
2Assume PWi
,1denotes the set of
proteins from a pathway with number i in the PPI network
of the first species (i.e., G
1). Similarly, we define PW
i,2. For
pathway i,
π,i
denotes the number of conserved
inter-actions between the proteins in this pathway under the
alignment
π, i.e., π,i
= EG1
[PWi,1]∩ π
−1(EG2
[PWi,2]). Note
that we are looking for pathways that are present in both
aligned species.
We say a protein u from a pathway is aligned correctly, if
it is mapped to a protein v from a pathway with the same
function. For pathway i, we define the number of correctly
mapped proteins as
|PWi
,1∩ π
−1(PWi
,2)|. This measure
corresponds to the number of proteins that, from
path-way i in the first species, are mapped to a protein from
the same pathway in the second species. For pathway i, we
define the accuracy as
acc
π,i=
2
|PWi
,1∩ π
−1
(PWi
,2)|
|PWi
,1| + |PWi
,2|
.
(2)
This measure corresponds to the fraction of correctly
mapped proteins in pathway i.
We conjecture that a good alignment algorithm should
align proteins from pathways with the same functions
across species, and many interactions among these
pro-an alignment
π successfully aligns a pathway i, if there are
at least
δ conserved interactions under the alignment π
for proteins in that pathway, i.e., if
π,i
≥ δ. This
thresh-olding guarantees that the structural similarity of aligned
pathways are more than a minimum value (here,
δ
con-served interactions). To evaluate the performance of an
algorithm based on this thresholding criterion, we define
a set of measures as follows.
1. We consider pathways with at least
δ (say δ ≥ 2)
interactions in each of the species. Let “#PW
δ”
denote the number of such pathways.
2. Alignment
π successfully aligns pathway i, if
π,i
≥ δ. The variable “#FPW
δ” refers to the number
of successfully aligned pathways. We define the recall
as
recall
π,δ=
#FPW
δ#PW
δ.
(3)
3. Again, for a correctly aligned pathway i, we define
acc
π,δ,isimilar to (2).
The averages over all i of all the acc
π,iand acc
π,δ,ivalues
are represented by acc
πand acc
π,δ, respectively. Figure 2
provides a toy example of how to calculate the pathway
alignment measures.
Results
In this section, we compare PROPER with the main
state-of-the-art network alignment algorithms, specifically (i)
with L-GRAAL as the most recent member of GRAAL
family that takes into account both sequence and
struc-tural similarities [23]; (ii) with MAGNA++ that tries to
maximize one of the EC, ICS or S
3measures [33, 34]
(In our experiments we run MAGNA++ in two different
Fig. 2 In this figure, two example PPI networks are given. Green nodes
are proteins which are in the same pathway (i.e., a pathway with the same number in both species). Dotted lines represent the alignmentπ between these two networks. Under this alignment, there are five conserved interactions between proteins in this pathway (shown by
thick black edges in each network). Also, the number of correctly [Kazemi-Hassani-Grossglauser-Modarres ’16]
Ontology: Assess the correlation of two biological networks in two different species based on network topology
Motivation: Deformable shape matching
A fundamental problem in computer vision: Detect similar objects that undergo different deformations
Shape REtrieval Contest (SHREC) dataset
3-D shapes → geometric graphs (features → nodes, distances → edges) Determine whether two graphs are similar in topologies
Motivation: Deformable shape matching
A fundamental problem in computer vision: Detect similar objects that undergo different deformations
Beyond worst-case: Noisy graph isomorphism
Definition (Hypothesis testing)
Suppose we observe two random graphs A and B: H0 : A and B areindependent
H1 : A and Bπ = (Bπ(i)π(j)) are edge-correlated
conditional on a uniform permutation π
Goal: Test H0 versus H1
• UnderH1, the inherent edge correlation is obscured by the latent
node correspondence π
• Hypothesis testing aspect of graph matching(recover π under H1) • The test needs to rely on graph invariants(invariant under graph
isomorphism), such as
I Subgraph counts (e.g. # of edges or triangles)
Gaussian setting
A and B are twoGaussian-weighted complete graphs
Definition (Gaussian Wigner model)
H0 : (Aij, Bij)i.i.d.∼ N (00) , (1 00 1) H1 : Aij, Bπ(i)π(j) i.i.d. ∼ N(0 0) , 1 ρ ρ 1 conditional on uniform π
• Under bothH0 andH1, A and B are Gaussian Wigner matrices
marginally
Erd˝
os-R´
enyi setting
Definition (Erd˝os-R´enyi graphs model [Barak-Chou-Lei-Schramm-Sheng’19])
H0 : A and B are independent G(n, ps)
H1 : A and Bπ = (Bπ(i)π(j)) are independently edge-sampled from
a common parent graph G(n, p) with sampling probability s
• Under bothH0 andH1, A and B areG(n, ps) marginally
• UnderH1, Aij, Bπ(i)π(j) are correlated Bern(ps) with correlation
Sharp threshold for detection: Gaussian
Q and P: probability measure under H0 andH1
Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1− o(1) if ρ2≥ 4log nn−1 o(1) if ρ2≤ (4 − )log nn 1 lim n→∞TV (P, Q) 0 lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)
strong detection possible test error=o(1)
Sharp threshold for detection: Gaussian
Q and P: probability measure under H0 andH1
Theorem (Wu-X.-Yu ’20) TV (P, Q) = ( 1− o(1) if ρ2≥ 4log nn−1 o(1) if ρ2≤ (4 − )log nn 1 lim n→∞TV (P, Q) 0 lim n→∞ nρ2 log n 4 weak detection impossible test error=1− o(1)
strong detection possible test error=o(1)
Sharp threshold for detection: dense Erd˝
os-R´
enyi graphs
Theorem (Wu-X.-Yu ’20)
If s2 ≥ 2 log n
(n− 1)plog1p − 1 + p , then TV (P, Q) = 1 − o (1)
• (Dense regime) p = n−o(1): If s2 ≤ (2− ) log n
nplog1p − 1 + p ,
then TV (P, Q) = o(1)
Remark:
• Sharp phase transition at ns
2plog1 p−1+p
log n = 2
• p7→ p(log1
p − 1 + p) is not monotone and uniquely maximized at
Sharp threshold for detection: dense Erd˝
os-R´
enyi graphs
Theorem (Wu-X.-Yu ’20)
If s2 ≥ 2 log n
(n− 1)plog1p − 1 + p , then TV (P, Q) = 1 − o (1)
• (Dense regime) p = n−o(1): If s2 ≤ (2− ) log n
nplog1p − 1 + p ,
then TV (P, Q) = o(1)
Remark:
• Sharp phase transition at ns
2plog1 p−1+p
log n = 2
• p7→ p(log1
p − 1 + p) is not monotone and uniquely maximized at
Threshold for detection: sparse Erd˝
os-R´
enyi graphs
Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n− 1)p log1p − 1 + p , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1np∧ o(1), then TV (P, Q) = o(1)
p = n−α for a constant α:
Threshold for detection: sparse Erd˝
os-R´
enyi graphs
Theorem (Wu-X.-Yu ’20) If s2 ≥ 2 log n (n− 1)p log1p − 1 + p , then TV (P, Q) = 1 − o (1) • (Sparse regime) p = n−Ω(1): If s2< 1 np∧ 0.01, then TV (P, Q) = 1 − Ω(1) If s2< 1np∧ o(1), then TV (P, Q) = o(1)
p = n−α for a constant α:
What is known algorithmically
• Counting edges: achieve weak detection in linear time, if s = Ω(1)
• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:
correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if
s = Ω (1) and nps∈hn, n1/153i∪hn2/3, n1−i
• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)
[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if
s2 > 1/2.956 and np≥ n−o(1)
Order-optimalwhen np = Θ(1)
What is known algorithmically
• Counting edges: achieve weak detection in linear time, if s = Ω(1)
• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:
correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if
s = Ω (1) and nps∈hn, n1/153i∪hn2/3, n1−i
• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)
[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if
s2 > 1/2.956 and np≥ n−o(1)
Order-optimalwhen np = Θ(1)
What is known algorithmically
• Counting edges: achieve weak detection in linear time, if s = Ω(1)
• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:
correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if
s = Ω (1) and nps∈hn, n1/153i∪hn2/3, n1−i
• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)
[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if
s2 > 1/2.956 and np≥ n−o(1)
Order-optimalwhen np = Θ(1)
What is known algorithmically
• Counting edges: achieve weak detection in linear time, if s = Ω(1)
• Counting “balanced” subgraphs[Barak-Chou-Lei-Schramm-Sheng’19]:
correctly tellH0 vs. H1 w.p. at least 0.9 in poly-time, if
s = Ω (1) and nps∈hn, n1/153i∪hn2/3, n1−i
• Counting (weighted) trees – low-degree poly. approx. of P(A,B)Q(A,B)
[Mao-Wu-X.-Yu ’forthcoming]: achieve strong detection in poly-time, if
s2 > 1/2.956 and np≥ n−o(1)
Order-optimalwhen np = Θ(1)
Proof of positive results: QAP statistic
T (A, B) , max
π∈Sn
X
i<j
AijBπ(i)π(j) (edge correlation)
• This is known as Quadratic Assignment Problem
• Invariant to the node relabeling of both A and B
Proof of negative results: second-moment method
EQ " P(A, B) Q(A, B) 2# = O(1) =⇒ TV(P, Q) ≤ 1 − Ω(1)Strong detection is impossible
EQ " P(A, B) Q(A, B) 2# = 1 + o(1) =⇒ TV(P, Q) = o(1)
Second-moment calculation: cycle (orbit) decomposition
• Node permutation σ on [n] • Edge permutation σE on [n]2 : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)Second-moment calculation: cycle (orbit) decomposition
• Node permutation σ on [n] • Edge permutation σE on [n]2 : σE((i, j)) , (σ(i), σ(j)) E.g. n = 6 and σ = (1)(23)(456): σ: 1 2 3 4 5 6 σE: (2, 3) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (4, 5) (5, 6) (4, 6) (2, 4) (3, 5) (2, 6) (3, 4) (2, 5) (3, 6)Second-moment calculation: cycle decomposition
P(A, B) Q(A, B) 2 = Eπ P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O XijO: disjoint orbits of edge permutation σE with σ , π−1◦
e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( 1 1−ρ2|O| Gaussian
Second-moment calculation: cycle decomposition
P(A, B) Q(A, B) 2 = Eπ P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O XijO: disjoint orbits of edge permutation σE with σ , π−1◦
e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO] EQ[XO] = ( 1 1−ρ2|O| Gaussian
Second-moment calculation: cycle decomposition
P(A, B) Q(A, B) 2 = Eπ P(A, B|π) Q(A, B) 2 = Eπ⊥⊥πe Y i<j Xij Xij , P(Bπ(i)π(j)|Aij )P(Bπ(i)e π(j)e |Aij) Q(Bπ(i)π(j))Q(Beπ(i)eπ(j)) = Eπ⊥⊥πe Y O∈O XO XO , Y (i,j)∈O XijO: disjoint orbits of edge permutation σE with σ , π−1◦
e π EQ " P(A, B) Q(A, B) 2# = Eeπ⊥⊥πEQ Y O∈O XO = Eeπ⊥⊥π Y O∈O EQ[XO]
Failure of second-moment
We show EQ " P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2≤ (2−) log nn +∞ if ρ2≥ (2+) log nn• Gaussian: suboptimal by a factor of 2
• ER graphs: suboptimal by an unbounded factor when p = o(1)
Obstruction from short orbits
E(A,B)∼Q " P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)
Atypically large magnitude ofQ
O∈O:|O|=kXO for short orbits of length
Failure of second-moment
We show EQ " P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2≤ (2−) log nn +∞ if ρ2≥ (2+) log nn• Gaussian: suboptimal by a factor of 2
• ER graphs: suboptimal by an unbounded factor when p = o(1)
Obstruction from short orbits
E(A,B)∼Q " P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)
Atypically large magnitude ofQ
O∈O:|O|=kXO for short orbits of length
Failure of second-moment
We show EQ " P(A, B) Q(A, B) 2# = ( 1 + o(1) if ρ2≤ (2−) log nn +∞ if ρ2≥ (2+) log nn• Gaussian: suboptimal by a factor of 2
• ER graphs: suboptimal by an unbounded factor when p = o(1)
Obstruction from short orbits
E(A,B)∼Q " P(A, B) Q(A, B) 2# = Eπ⊥⊥eπ " Y O∈O EQ[XO] # e π=π ≥ 1 n! 1 + ρ 2(n2)
Atypically large magnitude ofQ
O∈O:|O|=kXO for short orbits of length
Truncated second-moment: dense regime
It suffices to consider k = 1: Y O∈O:|O|=1 XO≈ 1 p 2eA∧Bπ(F )F : set of fixed points of σ , π−1◦eπ A∧ Bπ: Intersection graph
eA∧Bπ(F ): # of edges of subgraph of A∧ Bπ induced by F
• UnderP: eA∧Bπ(S) concentratesuniformly over all S when|S| is
large
• On this typical event E under P, when |F | is large,
"
Truncated second-moment: sparse regime
Need to consider k = Θ(log n). It can be shown
• Long orbits: EQ Y |O|>k XO ≤ 1 + ρk n2 k = 1 + o(1)
• Shortincompleteorbits:
EQ[XO| O 6⊂ E (A ∧ Bπ)]≤ 1 • Shortcomplete orbits:
XO=
1 p
2|O|
, ∀O ⊂ E (A ∧ Bπ)
In the subcritical regime nps2 < 1, A∧ Bπ ∼ G(n, ps2) is a pseduoforest
Truncated second-moment: orbit pseduoforest
OnE , {(A, B, π) : A ∧ Bπ is a pseudoforest} under P:
EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ " 1 p 2e(Hk) 1{Hkis a pseudoforest} # = (1 + o(1)) X H∈Hk
s2e(H) (generating function)
Hk: pseudoforests assembled from edge orbits of length at most k
Truncated second-moment: orbit pseduoforest
OnE , {(A, B, π) : A ∧ Bπ is a pseudoforest} under P:
EQ " Y O∈O XO1{E} # ≤ (1 + o(1)) EQ " 1 p 2e(Hk) 1{Hkis a pseudoforest} # = (1 + o(1)) X H∈Hk
s2e(H) (generating function)
Hk: pseudoforests assembled from edge orbits of length at most k
Concluding remarks
• Formulate the problem of testing network correlation and characterize the statistical detection limit
• The impossibility proof applies truncated second-moment method
• The sparse setting leverages the pseudoforest structure of subcritical Erd˝os-R´enyi graphs
• A large computational gap may exist between the statistical and computational limits
Open problem
• Sharp detection threshold in the sparse regime
• Prove the existence of or close the computational gaps
• Sharp partial recovery threshold Reference