Probabilistic methods for post-genomic
data integration
Dirk Husmeier
Biomathematics & Statistics Scotland (BioSS)
JCMB, The King’s Buildings, Edinburgh EH9 3JZ
United Kingdom
Integrated analysis
of
Integrated analysis
of
regulatory networks
•
Expression data
alone are
not sufficient
.
•
Combining
multiple sources
of information
yields
complementary constraints
.
Combining
promoter sequences
and
Combining
promoter sequences
and
gene expression
data
Conventional approach:
•
Find
clusters
of
co-expressed
genes.
•
Identify
regulatory
elements
by
searching
for common
over-represented motifs
in the
promoter regions of these genes.
Microarray
data
Model
Promoter
sequences
Microarray
data
Model
Promoter
sequences
Microarray
data
Model
Promoter
sequences
Microarray
data
Model
Promoter
sequences
Microarray
data
Model
Promoter
sequences
Microarray
data
Model
Promoter
sequences
Segal, Yelensky, Koller (2003)
Bioinformatics 19
Segal, Yelensky, Koller (2003)
Bioinformatics 19
Revision:
T
A
T
A
C
C
A
C
. . . .G
C
C
T
A
TA
G C CMotif:
T
G
A A
T T
C
T
A
T
A
C
C
A
C
. . . .G
C
T
T
A
TA
G C CMotif:
T
G
A A
T
T T
C
T
A
T
A
C
C
A
C
. . . .G
C
G
T
A
TA
G C CMotif:
T
A A
T T
C
T
A
T
A
C
C
A
C
. . . .G
C
G
T
A
TA
G C CMotif:
T
A A
T T
C
T
A
T
A
C
C
A
C
. . . .G
C
G
T
A
TA
G C CMotif:
T
A A
T T
C
T
A
T
A
C
C
A
C
. . . .C
G
G
T
A
TA
G C CMotif:
T
A A
T T
C
T
A
T
A
C
C
A
C
. . . .C
G
G
T
A
TA
G C CMotif:
T
A A
Position Specific Scoring Matrix (PSSM)
Position Specific Scoring Matrix (PSSM)
Search for a
motif
of length
W
in
binding sequences
.
W
×
4
matrix
ψ
k(
l
)
:
Probability that the nucleotide in the
k
th position,
Position Specific Scoring Matrix (PSSM)
Search for a
motif
of length
W
in
binding sequences
.
W
×
4
matrix
ψ
k(
l
)
:
Probability that the nucleotide in the
k
th position,
k
∈
[1
, . . . , W
]
, is an
l
∈ {
A, C, G, T
}
.
Background model
for
non-binding sequences
4
-dim vector
θ
0(
l
)
:
Sequence
S
1, S
2, . . . , S
NNon-binding sequence: R=0
P(S1, S2, . . . , SN|R = 0) = N Y t=1 θ0(St)Sequence
S
1, S
2, . . . , S
NNon-binding sequence: R=0
P(S1, S2, . . . , SN|R = 0) = N Y t=1 θ0(St)Binding sequence: R=1, motif starting at position m+1
P(S1, S2, . . . , SN|R = 1, start = m + 1) = m Y t=1 θ0(St) W Y k=1 ψk(Sm+k) N Y t=m+W+1 θ0(St) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)
Binding sequence: R=1, motif starting at position m+1
P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)Binding sequence: R=1, motif starting at position m+1
P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)Binding sequence: R=1, motif starting anywhere
P(S1, S2, . . . , SN|R = 1) = N−W X m=0 P(start = m + 1)P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) 1 N − W + 1 N−W X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k)
Binding sequence: R=1, motif starting at position m+1
P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)Binding sequence: R=1, motif starting anywhere
P(S1, S2, . . . , SN|R = 1) = N−W X m=0 P(start = m + 1)P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) 1 N − W + 1 N−W X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k)
Objective:
Prediction
of
binding
activity from sequence:
Apply Bayes rule: P(R = 1|S1, S2, . . . , SN) = P(S1, S2, . . . , SN|R = 1)P(R = 1) P(S1, S2, . . . , SN|R = 0)P(R = 0) + P(S1, S2, . . . , SN|R = 1)P(R = 1) = 1 + P(R = 0)P(S1, S2, . . . , SN|R = 0) P(R = 1)P(S1, S2, . . . , SN|R = 1) !−1 = 1 + " P(R = 1) P(R = 0) 1 (N − W + 1) N−W X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k) #−1!−1
Apply Bayes rule: P(R = 1|S1, S2, . . . , SN) = P(S1, S2, . . . , SN|R = 1)P(R = 1) P(S1, S2, . . . , SN|R = 0)P(R = 0) + P(S1, S2, . . . , SN|R = 1)P(R = 1) = 1 + P(R = 0)P(S1, S2, . . . , SN|R = 0) P(R = 1)P(S1, S2, . . . , SN|R = 1) !−1 = 1 + " P(R = 1) P(R = 0) 1 (N − W + 1) N−W X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k) #−1!−1
Define:
w
k(
l
)
= log
ψk(l) θ0(l),
w
0= log
P(R=1) P(R=0),
logit(
z
) =
1 1+exp(−z)P(R = 1|S1, S2, . . . , SN) = logit log " w0 N − W + 1 N−W X m=0 exp W X k=1 wk(St+k) !# !
4
×
W
+ 1
parameters:
w
k(
l
)
,
w
0T
A
T
A
C
C
A
C
. . . .G
C
C
T
A
TA
G C CMotif:
T
G
A A
T T
T T
C
T
A
T
A
C
C
A
C
. . . .C
G
G
T
A
TA
G C CMotif:
Score
1T
A A
A
G
T T
C
T
A
T
A
C
C
A
C
. . . .C
G
A
T
A
TA
G C CMotif:
Score
1Score
2T
T
C
G
A A
T T
C
T
A
T
A
C
A
C
. . . .C
G
...
T
A
TA
G C CMotif:
T
C
G
A A
T T
C
T
A
T
A
C
A
C
. . . .C
G
...
...
T
A
TA
G C CMotif:
T
G
A A
T T
C
T
A
...
...
G
C
. . . .C
A
C
C
T
A
Score
1Score
Motif:
T
A
CTA
CG 2+
NScore
tScore
C
T
T
G
A
A
T
T
...
...
G
C
. . . .C
A
C
C
A
T
A
Score
1Score
Motif:
T
A
CTA
CG 2Nonlinear transfer function
+
N
Score
t
T
C
T
T
G
A
A
T
A
P(R=1|sequence)
...
...
G
C
. . . .C
A
C
C
T
A
2Score
1Score
Motif:
T
A
CTA
CGNonlinear transfer function
+
N
Score
t
P(R = 1|S1, S2, . . . , SN) = logit log " w0 N − W + 1 N−W X m=0 exp W X k=1 wk(St+k) !# !
4
×
W
+ 1
parameters:
w
k(
l
)
,
w
0Wolfgang Lehrach
Biomathematics & Statistics Scotland
SH3 yeast two-hybrid interaction network
Tong et al. (2002), Science 295, 321-324
285 interactions
between
28 SH3 proteins
and
143 binding peptides
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Final Test Set Performance
True positive rate (sensitivity)
False positive rate (1−specificity) 0.61 Reiss 0.62 None 0.64 Naive 0.69 Gaussian
0.71 Laplacian with pruning 0.73 Laplacian
The model
of
Segal, Yelensky and Koller
g.S2 g.S1
...
. . . . N P(g.R2 |g.S) TAT A G C C g.R2 1 g.R g.SBasics Evaluation MotifScanne Cases Conclusions JJ II J I Close
Transcriptional Regulation
g.S2 g.S1
...
. . . . N P(g.R2 |g.S) TAT A G C C g.R2 1 g.R g.Sg.S g.S2 g.SN g.R1 g.R2 g.M 1 . . . .
...
1 g.R 2 g.R g.M C C G A T A T 2 |g.S) P(g.R 3 2 1N g.S g.R1 g.R2 g.M 1 g.E g.E 2 C C G A T . . . .
...
1 g.S g.S2 A 3 2 1 g.M 1 g.R 2 g.R 3 g.E P(g.E T 2 |g.S) P(g.R 0 g.M 3 2 1 |g.M) 3...
. . . . g.M 1 g.R g.R2 g.S1 g.S2 g.SN g.E3 P(g.R2 |g.S) TAT A G C C g.E 2 g.E1P(g.Ri = 1|g.S1, g.S2, . . . , g.SN) = logit log " w0 N − W + 1 N−W X m=0 exp W X k=1 wk(g.St+k) !# !
1 g.S 2 2 g.R g.R1 g.M . . . .
...
g.S 1 g.R 2 g.R 3 g.E N g.S 1 g.E g.E 2 3 2 1 g.MSoftmax function
P
(
g.M
=
m
|
g.R
1=
r
1, g.R
2=
r
2, . . . , g.R
N=
r
N)
=
exp
P
L i=1u
mir
iP
˜ mexp
P
L i=1u
mi˜r
iParameter matrix
:
...
. 2 g.E g.E3 1 g.E g.R1 g.R2 g.M . . . 1 |g.M) 3 P(g.E 2 1 g.S g.S2 g.SN 0 g.M 3Independent Gaussian distributions
P
(
g.E
1, g.E
2, . . . , g.E
L|
g.M
=
m
) =
Y
j
P
(
g.E
j|
g.M
=
m
)
P
(
g.E
j|
g.M
=
m
) =
N
(
µ
j,m, σ
j,m)
For each module
m
and each condition
j
:
Mean
:
µ
j,mN g.S g.R1 g.R2 g.M 1 g.E g.E 2 C C G A T . . . .
...
1 g.S g.S2 A 3 2 1 g.M 1 g.R 2 g.R 3 g.E P(g.E T 2 |g.S) P(g.R 0 g.M 3 2 1 |g.M) 31 g.S g.S2 g.SN 1 g.E g.E 2 C C G A T 2 g.R g.R1 g.M . . . .