• No results found

Probabilistic methods for post-genomic data integration

N/A
N/A
Protected

Academic year: 2021

Share "Probabilistic methods for post-genomic data integration"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

Probabilistic methods for post-genomic

data integration

Dirk Husmeier

Biomathematics & Statistics Scotland (BioSS)

JCMB, The King’s Buildings, Edinburgh EH9 3JZ

United Kingdom

(2)

Integrated analysis

of

(3)

Integrated analysis

of

regulatory networks

Expression data

alone are

not sufficient

.

Combining

multiple sources

of information

yields

complementary constraints

.

(4)

Combining

promoter sequences

and

(5)

Combining

promoter sequences

and

gene expression

data

Conventional approach:

Find

clusters

of

co-expressed

genes.

Identify

regulatory

elements

by

searching

for common

over-represented motifs

in the

promoter regions of these genes.

(6)
(7)

Microarray

data

Model

Promoter

sequences

(8)

Microarray

data

Model

Promoter

sequences

(9)

Microarray

data

Model

Promoter

sequences

(10)
(11)

Microarray

data

Model

Promoter

sequences

(12)

Microarray

data

Model

Promoter

sequences

(13)

Microarray

data

Model

Promoter

sequences

(14)

Segal, Yelensky, Koller (2003)

Bioinformatics 19

(15)

Segal, Yelensky, Koller (2003)

Bioinformatics 19

Revision:

(16)

T

A

T

A

C

C

A

C

. . . .

G

C

C

T

A

T

A

G C C

Motif:

T

G

A A

T T

(17)

C

T

A

T

A

C

C

A

C

. . . .

G

C

T

T

A

T

A

G C C

Motif:

T

G

A A

T

(18)

T T

C

T

A

T

A

C

C

A

C

. . . .

G

C

G

T

A

T

A

G C C

Motif:

T

A A

(19)

T T

C

T

A

T

A

C

C

A

C

. . . .

G

C

G

T

A

T

A

G C C

Motif:

T

A A

(20)

T T

C

T

A

T

A

C

C

A

C

. . . .

G

C

G

T

A

T

A

G C C

Motif:

T

A A

(21)

T T

C

T

A

T

A

C

C

A

C

. . . .

C

G

G

T

A

T

A

G C C

Motif:

T

A A

(22)

T T

C

T

A

T

A

C

C

A

C

. . . .

C

G

G

T

A

T

A

G C C

Motif:

T

A A

(23)

Position Specific Scoring Matrix (PSSM)

(24)

Position Specific Scoring Matrix (PSSM)

Search for a

motif

of length

W

in

binding sequences

.

W

×

4

matrix

ψ

k

(

l

)

:

Probability that the nucleotide in the

k

th position,

(25)

Position Specific Scoring Matrix (PSSM)

Search for a

motif

of length

W

in

binding sequences

.

W

×

4

matrix

ψ

k

(

l

)

:

Probability that the nucleotide in the

k

th position,

k

[1

, . . . , W

]

, is an

l

∈ {

A, C, G, T

}

.

Background model

for

non-binding sequences

4

-dim vector

θ

0

(

l

)

:

(26)
(27)

Sequence

S

1

, S

2

, . . . , S

N

Non-binding sequence: R=0

P(S1, S2, . . . , SN|R = 0) = N Y t=1 θ0(St)

(28)

Sequence

S

1

, S

2

, . . . , S

N

Non-binding sequence: R=0

P(S1, S2, . . . , SN|R = 0) = N Y t=1 θ0(St)

Binding sequence: R=1, motif starting at position m+1

P(S1, S2, . . . , SN|R = 1, start = m + 1) = m Y t=1 θ0(St) W Y k=1 ψk(Sm+k) N Y t=m+W+1 θ0(St) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)

(29)

Binding sequence: R=1, motif starting at position m+1

P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)

(30)

Binding sequence: R=1, motif starting at position m+1

P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)

Binding sequence: R=1, motif starting anywhere

P(S1, S2, . . . , SN|R = 1) = NW X m=0 P(start = m + 1)P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) 1 N W + 1 NW X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k)

(31)

Binding sequence: R=1, motif starting at position m+1

P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) W Y k=1 ψk(Sm+k) θ0(Sm+k)

Binding sequence: R=1, motif starting anywhere

P(S1, S2, . . . , SN|R = 1) = NW X m=0 P(start = m + 1)P(S1, S2, . . . , SN|R = 1, start = m + 1) = N Y t=1 θ0(St) 1 N W + 1 NW X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k)

Objective:

Prediction

of

binding

activity from sequence:

(32)

Apply Bayes rule: P(R = 1|S1, S2, . . . , SN) = P(S1, S2, . . . , SN|R = 1)P(R = 1) P(S1, S2, . . . , SN|R = 0)P(R = 0) + P(S1, S2, . . . , SN|R = 1)P(R = 1) = 1 + P(R = 0)P(S1, S2, . . . , SN|R = 0) P(R = 1)P(S1, S2, . . . , SN|R = 1) !−1 = 1 + " P(R = 1) P(R = 0) 1 (N W + 1) NW X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k) #−1!−1

(33)

Apply Bayes rule: P(R = 1|S1, S2, . . . , SN) = P(S1, S2, . . . , SN|R = 1)P(R = 1) P(S1, S2, . . . , SN|R = 0)P(R = 0) + P(S1, S2, . . . , SN|R = 1)P(R = 1) = 1 + P(R = 0)P(S1, S2, . . . , SN|R = 0) P(R = 1)P(S1, S2, . . . , SN|R = 1) !−1 = 1 + " P(R = 1) P(R = 0) 1 (N W + 1) NW X m=0 W Y k=1 ψk(Sm+k) θ0(Sm+k) #−1!−1

Define:

w

k

(

l

)

= log

ψk(l) θ0(l)

,

w

0

= log

P(R=1) P(R=0)

,

logit(

z

) =

1 1+exp(z)

(34)

P(R = 1|S1, S2, . . . , SN) = logit log " w0 N W + 1 NW X m=0 exp W X k=1 wk(St+k) !# !

4

×

W

+ 1

parameters:

w

k

(

l

)

,

w

0

(35)

T

A

T

A

C

C

A

C

. . . .

G

C

C

T

A

T

A

G C C

Motif:

T

G

A A

T T

(36)

T T

C

T

A

T

A

C

C

A

C

. . . .

C

G

G

T

A

T

A

G C C

Motif:

Score

1

T

A A

(37)

A

G

T T

C

T

A

T

A

C

C

A

C

. . . .

C

G

A

T

A

T

A

G C C

Motif:

Score

1

Score

2

T

(38)

T

C

G

A A

T T

C

T

A

T

A

C

A

C

. . . .

C

G

...

T

A

T

A

G C C

Motif:

(39)

T

C

G

A A

T T

C

T

A

T

A

C

A

C

. . . .

C

G

...

...

T

A

T

A

G C C

Motif:

(40)

T

G

A A

T T

C

T

A

...

...

G

C

. . . .

C

A

C

C

T

A

Score

1

Score

Motif:

T

A

CT

A

CG 2

+

N

Score

t

Score

(41)

C

T

T

G

A

A

T

T

...

...

G

C

. . . .

C

A

C

C

A

T

A

Score

1

Score

Motif:

T

A

CT

A

CG 2

Nonlinear transfer function

+

N

Score

t

(42)

T

C

T

T

G

A

A

T

A

P(R=1|sequence)

...

...

G

C

. . . .

C

A

C

C

T

A

2

Score

1

Score

Motif:

T

A

CT

A

CG

Nonlinear transfer function

+

N

Score

t

(43)

P(R = 1|S1, S2, . . . , SN) = logit log " w0 N W + 1 NW X m=0 exp W X k=1 wk(St+k) !# !

4

×

W

+ 1

parameters:

w

k

(

l

)

,

w

0

(44)

Wolfgang Lehrach

Biomathematics & Statistics Scotland

(45)

SH3 yeast two-hybrid interaction network

Tong et al. (2002), Science 295, 321-324

285 interactions

between

28 SH3 proteins

and

143 binding peptides

(46)
(47)

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Final Test Set Performance

True positive rate (sensitivity)

False positive rate (1−specificity) 0.61 Reiss 0.62 None 0.64 Naive 0.69 Gaussian

0.71 Laplacian with pruning 0.73 Laplacian

(48)

The model

of

Segal, Yelensky and Koller

(49)

g.S2 g.S1

...

. . . . N P(g.R2 |g.S) TAT A G C C g.R2 1 g.R g.S

(50)

Basics Evaluation MotifScanne Cases Conclusions JJ II J I Close

Transcriptional Regulation

(51)

g.S2 g.S1

...

. . . . N P(g.R2 |g.S) TAT A G C C g.R2 1 g.R g.S

(52)

g.S g.S2 g.SN g.R1 g.R2 g.M 1 . . . .

...

1 g.R 2 g.R g.M C C G A T A T 2 |g.S) P(g.R 3 2 1

(53)

N g.S g.R1 g.R2 g.M 1 g.E g.E 2 C C G A T . . . .

...

1 g.S g.S2 A 3 2 1 g.M 1 g.R 2 g.R 3 g.E P(g.E T 2 |g.S) P(g.R 0 g.M 3 2 1 |g.M) 3

(54)

...

. . . . g.M 1 g.R g.R2 g.S1 g.S2 g.SN g.E3 P(g.R2 |g.S) TAT A G C C g.E 2 g.E1

(55)

P(g.Ri = 1|g.S1, g.S2, . . . , g.SN) = logit log " w0 N W + 1 NW X m=0 exp W X k=1 wk(g.St+k) !# !

(56)

1 g.S 2 2 g.R g.R1 g.M . . . .

...

g.S 1 g.R 2 g.R 3 g.E N g.S 1 g.E g.E 2 3 2 1 g.M

(57)

Softmax function

P

(

g.M

=

m

|

g.R

1

=

r

1

, g.R

2

=

r

2

, . . . , g.R

N

=

r

N

)

=

exp

P

L i=1

u

mi

r

i

P

˜ m

exp

P

L i=1

u

mi˜

r

i

Parameter matrix

:

(58)

...

. 2 g.E g.E3 1 g.E g.R1 g.R2 g.M . . . 1 |g.M) 3 P(g.E 2 1 g.S g.S2 g.SN 0 g.M 3

(59)

Independent Gaussian distributions

P

(

g.E

1

, g.E

2

, . . . , g.E

L

|

g.M

=

m

) =

Y

j

P

(

g.E

j

|

g.M

=

m

)

P

(

g.E

j

|

g.M

=

m

) =

N

(

µ

j,m

, σ

j,m

)

For each module

m

and each condition

j

:

Mean

:

µ

j,m

(60)
(61)

N g.S g.R1 g.R2 g.M 1 g.E g.E 2 C C G A T . . . .

...

1 g.S g.S2 A 3 2 1 g.M 1 g.R 2 g.R 3 g.E P(g.E T 2 |g.S) P(g.R 0 g.M 3 2 1 |g.M) 3

(62)

1 g.S g.S2 g.SN 1 g.E g.E 2 C C G A T 2 g.R g.R1 g.M . . . .

...

A 3 2 1 g.M 1 g.R 2 g.R 3 g.E P(g.E T 2 |g.S) P(g.R 0 g.M 3 2 1 |g.M) 3

(63)

Bayesian approach

P(parameters

|

data)

=

P

(64)

Bayesian approach

P(parameters

|

data)

=

P

P(parameters, latent variables

|

data)

(65)

Bayesian approach

P(parameters

|

data)

=

P

P(parameters, latent variables

|

data)

Intractable!

Gibbs sampling

parameters

P(parameters

|

latent variables, data)

latent variables

P(latent variables

|

parameters

, data)

(66)

P(x,y)

y

(67)

P(x,y)

y

(68)

P(x,y)

y

x

(69)

P(x,y)

y

x

P(y|x)

(70)

P(x,y)

y

(71)

Still too expensive

Find one

“good” set of parameters

rather than a whole

sample from the posterior distribution.

Hard-assignment EM

algorithm.

Various

heuristic

simplifications.

(72)

...

. . . . g.M 1 g.R g.R2 g.E3 g.E 2 g.E1 g.SN g.S2 g.S1

(73)

...

. . . .

g.M

1

g.R

g.R

2

E-step

g.E3 g.E 2 g.E1 g.SN g.S2 g.S1

(74)

...

. . . . g.M 1 g.R g.R2

M-step

g.E3 g.E 2 g.E1 g.SN g.S2 g.S1

(75)

...

. . . . g.M 1 g.R g.R2 g.E3 g.E 2 g.E1 g.SN g.S2 g.S1

(76)

...

. . . . g.M 1 g.R g.R2 g.E3 g.E 2 g.E1 g.SN g.S2 g.S1

(77)

Segal, Yelensky, Koller (2003)

Bioinformatics 19

(78)
(79)

Experiment 1

173 microarrays, measuring responses to various

stress conditions (Gasch et al. 2000)

Conventional algorithms:

20%

of the predicted

motifs are known.

Unified probabilistic model:

45%

of the

(80)

Experiment 2

77 microarrays, expression during the cell cycle

(Spellman et al. 1998)

Conventional algorithms:

30%

of the predicted

motifs are known.

Unified probabilistic model:

56%

of the

(81)
(82)

References

Related documents

Both Polyprenol (C-60) and Chloramphenicol exhibited resistance to Bacillus cereus (IMAUB1022) and Escherichia coli (ATCC 25938) before plasmid curing &

Levers for immediate Cash Flow Improvement.. Kearney, May 31, 2012 12 Investments with strategic and financial relevance: ADAPT or POSTPONE Investments without strategic

sulla base della nostra esperien- za di cinque anni la presenza di un medico (spesso come co-conduttore) è molto utile per discutere ad esempio dei sintomi

As both RBD and cholinergic denervation are risk factors for cognitive impairment in PD, we aimed to inves- tigate the potential association between RBD symptoms and cholinergic

Abstract: The purpose of this study is to investigate the role of self-identity in moderating the relationship between personality traits, subjective norms,

Manufacturer: iRobot Corporation, 8 Crosby Drive, Bedford, MA 01730, USA declares that Roomba Robotic Vacuum Cleaner along with Battery Power Supply and the Home Base conform to

However, in the ‘Right to manage model’ of bargaining, unionisation in the labour market raises the effort level of worker but lowers the number of workers irrespective of