• No results found

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

N/A
N/A
Protected

Academic year: 2021

Share "An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

An Introduction to the Use of

Bayesian Network to Analyze Gene

Expression Data

Cristina Manfredotti

Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.)

Università degli Studi Milano-Bicocca

(2)

Introduction

• A central goal of molecular biology is to

understand the regulation of protein

synthesis.

• DNA microarray experiments can

measure thousands of gene expression

levels simultaneously.

• An important challenge is to develop

methodologies that are both statistically

sound and computationally tractable.

(3)

Biological Background

• DNA

• Gene

DNA is a

double-stranded

molecule

Hereditary information is encoded

Gene is a

segment

of DNA

Contain the information required

to make a protein

(4)

Motivations

• Each gene encodes a protein and proteins are

the functional units of life

• Every gene is present in every cell, but only a

fraction of the genes are expressed at any time

• Understanding the mechanisms that determine

which genes are expressed, and when they are

expressed, is the key to the development of

new treatments of diseases

• Many diseases result from the interaction

between genes

(5)

Bayesian Networks

• Prior work

Clustering of expression

data

Groups together genes with similar

expression pattern

Disadvantage: does not reveal structural relations

between genes

• Big challenge

Extract meaningful information from the expression data

Discover interactions between genes based on the

measurements

(6)

Bayesian Networks

• A Bayesian Network (BN) is a graphical

representation of a probability distribution

Compact & intuitive representation

Useful for describing processes composed of locally

interacting components

Have a good statistical foundation

Efficient model learning algorithm

Capture causal relationships

(7)

Representing Distributions

• A Bayesian networks is a representation of a

joint probability distribution

.

• A Bayesian network has

two

components.

G

: a directed-acyclic graph structure

Θ

: a set of parameters for conditional

distribution of each variable

• The joint probability distribution of {

X

1

, …,

X

n

} is

represented by Bayesian Network as follows:

– where

Pa

G

(

X

i

) is the set of parents of

X

i

given

the graph G

,

))

(

|

(

)

,...,

(

1 1

=

= n i i G i n

P

X

X

X

X

P

Pa

(8)

An Example of a Simple BN

Gene A

Gene E

Gene B

Gene D

Gene C

-

Gene B

and

Gene D

are independent

given

Gene A

.

-

Gene B

asserts dependency between

Gene A

and

Gene E

.

-

Gene A

and

Gene C

are independent

given

Gene B

.

)

(

)

|

(

)

|

(

)

,

|

(

)

(

)

,

,

,

|

(

)

,

,

|

(

)

,

|

(

)

|

(

)

(

)

,

,

,

,

(

E

P

A

D

P

B

C

P

E

A

B

P

A

P

D

C

B

A

E

P

C

B

A

D

P

B

A

C

P

A

B

P

A

P

E

D

C

B

A

P

=

=

Gene A

Gene E

Gene B

Gene D

Gene C

(9)

Learning Bayesian Networks

• Given a training set

D

= {

x

1

, …,

x

N

} of

independent instances of

X

, find a network

B

=

<

G

,

Θ

> that best matches

D

.

• The score function for a network is defined as,

where

is the marginal likelihood which averages the

probability of the data over all possible

parameter assignments to

G

.

)

(

)

(

)

|

(

)

|

(

)

:

(

D

P

G

P

G

D

P

D

G

P

D

G

S

=

=

Θ

Θ

Θ

=

P

D

G

P

G

d

G

D

P

(

|

)

(

|

,

)

(

|

)

(10)

Learning Bayesian Networks

(11)

Learning Bayesian Networks

(12)

A simple example

• We want to construct a BN of a system

composed of 3 genes (A, B and C) that can

be ON or OFF

• Given the training set D

• Fix a number of iteration M

• Choose (randomly) M structures G

J

(binary-squared matrix)

• Learn the Conditional Probability Table

(13)

A simple example

0

1

1

0

1

1

1

0

0

1

0

0

1

0

0

1

1

0

1

1

0

0

0

0

C

B

A

D:

1

1

1

1

1

1

1

1

1

0

1

1

0

1

1

|D| = 13

M = 6

(14)

Structures:

0

0

0

1

0

0

1

1

0

0

0

1

1

0

1

0

0

0

0

1

0

0

0

1

1

0

0

0

1

0

0

0

0

1

1

0

0

0

0

1

0

1

1

0

0

0

0

0

1

0

0

1

1

0

CC

CB

CA

BC

BB

BA

AC

AB

AA

G

j

:

G

1

(15)

A

B

C

P(A=0) = 6/13

P(A=1) = 7/13

7/13

2/13

1

0

4/13

0

1

0

B\A

3/13 0 5/13 0 1 4/13 0 0 1/13 0 11 10 01 00 C\A,B

G

1

:

(16)

A

B

C

P(B=0) = 4/13

P(B=1) = 9/13

5/13

3/13

1

4/13

1/13

0

1

0

C\B

3/13 4/13 0 0 1 2/13 0 3/13 1/13 0 11 10 01 00 A\B,C

G

5

:

(17)

A simple example

0

1

1

0

1

1

1

0

0

1

0

0

1

0

0

1

1

0

1

1

0

0

0

0

C

B

A

D:

(18)

A

B

C

P([0 0 0]|G

1

)P(G

1

) =

6/13*4/13*1/13*2/6

P([0 1 1]|G

1

)P(G

1

) =

6/13*2/13*5/13*2/6

P([0 0 1]|G

1

)P(G

1

) =

6/13*4/13*0*2/6

Score = 1/n

P(D

i

|G

1

)

G

1

:

(19)

A

B

C

P([0 0 0]|G

5

)P(G

5

) =

1/13*4/13*1/13*1/6

P([0 1 1]|G

5

)P(G

5

) =

2/13*9/13*5/13*1/6

P([0 0 1]|G

5

)P(G

5

) =

3/13*4/13*3/13*1/6

Score = 1/n

P(D

i

|G

5

)

G

5

:

(20)

• Practical problem

Small data sets

variables —

hundreds

of or thousands

of genes

samples — just tens

of microarray experiments

• On the positive side, genetic regulation

networks are

sparse!!!

• Characterize and learn

features

that are

common to most of these networks

(21)

Analyzing Expression Data:

• The first feature

Markov relations

Symmetric relation: Y is in X

s Markov blanket

iff there is either an edge between them, or both

are parents of another variable (Pearl 98).

Biological interpretation: a Markov relation indicates

that the two genes are related in some joint biological

interaction or process

(22)

Analyzing Expression Data:

• The second feature

order relations

Global property: A is an ancestor

of B in all the

equivalent Bayesian networks learned

Biological interpretation: an order relation indicates

that the transcription of one gene is a direct cause

of

the transcription of another gene

A

(23)

Estimating Statistical Confidence in

Features

• To what extent does the data support a given feature?

• effective and relatively simple approach for estimating

confidence:

bootstrap method

.

• For

i

= 1, …,

m

– Re-sample with replacement

N

instances from

D

.

Denote by

D

i

the resulting dataset.

– Apply the learning procedure on

D

i

to induce a

network structure

G

.

• For each feature

f

of interest calculate

– where

f

(

G

) is 1 if

f

is a feature in

G

, and 0 otherwise.

=

=

m i

f

G

i

m

f

1

(

)

1

)

(

conf

(24)

How to collect data:

• Gene knock down

• Gene knock out

• Compound

• Tessue microarray

• Time course

(25)

References

Related documents

dairy processing industry, as well as its rising costs, this Energy Guide has also provided information on basic, proven measures for improving plant-level

This letter is based on emergency medical work at two World Health Organization (WHO) and Ministry of Health (MoH) Iraq lead Role II+ Field Hospital facilities south of Mosul City

SSR markers Xbcd98 and Xwmc157 are tightly linked to Gby and Gbz , respectively, and correspondingly located in chromosomes 7A and 7D, they can be useful for wheat

Therefore, this experimental study was design to assess the effect of gestational diabetes on neuronal density of CA1 and CA3 of subfields of the hippocampus in postnatal days 7 and

Fig. 2 Report Normalization and Tokenization Using the Natural Language Toolkit.. results existing in pseudo-table form) [12] by checking each value against a set of rules. Each

technologists and therapists, registered nurses and doctors of Chinese medicine exhibited significant increased risks in medical services utilization for menstrual disorders

Its loss led to a defect during growth in shaking suspension, and GPHR ⫺ cells exhibited severe changes during late development that can be explained by the developmental

The present investigation was aimed to evaluate the extent of toxicity of these pesticides when used in combination and potentiation of the toxic effects under conditions