An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

(1)

An Introduction to the Use of

Bayesian Network to Analyze Gene

Expression Data

Cristina Manfredotti

Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.)

Università degli Studi Milano-Bicocca

(2)

Introduction

• A central goal of molecular biology is to

understand the regulation of protein

synthesis.

• DNA microarray experiments can

measure thousands of gene expression

levels simultaneously.

• An important challenge is to develop

methodologies that are both statistically

sound and computationally tractable.

(3)

Biological Background

• DNA

• Gene

DNA is a

double-stranded

molecule

Hereditary information is encoded

Gene is a

segment

of DNA

Contain the information required

to make a protein

(4)

Motivations

• Each gene encodes a protein and proteins are

the functional units of life

• Every gene is present in every cell, but only a

fraction of the genes are expressed at any time

• Understanding the mechanisms that determine

which genes are expressed, and when they are

expressed, is the key to the development of

new treatments of diseases

• Many diseases result from the interaction

between genes

(5)

Bayesian Networks

• Prior work

—

Clustering of expression

data

Groups together genes with similar

expression pattern

Disadvantage: does not reveal structural relations

between genes

• Big challenge

Extract meaningful information from the expression data

Discover interactions between genes based on the

measurements

(6)

Bayesian Networks

• A Bayesian Network (BN) is a graphical

representation of a probability distribution

Compact & intuitive representation

Useful for describing processes composed of locally

interacting components

Have a good statistical foundation

Efficient model learning algorithm

Capture causal relationships

(7)

Representing Distributions

• A Bayesian networks is a representation of a

joint probability distribution

.

• A Bayesian network has

two

components.

–

G

: a directed-acyclic graph structure

–

Θ

: a set of parameters for conditional

distribution of each variable

• The joint probability distribution of {

X

1

, …,

X

n

} is

represented by Bayesian Network as follows:

– where

Pa

G

(

X

i

) is the set of parents of

X

i

given

the graph G

,

))

(

|

(

)

,...,

(

1 1

=

∏

₌ n i i G i n

P

X

P

Pa

(8)

An Example of a Simple BN

Gene A

Gene E

Gene B

Gene D

Gene C

-

Gene B

and

Gene D

are independent

given

Gene A

.

-

Gene B

asserts dependency between

Gene A

and

Gene E

.

-

Gene A

and

Gene C

are independent

given

Gene B

.

)

(

)

|

(

)

|

(

)

,

|

(

)

(

)

,

|

(

)

,

|

(

)

,

|

(

)

|

(

)

(

)

,

(

E

P

A

D

P

B

C

P

E

A

B

P

A

P

D

C

B

A

E

P

C

B

A

D

P

B

A

C

P

A

B

P

A

P

E

D

C

B

A

P

=

Gene A

Gene E

Gene B

Gene D

Gene C

(9)

Learning Bayesian Networks

• Given a training set

D

= {

x

1

, …,

x

N

} of

independent instances of

X

, find a network

B

=

<

G

,

Θ

> that best matches

D

.

• The score function for a network is defined as,

where

is the marginal likelihood which averages the

probability of the data over all possible

parameter assignments to

G

.

)

(

)

(

)

|

(

)

|

(

)

:

(

D

P

G

P

G

D

P

D

G

P

D

G

S

=

∫

Θ

=

P

D

G

P

G

d

G

D

P

(

|

)

(

|

,

)

(

|

)

(10)

(11)

(12)

A simple example

• We want to construct a BN of a system

composed of 3 genes (A, B and C) that can

be ON or OFF

• Given the training set D

• Fix a number of iteration M

• Choose (randomly) M structures G

J

(binary-squared matrix)

• Learn the Conditional Probability Table

(13)

A simple example

…

0

1

0

1

0

1

0

1

0

1

0

1

0

C

B

A

D:

1

0

1

0

1

…

|D| = 13

M = 6

(14)

Structures:

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

CC

CB

CA

BC

BB

BA

AC

AB

AA

G

j

:

G

1

(15)

A

B

C

P(A=0) = 6/13

P(A=1) = 7/13

7/13

2/13

1

0

4/13

0

1

0

B\A

3/13 0 5/13 0 1 4/13 0 0 1/13 0 11 10 01 00 C\A,B

G

1

:

(16)

A

B

C

P(B=0) = 4/13

P(B=1) = 9/13

5/13

3/13

1

4/13

1/13

0

1

0

C\B

3/13 4/13 0 0 1 2/13 0 3/13 1/13 0 11 10 01 00 A\B,C

G

5

:

(17)

A simple example

…

0

1

0

1

0

1

0

1

0

1

0

1

0

C

B

A

D:

(18)

A

B

C

P([0 0 0]|G

₁

)P(G

₁

) =

6/13*4/13*1/13*2/6

P([0 1 1]|G

₁

)P(G

₁

) =

6/13*2/13*5/13*2/6

P([0 0 1]|G

₁

)P(G

₁

) =

6/13*4/13*0*2/6

…

Score = 1/n

∑

P(D

_i

|G

₁

)

G

1

:

(19)

A

B

C

P([0 0 0]|G

₅

)P(G

₅

) =

1/13*4/13*1/13*1/6

P([0 1 1]|G

₅

)P(G

₅

) =

2/13*9/13*5/13*1/6

P([0 0 1]|G

₅

)P(G

₅

) =

3/13*4/13*3/13*1/6

…

Score = 1/n

∑

P(D

_i

|G

₅

)

G

5

:

(20)

• Practical problem

—

Small data sets

variables —

hundreds

of or thousands

of genes

samples — just tens

of microarray experiments

• On the positive side, genetic regulation

networks are

sparse!!!

• Characterize and learn

features

that are

common to most of these networks

(21)

Analyzing Expression Data:

• The first feature

—

Markov relations

Symmetric relation: Y is in X

’

s Markov blanket

iff there is either an edge between them, or both

are parents of another variable (Pearl 98).

Biological interpretation: a Markov relation indicates

that the two genes are related in some joint biological

interaction or process

(22)

Analyzing Expression Data:

• The second feature

—

order relations

Global property: A is an ancestor

of B in all the

equivalent Bayesian networks learned

Biological interpretation: an order relation indicates

that the transcription of one gene is a direct cause

of

the transcription of another gene

A

(23)

Estimating Statistical Confidence in

Features

• To what extent does the data support a given feature?

• effective and relatively simple approach for estimating

confidence:

bootstrap method

.

• For

i

= 1, …,

m

– Re-sample with replacement

N

instances from

D

.

Denote by

D

i

the resulting dataset.

– Apply the learning procedure on

D

i

to induce a

network structure

G

.

• For each feature

f

of interest calculate

– where

f

(

G

) is 1 if

f

is a feature in

G

, and 0 otherwise.

∑

=

m i

f

G

i

m

f

1

(

)

1

)

(

conf

(24)

How to collect data:

• Gene knock down

• Gene knock out

• Compound

• Tessue microarray

• Time course

(25)