An Introduction to the Use of
Bayesian Network to Analyze Gene
Expression Data
Cristina Manfredotti
Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.)
Università degli Studi Milano-Bicocca
Introduction
• A central goal of molecular biology is to
understand the regulation of protein
synthesis.
• DNA microarray experiments can
measure thousands of gene expression
levels simultaneously.
• An important challenge is to develop
methodologies that are both statistically
sound and computationally tractable.
Biological Background
• DNA
• Gene
DNA is a
double-stranded
molecule
Hereditary information is encoded
Gene is a
segment
of DNA
Contain the information required
to make a protein
Motivations
• Each gene encodes a protein and proteins are
the functional units of life
• Every gene is present in every cell, but only a
fraction of the genes are expressed at any time
• Understanding the mechanisms that determine
which genes are expressed, and when they are
expressed, is the key to the development of
new treatments of diseases
• Many diseases result from the interaction
between genes
Bayesian Networks
• Prior work
—
Clustering of expression
data
Groups together genes with similar
expression pattern
Disadvantage: does not reveal structural relations
between genes
• Big challenge
Extract meaningful information from the expression data
Discover interactions between genes based on the
measurements
Bayesian Networks
• A Bayesian Network (BN) is a graphical
representation of a probability distribution
Compact & intuitive representation
Useful for describing processes composed of locally
interacting components
Have a good statistical foundation
Efficient model learning algorithm
Capture causal relationships
Representing Distributions
• A Bayesian networks is a representation of a
joint probability distribution
.
• A Bayesian network has
two
components.
–
G
: a directed-acyclic graph structure
–
Θ
: a set of parameters for conditional
distribution of each variable
• The joint probability distribution of {
X
1
, …,
X
n
} is
represented by Bayesian Network as follows:
– where
Pa
G
(
X
i
) is the set of parents of
X
i
given
the graph G
,
))
(
|
(
)
,...,
(
1 1=
∏
= n i i G i nP
X
X
X
X
P
Pa
An Example of a Simple BN
Gene A
Gene E
Gene B
Gene D
Gene C
-
Gene B
and
Gene D
are independent
given
Gene A
.
-
Gene B
asserts dependency between
Gene A
and
Gene E
.
-
Gene A
and
Gene C
are independent
given
Gene B
.
)
(
)
|
(
)
|
(
)
,
|
(
)
(
)
,
,
,
|
(
)
,
,
|
(
)
,
|
(
)
|
(
)
(
)
,
,
,
,
(
E
P
A
D
P
B
C
P
E
A
B
P
A
P
D
C
B
A
E
P
C
B
A
D
P
B
A
C
P
A
B
P
A
P
E
D
C
B
A
P
=
=
Gene A
Gene E
Gene B
Gene D
Gene C
Learning Bayesian Networks
• Given a training set
D
= {
x
1
, …,
x
N
} of
independent instances of
X
, find a network
B
=
<
G
,
Θ
> that best matches
D
.
• The score function for a network is defined as,
where
is the marginal likelihood which averages the
probability of the data over all possible
parameter assignments to
G
.
)
(
)
(
)
|
(
)
|
(
)
:
(
D
P
G
P
G
D
P
D
G
P
D
G
S
=
=
∫
Θ
Θ
Θ
=
P
D
G
P
G
d
G
D
P
(
|
)
(
|
,
)
(
|
)
Learning Bayesian Networks
Learning Bayesian Networks
A simple example
• We want to construct a BN of a system
composed of 3 genes (A, B and C) that can
be ON or OFF
• Given the training set D
• Fix a number of iteration M
• Choose (randomly) M structures G
J
(binary-squared matrix)
• Learn the Conditional Probability Table
A simple example
…
…
…
0
1
1
0
1
1
1
0
0
1
0
0
1
0
0
1
1
0
1
1
0
0
0
0
C
B
A
D:
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
…
…
…
|D| = 13
M = 6
Structures:
0
0
0
1
0
0
1
1
0
0
0
1
1
0
1
0
0
0
0
1
0
0
0
1
1
0
0
0
1
0
0
0
0
1
1
0
0
0
0
1
0
1
1
0
0
0
0
0
1
0
0
1
1
0
CC
CB
CA
BC
BB
BA
AC
AB
AA
G
j
:
G
1
A
B
C
P(A=0) = 6/13
P(A=1) = 7/13
7/13
2/13
1
0
4/13
0
1
0
B\A
3/13 0 5/13 0 1 4/13 0 0 1/13 0 11 10 01 00 C\A,BG
1
:
A
B
C
P(B=0) = 4/13
P(B=1) = 9/13
5/13
3/13
1
4/13
1/13
0
1
0
C\B
3/13 4/13 0 0 1 2/13 0 3/13 1/13 0 11 10 01 00 A\B,CG
5
:
A simple example
…
…
…
0
1
1
0
1
1
1
0
0
1
0
0
1
0
0
1
1
0
1
1
0
0
0
0
C
B
A
D:
A
B
C
P([0 0 0]|G
1)P(G
1) =
6/13*4/13*1/13*2/6
P([0 1 1]|G
1)P(G
1) =
6/13*2/13*5/13*2/6
P([0 0 1]|G
1)P(G
1) =
6/13*4/13*0*2/6
…
Score = 1/n
∑
∑
∑
∑
P(D
i|G
1)
G
1
:
A
B
C
P([0 0 0]|G
5)P(G
5) =
1/13*4/13*1/13*1/6
P([0 1 1]|G
5)P(G
5) =
2/13*9/13*5/13*1/6
P([0 0 1]|G
5)P(G
5) =
3/13*4/13*3/13*1/6
…
Score = 1/n
∑
∑
∑
∑
P(D
i|G
5)
G
5
:
• Practical problem
—
Small data sets
variables —
hundreds
of or thousands
of genes
samples — just tens
of microarray experiments
• On the positive side, genetic regulation
networks are
sparse!!!
• Characterize and learn
features
that are
common to most of these networks
Analyzing Expression Data:
• The first feature
—
Markov relations
Symmetric relation: Y is in X
’
s Markov blanket
iff there is either an edge between them, or both
are parents of another variable (Pearl 98).
Biological interpretation: a Markov relation indicates
that the two genes are related in some joint biological
interaction or process
Analyzing Expression Data:
• The second feature
—
order relations
Global property: A is an ancestor
of B in all the
equivalent Bayesian networks learned
Biological interpretation: an order relation indicates
that the transcription of one gene is a direct cause
of
the transcription of another gene
A
Estimating Statistical Confidence in
Features
• To what extent does the data support a given feature?
• effective and relatively simple approach for estimating
confidence:
bootstrap method
.
• For
i
= 1, …,
m
– Re-sample with replacement
N
instances from
D
.
Denote by
D
i
the resulting dataset.
– Apply the learning procedure on
D
i
to induce a
network structure
G
.
• For each feature
f
of interest calculate
– where
f
(
G
) is 1 if
f
is a feature in
G
, and 0 otherwise.
∑
==
m if
G
im
f
1(
)
1
)
(
conf
How to collect data:
• Gene knock down
• Gene knock out
• Compound
• Tessue microarray