Learning genetic network topology using structural EM

6.2 Applications

6.2.1 Learning genetic network topology using structural EM

Here we describe some initial experiments using DBNs to learn small artificial examples typical of the causal processes involved in genetic regulation. We generate data from models of known structure, learn DBN models from the data in a variety of settings, and compare these with the original models. The main purpose of these experiments is to understand how well DBNs can represent such processes, how many observations are required, and what sorts of observations are most useful. We refrain from describing any particular biological process, since we do not yet have sufficient real data on the processes we are studying to learn a scientifically useful model.

Simple genetic systems are commonly described by a pathway model—a graph in which vertices rep- resent genes (or larger chromosomal regions) and arcs represent causal pathways (Figure 6.1(a)). A vertex can either be “off/normal” (state 0) or “on/abnormal” (state 1). The system starts in a state which is all 0s, and vertices can “spontaneously” turn on (due to unmodelled external causes) with some probability per unit time. Once a vertex is turned on, it stays on, but may trigger other neighboring vertices to turn on as well— again, with a certain probability per unit time. The arcs on the graph are usually annotated with the “half-life” parameter of the triggering process. Note that pathway models, unlike BNs, can contain directed cycles. For many important biological processes, the structure and parameters of this graph are completely unknown; their discovery would constitute a major advance in scientific understanding.

Pathway models have a very natural representation as DBNs: each vertex becomes a state variable, and the triggering arcs are represented as links in the transition network of the DBN. The tendency of a vertex to stay “on” once triggered is represented by persistence links in the DBN. Figure 6.1(b) shows a DBN representation of the five-vertex pathway model in Figure 6.1(a). The nature of the problem suggests that noisy-ORs (or noisy-ANDs) should provide a parsimonious representation of the CPD at each node (see Section A.3.2). To specify a noisy-OR for a node withkparents, we use parametersq1, . . . , qk, whereqiis the probability the child node will be in state 0 if theith parent is in state 1. In the five-vertex DBN model that we used in the experiments reported below, all theqparameters (except for the persistence arcs) have value0.2. For a strict persistence model (vertices stay on once triggered),qparameters for persistence arcs are fixed at 0. To learn such noisy-OR distributions, we used the EM techniques discussed in Section C.2.3. We also tried using gradient descent, following [BKRK97], but encountered difficulties with convergence in cases where the optimal parameter values were close to the boundaries (0 or 1). To prevent structural

A

B

C

D

E

A

B

C

D

E

A

B

C

D

E

0.9

0.8

0.7

0.6

0.5 (b)

(a)

Figure 6.1: (a) A simple pathway model with five vertices. Each vertex represents a site in the genome, and each arc is a possible triggering pathway. (b) A DBN model that is equivalent to the pathway model shown in (a). (c) The DBN model extended by adding a switch nodeSand an observation nodeO, whose value either indicates that this slice is hidden, or it encodes the state of all of the nodes in the slice.

0 10 20 30 40 50 60 70 80 90 100 −1 0 1 2 3 4 5 6

Number of training sequences

Hamming distance 0% 20% 40% 60% 0 10 20 30 40 50 60 70 80 90 100 −1 0 1 2 3 4 5 6

Number of training sequences

Hamming distance

0% 20% 40% 60%

Figure 6.2: Results of structural EM for the pathway model in Figure 6.1. We plot the number of incorrect edges against number of training slices, for different levels of partial observability. 20% means that 20% of the slices, chosen at random, were fully hidden. Top: tabular CPDs; bottom: noisy-OR CPDs.

0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2

Number of training sequences

Relative logloss 0% 20% 40% 60% 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2

Number of training sequences

Relative logloss

0% 20% 40% 60%

Figure 6.3: As in Figure 6.2, except we plot relative log-loss compared with the generating model on an independent sample of 100 sequences. Top: tabular CPDs; bottom: noisy-OR CPDs.

overfitting, we used a BIC penalty, where the number of parameters per node was set equal to the number of parents.

In all our experiments, we enforced the presence of the persistence arcs in the network structure. We used two alternative initial topologies: one that has only persistence arcs (so the system must learn to add arcs) and one that is fully interconnected (so the system must learn to delete arcs). Performance in the two cases was very similar. We assumed there were no arcs within a slice.

We experimented with three observation regimes that correspond to realistic settings: • The complete state of the system is observed at every time step.

• Entire time slices are hidden uniformly at random with probabilityh, corresponding to intermittent observation.

• Only two observations are made, one before the process begins and another at some unknown timetobs after the process is initiated by some external or spontaneous event. This might be the case with some disease processes, where the DNA of a diseased cell can be observed but the elapsed time since the disease process began is not known. (The “initial” observation is of the DNA of some other, healthy cell from the same individual.)

This last case, which obtains in many realistic situations, raises a new challenge for machine learning. We resolve it as follows: we supply the network with the “diseased” observation at time slice T, whereT is with high probability larger than the actual elapsed timetobssince the process began.2 We also augment the DBN model with a hidden “switch” variableS that is initially off, but can come on spontaneously. When the switch is off, the system evolves according to its normal transition modelP(Xt|Xt−1, S = 0), which is to be determined from the data. Once the switch turns on, however, the state of the system is frozen—that is, the conditional probability distributionP(Xt|Xt−1, S)is fixed so thatXt = Xt−1 with probability 1. The persistence parameter forSdetermines a probability distribution overtobs; by fixing this parameter such that (a priori)tobs < T with high probability, we effectively fix a scale for time, which would otherwise be 2_{With the}_q_{parameters set to 0.2 in the true network, the actual system state is all-1s with high probability after about}_T _{= 20}_{, so}

arbitrary. The learned network will, nonetheless, imply a more constrained distribution fortobsgiven a pair of observations.

We consider two measures of performance. One is the number of different edges in the learned network compared to the generating network, i.e., the Hamming distance between their adjacency matrices. The second is the difference in the logloss of the learned model compared to the generating model, measured on an independent test set. Our results for the five-vertex model of Figure 6.1(a) are shown in Figures 6.2 and 6.3. We see that noisy-ORs perform much better than tabular CPTs when the amount of missing data is high. Even with 40% missing slices, the exact structure is learned from only 30 examples by the noisy-OR network.3_{However, when all-but-two slices are hidden, the system needs 10 times as much data to learn. The}

case in which we do not even know the time at which the second observation is made (which we modeled with the switching variable) is even harder to learn (results not shown).

Of course, it will not be possible to learn real genetic networks using techniques as simple as this. For one thing, the data sets (e.g., micro-arrays) are noisy, sparse and sampled at irregular intervals.4 _Second,

the space of possible models is so huge that it will be necessary to use strong prior domain knowledge to make the task tractable. Succesful techniques will probably be more similar to “computer assisted pathway refinement” [ITR+01] than to “de novo” learning.

In document Dynamic Bayesian Networks Representation, Inference And Learning Kevin Patrick Murphy pdf (Page 114-117)