Learning Bayesian networks - Learning and predicting with chain event graphs

In many scenarios, the modeller might not have complete certainty over the conditional independence relationships which hold between the variables of the system under consideration, or equivalently the Bayesian network which best represents the model. In this case, the Bayesian approach is to consider the structure itself as a random variable with a probability distribution of its form set a priori, and then updated using Bayes’ theorem in the light of new data. This procedure has been described aslearningthe Bayesian network by the artiﬁcial intelligence community, e.g. in [Heckerman, 1999] and can be considered as another form of model selection. However, the procedure is in practice rarely so simple. The major obstacle in carrying it out is that the size of the set of possible Bayesian networks grows in size super-exponentially with respect to the size of the set of random variables [Cooper and Herskovits, 1992]. This means that setting a proper subjective prior distribution over the set of possible Bayesian networks for any practical situation is generally intractably diﬃcult, as is setting the parameter priors and likelihoods for each possible BN.

There are some approaches advocated in the literature, however, that seek to minimise this diﬃculty by utilising some reasonable simplifying assumptions. I discuss the assumptions which relate to discrete variables in particular which is my focus in this thesis.

The initial set of assumptions deals with the probability model for the data implied by each Bayesian network. LetB be the random variable representing the Bayesian network which holds. Then

P(X |θB, B) = n ∏ i=1

where θB = {θB1, . . . , θBn} is the set of parameter vectors θBi for each distribu-

tion P(Xi | Qi, θBi, B). Then the prior probability distribution of θB|B is set by

assumingparameter independence[Spiegelhalter and Lauritzen, 1990], so that

P(θB|B) = n ∏ i=1 qi ∏ j=1 P(θBij |B) (2.5)

where θBij is the parameter vector of the probabilities P(Xi | Qi = qj, B) and qi

is the number of possible values of Qi. Note that I am assuming, in line with my

relevance assumptions, that the value of θBij does not rely on the parts of B not

related to Xi and its parents, a property called likelihood modularity. If θBij

is distributed as Dir(αBij), then the updating ofP(θBij |B,X) is conjugate:

θBij |B,X ∼Dir(αBij+Nij) (2.6)

whereNij represents the vector of counts Nijk when Qi =qj and Xi =xik, where

kindexes the possible values of Xi.

While parameter independence simpliﬁes the setting and updating of P(θ | B) for each possible BNB, it still requires the setting of each P(θBij|B) for each

B, and still does not address the setting ofP(B).

In order to simplify the setting ofP(θBi |B)— the priors for the parameters

of variable Xi in a BN B — for all variables Xi for each possible BN B, one can

make the assumption of prior modularity. This states that if two Bayesian networks B1 and B2 have identical parent variables Qi for some variable Xi, then

P(θBi | B1) = P(θBi | B2), i.e. the prior on the parameters that determine the

distribution ofXi are equal for both BNs. The subscriptBwill therefore be dropped

henceforth as now only the parent set of a variableX is necessary to determine the prior distribution of its parameters.

Under the assumptions of prior and likelihood modularities, it is the case (as shown in [Heckerman and Geiger, 1995]) that in order to set parameter priors for each possible BN it is suﬃcient to set parameter priors only for the complete Bayesian networks. Parameter priors for incomplete networks are then derived from equivalent local structures in the corresponding complete network.

This can still be intractable, and so there is one more level of simpliﬁcation possible. Assume that under any B the parameter vectors θij are mutually inde-

pendent of one another for any Xi for any values of its parents Qi =qj as above,

and that for any two Markov equivalent BNs B1, B2 (i.e. those which encode the

same sets of conditional independence relations onX, as can be determined using the methods of [Verma and Pearl, 1990] or [Chickering, 1995]) it is assumed that

P(X |B1) =P(X |B2) (called hypothesis equivalence by [Heckerman et al., 1995]). Geiger and Heckerman [1997] showed that in this case that allθij must have

a Dirichlet distribution. Therefore to specify the parameter priors for any network

B one needs only to specify the hyperparameters of the Dirichlet distribution of the joint distribution ofX on a complete network.

The setting ofP(B)is comparatively simple. Apart from the obvious choices of a uniform prior over all possibleB or a subset of all possible B, another possible qualitative characterisation is to consider the probability for the inclusion of each edge in a BN with a ﬁxed order of variables [Buntine, 1991], and further still if the edges are considered exchangeable, i.e. all of the edges have a probability p of existing, then only one probability assessment — that ofp — is needed.

With the parameters set as above and assuming Dirichlet priors,P(X |B)

Heckerman et al. [1995]: P(X |B) = n ∏ i=1 |qi| ∏ j=1   Γ(αij.) Γ(αij.+xij.) |xi| ∏ k=1 Γ(αijk+xijk) Γ(αijk)   (2.7)

where|xi|are the number of possible values ofXi,xij. = ∑

kxijk,xijkis the number

of times Xi =xik when Qi =qj, and αij. = ∑

kαijk. P(B |X) can then be easily

calculated from Bayes’ theorem for eachB ifP(B) is a ﬁxed quantity a priori. However, when there are a large number of possible BNs B, this might not be practical. To predict new dataX∗ from the system after having observedX, it is necessary to calculate

P(X∗ |X) = ∑

B∈B

P(X∗|B)P(B |X). (2.8)

This is calledmodel averaging [Hoeting et al., 1999]. For a large set of possible BNs B, it would be impractical to calculate P(X∗ | B) and P(B | X) for each

B. There are a number of approximations to the full solution which could still give good predictions while reducing the computational eﬀort required [Hoeting et al., 1999].

If the aim is to provide a good “explanatory” network for the system, then trying to find the most probable BN (MAP, or Maximum A Posteriori BN) can be done more efficiently, if not necessarily optimally, than just calculatingP(B|X)for every possibleB, bysearchingthe model space. There have been many strategies suggested for this search, including greedy search, greedy search with restarts, best- first search, and Monte Carlo methods, all discussed by Heckerman [1999], and more recently weighted MAX-SAT solving [Cussens, 2008].

One relevant consequence of the model set-up described above which leads to equation (2.7) is that the goodness of a BN, deﬁned here as its posterior probability,

can be calculated as the product of purelylocal properties of the network, where local here relates to individual nodes and their parents. This means that if two BNs diﬀer only in one parent set Qi of some variable Xi, the diﬀerence in scores

will result only from that local difference. This allows for efficient local search algorithmsfor searching the model space. A simple local greedy search starts with one possible BN, then calculates the score for a BN which differs only in having an edge reversed, an edge added or an edge deleted (subject to the resulting network being acyclic) by only re-calculating the relevant local score, and chooses the BN which has the higher posterior probability. Because only the local differences in the graphs have to be taken into account, the search proceeds more quickly.

The search algorithms to ﬁnd the MAP BN can also be used to ﬁnd more than one high-scoring network so thatP(X∗|X) can be approximated as

P(X∗|X)≈ ∑

B∈B˜

P(X∗ |B)P(B|X) (2.9)

whereB˜ is the set of highest-scoring networks found during the model search, where the size of the set can be chosen as high as desired.

In document Learning and predicting with chain event graphs (Page 33-37)