Dynamic Bayesian network with BDe score (Banjo)

Z Z M Y m=1 p(zn,m|µ,Σ) p(µ|µ0,(νW)−1)p(W|α,T0) ! dµdW (2.47) can then be computed in closed-form. If it is further assumed that the target variable

yn, conditional on the set of regulators πn, becomes statistically independent of all

the other potential regulators, symbolically p(yn|X?n,πn) = p(yn|X?_n_[_π_n_]), then the

conditional distributions p(yn|X?n,πn) =p(yn|X?_n_[_π_n_]) = p(yn,X?_n_[_π_n_]) p(X?_n_[_π n]) (2.48) can also be computed in closed-form for each regulator set πn, see [49] for details.

Imposing uniform priors on the regulator sets, πn, subject to a maximal cardinality

restrictionF, the posterior distribution of the regulator setπnwith|πn| ≤ F is given

is lower than or equal to the fan-in F. The posterior probability of an interaction

betweenxn

i andyn can then be computed by marginalization:

P(xn i →yn|yn,X?n) = X πn I(xn i ∈πn)P(πn|yn,X?n) (2.50) whereI(xn

i ∈πn)is the indicator function, which is1ifxni is in the set of regulators πn, and zero otherwise. I use the posterior probabilities in Equation (2.50) to score

the regulatory interactions with respect to their strengths.

2.14 Dynamic Bayesian network with BDe score (Banjo)

The Banjo (Bayesian Inference with Java objects) is an implementation of a dynamic Bayesian network (DBN) inference algorithm using the Bayesian Dirichlet (BDe) scoring metric [67]. The DBN is a rst order Markov model that has time-varying dependences and conditional independences of discrete variables, meaning that the variables at one time-point are aected by the variables of the immediate previous time-point.

38 Chapter 2 The dependences that form the network are proposed in a greedy search procedure and the BDe metric scores how well the network represents the observed data. The strength and sign of the network dependences are determine through a additional inuence score. The DBN is dened by the pair <G,Θ>. The graph G describes the dependence

structure and the parameter set Θ holds the probability distribution parameter vector θn|πn = (θn,m|πn)∀nm, where m = (1, . . . , M) refers in this context to the time

points in the dynamic network. The parameter θn,m|πn = p(xn|πn) for each node n

and time point m depends on its corresponding parent set previously dened with

πn. The joint probability distribution over all nodes is p(x) = P(x₁, . . . , xN) =

m=1

n=1P(xn, m|πn), namely the probability for the variablexn to take on a cer-

tain value, given the dependence on the incoming parent nodes. The score for the graphG given a data setD for all variables xn,m is the Bayesian score function:

logP(G|D) =logP(D|G) +P(G)−logP(D) (2.51)

The evidencep(D)can be neglected since its marginal probability is the same for all

settings ofG. The prior over the graphp(G) also vanishes since I assume no preference

for a graph, thus yielding a uniform distribution. Solving the remaining log of the marginal likelihood p(D|G) requires the integration over all possible settings of the

parameter setΘ, leading to the Bayesian Dirichlet scoresD(G):

BD: sD(G) =log p(D|G) =log

p(D|G,Θ)p(Θ|G)dΘ (2.52) The task is to nd a graph G∗ that satises G∗ = argmax_GsD(G). Assuming that

p(Θ|G) is a Dirichlet prior, the integral can be solved with

sD(G) =log N Y n=1 qn Y j=1 Γ(αnj) Γ(αnj+Nnj) rn Y k=1 Γ(αnjk+Nnjk) Γ(αnjk) (2.53) whereqn is the number of unique instantiations ofπn,rn is the number of discrete

values in the data D, Γ(·) is the gamma function, α_nj = P

kαnjk and αnjk are the

Dirichlet concentration hyper-parameters,Nnjk is the number of times that xm takes

on the valuek and the parents ofxn take on instantiationj, andNnj =PkNnjk.

A disadvantage of Banjo is that it is limited to discrete values, which requires a discretisation of my continuous data causing information loss. In Chapter 5 I use the quantile discretisation procedure described by Hartemink [66]. For a detailed account

2.14. DYNAMIC BAYESIAN NETWORK WITH BDE SCORE (BANJO) 39 of the method refer to the supplementary material S1 in [132] or the website8_.

To measure the condence of the proposed interactions, I use the fact that Banjo produces a summary of the 100 highest scoring networks. Extracting the regulatory interactions between a predictorxn and a targetynfrom the 100 networks corresponds

to marginalization over these high scoring networks. An estimator of marginal posterior probability of an interactionxn→yn is given by the fraction of networks that contain

the interaction.

Chapter 3

Inference and Evaluation

Applying probabilistic predictions to nd proper parameters for a regression problem is a central paradigm in machine learning. The maximum likelihood estimate (MLE) and maximum a posteriori (MAP) estimate are the two most important and basic approaches to infer model parameters given only a likelihood (MLE) or posterior (MAP) probability estimate. With them it is easier to handle ambiguous cases by assigning a probability, i.e. a condence, to the parameters in the model that map a predictive set of features to a dependent variable or response as it is dened in the widely used linear regression models. Given a data set D= (x₁, ..,xM) and the parameter vector θ, which can contain also a single parameter, the MLE is dened as

θM LE = argmax

θ∈Θ n

p(D|θ)o (3.1)

Hence θM LE becomes a maximum likelihood estimate for the true parameter θ.

The advantage of the MLE is that it is easy to compute and invariant under re- parametrization, i.e. if a function g(θM LE) is a MLE for g(θ) than it would still

be a MLE if for instance the true parameter is squared with g(θ2). Furthermore, the

MLE has several asymptotic properties such that it converges toward a normal distribution and with a large data size M converges to the true parameter θ. One of the

major disadvantages of the MLE is that it tends to over-t the model on the data. This means that the model might perfectly predict the data samples it was tted to, but completely fails on a similar data set because it does not capture the uncertainty of the data but rather picks up the noise of the data samples. However, a penalized likelihood can prevent overtting and is equivalent to the MAP where penalization is controlled

42 Chapter 3 with a prior density. The MAP can be thought of as the maximum value of the joint posterior densityp(θ,D)for which the parameterθ from the complete parameters set Θbest explains the data D.

θM AP = argmax θ∈Θ n p(θ|D)o= argmax θ∈Θ n p(D|θ)p(θ) o (3.2) For a data set that has a large number of samples M → ∞, the likelihood p(D|θ)

becomes dominant compared to the parameter priorp(θ). In this scenario the MAP

tends to approach the solution of the MLE and also shares the same asymptotic properties. A disadvantage of the MAP is that it is not invariant to changes of the model parameter in contrast to MLE. Thus the MAP is suboptimal given the vast amount of possible posterior densities for varying parameter sets.

In Section 3.1 I will discuss feature selection techniques that control the number and choice of parameters in the setθwith focus on least squares regularization. Section 3.2

describes the popular Marcov Chain Monte Carlo (MCMC) that infers marginal posterior probabilities by converging towards a true posterior density. Feature selection in a discrete sense can be achieved with the reversible jump MCMC (RJMCMC) that is described in the same section. Since, I use clustering to infer species similarities in terms of neighbourhood distributions (see Chapter 5), I will explain in Section 3.3 the k-means and Gap-statistics. Finally, Section 3.4 describes how I evaluate the learned network structures that are retrieved from the methods previously dened in Chapter 2.

In document Machine learning in systems biology at different scales : from molecular biology to ecology (Page 50-55)