Z Z M Y m=1 p(zn,m|µ,Σ) p(µ|µ0,(νW)−1)p(W|α,T0) ! dµdW (2.47) can then be computed in closed-form. If it is further assumed that the target variable
yn, conditional on the set of regulators πn, becomes statistically independent of all
the other potential regulators, symbolically p(yn|X?n,πn) = p(yn|X?n[πn]), then the
conditional distributions p(yn|X?n,πn) =p(yn|X?n[πn]) = p(yn,X?n[πn]) p(X?n[π n]) (2.48) can also be computed in closed-form for each regulator set πn, see [49] for details.
Imposing uniform priors on the regulator sets, πn, subject to a maximal cardinality
restrictionF, the posterior distribution of the regulator setπnwith|πn| ≤ F is given
by: P(πn|yn,X?n) = p(yn|X?n[πn]) P ˜ πn:|π˜n|≤Fp(yn|X ? n[˜πn]) (2.49) where the sum in the denominator is over all valid regulator setsπ˜nwhose cardinality
is lower than or equal to the fan-in F. The posterior probability of an interaction
betweenxn
i andyn can then be computed by marginalization:
P(xn i →yn|yn,X?n) = X πn I(xn i ∈πn)P(πn|yn,X?n) (2.50) whereI(xn
i ∈πn)is the indicator function, which is1ifxni is in the set of regulators πn, and zero otherwise. I use the posterior probabilities in Equation (2.50) to score
the regulatory interactions with respect to their strengths.
2.14 Dynamic Bayesian network with BDe score (Banjo)
The Banjo (Bayesian Inference with Java objects) is an implementation of a dynamic Bayesian network (DBN) inference algorithm using the Bayesian Dirichlet (BDe) scor- ing metric [67]. The DBN is a rst order Markov model that has time-varying depen- dences and conditional independences of discrete variables, meaning that the variables at one time-point are aected by the variables of the immediate previous time-point.
38 Chapter 2 The dependences that form the network are proposed in a greedy search procedure and the BDe metric scores how well the network represents the observed data. The strength and sign of the network dependences are determine through a additional inuence score. The DBN is dened by the pair <G,Θ>. The graph G describes the dependence
structure and the parameter set Θ holds the probability distribution parameter vec- tor θn|πn = (θn,m|πn)∀nm, where m = (1, . . . , M) refers in this context to the time
points in the dynamic network. The parameter θn,m|πn = p(xn|πn) for each node n
and time point m depends on its corresponding parent set previously dened with
πn. The joint probability distribution over all nodes is p(x) = P(x1, . . . , xN) =
QM
m=1
QN
n=1P(xn, m|πn), namely the probability for the variablexn to take on a cer-
tain value, given the dependence on the incoming parent nodes. The score for the graphG given a data setD for all variables xn,m is the Bayesian score function:
logP(G|D) =logP(D|G) +P(G)−logP(D) (2.51)
The evidencep(D)can be neglected since its marginal probability is the same for all
settings ofG. The prior over the graphp(G) also vanishes since I assume no preference
for a graph, thus yielding a uniform distribution. Solving the remaining log of the marginal likelihood p(D|G) requires the integration over all possible settings of the
parameter setΘ, leading to the Bayesian Dirichlet scoresD(G):
BD: sD(G) =log p(D|G) =log
Z
p(D|G,Θ)p(Θ|G)dΘ (2.52) The task is to nd a graph G∗ that satises G∗ = argmaxGsD(G). Assuming that
p(Θ|G) is a Dirichlet prior, the integral can be solved with
sD(G) =log N Y n=1 qn Y j=1 Γ(αnj) Γ(αnj+Nnj) rn Y k=1 Γ(αnjk+Nnjk) Γ(αnjk) (2.53) whereqn is the number of unique instantiations ofπn,rn is the number of discrete
values in the data D, Γ(·) is the gamma function, αnj = P
kαnjk and αnjk are the
Dirichlet concentration hyper-parameters,Nnjk is the number of times that xm takes
on the valuek and the parents ofxn take on instantiationj, andNnj =PkNnjk.
A disadvantage of Banjo is that it is limited to discrete values, which requires a discretisation of my continuous data causing information loss. In Chapter 5 I use the quantile discretisation procedure described by Hartemink [66]. For a detailed account
2.14. DYNAMIC BAYESIAN NETWORK WITH BDE SCORE (BANJO) 39 of the method refer to the supplementary material S1 in [132] or the website8.
To measure the condence of the proposed interactions, I use the fact that Banjo produces a summary of the 100 highest scoring networks. Extracting the regulatory interactions between a predictorxn and a targetynfrom the 100 networks corresponds
to marginalization over these high scoring networks. An estimator of marginal posterior probability of an interactionxn→yn is given by the fraction of networks that contain
the interaction.
Chapter 3
Inference and Evaluation
Applying probabilistic predictions to nd proper parameters for a regression problem is a central paradigm in machine learning. The maximum likelihood estimate (MLE) and maximum a posteriori (MAP) estimate are the two most important and basic approaches to infer model parameters given only a likelihood (MLE) or posterior (MAP) probability estimate. With them it is easier to handle ambiguous cases by assigning a probability, i.e. a condence, to the parameters in the model that map a predictive set of features to a dependent variable or response as it is dened in the widely used linear regression models. Given a data set D= (x1, ..,xM) and the parameter vector θ, which can contain also a single parameter, the MLE is dened as
θM LE = argmax
θ∈Θ n
p(D|θ)o (3.1)
Hence θM LE becomes a maximum likelihood estimate for the true parameter θ.
The advantage of the MLE is that it is easy to compute and invariant under re- parametrization, i.e. if a function g(θM LE) is a MLE for g(θ) than it would still
be a MLE if for instance the true parameter is squared with g(θ2). Furthermore, the
MLE has several asymptotic properties such that it converges toward a normal distri- bution and with a large data size M converges to the true parameter θ. One of the
major disadvantages of the MLE is that it tends to over-t the model on the data. This means that the model might perfectly predict the data samples it was tted to, but completely fails on a similar data set because it does not capture the uncertainty of the data but rather picks up the noise of the data samples. However, a penalized likelihood can prevent overtting and is equivalent to the MAP where penalization is controlled
42 Chapter 3 with a prior density. The MAP can be thought of as the maximum value of the joint posterior densityp(θ,D)for which the parameterθ from the complete parameters set Θbest explains the data D.
θM AP = argmax θ∈Θ n p(θ|D)o= argmax θ∈Θ n p(D|θ)p(θ) o (3.2) For a data set that has a large number of samples M → ∞, the likelihood p(D|θ)
becomes dominant compared to the parameter priorp(θ). In this scenario the MAP
tends to approach the solution of the MLE and also shares the same asymptotic prop- erties. A disadvantage of the MAP is that it is not invariant to changes of the model parameter in contrast to MLE. Thus the MAP is suboptimal given the vast amount of possible posterior densities for varying parameter sets.
In Section 3.1 I will discuss feature selection techniques that control the number and choice of parameters in the setθwith focus on least squares regularization. Section 3.2
describes the popular Marcov Chain Monte Carlo (MCMC) that infers marginal pos- terior probabilities by converging towards a true posterior density. Feature selection in a discrete sense can be achieved with the reversible jump MCMC (RJMCMC) that is described in the same section. Since, I use clustering to infer species similarities in terms of neighbourhood distributions (see Chapter 5), I will explain in Section 3.3 the k-means and Gap-statistics. Finally, Section 3.4 describes how I evaluate the learned network structures that are retrieved from the methods previously dened in Chapter 2.