Chapter 2 Methodologies
2.2 Regression methods
2.2.2 Bayesian regularisation neural networks
Artificial neural networks (ANNs) are widely employed for a vast number of learning problems, due to their flexibility and accuracy (‘universal approximators’). Recent developments in the form of deep learning networks [68, 69] have also revitalised
the area. ANNs use a fixed number of basis functions that can be adapted to di↵erent datasets to approximate an unknown mapping from inputs to outputs [20]. There are many types of ANN model, e.g., feed-forward neural networks, recurrent networks, polynomial networks and modular networks.
In this thesis, a feed-forward neural network, also known as multiplayer per- ceptron (MLP), is used as a regression model (a data-driven emulator) for multi- variate emulation. The method is briefly introduced following Bishop et al. [20]. The basic idea of ANNs is to create a network of connected “neurons” that take inputs from a specified subset of the neurons and return outputs that are in turn used as inputs for another subset of neurons. They are, essentially, a complex ex- pansion of a function in terms of a basis that depends on functions associated with the neurons and the number of neurons, as well as the way in which the neurons interact.
In MLPs, the output (activation) is defined as follows:
a(jk+1)=h Nk X i=1 wji(k)ai(k)+w(jk0) ! , (2.10)
whereh(·) is an ‘activation function’a(ik) indicates thei-th output in thek-th layer, Nkis the total number of activation functions in thek-th layer, andw(jik)is the weight
(or parameter) associated witha(ik), connecting it to neuronj in layer k. The most commonly chosen activation function is the sigmoid function h(x) = 1/(1 +e x).
Other popular choices includes the hyperbolic tangenth(x) = (ex e x)/(ex+e x)
and rectified linear functionh(x) = max(0, x).
Eq. (2.10) corresponds to afeedforward network, in which the inputs to layer k are outputs of neurons from layers that strictly precede layer k. The number of neurons in a layerNk and the number of layers ndecide the complexity of a MLP.
gives the univariate output of an+ 1 layer MLP as: o(⇠⇠⇠) =a(1n+1) =h Nn X i=1 w1(ni)ai(n)+w(10n) ! , (2.11)
which is used to approximate ⌘(⇠⇠⇠). The MLP is naturally extended to multiple output problems, where y = (y1, . . . ,yd)T = (⌘
1(⇠⇠⇠), . . . ,⌘d(⇠⇠⇠))T 2 Rd, by setting oi(⇠⇠⇠) =a(in+1)⇡⌘i(⇠⇠⇠) for i= 1, . . . , d.
To train the model with a given dataset, a cost function is defined as:
ED = m X i=1 1 2 ⇣ y(i) o(⇠⇠⇠(i)⌘T⇣y(i) o(⇠⇠⇠(i)⌘, (2.12)
where o(⇠⇠⇠(i) = (oi(⇠⇠⇠(i)), . . . ,od(⇠⇠⇠(i))T. One can define a vector of weights w =
(w(0)10, . . . ,w(dNn)n). Finding thew that minimises the square sum error defined in Eq. (2.12) would give an optimal approximation to our training set {y(i)}m
i=1. This,
however, will not necessarily generate an accurate approximation to ⌘⌘⌘(⇠⇠⇠) at test inputs (generalization of the model) as a consequence of over-fitting, which is a major issue in basic ANN formulations. The problem can be alleviated by adding a regularization term (a general method for optimization problems) to the cost function as follows:
E = ED +↵Ew= ED+↵
1 2w
Tw, (2.13)
where Ew is the sum of squares of the network weights and and ↵ are param-
eters determining the weighting of each cost term. A large could lead to good approximations to the training data but may result in overfitting while a large ↵
would improve generalization but underestimate the error in fitting the model to the training data.
Minimization of (2.13) can achieved with a gradient based optimisation al- gorithm, e.g., gradient descent. These approaches, however, involve computing the
partial derivatives of E w.r.t. each weight, which is computational intensive. A
back-propagation is typically introduced to efficiently calculate the partial deriva- tives [20].
Other methods, e.g., early-stopping and cross-validation, could be imple- mented to improve generalization and avoid over-fitting. Another choice is Bayesian networks, in which a prior is placed on the weights, leading to improved generaliza- tion. An efficient approach, which avoids a full Bayesian estimation of all network weights (highly computationally intensive and therefore rarely adopted) is Bayesian regularization [70, 71], which is implemented in the thesis.
The weights are assumed to have zero-mean Gaussian prior distributions. Set
↵ to the inverse variance of the zero-mean (assumed) Gaussian noise and to the inverse variance of the weights. By Bayes law, the posterior density of the weights is:
P(w|D,↵, ,M) = P(D|w, ,M)P(w|↵,M)
P(D|↵, ,M) , (2.14) whereD={y(i),⇠⇠⇠(i)}m
i=1is the data set,Mindicates the MLP model,P(w|↵,M) is
the prior density andP(D|w, ,M) is the likelihood function. The optimal weights should maximize the posterior likelihoodP(w|D,↵, ,M). It is assumed that the noise in the training data and the weight prior are both Gaussian:
P(D|w, ,M) = 1 ZD( ) exp ( ED), P(w|↵,M) = 1 Zw(↵) exp ( ↵Ew), (2.15)
whereZD( ) = (⇡/ )m/2 andZw(↵) = (⇡/↵)N/2 withN being the total number of
weights. Substituting these assumptions into Eq. (2.14) yields:
P(w|D,↵, ,M) = (ZD( )Zw(↵)) 1 exp ( ( ED+↵Ew)) P(D|↵, ,M) = (ZF(↵, )) 1exp ( F(w)) P(D|↵, ,M) . (2.16)
In a Bayesian framework, the optimal weights should maximise the posterior, which is equivalent to minimising the regularised objective functionE= ED+↵Ew. The
posterior of↵ and has the form (again using Bayes rule):
P(↵, |D,M) = P(D|↵, ,M)P(↵, |M)
P(D|M) . (2.17)
To derive the maximum of the posteriorP(↵, |D,M) a uniform prior den- sityP(↵, |M) is assumed so that the maximisation could be obtained by maximis- ing the likelihood functionP(D|↵, ,M). Notice that the likelihood function is also the normalisation term shown in Eq. (2.14) and (2.16). One can thus solve Eq. (2.14) for the normalisation factor:
P(D|↵, ,M) = P(D|w, ,M)P(w|↵,M) P(D,↵, ,M) = ⇣ (ZD( )) 1exp ( ED) ⌘⇣ (Zw(↵)) 1exp ( ↵Ew) ⌘ (ZF(↵, )) 1exp ( F(w)) = ZF(↵, ) ZD( )Zw(↵) exp( ED ↵Ew) exp( F(w)) = ZF(↵, ) ZD( )Zw(↵) . (2.18)
To solve forZF, which is the only remaining unknown, a quadratic Taylor
expansion ofF(w) around the minimum pointwM P (i.e. a Laplace approximation)
is implemented. This yields:
ZF ⇡(2⇡)N/2 ⇣
det⇣ HM P 1⌘⌘1/2exp( F(wM P)). (2.19) WhereH= r2
wED+↵r2wEw is the Hessian matrix of the objective function and
HM P is the Hessian matrix evaluated at w = wM P. Substituting Eq. (2.19) into Eq. (2.18) and solving for the optimal values of↵and by setting the corresponding
derivatives to zero yields: ↵M P = 2Ew(wM P) , M P = m 2ED(wM P) ,