6.2 Proposed Approach
6.2.2 Bayesian Model Selection for Matrix Consisting of Identical ANNs
Next, the optimal ANNs in V are trained repeatedly in order to take account of the random initialization of the weight of the ANNs in V. Thus, leading to the production of matrix Am,d of size M × D consisting of ANNs Nm,d. Matrix Am,d is given as:
Am,d = N1,1 N1,d . . . N1,D Nm,1 Nm,d . . . Nm,D .. . ... . .. ... NM,1 NM,d . . . NM,D
Such that each column of A contains a vector of unique identical ANNs architec- ture. The idea behind the formation of matrix Am,d is to take into consideration the
different sources of uncertainties arising from (1) the ANN architecture and, (2) the
random initialization of the weight parameter. Thus, given matrix Am,d, Bayesian
statistics can be used to infer the posterior probability of the mth ANN in dth column
as: P (Nm,d|Dtrain) = P (Dtrain(x, y)|Nm,d)P (Nk,l) PM m=1P (Dtrain(x, y)|Nm,d)P (Nm,d) , (6.10)
where P (Dtrain(x, y)|Nm,d) is the likelihood of training data Dtrain(x, y) for the Nm,d
ANN, and P (Nm,d) is the prior probability of Nm,d, which is the ANN probability
evaluated before observing training data Dtrain(x, y). The prior ANN probability
P (Nm,d) can be specified depending on the existing prior knowledge about the credi-
bility of ANN Nm,d, or it can be given as a uniform probability, P (Nm,d) = 1/M , if no
additional information is provided. The advantage of assigning uniform prior proba- bility to P (Nm,d) is that the difficulty of estimating the prior probability numerically
is avoided. The likelihood P (Dtrain(x, y)|Nm,d) may be thought of as the probability
of observing the training data Dtrain(x, y) under Nm,d ANN. It supplies a relative
measure of how well the Nm,d ANN is supported by the training data Dtrain(x, y).
Since the denominator in Eq.(6.10) is common for all the ANNs, the posterior ANN probability is proportional to prior probability and the likelihood. The likelihood of each ANN is evaluated by measuring the degree of agreement between the training data Dtrain(y) and the response ˆym,dz for each ANN. Hence, a probabilistic relationship
between training data Dtrain(x, y) and ANN predictions ˆyzm,d involving uncertainty
can be described. Typically, the bias function and noise are included as parts of the probabilistic relationship to match ANN predictions with training data. The bias function captures the discrepancies between the expensive model responses and predictions made by the ANN. The noise is usually assumed to be independent and identically distributed normal random variable with a mean of zero [64]. Further- more, various authors, [65–67] have used the Bayesian statistical method to quantify the uncertainty in the bias function modelled as a Gaussian process. In their works, a mathematical formulation that combines bias function associated with the ANN and noise from training data is utilized to describe the probabilistic relationship be- tween the training data Dtrain(x, y) and ANN predictions ˆym,dz . The mathematical
formulation of this probabilistic relationship is given by the following:
Dtrain(y) = ˆyzm,d− εzm,d, (6.11)
where εzm,d is a random variable that covers both bias associated with the ANN pre- diction ˆyzm,d and the noise in the response training data Dtrain(y). εzm,d is assumed
to be an independent identically distributed random variable with a mean µm,d of
zero. The use of εz
m,d with zero mean does not shift ANN prediction ˆym,dz . This re-
flects the fact that ˆyz
m,d is the most probable prediction value for the ANN. The bias
function is not included as a separate term in the probabilistic relationship. This is due to the fact that introducing a separate bias function results in shifting the prediction ˆyz
m,d of the ANN from the initially predicted value. Next, the likelihood
P (Dtrain(x, y)|Nm,d) of training data Dtrain(x, y) for ANN Nm,d is evaluated by ob-
serving where the training data points Dtrain(y) are located in the distribution of ˆyzm,d
estimated by Nm,d. The procedures to estimate the distribution P (ˆy|Nm,d) of Nm,d
and the likelihood P (Dtrain(x, y)|Nm,d) is given. First, the uncertainty in errors of
predictions ˆyz
m,d made by Nm,d is quantified by introducing an assumption that the
with a mean µm,d of zero. The error of the prediction of network Nm,d is represented by the following: εzm,d = Dtrain(yz) − ˆym,dz , ε z m,d ∼ N (0, σ 2 m,d), z = 1, 2, ..., N, (6.12)
where Dtrain(yz) is the zth training response output data, ˆyz the prediction of the
training data made by Nm,d, σ2m,d is the variance of prediction error εzm,d, and N
the number of samples in the training data. The prediction error εz
m,d measured is
considered to be a random sample from a normal distribution with a mean µm,d of
zero and variance σ2
m,d. Using the principle of maximum likelihood estimation (MLE)
(see [68]), the variance σ2
m,d for Nm,d can be estimated as:
σ2m,d = 1 N N X z=1 (εzm,d)2. (6.13)
Furthermore, the predictive distribution P (ˆy|Nm,d) of response ˆym,d under ANN Nm,d
is created by including the prediction error obtained in the previous step into the pre- diction of ˆym,d made by Nm,d. This predictive distribution is defined by the following
equation:
P (ˆym,d|Nm,d) = Dtrain(y) + εzm,d. (6.14)
Lastly, assuming that the residuals between the training data Dtrain(x, y) and ANN
Nm,doutput ˆym,d are normally and independently distributed with a mean of zero and
constant variance σ2
m,d, the likelihood function P (Dtrain(x, y)|Nm,d) can be expressed
as: P (Dtrain(x, y)|Nm,d) ≈ 1 q 2πσ2 m,d 1 N N X z=1 exp{−[y z− ˆyz m,d]2 2σ2 m,d }. (6.15)
6.2.2.1 Robust Prediction from Artificial Neural Networks
Thereafter, to obtain a robust prediction yd
robust, d = 1, 2, ..., D from the robust net-
work (i.e. set comprising of identical networks in the dth column of A
m,d), the pre-
diction made by all the identical networks in the dth column of A
m,d are combined
using the adjustment factor approach introduced in Chapter 5. Hence, the robust prediction of the ANNs in Am,d is expressed as:
ydrobust = ˆyd∗+ Adf, d = 1, 2, ..., D, m = 1, 2, ..., M, (6.16) where ˆyd∗ represents the point estimate of the best ANN in the dth column charac-
from the identical networks in the dth column of Am,d. Since the adjustment factor
Adf is assumed to be normally distributed, the expected value E[·] and variance Var[·] of the adjustment factor Adf is given by the following:
E[Adf] = M X m=1 P (Nm,d|Dtrain)(ˆym,d− ˆyd∗), (6.17) for d = 1, 2, ..., D and m = 1, 2, ..., M Var[Adf] = M X m=1
P (Nm,d|Dtrain)(ˆym,d− E[yrobustd ])
2. (6.18)
Similarly, the expected value and variance of the robust prediction ydrobust can be estimated from the following:
E[ydrobust] = ˆyd∗+ E[Adf], (6.19)
Var[yrobustd ] = Var[Adf], (6.20)
where E[Adf] and Var[Adf] represents the expected value and variance of the adjust- ment factor, and E[yrobustd ] and Var[yrobustd ] represents the expected value and variance of the robust estimate.
6.2.2.2 Confidence Interval for Robust Neural Network Prediction
Next, to quantify the uncertainty in the robust prediction yrobustd due to model uncer- tainty, confidence intervals are established. In particular, the 5th and 95th percentiles of the robust prediction are used quantify the model uncertainty. In theory, this interval is likely to contain the true estimated value. As the model uncertainty is assumed to be distributed normally, the confidence intervals of each respective ANN are given as:
ydrobust = E[ydrobust] + a∗ q Var[yd robust], (6.21) and yd robust = E[y d robust] − a ∗q Var[yd robust]. (6.22) where yd
robust and ydrobust represents the upper and lower confidence intervals of the
robust estimate from Nd
robust and a
∗represents the upper critical value of the Gaussian