Bayesian Model Selection for Matrix Consisting of Identical ANNs

6.2 Proposed Approach

6.2.2 Bayesian Model Selection for Matrix Consisting of Identical ANNs

Next, the optimal ANNs in V are trained repeatedly in order to take account of the random initialization of the weight of the ANNs in V. Thus, leading to the production of matrix Am,d of size M × D consisting of ANNs Nm,d. Matrix Am,d is given as:

Am,d =      N1,1 N1,d . . . N1,D Nm,1 Nm,d . . . Nm,D .. . ... . .. ... NM,1 NM,d . . . NM,D     

Such that each column of A contains a vector of unique identical ANNs architecture. The idea behind the formation of matrix Am,d is to take into consideration the

different sources of uncertainties arising from (1) the ANN architecture and, (2) the

random initialization of the weight parameter. Thus, given matrix Am,d, Bayesian

statistics can be used to infer the posterior probability of the mth _{ANN in d}th _column

as: P (Nm,d|Dtrain) = P (Dtrain(x, y)|Nm,d)P (Nk,l) PM m=1P (Dtrain(x, y)|Nm,d)P (Nm,d) , (6.10)

where P (Dtrain(x, y)|Nm,d) is the likelihood of training data Dtrain(x, y) for the Nm,d

ANN, and P (Nm,d) is the prior probability of Nm,d, which is the ANN probability

evaluated before observing training data Dtrain(x, y). The prior ANN probability

P (Nm,d) can be specified depending on the existing prior knowledge about the credi-

bility of ANN Nm,d, or it can be given as a uniform probability, P (Nm,d) = 1/M , if no

additional information is provided. The advantage of assigning uniform prior probability to P (Nm,d) is that the difficulty of estimating the prior probability numerically

is avoided. The likelihood P (Dtrain(x, y)|Nm,d) may be thought of as the probability

of observing the training data Dtrain(x, y) under Nm,d ANN. It supplies a relative

measure of how well the Nm,d ANN is supported by the training data Dtrain(x, y).

Since the denominator in Eq.(6.10) is common for all the ANNs, the posterior ANN probability is proportional to prior probability and the likelihood. The likelihood of each ANN is evaluated by measuring the degree of agreement between the training data Dtrain(y) and the response ˆym,dz for each ANN. Hence, a probabilistic relationship

between training data Dtrain(x, y) and ANN predictions ˆyzm,d involving uncertainty

can be described. Typically, the bias function and noise are included as parts of the probabilistic relationship to match ANN predictions with training data. The bias function captures the discrepancies between the expensive model responses and predictions made by the ANN. The noise is usually assumed to be independent and identically distributed normal random variable with a mean of zero [64]. Further- more, various authors, [65–67] have used the Bayesian statistical method to quantify the uncertainty in the bias function modelled as a Gaussian process. In their works, a mathematical formulation that combines bias function associated with the ANN and noise from training data is utilized to describe the probabilistic relationship between the training data Dtrain(x, y) and ANN predictions ˆym,dz . The mathematical

formulation of this probabilistic relationship is given by the following:

Dtrain(y) = ˆyzm,d− εzm,d, (6.11)

where εz_m,d is a random variable that covers both bias associated with the ANN prediction ˆyz_m,d and the noise in the response training data Dtrain(y). εzm,d is assumed

to be an independent identically distributed random variable with a mean µm,d of

zero. The use of εz

m,d with zero mean does not shift ANN prediction ˆym,dz . This re-

flects the fact that ˆyz

m,d is the most probable prediction value for the ANN. The bias

function is not included as a separate term in the probabilistic relationship. This is due to the fact that introducing a separate bias function results in shifting the prediction ˆyz

m,d of the ANN from the initially predicted value. Next, the likelihood

P (Dtrain(x, y)|Nm,d) of training data Dtrain(x, y) for ANN Nm,d is evaluated by ob-

serving where the training data points Dtrain(y) are located in the distribution of ˆyzm,d

estimated by Nm,d. The procedures to estimate the distribution P (ˆy|Nm,d) of Nm,d

and the likelihood P (Dtrain(x, y)|Nm,d) is given. First, the uncertainty in errors of

predictions ˆyz

m,d made by Nm,d is quantified by introducing an assumption that the

with a mean µm,d of zero. The error of the prediction of network Nm,d is represented by the following: εz_m,d = Dtrain(yz) − ˆym,dz , ε z m,d ∼ N (0, σ 2 m,d), z = 1, 2, ..., N, (6.12)

where Dtrain(yz) is the zth training response output data, ˆyz the prediction of the

training data made by Nm,d, σ2m,d is the variance of prediction error εzm,d, and N

the number of samples in the training data. The prediction error εz

m,d measured is

considered to be a random sample from a normal distribution with a mean µm,d of

zero and variance σ2

m,d. Using the principle of maximum likelihood estimation (MLE)

(see [68]), the variance σ2

m,d for Nm,d can be estimated as:

σ2_m,d = 1 N N X z=1 (εz_m,d)2. (6.13)

Furthermore, the predictive distribution P (ˆy|Nm,d) of response ˆym,d under ANN Nm,d

is created by including the prediction error obtained in the previous step into the prediction of ˆym,d made by Nm,d. This predictive distribution is defined by the following

equation:

P (ˆym,d|Nm,d) = Dtrain(y) + εzm,d. (6.14)

Lastly, assuming that the residuals between the training data Dtrain(x, y) and ANN

Nm,doutput ˆym,d are normally and independently distributed with a mean of zero and

constant variance σ2

m,d, the likelihood function P (Dtrain(x, y)|Nm,d) can be expressed

as: P (Dtrain(x, y)|Nm,d) ≈ 1 q 2πσ2 m,d 1 N N X z=1 exp{−[y z_{− ˆ}_yz m,d]2 2σ2 m,d }. (6.15)

6.2.2.1 Robust Prediction from Artificial Neural Networks

Thereafter, to obtain a robust prediction yd

robust, d = 1, 2, ..., D from the robust net-

work (i.e. set comprising of identical networks in the dth _{column of A}

m,d), the pre-

diction made by all the identical networks in the dth _{column of A}

m,d are combined

using the adjustment factor approach introduced in Chapter 5. Hence, the robust prediction of the ANNs in Am,d is expressed as:

yd_robust = ˆyd∗+ Ad_f, d = 1, 2, ..., D, m = 1, 2, ..., M, (6.16) where ˆyd∗ _{represents the point estimate of the best ANN in the d}th _{column charac-}

from the identical networks in the dth column of Am,d. Since the adjustment factor

Ad_f is assumed to be normally distributed, the expected value E[·] and variance Var[·] of the adjustment factor Ad_f is given by the following:

E[Ad_f] = M X m=1 P (Nm,d|Dtrain)(ˆym,d− ˆyd∗), (6.17) for d = 1, 2, ..., D and m = 1, 2, ..., M Var[Ad_f] = M X m=1

P (Nm,d|Dtrain)(ˆym,d− E[yrobustd ])

2_. _(6.18)

Similarly, the expected value and variance of the robust prediction yd_robust can be estimated from the following:

E[yd_robust] = ˆyd∗+ E[Ad_f], (6.19)

Var[y_robustd ] = Var[Ad_f], (6.20)

where E[Ad_f] and Var[Ad_f] represents the expected value and variance of the adjustment factor, and E[y_robustd ] and Var[y_robustd ] represents the expected value and variance of the robust estimate.

6.2.2.2 Confidence Interval for Robust Neural Network Prediction

Next, to quantify the uncertainty in the robust prediction y_robustd due to model uncertainty, confidence intervals are established. In particular, the 5th and 95th percentiles of the robust prediction are used quantify the model uncertainty. In theory, this interval is likely to contain the true estimated value. As the model uncertainty is assumed to be distributed normally, the confidence intervals of each respective ANN are given as:

yd_robust = E[yd_robust] + a∗ q Var[yd robust], (6.21) and yd robust = E[y d robust] − a ∗q Var[yd robust]. (6.22) where yd

robust and yd_robust represents the upper and lower confidence intervals of the

robust estimate from Nd

robust and a

∗_{represents the upper critical value of the Gaussian}

6.2.3 Model Averaging for the Ensemble of Robust Neural

In document Robust Surrogate Models for Uncertainty Quantification and Nuclear Engineering Applications (Page 82-86)