The specific methodology used to train and combine individual RNNs into an ensemble is adapted from the model proposed by Jin and Sendhoff [172] where two key questions are asked; has the training converged? And has the prediction accuracy to the performance measure been met? By answering these questions the model should have been trained correctly and be able to confidently predict the performance measure using minimal partially converged CFD data.
With respect to the first question, when training a neural network it is important that the training data is used to create a model that is able to learn the underlying function and predict future data points. When a model only learns the characteristics in the training data and not those of the underlying function, it is said to be over-fitting or over-training [45]. It is therefore important to train the model for a suitable number of epochs and stop training at a point when the model predictions are accurate. Both early- stopping and regularisation techniques have been adopted for neural network training to achieve this.
over-fitting [45, 65]. To do this the prediction or generalisation performance of the model is periodically assessed during training, using a validation set of data that has not previously been seen by the network. This is known as cross-validation and both the training and validation errors are evaluated by stopping the training process and using the model that has been created up to this epoch to evaluate the error between a set of known target and predicted values.
Figure 3.1 illustrates the theory behind the early-stopping methodology and shows that initially both the training and validation errors will decrease with increasing numbers of training epochs. The training error is likely to continue to decrease with increasing numbers of epochs, but training should be stopped when the validation error is at a minimum [45].
Epochs
Err
or
Training Error Validation Error
Early Stopping Point
Strip Length
Figure 3.1: Example of Early-stopping Methodology
It is at this point that the model is starting to over-fit and any training after this point results in only learning specific information in the training data that is not relevant to the underlying function. Therefore, it is important to stop the training at this point to ensure that the model created is able to satisfactorily predict future data points.
Regularisation is an alternative to an early stopping method when trying to avoid over-fitting. Instead of only monitoring the networks performance on different sets of data, a trade-off between networks performance and another term that imposes prior knowledge on the models, known as the complexity penalty is implemented [45]. A regularisation parameter is used to adjust the relative importance of the two terms.
The sum of the squared or absolute weights have been used as the penalty term in an attempt to drive some of the weight values close to zero [45]. Minimising the number of connections in the network has also been investigated [88] by incorporating a multi-objective optimisation approach.
Due to the training method used and the relative simplicity of its implementation, an early-stopping method was used in this work. The specific method used monitors the number of increases in the validation error during a certain number of training epochs This number of epochs is known as the strip length and an example is shown in Fig. 3.1. Training continues as long as the validation error is decreasing and will only stop when there has been a certain number of increases in this set number of epochs. This means that training will continue even if the validation error increases and then falls again, but would stop if it continually increases. This should avoid local minimums in the error surface.
Using Fig. 2.6a that was introduced in Section 2.2.2 as a specific example of a typical CFD convergence history, Fig. 3.2 illustrates the parts of the CFD convergence data that is used for training and validation of the networks, as well as the final performance value that the networks are predicting.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 20 40 60 80 100 120 140 160 CL Iteration
Training Data Validation Data
Target Value
Variance Range Locations
Convergence History
Figure 3.2: Breakdown of CFD Convergence Data - Heterogenous Ensemble
As previously discussed, the training and validation data sets are used to construct the networks and the CFD data highlighted in Fig. 3.2 is used for this purpose. During training and validation, each network is used as a single single step predictor with all highlighted data used. This means that the calculated errors during training and validation are errors across the complete batch of data. A mean squared error (MSE) has been used in this work.
Once training has been stopped by the early-stopping method the models are used recursively to predict up to the performance measure at the final flow iteration, referred to as the “Target Value” in Fig. 3.2. This is achieved by feeding back the predicted values at each flow iteration and using them to form part of the networks input. The error at this final flow iteration is the output of each network and once the prediction horizon has been reached the second question can be addressed. This is achieved by establishing a “level of diversity” between all of the predictions made by the different networks and this determines the ensembles fidelity [50].
By considering all of the different networks that have been created, the level of diversity in the ensemble is determined by calculating the variance of the predicted CFD performance measure at specific numbers of flow iterations. The variance was selected as it gives an indication of the difference between all of the various networks predictions and how far they are distributed about the mean.
Initially a single value of variance was investigated, however by just stipulating a small variance the ensemble members may be accurate, but this would indicate that there may not be any diversity between the ensemble members predictions of the converged performance measure. Conversely, if the variance is too high, this indicates that there is diversity among the ensemble members, but this may also correspond to inaccuracy in the predicted values. Therefore a range of variance was established so the ensemble member predictions could provide both diverse and accurate predictions.
Because CFD data is converging to a certain point and that it should be stable for a number of iterations before the final value, the predicted values should be within the set variance range at three flow iteration points. These points are at the final converged iteration, as well as five and ten iterations before the final flow iteration. These three locations are highlighted in Fig. 3.2 and the elliptical shapes are a graphical representation of the spread of the predicted values that need to be within the variance range.
Once the prediction to the final flow iteration had been made and if the ensembles predictions are within the variance range, at the three flow iteration locations, it was assumed that there was consistency in the predicted value and that convergence had occurred. The output of the ensemble was then determined by taking the simple average of the individual networks output at the final flow iteration. If one of the variances was out of range, more data is added to the training data set. A sliding split of data is used, which means that ten additional training samples are added to the training data set and the validation data values slide along the convergence history by ten points. This is shown in Fig. 3.3, where Fig. 3.3a illustrates the split of data initially used used for training and validation and Fig. 3.3b illustrates the split of data after the additional ten data points have been added to the training data set.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 20 40 60 80 100 120 140 160 CL Iteration
Training Data Validation Data
Target Value
Convergence History
(a) Original Number of Training Data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 20 40 60 80 100 120 140 160 CL Iteration
Training Data Validation Data
Target Value
Convergence History
(b) Increased Number of Training Data Figure 3.3: Partitioning of Data
the networks. Algorithm 1 summarises the proposed framework for training and prediction of the heterogeneous ensemble and ten additional data points are added to the training data set until there is no more data available. Individual networks are initially created and these are referred to as the individual ensemble members. Each network is trained independently and then combined to create the final ensemble prediction. Initially the final ensemble output is the simple average of each ensemble members predictions at the final flow iteration.
Algorithm 1: Proposed Framework for Heterogeneous Ensemble while Data points 6= max do
Create individual ensemble members Create training and validation data sets
for Max number of training epochs do for all Ensemble members do
if No estop flag then Train network
Calculate training error (MSE) Calculate validation error (MSE) Update estop
if estop met then Assign estop flag
if Each member assigned estop flag then Stop training
for all Ensemble members do
Predict convergence at final flow iteration
Calculate variance of predicted values at three flow iteration locations if Variance range met at each location then
Output mean of predicted values at final flow iteration else if Add 10 data points to training data set then
The C++ Shark Machine Learning Library [173] has been used to implement and train the individual RNNs. The topology of each RNN can be thought of as a FFNN with an additional memory layer [174, 175]. The state of all the neurons from the previous time step are stored in a memory layer and the FFNN receives additional activation from this layer, through weight connections. A single memory layer refers to a network that is unfolded one state back in time during training and prediction. The Shark library allows for different network structures to be defined, including the number of hidden neurons and memory layers, as well as the connections between all neurons.
To create the heterogeneous ensemble members, the CFD data is partitioned into three different input structures and hence network structures. Each network has one output neuron that predicts the same value (y(t + 1)), but each network uses different input values, i.e. y(t − 2), y(t − 1) and y(t) for a network with three inputs, y(t − 1) and y(t) for networks with two inputs and y(t) for networks with one input. This means the reconstructed data is represented by a normalised embedding delay, τ = 1 and an embedding dimension, D = 3, 2, 1.
As previously mentioned, all data points are presented to the RNN during training and validation. This is referred to as batch learning and this technique requires a warm-up-length of data to be taken into consideration. The warm-up-length is used to initialise the internal states of the neurons, which means that the network can converge to a “normal” dynamic state, allowing for new data to be predicted [111, 173]. It is not considered when evaluating the training and validation errors, or when using the model to predict future values.
The Improved Resilient backPROPagation Plus (IRPropPlus) algorithm has been selected as the learning algorithm, due to its superior learning rate compared to other RProp variants and other gradient based learning algorithms [176, 177]. The IRPropPlus learning algorithm is an extension of the RPropPlus algorithm, which in turn was modified from the RProp learning algorithm.
All three algorithms use the sign of the partial derivatives instead of the absolute values to update the connection weights. The RProp algorithm updates the step size for each connection, which in turn is used to adjust the weight values. If there is no change in the sign a larger step size is used, whereas for a change in the sign the step is much smaller. This allows for finer refinement when promising areas of the error surface have been found.
Instead of adapting the weights using an updated step size when there is a change in sign, the RPropPlus algorithm incorporates weight back tracking/retraction. This returns the weight change to the value at the previous iteration. The partial derivative is also set to zero, to avoid an update of the learning rate in the next epoch.
The IRPropPlus algorithm also incorporates the retraction of the weight value, but instead of only being based on if the sign has changed, there also has to be an increase in the error value. By recalling the previous steps error, global information of the learning process is incorporated into the algorithm, which allows the minimum of the error surface to be found more easily. A change in the partial derivative sign only indicates that a minimum was skipped, but by including the previous error it is known whether the error is moving along the error surface in the correct direction. The IRPropPlus algorithm has been shown to be successful on both recurrent and feedforward neural networks [177].
There are three networks of each input structure, making a total of nine ensemble members. The networks used in this chapter use the libraries inbuilt network settings and all neurons use the non-linear sigmoid function tanh() which allows a non-linear system to be modeled. It does mean that all data needs to be normalised, as the function outputs between -1 and 1. Different initial weights distinguish ensemble members of the same input structure and the weights of each connection are randomly initialised between -0.1 and 0.1.
Although the default settings have been used, there are still a number of parameters that need to be determined, including the number of hidden neurons, memory layers and connections. The number of training epochs and parameters for the early-stopping criteria also need to be determined, as well as a suitable variance range.
The following section presents the data used to evaluate the developed methodology, as well as descrip- tions of how the various parameters were determined.