Training phase - An artificial neural network approach for cost estimation of engineering servi

In order to improve the performance of the model, an optimization strategy was developed. This strategy consists of three iterative processes which were carried out sequential. The first iterative process determined the best performing training algorithm and the best model based on the complete dataset. The second iterative process determined the best performing input variables, and therefore the dataset was altered. The last iterative process consisted of finding the range of proposal value wherein the model performed best. In order to develop and train an artificial neural network, a MATLAB script is needed to be established. This is done by using the Neural network Toolbox (Beale, M. H., Hagan, M. T., & Demuth, 2018), this in order to develop the initial script. Subsequently, the script is extended and altered by means of facilitating the optimization strategy. With the use of this script the network is trained, analysed and optimized. First of all the optimization strategy with the coherent three iterative processes are explained.

3.2.1 Optimization strategy

First iterative process

The first iterative process is about determining the best performing training algorithm and the best model based on the complete dataset (see Figure 3-2 below). In this iterative process, three alternative training algorithms are tested. The training algorithms that are described in the literature review in chapter 2.9. In the first iterative process the Levenberg-Marquardt, Bayesian Regularization and Resilient backpropagation training algorithms will be tested. The first iterative process starts with importing the total data set. Subsequently, a training algorithm is selected. Hereafter, the training enters a network architecture optimization module, which is illustrated on the right-hand side of Figure 3-2. Here, the growing method was used. In this technique, the training is started with a single hidden neuron and one neuron is added to the hidden layer every iteration. The training is ceased whenever significant overfitting emerges. For every architecture, the performance of the network is retained. When the network architecture is optimized, the next training algorithm is selected until all the training algorithms are tested. After the first iterative process is finished, the best training algorithm and best network architecture that explains the total dataset is found.

33

Second iterative process

After the best training algorithm is found, the network that obtained the highest performance is analysed to determine the relative importance of the input variables. This is the start of the second iterative process (see Figure 3-3 below). In order to find the simplest model that explains the data, it could be helpful to eliminate redundant or irrelevant input variables. By calculating the relative importance of input variables of the network with the highest performance, redundant input variables can be removed and generalization can increase. A method called Connection Weights Algorithm (Olden & Jackson, 2002) can be used to calculate the relative importance of a given input variable of a neural network and can be defined as Equation 12 below. This approach is based on estimates of the network’s final weights obtained by training the network (Ibrahim, 2013; Janssen, 2018).

𝑅𝐼𝑥= ∑ 𝑊𝑥𝑦𝑊𝑦𝑧 𝑚

𝑦=1

(Equation 12)

The next step is to eliminate the variables that have low impact and retrain the network with the training algorithm that was determined by the highest performance in the first iterative optimization process. The elimination is done by excluding one variable at a time until there is one variable left. Also, the training will be ceased when there is a significant drop in performance when excluding a certain variable. Due to the fact that the number of input neurons decreases, the number of neurons in the hidden layers also potentially need to be changed. Therefore, the strategy of growing is also used again. Eventually, it becomes clear what the simplest model that explains the data is. This model has the best training algorithm, most relevant input variables and the best fitting architecture.

Figure 3-3. Optimization strategy: second iterative process

In addition to the connection weight algorithm for the determination of the relative importance of the input variables, two other methods are used namely; multiple linear regression analysis and expert opinion. MLR analysis is a suitable method to identify which variables have a significant influence on the proposal price. It can help determine whether there is a linear association or causation between the independent variables and proposal price. First, the relative importance of the independent variables is determined by the unit drop in R2 _{when a variable is deleted} from the sample. R2 _{is the coefficient of determination and shows the percentage of variation in a dependent variable} which is explained by all the independent variable together. The larger the drop in R2_{when removed from the} sample, the more important it is assumed to be. In addition, the data is checked on whether it has multicollinearity. This occurs when two or more independent variables are highly correlated with each other. When collinearity is present, it is hard to find out if one variable causes an effect or the other (van der Steen, 2018). Therefore, when there is multicollinearity in the data, some variables could be redundant and removed. Finally, the last method to

34

determine the relative importance of the input variables is by expert opinion. As described in the pre-training phase the variables were ranked by experts. This ranking is also used as a way to determine the relative importance of the input variables. Also, neural networks are developed based on the relative importance of the input variables determined by the MLR analysis and expert opinion. The performances of the neural network can be compared with the neural network based on the results from the connection weight algorithm. Based on the comparison it can be known what the best method is to determine the most important variables for a neural network.

Third iterative process

Lastly, neural networks can interpolate accurately throughout the range of the data preceded, however extrapolation outside the range of the training set is of lower quality. There is no way to prevent errors of extrapolation unless the data that is used to train the network covers all regions of the input space where the network is used. In addition, if there is a relatively small number of data points in a specific region, this could also lead to bad interpolation. This is a simple result of not enough data for that specific region, and therefore it cannot be trained properly for that region. In order to ensure preventing bad results from extrapolation, it should be ensured that the network is not used for project values that are outside the dataset on which the network is trained.

In addition, we can exclude certain project value ranges where there are relatively low numbers of examples. This will lead to a smaller range of projects for which the neural network can be used, however, it could lead to a higher performance of the model when interpolating. This is done in the third and final iterative process (see Figure 3-4 below), where a selection of project value range is made. In this process, three selections of data regions were made based on the results of the second iterative process. First of all, it was decided to proceed the training with the 5 different network architectures that came out best in the second iterative process. However, when a data selection is made, the complexity of the underlying function of the data could be different compared to the full database. Therefore, the growing technique was used again and the number of hidden neurons was changed for every network in each training set. The full dataset is divided into 11 project value ranges categories. The total value

range is varying from €2000 to €10.000.000. In Table 3-1 below, the selected project value ranges are illustrated.

Figure 3-4. Optimization strategy: third iterative process

The first selection excludes projects that have a value of more than €1.000.000. Therefore the sample in first data

selection contains projects with a value ranging from €2.000 euro to €1.000.000. The number of projects that have a value of more than €1.000.000 is relatively small. In total, the final dataset contains 14 projects that have a value

35

bigger projects can lead to bigger errors in the performance function MSE. In addition, when the neural network is trained on larger projects, this could potentially lead to lower performance when testing the neural network on smaller projects. Based on these considerations, the first data selection is made. In total 118 data points remain in the sample.

Subsequently, the first data selection is further reduced to 60 data points in the second data selection. In this case, the 58 projects that have a value of less than €50.000 are also excluded from the sample. The purpose of the neural network is to predict proposal values from €50.000 euro and above. Therefore, the projects of lower value were left

out of the sample. In addition, when lower values are included in the sample, the testing results for these lower values could be relatively large. This due to the fact that the training was also carried out for larger projects. When the network is trained for larger networks the error could potentially get larger for smaller projects.

The last selection was made to increase the number of data points to 70. Neural networks are at the mercy of the data that it is exposed to, and more quality data usually leads to better results. However, to still train the neural network on projects similar to the second data selection, two categories were switched. The third and final data

selection contains projects with a value ranging from €20.000 euro to €500.000. In this selection, the differences in value between the larger projects and smaller projects get smaller. This could potentially lead to smaller relative errors in the projects.

Table 3-1. Data selection: project value range

1th selection 2nd selection 3rd selection Occurrence in

sample >= 2000, <5000 >= 2000, <5000 >= 2000, <5000 2 >= 5000, <10000 >= 5000, <10000 >= 5000, <10000 12 >= 10000, <20000 >= 10000, <20000 >= 10000, <20000 23 >= 20000, <50000 >= 20000, <50000 >= 20000, <50000 21 >= 50000, <100.000 >= 50000, <100.000 >= 50000, <100.000 19 >= 100.000, <200.000 >= 100.000, <200.000 >= 100.000, <200.000 15 >= 200.000, <500.000 >= 200.000, <500.000 >= 200.000, <500.000 15 >= 500.000, <1.000.000 >= 500.000, <1.000.000 >= 500.000, <1.000.000 11 >= 1.000.000, <2.000.000 >= 1.000.000, <2.000.000 >= 1.000.000, <2.000.000 8 >= 2.000.000, <5.000.000 >= 2.000.000, <5.000.000 >= 2.000.000, <5.000.000 5 >= 5.000.000, <10.000.000 >= 5.000.000, <10.000.000 >= 5.000.000, <10.000.000 1

36

In document An artificial neural network approach for cost estimation of engineering services : enhancing cost estimation efficiency (Page 44-48)