Pre-training phase - An artificial neural network approach for cost estimation of engineering s

The pre-training phase concern the selection of data and data pre-processing. Neural networks represent a technology that is at the mercy of the data. The training data must span the full range of the input space for which the network will be used (Hagan et al., 2014). Neural networks can interpolate accurately throughout the range of the data preceded, however extrapolation outside the range of the training set is of lower quality. To ensure the right data is used in training the model, a selection of data is made. First of all, the input variables that influence the proposal price that are identified in the literature review are further developed by interviews. Based on the availability of the input variables, a survey is established and spread to gather potential supplementary data. Lastly, the data should be pre-processed for an efficient training process.

3.1.1 Determine input variables

To start, the factors that influence the proposal price that are determined in the literature review need to be further investigated and assessed. The literature review led to the identification of the possible variables that influence the cost of engineering services (see chapter 2.10). This process is done by carrying out desk research and reviewing different journals, papers, and essays. In order to verify and determine additional influencing factors, interviews are carried out with experts within the research client. Interviews were held with 13 employees within Bilfinger Tebodin that have experience with preparing bidding offers for engineering services. The interviewees consisted of three project managers, five lead engineers (different departments), two heads of departments and three tender managers. They were asked to answer 6 open questions that lead to the most important factors that influence the costs of engineering services. The following 6 questions were asked:

1. Can you describe the approach you would use in order to estimate the required man-hours of a project based on a Request For Quotation (RFQ)?

2. Can you describe the approach you would use in order to estimate the required man-hours of a project based on a RFQ if you only had one hour?

3. Can you describe how you get a broad view of the size of a project while reviewing a RFQ? 4. What information or elements are ideally available in a RFQ to make an estimate?

5. If you had to make a proposal without an RFQ and you could ask 5 questions to the client, what questions would you ask?

6. Can you explain what the variables are that influence the costs of engineering services in a project?

While answering these question none of the interviewees had seen the 14 relevant variables that were distinguished from the literature (chapter 2.10). This was done to identify missing variables and to identify the relevance of these variables. Subsequently, they were asked to rank the 14 different variables from 1 to 14, where 1 was the most important variable and 14 the least important variable. The average of the scores is taken to identify the average relative importance of the 14 different variables by expert opinion. Thereafter, a set of final input variables was determined based on the literature review and interviews. This final set of input variables also consisted of qualitative variables. The ANN model only can handle numerical values, therefore qualitative variables were transformed into quantitative variables. The way in which this is done will be explained in chapter 4.

31

3.1.2 Collecting and pre-processing data

A database was obtained by a printout of Shared Tools and was used and further developed. In this database some of the input variables were present, yet not all the variables were. All the data that was not available in the databases in the software Shared Tools was gathered using an online survey. In this survey, the responsible tender managers or project managers, depending on who made the tender are asked to provide the missing data. In some cases, the responsible tender manager was not working at Tebodin anymore. In that case, someone who was involved in the project that is still working at Tebodin was asked to provide the data. The survey was set up in the SharePoint environment within the Intranet of Tebodin. The projects that are valid for use in the model were provided with a project ID number. This project ID number could be selected when filling in the survey. This allowed matching the survey results with the already existing database for the projects. The setup of the survey can be found in Appendix A.

Lastly, a final database was set up by connecting the database with the output of the survey. After collecting the data, the data was divided into three sets: training, validation, and testing. In this division, the training set is about 70% of the total data set, and the validation and testing set represents 15% of the total dataset each (Hagan et al., 2014). It is important that each set is a good representative of the full data set. The simplest and most common method for data division is to select datasets at random. In addition, it is common to normalize the data before applying them to the network. The purpose of the normalization is to facilitate and enhance network training. In multilayer networks, sigmoid transfer functions are often used in the hidden layers. These functions become saturated when the net input is greater than three and will lead to very small gradients. It is common to normalize the data before applying them to the network. The standard method is to normalize the data so that they fall into a standard range between -1 to 1 (Janssen, 2018). Therefore, this is done for both the input data as output data.

32

In document An artificial neural network approach for cost estimation of engineering services : enhancing cost estimation efficiency (Page 42-44)