Challenges 141 - DATA ANALYSIS AND RESULTS 53

CHAPTER 4. DATA ANALYSIS AND RESULTS 53

4.20. Challenges 141

Throughout the preprocessing of revenue vehicle inventory data for machine learning algorithms and exploratory data analysis, many challenges were encountered. For instance, the quality of the revenue vehicle inventory data was not good. In addition, there were many roadblocks during the feature engineering such as problems with missing data.

Data are the most important part of developing any predictive model. Lack of good quality data or lack of sufficient data may not produce a good predictive model. In this model, the revenue vehicle inventory data from 2008 to 2016 from the NTD database were used. Due to poor quality of data, the available data from 1999 to 2007 were not used in the model. According to the FTA, the vehicle’s default useful life depends on the vehicle type (NTD, 2017). Therefore,

142

each vehicle type needs enough training data to train the model. The exploratory data analysis with the training data showed some of the vehicle types only had a few training data points. For example, the Inclined Plane Vehicle and Double Decker Bus vehicle types had only 1 data point, which was not enough to train these particular vehicle types.

The tasks for data preparation of the machine learning algorithm were very challenging. The tasks involved cleaning bad data with missing information, creating new features,

transforming them into useful features, and reorganizing data into suitable machine learning algorithms. The data preparation involved looking for data anomalies and making sure to fix anomalies by taking proper actions and transforming them to be consistent.

Since the revenue vehicle inventory data sets were complex and there was no direct information on when a vehicle was retired, it was very challenging to split the data into the training set with retired vehicles and the deployment set with non-retired vehicles. In the revenue vehicle inventory data, the Retired column was an important attribute as it indicated whether a vehicle was retired or not by flagging ‘Y’ or ‘N’. This column exists in the data from 2014 through 2016, but not in the data from 2013 and prior. In addition, there were many data points where the Retired column had null points in data from 2014 through 2016. Therefore, during the data cleaning process, the Retired column was added to data from 2008 to 2013 with ‘Y’ value and an extra column Retired Year was created to all data sets.

The Manufacture Year was another important column used to calculate the service life of vehicles. There were 3189 data points with no value for Manufacture Year, which represented about 7.5% of the total data. These data were not considered for the predictive model and

removed from the data set. Fuel Type was also an important categorical feature that impacted the accuracy of the predictive model. The exploratory analysis showed that there were 14100 data

143

points missing for the Fuel Type category, which represented 33% of total data points. However, in this case, the huge amount of data was not dropped from the data set. Instead, the missing category was replaced by a dummy type with Unknown Fuel type. It might impact the performance; however, it solved the problem.

During data processing, creating some useful features by combining multiple features was another challenge. Since there was no strong correlation found between features with the target feature, a combination of different features was applied to the model to obtain the best performance. Therefore, a trial and error method was applied to the features selection using the feature importance function to see whether newly created features had any impact on the model. By following the trial and error method, some of the features were selected for the model, and the rest of them were rejected.

After completing the initial exploratory data analysis, the selection of the best predictive model for this problem was another challenge. The analysis showed the target variable was a continuous variable and the regression analysis could solve the problem. Since there are many regression algorithms available for machine learning problems and there is no concrete

methodology to choose the best model, this work was started with several popular methods to build the predictive model for this problem. The entire data set was split into three sets called the training set, the test set, and the deployment set. Once the process was done, three popular machine learning techniques were chosen for the model. They were random forest regression, gradient boosting regression, and decision tree regression. By using these three techniques, a separate predictive model was built, evaluated the performance of the results, and the

144

comparison of the models took a significant amount of time, choosing the perfect algorithm for the problem was a bit of a challenge.

Another challenge was to handle the outliers in the data set. After calculating the actual service life of the vehicle by subtracting the manufacture year from the retired year some vehicles were observed to have very low service life. This may be due to some consequences of human errors by incorrectly inputting data for manufacture year or retired during the data collection processes. These data were handled by removal from the training data set.

In document Building a Predictive Model on State of Good Repair by Machine Learning Algorithm on Public Transportation Rolling Stock (Page 158-161)