In this chapter, the performance of the ML models developed in Chapter 4 will be reviewed to identify the best performing model for each wheel wear type. The chapter starts off with a description of how the various ML model performance metrics were combined to create a single score that was used to select the best suited model for each wheel wear type. The chapter continues with a report of how each model performed, based on the aforementioned score.
5.1 Model score combination
In Chapter 4, various performance metrics for each of the developed ML models are reported for each of the five types of railway wheel wear. However, there still remains the issue of selecting the winning model for each of the wheel wear types. To this end, a method of combining the performance metrics was devised to provide a single metric that was used to rank the models for each wheel wear type.
The difficulty that lead to the development of this combined performance metric stemmed from the high degree of class imbalance in the target variables. The target variable count, grouped by class, is provided in Table 40. As one can see, for most of the wear measurements, a vast majority of the observations fell in the ‘0’ target variable class. This imbalance makes the model metrics difficult to interpret; the reasons for this are discussed in Section 3.2.5.2. To reiterate, a large class imbalance in the output variable of a binary classification problem reduces the merit of many model performance metrics. This is especially true for metrics that only report on the rate of accurate predictions, because the class imbalance will artificially increase such performance metrics. The metrics produced by confusion matrices are particularly susceptible to this phenomenon, because it mainly reports on various accuracy rates of a model. A combined wheel wear measurement was therefore formulated to provide a single measure to rank the ML models, that takes the effect of binary target variable class imbalance into account.
The following equation was used to combine the wheel wear measurements:
In Eq. 15, CS is a normalised weighted score of the four model performance metrics’ sensitivity, specificity, F1 and AUC. The reasoning behind the weights of the model metrics was as follows: specificity had the lowest weight because of the bias toward the negative class, as shown in Table 40. This make it less of an achievement if a model performs well in predicting the negative classes. Sensitivity and the F1 score each received a weight of 2. For sensitivity, this was done to reflect the fact that it is quite an achievement if a model does well in predicting the positive class of the target variable, again due to the target class imbalance shown in Table 40. The F1 score was also allocated with a weight of 2 because it is a balanced score between sensitivity and specificity, that is not severely affected by target variable class imbalanace. AUC was assigned the highest weight of 3, because AUC was the metric that was the most target variable class frequency agnostic among the model performance metrics. In other words, AUC was the metric that was least affected by the class imbalance of the binary output variable, which is why this measurement was attributed with the largest weight in the CS calculation.
Table 40: Wheel wear measurement type target variable class counts
Wear Measurement Count (Class = 0) Count (Class = 1)
FH 72’864 10’038
TD 41’504 41’398
HW 75’802 7’100
FS 27’230 55’672
FT 75’496 7’406
5.2 Final model scores and selection
The combined score of the models for each of the wear measurement types are provided in Table 41 to Table 45. The winning model type for each wheel wear measurement is listed in Table 46. The results show that logistic regression performed the poorest of the three ML model types, and was not the best performing model type for any of the wheel wear measurements. ANN performed very well when it came to TD measurement prognostics, and was the best performing model type for the TD measurement type. Of the three model types, random forest was the model type that achieved the highest score for the most wheel measurement types. Random forest was the best performing model type for all the wheel wear measurements, save for TD. This could possibly be attributed to the fact that, of the three model types, random forest is least affected by outliers and missing data.
Table 41: Combined model scores for FH
Model Type CS
Logistic Regression 0.755
ANN 0.816
Random Forest 0.822
Table 42: Combined model scores for TD
Model Type CS
Logistic Regression 0.959
ANN 0.964
Random Forest 0.593
Table 43: Combined model scores for HW
Model Type CS
Logistic Regression 0.67
ANN 0.711
Random Forest 0.731
Table 44: Combined model scores for FS
Model Type CS
Logistic Regression 0.608
ANN 0.933
Random Forest 0.953
Table 45: Combined model scores for FT
Model Type CS
Logistic Regression 0.678
ANN 0.71
Table 46: Combined model scores for FT
Wear Measurement Highest Scoring Model Type
FH Random Forest TD ANN HW Random Forest FS Random Forest FT Random Forest
5.3 Chapter summary
In this chapter, a method for combining the model scores reported in Chapter 4 into a single model score is described. The chapter continues with a report of this score for each model type and for each wheel measurement type. The chapter concludes by reporting that logistic regression was not the best performing model type for any of the wheel wear measurements. ANN was the best performing model for TD measurements and random forest was the was the best performing model for FH, HW, FS and TD. This was attributed to the fact that random forest is such a robust model that it is well suited for handling outliers and missing data.