Chapter 4 Results
4.3 NBM Features Developed from SWI Challenges
4.3.2 Learning Database
The Learning Database was developed in Excel based on the mapping of the three data sources and transformation of the 134 Army Records based on the responses to the questions that represent each of the predictor nodes. Each row represents one of the error reports. The responses to the questions are captured in Columns 1 through 10 of the Learning Database. The database consists of eleven columns and 134 rows to represent each record. Columns 1 through 10 contain the predictor node data and Column 11 contains one of the five discretized IDI that corresponds to the actual data in the error report. The Columns represent the features developed from SWI challenges. A key step
78
in knowledge integration is the transformation of the original error reports in the features that populate the Learning Database.
A sample of the records in the Learning Database is shown in Table 9. The numbers in the cells represent True=1 or False=2 responses to the questions in Table 8.
The entire Learning Database is provided in Appendix B.
Table 9 - Sample of Learning Database Entries
The distribution of the training and prediction set data as compared to the entire database for each target node category is provided in Table 10. These data sets when compared to the entire dataset show the similarity of the frequencies as representative samples of the entire Learning database. The training and prediction sets were randomly selected and were used to determine the accuracy. Further discussion of these datasets along with their distribution for each predictor node (features) is provided in 4.4.1.1 and 4.4.1.2.
Table 10 - Dataset Summary
The training data set of 119 records was randomly selected as approximately 90%
of the 134 records in the Learning Database. The same training set was used for each
Severity ErrorCat ACAT1 SysDepend OrgDepend SameEvent PriorEvent SysGroup SysType Core IDI (Days)
2 1 2 2 2 1 1 1 1 1 1
IDI IDI-1 73 54.48% 63 52.94% 8 53.33%
IDI-2 42 31.34% 40 33.61% 4 26.67%
IDI-3 11 8.21% 10 8.40% 2 13.33%
IDI-4 2 1.49% 1 0.84% 0 0.00%
IDI-5 6 4.48% 5 4.20% 1 6.67%
134 100.00% 119 100.00% 15 100.00%
Target
79
Global Accuracy measure. The frequency count and the prior probability distribution of each of the 10 features is provided in Table 11. The prediction data set consists of 15 records and is shown in Table 12. This dataset is used for Prediction Set Accuracy for each of the models as discussed in paragraph 4.5 below.
Table 11 - Training Data Set Distribution
Table 12 - Prediction Data Set Distribution
Feature Categories Frequency %
80 4.4 NBM Analysis Results
NBM versions were assessed to determine the impact the features have on the accuracy of the model. Prediction Set Accuracy and Global Accuracy were calculated for each of the resulting NBM versions. Prediction Set Accuracy is the traditional measure to determine the performance of a model while the Global Accuracy has particular significance to classification models such as the NBM developed in this dissertation research. The ratio of correct predictions to total predictions based on a random set of data is the Prediction Set Accuracy.
The Global Accuracy was developed from the Confusion Matrix that resulted from 10 fold cross-validation. The Confusion Matrix shows the performance of the NBM relative to each of the intervals. Each of the resulting models and their associated
features along with the Global Model Accuracy and the Prediction Set Accuracy is shown in 4.4.1 through 4.4.8. The resulting NBM with the features, accuracy measures and confusion matrix is shown in Tables 13 through 33.
4.4.1 NBM 1: Army Data Features (Severity and ErrorCat)
Model 1 as shown in Table 13 was developed using the two data elements recorded in the original error reports: Severity and ErrorCat. The resulting NBM used the training data set with only Severity and ErrorCat data at the percentages represented in Table 11.
81
Table 13 – NBM 1 Severity & ErrorCat
4.4.2 NBM 2: Severity, ErrorCat and ACAT
Model 2 was developed using the Model 1 data elements with the addition of the ACAT. The resulting model analysis with features Severity, ErrorCat and ACAT along with Global and Prediction Accuracy is shown in Table 14.
Table 14 - NBM 2 - Accuracy Results With External Data (ACAT)
Confusion Matrix: Army Data &
ACAT
82
4.4.3 NBM 3: All Features (including SWI Challenges)
Model 3 was developed using the Model 2 data elements with the addition of the seven features based on the SWI challenges: The resulting model with 10 features and resulting accuracy analysis is shown in Table 15. Based on Global Accuracy, Model 3 is the most accurate, however based on Prediction Set Accuracy, Model 1 is the most accurate. This pattern where the Prediction Set and Global Accuracy do not result in the same recommendation for the most accurate model continues throughout the NBM development. To resolve this matter, the final recommendation will be made based on additional accuracy measures that assess the individual interval accuracy as discussed in 4.5.9.
Table 15 - NBM 3 - All Features (including SWI Challenges)
4.4.4 Independence Test Results
Chi-Square for Independence (Table 16) results show that the following features have at a dependency with at least one of the feature: Core, SysType, ACAT and
SameEvent. These features were individually and jointly trimmed to determine the
IDI
83
impact on model accuracy. The test was conducted at alpha = .05 with the H0 =Features are Independent; HA = Features are Dependent. Each of the dependent features was trimmed to determine their impact on accuracy measures.
Table 16 - Chi Squared Test for Independence Results
4.4.5 NBMs With Dependent Variables Trimmed
Singularly trimming each feature with dependencies resulted in models 4, 5, 6 and 7 as shown in Table 17 through Table 20. Again, the Global and Prediction Accuracy show differing accuracy measures with Global Accuracy being consistently higher but trimming the Core feature results in the most accurate model to this point.
Table 17 - NBM 4 - Trim Core Feature
FEATURES Severity ErrorCat Oversight SysDepend OrgDepend PriorEvent SameEvent SysGroup SysType Core Severity
ErrorCat DNR
ACAT DNR R
SysDepend DNR DNR DNR
OrgDepend DNR DNR DNR DNR
PriorEvent DNR DNR DNR DNR DNR
SameEvent DNR DNR DNR DNR DNR R
SysGroup DNR DNR DNR DNR DNR DNR DNR
SysType DNR DNR R R DNR DNR DNR R
Core R DNR DNR DNR DNR R R R R
R
DNR Do Not Reject the Null Features Independent HO
Features are Independent HA
Features Are Dependent Reject the Null Dependency Exists
IDI
Confusion Matrix: NBM3 minus Core
84
Table 18 - NBM 5 – Trim SysType Feature
Table 19 - NBM 6 - Trim SameEvent Feature
Actual NBM 5 - All Data minus SysType
Prediction
Confusion Matrix: NBM3 minus SysType NBM 5 - All Data minus SysType
Prediction
Confusion Matrix: NBM3 minus SysType
85
Table 20 - NBM 7 - Trim ACAT Feature
4.4.6 NBMs With Sets of Two Dependent Variables Trimmed
Models 5 through 13 were developed after trimming sets of two features to determine the impact on accuracy. The sets of features were selected based on the Chi Squared test for Independence that indicated which features had dependency with other features. The resulting models show that NBM10 that results from trimming the features ACAT and SameEvent result in the highest Global Accuracy measure and the least accurate Prediction Set Accuracy. This difference in accuracy measures is similar to the previous sections, and as previously stated, an assessment of the model behavior at each interval show which model provides the best accuracy at the IDI level. Tables 21 through Table 26 shows the resulting models.
1 56 10 45 7
Confusion Matrix: NBM3 minus ACAT
86
Table 21 - NBM 8 – Trim Core and SysType Features`
Table 22 - Trim Core & SameEvent Features
Actual
NBM 8 - All Data minus Core & SysType Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus Core & SysType ALL DATA MINUS Core AND SysType
True ALL DATA MINUS Core AND SysType
Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus CORE & SameEvent ALL DATA MINUS CORE & SameEvent
Actual
NBM 9 - All Data minus Core & SameEvent Prediction
87
Table 23 - NBM 10 - Trim ACAT & SameEvent Features
Table 24 - NBM 11 - Trim SameEvent & SysType Features
Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus ACAT & SameEvent
ALL DATA MINUS ACAT & SAME
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
NBM 10 - All Data minus ACAT & SameEvent Prediction
Confusion Matrix: NBM3 minus SameEvent & SysType
NBM 11- All Data minus SameEvent&SysType Prediction
ALL DATA MINUS SameEvent & SysTYPE
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
88
Table 25 - NBM 12 - Trim ACAT & SysType Features
Table 26 - NBM 13 - Trim Core & ACAT Features
4.4.7 NBMs with Sets of Three Features Trimmed
Additional models were developed based on trimming features in groups of three.
The features that were trimmed show dependency with other features is based on the
IDI
ALL DATA MINUS ACAT & SysType
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
Confusion Matrix: NBM3 minus ACAT & SysType
NBM 12 - All Data minus ACAT&SysType Prediction
ALL DATA MINUS CORE & ACAT
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
Confusion Matrix: NBM3 minus Core & ACAT
89
results of the Chi-Squared test. The resulting model accuracy results are shown in Tables 27 through 31.
Table 27 - NBM 14 Trim Core, ACAT, & SysType Features
Table 28 - NBM 15 Trim ACAT, SameEvent, & SysType Features
Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus Core, ACAT & SysType
IDI
ALL DATA MINUS 3: Core, ACAT & SysType
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
ALL DATA MINUS 3: ACAT, SameEvent, & Type
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
Confusion Matrix: NBM3 minus ACAT, SameEvent & SysType
NBM 17 - All Data minus ACAT, SameEvent,
& SysType
90
Table 29 - NBM 16 Trim Core, SameEvent, & SysType Features
Table 30 - NBM 17 Trim Core, SameEvent & ACAT Features
4.4.8 NBM with Four Dependent Features Trimmed
Trimming all four features resulted in the final model gave mixed Accuracy results as discussed in the previous models. The results are shown in Table 31.
Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus Core, SameEvent, SysType
ALL DATA MINUS 3: Core, SameEvent, SysType
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
NBM 18 - All Data minus Core, SameEvent, &
SysType Prediction
91
Table 31 - NBM 18 Trim Core, SameEvent, SysType & ACAT Features
4.4.9 NBM Accuracy Measures Comparisons
Table 32 shows the various measures of accuracy that are relevant to classification models as explained below:
1. A brief description of the features that define the NBM is in columns 1 and 2.
2. The Prediction and Global Accuracies are shown in the columns 3 and 4 to indicate the overall performance of the NBM.
Table 32 - Model Accuracy Summary
Actual 5 4 3 2 1
Confusion Matrix: NBM3 minus Core, SameEvent & ACAT
IDI
ALL DATA MINUS 3: Core, SameEvent & ACAT
Global Accuracy = (TP+TN)/(TP+TN+FP+FN)
NBM 20 - All Data minus Core, SameEvent, SysType, & ACAT
Prediction
92
0.600 0.878 0.757 0.841 0.797 0.612 0.750 0.674
0.533 0.883 0.839 0.825 0.832 0.612 0.750 0.674
0.533 0.905 0.846 0.873 0.859 0.674 0.775 0.721
0.729 0.893 0.825 0.825 0.825 0.667 0.750 0.706
0.706 0.893 0.812 0.889 0.848 0.682 0.750 0.714
0.533 0.888 0.831 0.857 0.844 0.674 0.775 0.721
0.533 0.881 0.848 0.889 0.868 0.674 0.775 0.721
0.667 0.898 0.828 0.803 0.835 0.638 0.769 0.698
0.600 0.902 0.852 0.825 0.839 0.674 0.775 0.721
0.533 0.902 0.836 0.889 0.862 0.738 0.775 0.756
0.533 0.908 0.825 0.825 0.825 0.660 0.775 0.713
0.533 0.902 0.809 0.873 0.840 0.732 0.750 0.741
0.600 0.885 0.828 0.841 0.835 0.674 0.756 0.731
0.667 0.885 0.820 0.794 0.806 0.625 0.750 0.682
0.600 0.895 0.831 0.857 0.844 0.667 0.750 0.706
0.667 0.885 0.810 0.794 0.806 0.659 0.725 0.690
0.600 0.892 0.852 0.825 0.839 0.667 0.750 0.706
0.600 0.888 0.800 0.825 0.813 0.659 0.725 0.690
** N/A is due to division by zero however, since this interval has the least number of instances it does not affect the maximum most accurate interval F1 Score
*Note: Model NBM4 & higher result from feature removal based on Independence testing showing feature
93
a. The Prediction Accuracy is the typical measure of the correct predictions divided by the total number of predictions, which as indicated earlier is based on a set of 15 randomly selected use cases.
b. The Global Accuracy is the more accurate measure for classification models, because it averages the accuracy of 10 random samples of prediction sets and provides the Confusion Matrix to show the performance of the model within each interval class.
3. The remaining columns provide the precision and recall for the two intervals of interest - IDI-1 and IDI-2. Rather than consider the performance of all of the intervals, IDI-1 and IDI-2 combine for 85.9% of the data (relative
frequency shown in Table 7). The interval precision and recall calculations are important because they provide the opportunity to see which intervals are more accurate and allow the calculation of the F1 Score.
4. The F1 Score for intervals IDI1 and IDI-2 have the highest relative frequency to they were used to determine the NBM 10 as the best set of features to predict the schedule delay.
The following analysis provides additional assessment of the accuracy measures.
No model has the most accurate prediction for every interval. The prior sections provide model assessments that show the discrepancy between Global and Prediction Set
Accuracy. However, as shown in Figure 17, the Global Accuracy measure is relatively stable while the Prediction Set Accuracy is more susceptible to the set of features. Using Prediction Set Accuracy measure, Model 4 would be the choice but the Global Accuracy
94
measure would select Model 10. These results represent why the F1 Score is the recommended accuracy measure.
Figure 17 - NBM Prediction & Global Accuracy Comparison
The Confusion Matrix for each NBM was used to calculate the Precision, Recall and the F1 Score for each interval to make the final determination of the recommended NBM. As discussed earlier, IDI-1 and IDI-2 intervals were used to select the most accurate NBM since the majority of the data is in these two intervals.
Figure 18 provides a graphical comparison of the F1 Score for each of the NBMs.
Based on the F1 score as the criteria, Model 10 is the recommended NBM for schedule delay prediction based on the F1 Score for intervals IDI-1 and IDI-2 at 86.2% and 75.6%
respectively.
0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 0.950
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 PredicNon Accuracy GLOBAL ACCURACY
95
Figure 18 - F1 Score Comparison (IDI-1 & IDI-2)
4.5. Contribution Analysis for Final Set of Features
The most accurate model based on F1 Score is NBM 10 with the following 8 features: ErrorCat, SysGroup, OrgDepend, SysDepend, SysType, Core, Severity and PriorEvent. The Contribution Analysis for these features is provided in the following Pareto Chart (Figure 19).
Figure 19 – Pareto Chart of Contribution Analysis
0.500
ErrorCat SysGroup OrgDepend SysDepend SysType Core Severity PriorEvent Feature ContribuBon to Delay CumulaBve ContribuBon
96
Contribution Analysis shows that as proven in the Data Preprocessing phase, both technical and non-technical factors are responsible for SWI delay. As Figure 21
indicates, the ErrorCat at 25% has the highest impact on the time to resolve an error followed by the SysGroup (13%), OrgDepend (12.5%), SysDepend (12.2), SysType (11%), Core (10%), Severity (9.3%) and PriorEvent Errors (8.2%). Also notable is that 80% of the time to resolve an error is caused by five features: ErrorCat, SysGroup, OrgDepend, SysDepend and SysType. Only one of these features was captured in the initial error reports, the remaining four features were mined from the data as known SWI challenges. These features link to following SWI Challenges and Questions (Table 33).
Table 33 – Contribution of SWI Challenges
SWI Challenges
(Literature Survey Results) Questions Used to Populate
Learning Database Learning Database Column and Node in NBM
(Features) Technical Risk 4. What is the type of error? CEC
SoS Complexity 10. Is the system a SOS? SysGroup
Independent Management 3. Does the error impact more
than 1 organization? OrgDepend
System Interdependencies 1. Does the error impact more than 1 system?
SysDepend
System Interdependencies 2. Is the system in the
command post? SysType
SoS Complexity 9. Is the system a core
system? Core
Technical Risk 5. Does the error impair a
critical task? Severity
Non-Technical Risk 6. Does the system have
errors in prior events? PriorEvent
97 4.6 Summary of Findings
These results of the NBM development presented in this dissertation are an indication of the ability to use historical data to provide accurate schedule prediction by using feature selection to determine the most important features. The prediction and contribution analysis can be used to support decisions that enable systems engineers and managers to realign resources or shift schedules to meet their priorities, as well as support risk mitigation decisions. Essentially, the features developed for this research are
accurate for predictions of the IDI-1, IDI-2, IDI-3, IDI-5 that include errors that are resolved within 92 days or those that take more than 176 days to resolve. However, the NBM is less accurate in predicting IDI-4, that includes errors that take between 92 and 176 days to resolve. Fortunately, this range does not occur frequently (1.49% of the time – see Table 10). Additional study is required to fully investigate this outcome.
4.6.1 Feature Selection Had Mixed Results on Accuracy
Model accuracy was assessed using features to develop different models.
Initially, features based on the historical US Army error reports were used to determine the baseline model accuracy. These same features were subsequently used with the addition of the external feature (ACAT) that show a slight increase in Global Accuracy while Prediction Set Accuracy showed a slight decrease. Finally, a third model with the first two sets of features and the addition of the seven features from SWI Challenges determined from the literature survey resulted in increased Global Accuracy but no change in the Prediction Set Accuracy. This difference between Global and Prediction Set accuracy continues through each of the model variations assessed. Removal of
98
features based on Independence testing, provided further variation in these accuracy measures. However, to fully assess the impact of removal of features, the Precision, Sensitivity and F1Score measures were the final determinant of the most accurate model.
NBM-10 with eight features was the most accurate with 90.2% Global accuracy and 53.3% Prediction Set accuracy. This accuracy is comparable to other similar NBM models (Bielza, 2014) but exceeds other methods for SW development resource estimation that includes schedule estimation and averages 39% (Boehm & Valerdi, 2011).
4.6.2 Implications to Technical Impacts on Delay
Four of the eight features primarily quantify the impact that interdependencies have on schedule delay and were not included in the original error reports. SysGroup, SysDepend, OrgDepend, SysType, all define different aspects of the interdependencies that define complex systems represented as SWI Challenges. Based on this dissertation research, these features contribute 48% of the schedule delay. Because these errors often are not revealed through system level testing, they create challenges that generally
require stakeholder and system owner communications and resources to troubleshoot and resolve. By not considering the impact of these interdependencies in the original or updated schedule estimation, almost half of the time required to resolve an error is omitted, which is likely responsible for the underestimation of the time required to resolve an error.
4.6.3 Implications to Organizational Impacts on Delay
The NBM contribution analysis also provides objective analysis of the impact organizational dependencies (OrgDepend) have on schedule delay prediction caused by
99
integration errors. OrgDepend was mined from the data based on those errors that had at least two systems and two organizations that responsible for resolving the error.
According to the contribution analysis, OrgDepend is third when ranked according to its impact on the integration schedule delay. While the majority of the features are primarily technical, the organizational impacts ranking show its importance in understanding the causal factors that create the schedule delay.
100
Chapter 5 – Conclusions
This dissertation set out to develop a NBM to predict the schedule delay created by SWI errors. Error reports from Army SWI events from three events were used as a data source. The results are promising and show that SWI challenges can drive features that accurately predict the schedule delay. Previous chapters in this dissertation research presented the key findings that include determination of key features for the NBM based on accuracy and the contribution each feature has on the model prediction.
First, the features for a NBM that best predict the SWI delay created by errors during the SWI phase of development were determined. The approach to develop the features relied on a Literature Survey of SWI challenges documented through journal articles and conference papers published in the past 10 years. The features that were most prevalent in 30 articles were summarized and used to drive data mining activities that extracted the evidence of these challenges from the original error reports. The resulting SWI challenges include: System Interdependencies, Independent Management, SW Integration Risks (Technical and Non Technical) and Complexity. These challenges as reflected in the Army error reports were represented as 10 features that were further trimmed through feature engineering.
Feature engineering was an important aspect of model development that
supported the removal of features based on Independence analysis results. Research has
supported the removal of features based on Independence analysis results. Research has