• No results found

Chapter 2 – Literature Review

2.5 SW Integration Challenges Based on Literature Survey

2.5.3 Software Integration (SWI) Risks

2.5.3.2 Non-Technical Risk

This category is used to discuss research that is traditionally considered project management that primarily focuses on business and resource management with a focus on the activities that likely impact schedule. While these factors are external to the specific ongoing integration event, they may constrain the PRC resolution options available to systems managers. The need for cross functional and cross organizational collaboration (Thamhain, 2013b) is a factor in the time needed to conduct root cause analysis to include allocation of required personnel resources. The lack of SI processes

32

that support management in this phase is indicative of the continued emphasis solely on the physical challenges (Jain, 2010).

Alignment of stakeholder objectives and project management to include funding requires agreed upon governance (Dahmann, 2015) that is often lacking in the SoS environment. The implementation of this alignment further has to support programmatic milestones as approved in acquisition category (ACAT) with the process different

depending on the ACAT as defined in DoD Directive 5000 (2015). This category creates varying levels of oversight that can have an overall impact on the PRC. According to (Thamhain, 2013b):

“The involvement of many people, processes, and technologies spanning different organizations, support groups, subcontractors, vendors, government agencies...

compounds the level of uncertainty and distributes risk over a wide area…. often creating surprises with potentially devastating consequences.”

In spite of these challenges and risks, Project Management as traditionally executed does not provide the tools and processes that are intended for the SoS environment and its uncertainty (Thamhain, 2013b). Through surveys of prior U.S.

Department of Defense aerospace projects (Davendralingam & Kenley, 2013) measured the complex relationships of these stakeholders to determine the impact of specified risk categories on cost and schedule. The survey (Davendralingam & Kenley, 2013)

concluded, “more complex stakeholder relationships demonstrate significantly higher schedule delay”.

33 2.5.4 Complexity

Since the 1990s researchers have recognized and tried to quantify SW complexity (Basili, Briand, & Melo, 1996) by focusing on individual and internal influences

specifically related to the technical aspects of a system. However, there is no single way to define or measure complexity (Maurer, Schneller, & Omer, 2014). Current research shows the uncertainty associated with complex environments creates a challenge that is nonlinear and results in increased program risk (Madni & Sievers, 2014) that is

considered responsible for multiple failed projects in public and private sector. Multiple technical and non-technical factors create the complexity experienced in the SWI

environment. Complexity can also include programmatic or organizational influences as non-technical risk factors. SI should considers “lifecycle, architecture, process,

interfaces, enterprise, product and data” (Madni & Sievers, 2014) to be among the critical factors that bring complexity to SI. Research on the rationale for this continued underestimation of SI effort, particularly in regards to schedule indicates the significant role of complexity in environments such as SW integration. Other similar research by Thamhain (2013b) identifies organizational factors as the key to success within a

complex environment. The researcher (Thamhain, 2013b) further argues that integrating cross-organizational resources is a source of uncertainty. Sauser (2008) concludes that this complexity is driven by a convergence of individual organizations, unique and sometimes overlapping capabilities, separately controlled organizations and a schedule that is never executed as originally planned. The combined effects of these challenges provide the characteristics of complexity even if there is no consensus definition.

34 2.6 Bayesian Probabilities

A key element of NBM is the use of Bayesian probabilities. Pearl (1977)

presented a Bayesian method for constructing a probabilistic network from a database of records as a means to gain insight into the dependencies that exist among the variables.

In recent years, probabilistic modeling through NBM has proven to be a reliable method to build knowledge-based analysis capability that uses historical information to predict future events.

NBM is widely used to capture existing knowledge to support activities such as spam filtering, and data classification as well as predicting embedded and component SW errors. Most defect prediction models are aimed at software component reliability during the SW development cycle and do not include consideration of errors created during the SW integration and test phase (Mende et al., 2011) NBM has also been used to determine the impact of changes to architecture (Tang, Nicholson, Jin, & Han, 2007) as well as clustering and classification (Domingos & Pazzani, 1997) of data. These NBMs rely heavily on expert judgment to estimate the conditional probabilities that can be difficult for legacy systems. There is evidence that NBMs can be constructed from data and not rely solely on expert judgment (Loutchkina et al., 2014).

The previous discussion regarding the challenges in the SW integration phase, the increasing complexity and the associated risks, provide a documented need for SW estimation models that more accurate reflects the complexities in the SWI environment.

35 2.7 NBM Theoretical Framework

In recent years, probabilistic modeling through Naïve Bayes Models (NBM) has proven to be a reliable method to build knowledge-based analysis capability that uses historical information to predict future events. (Domingos & Pazzani, 1997) used datasets to prove that NBM for inference and probabilistic learning results in acceptable accuracy to other similar models like Bayesian Belief Networks. However, NBM is much simpler than BBN because there is one target with multiple features used for prediction.

For this reason, NBM provides a simple analysis tool to predict the likelihood of

specified outcomes (Friedman et al., 1997). NBMs are based on Bayes theorem that uses conditional probabilities to predict outcomes. Bayes Theorem states the following:

Given a hypothesis C and data f:

𝑃(𝐶│𝑓) = (𝑷(𝒄) 𝑷(𝒇│𝐶) )/𝑷(𝒇) Equation 1

• P(C): Prior probability of C based on the frequency of the target class data

• P(f): Prior Probability of f based on the frequency of feature (predictor) data

• P(f|C): Likelihood of feature (f) given class (C)

• P(C|f): Posterior probability (Target of the model)

where C = classification (delay interval or target prediction) and f = data from multiple features (predictors). All of the values required to calculate P (f|C) are calculated from the learning database developed from the raw data by using the frequency counts for each feature. A key assumption of NBM for classification (also referred to as the NB Classifier) is that the predictive features are strongly (naïve) independent of each other. Each prediction is the most likely outcome based on the

36

observed frequencies calculated from the historical data which is also referred to as the maximum posterior probability as shown in Equation 2.

𝑐!" = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃 𝑐! !𝑃(𝑓! 𝑐! Equation 2

where 𝑐! is the class prediction given a set of features 𝑓!.

2.8 NBM for Classification

Several advantages make NBM suitable for modeling the SWI environment.

Most importantly, NBM is proven to be accurate for classification (Zhang, 2008).

Another researcher, (Mori et al., 2013) proposed a NB classifier to estimate the

probability of project failure that was more accurate than other statistical methods that the researcher attributed to the ability of the NB to be updated as new information is known versus the static estimates of other methods. Another strength of Naïve Bayes includes the ease in developing and explaining the model through graphical depictions to include showing relationship between the target prediction and historical data (Bielza, 2014).

The study by Bielza (2014) also showed that irrelevant variables could be deleted through

“filtering” for feature selection using methods to remove attributes based on some

criteria. To improve accuracy, this dissertation research intends to use as few variables as possible and will use Independence testing to eliminate variables.

The key disadvantage to NB classifiers is the inability of the model to accurately calculate the associated probabilities (Zhang, 2008); however, in spite of this deficiency, the classifier is still accurate. Most NBM users ignore the probability estimates as long as the classification is correct. However, when the probability estimates are relatively close, it is important for the NBM predicted outcome interval to be dependable so the

37

manager knows the relative likelihood of a different outcome (classification of interval) so that he or she can plan accordingly. Domingos (1977) attributes this paradox

(probability being incorrect but class correct) to the “zero one loss” which essentially

“does not penalize inaccurate probability estimation as long as the maximum probability is assigned to the correct class “ Studies have shown that this discrepancy does not prohibit the value of NBM for prediction. In fact, Kupervasser(2014) proves that even when the NBM is not exact in its probability estimation, the model can reliably provide the optimal solution for classification.

For the NBM usage as a classifier, the advantages make it a compelling tool for the SWI environment. The ability to provide the manager with a range of potential outcomes is more realistic and frames the input from expert judgment so the combined input enhances the decision process. The disadvantages have not proven to be prohibitive of the NBM as an accurate tool for prediction. As shown through research, the NBM accuracy exceeds that of more traditional approaches. The CRISP process as shown below offers a method to develop a NBM while implementing mitigation strategies to overcome the disadvantages.

2.9 Model Development Process

There are a limited number of comprehensive generic approaches to data mining and model development. The CRISP model as discussed by Larose (2014) provides an acceptable nonproprietary guideline that will be used to frame the discussion for the NBM development. The CRISP process as shown in Figure 5 emerged in the 1990s and continues to be the preferred starting place, although many model developers modify the basic process for their individual need (Larose, 2014). The process provides a reference

38

for problem solving that uses data mining to support model development. The CRISP process involves six phases that revolve around the understanding the data as it is transformed to build the model. Each of the phases will be discussed below along with the contribution and use for this research except the Deployment phase, which is beyond the scope of this dissertation.

Figure 5 - CRISP Model Development Process

2.9.1 Business Understanding

This initial phase of model development includes, the problem definition and the determination of the model objectives along with the process to achieve the objectives (Larose, 2014). The domain expertise for this research was developed through a literature survey to represent subject matter expertise. The literature survey resulted in

documented SWI challenges to represent the complexities of the SWI environment to provide the foundation for model development.

39 2.9.2 Data Understanding

The next phase of the CRISP model includes data collection and understanding that provides an assessment of the quality and condition of the data. The initial data analysis is conducted during this phase to understand the relationships within the data elements. This phase includes data mining. Witten (2014) describes this step as “turning data into information and information into knowledge” that leads to machine learning through a learning database. (Witten, 2014) believes human supervision is required to develop the learning database. For this dissertation, the majority of the effort was spent data mining and pre processing the data to build a learning database. NBM was

developed using supervised learning based on data from multiple sources.

2.9.3 Data Preparation

Data Preparation can be the most extensive phase of NBM development. As Witten (2014) suggests, human intervention is essential to completing this phase. This research supports this position since the database used for the model was developed from the historical database of Army SWI errors combined with SWI challenges and external information. This amount of intervention is not unusual when the data is not initially stored with the intent of it being used for analytical purposes. This phase also includes feature selection, the type of data (Boolean, nominal, ordinal etc), and variable type (discrete or continuous). Both Larose (2014) and Witten (2014) agree that it is key to extract or prune the data to prepare for a modeling tool, as is the case for this study.

Data Transformation is an aspect of data preparation that applied to the model built for this research. This process includes changing categorical data to numerical data and binning continuous data through discretization if needed. The original error reports

40

from the Army historical data were transformed to develop the learning database with the features as columns and the individual error reports as rows.

2.9.4 Modeling (Naïve Bayes Model Development)

For the modeling phase, BayesiaLab and XLSTAT commercial software were selected to complete this research. Each of these products had different advantages.

XLSTAT works within Excel, which allowed easy manipulation of the learning database that was also developed in Excel. The CRISP model as presented by Larose considers this the normal pattern for these phases. Several researchers recommend the dataset be used to develop a learning database (Larose, 2014) with a recommendation that 15-30%

be set aside for validation. Due to the limited amount of data as suggested by (Witten et al., 2014), for this dissertation 10% of the data was set aside so the maximum amount of data could be used to train the model.

2.9.4.1 Discretization to Improve Accuracy

Discretization is an important pre-processing activity that is necessary for preparing continuous data prior to constructing the model. Discretization is the process that creates intervals from the data as a more concise representation of the data when compared to continuous data (Liu, Hussain, Chew Lim Tan, & Dash, 2002). Studies (Gupta, Mehrotra, & Mohan, 2010; Wu, 2006) have shown that discretization improves accuracy and reduces the time for the learning phase of NBM development. Also, most software used to model Naïve Bayes requires discrete data, which is the case for this research. Computing time is not a concern for this dissertation; however, the commercial software used to build the model and conduct the analysis does require discrete data.

41 2.9.4.2 NBM Analysis

Assessment of the NBM was based on the accuracy of the model by comparing the correct classification to the total number of use cases. In k-fold cross-validation, the sample database is randomly partitioned into k equal sized subsamples with k

subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. For this research, test cases were randomly selected from the learning database using 10 fold cross validation using 10% of the data for validation and 90% of the data for training the model.

Larose (2014) recommends the early analysis look at correlation between variables, to include contribution analysis between the target node and the specific variable. The results of this analysis will further shape the feature selection. For this research, this analysis proved essential to pruning and developing the right set of features for the model. However, as discussed later during the model development section, alternative methods for inclusion or exclusion of variables and different discretization methods were assessed to determine their impact on the accuracy of the model. In order to avoid duplicative or insignificant derived variables, correlation and assessing the usefulness of the derived variable can be completed during the model assessment (Larose, 2014).

2.10 NBM Cautions And Limitations for Prediction

There are cautions and limitations when using NBM for prediction. As shown through this literature review, NBM offers an analytical approach to better predict schedule delays based on the advantages discussed in Section. However, many researchers indicate the limitations regarding this type of predictive model. The

42

following list of constraints and recommended mitigation strategies is a critical to model development.

1. When there is limited or no data within a class, Laplace smoothing should be used to assign a value between one and five to the node cell (Witten, 2014). This allows the mathematical equations to continue with a nonzero number in the denominator.

2. Using NBM for inference is likely limited to the types of systems that are used to develop the prior probabilities and the CPTs (Domingos & Pazzani, 1997). For this research, the development of a NBM using data from previous events offers a credible methodology to support problem resolution prediction experienced during integration testing. In the past ten years research has validated NBMs for inference, especially when the model is developed based on data and the

inferences are represented within the boundaries of the data. For the data set used to validate the prediction model this dissertation, two factors create consistency.

First, there are core systems that participate in each event as the NW backbone (legacy systems primarily). Also, the new systems are within the network portfolio of systems that are in the C4ISR domain.

3. Conditional Independence of the features is recommended. However, violating this can still result in good prediction (Domingos & Pazzani, 1997). For this dissertation research, an assessment of the independence of the features will be conducted to determine if the trimming the features in question improves the accuracy of the model.

43

4. Prior Probabilities are required to train the model. For this example, data from previous integration events is used to determine the observed frequencies and prior probabilities for the model.

None of the aforementioned limitations are prohibitive to getting accurate results from the NB Classifiers as demonstrated in the Results Section, Chapter 4 of this

dissertation. However, they resulted in either justification of the method or appropriate adjustments were made as discussed with each of the individual limitations.

2.11 Literature Review Summary

This literature survey provided background on research that was used to develop a NBM from data to determine the impact an error has on the schedule. The SWI

environment that establishes the foundation for this research was defined and the existing tools were shown to be inadequate for this environment due to the complexity of the non-linear relationships that are referred to as SWI challenges in this dissertation. A focused literature survey of recent research provided a set of SWI challenges that were used to mine the data and define the features that represent the independent variables for the model. Finally, the literature review provided the background and justification for using the NBM for prediction where there is uncertainty such as the SWI environment.

44 Chapter 3 Method

This research used quantitative analysis to build a NBM based on a process that determines: (1) features for a NBM to predict the SWI delay created by errors during the SWI phase of development and (2) the most accurate model based on a set of features selected through feature engineering and (3) the contribution each of the features bring to the prediction. SWI is a critical phase that is often conducted at the end of the SW development phase and frequently results in schedule slippages that are difficult to quantify. The lack of tools available to support schedule control during this phase leaves many managers with inaccurate schedule estimation of the delay created by the SWI errors. Through feature selection and model analysis, this research developed a NBM to use probabilistic estimation to provide a range of potential outcomes so that managers can leverage the prediction to make knowledge based decisions. The research method (Figure 6) included data collection from three sources, knowledge integration to develop a learning database and finally development of the NBM.

3.1 Data Collection

Three sources of data were used to support this research. The data collection sources included error reports from US Army SWI events, SWI challenges resulting from a survey of literature to develop features based on interdependencies, and an external source that is related to the project management or programmatic leadership oversight for the system responsible for correcting the error. The features are the independent

variables that potentially impact the schedule prediction. Each data collection effort is discussed further to show how it can represent potential schedule impacts.

45

Figure 6 - Framework for NBM Development Process

3.1.1 Error Reports

The error reports generated from 2010 to 2015 during the U.S. Army SW Integration events provided the raw data that was mined for the development of the independent variables for the NBM. The U.S. Army SWI is a SoS event that results in a

The error reports generated from 2010 to 2015 during the U.S. Army SW Integration events provided the raw data that was mined for the development of the independent variables for the NBM. The U.S. Army SWI is a SoS event that results in a