Safe software, contrary to various opinions and understandings, cannot be achieved through software reliability practices [14]. Software reliability is defined as “the ability of a system or component to perform its required functions under stated conditions for a specified period of time.” [7]
The focus on reliability is performance according to requirements in terms of failure to meet those requirements. Although reliability can be used as an indirect indicator of safety, lack of hazard analysis and subsequent safety requirements can render this a moot point, as reliability does not mandate safety
requirements. As discussed by Leveson [14], most safety-critical failures can be
traced back to incorrect requirements, i.e., a lack of understanding as to what the software should do under hazardous conditions. In essence, this falls under the validation domain based on stakeholder expectation. Stakeholders expect the system to be safe without necessarily providing specifications for safety. As further evidence to the validation case Leveson [14] states that, “although coding errors often get the most attention, they have more of an effect on reliability and other qualities than on safety.” This statement indicates a reliance on verification according to requirements, rather than validation according to expectation, as being a key quality improvement technique. Validation, again, is not afforded the attention it requires.
Therefore, safety-critical software-intensive systems require a systematic metric framework to aid in validation. Although much effort has been afforded to developing hazard analysis techniques and hazard reduction techniques (such as Failure Modes and Effects Criticality Analysis, Event Tree Analysis, Fault Tree Analysis, etc.), there is little to no evidence that measurements of these processes and products are being conducted to aid in answering the question of “are we building the right safety product?” (i.e., validation of system safety).
Safety requirements can be divided into two categories: generic requirements and system specific (or derived) requirements. Generic requirements are those recommended in standards, contained in workplace procedures, or identified in “lessons learned,” etc. They are essentially good practice, based on previously identified common causes leading to known hazards, to aid in developing a safer system. Derived requirements are those that are realized through the undertaking of hazard analysis and association of software functions that may contribute to identified hazards. These derived safety requirements may be more specific to the validation of safety-critical software- intensive systems than any other artifact or product, as high-level user documentation often does not provide such detail.
There are many products and procedures involved in the engineering of a safety-critical software-intensive system. Hazard identification, hazard analysis, safety-critical software function identification, and verification of safety requirements are some of the areas that will need to be considered in the development of a validation metrics framework. These products and processes will be some of the major foci of the Validation Metrics Framework for safety- critical software-intensive systems.
1. Software Hazard Risk Assessment
Unlike risk assessment for hardware, software risk assessment has unique qualities that inhibit the traditional assignment of consequence/severity and likelihood/probability. Determining the probabilistic nature of software is a hotly debated topic in the software engineering discipline; however, for the purpose of this thesis the assumption is made that software failures are systematic. That is to say, they are caused by incorrect requirements (design errors) or development errors, therefore are systematic in nature and cannot be assigned probabilistic failure rates. Although this is contrary to much of the field of software reliability, it does allow for the use of many pre-conceived software safety tools.
Determining the safety risk associated with software requires a different approach to hardware. A typical approach5 to determining software risk is to
determine the software’s level of control over the associated hazard or hazard causal factor rather than determining the probability (or likelihood) of a hazard/hazard causal factor occurring. Figure 4 shows a Software Hazard Criticality Matrix (SHCM) for assessing the risk of software contributing to system hazards.
Figure 4. Software Hazard Criticality Matrix [From [2]]
Figure 4 utilizes the Control Category scheme given by MIL-STD-882C. With regards to the Control Category schemes, The Joint Software System Safety Handbook [2] states that, “The SSS [Software System Safety] team must
review these lists and tailor them to meet the objectives of the SSP [System Safety Program] and software development program.” For the purpose of this thesis, the Control Category scheme of MIL-STD-882C, as presented in Figure 4, will be assumed.
The Joint Software System Safety Handbook [2] emphasizes the fact that the SHCM is not intended to be used directly as a Hazard Risk Index (HRI) matrix. Because it is not possible to assign a probability of occurrence, the risk assessment provided by the SHCM is not entirely compatible with the risk assessment of a HRI. Instead, the SHCM reflects risk in the unique terms of software, indicating a level of rigor required to address the risk level. In some cases, the risk level may warrant an alternative solution that does not utilize software control. Therefore, when determining hazard risk that includes software causal factors, engineering judgment must be applied to determine a level of probability, taking into consideration the SHCM rating, the level of rigor applied, and the resultant safety measures developed.
2. Hazard Causal Factors
Software hazards are in and of themselves causal factors to system hazards. As discussed earlier, software cannot create a mishap by itself. However, software is often responsible for system functionality that can create mishaps. The linkage of system hazards to software hazards (causal factors) requires an in-depth understanding of system functionality. Software functionality contributing to system hazards is identified as a first-order causal factor. They are in themselves hazards, but with further analysis second and even third-order causal factors can be revealed. Analysis of software causal-factors beyond first- order is typically only required for medium- and high-level risks as identified in the SHCM. Although this is a generally accepted rule of thumb, and in this thesis assumed to be the norm, it may be the case that certain industries, or applications of software have determined a different level of analysis. Therefore,
any metric framework measuring the depth of software causal factor analysis must be tailorable to the different standards of measurement of sufficiency.
For the purpose of this thesis software hazard causal factors, since they are in fact hazards themselves, will be referred to as software hazards. Any reference to software hazards must be taken in context, but generically they will always be causal factors. It is also assumed that medium- and high-level software hazards require analysis to the level of third-order causal factors, unless justified otherwise.