Chapter 1 Introduction
2.2 Reliability Engineering
2.2.1 Failure Model
The identification and evaluation of failure is central to risk and dependability studies. Therefore, defining what failure means is an intrinsic component in studying various operational attributes of systems such as reliability, availability, maintainability, and risk [74], [75], [89], [90], [101], [102], [103], [104], [105].
The relationship among faults, errors, and failures is usually represented by a chain, as depicted in Figure 2-1. A fault in the system is the assumed cause of an error and is considered active when it causes an error. Error is a subsequent state to fault(s) representing the part of the system state that could lead to subsequent failures. Failure is defined as the event that causes the delivered service to deviate from correct service [8], [33], [87], [100].
Figure 2-1: The fundamental chain of dependability and security threats, modified from [8]
The complete failure of an entity or a system is not necessary for it not to perform its intended function successfully, as partial failure might still allow operation continuity [47]. The time interval does not have to be always in hours or days, as it depends on the system under study, and thus it could be, as examples, clock time, operation time, or number of cycles. Operating conditions should include information about load and environmental setting. Furthermore, while a failure definition should constantly reflect the deviation from correct service [87], failure modes and its specifics are not necessary identical among the different components [88]. As such, in traditional reliability, failure modes among different hardware components are not necessarily identical; and when reliability study was extended to software engineering, software failure modes were defined differently, reflecting the failure attributes of software components. For instance, failure of an electronic component might be defined as the inability of a thyristor to withstand the nth voltage spike; for a mechanical component it might be the event that the strengths of the component are smaller than its stresses [47]; and for software components it might be defined as an unacceptable departure of program operation from requirements, modeled by the mean time to failure (MTTF) based on execution time, not calendar time, such as in the Musa model [89], [106], or it might be modeled based on Bayesian interpretation of probability, such as in the Littlewood model [107]. What matters is the establishment of the qualitative aspects of failure, particularly the full range of component failure modes and the error processes that lead to failure [88].
The combined consideration and utilization of dependability and security attributes has facilitated a better understanding of the fault-error-failure chain model and analysis of various threats that might affect a system [8]. Thus, the traditional definition of failure is increasingly applied to security breaches [9], [98], [108] but in an inclusive analogy of both security breaches and traditional system
causation activation propagation fault error - malicious, or - nonmalicious - internal, or - external - natural, or - human made Other classifications failure fault
failures. The subjectivity issue of security in general and the requirement to fulfill both dependability and security attributes has led to considering their attributes from a probabilistic perspective, as faults and subsequent failures can never be totally eliminated from any system [8]. The reliability characteristics of security systems can consequently be defined and analyzed using the reliability characteristics of its components, i.e., security controls, with the presence of the appropriate system abstraction and failure logic.
However, to analyse failure data, two approaches are generally considered: parametric and nonparametric [103], [109]. Parametric analysis involves the choice of probability distribution first and then the evaluation of its parameters to fit the data available. The choice of a particular distribution is usually made based on similar previous tests or the phenomena basis itself. Probability plotting is used to estimate distribution parameters and represent them graphically. On the other hand, nonparametric analysis involves various techniques such as constructing histograms and calculating sample statistics (e.g., sample mean and variance). In this case, no particular assumptions are made about the underlying distribution, although these analysis tools often provide enough information to allow selection of a suitable distribution afterwards.
Reliability can be increased by decreasing the hazard rate, which represents the proportion of components in service that fail per unit interval. If the hazard rate increases with time, the cumulative distribution of the time to failure is defined as Increasing Failure Rate (IFR) distribution. On the other hand, if the hazard rate decreases with time, the cumulative distribution of the time to failure is defined as Decreasing Failure Rate (DFR) distribution. Moreover, if the hazard rate function is constant of time, then it is called Constant Failure Rate (CFR) distribution, which leads to studying the exponential distribution (also known as the negative exponential distribution) [49].
Many researches consider Exponential distribution to be the most commonly used distribution in reliability theory. The main underlying assumption is that the failure rate at which the system, or component, fails is independent of time or use. So, it is suitable for systems that operate, or at least intended to operate, continuously with no significant wearout mechanisms and early defect failures. Although this analysis is not realistic for all time, the approximation of the constant failure rate is sufficiently accepted even though a system, or a component, may experience some early failures or aging effects, as long as it provides a good approximation during the useful time. The effect of early failures is usually treated by quality control measures, while the magnitude of aging effect is usually treated by continuous preventive maintenance and timely replacement policies. For explicit cases
where a systemβs failure rate varies over time, a constant failure rate that encompasses the whole failure curve might be used to ensure it contains the whole failure variation. So, the use of the constant failure rate can actually be extended to apply to many cases where it would not be the correct theoretical model [110]. Exponential distribution, however, can be applied on a wide range of systems such as aircraft and spacecraft electronics, satellite stations, communications equipment, and computer networks [111].
Nevertheless, the mathematical properties of the exponential model are unique to its definition and important to its wide applications. The failure distribution is completely defined by the knowledge of only one parameter, which is the Mean Time To Failure or πππππππΉπΉ, often denoted by π. πππππππΉπΉ sufficiently defines the only distribution parameter, that is, failure rate, and is often denoted by ππ. These two variables define each other and are used interchangeably as ππ = 1/π. Another useful property of the exponential model is the simple mathematical manipulation of reliability functions, as many calculations involve the integration and multiplication of exponential functions that are easy operations [104], [111], [112]. This model is demonstrated graphically later in Section 2.2.4 in Figure 2-7 and Figure 2-8.
However, time-dependant failure rate models facilitate the study of failure nature across time, whether they are infant mortality failures, random failures, or aging effect failures. These models are suitable for situations where different failure stages need to be treated and analysed explicitly. In contrast to the exponential model used for random failures, such models need at least two parameters to reflect failure behaviour. Normal and lognormal distributions are frequently used in some of these situations, but Weibull distribution is considered the most universal and widely accepted one [109], [113].
Regardless of the failure model in use, a parametric or nonparametric, measured or estimated, appropriate definition of failure (and hence, success) represents a building block before any operational analysis methods can be made good use of.