Roozbeh Izadi-Zamanabadi
Roozbeh Izadi-Zamanabadi
Lecture Notes - Practical
Lecture Notes - Practical
Approach to Reliability, Safety,
Approach to Reliability, Safety,
and Active Fault-tolerance
and Active Fault-tolerance
December 11, 2000
December 11, 2000
Aalborg University
Aalborg University
Department of Control Engineering
Department of Control Engineering
Fredrik Bajers Vej 7C
Fredrik Bajers Vej 7C
DK-9220 Aalborg
DK-9220 Aalborg
Denmark
Table of Contents
Table of Contents
1.
1. INTINTRORODUDUCTCTIONION 44
1.1
1.1 TTermerminoinologlogyy .. . . 44
2.
2. RELRELIABIIABILITY LITY AND AND SAFSAFETYETY 66
2.1
2.1 RelReliabiabilitilityy . . . 66 2.1
2.1.1 .1 BasBasic deic definifinitiontions . . . s . . . 66 2.1.2
2.1.2 ConstaConstant nt failurfailure e rate rate model model and and the the expoexponentianential l distribdistributions utions 88 2.1
2.1.3 .3 MeaMean tn time ime betbetweeween fn failuailure wre with ith conconstastant nt faifailure lure raterate . . . 99 2.2
2.2 SafSafety . . . ety . . . 1212 2.3
2.3 HazHazard Aard Analnalysiysis s . . . 1313 2.3
2.3.1 .1 SteSteps in tps in the Hahe Hazarzard and analyalysis psis procrocess ess . . . 1414 2.3
2.3.2 .2 SysSystem motem model . . . . del . . . 1515 2.3
2.3.3 .3 HazHazard lard levevel el . . . 1515 2.4
2.4 HazarHazard anald analysis mysis models odels and teand techniqchniques ues . . . 1717 2.4
2.4.1 .1 FaiFailure lure ModMode ane and Efd Effecfect Anat Analyslysis (Fis (FMEAMEA)) . . . 1717 2.4.2
2.4.2 FailuFailure Mre Mode, ode, EffecEffect at and nd CriticaCritically Ally Analysnalysis (Fis (FMECA) . . MECA) . . . 1818 2.4.3
2.4.3 Fault TFault Tree Anaree Analysis (lysis (FTFTA) . A) . . . 1919 2.4
2.4.4 .4 RisRisk k . . . 2323 2.5
2.5 ReliabReliability anility and safd safety asety assesssessment - ament - an ovn overvieerview w . . . 2626 2.5
2.5.1 .1 RelReliabiabilitility Assy Assessessmenmentt . . . 2626 2.5
2.5.2 .2 SteSteps in rps in reliaeliabilibility asty assessessmesmentnt . . . 2626 2.5
2.5.3 .3 SafSafety Aety Assessessmssment ent . . . 2626 2.5
2.5.4 .4 SteSteps in sps in safeafety asty assessessmesments . . . . nts . . . 2626
3.
3. ACACTIVE TIVE FFAULAULTT-TOLE-TOLERANT RANT SYSTEM SYSTEM DESIGNDESIGN 2828 3.1
3.1 IntrIntroduoductioction n . . . 2828 3.2
3.2 TType oype of fauf faults . . . lts . . . 2828 3.3
3.3 HarHardwadware fre faulault-tot-tolerlerancance e . . . 2929 3.3.1
3.3.1 MechaMechanical nical and Eleand Electricactrical systel systems . ms . . . 3131 3.4
3.4 ReliabReliability coility considensiderationrations for s for standstandby conby configuratfigurations . . . ions . . . 3131 3.5
3.5 SofSoftwatware fare fault-ult-toletoleranrance . . . ce . . . 3535 3.6
3.6 ActivActive faue fault-tolelt-tolerant rant controcontrol sysl system destem design ign . . . 3636 3.6.1
3.6.1 JustifiJustificatiocation for apn for applying plying fault-fault-toleratolerance . . . nce . . . 3636
T
Tabablle e oof f CCoonntetenntts s 33
3.6.2
3.6.2 A proceduA procedure for desigre for designing actning active fauive fault-tolerlt-tolerant (conant (control)trol) sys
1. INTRODUCTION
1. INTRODUCTION
The fundamental objective of the combined safety and Reliability The fundamental objective of the combined safety and Reliability assess-ment is to identify critical items in the design and the choice of equipassess-ment ment is to identify critical items in the design and the choice of equipment that
that maymay jeopajeopardizerdize safetsafetyy or or avavailabiailabilitylity,, andand theretherebyby to to proviprovidede arguargumentsments for the selection between different options for the system.
for the selection between different options for the system.
Achieving safety and reliability has been one the prime objectives for system Achieving safety and reliability has been one the prime objectives for system designers while designing safety critical system for dec
designers while designing safety critical system for decades. With growingades. With growing environ- environ-mental awareness, concerns, and demands, the scope of the design of reliable (and mental awareness, concerns, and demands, the scope of the design of reliable (and safe) sys
safe) systems has betems has beenen enhanenhancedced to eveto evenn small comsmall componenponentsts as sensoas sensors andrs and actuaactuators.tors. In the past, the normal procedure to address the higher demand for reliability was In the past, the normal procedure to address the higher demand for reliability was to add hardware redundancy that in turn increases the production and maintenance to add hardware redundancy that in turn increases the production and maintenance costs. Active fault-tolerant design is an attempt to achieve higher redundancy while costs. Active fault-tolerant design is an attempt to achieve higher redundancy while minimizing the costs.
minimizing the costs.
In chapter 2 reliability and safety related issues are considered and described. In chapter 2 reliability and safety related issues are considered and described. The idea of introducing this chapter is to provide an overview of the concepts and The idea of introducing this chapter is to provide an overview of the concepts and methods used for reliability and safety assessment.
methods used for reliability and safety assessment.
The focus in chapter 3 is on fault-tolerance concept. Type of possible faults in The focus in chapter 3 is on fault-tolerance concept. Type of possible faults in components and customary methods for applying redundancy is described. Finally, components and customary methods for applying redundancy is described. Finally, the chapter is wrapped up by considering and describing the main subject, which is the chapter is wrapped up by considering and describing the main subject, which is a formal and consistent procedure to design active fault-tolerant systems.
a formal and consistent procedure to design active fault-tolerant systems.
1.1 Terminology
1.1 Terminology
Definition of the used notations in this report is provided in this section. Definition of the used notations in this report is provided in this section.
Availability:
Availability: The probability that a The probability that a systesystem m is is perforperforming satisfming satisfactorilactorilyy at time
at time ..
Fault-tolerance:
Fault-tolerance: Ability to tolerate and accommodate for a fault with orAbility to tolerate and accommodate for a fault with or without performance degradation.
without performance degradation.
Fail-operational:
Fail-operational: The component/unit stays operational after one failure.The component/unit stays operational after one failure.
1
1..1 1 TTeerrmmiinnoollooggy y 55
Fail-safe:
Fail-safe: TheThe compocomponent/unent/unitnit proceprocesses dsses directlyirectly to a predto a predefinedefined safesafe state (actively or passively) after one (or several) fault has state (actively or passively) after one (or several) fault has occurred.
occurred.
Hazard:
Hazard: the presence of any process abnormality which representsthe presence of any process abnormality which represents a dangerous or potentially dangerous state.
a dangerous or potentially dangerous state.
MTBF:
MTBF: Mean time between failureMean time between failure
MTTR:
MTTR: Mean time to repairMean time to repair
Reliability:
Reliability: is the characteristic of an item expressed by the probabil-is the characteristic of an item expressed by the probabil-ity that it will pe
ity that it will performrform its requirits requireded functfunctionion in the specin the specifiedified manne
manner r oveover r a a givegiven time n time period and under specifieperiod and under specified d andand assumed conditions.
assumed conditions.
Risk:
Risk: (normally) refers to the(normally) refers to thestatistical annual statistical annual frequenfrequencycy of theof the particular hazard state. Ex. once per year, or once per particular hazard state. Ex. once per year, or once per 10000 years.
10000 years.
Risk
Risk reductireduction:on: Reduce the probability of hazardous eventsReduce the probability of hazardous events
Safety:
Safety: is freedom from accident or losses.is freedom from accident or losses.
Safety
Safety integriintegrity:ty: is is the probability of the probability of a a safetsafety-relaty-related system ed system satisfsatisfactorilactorilyy performing the required safety functions under all stated performing the required safety functions under all stated conditions within a stated period of time.
conditions within a stated period of time.
Safety Integrity Level:
Safety Integrity Level: AAveveragragee proprobabbabilitilityy ofof faifailurluree toto perperformform itsits desdesignigneded funfunc- c-tion on demand (
tion on demand (SILSIL).).
Potential Hazard states, are most often identified by carrying out hazard and Potential Hazard states, are most often identified by carrying out hazard and operational studies known as
operational studies known as HAZOPHAZOPs.s.
£ £
2. RELIABILITY AND SAFETY
2. RELIABILITY AND SAFETY
2.1 Reliability
2.1 Reliability
Reliability engineering is concerned primarily with failures and failure rate Reliability engineering is concerned primarily with failures and failure rate reduc-tion. The reliability engineering approach to safety thus concentrate on failures as tion. The reliability engineering approach to safety thus concentrate on failures as cause of accidents. Reliability engineers use a variety of techniques to minimize cause of accidents. Reliability engineers use a variety of techniques to minimize component failure (Leveson 1995).
component failure (Leveson 1995).
Definition of reliability embraces the clear-cut criterion for failure, from which Definition of reliability embraces the clear-cut criterion for failure, from which we may judge at what point the system is no longer functioning properly.
we may judge at what point the system is no longer functioning properly.
2.1.1
2.1.1 Basic Basic definitionsdefinitions
Reliability is defined as the probability that a system survives for some specified Reliability is defined as the probability that a system survives for some specified period of time. It may be expressed in term of the random
period of time. It may be expressed in term of the random , the, thetime-to-system- time-to-system- failure
failure. The probability density function (PDF),. The probability density function (PDF), , has the physical meaning:, has the physical meaning: Probability that failure takes place Probability that failure takes place
at a time between
at a time between andand (2.1)(2.1) for vanishing small
for vanishing small . The cumulative distribution function (CDF),. The cumulative distribution function (CDF), has nowhas now the following meaning:
the following meaning:
(2.2) (2.2) The
Thereliabilityreliabilityis defined as:is defined as:
Probability that a system operates Probability that a system operates without failure for a length of time
without failure for a length of time (2.3)(2.3) Since a system that does not fail for
Since a system that does not fail for must fail at somemust fail at some , one get:, one get:
(2.4) (2.4) or equivalently either or equivalently either Ø Ø ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ ¡¡ Ø Ø È È ØØ Ø Ø Ø Ø · · ¡¡ Ø Ø Ø Ø ¡¡ Ø Ø ¡¡ Ø Ø ´ ´ Ø Ø µ µ ´ ´ Ø Ø µµ È È Ø Ø Ø Ø
Probability that failure takes place Probability that failure takes place
at a time less
at a time less than or equal tothan or equal toØ Ø
Ê Ê ´ ´ Ø Ø µµ È È Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ê Ê ´ ´ Ø Ø µµ ½ ½ ´ ´ Ø Ø µ µ
2 2..1 1 RReelliiaabbiilliitty y 77 (2.5) (2.5) or or (2.6) (2.6) Following properties are clear:
Following properties are clear:
and
and (2.7)(2.7)
One can see from equation 2.4 that reliability is the complementary cumulative One can see from equation 2.4 that reliability is the complementary cumulative dis-tribu
tribution function (CCDF), tion function (CCDF), that is,that is, . Similarly, since. Similarly, since is theis the probability that the system will fail before
probability that the system will fail before , it is often referred to as the unreli-, it is often referred to as the unreli-ability or failure probunreli-ability, i.e.,
ability or failure probability, i.e., ..
Equation 2.5 may be inverted by differentiation to give the PDF of failure times Equation 2.5 may be inverted by differentiation to give the PDF of failure times in terms of the reliability:
in terms of the reliability:
(2.8) (2.8) One can gain insight into failure mechanisms by examining the behavior of the One can gain insight into failure mechanisms by examining the behavior of the failure rate. The
failure rate. The failure ratefailure rate,, , may be defined in terms of the reliability or the, may be defined in terms of the reliability or the PDF of the time-to-failure as follows. Let
PDF of the time-to-failure as follows. Let be the probability that the systembe the probability that the system will fail at some time
will fail at some time given that it has not yet failed at timegiven that it has not yet failed at time . Thus it is. Thus it is the conditional probability
the conditional probability
(2.9) (2.9) Following the definition of conditional probability we have:
Following the definition of conditional probability we have:
(2.10) (2.10) The numerator of equation 2.10 is just an alternative way of writing the PDF, i.e.: The numerator of equation 2.10 is just an alternative way of writing the PDF, i.e.:
(2.11) (2.11) The De-numerator of equation 2.10 is just R(t) (see equation 2.3). Therefore, by The De-numerator of equation 2.10 is just R(t) (see equation 2.3). Therefore, by combining equations, one obtains:
combining equations, one obtains:
(2.12) (2.12) This quality, the failure rate, is also referred to as the
This quality, the failure rate, is also referred to as thehazard hazard orormortality ratemortality rate.. The
The mosmostt useusedd way way toto exexprepressss the the relreliabiabilitilityy andand thethe faifailurelure PDF PDF is is inin tertermsms ofof faifailurluree rate. To do this, we first eliminate
rate. To do this, we first eliminate from Eq. 2.12 by inserting Eq. 2.8 to obtainfrom Eq. 2.12 by inserting Eq. 2.8 to obtain the failure rate in terms of the reliability,
the failure rate in terms of the reliability,
Ê Ê ´ ´ Ø Ø µµ ½ ½ Ø Ø ¼ ¼ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ Ê Ê ´ ´ Ø Ø µµ ½ ½ Ø Ø ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ Ê Ê ´´ ¼¼ µµ ½½ Ê Ê ´ ´ ½ ½ µµ ¼ ¼ Ê Ê ´ ´ Ø Ø µµ ½ ½ ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ Ø Ø Ø Ø ´ ´ Ø Ø µµ ½ ½ Ê Ê ´ ´ Ø Ø µ µ ´ ´ Ø Ø µµ Ø Ø Ê Ê ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ ¡¡ Ø Ø Ø Ø Ø Ø · · ¡¡ Ø Ø Ø Ø ´ ´ Ø Ø µ µ ¡¡ Ø Ø È È Ø Ø Ø Ø · · ¡¡ Ø Ø Ø Ø Ø Ø È È Ø Ø Ø Ø · · ¡¡ Ø Ø Ø Ø Ø Ø È È ´ ´ Ø Ø Ø Ø µ µ ´ ´ Ø Ø Ø Ø · · ¡¡ Ø Ø µ µ È È Ø Ø Ø Ø È È ´ ´ Ø Ø Ø Ø µ µ ´ ´ Ø Ø Ø Ø · · ¡¡ Ø Ø µ µ È È ØØ Ø Ø Ø Ø · · ¡¡ Ø Ø ´ ´ Ø Ø µ µ ¡¡ ØØ ´ ´ Ø Ø µµ ´ ´ Ø Ø µ µ Ê Ê ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ
(2.13) (2.13) Then by multiplying both side of the equation by
Then by multiplying both side of the equation by and integrating between zeroand integrating between zero and
and , we obtain, we obtain
(2.14) (2.14) since
since
exponentiating the results exponentiating the results
(2.15) (2.15) The probability density function
The probability density function is obtained by inserting Eq. 2.15 into Eq.is obtained by inserting Eq. 2.15 into Eq. 2.12 and solve for
2.12 and solve for
(2.16) (2.16) The most used parameter to characterize reliability is the
The most used parameter to characterize reliability is the mean time to failuremean time to failureoror MTTF. It is just the expected or mean value of
MTTF. It is just the expected or mean value of of the failure at timeof the failure at time . Hence. Hence MTTF
MTTF (2.17)(2.17)
The MTTF may be written directly in terms of the reliability by substituting Eq. 2.8 The MTTF may be written directly in terms of the reliability by substituting Eq. 2.8 and integration by parts:
and integration by parts: MTTF
MTTF (2.18)(2.18)
The term
The term vanishes atvanishes at . Similarly, from Eq. 2.15, we see that. Similarly, from Eq. 2.15, we see that willwill decay exponentially, since the failure rate
decay exponentially, since the failure rate greater than zero. Thusgreater than zero. Thus as
as . Therefore, we have. Therefore, we have MTTF MTTF
2.1.2
2.1.2 Constant failure ratConstant failure rate model and the exponential distributie model and the exponential distributionsons
Random failures that give rise to the constant failure rate model are the most widely Random failures that give rise to the constant failure rate model are the most widely used basis for desc
used basis for describingribing reliabreliability phenoility phenomena.mena. TheyThey are definedare defined by the assumptby the assumptionion that the rate at which
that the rate at which the system fails is indepethe system fails is independentndent of its age. For of its age. For continuouslycontinuously op- op-erating systems this implies a constant failure rate.
erating systems this implies a constant failure rate.
The constant failure rate model for continuously operating systems leads to an The constant failure rate model for continuously operating systems leads to an ex-ponential distribution. Replacing the time dependent failure rate
ponential distribution. Replacing the time dependent failure rate by a constantby a constant in Eq. 2.16 yields, for the FDP,
in Eq. 2.16 yields, for the FDP,
´ ´ Ø Ø µµ ½ ½ Ê Ê ´ ´ Ø Ø µ µ Ø Ø Ê Ê ´ ´ Ø Ø µ µ Ø Ø Ø Ø Ø Ø ¼ ¼ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ ÐÐ ÒÒ Ê Ê ´ ´ Ø Ø µµ Ê Ê
´´ ¼¼ µµ ½½ . The desired expression for reliability can hence be derived by. The desired expression for reliability can hence be derived by
Ê Ê ´ ´ Ø Ø µµ ÜÜ ÔÔ Ø Ø ¼ ¼ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ :: ´ ´ Ø Ø µµ ´ ´ Ø Ø µµ ÜÜ ÔÔ Ø Ø ¼ ¼ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ Ø Ø Ø Ø ½ ½ ¼ ¼ ØØ ´ ´ Ø Ø µ µ ØØ ½ ½ ¼ ¼ Ø Ø Ê Ê Ø Ø Ø Ø ØØ Ê Ê ´ ´ Ø Ø µ µ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ½ ½ ¼ ¼ · · ½ ½ ¼ ¼ Ê Ê ´ ´ Ø Ø µ µ Ø Ø ØØ Ê Ê ´ ´ Ø Ø µ µ Ø Ø ¼ ¼ Ê Ê ´ ´ Ø Ø µ µ ´ ´ Ø Ø µ µ ØØ Ê Ê ´ ´ Ø Ø µ µ ¼ ¼ Ø Ø ½ ½ ½ ½ ¼ ¼ Ê Ê ´ ´ Ø Ø µ µ Ø Ø (2.19)(2.19) ´ ´ Ø Ø µ µ
2
2..1 1 RReelliiaabbiilliitty y 99
(2.20) (2.20) Similarly, the CDF becomes
Similarly, the CDF becomes
(2.21) (2.21) and reliability may be written as (see Eq. 2.4)
and reliability may be written as (see Eq. 2.4)
(2.22) (2.22) The MTTF and the variance of the failure times are also given in terms of The MTTF and the variance of the failure times are also given in terms of .. From Eq. 2.19 we obtain
From Eq. 2.19 we obtain
MTTF MTTF and the variance is found to be:
and the variance is found to be:
(2.24) (2.24)
2.1.3
2.1.3 Mean time between failMean time between failure with constant fure with constant failure rateailure rate
In many situations failure does not constitute the end of life. Rather, the system In many situations failure does not constitute the end of life. Rather, the system is immediately replaced or repaired and operation continues. In such situations a is immediately replaced or repaired and operation continues. In such situations a number of new pieces of information become important. We may want to know the number of new pieces of information become important. We may want to know the expected number of failures over some specified period of time in order to estimate expected number of failures over some specified period of time in order to estimate the cost of replacement parts. More important, it may be necessary to estimate the the cost of replacement parts. More important, it may be necessary to estimate the probability that more than a specific number of failures
probability that more than a specific number of failures will occur over a periodwill occur over a period of time. Such information allows us to maintain an adequate inventory of repair of time. Such information allows us to maintain an adequate inventory of repair parts.
parts.
In modeling these situations, one restrict his/her attention to the constant failure In modeling these situations, one restrict his/her attention to the constant failure rate approximation. In this the failure rate is often given in terms of the
rate approximation. In this the failure rate is often given in terms of the mean timemean time between failure
between failure (MTBF), as opposed to the mean time to failure or MTTF. In fact(MTBF), as opposed to the mean time to failure or MTTF. In fact they are both the same number if, when a system fails it is assumed to be repaired they are both the same number if, when a system fails it is assumed to be repaired immediately to an as-good-as-new condition.
immediately to an as-good-as-new condition. W
We first conside first considerer the times at whthe times at whichich the failuthe failuresres take platake place,ce, and therand thereforeefore the numbthe numberer that occur within any given span of the time. Suppose that we let
that occur within any given span of the time. Suppose that we let be a discretebe a discrete random variable representing the number of failures that take place between random variable representing the number of failures that take place between and a time
and a time . Let. Let
(2.25) (2.25) be the probability that exactly
be the probability that exactly
we start counting failures at time zero, we must have we start counting failures at time zero, we must have
(2.26) (2.26) (2.27) (2.27) ´ ´ Ø Ø µµ Ø Ø ´ ´ Ø Ø µµ ½ ½ Ø Ø Ê Ê ´ ´ Ø Ø µµ Ø Ø ½ ½ (2.23)(2.23) ¾ ¾ ½ ½ ¾ ¾ Æ Æ Ò Ò Ø Ø ¼ ¼ Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø µµ È È Ò Ò Ò Ò Ø Ø
Ò Ò failures have taken place before timefailures have taken place before time . Clearly, if Ø Ø . Clearly, if
Ô Ô ¼ ¼ ´´ ¼¼ µµ ½½ Ô Ô Ò Ò ´´ ¼¼ µµ ¼¼ Ò Ò ¼ ¼ ½ ½ ¾ ¾ ½ ½
In addition, at any time In addition, at any time
(2.28) (2.28) For small
For small , let failure, let failure be the be the probaprobability that thebility that the th failure will th failure will taketake place during the time increment between
place during the time increment between andand , given that exactly, given that exactly failuresfailures have
have taken place taken place beforebefore timetime . Then the probab. Then the probabilityility that no failure wilthat no failure will occurl occur duringduring before
before my be written asmy be written as
(2.29) (2.29) Then noting that
Then noting that
(2.30) (2.30) we obtain the simple differential equation
we obtain the simple differential equation
(2.31) (2.31) Using the initial condition, Eq. 2.26, we find
Using the initial condition, Eq. 2.26, we find
(2.32) (2.32) With
With determined, we may now solve successively fordetermined, we may now solve successively for in the following manner.We first observe that if
in the following manner.We first observe that if failures have taken place beforefailures have taken place before time
time , the probab, the probabilityility that thethat the th failure will take place betweenth failure will take place between andand is
is . Therefore, since this transition probability is independent of the number of . Therefore, since this transition probability is independent of the number of previous failures, we may write
previous failures, we may write
The last term accounts for the probability that no failure takes place during
The last term accounts for the probability that no failure takes place during . For. For sufficiently small
sufficiently small we can ignore the possibility of two or more failures takingwe can ignore the possibility of two or more failures taking place.
place.
Using the definition of the derivative once again, we may reduce Eq.
Using the definition of the derivative once again, we may reduce Eq. ????to the dif-to the dif-ferential equation
ferential equation
(2.34) (2.34) with the following solution
with the following solution
½ ½ Ò Ò ¼ ¼ Ô Ô Ò Ò ´ ´ Ø Ø µµ ½ ½ ¡¡ Ø Ø ¡¡ Ø Ø ´ ´ Ò Ò ·· ½½ µ µ Ø Ø Ø Ø · · ¡¡ Ø Ø Ò Ò Ø Ø
¡¡ Ø Ø isis . From this we see that the probability that no failures have occurred. From this we see that the probability that no failures have occurred ½ ½ ¡¡ Ø Ø Ø Ø · · ¡¡ Ø Ø Ô Ô ¼ ¼ ´ ´ Ø Ø · · ¡¡ Ø Ø µµ ´´ ½½ ¡¡ Ø Ø µ µ Ô Ô ¼ ¼ ´ ´ Ø Ø µ µ Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø µµ ÐÐ ÑÑ ¡¡ Ø Ø ¼ ¼ Ô Ô Ò Ò ´ ´ Ø Ø · · ¡¡ Ø Ø µ µ Ô Ô ¼ ¼ ´ ´ Ø Ø µ µ ¡¡ Ø Ø Ø Ø Ô Ô ¼ ¼ ´ ´ Ø Ø µµ Ô Ô ¼ ¼ ´ ´ Ø Ø µ µ Ô Ô ¼ ¼ ´ ´ Ø Ø µµ Ø Ø Ô Ô ¼ ¼ ´ ´ Ø Ø µ µ Ô Ô Ò Ò ´ ´ Ø Ø µ µ Ò Ò ½ ½ ¾ ¾ ¿ ¿ Ò Ò Ø Ø ´ ´ Ò Ò ·· ½½ µ µ Ø Ø Ø Ø · · ¡¡ Ø Ø ¡¡ Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø · · ¡¡ Ø Ø µµ ¡¡ Ø Ø µ µ Ô Ô Ò Ò ½ ½ ´ ´ Ø Ø µµ ·· ´´ ½½ ¡¡ Ø Ø µ µ Ô Ô Ò Ò ´ ´ Ø Ø µ µ (2.33)(2.33) ¡¡ Ø Ø ¡¡ Ø Ø Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø µµ Ô Ô Ò Ò ´ ´ Ø Ø µµ · · Ô Ô Ò Ò ½ ½ ´ ´ Ø Ø µ µ Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø µ µ Ô Ô Ò Ò ´´ ¼¼ µµ Ø Ø Ô Ô Ò Ò ½ ½ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ Ø Ø ¼ ¼ (2.35)(2.35)
2
2.1 .1 RReeliliaabbililitity y 1111
But since from Eq. 2.27 But since from Eq. 2.27
(2.36) (2.36) This recursive relationship allows us to calculate the
This recursive relationship allows us to calculate the , insert, insert Eq. 2.32 on the right-hand side and carry out the integral to obtain
Eq. 2.32 on the right-hand side and carry out the integral to obtain
(2.37) (2.37) and for all
and for all the Eq. 2.36 is satisfied bythe Eq. 2.36 is satisfied by
(2.38) (2.38) and these quantities in turn satisfy the initial conditions given by Eqs. 2.26 and and these quantities in turn satisfy the initial conditions given by Eqs. 2.26 and 2.27.
2.27.
The probabilities
The probabilities are the same as the Poisson distribution withare the same as the Poisson distribution with . Thus. Thus it is possible to determine the mean and the variance of the number
it is possible to determine the mean and the variance of the number of eventsof events occurring over a time span
occurring over a time span . Thus the expected number of failures during time. Thus the expected number of failures during time isis
and the variance of and the variance of isis
(2.40) (2.40) Since
Since are the probability mass functions of a discrete variableare the probability mass functions of a discrete variable , we must, we must have,
have,
(2.41) (2.41) The number of failures can be related to the mean time between failures by
The number of failures can be related to the mean time between failures by
MTBF
MTBF (2.42)(2.42)
This is the expression that relates
This is the expression that relates and the MTBF assuming a constant failureand the MTBF assuming a constant failure rate.
rate.
In general, the MTBF may be determined from In general, the MTBF may be determined from
MTBF
MTBF (2.43)(2.43)
where
where , the number of failures, is large., the number of failures, is large. The probability that more than
The probability that more than failures have occurred isfailures have occurred is
Ô Ô Ò Ò ´´ ¼¼ µµ ¼¼ , we have, we have Ô Ô Ò Ò ´ ´ Ø Ø µµ Ø Ø Ø Ø ¼ ¼ Ô Ô Ò Ò ½ ½ ´ ´ Ø Ø ¼ ¼ µ µ Ø Ø ¼ ¼ Ø Ø ¼ ¼ Ô Ô
Ò Ò successively. Forsuccessively. For
Ô Ô ½ ½ Ô Ô ½ ½ ´ ´ Ø Ø µµ ØØ Ø Ø Ò Ò ¼ ¼ Ô Ô ½ ½ ´ ´ Ø Ø µµ ´ ´ Ø Ø µ µ Ò Ò Ò Ò Ø Ø Ô Ô Ò Ò ´ ´ Ø Ø µ µ Ø Ø Ò Ò Ø Ø Ø Ø Ò Ò Ò Ò ØØ (2.39)(2.39) Ò Ò ¾ ¾ Ò Ò ØØ Ô Ô Ò Ò ´ ´ Ø Ø µ µ Ò Ò ½ ½ Ò Ò ¼ ¼ Ô Ô Ò Ò ´ ´ Ø Ø µµ ½ ½ Ò Ò Ø Ø Ò Ò Ø Ø Ò Ò Ò Ò Æ Æ
(2.44) (2.44) Instead of writing this infinite series, however, we may use Eq. 2.41 to write
Instead of writing this infinite series, however, we may use Eq. 2.41 to write
(2.45) (2.45)
2.2 Safety
2.2 Safety
While using various techniques are often effective in increasing reliability, they do While using various techniques are often effective in increasing reliability, they do not necessarily increase safety. Under some conditions
not necessarily increase safety. Under some conditions they may even reduce safetythey may even reduce safety.. For example, increasing
For example, increasing the burst-pressure to working-presthe burst-pressure to working-pressuresure ratio of a tank intro-ratio of a tank intro-duces new dangers of an explosion or chemical reaction in the event of a rapture. duces new dangers of an explosion or chemical reaction in the event of a rapture. Assuming that reliability and safety are synonymous is hence true in special cases. Assuming that reliability and safety are synonymous is hence true in special cases. In general, safety has a broader scope than failures, and failures may not In general, safety has a broader scope than failures, and failures may not compro-mise safety. There is obviously an overlap between reliability and safety (Leveson mise safety. There is obviously an overlap between reliability and safety (Leveson 1995) (figure 2.1); manly acciden
1995) (figure 2.1); manly accidents may occur without componentts may occur without component failure - the indi-failure - the
indi-Safety
Safety
Reliability
Reliability
Fig. 2.1.
Fig. 2.1.Safety and reliability are overlapping, but are not identical.Safety and reliability are overlapping, but are not identical.
vidua
vidual componenl componentsts where operwhere operatingating exacexactly as specified or intendedtly as specified or intended,, that is, withoutthat is, without fai
failurlure.e. TheThe oppopposiositete is also trueis also true - - ComComponponententss maymay faifail l withwithoutout aa resresultultinging accaccideident.nt. Accidents may be caused by equipment operation outside the parameters and time Accidents may be caused by equipment operation outside the parameters and time limits upon which the reliability analysis are based. Therefore, a system may have limits upon which the reliability analysis are based. Therefore, a system may have high reliability and still have accidents. Safety is an emergent property that arises at high reliability and still have accidents. Safety is an emergent property that arises at the system level when components are operating together.
the system level when components are operating together.
For a safety-related system all aspects of reliability, availability, maintainability, For a safety-related system all aspects of reliability, availability, maintainability, and safety (RAMS) have to be considered as they are relevant to the responsibility and safety (RAMS) have to be considered as they are relevant to the responsibility of
of manufmanufactureacturers and rs and the acceptabthe acceptability of ility of the customersthe customers. . Safety and reliability areSafety and reliability are generally achieved by a combination of
generally achieved by a combination of fault prevention fault prevention È È Ò Ò Æ Æ ½ ½ Ò Ò Æ Æ ·· ½ ½ ´ ´ Ø Ø µ µ Ò Ò Ò Ò Ø Ø È È Ò Ò Æ Æ ½ ½ Æ Æ Ò Ò ¼ ¼ ´ ´ Ø Ø µ µ Ò Ò Ò Ò Ø Ø Á Á
2.
2.3 3 HHazazarard d AnAnalalysysis is 1313
fault tolerance fault tolerance
fault detection and diagnosis fault detection and diagnosis
autonomous supervision and protection autonomous supervision and protection There are two stages for fault prevention:
There are two stages for fault prevention: fault avoidancefault avoidance andand fault removalfault removal.. Fault avoidance and removal has to be accomplished during the design and test Fault avoidance and removal has to be accomplished during the design and test phase.
phase.
2.3 Hazard Analysis
2.3 Hazard Analysis
Hazard analysis is the essential part of any effective safety program, providing Hazard analysis is the essential part of any effective safety program, providing vis-ibility and coordination (Leveson 1995) (see figure 2.2). Information flows both ibility and coordination (Leveson 1995) (see figure 2.2). Information flows both
Hazard
Hazard
analysis
analysis
Design
Design
Management
Management
Maintenance
Maintenance
Operations
Operations
Training
Training
QA
QA
Test
Test
Fig. 2.2.
Fig. 2.2.Hazard analysis provides visibility and coordination.Hazard analysis provides visibility and coordination.
outward from and back into the hazard analysis. The outward information can help outward from and back into the hazard analysis. The outward information can help designers perform
designers perform trade studies atrade studies and eliminate or nd eliminate or mitigatemitigate hazards and hazards and can help can help qual- qual-ity assurance identify qualqual-ity categories, acceptance tests, required inspections, and ity assurance identify quality categories, acceptance tests, required inspections, and components that need special care. Hazard analysis is a necessary step before components that need special care. Hazard analysis is a necessary step before haz-ards can be eliminated or controlled through design or operational procedures. ards can be eliminated or controlled through design or operational procedures. Per-forming this analysis, however, does not ensure safety.
forming this analysis, however, does not ensure safety. Hazar
Hazard d analyanalysis is sis is perforperformed continuomed continuously througusly throughout the hout the life of life of the system, the system, withwith increasing depth and extent as more information is obtained about the system increasing depth and extent as more information is obtained about the system de-sign. As the project progress, the use of hazard analysis will vary - such as sign. As the project progress, the use of hazard analysis will vary - such as identi-fying hazards, testing basic assumptions and various scenarios about the system fying hazards, testing basic assumptions and various scenarios about the system op-eration, specifying operational and maintenance tasks, planning training programs, eration, specifying operational and maintenance tasks, planning training programs, evaluation potential changes, and evaluating the assumptions and foundation of the evaluation potential changes, and evaluating the assumptions and foundation of the models as the system is used and feedback is obtained.
models as the system is used and feedback is obtained.
Á Á
Á Á
The
The variovariousus hazarhazardd analyanalysissis techntechniquesprovideiquesprovide formaliformalismssms forfor systesystematizimatizingng knoknowl- wl-edge, draw attention to gaps in knowlwl-edge, help prevent important consideration edge, draw attention to gaps in knowledge, help prevent important consideration being missed, and aid in reasoning about systems and determining where being missed, and aid in reasoning about systems and determining where improve-ments are most likely to be effective.
ments are most likely to be effective.
The way of choosing the appropriate hazard analysis process depends on the The way of choosing the appropriate hazard analysis process depends on the goals or purposes of the hazard analysis. The goals of safety analysis are related to goals or purposes of the hazard analysis. The goals of safety analysis are related to three general tasks:
three general tasks: 1.
1. DevelopmentDevelopment: the examination of a new system to identify and assess potential: the examination of a new system to identify and assess potential hazards and eliminate or control them.
hazards and eliminate or control them. 2.
2. Operational managementOperational management: the examination of an existing system to identify: the examination of an existing system to identify and assess hazards in order to improve the level of safety, to formulate a safety and assess hazards in order to improve the level of safety, to formulate a safety manag
managementement policpolicy,y, to to traintrain persopersonnel,nnel, andand to to increaincreasese motivmotivationation forfor efficefficienciencyy and safety of operation.
and safety of operation. 3.
3. CertificationCertification: the examination of a planned or existing system to demonstrate: the examination of a planned or existing system to demonstrate its level of safety in order to be accepted by the authorities or the public. its level of safety in order to be accepted by the authorities or the public. The first two tasks have the common goal of making the system safer (by using The first two tasks have the common goal of making the system safer (by using appropriate techniques to engineer safer systems), while the third task has the goal appropriate techniques to engineer safer systems), while the third task has the goal of con
of convincingvincing managementmanagement or goveor governmentrnment licenser that licenser that an existing an existing design or design or systemsystem is safe (which is also called
is safe (which is also called risk assessmentrisk assessment).).
2.3.1
2.3.1 Steps in the Hazard Steps in the Hazard analysis procesanalysis processs
A hazard analysis consists of the following steps: A hazard analysis consists of the following steps:
1.
1. Definition oDefinition of objectivf objectives.es. 2.
2. DefinitDefinition of sion of scope.cope. 3.
3. DefinitDefinition and ion and descrdescription of iption of the system, system boundarithe system, system boundaries, and informationes, and information to be used in the analysis.
to be used in the analysis. 4.
4. IdentiIdentificatiofication of hazardsn of hazards.. 5.
5. Collection of data (such as hCollection of data (such as historical data, related standards istorical data, related standards and codes of prac-and codes of prac-tice, scientific tests and experimental results).
tice, scientific tests and experimental results). 6.
6. QualitaQualitative rankintive ranking of g of hazahazards based on rds based on their potentiatheir potential l effeeffects (immediate orcts (immediate or protracted) and perhaps their likelihood (qualitative or quantitative).
protracted) and perhaps their likelihood (qualitative or quantitative). 7.
7. Identification oIdentification of caused f caused factors.factors. 8.
8. Identification of preventivIdentification of preventive or corrective e or corrective measures and general design criteriameasures and general design criteria and controls.
and controls. 9.
9. EvalEvaluation of uation of prevpreventientive or ve or correccorrective meastive measures, includinures, including estimates of g estimates of cost.cost. Relative cost ranking may be adequate.
Relative cost ranking may be adequate. 10.
10. VVerification that controls have been implemented correctly and are effective.erification that controls have been implemented correctly and are effective. 11.
11. Quantification of selected, unresolved Quantification of selected, unresolved hazards, including probability ohazards, including probability of occur-f occur-rence, economic impact, potential losses, and costs of preventive or corrective rence, economic impact, potential losses, and costs of preventive or corrective measures.
measures. 12.
2.
2.3 3 HHazazarard d AnAnalalysysis is 1515
13.
13. Feedback and eFeedback and evaluation of operational exvaluation of operational experience.perience.
Not all of the steps are needed to be performed for every system and for every Not all of the steps are needed to be performed for every system and for every hazard. For new systems designs, usually the first 10 steps are necessary.
hazard. For new systems designs, usually the first 10 steps are necessary.
2.3.2
2.3.2 System System modelmodel
Every system analysis requires some type of model of the system, which may Every system analysis requires some type of model of the system, which may change from a fuzzy idea in the analysts mind to a complex and carefully specified change from a fuzzy idea in the analysts mind to a complex and carefully specified mathematical model. A model is a representation of a system that can be mathematical model. A model is a representation of a system that can be manipu-lated in order to obtain information about the system itself. Modeling any system lated in order to obtain information about the system itself. Modeling any system requires a description of the following:
requires a description of the following:
Structure
Structure: : The interrelaThe interrelationshtionship of ip of the parts the parts along some dimensioalong some dimension(s), such asn(s), such as space, time, relative importance, and logic and decision making properties: space, time, relative importance, and logic and decision making properties:
Distinguishing
Distinguishing qualitiesqualities:: a a qualitaqualitativetive descdescriptionription of of the the particuparticularlar varivariablesables andand parameters the characterize the system and distinguish it from similar structure. parameters the characterize the system and distinguish it from similar structure.
Magnitude
Magnitude, , probaprobabilitybility, , and and timetime: a description of the variables and parameters: a description of the variables and parameters with the distinguishing qualities in terms of their magnitude or frequency over with the distinguishing qualities in terms of their magnitude or frequency over time and their degree of certainty for the situations or conditions of interest. time and their degree of certainty for the situations or conditions of interest.
2.3.3
2.3.3 Hazard Hazard levellevel
The hazard category or level is characterized by (1)
The hazard category or level is characterized by (1) severityseverity(sometimes also called(sometimes also called
damage
damage) and (2)) and (2)likelihood likelihood (or(orfrequencyfrequency) of occurrence.) of occurrence. Hazard level is often sHazard level is often spec- pec-ifie
ified in formd in form of aof ahazard criticality index matrixhazard criticality index matrixto aid in prioritizationto aid in prioritization as it is as it is shoshownwn in figure 2.3. Hazard criticality index matrix combines hazard severity and hazard in figure 2.3. Hazard criticality index matrix combines hazard severity and hazard probability/frequency.
probability/frequency.
Hazard consequence classes/categories.
Hazard consequence classes/categories. Categorizing the hazard severity reflectsCategorizing the hazard severity reflects worst possible consequences. Different categories are used depending on the worst possible consequences. Different categories are used depending on the indus-try or even the used system. Some examples are:
try or even the used system. Some examples are:
Catastrophic – Critical – Marginal – Negligible Catastrophic – Critical – Marginal – Negligible or
or
Major – Medium – Minor Major – Medium – Minor or simply
or simply
Category 1 – Category 2 – Category 3. Category 1 – Category 2 – Category 3. Consequence classes can vary from
Consequence classes can vary from AcceptableAcceptabletoto UnacceptableUnacceptableas it is shown inas it is shown in figu
figurere 2.32.3.. SeSeveverityrity ofof hazhazardard (fo(forr aa gigivevenn proproducductionplanttionplant)) is is conconsidsidereeredd withwith regregardard to:
to:
Environmental damages Environmental damages Production line damages Production line damages Lost production Lost production £ £ £ £ £ £ £ £ £ £ £ £
Frequen Frequency cy of of
occurrence occurrence Consequence Consequence class class Major Major Medium Medium Minor Minor Low
Low MediumMedium HighHigh
Unacceptable Unacceptable high risk high risk Acceptable Acceptable low risk low risk Fig. 2.3.
Fig. 2.3.The hazard level shown in form of hazard criticality index matrix: hazard conse-The hazard level shown in form of hazard criticality index matrix: hazard conse-quence classes versus frequency of occurrence
quence classes versus frequency of occurrence
Lost workforce Lost workforce Lost company image Lost company image Insurance (premium) Insurance (premium) The basis for defining
The basis for defining hazard consequenhazard consequencece classes is the companclasses is the companyy policy in general.policy in general. Hence, definition of hazard consequence classes are not application specific, i.e. Hence, definition of hazard consequence classes are not application specific, i.e. different companies can define different consequence classes for the same case. different companies can define different consequence classes for the same case.
Hazard likelihood/frequency of occurrence.
Hazard likelihood/frequency of occurrence. The hazard likelihood of occurrenceThe hazard likelihood of occurrence can be specified either qualitatively or quantitatively. During the design phase of can be specified either qualitatively or quantitatively. During the design phase of a system, due to the lack of detailed information, accurate evaluation of hazard a system, due to the lack of detailed information, accurate evaluation of hazard likelihood is not possible. Qualitative evaluation of likelihood is usually the best likelihood is not possible. Qualitative evaluation of likelihood is usually the best that can be done.
that can be done.
Accepted frequencies of occurrence.
Accepted frequencies of occurrence. The basis for definingThe basis for defining accepted frequenciesaccepted frequencies of
of occurreoccurrencencedepends on the company policy on the issue. So, the definition of ac-depends on the company policy on the issue. So, the definition of ac-cepte
cepted frequd frequencieenciess of occof occurrencurrencee is notis not appliapplicationcation specispecific (cafic (can van varyry fromfrom compacompanyny to company). Examples of ranking of frequencies of occurrence are:
to company). Examples of ranking of frequencies of occurrence are:
Frequent – Probable – Occasional – Remote – Improbable - Impossible Frequent – Probable – Occasional – Remote – Improbable - Impossible or
or
High – Medium – Low High – Medium – Low By
By spespecifcifyinyingg hazhazardard conconseqsequenuencece catcategegorieoriess andand freqfrequenuencieciess ofof occoccurreurrencence oneone cancan draw the hazard criticality index matrix as shown in figure 2.3.
draw the hazard criticality index matrix as shown in figure 2.3.
Specifying the acceptable Hazard (risk) level is management’s responsibility Specifying the acceptable Hazard (risk) level is management’s responsibility .. This corresponds to drawing the thick line in the figure. Following issues have This corresponds to drawing the thick line in the figure. Following issues have im-pact on how to specify the acceptable risk levels:
pact on how to specify the acceptable risk levels:
£ £
£ £
2.
2.4 4 HaHazarzard d ananalyalysisis s modmodels els and and tectechnhniquiques es 1717
National regulations National regulations
Application of methods specified in standards, Application of methods specified in standards,
Following the best practice applied by major industrial companies, Following the best practice applied by major industrial companies,
By drawing comparison between socially accepted risks (e.g. risk of driving by By drawing comparison between socially accepted risks (e.g. risk of driving by car, airplane,...),
car, airplane,...),
Policy to propagate the company image. Policy to propagate the company image.
2.4
2.4 Hazard analys
Hazard analysis models and technique
is models and techniquess
There exists various methods for performing hazard analysis, each having different There exists various methods for performing hazard analysis, each having different coverage and validity. Hence, several may be required during the life of the project. coverage and validity. Hence, several may be required during the life of the project. It is important to notice that none of these methods alone is superior to all others for It is important to notice that none of these methods alone is superior to all others for every objective and even applicable to all type of systems. Furthermore, very little every objective and even applicable to all type of systems. Furthermore, very little validation of these techniques has been done. The most used techniques, which are validation of these techniques has been done. The most used techniques, which are listed below, are described extensively in (Leveson 1995):
listed below, are described extensively in (Leveson 1995): Checklists,
Checklists, Hazard Indices, Hazard Indices,
Fault tree analysis (FTA), Fault tree analysis (FTA),
Management Oversight and Risk Tree analysis (MORT), Management Oversight and Risk Tree analysis (MORT), Even Tree Analysis
Even Tree Analysis (ET(ETA),A),
Cause-Consequence Analysis (CCa), Cause-Consequence Analysis (CCa),
Hazard and Operability Analysis (HAZOP), Hazard and Operability Analysis (HAZOP), Interface Analysis,
Interface Analysis,
Failure Mode and Effect Analysis (FMEA), Failure Mode and Effect Analysis (FMEA),
Failure Mode Effect, and Critically Analysis (FMEAC), Failure Mode Effect, and Critically Analysis (FMEAC), Fault Hazard Analysis (FHA),
Fault Hazard Analysis (FHA), State Machine Hazard Analysis, State Machine Hazard Analysis,
In this report, FMEA and FMECA are described more in details, as they have been In this report, FMEA and FMECA are described more in details, as they have been applied ((Bøgh, Izadi-Zamanabadi, and Blanke 1995)).
applied ((Bøgh, Izadi-Zamanabadi, and Blanke 1995)).
2.4.1
2.4.1 FailurFailure Mode and Effect Analyse Mode and Effect Analysis (FMEA)is (FMEA)
This technique was developed by reliability engineers to permit them to predict This technique was developed by reliability engineers to permit them to predict equipment reliability. This is a hierarchical, bottom-up, inductive analysis equipment reliability. This is a hierarchical, bottom-up, inductive analysis tech-nique, where the initiating events are failures of individual components.
nique, where the initiating events are failures of individual components.
The objectives of FMEA are as follows (Russomanno, Bonnell, and Bowles 1994) The objectives of FMEA are as follows (Russomanno, Bonnell, and Bowles 1994)
–
– to provide a systematic examination of potential system failures;to provide a systematic examination of potential system failures;
–
– to analyze the effect of each failure mode on system operation;to analyze the effect of each failure mode on system operation;
–
– to access the safety of various systems or components, andto access the safety of various systems or components, and
–
– to identify corrective actions or design modifications.to identify corrective actions or design modifications.
£ £ £ £ £ £ £ £ £ £ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ µ
Identify
Identify IndenturIndenturee Level Level
Define Each Item Define Each Item
Define Ground Define Ground Rules Rules Identify Failure Identify Failure Modes Modes
Determine effect of Determine effect of Each Item Failure Each Item Failure
Classify Failures Classify Failures
Identify
Identify DetectioDetectionn Methods Methods Identify Identify Compensation Compensation Fig. 2.4.
Fig. 2.4.Major logical steps in the Major logical steps in the FMEA analysiFMEA analysiss
The
The anaanalyslysisis conconsidsidersers thethe efeffecfectsts ofof aa sinsinglegle itemitem faifailurluree onon ovoveraerallll syssystemtem opeoperatirationon and spans a myriad of applications, including electrical, mechanical, hydraulic, and and spans a myriad of applications, including electrical, mechanical, hydraulic, and software domains. Before the FMEA analysis begins, a thorough understanding of software domains. Before the FMEA analysis begins, a thorough understanding of the system components, their relationships, and failure modes must be determined the system components, their relationships, and failure modes must be determined and represented in a formalized manner. The indenture level of the analysis (i.e. the and represented in a formalized manner. The indenture level of the analysis (i.e. the item levels that
item levels that describe the relatidescribe the relative complexityve complexity of assembly or of assembly or function)function) must coin-must coin-cide with the stage of design. The indenture level progresses from the most general cide with the stage of design. The indenture level progresses from the most general description; then, properties are added and the indenture level becomes more description; then, properties are added and the indenture level becomes more spe-cialized. The FMEA must be conducted at multiple indenture level; hence a failure cialized. The FMEA must be conducted at multiple indenture level; hence a failure mod
modee can becan be ideidentifintifieded andand tractraceded to the most speto the most speciacializlizeded assassembemblyly forfor easease of e of reprepairair or design modification. Figure 2.4 outlines the major logical steps in the analysis. or design modification. Figure 2.4 outlines the major logical steps in the analysis. The possibl
The possible problems with e problems with the applicathe application of tion of FMEA (also FMECA) are FMEA (also FMECA) are includincludeded in the following:
in the following: 1.
1. Engineers havEngineers have insufficient time for propee insufficient time for proper analysisr analysis 2.
2. AnalysAnalysts have a ts have a poor understpoor understandinanding of g of FMEA and designers are inadequaFMEA and designers are inadequatelytely trained in FMEA.
trained in FMEA. 3.
3. FMEA is perceived as a FMEA is perceived as a difficult, laborious, and mundifficult, laborious, and mundane task.dane task. 4.
4. FMEA knowledge FMEA knowledge tends to be restricted to small groups of spetends to be restricted to small groups of specialists.cialists. 5.
5. ComComputputerierized aids zed aids are needeare needed d to to decdecrearease se the efforthe effort t reqrequireuired d to to proproducduce e aa FMEA.
FMEA.
2.4.2
2.4.2 FailurFailure Mode, Effect and Critically Analye Mode, Effect and Critically Analysis (FMECA)sis (FMECA)
Failure Mode, Effect and Critically Analysis (FMECA) is basically a FMEA with Failure Mode, Effect and Critically Analysis (FMECA) is basically a FMEA with more detailed analysis of the criticality of the failure. In this analysis criticalities more detailed analysis of the criticality of the failure. In this analysis criticalities or priorities are assigned to failure mode effects. Figure 2.5 shows an example of or priorities are assigned to failure mode effects. Figure 2.5 shows an example of tables used in this analysis.
2.
2.4 4 HaHazarzard d ananalyalysisis s modmodels els and and tectechnhniquiques es 1919
Failure Modes an
Failure Modes and Effects Critd Effects CriticallyicallyAnalysisAnalysis S
Suubbssyysstteem m PPrreeppaarreed d bbyy DateDate
Item
Item FailureFailure
Modes
Modes Cause of FailureCause of Failure
Possible
Possible
Effects
Effects PrProbob. . LeLevevell
Possible action to Reduce
Possible action to Reduce
Failure Rate or Effects
Failure Rate or Effects
Fig. 2.5.
Fig. 2.5.A simple FMECA schemeA simple FMECA scheme
2.4.3
2.4.3 Fault TFault Tree Analysis (FTree Analysis (FTA)A)
FTA is primarily a means for analyzing causes of hazards, not identifying hazards. FTA is primarily a means for analyzing causes of hazards, not identifying hazards. The top event in the tree must have been foreseen and thus identified by other The top event in the tree must have been foreseen and thus identified by other tech-niq
niquesues.. FTFTA is,A is, henhence,ce, a top-da top-dowownn seasearchrch metmethodhod.. OncOncee thethe tretreee is consis constructructedted,, it canit can be written
be written as Boolean expressias Boolean expression and on and simplisimplified to fied to show the specific combinatioshow the specific combinationn of basic events that are necessary and sufficient to cause the hazardous event. of basic events that are necessary and sufficient to cause the hazardous event.
Fault-tree evaluation by cut sets.
Fault-tree evaluation by cut sets. The direct evaluation procedures just discussedThe direct evaluation procedures just discussed allo
allow w us us to assess fault to assess fault trees with trees with relatirelatively few branches and basic eventsvely few branches and basic events. . WhenWhen larger trees are considered, both evaluation and interpretation of the results become larger trees are considered, both evaluation and interpretation of the results become more difficult and digital computer codes are invariably employed. Such codes are more difficult and digital computer codes are invariably employed. Such codes are usually formulated in terms of the minimum cut-set methodology discussed in this usually formulated in terms of the minimum cut-set methodology discussed in this section. There are at least two reasons for this. First, the techniques lend themselves section. There are at least two reasons for this. First, the techniques lend themselves well to the computer
well to the computer algorithms, and second, from them a algorithms, and second, from them a good deal of intermediategood deal of intermediate information can be obtained concerning the combination of component failures that information can be obtained concerning the combination of component failures that are pertinent to improvements in system design and operations.
are pertinent to improvements in system design and operations.
The discussion that follows is conveniently divided into qualitative and quantitative The discussion that follows is conveniently divided into qualitative and quantitative analysis. In qualitative analysis information about the logical structure of the tree analysis. In qualitative analysis information about the logical structure of the tree is used to locate weak points and evaluate and improve system design. In is used to locate weak points and evaluate and improve system design. In quantita-tive analysis the same objecquantita-tives are taken further by studying the probabilities of tive analysis the same objectives are taken further by studying the probabilities of component failures in relation to system design.
component failures in relation to system design.
Qualitativ
Qualitative e analysis.analysis. In the following the idea of minimum cut-sets is introducedIn the following the idea of minimum cut-sets is introduced and then it is related to th qualitative evaluation of of fault trees. We then and then it is related to th qualitative evaluation of of fault trees. We then dis-cuss briefly how the minimum cut sets are determined for large fault trees. Finally, cuss briefly how the minimum cut sets are determined for large fault trees. Finally, we discuss their use in locating system weak points, particularly possibilities for we discuss their use in locating system weak points, particularly possibilities for common-mode failures.
common-mode failures.
minimum cut-set
minimum cut-set formulatiformulation.on. A minimum cut set is defined as the smallest com-A minimum cut set is defined as the smallest com-bin
binatiationon ofof primprimaryary faifailurelures whichs which,, if theyif they occoccurur,, will cawill cause the topuse the top eveventent to occuto occurr.. It,It, therefore, is a combination (i.e. intersection) of primary failures sufficient to cause therefore, is a combination (i.e. intersection) of primary failures sufficient to cause the top event. It is the smallest combination in that all the failures must take place the top event. It is the smallest combination in that all the failures must take place for the top event to occur. If even one of the failures in the minimum cut set does for the top event to occur. If even one of the failures in the minimum cut set does not happen, the top event will not take place.
The terms minimum cut set and failure mode are sometimes used The terms minimum cut set and failure mode are sometimes used interchange-ably
ably.. HowHoweveever,r, therethere is is a a subtlesubtle diffedifferencerence that that we we shallshall obserobserveve hereahereafterfter.. InIn reliabreliabil- il-ity calculations a failure mode is a combination of component or other failures that ity calculations a failure mode is a combination of component or other failures that cause the system to fail, regardless of the consequences of the failure. A minimum cause the system to fail, regardless of the consequences of the failure. A minimum cut set is usally more restrictive, for it is the minimum combination of failures that cut set is usally more restrictive, for it is the minimum combination of failures that causes the top event as defined for a particular fault tree. If the top event is defined causes the top event as defined for a particular fault tree. If the top event is defined broadly as system failure, the two are indeed
broadly as system failure, the two are indeed interchangeable.interchangeable. Usually, Usually, howevhowever,er, thethe top event encompasses only the particular subset of system failures that bring about top event encompasses only the particular subset of system failures that bring about a particular safety hazard.
a particular safety hazard.
B
B
C
C
A
A
B
B
C
C
E4
E4
A
A
E3
E3
E
E
1
1
E
E
2
2
T
T
Fig. 2.6.Fig. 2.6.Example of a fault treeExample of a fault tree
The origin for using the term cut set may be illustrated graphically using the The origin for using the term cut set may be illustrated graphically using the re-duced fault tree in Fig. 2.7. The reliability block diagram corresponding to the tree duced fault tree in Fig. 2.7. The reliability block diagram corresponding to the tree is shown in Fig. 2.6. The idea of a cut set comes originally from the use of such is shown in Fig. 2.6. The idea of a cut set comes originally from the use of such
2.
2.4 4 HaHazarzard d ananalyalysisis s modmodels els and and tectechnhniquiques es 2121
A A B B C C Fig. 2.7.
Fig. 2.7.Minimum cut-sets on a Minimum cut-sets on a reliabilreliability block diagram.ity block diagram.
diagrams for electric apparatus, where the signal enters at the left and leaves at the diagrams for electric apparatus, where the signal enters at the left and leaves at the right. Thus the minimum cut set is the minimum number of components that must right. Thus the minimum cut set is the minimum number of components that must be cut to prevent the signal flow. There are two minimum cut sets,
be cut to prevent the signal flow. There are two minimum cut sets, consisting of consisting of components
components andand , and, and , consisting of component, consisting of component ..
a1 a1 a2 a2 b1 b1 a3 a3 a4 a4 b2 b2 c c Fig. 2.8.
Fig. 2.8.Minimum cut-setMinimum cut-sets on s on a reliability block diagram of a a reliability block diagram of a sevseven component system.en component system.
Practice
Practice: Find the minimum cut sets in Fig. 2.8.: Find the minimum cut sets in Fig. 2.8.
For larger systems, particularly those in which the primary failures appear more For larger systems, particularly those in which the primary failures appear more than once in the fault tree, the simple geometrical interpretation becomes than once in the fault tree, the simple geometrical interpretation becomes problem-atical. However, the primary characteristics of the concept remain valid. It permits atical. However, the primary characteristics of the concept remain valid. It permits the logical structure of the fault tree to be presented in a systematic way that is the logical structure of the fault tree to be presented in a systematic way that is amenable to interpretation in terms of the behavior of the minimum cut-sets. amenable to interpretation in terms of the behavior of the minimum cut-sets.
Suppose that the minimum cut sets of a system can be found. The top event, Suppose that the minimum cut sets of a system can be found. The top event, system failure, may then be expressed as the union of these sets, Thus if there are system failure, may then be expressed as the union of these sets, Thus if there are
minimum cut sets, minimum cut sets,
(2.46) (2.46) Each minimum cut set then
Each minimum cut set then consiconsists of sts of the intersecthe intersection of the tion of the minimum numbminimum number of er of primary failures required to cause the top event. For example, the minimum cut sets primary failures required to cause the top event. For example, the minimum cut sets for the system shown in Fig. 2.8 are
for the system shown in Fig. 2.8 are
Å Å ½ ½ Å Å ¾ ¾ Æ Æ Ì Ì Å Å ½ ½ Å Å ¾ ¾ ¡¡ ¡¡ ¡¡ Å Å Æ Æ