Failure Rate - Basic concepts for a better comprehension of safety

Chapter 3 Basic concepts for a better comprehension of safety

3.4 Failure Rate

Failure rate, often called “hazard rate” by reliability engineers, is a commonly used measure of reliability. It indicates the number of failures per unit time, for a quantity of components exposed to failure.

Failures per unit time Failure Rate = λ =

Number of components exposed to functional failure

It is common practice to use units of “failure per billion” 1x10^-9 per hour, known as FIT: Failure In Time (1x10^-9 per hour).

A failure rate of 20 FIT means both that

there are 20 probabilities of failure in a billion working hours,

there is a probability of functional safety failure equal to 20 billionth per working hour.

Example 1:

An Integrated Circuit (IC), in specified working conditions of 40 °C, has shown 7 functional failures for one billion hours mission.

This IC has a failure rate of 7 FIT (7x10^–9 per hr).

Example 2:

300 industrial I/O modules have been operating in a plant for 7 years. 5 failures have occurred. The average failure rate for this group of modules is:

5 -9

λ = = 0.000000271798 = 272 FIT = 272×10

300× 7×8760 per hour

To simplify and approximate the calculation it is possible to assume 10000 hrs per year instead of 8760:

5 -9

λ = = 0,00000023809 = 238FIT = 238×10

300× 7×10000 per hour

Other people prefer to use years instead of hours as unit time, so in the above example the result is:

λ = 5 = 0.00238

300× 7 per year.

“FIT per hour” is usually the best indication for very low failure rates, while

“failures per year” is preferred when dealing with high failure rates.

Example 3:

In the previous example the failure rate of the I/O modules is 272 FIT.

What is the MTTF of the modules?

-9

MTTF = 1 = 3676470 hrs = 420 yrs 272×10

The failure probability of an electrical device decreases exponentially in time, as previously discussed and, with approximation, is:

P λ × t≈

Example 4:

A device, with exponential probability of failure, has a failure rate of 500 FIT.

How many probabilities of failure are there in one year?

0 0000005 8760 0 00434 P^{≈ × ≈}λ t , ^× ⁼ , / y^r

3.4.1 Components with constant failure rate

Figure 19 presents the famous “bathtub curve”, generally accepted to represent the reliability of electronic devices. Mechanical devices tend to have slightly different curves.

The left portion of the curve shows the impact of “infant mortality”; the right portion shows the “wear out” failures.

A constant failure rate is represented by the middle flat portion of the curve.

This assumption tends to simplify the math involved, but until the industry comes up with more accurate models and data, the simplification can be accepted.

Failure

rate Life

Infant mortality Wear out

Operating time

Time

Figure 19, Example of failure rate function of time (life) (bathtub curve)

The failure rate is the reciprocal of MTTF:

λ = 1 MTTF

MTTF =1 λ

For repair times much smaller than success time:

λ = 1 MTBF

MTBF =1 λ Example:

Supposing λ = 0,000000238 FIT/ hr, calculate the approximate value of MTBF:

MTTF(MTBF) = ¹ = 420 yrs 0.000000238

10000

All reliability analyses for a device or system are based on the device, or system, failure rate data.

In any engineering discipline, the ability of recognizing the required degree of accuracy is essential. Simplifications and approximations are useful when they reduce complexity and allow a model to become understandable.

Therefore the judgment, and consequent technical decisions, in many situations should follow the experience and the logic sense of expert engineers. More detailed calculations could result in a waste of time.

One simple example: if the risk analysis made for a specific SIF of a SIS, has indicated that the required risk reduction factor (RRF) is 45, further studies to obtain a value of 55 are meaningless because both indicate a coherent value with level SIL 1 (RRF from 10 to 100).

3.4.2 Failure rate Categories

It is assumed that component failure rates are constant and, in non redundant PEC equipment, statistically independent. While these assumptions are not always realistic, they are reasonable and conservative for the “useful life”

period of the electronic components used in PEC equipments.

Failures are first grouped into the two significant categories: safe and dangerous.

TOT S D

λ = λ + λ

Dangerous failures are those which cause the loss of the system’s functional safety (or safe state). In a normally-energized system (like ESD) safe failures are defined as those that erroneously de-energize the output.

Dangerous failures instead prevent the output from being de-energized.

For example, in a DI (digital input circuit) with relay output, it has been defined that the safe state, in case of circuit functional failure, is a ND relay (normally de-energized). Dangerous failures in this case are the ones that prevent the relay from being de-energized.

Each failure category is further partitioned into failures that are detected by the on-line diagnostics versus the ones that are not.

Failures (Detected);

Failures (Undetected);

D DD DU

S SD SU

TOT DD DU SD SU

λ = λ + λ

λ = λ + λ + λ + λ

Where:

λ_DD: dangerous detected failure rates;

λ_DU: dangerous undetected failure rates;

λ_SD: safe detected failure rates;

λ_SU: safe undetected failure rates;

Failure rate categories are used to calculate the value of SFF (Safe Failure Fraction, see 6.4.3 at page 158), which is important for calculating Safety Integrity Levels (SIL).

DD SD SU DU

DD DU SD SU DD DU SD SU

λ + λ + λ λ

SFF = =

1-λ + λ + λ + λ λ + λ + λ + λ

∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

From this simple expression it is evident that to increase the percentage value of the SFF, and consequently the SIL level, it is necessary to decrease the value of λ_DU (dangerous undetected failures).

Example:

Suppose the following values:

λDD = 0.14 / year; λDU = 0.04 / year; λSD = 0.22 / year; λSU = 0.5 / year SFF = 1-0.04= 0.955 = 96%

0.9

In case of λ_DU = 0.4 /year:

SFF = 1- 0.4 = 0.682 = 68%

1.26

By defining the term C, “diagnostic coverage” as the built-in self testing capability of a system, it is also possible to define the probability that a failure will be detected given that it occurs, by the diagnostic coverage factors CD

and CS in the following equations:

DD D D

DU D D

SD S S

SU S S

λ = C × λ

λ = (1- C )× λ λ = C × λ λ = (1- C )× λ

Where:

CS : diagnostic coverage of safe failures

CD : diagnostic coverage of dangerous failures

A coverage factor must be obtained for each component in the system in order to separate detected from undetected failures.

3.4.3 Dependent, or common cause, failures

Part “4” of IEC 61508 standard defines a common cause failure as a

“failure, which is the result of one or more events, causing coincident failures of two or more separate channels in a multiple channel system, leading to system failure”.

These failures have a significant effect on reliability and safety of a SIS, and therefore must be considered in the reliability and safety model.

The four failure rate categories can be further specified in:

SDN - (Safe, detected, normal cause).

SDC - (Safe, detected, common cause).

SUN - (Safe, undetected, normal).

SUC - (Safe, undetected, common cause).

DDN - (Dangerous, detected, normal).

DDC - (Dangerous, detected, common cause).

DUN - (Dangerous, undetected, normal).

DUC - (Dangerous, undetected, common cause).

3.4.4 Common cause failures and Beta factor

The Beta model divides component failure rates in:

normal mode failure rate λN (fault of one component only);

common mode failure rate λC (fault of two or more components);

Figure 20, Failure rates subdivision in common and normal mode (Beta factor)

The rectangle’s total area represents failure rate (λ).

On the left, the stress is strong enough to produce a failure of two or more components as consequence of the same cause.

To put the two groups in relation, the following equations are used:

λ = β× λC

λ = (1-β)×N λ

The four failure rate categories SU, SD, DU and DD are divided into the Beta model as follow:

The values of beta factor can be different for each group and their calculation is not simple, therefore usually only one value is used.

In document Sil Manual Gmi (Page 55-62)