• No results found

Fault Tolerant Computing CS 530 Reliability Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Fault Tolerant Computing CS 530 Reliability Analysis"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Fault Tolerant Computing

CS 530

Reliability Analysis

Yashwant K. Malaiya

Colorado State University

(2)

Reliability Analysis: Outline

Reliability measures:

Reliability, availability, Transaction Reliability,

MTTF and R(t), MTBF Basic Cases

Single unit with permanent failure, failure rate

Single unit with temporary failures

Combinatorial Reliability: Block Diagrams

Serial, parallel. K-out-of-n systems

Imperfect coverage Redundancy

TMR, spares

Generalized

(3)

Reliability Analysis

• Permanent faults

 The unit will eventually fail. Thus reliability

“decays”.

• Temporary faults

 Faults come and go. Often Steady state characterization is possible.

 Permanent faults subject to repair are modeled as temporary faults.

• Design faults

 Reliability growth occurs during testing &

debugging. We will study this under “Software Reliability” later.

(4)

Why Mathematical Analysis?

• You can determine reliability by constructing a large number of copies of the target system, and collecting failure data. However, that would be infeasible except for special cases.

• Thus we need to be able to determine the

reliability before a system is built, by using the

information we have about the components and

the proposed architecture.

(5)

Basic Reliability Measures

Reliability: durational (default)

R(t)=P{correct operation in duration (0,t)}

 This is the default definition of reliability.

Availability: instantaneous

A(t)= P{correct operation at instant t)}

 Applied in presence of temporary failures

 A steady-state value is the expected value over a range of time.

Transaction Reliability: single transaction Rt=P{a transaction is performed correctly}

• The term “Reliability” is sometimes used with a non-standard meaning.

(6)

Mean time to …

• Mean Time to Failure (MTTF): expected time the unit will work without a failure.

• Mean time between failures (MTBF): expected time between two successive failures.

 Applicable when faults are temporary.

 The time between two successive failures includes repair time and then the time to next failure.

 Approximately equal to

• Mean time to repair (MTTR): expected time during

which the unit is non-operational.

(7)

Mean Time to Failure (MTTF)

• There is a very useful general relation between MTTF and R(t).

Here T is time to failure, which is a random variable.

dt t R MTTF

Thus

dt t R t

R t

dt dt t tdR

dt t f t T

E MTTF

=

+

=

=

=

=

0

0 0 0

0

) (

) ( )]

( [

) (

) ( )

(

dt t t dR

f or

dt t dR dt

t dF

t F

t T P

t in failure P

t R

Note

) ) (

(

) ( )

(

) ( 1

} 0

{ 1

)}

, 0 ( {

1 ) (

:

=

=

=

=

=

. 0

) ( Thus

form the

of generally is

R(t) and

0 0

:

t as t

tR

e x

as xe

Note

at x

Worth

Remembering!

(8)

Failures with Repair

Time between failures: time to repair + time to next failure

good bad

“failure”

“repair”

operational operational

Under repair Under repair

TTF TBF

• MTBF = MTTF + MTTR

• MTBF, MTTF are same same when MTTR ≈0

• Steady state availability = MTTF / (MTTF+MTTR)

(9)

Mission Time (High-Reliability Systems)

Reliability throughout the mission must remain above a threshold reliability Rth.

Mission time TM: defined as the duration in which R(t)≥≥≥≥Rth.

Rth may be chosen to be perhaps 0.95.

Mission time is a strict measure, used only for very high reliability missions.

0 0.25 0.5 0.75 1

0 20 40 60

time

R(t)

Rth

TM

(10)

Two Basic cases

We next consider two very important basic cases that serve as the basis for time-dependent

analysis.

1. Single unit subject to permanent failure

We will assume a constant failure rate to evaluate reliability and MTTF.

2. Single unit with temporary failures

System has two states Good and Bad, and transitions among them are defined by transition rates.

Both of these are example of Markov processes.

(11)

Constant Failure Rate Assumption

We will always assume a constant failure rate.

 It keeps analysis simple.

 During operating life, the failure rate is approximately constant.

The Bath-Tub curve:

 In the beginning the failure rate is high because the weaker devices fail due to “infant mortality”. Near the end the failure rate is again high due to “aging” or wear-out of devices.

Failure- rate

time Burn-in Operating life wearout

Famous “bath-tub”

(12)

Basic Cases: Single Unit with Permanent Failure

• Failure rate is the probability of failure/unit time

• Assumption: constant failure-rate λλλλ

Good 0

Bad 1 Z(t)=λλλλ

0 state in

being of

y probabilit on

depends 0

state leaving

of rate the

since )

) ( (

0 0

=

= p t dt

t

dp

λ

The state transition diagram &

the differential equation represent What we call Markov Modeling.

(13)

Single Unit with Permanent Failure (2)

368 .

0 )

( 1 ,

"

law y

reliabilit l

Exponentia The

"

) (

) ( )

(

) ( :

1 )

0 (

) ) (

(

1 0

0 0

0 0

=

=

=

=

=

=

=

=

e t

R t

At

e t

R

t p t

R Since

e t

p Solution

p

t dt p

t dp

t

t

λ λ

λ

λ

0 0 .2 5 0 .5 0 .7 5 1

0 5 0 1 0 0 1 5 0

time

R(t) e-λ t

1/λ

0 . 3 7

(14)

Single Unit: Permanent Failure (3)

Ex 1: a unit has MTTF

=30,000 hrs. Find failure rate.

λ=1/30,000=3.3x10-5/hr

Ex 2: Compute mission time TM if Rth =0.95.

e-λTM =0.95 TM= - ln(0.95)/ λ

≈0.051/ λ

Ex 3: Assume λ=3.33x10-5, and Rth =0.95 find TM.

Ans: TM = 1538.8 hrs

e

t

t

R ( ) =

λ

λ

λ

λ

λ

1

] [

) (

. )

( )

(

0 0 0

=

=

=

=

t

t

e

dt e

dt t R MTTF

case this

in t R as same is

t

A

(15)

Single Unit: Temporary Failures(1)

Temporary: intermittent, transient, permanent with repair

good bad

Good 0

Bad 1 λλλλ

µµµµ

(t).

p - 1 (t) p since needed

not

is it however (t),

p for expression an

get can we

Similarly

) 1

( )

0 ( )

(

etc.

transform laplace

by solved be

can

) ( )

) ( (

) ( )

) ( (

0 1

1 ) ( )

( 0

0

1 0

1

1 0

0

= + − +

=

− +

=

+

=

+

+

t t

e e

p t

p

t p t

dt p t dp

t p t

dt p t dp

µ λ µ

λ

µ λ

µ µ

λ

µ λ

Y. K. Malaiya, S. Y. H. Su: Reliability Measure of Hardware Redundancy Fault-Tolerant Digital Systems with Intermittent Faults. IEEE Trans.

Computers 30(8): 600-604 (1981)

Note state diagram &

Differential equations for Markov modeling

(16)

Single Unit: Temporary Failures(2)

µ λ

µ

µ λ

λ µ

λ µ

µ λ

µ µ λ

µ

µ λ µ

λ

µ λ µ

λ

• +

= +

= +

+ − +

=

=

+ − +

=

+

+

+

+

is ty availabili state

- Steady

) ( )

( ,

t

: exist ies

probabilit state

steady that

Note

) 1

( )

0 ( Thus

) ( A(t)

ty Availabili

) 1

( )

0 ( )

(

1 0

) (

) (

0

0

) (

) (

0 0

t p t

p

e e

p A(t)

t p

e e

p t

p

t t

t t

(17)

Single Unit: Temporary Failures(3)

same also

: time ion

Miss

MTTF 1 Thus

failure permanent

as same

0

t)}

(0, in

failures no

{ )

(

l) (durationa bility

Relia

=

=

=

=

λ

t

e

at t}

P{in Good P

t R

Good 0

Bad 1 λλλλ

µµµµ

Good 2

λλλλ

First failure

Note that when we say no failures in (0,t), even a brief failure is a failure. Thus R(t) may be too strict a measure when brief failures may be acceptable.

(18)

Combinatorial Reliability

This is a part of classic reliability theory.

Objective is: Given a

 systems structure in terms of its units

 reliability attributes of the units

 some simplifying assumptions

We need to evaluate the overall reliability measure.

There are two extreme cases we will examine first:

 Series configuration

 Parallel configuration

 Other cases involve combinations and other configurations.

• Note that conceptual modeling is applicable to R(t), A(t), Rt(t). A system is either good or bad.

(19)

Series configuration

Series configuration: all units are essential. System fails if one of them fails .

Assumption: statistically independent failures in units

.

U1 U2 U3

=

=

=

=

=

n

i

i S

S

R R

R R R

g U P g U P g U P

good U

good U

good U

P R

1 3

2 1

3 2

1

3 2

1

general In

} {

} {

} {

}

{ I I

(20)

Series configuration

n t t t

i n i

e e

t e t

λ λ

λ λ

λ λ λ λ

λ

+ + +

=

= Π

=

=

+ + +

L

L

2 1

S

] [

s i

: rates failure

individual of

sum the is rate failure system

i.e.

) ( R then

) ( R If

2 1

U1 U2 U3

This gives us a nice way to estimate the overall failure rate, when all the individual units are essential. This is the basis of the approach used in the popular “Military Handbook” MIL-HDBK-217 approach for estimating the failure rates for different systems.

The failure rates of individual units are estimated using empirical formulas. For example the failure rate of a VLSI chip is related to its

The reliability block

diagrams like this are only conceptual, not physical.

(21)

“A chain is as strong as it's weakest link”

Let us see for a 4-unit series system

Assume R1=R2 =R3=0.95, R4=0.75

 RS=0.643

Thus a chain is slightly weaker than its weakest link!

The plot gives reliability of a 10-unit system vs a single system. Each of the 10 units are identical.

More units, less reliability. 0

0.25 0.5 0.75 1

0 20 40 60 80 100

Time

Reliability Single unit

M 10 units

Combinatorial: Series Do you agree?

(22)

Combinatorial: Parallel

Parallel configuration: System is good when least one of the several replicated units is good. A parallel configuration

represents an ideal redundant system, ignoring any overhead.

U1

U2

U3

=

=

=

=

=

=

=

n i s

i n

i s

s

R R

e i

R R

R R

R

b U P b U P b U P

bad U

bad U

bad U

P

bad units

all P R

1

3 2

1

3 2

1

3 2

1

. .

) 1

( 1

general In

) 1

)(

1 )(

1 ( 1

.}

{ .}

{ .}

{ 1

} {

1

} {

1

I I

Where R represents

(23)

Fault Tolerant Computing

Parallel Configuration: Example

) 1

ln(

ln ) 1

(

) 1

( 1

: Solution

? R R

, R R

R if

needed are

units parallel

many How

1 R

y reliabilit system

Need

: Problem

s m

m 2

1

s

m x m

x m s

x R

R R R

= ∈

=

=

<

=

=

=

= L

. 4 x gives

9 . 0 R

0.0001), (

9999 .

0 R

Assume

m s

=

=

∈=

=

Remember, we’re consider an ideal system

Sometimes it is more convenient to talk in terms of “unreliability”

References

Related documents

The sparse-representation based method including Yang [2] proposes a new SR method under the condition of over-complete dictionary, Wang [3] presents a SR algorithm based on

Volunteers provide invaluable support and also require a specific skill set enabling them to make clients feel welcome, conduct assessments and implement the Doorways philosophy

The paper examines the economic and political factors behind it, including the role of economic ideas, experts, politicians, institutional arrangements in the Maastricht

K = 10 rules represents a (heuristic) upper limit on the size of an interpretable rule list, and M = 1000 represents the approximate number of rules with sufficiently high

•  The calculation of availability is based on a function of the Mean Time Between Failures (MTBF) of the components in the network and the Mean Time to Repair (MTTR) —or

This section includes products and procedures for the installation of the ARDEX PC-T™ Polished Concrete Topping component of the ARDEX Polished Concrete System

In accordance with the Israeli perspective, the West Bank and Gaza strip are not applied upon them both by the resolution (242), because they are two liberated zones, and they

*Also presented at the Milestones 10 th Annual Autism/Asperger’s Conference, Cleveland, OH (2012, June) and at the Autism Association for Behavior Analysis International 6 th