APPLIED
RELIABILITY
Techniques for Reliability
Analysis
with
Applied Reliability Tools (ART) (an EXCEL Add-In)
and
JMP® Software
AM216 Class 5 Notes
Santa Clara University
Copyright
David C. Trindade, Ph.D.S
TAT-T
ECH ®AM216 Class 5 Notes
• Accelerated Testing
(continued from Class 4 Notes)
– Accelerated Test Example (Analysis in JMP) – Degradation Modeling
– Sample Sizes for Accelerated Testing
• System Models
– Series System – Parallel System
– Analysis of Complex Systems – Standby Redundancy
• Defective Subpopulations
– Graphical Analysis – Mortals and Immortals – Models
– Case Study
– Class Project Example
• Modeling the Field Reliability
System Models
Series System
Consider a system made up with n components in series. If the
i
th component has reliability Ri (t),the system reliability is the product of the individual reliabilities, that is,
R t
s( )
R t
1
R t
2
...
R t
nwhich we denote with the capital “pi” symbol for multiplication
R t
sR t
i i n
1The system CDF, in terms of the individual CDF’s, is
The system failure rate is the sum of the individual
component failure rates. The system failure rate
is higher than the highest individual failure rate.
System Models
Parallel System
Consider a system made up with n components in parallel. The system CDF is the product of the individual CDF’s, that is,
The system reliability is
System failure rates are no longer additive (in
fact, the system failure rate is smaller than the
smallest individual failure rate), but must be
calculated using basic definitions.
System Failure Rate
Two Parallel Components
A component has CDF F(t) and a failure rate h(t). Two components are used in parallel in a system. Determine the failure rate of the system.
SOLUTION
The CDF for the two components in parallel is F2(t)
and the PDF, by differentiation, is 2F(t)f(t). The failure rate of the system is
f t F t F t f t F t F t F t f t F t F t F t h t s s 1 2 1 2 1 1 2 1 2
h tsThe result shows that the system failure rate is a
factor 2F/(1+F) times the component failure rate. The smaller the component CDF, the bigger the
Class Project
System Models
A) A component has reliability R(t) = 0.99. Twenty-five components in series form a system. Calculate the system reliability.
B) A component has reliability R(t) = 0.95
Reliability Block Diagrams
For components in series:
For components in parallel:
A B
Example of Series-Parallel
System: Big Rig
G H J I C D F E A B
Trailer
Cab
G H I J E F C D B AClass Project
Complex Systems
A system consists of seven units: A, B, C, D, E, G, H. For the system to function unit A and either unit B or C
and either D and E together or G and H together must
be working. Draw the reliability block diagram for this setup.
Standby Versus Active
Redundancy
In contrast to active parallel redundancy, there is
standby redundancy in which the second
component is idle until needed. Assuming perfect
switching and no degradation of the idle
component, standby redundancy results in higher reliability and less maintenance costs than active parallel redundancy. An illustration, assuming exponentially distributed failure times, is shown below.
System Failure Rates (2 Components)
Series, Parallel Reliability in
ART
Reliability Experiment
Consider . . .
We test 100 units for 1,000 hours. There are 30 failures by 500 hours, but no more by the end of test.
Question : Are we dealing with two
populations or just censored data ?
Question : If we continue the test, will we see
Defect Models
Mortals versus Immortals
The usual assumption in reliability analysis is that
all units can fail for a specific mechanism. If a
defective subpopulation exists, only a fraction of
the units containing the defect may be susceptible to failure. These are called mortals.
Units without the fatal flaw do not fail. These are called immortals.
The model for the total population of mortals and immortals becomes :
CDF = (fraction mortals) x CDF(mortals)
Example of a Defective
Subpopulation
A Processing Problem
Suppose we have 25 wafers in a lot, but only two wafers are contaminated with mobile ions due to a processing error.
If components are assembled from the 25 wafers, assuming equal yield per wafer, only 2/25= 8% of the components can have the fatal “defect” that makes failure possible.
The components from the non-contaminated wafers will not fail for this mechanism since they are defect free; that is, we have a defective
Spotting a Defective
Subpopulation
Graphical Analysis
Assume that a specified failure mode follows a lognormal distribution.
Plot the data on lognormal graph paper. If instead of following a straight line, the points seem to curve
away from the cumulative percent axis, it’s a signal
that a defective subpopulation may be present. If test is run long enough, expect plot to bend over
Defective Subpopulations
Graphical Analysis
Plot based on total sample (mortals and immortals).
Defect Model
Mortals and Immortals
The observed CDF Fobs(t) is
F
obs(t) = p F
m(t)
where Fm(t) is the CDF of the mortals and p is the fraction of mortals (units with the fatal defect) in the
total sample size.
For example, if there are 25 % mortals in the
population, and the mortal CDF at time t is 40%, then we would expect to observe about
0.25x0.40 = 0.10
Major Computer
Manufacturer Reliability Data
Gate Oxide Fails
Time (hours) 24 48 168 500 1000
Rejects 201 23 1 1 1
Sample Size 58,000 57,392 10,000 2,000 1,999
Censored 407 47,369 7,999 0 1,998
What Do These
Numbers Mean?
Plus and minus 3 sigma range of time to failure distribution extends from 33 seconds to 1.66E62
years !
It takes seconds to get to 0.1% cumulative failures, but over 412,000 hours (that is, 47 years) to get to
1.00% !
Assuming everything can fail is misleading and
unnecessary.
Modeling with
Defective Subpopulations
The same data, assuming 99% of the failures have occurred by 48 hours, can be modeled by a fraction
defective subpopulation of 227/58,000 = 0.39% and
a lognormal distribution of failure times for the
mortals T50 =10.6 hours and sigma = 0.68.
Practically 100% of failures occur by 168 hours. Any failures thereafter are probably not related to the
defective subpopulation. For example, handling
Defective Subpopulation
Models
If we don’t consider mortals vs. immortals, we will incorrectly assume that all units can fail.
Projections of field reliability will be biased
Statistical Reliability
Analysis and Modeling:
A Case Study
Analysis of Reliability Data
with Failures from a
Reliability Study
Background
One lot of a device type with initial burn-in results at 168 hours, 125oC :
Over 50% fallout due to bake recoverable failures
Since other lots, with similar manufacturing, might have escaped to a few customers, we needed to
assess the field impact.
Reliability Study
Design
Two static stresses:
179 Units : 125oC ambient
90 Units : 150oC ambient
30 Units: Control
Frequent readouts at 2, 4, 8, 16, 32, 48, 68, 92,
Purpose of Study
Reliability Modeling
• Determine if fraction defective (mortals) model applies
• Determine failure distribution (lognormal, parameters)
• Determine if true acceleration is present
• Determine activation energy for acceleration factors
• Determine recovery kinetics with and without bake
- Is 24 hours at 150oC necessary?
Modeling Procedure
Statistical Analysis Plan
• Analyze cumulative percent failures plot versus time, both linear and probability plots.
• Estimate fraction mortals for stress cells. Test for significant difference.
• Plot fallout of mortals (reduced sample size) on lognormal probability graph. Check for linearity and equality of slopes.
• Run maximum likelihood analysis. Test for equality of shape factors (sigmas). Estimate
single sigma. Estimate median life T50 for both cells.
Reliability Study
Bake Recoverable Failures
Reliability Study
Bake Recoverable Failures
Reliability Study
Bake Recoverable Failures
P ro b ab ility P lo t (Ad ju sted fo r M o rtals)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 L n (T im e to F a ilu re ) S ta n d a rd N o rm a l V a ri a te : Z 150oC 125oC
CHO OSE CONF. LIMIT FOR BOUND IN PERCENT: 9 0
ENTER ANY EXACT TIME S O F FAILURE FOR CELL 1
ENTER START AND ENDPOINT OF ALL READOUT INTERVALS (INCLUDE ZERO’S) SPREAD 2 4 8 16 32 48 68 92 11 6
ENTER CORRESPONDING NUMBERS OF FAILS PER INTERVAL (INCLUDE ZERO’S) 34 6 21 2 0 0 0 1 0
ENTER TIMES AL L FAILED UNITS WERE REMO VED FROM TEST (INCLUDING END OF TES T) 116
ENTER CORRE SPONDING NUMBERS REMOVED 0
ENTER ANY EXACT TIME S O F FAILURE FOR CELL 2
ENTER START AND ENDPOINT OF ALL READOUT INTERVALS (INCLUDE ZERO’S) SPREAD 2 4 8 16 32 48 68 92 11 6
ENTER CORRESPONDING NUMBERS OF FAILS PER INTERVAL (INCLUDE ZERO’S) 5 0 36 8 42 7 3 4 3
ENTER TIMES AL L FAILED UNITS WERE REMO VED FROM TEST (INCLUDING END OF TES T) 16 116
ENTER CORRE SPONDING NUMBERS REMOVED 2 3
MAXIMUM LIKELIHOOD ESTIMATES
VARIANCE VARIANCE COVARIANCE
CELL T50 SIGMA MU SIGMA MU MU SIGMA
1 1.90 1.208 .444 .0322 .0 373e-1 .643e-2
2 15 .08 1.060 2.714 .0059 .0 104e-3 .266e-5
ESTIMATE BO UNDS (90 PERCE NT CO NFIDENCE)
NUM. NUM.
CELL ON TEST FA IL T50 LOW T50 UP SIGMA LOW SIGMA UP
1 64 64 1.38 2.63 .909 1.508
2 11 3 108 12.74 17.86 .933 1.187
WANT EQUAL T50’S OR SIGMAS OR BOTH IN SOME CELLS (Y/N)? Y
CELLS: 1 2
TYPE 1 FOR EQUAL SIGMA’S, 2 FOR EQUAL MU’S, 3 FOR BOTH THE SAME: 1
THE ASSUMPTION OF QUAL SIGMA’S CAN NOT BE REJECTED AT THE 95 PERCENT LEVEL. UNDER THIS A SSUMP TION, RESULTS LIK E O BSERVED OCCUR AB OUT 41.9 PERCE NT OF THE TIME. (THE S MA LLER THIS PE RCENT, THE LESS LIKEL Y THE ASSUMPTION.)
MAXIMUM LIKELIHOOD ESTIMATES
VARIANCE VARIANCE COVARIANCE
CELL T50 SIGMA MU SIGMA MU MU SIGMA
1 2.02 1.090 .704 .0051 .0 247e-2 .538e-3
2 15 .08 1.090 1.713 .0051 .0 110e-2 .250e-5
ESTIMATE BO UNDS (90 PERCE NT CO NFIDENCE)
NUM. NUM.
CELL ON TEST FA IL T50 LOW T50 UP SIGMA LOW SIGMA UP
1 64 64 1.56 2.63 .972 1.207
2 11 3 108 12.68 17.54 .972 1.207
Reliability Study
Bake Recoverable Failures
Projection to Field Conditions
Acceleration Statistics
• Estimate acceleration factor between two stress cells : AF = 15.08 / 2.02 = 7.465 • Estimate activation energy, based on Tj’s,
35oC above ambient: EA = 1.375 eV
• Estimate field T50 based on Tj at 55oC ambient : field T50 = 18,288 hours
• Using field T50, sigma = 1.090, lognormal
distribution:
-project fallout and failure rates for various mortal fractions
Projection to Field Use
Bake Recoverable Fails
A Note of Caution
Analysis When Mortals Are Present
Since the analysis which took into account the presence of a defective subpopulation, parameter
estimates were accurate. The two customers,
notified of the affected lots, used analysis for
A Side Benefit
Screening a Wearout Mechanism
Note that it may be possible to screen a wearout
failure mechanism if only a subpopulation of the
units are mortal for that mechanism and sufficient acceleration is obtainable.
See Trindade paper “Can Burn-in Screen Wearout Mechanism? Reliability Models of Defective
Subpopulations - A Case Study” in 29th Annual
Class Project
Defect Models
50 components are put on stress. Readouts are at 10, 25, 50, 100, 200, 500, and 1,000 hours. The failure counts at the respective readouts are 2, 2, 4, 5, 4, 3, and 0.
1. Estimate the CDF for all units using the table below with n = 50.
2. Plot the data on Weibull probability paper on the next page.
Class Project
Defect Model Estimates
Weibull Parameter Estimates for Mortal Population: Characteristic Life (c) __________
Shape Parameter (m) __________
F t
( )
1
e
t c
/
mHow could we confirm that the Weibull model for the mortal population fits the data? We estimate the CDF at three times and compare to
Defective Subpopulations in
ART
Enter failure information (readout times, cumulative failures) into columns. Under ART, select Defective
System Models
A General Model for the
Field Reliability of
Integrated Circuits
Failure Rate Calculations
Primitive Method
Assumptions
• Constant failure rate • Single overall activation
energy
• Ambient temperatures
Primitive Method
Problems with Calculations
Example
100 units are stressed for 1,000 hours at 125oC.
Assume no self heating. One unit fails at 10 hours for mechanism with EA of 1.0 eV. Second unit fails at 500 hours for failure mechanism with EA of 0.5 eV.
Primitive Method Calculation
Overall average activation energy : 0.75 eV Acceleration Factor (125oC to 55oC): AF = 106 IFR (constant) at 55oC :
Primitive Method
Comparative Calculation
Individual Analysis by Failure Mechanism
Mechanism 1: EA = 1.0 eV, AF = 501 IFR (constant) at 55oC:
[1E9/(10+500+98x1000)]/AF = 20 FITS
Mechanism 2: EA = 0.5 eV, AF = 22, IFR (constant) at 55oC:[1E9/(10+500+98x1000)]/AF = 461 FITS
Failure Rate Calculations
Later Improved Method
• Early failures (infant mortality) reported separately
• Long-term life modeled with activation energy
specific to failure mechanisms
• Constant failure rate for long term life
• Temperature acceleration calculated with junction
Later Method
Problems
• Defective subpopulations not adequately
modeled
• Competing failure modes not adequately
modeled with constant failure rate
• Zero rejects and unidentified mechanisms
often not treated
An Alternative Model
Three categories of possible failures:
Test Escapes
Defective Subpopulations
Competing Failure Mechanisms
The three D’s:
Non-Functional Test Escapes
Dead on arrival (DOA)
Quality issue
Inadequate testing at manufacturer
or damaged after testing prior to customer receipt
Rejects “discovered” at customer;
called mistakenly reliability failures
Defective Subpopulations
There are proportions of the total population at risk of failure. Defective units are called mortals. The ones without the defect are called immortals.
Defective subpopulations are generally associated with processing problems.
There are physical reasons why defective subpopulations should exist.
Competing Risks
There are failure mechanisms that can affect all units.
We call these mechanisms competing risks
because several different types may exist and any
one can cause the unit to fail.
These mechanisms are typically associated with
design, processing, or material problems.
General Reliability Model
• Activation energies are specific to failure mechanisms.
• Zero rejects and unidentified mechanisms are included.
• Generates complete bathtub curve!
F
T
F
e
F
d
1
F
Nwhere
General Reliability Model In
Use at AMD
Class Project
System Models
A) A component has reliability R(t) = 0.99. Twenty-five components in series form a system. Calculate the system reliability.
Rs(t) = (0.99)25 = 0.778 or 77.8%
B) A component has reliability R(t) = 0.95
Three components in parallel form a system. Calculate the system reliability.
Class Project
Complex Systems
A system consists of seven units: A, B, C, D, E, G, H. For the system to function unit A and either unit B or C
and either D and E together or G and H together must
be working. Draw the reliability block diagram for this setup.
Write the equation for the CDF of the system in
Defect Models
1. Estimate the proportion defective p and the
number of mortals in the sample. Fill in the mortal CDF column in the table below.
2. Plot the data for the mortal subpopulation on
the same sheet of paper. Does the fit look reasonable?
4. Estimate the characteristic life c = T63, the 63rd percentile.
5. Estimate the shape parameter m by drawing a
line perpendicular to the “best fit by eye line”
through the estimation point on the Weibull paper and reading the beta estimation scale.
Class Project
Defect Model Example
Time Cum # Fails CDF Est All Units (%) CDF Est Mortals (%) 10 2 2/50 = 4% 2/20 = 10% 25 4 4/50 = 8% 4/20 = 20% 50 8 8/50 = 16% 8/20 = 40% 100 13 13/50 = 26% 13/20 = 65% 200 17 17/50 = 34% 17/20 = 85% 500 20 20/50 = 40% 20/20 = 100% 1000 20 20/50 = 40% 20/20 = 100%