APPLIED
RELIABILITY
Techniques for Reliability
Analysis
with
Applied Reliability Tools (ART) (an EXCEL Add-In)
and
JMP® Software
AM216 Class 1 Notes
Santa Clara University
Copyright
David C. Trindade, Ph.D.S
TAT-T
ECH ®Required Text
This material is based on the text:
APPLIED RELIABILITY
by Dr. P. A. Tobias & Dr. D. C. Trindade
2
ndEdition Published in 1995
CHAPMAN & HALL/CRC
Software Requirements
ART Excel Add-In
Available at
http://www.trindade.com/am216.htm
Access to Microsoft EXCEL (2003 or 2007)
Alternative (Open Software): OpenOffice at
http://www.openoffice.org/
JMP Recommended
Free 30 day trial at
www.jmp.com
Download at
http://www.onthehub.com/jmp/
Descriptive Statistics
– Variation
– Sample and Population – Random Sampling
– Types of Data
– Readout and Exact Data – Histograms
Reliability Terminology and Concepts
– Life Distributions – PDF
– CDF
– Reliability Function
– Hazard Rates (AFR and IFR) – Estimation
– Bathtub Curve – Failure Categories
Key Concept : Variation
• The objective in running experiments is to
extract useful information from data.
• Variation exists in all processes.
Variation Examples
Coin Toss
Signature
Descriptive Statistics
Terminology
Population :
The entire set or collection of measurements
of interest
Sample :
A subset of data from the population
Population
Sample
Probability
Probability Versus Inference
Probability
(Deduction from available information)
Example: I roll two dice.
What’s the probability that the sum of the two dice will be “7” ?
Statistical Inference
(Induction from observations)
Example: I randomly sample a hundred lines of code out of a program containing ten thousand lines and find six errors.
Random Sampling
What does “randomly sample” mean ?
Each measurement or data point in the population has an equal chance of being selected for the
sample.
Class Exercise
Class Project
Random Sampling Via EXCEL or
OpenOffice Spreadsheet
1. Assume we have n = 100 objects and we want to randomly choose 10.
2. Set up in Column A cells A1:A100 with numbers 1 through 100.
3. In Cell B1, type: =rand()
4. Use spreadsheet autofill to extend rand() function from cells B1 to B100. Recalculate several times using F9 key or hitting delete key in empty cell. 5. Highlight B1:B100. Do copy (Ctrl+C) and then Edit (Alt+E), paste special, numbers only, over cells B1:B100.
6. Highlight cells A1:B100. Do Data, Sort using column B, ascending.
Population, Sample,
Confidence
Population
• Large
• Contains unknown, fixed parameters (such as the average time to failure)
• Determining the exact values of the
interesting parameters may not be practical
Sample
• Typically limited, randomly sampled, finite members of the population
• Sample measurements are easier to obtain • Sample parameters estimate the respective
population parameters and change with different samples drawn
Class Exercise
Population / Sample
Write down an example of a population in your work :
What information would you like to know about this population ?
How would a sample be typically taken from this population? Is it random?
Types of Data: Categorical
Observations which are categorized by the
presence or absence of certain characteristics or qualities. Also, called qualitative data.
For example,
pass - fail, go - no go,
in spec - out of spec, mode of chip failure.
Ordinal categorical data has a meaningful ranking
or logical order, for example, ratings in a
questionnaire or classification by small-medium-large.
Nominal categorical data has no meaningful order,
Attribute Data
Quantitative Categorical Data
Counting the occurrences in specific categories
creates discrete quantitative categorical data, since only specific numbers are possible.
For example, the number of defects on a surface or the fraction of good dies on a wafer.
Types of Data: Continuous
For a continuous quantity, such as time, voltage, pressure, humidity, temperature, and so on, any value in an interval is possible.
In reliability work, we commonly refer to
What Type of Data Is?
Time to failure of a component ?
variables attributes ordinal nominal Number of failures in an interval of time ?
variables attributes ordinal nominal Brand of sputtering equipment ?
variables attributes ordinal nominal Serial number on capital equipment ?
variables attributes ordinal nominal
Size of an order of McDonald’s French fries? variables attributes
ordinal nominal Proportion of defective die on a wafer ?
variables attributes ordinal nominal Vendor source ?
variables attributes ordinal nominal Threshold voltage shift ?
variables attributes ordinal nominal Job classifications ?
variables attributes ordinal nominal
Exact Times to Failure vs.
Readout or Interval Data
Two ways to obtain failure data:
1) Record the exact times of failure.
• Continuous monitoring system on
stressed components.
Readout or Interval Data
2) Record the number of failures or the
changes in variables data between periodic
readouts.
• Readout or interval data is very
common for components on stress.
• Remove the components from stress
for testing. Return unfailed units to
stress.
• There is additional handling of units for
testing and potential disturbance of
failure mechanisms (e.g., self-recovery)
by removal of stress.
• Good idea to use controls (unstressed
units) that are tested along with
Reliability Stress Test Example
We will use a large number (the population) of microprocessors for a new product.
We obtain a sample (random?) of 100
microprocessors that we stress dynamically
(operational voltages) at an elevated temperature (HTOL).
The stress is run until all components fail. The failure mode observed is an open circuit. The failure mechanism is electromigration resulting from high current flow in a line in the circuit
metallurgy.
Reporting the Sample Results
How do we analyze and report the results from the HTOL experiment on processors?
Averages and Ranges
Don’t Tell the Whole Story
All these distributions have the same mean and range !!!
Numerical Measures Need a
Distribution
Table 1.1
Measurement Data on a
Sample of 100 Fuses
Fuse Opening Value of Current in Amps
4.64 4.95 5.25 5.21 4.90 4.67 4.97 4.92 4.87 5.11 4.98 4.93 4.72 5.07 4.80 4.98 4.66 4.43 4.78 4.53 4.73 5.37 4.81 5.19 4.77 4.79 5.08 5.07 4.65 5.39 5.21 5.11 5.15 5.28 5.20 4.73 5.32 4.79 5.10 4.94 5.06 4.69 5.14 4.83 4.78 4.72 5.21 5.02 4.89 5.19 5.04 5.04 4.78 4.96 4.94 5.24 5.22 5.00 4.60 4.88 5.03 5.05 4.94 5.02 4.43 4.91 4.84 4.75 4.88 4.79 5.46 5.12 5.12 4.85 5.05 5.26 5.01 4.64 4.86 4.73 5.01 4.94 5.02 5.16 4.88 5.10 4.80 5.10 5.20 5.11 4.77 4.58 5.18 5.03 5.10 4.67 5.21 4.73 4.88 4.80
Descriptive Statistics
EXCEL Data Analysis Tools (DAT)
Data is entered as a single column.
In Data ribbon, click Data Analysis Add-In. Select Descriptive Statistics.
Enter selections.
Results are displayed.
Visualizing Data
Histograms
A histogram is a bar chart of a frequency table or frequency distribution of a sample.
Table 1.2
Frequency Table of Fuse Data
Histogram Using Data Analysis
Tools in EXCEL
Enter data into a column. Set up convenient bins to span data. Select DAT. Click Histogram.
Enter information and click boxes as shown.
Adjusting Bars on Histogram
Adjust bar spacing by clicking on chart bars, right-clicking, selecting Format Data Series, Series
Distribution Analysis in JMP
Enter data into a column of a data table. Then select Analyze, Distribution. Cast Fuse Data
column into “Y, Columns” role.
Histograms and Models
• This a histogram of variables data (the
current in amperes at which the fuse opens), which are continuous measurements.
• With enough data points, the histogram begins to look like a smooth curve
• The sample frequency distribution shown by the histogram is estimating a theoretical
model or equation for the distribution of
Probability Density Function
Population Model
The population model estimated by the sample frequency distribution is called the probability
density function or PDF and is denoted by f(t)
The PDF equation is model the which describes the
continuous distribution of the times to failure. The
area under the curve is normalized to 1.
The histogram estimates the population PDF curve.
JMP Fits PDF to Histogram
Click red triangle at Fuse Data.
Cumulative Data
An Alternative Way to Visualize
The cumulative frequency table accumulates the number of observations less than or equal to a given value.
Cumulative Frequency Table
Upper Cell Boundary Number of Observations
(UCB) Less Than or Equal To
UCB 4.495 2 4.595 4 ... ... ... ... 5.495 100
The graphical rendering is called a cumulative frequency plot.
Function
Population Model
The population model corresponding to the sample cumulative frequency distribution is called the
cumulative distribution function or CDF and is
denoted by F(t).
The CDF is related to the PDF via the equation :
F t
( )
tf y dy
( )
Cumulative Frequency
Function Estimates CDF
CDF From PDF
• In percent, CDF goes from 0 to 100%
• In proportion, CDF goes from 0 to 1
• For variables data restricted to only
positive times, a CDF model is a possible
life distribution
Cumulative Distribution
Function
A Life Distribution
Interpretation 1
F(t) is the probability a unit randomly drawn
from the population fails by time t
For example, if F(500) = 10%, then the probability of a single (randomly drawn) unit failing by 500 hours is 10%.
Interpretation 2
F(t) is the fraction of all units in the population
which fail by time t
Interpretation of the CDF
F(t) = Probability of failure by time t
= Proportion of population that fails
by time t
Time(t)
f(t)
t
Class Project
CDF Interpretation
At 1500 hours the population CDF equals 0.16 or 16%. 1. How many failures do I expect at 1500 hours in a
random sample 100 units from this population, ?
2. What’s the probability that a single unit randomly sampled from the population will fail by 1500 hours?
3. If the population consists of one million units, how many units in the population fail by 1500 hours ?
4. What fraction of the population fails by 1500 hours ?
5.What’s the probability that no unit fails by 1500 hours if
The Reliability Function
• R(t) is called the reliability or survival
function. (Note: Some authors use S(t).)
• R(t) is the probability of surviving to time t.
• R(t) is also the fraction of survivors in the
population to time t
Since the probability of either surviving or failing
must equal one (a certainty), then,
R(t) + F(t) = 1
or
Reliability or Survival Plot
Empirical Distribution
Function (EDF)
If we have k measured values in a random sample of n units, instead of grouping data into intervals, we can construct an EDF by ordering the values from
Class Exercise
Constructing EDF in EXCEL
or OpenOffice Spreadsheet
Fuse Data (n = 100)
1. Enter label “Fuse Data” in cell A1, “Sorted Fuse Data” in cell B1, and “EDF” in cell C1.
2. Enter fuse data in Column A.
3. Highlight fuse data in A2:A101 and copy and paste to B2:B101. NOTE: Copy and Paste may be done with arrow cursor on highlighted boundary and Ctrl key.
4. With B2:B101 highlighted, select Data in menu, and choose Sort to sort data in ascending order.
5. In C2:C3, enter values 0.01, 0.02, 0.03. Highlight these three numbers. Place cursor arrow at right lower corner of highlighted region to change to a cross and autofill to C101.
6. Highlight B1:C101.
EXCEL: Select chart wizard and form a scatter plot with line.
SO: Select Insert Object, drag rectangle in sheet, Auto Format Chart, and form a scatter plot with line.
CDF in JMP
Click red triangle next to Fuse Data. Select CDF plot.
The Hazard Rate Concept
of a Life Distribution
The following American experience mortality table gives the proportion living as a function of age, starting from age 10 in increments of 10 years:
AGE 10 20 30 40 50 60 70 80 90 100 LIVING 1.000 .926 .854 .781 .698 .579 .386 .145 .008 .000 0 20 40 60 80 100 10 20 30 40 50 60 70 80 90 10 Time in Years Percent Alive 100
Creating a Histogram
American experience mortality table
AGE 10 20 30 40 50 60 70 80 90 100
LIVING 1.000 .926 .854 .781 .698 .579 .386 .145 .008 .000
To find the proportion of individuals who die during any ten year period, subtract applicable proportions. For example, during the interval 50 to 60 years,
0.698 - 0.579 = 0.119
or approximately 12% of those alive at age 10 die.
Life Distribution
Here is a histogram of the percent of
individuals alive at age 10 years who die in each subsequent ten year interval
0 5 10 15 20 25 Percent Dying 1 0 to 2 0 2 0 to 3 0 3 0 to 4 0 4 0 to 5 0 5 0 to 6 0 6 0 to 7 0 7 0 to 8 0 8 0 to 9 0 9 0 to 1 0 0
Ten Year Interval
The Average Hazard Rate
During an Interval
The percent dying is dropping during later intervals because there are very few people from age 10 alive at the beginning of those intervals.
To take into account the decreasing sample size, we use the concept of a hazard rate:
The ratio of the percent of people who die
during an interval
to the percent of people alive at the
beginning of the interval
divided by the
length of the interval
is the
average hazard (or failure) rate
Illustration of Hazard Rate
Calculation
American experience mortality table
AGE 10 20 30 40 50 60 70 80 90 100 LIVING 1.00 .926 .854 .781 .698 .579 .386 .145 .008 .000
Consider the interval 50 to 60 years
Roughly 70% survive to age 50 and 12% of those who started at age 10 die during the interval 50 to 60 years
So 12%/70% = 17% of those alive at age 50, the
beginning of the interval, die during the interval
Divide 17% by 10 years to get the average hazard
rate of 1.7% / yr during the ten year interval
The Hazard Rate Plot
0 2 4 6 8 10 12 15 25 35 45 55 65 75 85 95 Interval Midpoint in Years Percentper Year
Plot the average failure rate during an interval (y) at the center of the interval (x) to obtain the
(average) hazard rate plot.
From Average to
Instantaneous Hazard Rate
The average failure rate measures the rate of failure
over a time interval for those units alive at the beginning of the interval.
By going to smaller and smaller time intervals, we approach the hazard rate at a point, that is, the
conditional rate of failure in the next instance of
time following t, given survival to t.
The Hazard Function
The Instantaneous Failure Rate (IFR)
We can show the IFR or hazard rate is :
)
(
)
(
)
(
1
)
(
)
(
t
R
t
f
t
F
t
f
t
h
F(t), f(t) or h(t) are informationally equivalent,
The Average Failure Rate
The average failure rate (AFR) between time t1 and
time t2 is given by
AFR t t
R t
R t
t
t
( , )
1 2ln ( )
1ln ( )
2 2 1
The average failure rate (AFR) over the interval 0 to t is
AFR t
R t
t
( )
ln ( )
For F(t) < 10% approximately, we can simplify the expression for the AFR in terms of the CDF
AFR t
F t
t
F t
t
The Average Failure Rate
One can also specify an AFR over a time
period, for example, between two times t
1and
time t
2.Example Supplier AFR
Requirements
0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 100 TIme (Khrs) A F R (F IT S)Time Interval
AFR
0 - 4,000 hrs
350 FITS
The Average Failure Rate
and CDF
To estimate the cumulative percent failures by time
t using the average failure rate, the formula is
For F(t) < 10% approximately, we can simplify the expression for the AFR in terms of the CDF
For small F(t) between time t1 and time t2
For small F(t) in the interval 0 to t
Percent Fallout from AFR
1. The average hazard rate (AFR) is specified as
0.1%/Khrs over the first 4,000 hours. What is the
expected % fallout after 4,000 hours?
Approximate Calculation
Estimated fallout =
Exact Calculation (ART)
Estimated fallout =
2. The average hazard rate (AFR) is specified as
10%/Khrs over the first 4,000 hours. What is the
expected % fallout after 4,000 hours?
Approximate Calculation
Estimated fallout =
Exact Calculation (ART)
Error in CDF Estimate from
AFR Approximation Formula
Error in CDF Estimate Using Approximate Formula
0% 1% 2% 3% 4% 5% 6% 7% 8% 0% 5% 10% 15% 20% 25% 30% 35% Exact CDF E rr o r (O v e re s ti m a ti o n )
AFR Calculations in EXCEL
Set up spreadsheet using formula as shown below.
Simple Estimates for CDF
and Reliability
A simple estimate of F(t) at the end of an interval is the total number of failures r by time t divide by the number of starting units
A simple estimate of R(t) at the end of an interval is the total number of survivors n - r by time t divide by the number of starting units
Ten units start test. Readouts occur at 24, 48, 168, and 500 hours. Number of failures at readouts are:
Failures 1 2 1 3
Readouts -- 24 -- 48 --- 168 --- 500
Estimate the CDF F(t) and the Reliability Function
Simple Estimates for PDF
and Hazard Rate
An estimate of the average f(t) during an interval is the number of failures during an interval divided by the number of units that started at time t = 0 divided by the time length of the interval
An estimate of the average h(t) during an interval is the number of failures during an interval divided by the number of surviving units starting the interval divided by the time length of the interval
(Continued)
Ten units start test. Readouts occur at 24, 48, 168, and 500 hours. Number of failures at readouts are:
Failures 1 2 1 3
Readouts -- 24 -- 48 --- 168 --- 500
Estimate the PDF f(t) and the average failure rate AFR h(t) during each interval
Time f(t) h(t)
0 to 24
24 to 48
48 to 168
IFR for Integrated Circuits
“
Bathtub Curve”
Early Fails Inherent Life Wearout
Failure Definition
An event or inoperable state in which
any equipment, or part of the equipment,
does not, or would not, perform as intended.
“Does not perform as intended” has subjectivity.
For example, if the performance is marginal, is
it a failure?
Is a device that is just outside of specification a
failure?
What if the device operates “as intended”
following a recoverable event?
Failure Categories
Catastrophic : Fails suddenly, unexpectedly,
and non-reversible; i.e., breaking, short, open,
etc.
Degradation : Output degrades below the
expected level, non-reversible; i.e., fatigue,
corrosion, wear-out
Intermittent : Flip-flopping performance
below and within the expected level randomly
at an unknown time and for an unknown
Failure Rate Units
Failure rates for components are often so
small that units of failures per hour are not
practical. For example, 1 failure in 100
units on test for 1,000 hours is roughly an
AFR of 0.00001 f/h.
Instead, by using suitable multiplication
factors, we can scale the failure rates.
- 10
5for Percent per thousand hours
(%/Khrs)
- 10
9for FITS (nano-failures per hour or
ppm per thousand hours)
Hazard Rates in FITS
There are two common views of the term
FITS.
1. For a constant hazard rate, for the
equivalent of a billion (10
9) hours, e.g.,
2. For nonconstant hazard rates, we can
use FITs as a convenient measure for the
instantaneous rate of failure at time t or the
average rate of failure over an interval of
time (AFR).
For example, consider a speedometer
reading of speed at time t or the average
reading of speed over a ten minute interval.
Note: Distinguish between point or interval rate estimates which can produce very different FITS values for nonconstant rates.
Table of Equivalent Failure
Rates
In Different Units
Failures
Per Hour % / K FITS
.00001 1.0 10,000
.000001 .1 1,000
.0000001 .01 100
.00000001 .001 10
.000000001 .0001 1
Failures per hour x 105 = % / K
Failures per hour x 109 = FITS
Class Project
Equivalent Failure Rates
Fill out the table below by converting two empty cells in each row into failure rate units equivalent to the units specified in that row :
UNITS
Failures / hr % / Khr FITS
200
0.00005
Converting Units in ART
Under Add-Ins, click ART. Select Unit Conversion.
Parameters of Distributions
Numerical Measures
Distributions may be characterized by descriptive numerical constants called
parameters.
Central Tendency (Location)
Parameters of Distributions
Numerical Descriptive Measures
The PDF and CDF equations
• describe the population distribution
• contain one or more parameters in a form that is not unique
These parameters typically have a convenient
interpretation as descriptive measures of the population. For example the PDF for the normal distribution has the equation :
f x
( )
1
e
(x ) /2
2 2 2
The parameters and can be shown to be
equal to the population mean and standard
Parameters
Statistics
In contrast to a population parameter which is fixed, a statistic is an expression whose value:
•depends on the sample measurements •changes with each sample drawn
•has its own sampling distribution
X X X X n X n n i i n 1 2
1Population
Sample Sample Sample Sample Sample SampleSampling Distribution of Means
The most important theorem in statistics.
For any population, the distribution of sample
averages will be approximately normal for large enough n.
The variance of the averages is equal to the
population variance of individual readings divided by the sample size for averages, that is,
Sampling Distribution of Means
The Central Limit Theorem
Sampling Distribution
Example
Class Exercise
Generate 500 random numbers in a spreadsheet.
Choose a fixed set of 500 points. Make a histogram of the data. What distribution best describes the results?
Using this data, calculate 100 averages based on a sample of size n = 5. Make a histogram of the
Censored Reliability Data
If we end the test at a time or failure count before
all units have failed, then there is no information
on the times to failure of censored units
Time Censored (Type I) Failure Censored (Type II)
We call such censoring, single censoring. In fact, reliability data may be multicensored.
Reliability data is usually ordered data.
Because of right censoring, reliability data comes from the early tail of the distribution.
Comparing Censored
Reliability Data
to Randomly Sampled Data
• Threshold data from ten randomly sampled units: 5.5, 8.2, 9.5, 1.4, 3.6, 4.7, 7.3, 6.2, 2.9, 4.1 mvolts
»Mean: 5.34 mV
»Range : (9.5-1.4) = 8.1 mV
• Failure data from ten randomly sampled units: (Total test time of 10 hrs)
1.9, 2.8, 3.3, 4.6, 5.7, 8.2 hrs
Four units still surviving (no failures) by10 hrs.
–What’s the mean time to failure of the ten units ? –What’s the range of failure times of the ten units ? –What’s the population model (PDF) for the data? • To get the answers, we need to assume or specify
the distribution.
What Type of Data Is?
Time to failure of a component ?
variables attributes ordinal nominal Number of failures in an interval of time ?
variables attributes ordinal nominal Brand of sputtering equipment ?
variables attributes ordinal nominal Serial number on capital equipment ?
variables attributes ordinal nominal
Size of an order of McDonald’s French fries? variables attributes
ordinal nominal Proportion of defective die on a wafer ?
variables attributes ordinal nominal Vendor source ?
variables attributes ordinal nominal Threshold voltage shift ?
variables attributes ordinal nominal Job classifications ?
variables attributes ordinal nominal
Class Project
CDF Interpretation
At 1500 hours the population CDF equals 0.16 or 16%. 1. How many failures do I expect at 1500 hours in a
random sample 100 units from this population, ?
100x0.16 = 16
2. What’s the probability that a single unit randomly
sampled from the population will fail by 1500 hours?
0.16 or 16 %
3. If the population consists of one million units, how many units in the population fail by 1500 hours ?
1,000,000x0.16 = 160,000
4. What fraction of the population fails by 1500 hours ?
0.16 or 16 %
5. What’s the probability that no unit fails by 1500 hours
if I randomly sample 10 units from the population?
Probability one unit survives is (1-0.16) = 0.84
Class Project
Percent Fallout from AFR
1. The average hazard rate (AFR) is specified as
0.1%/Khrs over the first 4,000 hours. What is the
expected % fallout after 4,000 hours?
Approximate Calculation
Estimated fallout = 4x0.001 = 0.004 = 0.4%
Exact Calculation (ART)
Estimated fallout = 1-exp(-4x0.001) = 1-0.996 = 0.004 = 0.4%
2. The average hazard rate (AFR) is specified as
10%/Khrs over the first 4,000 hours. What is the
expected % fallout after 4,000 hours?
Approximate Calculation
Estimated fallout = (10/105)x4000 = 0.40 or 40%
Exact Calculation (ART)
Estimated fallout =1 - exp{-(10/105)x4000} = 1 - exp(-0.4)
(Solution)
Ten units start test. Readouts occur at 24, 48, 168, and 500 hours. Number of failures at readouts are:
Failures 1 2 1 3
Readouts -- 24 -- 48 --- 168 --- 500
Estimate the CDF F(t) and the Reliability Function
Class Exercise
(Solution)
Ten units start test. Readouts occur at 24, 48, 168, and 500 hours. Number of failures at readouts are:
Failures 1 2 1 3
Readouts -- 24 -- 48 --- 168 --- 500
Estimate the PDF f(t) and the average failure rate AFR h(t) during each interval
Time f(t) h(t)
0 to 24 (1/10)/24=0.0042 (1/10)/24=0.0042
24 to 48 (2/10)/24=0.0084 (2/9)/24=0.0093
48 to 168 (1/10)/120=0.00083 (1/7)/120=0.0012
Class Exercise
Equivalent Failure Rates
Fill out the table below by converting two empty cells in each row into failure rate units equivalent to the units specified in that row :