DOE-Based Multicell - Reliability Modeling: The RIAC Guide to Reliability Prediction, Assessmen

The methodology of a DOE (Design of Experiment)-based multicell involves subjecting a sample of products to a combination of factors, or accelerants. These factors can be stresses or categorical variables. The intent of these tests is to generate the data that is required to develop a life model that is capable of predicting reliability under a variety of use conditions. Life modeling is usually performed for specific failure causes. A goal of a reliability program is to identify those causes that warrant the work required to develop a life model. Characteristics of these “critical failure causes” often include:

• Failures experienced in EVT tests • New, unproven technology

• New, unproven manufacturing processes

• Items exposed to stringent/severe environmental conditions • Items exposed to stringent/severe operating stresses

• Items designed or manufactured with non-robust practices • Items with known life limitations

• Items from suppliers with a history of delivery, cost, performance or reliability problems

• Old technology with availability problems (obsolescence and/or diminishing manufacturing sources

After the identification of critical failure causes of a product or system that require life modeling, action must be taken to ensure that those items are sufficiently robust to meet product/system reliability and durability requirements. Life modeling is used for this purpose, and involves the characterization and quantification of specific failure causes, making it a critical element of a reliability program.

Figure 2.5-10: Life Modeling Methodology

Each of the elements in Figure 2.5-10 are further examined below. Additionally, the topics of Design of Experiments (DOE) and life modeling are treated in more detail in Chapters 4 and 5, due to their relatively complex nature and their importance to life modeling. A detailed example of a life model developed is also provided in Chapter 7.

Identify Factors

Factors are the independent variables that can influence the product reliability, and the response variable is the dependent variable. DOE is a common technique used to study the relationships amongst many types of factors. In the context of this book, the response variables specifically refer to the reliability metric of interest.

Critical failure causes and the factors that potentially affect their probability of

occurrence need to be identified. This can be done through testing, through analysis, or both. EVT testing that is performed as part of the overall product/system reliability program can be used for the identification of these factors, as previously described.

Identify Factors Reliability Tests Develop Life Model Life Modeling Characterize operating stresses Predict Reliability under Use Conditions DOE FMEA Model of System Reliability

Tools

Actions

FTA FTA Measurement: • Environment • Stresses • Duty Cycle • Extreme Event Statistics

FMEA is also a popular analytical technique for this and will be used in the upcoming example.

Factors fall into one of several categories: • Stresses o Environmental o Operational • Product/System Attributes o Design factors o Manufacturing processes

Each of these factors can be a continuous or a categorical variable:

• A continuous variable is one that can assume any value within a given range • Categorical variables are those that assume a discrete number of possibilities Some factors can be modeled as either. For example, environmental stress can be modeled with continuous variables of the specific environmental stresses (i.e.,

temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable. The latter case is the approach that has historically been used in MIL-HDBK-217, which uses environmental categories like “Ground, Benign,” Airborne, Inhabited”, etc. The 217Plus methodology treats them as continuous variables, but default values are provided for the categorical values of environment.

There are several ways in which these factors can be identified. One method that has proven to be an efficient means of accomplishing this is to utilize the FMEA. This involves modifying the FMEA to include several additional columns that correspond to the above listed factors. At the analysts discretion, from one to four additional columns can be included. This will depend on the type of product or system under analysis and the level of rigor desired. In this approach, the FMEA team (or at least someone

knowledgeable with the item design and process attributes) identifies the specific stresses or attributes that will affect the probability of occurrence of the specific failure cause that was identified in the FMEA. Since each failure cause will generally have an associated risk priority number (RPN), the cumulative RPN can be calculated for all failure causes affected by the specific stress or product/system attribute.

For example, consider the case in which an FMEA was accomplished in this manner, and the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are

shown, but the same methodology would apply to whichever additional factors are included in the FMEA.

A more detailed discussions of the FMEA methodology is provided in Chapter 8.

Figure 2.5-11: Identification of Test Stresses Based on the FMEA

In this case, the sum of the RPN values for all failure causes accelerated by mechanical shock is about 500. This cumulative RPN value is a relative number only, but can provide valuable insight into the most important stresses to be addressed in the reliability test plan.

In this example, the test stresses shown pertain to all of the failure causes addressed in the FMEA. In performing life tests on specific failure causes, the information identified in the FMEA should be used to identify the test stresses to be considered in the DOE plan.

Reliability Tests

If critical item failure mechanisms are time dependent, then time-based life tests are required. Life tests are conducted by subjecting test samples to a defined stress level and measuring the times when failure occurs. The process is repeated for various

combinations of factor levels. Considerations for the reliability tests are described below.

Test Plan

If there are multiple accelerating stresses, then life tests must be conducted at various combinations of stress magnitudes. A plan should be developed using an effective tool such as Design of Experiments. The plan should consider all aspects of testing so that the test program generates data in a cost effective way. It is easy to lapse into the mentality of testing “one factor at a time”, in which tests are conducted to assess specific factors, but this approach is generally not time- or cost-effective.

Factors to consider in establishing an appropriate DOE include (1) the sample size per test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology (i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4.

Maximum Test Stress

A prerequisite for developing a complete test plan to assess the lifetime of a product or system attribute is knowledge of the maximum stress magnitude that can be tolerated by the item prior to catastrophic failure. This knowledge supports establishment of an upper bound on subsequent test stresses that may be a part of step-stress testing. These tests are generally performed as part of the EVT tests.

In many cases, it is desirable to establish the upper bound of the test stress for each specific stressor. An efficient way to determine this stress level, often called the “destruct limit”, is to perform a step stress test. Here, a sample of units is exposed to a stress level well below the suspected destruct limit. Then, the stress is increased until the product is overstressed. This step-stress test can include a linearly ramped stress, or a stepped-stress in which the samples are exposed to a constant stress for a given dwell time, after which the stress is increased, dwelled, and so on until failure. An example of the identification of these maximum stresses was mentioned previously in the HALT discussion.

The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the actual life tests will be performed at a maximum stress that is a certain percentage level

below the destruct limit. This percentage is dictated primarily by the sensitivity of the TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12. Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme sensitivity to the stress level.

Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress

For example, if a power law acceleration model is used, the life – stress relationship is:

n S

A Life=

where “A” is a life constant and “S” is the stress.

A typical value of “n” for Case 1 would be 1 to 3, whereas a typical value of “n” for Case 2 would be greater than 20.

In case 1, the maximum stress for the life tests may be 10-20% below the destruct limit. For Case 2, however, the maximum stress should be only a few percentage points below the destruct limit. Otherwise, the risk is taken that the product or system will not fail within a reasonable time period, which is required for reliability model development.

Stress Profile

The two main types of stress profiles are steady-state and time varying. Steady state tests are those in which a sample set is exposed to constant stress levels, and the response (performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.

Figure 2.5-13: Possible Stress Profiles

Any of the profiles in Figure 2.5-13 can be used to develop life models. If the time- varying stress profiles are used, a cumulative damage model is usually appropriate. In this case, the stress function is integrated to obtain the cumulative damage. This will be explained in more detail later.

Some of the advantages and disadvantages of the two generic approaches are listed in Table 2.5-9.

Table 2.5-9: Stress Profile Option Advantages and Disadvantages

Approach Advantage Disadvantage

Steady State Stress

Results can be easily interpreted Facilitates the de-convolution of time and stress effects more easily

Longer test times required Requires knowledge of destruct limits

Stepped (or Linear Ramped) Stress

Short test times possible A good approach when the time to failure characteristics as a function of stress are unknown Does not require knowledge of destruct limits

Can be difficult to model parameters

Software required for modeling

Optimum Measurement Intervals

When testing is performed on products or systems whose performance cannot be

monitored in-situ, the test needs to be run such that performance measurements are done at periodic intervals. These intervals need to be frequent enough to bracket the TTFs tightly enough such that life model parameters can be estimated accurately enough. The objective of the measurement intervals is to obtain as much resolution as possible in the regions of time that exhibit high failure rates. The measurement intervals should be an order of magnitude shorter than the failure times.

There are several approaches to determining the appropriate measurement intervals: 1. Use constant intervals. While this approach may not be optimal, it can be

appropriate in cases where the failure characteristics are completely unknown 2. If the rate of occurrence of failure (ROCOF) is expected to decrease over time,

the measurement intervals can start out very frequent, and decrease in frequency as the failure rate decreases. This is shown in Figure 2.5-14.

Figure 2.5-14: Measurement Points for an Infant Mortality Failure Cause

If the ROCOF is expected to increase over time, the measurement intervals can start out very infrequent, and increase in frequency as the failure rate decreases. This is shown in Figure 2.5-15.

Figure 2.5-15: Measurement Points for a Wearout Failure Cause

This case is generally much more difficult to implement because the failure characteristics need to be known before the tests. Therefore, one of the first two approaches is usually desirable.

Failure Rate Measurement Points Measurement Points Failure Rate

Sample Size Requirements

The determination of adequate sample sizes will depend on several factors, the most important being whether the failure cause is special cause or common cause. If it is special cause, the sample size needed will depend entirely on the percent of the

population affected by the failure cause. For example, if the failure cause manifests itself in 0.1% of the population, then at least 1000 items would be required in order to expect a single failure. Since multiple failures are required for true quantification, an order of magnitude more items, or about 10,000, would be required. The specific number can be calculated by using the principals of reliability demonstration, as explained elsewhere in this book.

If the failure cause is a common cause mechanism, meaning that the entire population is at risk, then many fewer items would be required. In this case, test data on enough samples is required such that differences in reliability as a function of the factors (i.e., stresses, indicator variables) can be determined in a statistically significant manner. This will be a function of how much inherent variability there is in the population, and how sensitive the reliability is as a function of the factors under analysis. Essentially, if these variabilities are known, then statistical techniques, like the Fisher F-test, could be used. However, in practice, these variabilities are rarely known a priori. Therefore, sample sizes as large as possible are preferred. In practice, the sample sizes are usually dictated by programmatic constraints, in which case it is the reliability practitioner’s responsibility to lobby program managers for the required samples.

Test Time

The question as to how long tests should be run before stopping them inevitably needs to be addressed. This is especially true in cases where the stress levels are low and the resulting lifetimes are long. While it is usually difficult to determine an appropriate test duration before the test is run, a general rule of thumb is that tests should be run for durations sufficient to cause at least 50% of the items to fail. This facilitates

quantification of the median life. Keep in mind that tests are used to characterize the statistical distribution at a specific stress level, and therefore enough failures need to be experienced to quantify the distribution.

Consider the illustration in Figure 2.5-16. In this case, tests were performed at two stress levels, and the resulting TTF distributions were obtainable for each level. The

acceleration in this case can be quantified, along with confidence bounds around the acceleration model parameters.

Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are Available

Now, consider the case in which the lower stress samples are not tested until enough failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution cannot be quantified. All that is possible is the estimation of the lower bound of life, via techniques like Weibayes analysis (shown as the star).

Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not Available

This 50% objective can sometimes be offset if enough data is available in at least two other, more stressful conditions, to compensate for the lack of data in the low stress condition.

Develop Life Model

After the life data is generated from implementing the DOE plan, a reliability model can be constructed. Factors that must be quantified include:

• Time-to-failure (TTF) distribution

• Acceleration factors for the primary stress variables

• Characterization of the impact of specific design attributes on reliability A generic sequence of events for model development is shown in Figure 2.5-18.

Figure 2.5-18: Life Model Sequence

Collect data • TTFs

• Acceleration variables • Stress(es)

• Indicator

Select acceleration model(s) Estimate model parameters

Select TTF distribution

Analyze goodness of fit and parameter significance

The TTF distribution can typically be modeled using the Weibull, exponential or lognormal distributions. For sample "subpopulations" that exhibit different reliability behavior than the main population, TTF distributions may manifest themselves as bimodal. It is important that bimodal distributions be characterized. If one of the two "modes" in the distribution appears to be the result of early failures from workmanship, materials or process defects, then this information should be used to develop an

appropriate reliability screen. This topic is discussed in detail later in this book.

Characterize Operating Stresses

In order to estimate the field reliability of the product, in addition to the life model (which will predict the life characteristics as a function of the chosen factors),

information regarding the stresses to which the product or system will be exposed in the field is also necessary.

There are a variety of sources that can be used to estimate the stresses to which an item will be exposed. First, customers will usually specify nominal and worst case

environmental requirements in the product or system specification. However, the data in specifications are often very generic and lack sufficient detail for reliability analysis. Another source of information is from direct measurement, either by directly measuring stresses in the item use environment, or by equipping the item with sensors and data logging features.

Field maintenance personnel can also often provide qualitative information pertaining to stresses, especially when those stresses have resulted in failures.

There is a wealth of information available in both commercial and military handbooks and standards. Many industries also have their own source material from the products or systems used in their industry.

A summary of sources include: • Customer specifications • Customer usage information • Measurement of conditions: • Stresses

• Duty cycle

• Using a sample of fielded products fitted with sensors and data-recording electronics

• Discussions with field maintenance personnel • Handbooks and standards

• MIL-STD-210, “Climatic Information to Determine Design and Test Requirements for Military Systems and Equipment”

Predict Reliability Under Use Conditions

Once life models have been developed for all pertinent failure causes, the specific

combinations of design attributes and stresses that result in reliability requirements being met can be identified. These attributes/stresses define the item "safe operating region," which should then be added to the system/product design rules so that reliability requirements for future designs can be met without having to repeat the reliability modeling process for that item.

Model of System Reliability

Once life models have been developed for all pertinent failure causes, they need to be combined such that a reliability estimate of the entire product can be made. Section 2.7 describes this process and the appropriate tools in more detail.

Degradation Modeling

In many cases, the reliability response variable will not be a TTF, but rather it will be the behavior of a critical parameter as a function of time. In these cases, there are several choices:

1. Develop a model that predicts the parameter as a function of all factors that need to be quantified.

2. Derive a simple model (linear, logarithmic, exponential or power law) model that describes the parameter as a function of time, and then use this model to estimate a time to failure (i.e. the time the parameter is predicted to degrade to some predefined failure threshold.

In many cases, Option 2 is a good choice. Option 1 is a good choice in the following cases:

1. When the failure mechanism can reach an asymptotic value of degradation. This condition is difficult to model using the conventional life modeling techniques 2. If the goal of the analysis is to feed other analytical techniques, like worst case

In document Reliability Modeling: The RIAC Guide to Reliability Prediction, Assessment (Page 81-97)