CNET/RDF
2.5.1.2.5. Collecting Field Data
Since field data is critical to the reliability assessment process, it is explored in this section. The nuances of collecting and interpreting it are discussed. Some of the issues encountered in collecting field data are discussed in the NPRD discussion included in Chapter 7. The intent of this section is to present guidelines on how to approach field data collection.
Good data collection is the key to an effective process for utilizing data obtained from a reliability tracking system. This information includes:
• Failure statistics (i.e., TTF, MTBF)
• Application information (i.e. stress, environment, etc.) • Failure modes
• Failure causes
The intent of this section is to outline a reliability data collection and analysis system that can provide the data required. Although a reliability tracking system outlined herein has similarities to a FRACAS program, there are distinct differences. While a FRACAS program is intended identify the causes of failures so that corrective action can take place, the program outlined herein is intended to be more comprehensive in that it assists its user in more than the implementation of corrective actions, as it also provides the data required to quantify reliability, in accordance with the methodologies outlined in this book. This concept is illustrated in Figure 2.5-31.
Figure 2.5-31: Uses of Program Data Elements
A data system consists of several basic elements: a database, software analysis tools, and an interface to the data system users. The database is the core of the system that captures the raw maintenance data that is necessary to perform the required data analysis. A typical structure of a database is provided in Figure 2.5-32.
Figure 2.5-32: Program Database Structure
The blocks in the above figure correspond to records in a relational database structure. The data elements associated with each record are defined below. The System
Reliability TTF Analysis Failure Verification MTBF Analysis Root Cause Identification Vendor Selection Warranty Claims RCM Implementation Implement Design Improvements Parts Breakdown System Information Maintenance Data Root Failure Cause/Analysis Data
Information record consists of population statistics and needs to be updated whenever the product or system status changes. Such a change occurs when new or modified items are fielded.
The parts breakdown data element consists of a hierarchical description of the system. This description is necessary to avoid confusion as to which FRUs (Field Replaceable Units) belong to which assemblies and the number of FRUs in the assembly, as well as in the entire system.
The maintenance data element consists of a record of the maintenance action taken to maintain or repair the system. It also consists of a description of the anomaly, the failure mode, and the failure mechanism of the failed unit as determined by the maintenance technician. One record corresponds to a single maintenance action, and there can be any number of them for each FRU in the system (i.e., a FRU in the system can be replaced any number of times over the life cycle of the system).
The root failure cause/analysis data element consists of information on the results of the detailed failure analysis that may be performed on the failed unit. It is a separate record because not all maintenance actions will result in the failure analysis of a removed unit. There are two primary interfaces required of the system. The first is the maintenance technician interface. This interface is the means by which maintenance data is entered into the database. Ideally, this interface would consist of computers located within the maintenance facility for direct data entry. The second interface is the one utilized by individuals that need the results of the data analysis. The flow of the interface to the system from the perspective of the system user is given in Figure 2.5-33.
Figure 2.5-33: Database Information Flow
Important elements of the data system that should be considered for inclusion are summarized below:
• System information
• Number of systems fielded • Dates of fielding for each system j • Location of operation (optional)
• System Numbers (unique identifier for each system) Critical elements of a data collection system are discussed below.
Parts Breakdown
A description of every level of assembly must be available, down to the lowest level of repair. For the purposes of this example, this assembly will be called a FRU (Field
Maintenance technician identifies the part requiring maintenance and enters the part data
into the database System maintenance is required and maintenance
commences Central Database System user enters part breakdown and maintains system usage status User runs appropriate analysis and obtains necessary reliability metric(s) Technician performs required maintenance Maintenance technician enters maintenance data
Replaceable Unit). This product or system description is critical to the unique
identification of parts so that the data that is reported at various levels is not confounded. It is also critical if maintenance actions are not consistently performed at the same level. At the lowest level of indenture, the following FRU information is required.
• Part number • Serial number
• Part identification code (unique descriptor of part in hierarchical breakdown of system; sometimes referred to as a Reference Designator)
• Number of parts in the product or system
• Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.)
• Identification as to if there is an individual elapsed time meter (or miles, cycles, operations) on the specific part or whether system life units must be used • Manufacturer name
Maintenance Information
A critical element to an effective reliability data collection and analysis system is the accurate quantification of the failure cause. Not all perceived failures are real failures and, therefore, it is important to identify whether part removals are indeed true failures. Figure 2.5-34 illustrates the hierarchy of maintenance actions.
Figure 2.5-34: Hierarchy of Maintenance Actions
The following is a list of required data elements in the capture of maintenance information:
• Job number (unique identifier)
Scheduled Maintenance Action Correct Diagnosis Incorrect Diagnosis Correct Diagnosis Cannot Duplicate Unnecessary Repair Faulty Unit
Gets Put Back into Field Necessary Repair Failure Analysis Not Performed Failure Analysis Performed to Identify Root Cause Incorrect Diagnosis Unscheduled
Remove/Replace Perform Routine
Maintenance
• Calendar date and time that system is taken out of operation • Calendar date of maintenance action
• System serial or configuration control number
• Number of total life units (i.e. hours, miles, cycles, operations) on the FRU at the start of the maintenance action (if life unit meter is on FRU)
• Number of total life units (i.e. hours, miles, cycles, operations) on the product or system at the start of the maintenance action (if life unit meter is not on part) • Number of total life units (FRU or product/system, depending on which of the
above two items are applicable) on the part at the start of the maintenance action. This is a calculated field generated by the database software.
• Initial description of the anomaly • Initiating event (only one is chosen):
o Failure of system to perform (unscheduled maintenance) o Condition monitoring-based event
o Scheduled maintenance • When discovered
• Action taken (only one is chosen): o Remove/replace
o Maintain
o Remove, re-test OK, and replace
• FRU on which action is taken (description and serial/configuration control number)
• Maintenance technician (name)
• Man-hours required for maintenance action
• Calendar date and time that the system is put back into service • Cause of failure identified by the maintenance technician • Failure mode description
• Failure mechanism description. There could be a standardized listing of the possible failure mechanisms from which the technician could scan and identify the appropriate mechanism.
Failure Analysis Information
The failure analysis record is used when there is a detailed failure analysis performed on a removed FRU. The data contained in this record generically consists of the following:
• Summary of the analysis performed • Results of the analysis
Analysis
From the data collected and captured in the database, several fundamental reliability parameters, including those listed below, can be calculated.
• Operating hours (or life unit) of each FRU • Cumulative operating hours of the population • Cumulative system calendar hours of the population • Cumulative FRU calendar hours of the population • Individual calendar times for each product or system • For scheduled removals:
o Number of scheduled removals
o Total number of man-hours associated with scheduled removals o Individual operating times for scheduled removals
o Individual calendar times for scheduled removals o Number of man-hours for each scheduled removal • For unscheduled removals:
o Number of unscheduled removals
o Total number of man-hours associated with unscheduled removals o Individual operating times for unscheduled removals
o Individual calendar times for unscheduled removals o Number of man-hours for each unscheduled removal • Number of total removals
• Total number of man-hours • Individual number of man hours
• Individual operating times of all removals • Individual calendar times of all removals • Number of removals for each failure cause
• Individual operating times of removals for each failure cause • Individual calendar times of removals for each failure cause • Total time that each individual product or system is unavailable
For many of these parameters, it is necessary to calculate the number of life units to which each part has been exposed. This is done by calculating the number of life units on the part since the last time that the part was replaced. This calculation procedure is illustrated in Figure 2.5-35.
Figure 2.5-35: Calculation of Part Life Unit
Outputs
A list of typical output parameters are listed below:
• Mean Operating Hours Between Scheduled Removals • Mean Calendar Hours Between Scheduled Removals • Mean Operating Hours Between Unscheduled Removals
Is there a life unit meter on the part?
Use part life unit meter
Use system life unit meter
Yes No
Record life unit meter reading (i.e.,
Part
hours/miles/cycles)
Has the part been previously removed? (i.e.,
is there a maintenance record for that part in the
database?)
Subtract the system life unit from that of the last maintenance
record from the current life unit
Yes
Record the system life unit
• Mean Calendar Hours Between Unscheduled Removals • Mean Man Hours per Maintenance Action (MMH/MA)
• Distribution of maintenance man hours per maintenance action
• Weibull parameters of individual operating times for unscheduled maintenance actions
• Weibull parameters for failures of a specific cause
• Pareto ranking of part failure rates (or of any of the above listed parameters) • Failure cause distribution
• Pareto ranking of failure causes
• Mean system availability for each system • Distribution of system availability
Drenick’s Theorem
An important aspect of interpreting field reliability data is distinguishing between calendar time and operating time. Consider a situation in which five items are fielded at the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other appropriate life unit) that is described by the TTF distribution as a function of operating time.
Figure 2.5-36: Failure Times Based on Operating Time
1 2 3 4 5 Operating Time Failure Times
Now, consider the same five items that were placed in the field at different calendar times, as illustrated in Figure 2.5-37. They will have the same failure times relative to their operating time, but the apparent failure times relative to calendar time will be quite different.
Figure 2.5-37: Failure Times Based on Calendar Time
Furthermore, if the product or system is repairable (in which case the failed items are replaced upon failure with a new item), an interesting effect occurs in which the apparent failure rate will reach an asymptotic value that appears to represent a constant failure rate. This occurs as the “time zero” values become randomized as items fail and are replaced with new items.
To illustrate the relationship between the beta value (Weibull shape) and the
instantaneous failure rate as a function of calendar time when parts are replaced upon failure, a simulation was performed. In this simulation example, the failure rate of 1100 items as a function of calendar time was calculated.
Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibull- distributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis is calendar time, normalized to a time unit of one characteristic life.
Calendar Time Failure Times 1 2 3 4 5
Figure 2.5-38: Failure Rate Simulation with Weibull Beta = 20
Figure 2.5-40: Failure Rate Simulation with Weibull Beta = 2.0
Figure 2.5-42: Failure Rate Simulation with Weibull Beta = 0.5
Consider the case where the Weibull beta = 20 (Figure 2.5-38). When the populations start operating at the same time at t = 0, the failures occur at a rate described by the Weibull distribution with a beta value of 20. The peak of the failure rate occurs at approximately the characteristic life value of time. As units fail and are replaced, the “time zeros” start to become randomized. As enough time passes, the “times zeros” will eventually become completely randomized. At this point, the asymptotic value of failure rate is reached, which is the reciprocal of the characteristic life (in this case, 100). Figure 2.5-39, depicting the simulation results for a beta value of 5.0, indicates a similar effect. The asymptotic failure rate, however, is reached sooner. This happens because the variance in failure time is greater for a beta of 5.0 relative to a beta of 20, which, in turn, means that the population “time zeros” become randomized sooner. The plot illustrating a beta of 2.0 (Figure 2.5-40) is similar, with a corresponding asymptotic value reached sooner. The plot corresponding to a beta of 1.0 (Figure 2.5-41) indicates that the random failure rate occurs at t=0 which intuitively make since it has, by definition, a randomly occurring failure rate.
However, when the beta is less than 1.0 (Figure 2.5-42), the asymptotic failure rate value is zero. This occurs because, when enough time has passed, the failed items have been replaced with items that have a higher probability of living longer. The lower the beta value, the shorter the time period required to achieve a zero failure rate.
Because this is an important factor in interpreting field reliability data, a methodology was derived for the NPRD data to estimate the characteristic life based on field data with varying “time zero” values. This methodology is discussed in Chapter 7, Section 4. 2.5.2. Physics
The generic approaches covered here in using a physics approach are stress strength interference models and models from first principals. Each is described below. 2.5.2.1. Stress/Strength Modeling
Stress/strength interference theory is a technique used to quantify the probability that the strength of an item is less than the stress to which it is subjected. For example, if the distribution of the strength of an item can be quantified, and the distribution of the stress it is under can be quantified, the area of intersection of the two stresses represents the probability that the strength is less than the stress.
This technique is general in nature and applies equally to any situation that the two
distributions can be quantified, as long as the X-axis represents the same variable for both distributions. The variable can be electrical, such as voltage or current, or it can be mechanical strength, for example, in units of KPSI.
The goal of any design for robustness effort is to minimize the variance of both
distributions, and maximize the separation of the distribution means. In this manner, the probability of distribution intersection, or failure, is minimized.
Figure 2.5-43: Stress Strength Methodology
In this example, a mechanical item has certain physical properties, for example its modulus and its coefficient of thermal expansion (CTE). These material properties are used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the stresses to which the item is exposed. This stress can be modeled in several ways. One is the use of handbooks that contain closed-form equations that estimate the stress to which a material is exposed as a function of dimensions, force, deflections, etc. This is usually only viable for simple structures. For more complex mechanical structures, finite element models and analysis (FEA) may be required to simulate stresses.
For the strength portion of the model, two factors need to be considered: • The inherent strength distribution of the material
• The strength properties as a function of time Stress FEA
Material Properties
CTE Modulus Strength Strength Data Fatigue Data Probability of Failure vs TimeDesign Dimension
Dimensions Extrinsic StressesAn example of strength as a function of time is the fatigue properties of the material. The fatigue properties pertain to the strength degradation over time.
At time = 0, the probability of failure is the intersection of the stress and the strength distributions, as illustrated in Figure 2.5-44.
Figure 2.5-44: Stress/Strength Interference
The calculation for Normally-Distributed Stress and Strength Distributions is:
where:
Z = Standard Normal variant (i.e., the number of standard deviations from the normal standardized distribution). The value for “Z” can be obtained from:
1. Tables of the Standard Normal distribution 2. MS EXCEL formula = Normdist(Z)
μx = the mean of the strength μy = the mean of the stress
σx = the standard deviation of the strength σy = the standard deviation of the stress
2 2 y x y x u u Z σ σ + − =
In many real situations, distributions other than the Normal are used, requiring alternate methods of calculating the interference probability. Readily available software tools can be used for this purpose (Reference 3).
As stated previously, in addition to the probability of failure at t=0, it is also critically important to understand how this interference between stress and strength behaves as a function of time. Items will sometimes age (due to mechanisms such as fatigue), which essentially means that the strength distribution changes such that its mean is lowered. Assuming that the stress to which the item is exposed remains constant, the result is that there is more interference, and the failure probability increases with time. To properly account for this aging phenomenon, the characteristics of this strength distribution and the interference must be quantified as a function of time. This concept is illustrated in Figure 2.5-45.
An example of a model that has been successfully used for brittle materials is the following:
where:
P = probability of failure
m = Weibull slope of the initial strength S0 = characteristic strength
n = fatigue constant
V and V0 are volume parameters to account for the effects of size (i.e., they account for the effect that the more volume or surface area that there is, the more likely it is to have a strength limiting flaw)
σ = stress
Now, if a screen is applied to the material to eliminate defects having strength values below the applied screen stress threshold (Sth), the probability of failure becomes:
This is only one example of a stress strength model. Many others can be found in the literature.
Models such as these can be invaluable in understating the sensitivity of reliability as a function of the factors accounted for in the model. However, as is the case with any physics-based model, it is important to validate the model based on empirical evidence. This is critical because there is ample opportunity to introduce large errors in the
analysis, based on extreme sensitivity to assumptions, sample variability, etc. Additionally, while the approach may be grounded in physics, the model parameters usually need empirical data for their quantification.
⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − = n m m t t S V V P 0 0 0 exp 1 σ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − = n m m n th t t S t t S V V P 0 0 1 0 0 exp 1 σ