Reliability engineering isengineeringthat emphasizes
dependabilityin thelifecycle managementof aproduct. Dependability, or reliability, describes the ability of a sys-tem or component to function under stated conditions for a specified period of time.[1]Reliability engineering rep-resents a sub-discipline withinsystems engineering. Re-liability is theoretically defined as theprobabilityof fail-ure, the frequency of failures, or in terms ofavailability, a probability derived from reliability and maintainability.
Maintainabilityandmaintenancemay be defined as a part of reliability engineering. Reliability plays a key role in cost-effectiveness of systems.
Although reliability is defined and affected by stochas-tic parameters, according to some acknowledged special-ists, quality, reliability and safety are not achieved by mathematics and statistics. Nearly all teaching and lit-erature on the subject emphasizes these aspects, and ig-nores the reality that the ranges of uncertainty involved largely invalidate quantitative methods for prediction and measurement.[2]
Reliability engineering relates closely tosafety engineer-ingand tosystem safety, in that they use common meth-ods for their analysis and may require input from each other. Reliability engineering focuses on costs of fail-ure caused by system downtime, cost of spares, repair equipment, personnel and cost of warranty claims. Safety engineering normally emphasizes not cost, but preserv-ing life and nature, and therefore deals only with par-ticular dangerous system failure modes. High reliability (safety factor) levels are also here the result of good engi-neering, attention to detail and almost never the result of only re-active failure management (reliability accounting / statistics).[3]
Former US Secretary of DefenseJames R. Schlesinger
once stated: “Reliability is, after all, engineering in its most practical form.”[2]
1
Overview
Reliability engineering for complex systems requires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may involve:
• Use (load) studies and requirements specification
(system level analysis)
• Inherent (system) Design Reliability:
Hardware-and Software design, System Diagnostics design
• Functional (Failure) analysis • Human Factors / Errors • Manufacturing induced failures • Maintenance induced failures • Transport induced failures • Storage induced failures • Failure / reliability testing
• Spare-parts stocking (Availability control) • Technical documentation
• Data and information acquisition/organisation
Effective reliability engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering skills and good knowledge from many different special fields of engineering,[4]like:
• Tribology
• Stress (mechanics)
• Fracture mechanics/Fatigue (material) • Thermal engineering
• Fluid mechanics / shock loading engineering • Electrical engineering
• Chemical engineering (e.g. Corrosion)
Reliability may be defined in the following ways:
• The idea that an item is fit for a purpose with respect
to time
• The capacity of a designed, produced or maintained
item to perform as required over time
• The capacity of a population of designed, produced
or maintained items to perform as required over specified time
• The resistance to failure of an item over time • The probability of an item to perform a required
function under stated conditions for a specified pe-riod oftime
2 2 RELIABILITY AND AVAILABILITY PROGRAM PLAN
• The durability of an object.
Many engineering techniques are used in reliability engi-neering, such as reliabilityhazard analysis,failure mode and effects analysis (FMEA), failure modes, mecha-nisms, and effects analysis (FMMEA),[5] fault tree anal-ysis(FTA), material stress and wear calculations, fatigue and creep analysis,finite element method, reliability pre-diction, thermal (stress) analysis, corrosion analysis, hu-man error analysis, reliability testing, statistical uncer-tainty estimations, Monte Carlo simulations, design of ex-periments, reliability centered maintenance (RCM), fail-ure reporting and corrective actions management. Be-cause of the large number of reliability techniques, their expense, and the varying degrees of reliability required for different situations, most projects develop a reliabil-ity program plan to specify the reliabilreliabil-ity tasks that will be performed for that specific system.
Consistent with the creation of safety cases, for example
ARP4761, the goal is to provide a robust set of qualita-tive and quantitaqualita-tive evidence that use of a component or system will not be associated with unacceptable risk. The basic steps to take[6]are to:
• First thoroughly identify relevant unreliability
“haz-ards”, e.g. potential conditions, events, human er-rors, failure modes, interactions, failure mechanisms and root causes, by specific analysis or tests
• Assess the associated system risk, by specific
analy-sis or testing
• Propose mitigation, e.g. requirements, design changes, detection, maintenance, training, by which the risks may be lowered and controlled for at an acceptable level.
• Determine the best mitigation and get agreement on
final, acceptable risk levels, possibly based on cost-benefit analysis
Risk is the combination of probability and severity of the failure incident (scenario) occurring.
In a deminimus definition, severity of failures include the cost of spare parts, man hours, logistics, damage (sec-ondary failures) and downtime of machines which may cause production loss. A more complete definition of failure also can mean injury, dismemberment and death of people within the system (witness mine accidents, in-dustrial accidents, space shuttle failures) and the same to innocent bystanders (witness the citizenry of cities like Bhopal, Love Canal, Chernobyl or Sendai and other vic-tims of the 2011 Tōhoku earthquake and tsunami) - in this case, Reliability Engineering becomes System Safety. What is acceptable is determined by the managing au-thority or customers or the effected communities. Resid-ual risk is the risk that is left over after all reliability activ-ities have finished and includes the un-identified risk and is therefore not completely quantifiable.
2 Reliability and availability
pro-gram plan
A reliability program plan is used to document exactly what “best practices” (tasks, methods, tools, analysis and tests) are required for a particular (sub)system, as well as clarify customer requirements for reliability assessment. For large scale, complex systems, the reliability program plan should be a separatedocument. Resource determi-nation for manpower and budgets for testing and other tasks is critical for a successful program. In general, the amount of work required for an effective program for complex systems is large.
A reliability program plan is essential for achieving high levels of reliability, testability,maintainabilityand the re-sulting systemAvailabilityand is developed early during system development and refined over the systems life-cycle. It specifies not only what the reliability engineer does, but also the tasks performed by otherstakeholders. A reliability program plan is approved by top program management, which is responsible for allocation of suffi-cient resources for its implementation.
A reliability program plan may also be used to evalu-ate and improve availability of a system by the strevalu-ategy on focusing on increasing testability & maintainability and not on reliability. Improving maintainability is gen-erally easier than reliability. Maintainability estimates (Repair rates) are also generally more accurate. How-ever, because the uncertainties in the reliability estimates are in most cases very large, it is likely to dominate the availability (prediction uncertainty) problem; even in the case maintainability levels are very high. When relia-bility is not under control more complicated issues may arise, like manpower (maintainers / customer service ca-pability) shortage, spare part availability, logistic delays, lack of repair facilities, extensive retro-fit and complex configuration management costs and others. The prob-lem of unreliability may be increased also due to the “domino effect” of maintenance induced failures after re-pairs. Only focusing on maintainability is therefore not enough. If failures are prevented, none of the others are of any importance and therefore reliability is generally regarded as the most important part of availability. Re-liability needs to be evaluated and improved related to both availability and the cost of ownership (due to cost of spare parts, maintenance man-hours, transport costs, storage cost, part obsolete risks, etc.). But, as GM and Toyota have belatedly discovered, TCO also includes the down-stream liability costs when reliability calculations do not sufficiently or accurately address customers’ per-sonal bodily risks. Often a trade-off is needed between the two. There might be a maximum ratio between avail-ability and cost of ownership. Testavail-ability of a system should also be addressed in the plan as this is the link be-tween reliability and maintainability. The maintenance strategy can influence the reliability of a system (e.g.
by preventive and/or predictive maintenance), although it can never bring it above the inherent reliability. The reliability plan should clearly provide a strategy for availability control. Whether only availability or also cost of ownership is more important depends on the use of the system. For example, a system that is a critical link in a production system – e.g. a big oil platform – is normally allowed to have a very high cost of ownership if this trans-lates to even a minor increase in availability, as the un-availability of the platform results in a massive loss of rev-enue which can easily exceed the high cost of ownership. A proper reliability plan should always address RAMT analysis in its total context. RAMT stands in this case for reliability, availability, maintainability/maintenance and testability in context to the customer needs.
3
Reliability requirements
For any system, one of the first tasks of reliability en-gineering is to adequately specify the reliability and maintainability requirements derived from the overall
availabilityneeds and more importantly, from proper fail-ure analysis or preliminary test results. Requirements should constrain the designers from designing particular unreliable systems. Setting only availability (reliability, testability and maintainability) targets (e.g. max. Failure rates)is not appropriate. This is a broad misunderstanding about Reliability Requirements Engineering. Reliabil-ity requirements address the system itself, including test and assessment requirements, and associated tasks and documentation. Reliability requirements are included in the appropriate system or subsystem requirements speci-fications, test plans and contract statements. Creation of proper lower level requirements is critical.
Provision of only quantitative minimum targets (e.g. MTBF values/ Failure rates) is not sufficient for differ-ent reasons. One reason is that a full validation (related to correctness and verifiability in time) of an quantitative reliability allocation (requirement spec) on lower levels for complex systems can (often) not be made as a conse-quence of 1) The fact that the requirements are probabal-istic 2) The extremely high level of uncertainties involved for showing compliance with all these probabalistic re-quirements 3) Reliability is a function of time and accu-rate estimates of a (probabalistic) reliability number per item are available only very late in the project, sometimes even only many years after in-service use. Compare this problem with the continues (re-)balancing of for example lower level system mass requirements in the development of an aircraft, which is already often a big undertaking. Notice that in this case masses do only differ in terms of only some %, are not a function of time the data is non-probabalistic and available already in CAD models. In case of reliability, the levels of unreliability (failure rates) may change with factors of decades (1000’s of %)as result of very minor deviations in design, process or anything
else. The information is often not available without huge uncertainties within the development phase. This makes this allocation problem almost impossible to do in a use-ful, practical, valid manner, wich does not result in mas-sive over- or under specification. A pragmatic approach is therefore needed. For example; the use of general lev-els / classes of quantitative requirements only depending on severity of failure effects. Also the validation of re-sults is a far more subjective task than for any other type of requirement. (Quantitative) Reliability parameters -in terms of MTBF - are by far the most uncerta-in design parameters in any design.
Furthermore, reliability design requirements should drive a (system or part) design to incorporate features that pre-vent failures from occurring or limit consequences from failure in the first place! Not only to make some predic-tions, this could potentially distract the engineering ef-fort to a kind of accounting work. A design requirement should be so precise enough so that a designer can “de-sign to” it and can also prove -through analysis or testing-that the requirement has been achieved, and if possible within some a stated confidence. Any type of reliabil-ity requirement should be detailed and could be derived from failure analysis (Finite Element Stress and Fatigue analysis, Reliability Hazard Analysis, FTA, FMEA, Hu-man Factor analysis, Functional Hazard Analysis, etc.) or other any type of reliability testing. Also, requirements are needed for verification tests e.g. required overload loads (or stresses) and test time needed. To derive these requirements in an effective manner, asystems engineer-ingbased risk assessment and mitigation logic should be used. These practical design requirements shall drive the design and not only be used for verification purposes. These requirements (often design constraints) are in this way derived from failure analysis or preliminary tests. Understanding of this difference with only pure quantita-tive requirement specification (e.g. Failure Rate / MTBF setting) is paramount in the development of successful (complex) systems.[7]
The maintainability requirements address the costs of re-pairs as well as repair time. Testability (not to be con-fused with test requirements) requirements provide the link between reliability and maintainability and should address detectability of failure modes (on a particular sys-tem level), isolation levels and the creation of diagnostics (procedures).
As indicated above, reliability engineers should also ad-dress requirements for various reliability tasks and doc-umentation during system development, test, production, and operation. These requirements are generally speci-fied in the contract statement of work and depend on how much leeway the customer wishes to provide to the con-tractor. Reliability tasks include various analyses, plan-ning, and failure reporting. Task selection depends on the criticality of the system as well as cost. A safety crit-ical system may require a formal failure reporting and review process throughout development, whereas a
non-4 5 DESIGN FOR RELIABILITY
critical system may rely on final test reports. The most common reliability program tasks are documented in re-liability program standards, such as MIL-STD-785 and IEEE 1332. Failure reporting analysis and corrective ac-tion systems are a common approach for product/process reliability monitoring.
4 Reliability culture
Practically, most failures can in the end be traced back to a root causes of the type of human errors of any kind. For example, human errors in:
• Use studies
• Requirement analysis / setting • Configuration control • Assumptions
• Calculations / simulations / FEM analysis • Design
• Design drawings
• Testing (incorrect load settings or failure
measure-ment) • Statistical analysis • Manufacturing • Quality control • Maintenance • Maintenance manuals
• Incorrect feedback of information • etc.
However, humans are also very good in detection of (the same) failures, correction of failures and improvising when abnormal situations occur. The policy that human actions should be completely ruled out of any design and production process to improve reliability may not be ef-fective therefore. Some tasks are better performed by humans and some are better performed by machines.[8] Furthermore, human errors in management and the orga-nization of data and information or the misuse or abuse of items may also contribute to unreliability. This is the core reason why high levels of reliability for complex sys-tems can only be achieved by following a robustsystems engineeringprocess with proper planning and execution of the validation and verification tasks. This also includes careful organization of data and information sharing and creating a “reliability culture” in the same sense as hav-ing a “safety culture” is paramount in the development of safety critical systems.
5 Design for reliability
Reliability design begins with the development of a (sys-tem)model. Reliability and availability models useblock diagramsandfault treesto provide a graphical means of evaluating the relationships between different parts of the system. These models may incorporate predictions based on failure rates taken from historical data. While the (in-put data) predictions are often not accurate in an absolute sense, they are valuable to assess relative differences in design alternatives. Maintainability parameters, for ex-ampleMTTR, are other inputs for these models. The most important fundamental initiating causes and failure mechanisms are to be identified and analyzed with engineering tools. A diverse set of practical guidance and practical performance and reliability requirements should be provided to designers so they can generate low-stressed designs and products that protect or are protected against damage and excessive wear. Proper Validation of input loads (requirements) may be needed and veri-fication for reliability “performance” by testing may be needed.
A Fault Tree Diagram
One of the most important design techniques is redundancy. This means that if one part of the system fails, there is an alternate success path, such as a backup system. The reason why this is the ultimate design choice is related to the fact that high confidence reliability evi-dence for new parts / items is often not available or ex-tremely expensive to obtain. By creating redundancy, together with a high level of failure monitoring and the avoidance of common cause failures, even a system with relative bad single channel (part) reliability, can be made highly reliable (mission reliability) on system level. No testing of reliability has to be required for this. Further-more, by using redundancy and the use of dissimilar de-sign and manufacturing processes (different suppliers) for the single independent channels, less sensitivity for qual-ity issues (early childhood failures) is created and very
high levels of reliability can be achieved at all moments of the development cycles (early life times and long term). Redundancy can also be applied in systems engineering by double checking requirements, data, designs, calcula-tions, software and tests to overcome systematic failures. Another design technique to prevent failures is called physics of failure. This technique relies on understand-ing the physical static and dynamic failure mechanisms. It accounts for variation in load, strength and stress lead-ing to failure at high level of detail, possible with use of modern finite element method(FEM) software pro-grams that may handle complex geometries and mecha-nisms like creep, stress relaxation, fatigue and probabilis-tic design (Monte Carlo simulations / DOE). The material or component can be re-designed to reduce the probabil-ity of failure and to make it more robust against varia-tion. Another common design technique is component derating: Selecting components whose tolerance signif-icantly exceeds the expected stress, as using a heavier gauge wire that exceeds the normal specification for the expectedelectrical current.
Another effective way to deal with unreliability issues is to perform analysis to be able to predict degradation and being able to prevent unscheduled down events / fail-ures from occurring. RCM(Reliability Centered Main-tenance) programs can be used for this.
Many tasks, techniques and analyses are specific to par-ticular industries and applications. Commonly these in-clude:
• Built-in test (BIT) (testability analysis) • Failure mode and effects analysis
(FMEA)
• Reliabilityhazard analysis • Reliability block-diagram analysis • Dynamic Reliability block-diagram
analysis[9] • Fault tree analysis • Root cause analysis • Sneak circuit analysis • Accelerated testing
• Reliability growth analysis (active
re-liability)
• Weibullanalysis (re-active reliability)
• Thermal analysisby finite element analy-sis (FEA) and / or measurement
• Thermal induced, shock andvibration fa-tigueanalysis by FEA and / or measure-ment
• Electromagnetic analysis • Statistical interference
• Avoidance ofsingle point of failure
• Functional analysis and functional failure
analysis (e.g., function FMEA, FHA or FFA)
• Predictive and preventive maintenance:
reliability centered maintenance (RCM) analysis
• Testability analysis
• Failure diagnostics analysis (normally
also incorporated in FMEA)
• Human error analysis • Operational hazard analysis • Manual screening
• Integrated logistics support
Results are presented during the system design reviews and logistics reviews. Reliability is just one requirement among many system requirements. Engineering trade studies are used to determine theoptimumbalance be-tween reliability and other requirements and constraints.
6 Reliability prediction and
im-provement
Reliability prediction is the combination of the creation of a proper reliability model together with estimating (and justifying) the input parameters for this model (like failure rates for a particular failure mode or event and the mean time to repair the system for a particular failure) and finally to provide a system (or part) level estimate for the output reliability parameters (system availability or a particular functional failure frequency).
Some recognized reliability engineering specialists – e.g. Patrick O'Connor, R. Barnard – have argued that too much emphasis is often given to the prediction of reliabil-ity parameters and more effort should be devoted to the prevention of failure (reliability improvement).[2] Fail-ures can and should be prevented in the first place for most cases. The emphasis on quantification and target setting in terms of (e.g.) MTBF might provide the idea that there is a limit to the amount of reliability that can be achieved. In theory there is no inherent limit and higher reliability does not need to be more costly in development. Another of their arguments is that prediction of reliability based on historic data can be very misleading, as a compari-son is only valid for exactly the same designs, products, manufacturing processes and maintenance under exactly the same loads and environmental context. Even a mi-nor change in detail in any of these could have major ef-fects on reliability. Furthermore, normally the most un-reliable and important items (most interesting candidates for a reliability investigation) are most often subjected to many modifications and changes. Engineering designs are in most industries updated frequently. This is the rea-son why the standard (re-active or pro-active) statistical
6 8 QUANTITATIVE SYSTEM RELIABILITY PARAMETERS – THEORY
methods and processes as used in the medical industry or insurance branch are not as effective for engineering. An-other surprising but logical argument is that to be able to accurately predict reliability by testing, the exact mech-anisms of failure must have been known in most cases and therefore – in most cases – can be prevented! Fol-lowing the incorrect route by trying to quantify and solv-ing a complex reliability engineersolv-ing problem in terms of MTBF or Probability and using the re-active approach is referred to by Barnard as “Playing the Numbers Game” and is regarded as bad practise.[3]
For existing systems, it is arguable that responsible pro-grams would directly analyse and try to correct the root cause of discovered failures and thereby may render the initial MTBF estimate fully invalid as new assump-tions (subject to high error levels) of the effect of the patch/redesign must be made. Another practical issue concerns a general lack of availability of detailed fail-ure data and not consistent filtering of failfail-ure (feedback) data or ignoring statistical errors, which are very high for rare events (like reliability related failures). Very clear guidelines must be present to be able to count and compare failures, related to different type of root-causes (e.g. manufacturing-, maintenance-, transport-, system-induced or inherent design failures, ). Comparing differ-ent type of causes may lead to incorrect estimations and incorrect business decisions about the focus of improve-ment.
To perform a proper quantitative reliability prediction for systems may be difficult and may be very expensive if done by testing. On part level, results can be obtained often with higher confidence as many samples might be used for the available testing financial budget, however unfortunately these tests might lack validity on system level due to the assumptions that had to be made for part level testing. These authors argue that it can not be em-phasized enough that testing for reliability should be done to create failures in the first place, learn from them and to improve the system / part. The general conclusion is drawn that an accurate and an absolute prediction – by field data comparison or testing – of reliability is in most cases not possible. An exception might be failures due to wear-out problems like fatigue failures. In the introduc-tion of MIL-STD-785 it is written that reliability predic-tion should be used with great caupredic-tion if not only used for
comparison in trade-off studies.
See also: Risk Assessment#Quantitative risk assessment
– Critics paragraph
7
Reliability theory
Main articles: Reliability theory, Failure rate and
Survival analysis
Reliability is defined as theprobabilitythat a device will
perform its intended function during a specified period of time under stated conditions. Mathematically, this may be expressed as,
R(t) = P r{T > t} =
∫ ∞ t
f (x) dx
where f (x) is the failure probability density function and t is the length of the period of time (which is assumed to start from time zero).
There are a few key elements of this definition:
1. Reliability is predicated on “intended function:" Generally, this is taken to mean operation without failure. However, even if no individual part of the system fails, but the system as a whole does not do what was intended, then it is still charged against the system reliability. The system requirements spec-ification is the criterion against which reliability is measured.
2. Reliability applies to a specified period of time. In practical terms, this means that a system has a speci-fied chance that it will operate without failure before time t. Reliability engineering ensures that compo-nents and materials will meet the requirements dur-ing the specified time. Units other than time may sometimes be used.
3. Reliability is restricted to operation under stated (or explicitly defined) conditions. This constraint is necessary because it is impossible to design a sys-tem for unlimited conditions. AMars Rover will have different specified conditions than a family car. The operating environment must be addressed dur-ing design and testdur-ing. That same rover may be re-quired to operate in varying conditions requiring ad-ditional scrutiny.
8 Quantitative system reliability
parameters – theory
Quantitative Requirements are specified using reliability
parameters. The most common reliability parameter is themean time to failure(MTTF), which can also be spec-ified as thefailure rate(this is expressed as a frequency or conditional probability density function (PDF)) or the number of failures during a given period. These param-eters may be useful for higher system levels and systems that are operated frequently, such as most vehicles, ma-chinery, and electronic equipment. Reliability increases as the MTTF increases. The MTTF is usually specified in hours, but can also be used with other units of mea-surement, such as miles or cycles. Using MTTF values
on lower system levels can be very misleading, specially if the Failures Modes and Mechanisms it concerns (The F in MTTF) are not specified with it.[10]
In other cases, reliability is specified as the probability of mission success. For example, reliability of a scheduled aircraft flight can be specified as a dimensionless proba-bility or a percentage, as insystem safetyengineering. A special case of mission success is the single-shot de-vice or system. These are dede-vices or systems that remain relatively dormant and only operate once. Examples in-clude automobileairbags, thermalbatteriesandmissiles. Single-shot reliability is specified as a probability of one-time success, or is subsumed into a related parameter. Single-shot missile reliability may be specified as a re-quirement for the probability of a hit. For such systems,
the probability of failure on demand (PFD)is the reliabil-ity measure – which actually is an unavailabilreliabil-ity number. This PFD is derived from failure rate (a frequency of oc-currence) and mission time for non-repairable systems. For repairable systems, it is obtained from failure rate and mean-time-to-repair (MTTR) and test interval. This measure may not be unique for a given system as this mea-sure depends on the kind of demand. In addition to sys-tem level requirements, reliability requirements may be specified for critical subsystems. In most cases, reliabil-ity parameters are specified with appropriate statistical
confidence intervals.
9
Reliability modelling
Reliability modelling is the process of predicting or un-derstanding the reliability of a component or system prior to its implementation. Two types of analysis that are often used to model a complete system availability (in-cluding effects from logistics issues like spare part provi-sioning, transport and manpower) behavior are fault tree analysis and reliability block diagrams. On component level the same type of analysis can be used together with others. The input for the models can come from many sources: Testing, Earlier operational experience field data or data handbooks from the same or mixed industries can be used. In all cases, the data must be used with great cau-tion as prediccau-tions are only valid in case the same product in the same context is used. Often predictions are only made to compare alternatives.
A reliability block diagram showing a 1oo3 (1 out of 3) redun-dant designed subsystem
For part level predictions, two separate fields of investi-gation are common:
• Thephysics of failureapproach uses an understand-ing of physical failure mechanisms involved, such as mechanicalcrack propagationor chemicalcorrosion
degradation or failure;
• Theparts stress modellingapproach is an empirical method for prediction based on counting the number and type of components of the system, and the stress they undergo during operation.
Software reliabilityis a more challenging area that must be considered when it is a considerable component to sys-tem functionality.
10 Reliability test requirements
Reliability test requirements can follow from any analy-sis for which the first estimate of failure probability, fail-ure mode or effect needs to be justified. Evidence can be generated with some level of confidence by testing. With software-based systems, the probability is a mix of software and hardware-based failures. Testing reliability requirements is problematic for several reasons. A sin-gle test is in most cases insufficient to generate enough statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impracti-cal, and environmental conditions can be hard to predict over a systems life-cycle.
Reliability engineering is used to design a realistic and affordable test program that provides empirical evidence that the system meets its reliability requirements. Statis-ticalconfidence levelsare used to address some of these concerns. A certain parameter is expressed along with a corresponding confidence level: for example, anMTBF
of 1000 hours at 90% confidence level. From this speci-fication, the reliability engineer can, for example, design a test with explicit criteria for the number of hours and number of failures until the requirement is met or failed. Different sorts of tests are possible.
The combination of required reliability level and required confidence level greatly affects the development cost and the risk to both the customer and producer. Care is needed to select the best combination of requirements – e.g. cost-effectiveness. Reliability testing may be per-formed at various levels, such as component,subsystem
andsystem. Also, many factors must be addressed during testing and operation, such as extreme temperature and humidity, shock, vibration, or other environmental fac-tors (like loss of signal, cooling or power; or other catas-trophes such as fire, floods, excessive heat, physical or security violations or other myriad forms of damage or degradation). For systems that must last many years, ac-celerated life tests may be needed.
8 11 RELIABILITY TESTING
11
Reliability testing
A reliability sequential test plan
The purpose of reliability testing is to discover potential problems with the design as early as possible and, ulti-mately, provide confidence that the system meets its reli-ability requirements.
Reliability testing may be performed at several levels and there are different types of testing. Complex systems may be tested at component, circuit board, unit, assem-bly, subsystem and system levels [11] . (The test level nomenclature varies among applications.) For example, performing environmental stress screening tests at lower levels, such as piece parts or small assemblies, catches problems before they cause failures at higher levels. Test-ing proceeds durTest-ing each level of integration through full-up system testing, developmental testing, and operational testing, thereby reducing program risk. However, testing does not mitigate unreliability risk.
With each test both a statistical type 1 and type 2 error could be made and depends on sample size, test time, as-sumptions and the needed discrimination ratio. There is risk of incorrectly accepting a bad design (type 1 error) and the risk of incorrectly rejecting a good design (type 2 error).
It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; some
failure modesmay take years to observe; some complex interactions result in a huge number of possible test cases; and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing can be used, such as (highly) accelerated life test-ing,design of experiments, andsimulations.
The desired level of statistical confidence also plays an role in reliability testing. Statistical confidence is
in-creased by increasing either the test time or the number of items tested. Reliability test plans are designed to achieve the specified reliability at the specifiedconfidence level
with the minimum number of test units and test time. Different test plans result in different levels of risk to the producer and consumer. The desired reliability, statisti-cal confidence, and risk levels for each side influence the ultimate test plan. The customer and developer should agree in advance on how reliability requirements will be tested.
A key aspect of reliability testing is to define “failure”. Although this may seem obvious, there are many situa-tions where it is not clear whether a failure is really the fault of the system. Variations in test conditions, opera-tor differences, weather and unexpected situations create differences between the customer and the system devel-oper. One strategy to address this issue is to use a scor-ing conference process. A scorscor-ing conference includes representatives from the customer, the developer, the test organization, the reliability organization, and sometimes independent observers. The scoring conference process is defined in the statement of work. Each test case is con-sidered by the group and “scored” as a success or failure. This scoring is the official result used by the reliability engineer.
As part of the requirements phase, the reliability engineer develops a test strategy with the customer. The test strat-egy makes trade-offs between the needs of the reliability organization, which wants as much data as possible, and constraints such as cost, schedule and available resources. Test plans and procedures are developed for each reliabil-ity test, and results are documented.
11.1 Accelerated testing
The purpose ofaccelerated life testing (ALT test)is to in-duce field failure in the laboratory at a much faster rate by providing a harsher, but nonetheless representative, envi-ronment. In such a test, the product is expected to fail in the lab just as it would have failed in the field—but in much less time. The main objective of an accelerated test is either of the following:
• To discover failure modes
• To predict the normal field life from the
highstresslab life
An Accelerated testing program can be broken down into the following steps:
• Define objective and scope of the test • Collect required information about the
product
• Identify the stress(es) • Determine level of stress(es)
• Conduct the accelerated test and analyze
the collected data.
Common way to determine a life stress relationship are
• Arrhenius model • Eyring model
• Inverse power law model • Temperature–humidity model • Temperature non-thermal model
12
Software reliability
Further information:Software reliability
Software reliability is a special aspect of reliability en-gineering. System reliability, by definition, includes all parts of the system, including hardware, software, sup-porting infrastructure (including critical external inter-faces), operators and procedures. Traditionally, reliabil-ity engineering focuses on critical hardware parts of the system. Since the widespread use of digitalintegrated cir-cuittechnology, software has become an increasingly crit-ical part of most electronics and, hence, nearly all present day systems.
There are significant differences, however, in how soft-ware and hardsoft-ware behave. Most hardsoft-ware unreliabil-ity is the result of a component or material failure that results in the system not performing its intended func-tion. Repairing or replacing the hardware component restores the system to its original operating state. How-ever, software does not fail in the same sense that hard-ware fails. Instead, softhard-ware unreliability is the result of unanticipated results of software operations. Even rela-tively small software programs can have astronomically largecombinationsof inputs and states that are infeasi-ble to exhaustively test. Restoring software to its original state only works until the same combination of inputs and states results in the same unintended result. Software re-liability engineering must take this into account. Despite this difference in the source of failure between software and hardware, severalsoftware reliability mod-els based on statistics have been proposed to quantify what we experience with software: the longer software is run, the higher the probability that it will eventually be used in an untested manner and exhibit a latent defect that results in a failure (Shooman1987), (Musa 2005), (Denney 2005).
As with hardware, software reliability depends on good requirements, design and implementation. Software reli-ability engineering relies heavily on a disciplinedsoftware engineering process to anticipate and design against
unintended consequences. There is more overlap be-tween softwarequality engineeringand software reliabil-ity engineering than between hardware qualreliabil-ity and reli-ability. A good software development plan is a key as-pect of the software reliability program. The software development plan describes the design and coding stan-dards, peer reviews, unit tests, configuration manage-ment,software metricsand software models to be used during software development.
A common reliability metric is the number of software faults, usually expressed as faults per thousand lines of code. This metric, along with software execution time, is key to most software reliability models and estimates. The theory is that the software reliability increases as the number of faults (or fault density) decreases or goes down. Establishing a direct connection between fault density and mean-time-between-failure is difficult, how-ever, because of the way software faults are distributed in the code, their severity, and the probability of the com-bination of inputs necessary to encounter the fault. Nev-ertheless, fault density serves as a useful indicator for the reliability engineer. Other software metrics, such as com-plexity, are also used. This metric remains controver-sial, since changes in software development and verifica-tion practices can have dramatic impact on overall defect rates.
Testing is even more important for software than hard-ware. Even the best software development process results in some software faults that are nearly undetectable until tested. As with hardware, software is tested at several levels, starting with individual units, through integration and full-up system testing. Unlike hardware, it is inadvis-able to skip levels of software testing. During all phases of testing, software faults are discovered, corrected, and re-tested. Reliability estimates are updated based on the fault density and other metrics. At a system level, mean-time-between-failure data can be collected and used to estimate reliability. Unlike hardware, performing exactly the same test on exactly the same software configuration does not provide increased statistical confidence. Instead, software reliability uses different metrics, such ascode coverage.
Eventually, the software is integrated with the hardware in the top-level system, and software reliability is sub-sumed by system reliability. The Software Engineering Institute’scapability maturity modelis a common means of assessing the overall software development process for reliability and quality purposes.
13 Reliability engineering vs safety
engineering
Reliability engineering differs from safety engineering
with respect to the kind of hazards that are considered. Reliability engineering is in the end only concerned with
10 14 RELIABILITY VERSUS QUALITY (SIX SIGMA)
cost. It relates to all Reliability hazards that could trans-form into incidents with a particular level of loss of rev-enue for the company or the customer. These can be cost due to loss of production due to system unavailability, un-expected high or low demands for spares, repair costs, man hours, (multiple) re-designs, interruptions on normal production (e.g. due to high repair times or due to unex-pected demands for non-stocked spares) and many other indirect costs.
Safety engineering, on the other hand, is more specific and regulated. It relates to only very specific and system safety hazards that could potentially lead to severe acci-dents and is primarily concerned with loss of life, loss of equipment, or environmental damage. The related sys-tem functional reliability requirements are sometimes ex-tremely high. It deals with unwanted dangerous events (for life, property, and environment) in the same sense as reliability engineering, but does normally not directly look at cost and is not concerned with repair actions after failure / accidents (on system level). Another difference is the level of impact of failures on society and the control of governments. Safety engineering is often strictly con-trolled by governments (e.g. nuclear, aerospace, defense, rail and oil industries).
Furthermore, safety engineering and reliability engineer-ing may even have contradictengineer-ing requirements. This re-lates to system level architecture choices . For example, in train signal control systems it is common practice to use a fail-safe system design concept. In this concept the
Wrong-side failureneed to be fully controlled to an ex-treme low failure rate. These failures are related to pos-sible severe effects, like frontal collisions (2* GREEN lights). Systems are designed in a way that the far ma-jority of failures will simply result in a temporary or to-tal loss of signals or open contacts of relays and generate RED lights for all trains. This is the safe state. All trains are stopped immediately. This fail-safe logic might un-fortunately lower the reliability of the system. The rea-son for this is the higher risk of false tripping as any full or temporary, intermittent failure is quickly latched in a shut-down (safe)state. Different solutions are available for this issue. See chapter Fault Tolerance below.
13.1 Fault tolerance
Reliability can be increased here by using a 2oo2 (2 out of 2) redundancy on part or system level, but this does in turn lower the safety levels (more possibilities for Wrong Side and undetected dangerous Failures). Fault tolerant voting systems (e.g. 2oo3 voting logic) can increase both reliability and safety on a system level. In this case the so-called “operational” or “mission” reliability as well as the safety of a system can be increased. This is also com-mon practice in Aerospace systems that need continued availability and do not have a fail safe mode (e.g. flight computers and related electrical and / or mechanical and / or hydraulic steering functions need always to be
work-ing. There are no safe fixed positions for rudder or other steering parts when the aircraft is flying).
13.2 Basic reliability and mission
(opera-tional) reliability
The above example of a 2oo3 fault tolerant system in-creases both mission reliability as well as safety. How-ever, the “basic” reliability of the system will in this case still be lower than a non redundant (1oo1) or 2oo2 sys-tem! Basic reliability refers to all failures, including those that might not result in system failure, but do result in maintenance repair actions, logistic cost, use of spares, etc. For example, the replacement or repair of 1 chan-nel in a 2oo3 voting system that is still operating with one failed channel (which in this state actually has become a 1oo2 system) is contributing to basic unreliability but not mission unreliability. Also, for example, the failure of the taillight of an aircraft is not considered as a mission loss failure, but does contribute to the basic unreliability.
13.3 Detectability and common cause
fail-ures
When using fault tolerant (redundant architectures) sys-tems or syssys-tems that are equipped with protection func-tions, detectability of failures and avoidance of common cause failures become paramount for safe functioning and/or mission reliability.
14 Reliability versus Quality (
Six
Sigma
)
The everyday usage term “quality of a product” is loosely taken to mean its inherent degree of excellence. In in-dustry, this is made more precise by defining quality to be “conformance to requirements at the start of use”.
As-suming the product specifications adequately capture
cus-tomer needs, the quality level can now be precisely mea-sured by the fraction of units shipped that meet the de-tailed product specifications.[12]
But are requirements and related product specifications validated? Will it later result in worn, by fatigue of corro-sion mechanisms or due to maintenance induced failures changed items and how many of these systems still meet function and fullfill customer needs after a week of oper-ation? What performance loss do we see or is it fully out of main function? What happens after a month, or at the end of a one year warranty period? That is where “relia-bility” comes in. Quality is a snapshot at the start of life and mainly related to control of product specifications and reliability is more of a system level motion picture of the day-by-day operation. Time zero defects are manufactur-ing mistakes that escaped final test (Quality Control). The
additional defects that appear over time are “reliability defects” or reliability fallout. These reliability issues may just as well occur due to Inherent design issues, which may have nothing to do with non-conformance product specifications. Items that are produced perfectly - ac-cording all product specifications - may fail over time due to any failure mechanism (e.g. mechanical-, electrical-, chemical- or human error related). Theoreticallyelectrical-, all
items will functionally fail over infinite time.[13]The Qual-ity level might be described by a single fraction defective. To describe reliability fallout a probability model that de-scribes the fraction fallout over time is needed. This is known as the life distribution model.[12]
Quality is therefore related to Manufacturing and Reli-ability is more related to the validation of sub-system or lower item requirements and design solutions. Items that do not conform to (any) product specification in gen-eral will do worse in terms of reliability (having a lower MTTF), but this does not always have to be the case. The full Quantification (in statistical models) of this combined relation is in general very difficult. In case manufactur-ing variances can be effectively reduced, six sigma tools may be used to find optimal process solutions and may thereby also increase reliability. Other Reliability So-lutions are generally found by either simplifying a sys-tem, understanding all mechanisms of failure involved, increase robustness (against variation from the manufac-turing variances and failure mechanisms) and possibly to use redundancy and fault tolerant systems in case of high availability needs (see chapter Reliability engineering vs Safety engineering above).
15
Reliability operational
assess-ment
After a system is produced, reliability engineering mon-itors, assesses and corrects deficiencies. Monitoring in-cludes electronic and visual surveillance of critical pa-rameters identified during the fault tree analysis design stage. Data collection is highly dependent on the nature of the system. Most large organizations havequality control
groups that collect failure data on vehicles, equipment and machinery. Consumer product failures are often tracked by the number of returns. For systems in dormant storage or on standby, it is necessary to establish a formal surveil-lance program to inspect and test random samples. Any changes to the system, such as field upgrades or recall re-pairs, require additional reliability testing to ensure the reliability of the modification. Since it is not possible to anticipate all the failure modes of a given system, espe-cially ones with a human element, failures will occur. The reliability program also includes a systematicroot cause analysisthat identifies the causal relationships involved in the failure such that effective corrective actions may be implemented. When possible, system failures and cor-rective actions are reported to the reliability engineering
organization.
One of the most common methods to apply to a reliability operational assessment arefailure reporting, analysis, and corrective action systems (FRACAS). This systematic approach develops a reliability, safety and logistics as-sessment based on Failure / Incident reporting, manage-ment, analysis and corrective/preventive actions. Organi-zations today are adopting this method and utilize com-mercial systems such as a Web based FRACAS applica-tion enabling an organizaapplica-tion to create a failure/incident data repository from which statistics can be derived to view accurate and genuine reliability, safety and quality performances.
It is extremely important to have one common source FRACAS system for all end items. Also, test results should be able to be captured here in a practical way. Fail-ure to adopt one easy to handle (easy data entry for field engineers and repair shop engineers)and maintain inte-grated system is likely to result in a FRACAS program failure.
Some of the common outputs from a FRACAS system in-cludes: Field MTBF, MTTR, Spares Consumption, Re-liability Growth, Failure/Incidents distribution by type, location, part no., serial no, symptom etc.
The use of past data to predict the reliability of new com-parable systems/items can be misleading as reliability is a function of the context of use and can be affected by small changes in the designs/manufacturing.
16 Reliability organizations
Systems of any significant complexity are developed by organizations of people, such as a commercialcompany
or a government agency. The reliability engineering organization must be consistent with the company’s
organizational structure. For small, non-critical systems, reliability engineering may be informal. As complexity grows, the need arises for a formal reliability function. Because reliability is important to the customer, the cus-tomer may even specify certain aspects of the reliability organization.
There are several common types of reliability organiza-tions. The project manager or chief engineer may employ one or more reliability engineers directly. In larger orga-nizations, there is usually a product assurance orspecialty engineeringorganization, which may include reliability,
maintainability,quality, safety,human factors,logistics, etc. In such case, the reliability engineer reports to the product assurance manager or specialty engineering man-ager.
In some cases, a company may wish to establish an inde-pendent reliability organization. This is desirable to en-sure that the system reliability, which is often expensive and time consuming, is not unduly slighted due to
bud-12 19 SEE ALSO
get and schedule pressures. In such cases, the reliability engineer works for the project day-to-day, but is actually employed and paid by a separate organization within the company.
Because reliability engineering is critical to early system design, it has become common for reliability engineers, however the organization is structured, to work as part of anintegrated product team.
17
Certification
TheAmerican Society for Qualityhas a program to be-come a Certified Reliability Engineer, CRE. Certifica-tion is based on educaCertifica-tion, experience, and a certificaCertifica-tion test: periodic re-certification is required. The body of knowledge for the test includes: reliability management, design evaluation, product safety, statistical tools, design and development, modeling, reliability testing, collecting and using data, etc.
Another highly respected certification program is the
CRP(Certified Reliability Professional). To achieve cer-tification, candidates must complete a series of courses focused on important Reliability Engineering topics, suc-cessfully apply the learned body of knowledge in the workplace and publicly present this expertise in an indus-try conference or journal.
18
Reliability engineering
educa-tion
Some universities offer graduate degrees in reliability en-gineering. Other reliability engineers typically have an engineering degree, which can be in any field of engineer-ing, from an accredited university or college program. Many engineering programs offer reliability courses, and some universities have entire reliability engineering grams. A reliability engineer may be registered as a pro-fessional engineer by the state, but this is not required by most employers. There are many professional confer-ences and industry training programs available for relia-bility engineers. Several professional organizations exist for reliability engineers, including the IEEE Reliability Society, the American Society for Quality (ASQ), and theSociety of Reliability Engineers (SRE).
19
See also
• Brittle systems • Burn-in
• Cauchy stress tensor • Factor of safety
• Failing badly
• FMEA
• Fault-tolerant system • Fault tree analysis • Fracture mechanics • Solid mechanics
• Highly accelerated life test • Highly accelerated stress test • Human reliability
• Industrial engineering • Integrated logistics support • Logistic engineering • Performance engineering • Product qualification • Professional engineer • Quality assurance • RAMS • Redundancy (engineering)
• Redundancy (total quality management) • Reliability (disambiguation)
• Reliability, availability and serviceability (computer hardware)
• Reliability theory
• Reliability theory of aging and longevity • Reliable system design
• Risk assessment • Safety engineering • Safety integrity level • Security engineering
• Single point of failure(SPOF)
• Software engineering • Software reliability testing • Spurious trip level
• Structural fracture mechanics • Strength of materials • Systems engineering • Temperature cycling
20
References
[1] Institute of Electrical and Electronics Engineers (1990) IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York, NY
ISBN 1-55937-079-3
[2] O'Connor, Patrick D. T. (2002), Practical Reliability
En-gineering (Fourth Ed.), John Wiley & Sons, New York.
ISBN 978-0-4708-4462-5.
[3] Barnard, R.W.A. (2008).“What is wrong with Reliability Engineering?". Lambda Consulting. Retrieved 30 Octo-ber 2014.
[4] “Articles - Where Do Reliability Engineers Come From? - ReliabilityWeb.com: A Culture of Reliability”. [5] Using Failure Modes, Mechanisms, and Effects
Analy-sis in Medical Device Adverse Event Investigations, S. Cheng, D. Das, and M. Pecht, ICBO: International Con-ference on Biomedical Ontology, Buffalo, NY, July 26– 30, 2011, pp. 340–345
[6] Federal Aviation Administration (19 March 2013).
System Safety Handbook (PDF). U.S. Department of Transportation. Retrieved 2 June 2013.
[7] System Reliability Theory, second edition, Rausand and Hoyland - 2004
[8] The Blame Machine, Why Human Error Causes Acci-dents - Whittingham, 2007
[9] Salvatore Distefano, Antonio Puliafito: Dependability Evaluation with Dynamic Reliability Block Diagrams and Dynamic Fault Trees. IEEE Trans. Dependable Sec. Comput. 6(1): 4-17 (2009)
[10] Practical Reliability Engineering, O'Conner, 2001 [11] Ben-Gal I., Herer Y. and Raz T. (2003).“Self-correcting
inspection procedure under inspection errors”. IIE Trans-actions on Quality and Reliability, 34(6), pp. 529–540. [12] “8.1.1.1. Quality versus reliability”.
[13] “The Second Law of Thermodynamics, Evolution, and Probability”.
21
Further reading
• Blanchard, Benjamin S. (1992), Logistics Engineer-ing and Management (Fourth Ed.), Prentice-Hall,
Inc., Englewood Cliffs, New Jersey.
• Breitler, Alan L. and Sloan, C. (2005), Proceedings
of the American Institute of Aeronautics and Astro-nautics (AIAA) Air Force T&E Days Conference, Nashville, TN, December, 2005: System Reliabil-ity Prediction: towards a General Approach Using a Neural Network.
• Ebeling, Charles E., (1997), An Introduction to Re-liability and Maintainability Engineering,
McGraw-Hill Companies, Inc., Boston.
• Denney, Richard (2005) Succeeding with Use
Cases: Working Smart to Deliver Quality. Addison-Wesley Professional Publishing. ISBN . Discusses the use of software reliability engineering in use casedriven software development.
• Gano, Dean L. (2007), “Apollo Root Cause
Analy-sis” (Third Edition), Apollonian Publications, LLC., Richland, Washington
• Holmes, Oliver Wendell, Sr.The Deacon’s Master-piece
• Kapur, K.C., and Lamberson, L.R., (1977), Reli-ability in Engineering Design, John Wiley & Sons,
New York.
• Kececioglu, Dimitri, (1991) “Reliability
Engineer-ing Handbook”, Prentice-Hall, Englewood Cliffs, New Jersey
• Trevor Kletz(1998) Process Plants: A Handbook for
Inherently Safer Design CRCISBN 1-56032-619-0 • Leemis, Lawrence, (1995) Reliability: Probabilistic
Models and Statistical Methods, 1995, Prentice-Hall. ISBN 0-13-720517-1
• Frank Lees(2005). Loss Prevention in the Process
Industries (3rdEdition ed.). Elsevier. ISBN 978-0-7506-7555-0.
• MacDiarmid, Preston; Morris, Seymour; et al.,
(1995), Reliability Toolkit: Commercial Practices
Edition, Reliability Analysis Center and Rome
Lab-oratory, Rome, New York.
• Modarres, Mohammad; Kaminskiy, Mark;
Krivtsov, Vasiliy (1999), “Reliability Engineering and Risk Analysis: A Practical Guide, CRC Press,
ISBN 0-8247-2000-8.
• Musa, John (2005) Software Reliability
Engineer-ing: More Reliable Software Faster and Cheaper, 2nd. Edition, AuthorHouse. ISBN
• Neubeck, Ken (2004) “Practical Reliability
Analy-sis”, Prentice Hall, New Jersey
• Neufelder, Ann Marie, (1993), Ensuring Software Reliability, Marcel Dekker, Inc., New York. • O'Connor, Patrick D. T. (2002), Practical Reliability
Engineering (Fourth Ed.), John Wiley & Sons, New
York.ISBN 978-0-4708-4462-5.
• Shooman, Martin, (1987), Software Engineering: Design, Reliability, and Management, McGraw-Hill,
14 21 FURTHER READING
• Tobias, Trindade, (1995), Applied Reliability,
Chap-man & Hall/CRC,ISBN 0-442-00469-9 • Springer Series in Reliability Engineering
• Nelson, Wayne B., (2004), Accelerated Testing – Statistical Models, Test Plans, and Data Analysis,
John Wiley & Sons, New York,ISBN 0-471-69736-2
• Bagdonavicius, V., Nikulin, M., (2002),
“Acceler-ated Life Models. Modeling and Statistical analy-sis”, CHAPMAN&HALL/CRC, Boca Raton,ISBN 1-58488-186-0
21.1 US standards, specifications, and
handbooks
• Aerospace Report Number:
TOR-2007(8583)−6889 Reliability Program Re-quirements for Space Systems, The Aerospace Corporation(10 Jul 2007)
• DoD 3235.1-H (3rd Ed)Test and Evaluation of Sys-tem Reliability, Availability, and Maintainability (A Primer), U.S. Department of Defense (March 1982)
.
• NASA GSFC 431-REF-000370 Flight Assurance Procedure: Performing a Failure Mode and Effects Analysis,National Aeronautics and Space Adminis-tration Goddard Space Flight Center(10 Aug 1996).
• IEEE 1332–1998 IEEE Standard Reliability Pro-gram for the Development and Production of Elec-tronic Systems and Equipment,Institute of Electrical and Electronics Engineers(1998).
• JPL D-5703 Reliability Analysis Handbook, National Aeronautics and Space Administration Jet Propulsion Laboratory(July 1990).
• MIL-STD-785B Reliability Program for Systems and Equipment Development and Production, U.S.
Department of Defense (15 Sep 1980). (*Obsolete, superseded by ANSI/GEIA-STD-0009-2008 titled
Reliability Program Standard for Systems Design, Development, and Manufacturing, 13 Nov 2008) • MIL-HDBK-217F Reliability Prediction of
Elec-tronic Equipment, U.S. Department of Defense (2
Dec 1991).
• MIL-HDBK-217F (Notice 1)Reliability Prediction of Electronic Equipment, U.S. Department of
De-fense (10 Jul 1992).
• MIL-HDBK-217F (Notice 2)Reliability Prediction of Electronic Equipment, U.S. Department of
De-fense (28 Feb 1995).
• MIL-STD-690DFailure Rate Sampling Plans and Procedures, U.S. Department of Defense (10 Jun
2005).
• MIL-HDBK-338B Electronic Reliability Design Handbook, U.S. Department of Defense (1 Oct
1998).
• MIL-HDBK-2173 Reliability-Centered Mainte-nance (RCM) Requirements for Naval Aircraft, Weapon Systems, and Support Equipment, U.S.
De-partment of Defense (30 JAN 1998); (superseded byNAVAIR 00-25-403).
• MIL-STD-1543BReliability Program Requirements for Space and Launch Vehicles, U.S. Department of
Defense (25 Oct 1988).
• MIL-STD-1629AProcedures for Performing a Fail-ure Mode Effects and Criticality Analysis, U.S.
De-partment of Defense (24 Nov 1980).
• MIL-HDBK-781AReliability Test Methods, Plans, and Environments for Engineering Development, Qualification, and Production, U.S. Department of
Defense (1 Apr 1996).
• NSWC-06 (Part A & B) Handbook of Reliabil-ity Prediction Procedures for Mechanical Equipment, Naval Surface Warfare Center(10 Jan 2006).
• SR-332 Reliability Prediction Procedure for Elec-tronic Equipment,Telcordia Technologies (January 2011).
• FD-ARPP-01Automated Reliability Prediction Pro-cedure,Telcordia Technologies(January 2011).
21.2 UK standards
In the UK, there are more up to date standards maintained under the sponsorship of UK MOD as Defence Standards. The relevant Standards include:
DEF STAN 00-40 Reliability and Maintainability (R&M)
• PART 1: Issue 5: Management Responsibilities and
Requirements for Programmes and Plans
• PART 4: (ARMP-4)Issue 2: Guidance for Writing
NATO R&M Requirements Documents
• PART 6: Issue 1: IN-SERVICE R & M
• PART 7 (ARMP-7) Issue 1: NATO R&M
Termi-nology Applicable to ARMP’s
DEF STAN 00-42 RELIABILITY AND MAINTAIN-ABILITY ASSURANCE GUIDES
• PART 1: Issue 1: ONE-SHOT DE-VICES/SYSTEMS
• PART 2: Issue 1: SOFTWARE • PART 3: Issue 2: R&M CASE • PART 4: Issue 1: Testability
• PART 5: Issue 1: IN-SERVICE RELIABILITY
DEMONSTRATIONS
DEF STAN 00-43 RELIABILITY AND MAINTAIN-ABILITY ASSURANCE ACTIVITY
• PART 2: Issue 1: IN-SERVICE
MAINTAIN-ABILITY DEMONSTRATIONS
DEF STAN 00-44 RELIABILITY AND MAINTAIN-ABILITY DATA COLLECTION AND CLASSIFICA-TION
• PART 1: Issue 2: MAINTENANCE DATA &
DEFECT REPORTING IN THE ROYAL NAVY, THE ARMY AND THE ROYAL AIR FORCE
• PART 2: Issue 1: DATA CLASSIFICATION AND
INCIDENT SENTENCING – GENERAL
• PART 3: Issue 1: INCIDENT SENTENCING –
SEA
• PART 4: Issue 1: INCIDENT SENTENCING –
LAND
DEF STAN 00-45 Issue 1: RELIABILITY CENTERED MAINTENANCE
DEF STAN 00-49 Issue 1: RELIABILITY AND MAIN-TAINABILITY MOD GUIDE TO TERMINOLOGY DEFINITIONS
These can be obtained fromDSTAN. There are also many commercial standards, produced by many organisations including the SAE, MSG, ARP, and IEE.
21.3 French standards
• FIDES . The FIDES methodology (UTE-C 80-811)
is based on the physics of failures and supported by the analysis of test data, field returns and existing modelling.
• UTE-C 80–810 or RDF2000 . The RDF2000
methodology is based on the French telecom expe-rience.
21.4
International standards
• TC 56 Standards: Dependability22 External links
• Prognostics Journal, an open-access journal, pro-vides an international forum for the electronic publi-cation of original research and industrial experience articles in all areas of systems reliability and prog-nostics.
• Models and methods regarding reliability analysis • Structural Safety