• No results found

Fault Avoidance

In document dtj v01 01 aug1985 pdf (Page 73-75)

Our first goal i n designing a re l i able system was to reduce the nu mber of fai l u res that occur in t h e m a c h i n e . T h i s i n v o l v e d g e t t i n g

Relia bility into the VA X 8600 .\),stem

components, interconnects, and power systems with the lowest fai l u re rates . Reducing the fai l ­ ure rates al so involved constantly monitoring the fai lures that were experienced and deter­ mini ng their causes .

A major infl uence on the I C re l i abi l ity was exercised by speci fying how the chips were to be stressed a n d teste d . The 0 1 Ps and the macroce l l arrays ( MCAs ) were requ i red to be bu rned in before testing; thereafter, al l chips were to be fu nctiona l ly teste d . H owever, i n debugging the early machi nes we discovered bad D I Ps . We had expected to find only a hand­ fu l of bad chips since they were a l l burned in. To identify the cause of these fai l u res, a l l defec­ tive chips were analyzed . The problem was identified as static that was "zapping" our mod­ ules. Subsequently, the design was changed so that a l l machines come with static grounding straps .

We also examined the designs of previous s to determine which problem areas were typ ica l . The backplane is an examp le. Wire­ wrapped backpla nes are difficult to bu ild and test. They have several fa i l u re modes-such as cold flow of the insulation, a nicked wire , and scraps of wire . They can also be damaged d u r­ i n g s e rv i c i ng of t h e m a c h i n e . A l l t h e s e problems often res u l t in interm i ttent fau lts that s lowly but surely become more so l i d . I m prov­ ing the quality control on the wire-wrappi ng process to obtain the desi red re l i abil i ty was a very difficu l t task, since the process is com ­ prised of a large number of repetitive but not identica l operat ions . Moreover, a very sma l l error rate st il l produces quite a large overal l fai l u re rate . Therefore , early i n the project, we decided to replace the wire-wrapped backp lane with a mu l t i layer pri nted circu it card . which has a much lower fa i l u re rate.

I n the power subsystem, fau l t avoidance was pursued by im proving the alternating cu rrent (ac) input -power tolerance , the des ign testing, the manufacturing processes, and the environ­ me nta l mon itoring. In particu lar. manufactur­ ing was a key area where the re l iability of the power suppl ies was im proved. A new power­ supply tester was developed to im prove our test ing capab i l it ies. I t conta ins logic that can fu l ly test the characteristics of a power supply and store the test data . The data includes line and load regulation and noise measu rements.

A modul ar power supply ( M PS ) was designed to run from a single clock so that a l l regulators

wou ld he i n synchron izat ion. This synchroniza­ tion a l l owed us to predict and control the out­ put noise of the switching regu lators . A new high-current connector that a l lows the regu la­ tors to be pluggabl e was also developed.

The power su bsystem also contains the envi­ ro nm ental m o n i toring modu le ( E M M ) . The EMM was designed to monitor the status of the power supply and the envi ronment inside the syste m . The EMM can measure the vo ltage out­ put of every regu lator, the inlet and outlet air temperatures, the air-flow veloci ty, and the ground-wire curre nt in the pri mary power cord . The system protects itself by having the EMM monitor these conditions, log any deviations, and shut down the system if adverse condit ions warrant i t .

Accord ing to E .) . McC l u skey, " I m p roper design of the hardware or software can result in a system which does not function at a l l . Such m istakes are , of course, quickly discovered and corrected . Other, less obvious design defects usually remain in any system even after it has been in service for a long time . "5 The resu l ts of design prob lems are logic circuits that ei ther fa i l prematurel y or sense signals fa lsely. The number of these types of errors is indirectly a measure of the quality of the too ls used in the system's design.

At the beginning of a design project , ru les are establ ished to make sure that the goa ls for sig­ nal integrity and component fai l ure rates can be achieved. I t is usually i mpossible to deve lop ru les that are both easy to check and at the same time don' t overly constrain the design engineer. Often this resu lts in complex ru les. I f they are inadvertently broke n , the usual ou tcome is a decrease in the machine's re l iabi l i ty. The bro­ ken ru les res u l t in com ponents that operate with excessive temperatures or signals that do not have adequate noise margins. A chip that runs too hot w i l l fa i l sooner than anticipated; a signal that doesn't have adequate noise margin wi l l somcti mes be sensed incorrect ly. Worse sti ll is the fact that the component is b lamed rather t han the true cause , a violated ru le.

As an cxample consider the operat ing tem­ perature of an IC. There is a tradeoff between the maxi m u m and minimum operating temper­ :uures and the amount of noise margin avai la­ ble. If the temperature of an IC exceeds its max imum specified temperature. the amount of noise norma l ly present from known sources, s u c h as a d j a c e n t - r u n c r o s s t a l k , m a y b e

Di!!,ila/ Tecbuical journal No. I A ug ust 1 985

that, we developed a tool for use on the 8600 to check for chips that were getting too hot. If a chip was detected as being too hot , i ts layout was mod ified to correct the problem without changing the total power of the modu l e .

A new t i m ing ana lysis tool was also deve l· oped for the project. This rool enabled the designers tO do a much more thorough job of t i m i ng ana lysis on this machine than had been done on previous projects . Using it i nvolved ru nning many separate programs t hat bu i l t a t i m ing model of the machine from the schemat· ics and the layouts of the modu les, backp lane , and MCAs . The resu lts of the model were then used hy a program that performed ti m i ng analy· sis of the design based upon a set of interbox ti ming specifications .

After the layouts of the modu les were com­ pleted , every single ru n was ana lyzed to ensure that signal integrity had been achieved . The program compu ted the amount of noise gener· ated from adjacent runs, retlections , and the l ike. Based on these resu lts, we made a number of rerout ings to increase the integrity of certai n signals.

In document dtj v01 01 aug1985 pdf (Page 73-75)

Related documents