Perspectives on Safety Critical Computing Systems

(1)

Available at http://www.ijcsonline.com/

Perspectives on Safety Critical Computing Systems

Kadupukotla Satish Kumar Ȧ and Panchumarthy Seetha RamaiahḂ Ȧ_{Dept of Computer Science, JNTU Kakinada, India, [email protected]}

Ḃ_{Dept of Computer Science and Systems Engineering, AU Visakhapatnam, India, [email protected]}

Abstract

Computer systems have become an integral part of our life. They are being used in systems catering to basic utility services to complex scientific research and defense purposes. Any system presents some risk to its owner's, users and environment. Some present more risk than others and those that present the most risk are what we call safety-critical systems. Safety Critical systems are those systems whose failures could result in loss of life, loss of revenue, significant property damage or damage to the environment. This paper reviews 9 system failures, belonging to various domains, because of software bugs. The paper discusses some regulatory standards and guidelines for proper testing of the software. We as well recommend the guidelines that should be followed while testing to reduce the instances of software failure in safety-critical systems.

Keywords: Safety Critical Systems, SIL, Safety Standards

I. INTRODUCTION

Most of the failures of software projects can be attributed to the fact that they fail to completely meet the requirements. These requirements can be the cost, schedule, quality, or requirements objectives. Studies have been carried out over the software failures, results of which are very alarming as it states that, around 50% to 80% of the projects result in a failure. The causes for failure are varied, but various studies state that the most common causes of failure are lack of the client participation, improperly trained developers, continuously changing client requirements, unrealistic project objectives, inaccurate estimates of resources, no proper definition of system requirements, poor reporting, inappropriate development practices. All software projects should be tested properly and they should maintain accuracy to all times. Every time we speak about success of software projects, there are also projects that are failures. [1][2][3]

This paper is organized as follows. The next section highlights some cases of software failures. In Section III we talk about how to overcome the software failures. In Section IV we give a description of the Safety Critical Systems. In Section V we talk about Software Engineering for Safety Critical Systems. In Section VI we present a discussion about the various standards and Safety Integrity Levels for Safety Critical Systems. In Section VII we give our conclusion.

II. SOFTWARE FAILURES

In an ever growing complexity of a system, errors might get ignored or remain undetected until a catastrophe occurs resulting in either huge loss of wealth or sometimes resulting in human casualties as well. We will be discussing some of the failures of complex systems, by citing well-known software errors that might / have led to huge loss of resources in the space, transportation, communication, government, and health care industries including:

 Disintegration of Mars Orbiter (1998)  Patriot Missile Defense System (1991)  Almost a WW-III (1983)

 Iran Air Flight 655 (1988)  AT&T Breakdown (1990)  Therac-25 (1986)

 Challenger Space Shuttle Disaster (1986)  Public in-convenience

A. Space

1. Disintegration of MARS Orbiter [19]

NASA launched a mission to carry out a study of MARS environment. An orbiter was launched in 1998 to carry out the study amid much fanfare but it ended in a disaster. Investigations into the root cause of the failure of mission was attributed to a software error involving calculation. A report issued by NASA states the root cause was failure to use metric units in the coding of a software file, small Forces, used in trajectory models. An investigation revealed that the navigation team was calculating metric units and the ground calculations were in Imperial units. The computer sytems in the crafts were unable to reconcile the differences resulting in a navigation error.

2. The Mariner 1 spacecraft [12]

(2)

expensive software bugs in history, as it resulted in the destruction of the Mariner 1 spacecraft in 1962 (cost in 1962 dollars: 18.5 million; cost in today’s dollars :135 million), before it could complete its mission of flying by Venus. The Mariner 1 spacecraft was launched on July 22, 1962 from Cape Canaveral, Florida. Soon after the launch, an onboard guidance antenna failed, which caused fallback to a backup radar system that should have been able to guide the spacecraft. However, there was a fatal flaw in the software of that guidance system. When the equations that would be used to process and translate tracking data into flight instructions were encoded onto punch cards, one critical symbol was left out: an overbar or overline, often confused in ensuing years with a hyphen. The lack of that overbar, essentially, caused the guidance computer to incorrectly compensate for some otherwise normal movement in the spacecraft.

3. Ariane 5, Flight 501 Failure [18]

European space Agency spent around 10 years and $7 billion to produce Ariane 5, a giant rocket capable of placing a pair of three-ton satellites into orbit with each launch. The mission also intended to give Europe supremacy in the commercial space business. Minutes into its maiden voyage, the rocket exploded because of a small computer program trying to stuff a 64-bit number into a 16-bit space.

B. Defense

1. Almost WW-III

During the height of cold war, when peaceniks around the world were leaving no stone unturned to prevent a war between USA & USSR which could have resulted in a nuclear world war, their efforts would have gone down the drain, courtesy a software error. On the 26th of September 1083, the early warning system of the USSR raised a false alarm that USA had launched a missile attack. It raised the alarm twice, first alarm stated that USA had launched 1 missile and another alarm mentioned an attack by 5 missiles. The officer on duty based on his understanding declared it as a false alarm.

2. Patriot Missile Defense System [15]

During the Operation Desert Storm, a software error in the US Patriot Missile Defense System resulted in the death of 28 US soldiers. The system failed to intercept an incoming Scud missile which struck military barracks. The failure was attributed to an error in calculation. The systems internal clock was measured in tenths of seconds and the actual time was reported by multiplying the internal clocks value with a 24-bit fixed-point register. Due to this, the two systems which were supposed to share an universal time, instead had independent system clocks, resulting in an out of sync situation. causing the failure.

3. Aegis Combat System, Iran Air Flight 655 [16] On July 3, 1988, Aegis combat defense system, used by the U.S. Navy, failed to carry out proper calculation because of which the USS Vincennes mistakenly shot down a passenger aircraft, Iran Air Flight 655 resulting in 290 civilian casualties. Using the missile guidance system, Vincennes’s Commanding Officer believed the Iran Air Airbus A300B2 was a much smaller Iran Air Force F-14A

Tomcat jet fighter descending on an attack vector, when in fact the Airbus was transporting civilians and on its normal civilian flight path. The radar system temporarily lost Flight 655 and Reassigned its track number to a F-14A Tomcat fighter that it had previously seen. During the critical period, the decision to fire was made, and U.S. military personnel shot down the civilian plane.

C. Telecommunications

1. AT&T Breakdown [17]

In January 1990, unknown combinations of calls caused malfunctions, over AT\&T network, across 114 switching center across the whole of Unites States. Due to the malfunction around 65 million calls could not be connected nationwide. The cause was attributed to a sequence of events that triggered a software error which was due to a fault in the code.

D. Health Care and Medicine

1. Therac-25 [14]

Therac-25 was a radiation therapy machine developed by Atomic Energy of Canada for cancer treatment. Between the years 1985 and 1987 Therac-25 machines in four medical centers gave massive overdoses of radiation to six patients. An extensive investigation and report revealed that in some instances operators repeated overdoses because machine display indicated that no dose was administered. Some patients received between 13,000 - 25,000 rads when 100-200 needed. The result of the excessive radiation exposure resulted in severe injuries and three patients lost their lives. Not adhering to good safety design was the cause of the errors. The investigation also found calculation errors. For example, the set-up test used a flag variable, of size of just one byte, whose bit value was incremented on each run. When the routine called for the 256th time, there was a flag overflow and huge electron beam was erroneously turned on. An extensive investigation that followed showed that although some latent error could be traced back for several years, there was an inadequate system of reporting that made it hard to pinpoint the root cause of the failure. The final investigations report indicates that during real-time operation the software recorded only certain parts of operator input/editing. A careful reconstruction by a physicist at one of the cancer centers in order to determine what went wrong revealed what exactly went wrong.

E. Public Utilities

1. Power Blackout in USA & Canada [13]

A software bug in an alarm system placed at a control room of an energy company caused an electrical power blackout in the Northeastern and Midwestern USA and Ontario in Canada. The outage affected over 50 million people from both the nations.

(3)

those requirements Software Project is art of technique. USA government itself is spending 60 Billion dollars on testing. Most of the projects have failures because of developers are not capturing requirements properly, End user could not provide his/her requirements properly, parameter negligence, Software professionals could not come up with proper technology understanding, not following security measures, etc. For better understanding conduct postmortem and iterate for next project. To overcome failures Safety-Critical System plays a key role. items when proofreading spelling and grammar:

A. Abbreviations and Acronyms

FMEA: Failure Mode and Effects Analysis LOPA: Layers Of Protection Analysis PFD: Probability of Failure on Demand SIF: Safety Instrumented Function SIL: Safety Integrity Level.

IV. ABOUT SAFETY CRITICALSYSTEM

A life-critical system or safety-critical system [1] is a system whose failure or malfunction may result in one (or more) of the following outcomes:

 Death or serious injury to people  Loss or damage to equipment/ property  Environmental Harm

Risks of this sort are usually managed with the methods and tools of safety engineering. A life-critical system is designed to loose less than one life per billion (109) hours of operation. Typical design methods include probabilistic risk assessment, a method that combines failure mode and effects analysis (FMEA) with fault tree analysis. Safety-critical systems are increasingly computer-based. Any system represents some risk to its owners, users, and environment. Some present more than others and those that present the most risk are what we call safety-critical systems.

The risk is a threat to something valuable. All systems either have something of value, which may be jeopardized inside them, or their usage may jeopardize some value outside them. A system should be built to protect the values both from the result of ordinary use of the system and from the result of malicious attacks of various kinds. A typical categorization of values looks at a values concerning

 Safety  Economy  Security  Environment

V. SOFTWARE ENGINEERING FOR SAFETY CRITICALSYSTEM

Component Software engineering for safety-critical systems is very difficult [8]. There are three aspects which can be applied to help in the software engineering process for safety-critical systems. First is process engineering and management. Secondly, selecting the appropriate resources and environment for the system. Thirdly, the developers should address any legal or regulatory requirements for the

system, for eg. Federation of American Aviation has given some guidelines to be followed for systems to be used in aviation. By setting up a standard to which a system should adhere to, it forces the developers to take the necessary precautions. The aviation industry has been successful in laying down standards for producing safety-critical avionics software. Similar standards are also in place for automotive (ISO 26262), Medical (IEC 62304) and nuclear (IEC 61513) industries. The standard approach is to carefully code, inspect, document, test, verify and analyze the system. Another approach is to certify a production system, a compiler, and then generate the system's code from specifications. Another approach uses formal methods to generate proofs that the code meets requirements. All of these approaches improve the software quality in safety-critical systems by testing or eliminating manual steps in the development process, because people make mistakes, and these mistakes are the most common cause for accidents. Many regulatory standards address how to determine the safety criticality of systems and provide guidelines for the corresponding testing. Some of them (but probably not all) are:

 CEI/IEC 61508 – Functional safety of electrical/ electronic programmable safety-related systems  Do-178B – Software considerations in airborne

systems and equipment certification

 Pr EN 50128 – Software for railway control and protection systems

 Def stan 00-55 – Requirements for safety-related software in defense equipment.

 IEC 880 – Software for Computers in the safety systems of nuclear power stations.

 MISRA (Motor Industry Software Reliability Association) – Development guidelines for vehicle-based software.

 FDA (Food and Drug Administration) – American Food and Drug Association (Pharmaceutical standards).

VI. DISCUSSION ON SIL

A SIL [20] is a measure of safety system performance, or probability of failure on demand (PFD) for a safety critical system. There are four discrete integrity levels associated with SIL. The higher the SIL level, the lower the probability of failure on demand of the system and the higher the system reliability and performance. SIL level are directly proportional to the system complexity and cost of development. A SIL level applies to an entire system. Individual products or components do not have SIL ratings. SIL levels are used when implementing a safety critical system that must reduce an existing intolerable process risk level to a tolerable risk range.

(4)

standards. In the European functional safety standards based on the IEC 61508 standard four SILs are defined, with SIL 4 the most dependable and SIL 1 the least. A SIL is determined based on a number of quantitative factors in combination with qualitative factors such as development process and safety life cycle management.

Assignment of SIL is an exercise in risk analysis where the risk associated with a specific hazard, that is intended to be protected against by a SIF, is calculated without the beneficial risk reduction effect of the SIF. That "unmitigated" risk is then compared against a tolerable risk target. The difference between the "unmitigated" risk and the tolerable risk, if the "unmitigated" risk is higher than tolerable, must be addressed through risk reduction of the SIF. This amount of required risk reduction is correlated with the SIL target. In essence, each order of magnitude of risk reduction that is required correlates with an increase in one of the required SIL numbers.

There are several methods used to assign a SIL. These are normally used in combination, and may include:

 Risk matrices  Risk graphs

 Layers Of Protection Analysis (LOPA)

In the above listed methods, LOPA is one of the most commonly used method by big industries. The assignment may be tested using both pragmatic and controllability approaches, applying guidance on SIL assignment published by the UK HSE. SIL assignment processes that use the HSE guidance to ratify assignments developed from Risk Matrices have been certified to meet IEC EN 61508.

The standards are application-specific, and that can make it difficult to determine what to do if we have to do with multidisciplinary products. Nonetheless, standards do provide useful guidance. The most generic of the standards listed above is IEC 61508; this may always be used if a system does not fit into any of the other types. All the standards operate with so-called software integrity levels (SILs).

Table 1. Classification of SIL

Value 100000000 100000 100 1 Safety Many Peo-

ple killed

Humans lives

in danger

Damage to physical ob- jects, risk of personal injury

Insignificant damage to things; no risk to people Economy Financial

catastrophes

Great financial loss Significant financi al loss Insignificant financi al loss Security Destruction/

disclosure of strategic data a nd services Destruction/ Disclosure of critic al data a nd services

Faults in data

No risk for data Environmen t Extensive and irreparable damage to the environment Reparable, but co m- prehensive damage to the environment Local damage to the environment

No environmental risk

The concept of SILs allows a standard to define a hierarchy of levels of testing (and development). A SIL is normally applied to a subsystem; that is, we can operate with various degrees of SILs within a single system or within a system of systems. The determination of the SIL for a system under testing is based on a risk analysis. The standards concerning safety critical systems deal with both development processes and supporting processes, that is, project management, configuration management, and product quality assurance.

We take as example the CEI/IEC 61508 which recommends the usage of test case design techniques depending on the SIL of a system. This standard defines four integrity levels: SIL4, SIL3, SIL2 and SIL1, where SIL4 is the most critical. For a SIL4 classified system, the standard says that the use of equivalence partitioning is highly recommended as part of the functional testing [Fig 1]. Furthermore the use of boundary value analysis is highly recommended, while the use of cause-effect graph and error guessing are only recommended. For white-box testing the level of coverage is highly recommended, though the standard does not say which level of which coverage. The recommendations are less and less strict as we come down the SILs in the standard. For highly safety-critical system the testers may be required to deliver a compliance statement or matrix, explaining how the pertaining regulations have to be follow and fulfilled.

Fig. 1. Graph of Various SILs VII. CONCLUSION

(5)

REFERENCES

[1] J. C. Knight, “Safety critical systems: challenges and directions”, IEEE Proceedings of the 24th International Conference on Software Engineering. ICSE 2002, pp. 547 – 550

[2] W. R. Dunn, “Designing safety-critical computer systems”, IEEE Transactions on Computers, Vol. 36, No. 11, pp. 40 – 46, November 2003.

[3] M. Ben. Swarup and P. S. Ramaiah “An Approach to Modeling Software Safety in Safety-Critical Systems”, Journal of Computer Science, Vol 5, No. 4, pp 311-320, 2009, ISSN: 1549-3636. [4] G Raj Kumar, Dr. K Alagarswamy, "The most common factors for

the failure of Software Development Project", TIJCSA, Vol 1, No. 11, pp: 74-77, January 2013.

[5] T. R. S. P. Babu, D. S. Rao, P. Ratna, "Negative Testing is Trivial for Better Software Products", IJRAET, Vol3, No. 1, pp: 36-41, 2015.

[6] T. O A Lehtinen, M. V Mantyla, J. Vanhanen, J. Itkonen and C. Lassenius, "Perceived causes of software project failures An analysis of their relationships", Information and Software Technology, Vol 56, No. 6, pp: 623-643, June 2014

[7] Edward E. Ogheneovo, "Software Dysfunction: Why Do Software Fail?", Journal of Computer and Communications, Vol 2, No 6, pp: 25-35, April 2014

[8] R. Kaur, Dr. J. Sengupta, "Software Process Models and Analysis on Failure of Software Development Projects", International Journal of Scientific & Engineering Research Vol 2, No.2, pp: 1-4, February-2011

[9] Lorin J. May, "Major Causes of Software Project Failures".

[Available at]

http://www.cic.unb.br/~genaina/ES/ManMonth/SoftwareProjectFail ures.pdf

[10] Dr. Paul Dorsey, "Top 10 Reasons Why Systems Projects Fail". [Available at] http://www.ksg.harvard.edu/m-rcbg/ethiopia/Publications/Top%2010%20Reasons%20Why%20Sy stems%20Projects%20Fail.pdf

[11] Andrew Short, "Reasons for software failures", [Available at] https://indico.cern.ch/event/276139/contribution/49/attachments/50 0995/691988/Reasons_for_software_failures.pdf

[12] http://www.itworld.com/article/2717299/it-management/mariner-1-s– 135-million-software-bug.html?page=2

[13] https://reports.energy.gov/BlackoutFinal-Web.pdf [14] http://sunnyday.mit.edu/papers/therac.pdf [15] http://www.gao.gov/products/IMTEC-92-26

[16] http://ocw.mit.edu/courses/aeronautics-and-astronautics/16- 422-human-supervisory-control-of-automated-systems- spring- 2004/projects/vincennes.pdf

[17] http://www.mit.edu/hacker/part1.html