Part 2: The Use of Software in Safety Critical Systems

(1)

Part 2: The Use of Software in Safety Critical

Systems

(2)

Software Design of Safety-Critical Systems

There are three different concerns:

♦ Reliability. (Continuous operation. Absence of errors.)

♦ Safety. (Avoiding errors, accidents and losses in the general software-user environment).

♦ Security. (Defense against deliberate intelligent non-random attacks.)

(3)

Reliability

♦ Reliability is the probability that a piece of equipment or component will perform its intended function

satisfactorily for a prescribed time and under stipulated environmental conditions.

♦ This notion is heavily influenced by thinking in terms of hardware (wear, strain, material failure).

♦ Can be increased (for hardware!) by multiple redundancy.

(4)

Safety

♦ Safety is freedom from accidents or losses.

♦ This depends not only on the software itself, but on the wider context where and how the software is used.

♦ Software does not operate in a vacuum. Designers of safety-critical software systems must be aware of the conditions under which the software will be used.

(5)

Security

♦ Security is the defense against deliberate non-random malicious action.

♦ Unlike for safety, multiple redundancy is ineffective.

♦ Probability estimates for security are very hard to achieve.

(6)

Computers and Risk

And they looked upon the software, and saw that it was good. But they just had to add this one other feature …

-G.F. McCormick

When Reach Exceeds Grasp

♦ Software is ubiquitous; is used to control all kinds of machines and devices.

♦ Software + General Purpose Computer Æ Special Purpose Machine.

♦ Software has many advantages over traditional electromechanical control devices.

♦ BUT: The blessings of software (speed, flexibility) are also its curse.

(7)

How is software used in safety-critical systems?

There are several different ways how software and

operators can interact in the control-loop (see also [Ephrath & Young]).

1. Providing information or advice to human controller upon request.

2. Interpreting raw data and displaying results to controller, who makes decisions.

3. Issuing commands directly, but under human monitor and human influence from time to time.

4. Eliminating the human from the control-loop completely.

(8)

Safety-critical areas out of the control-loop

1. Software-generated data is used to make safety-critical decisions. (E.g., air traffic control, medical analyzers)

2. Software used in design (CAD/CAM)

3. Safety-critical data stored in computer databases (e.g., medical records, blood bank data)

Software may be used in environments and conditions that have not been forseen by its designers.

Many errors are ultimately caused by communication problems between system designers, software

(9)

Software Myths and Reality

♦ Myth 1. Computers cost less than analog or electromachanical devices.

♦ Reality: Microcomputer hardware is cheap, but building and maintaining highly reliable and safe software is not. Even worse, software can be built cheaply, but then causes

enormous costs due to accidents, downtime, bug fixing, rewriting, impossibility to extend, etc.

(10)

♦ Myth 2. Software is easy to change.

♦ Reality. It is easy to make changes, but hard to keep the system consistent while doing so. Re-verifying and

re-certifying can cause enormous costs. Software becomes

`brittle’ as changes are made, thus the danger of introducing new errors increases over the lifetime of the software.

(11)

♦ Myth 3. Computers provide greater reliability than the devices they replace.

♦ Reality. Software as a purely mathematical construct does not fail in the engineering-sense (corrosion, wearout, random failures). Software as pure design fails due to design errors. These are abundant, even in thoroughly tested software that has been in use for a long time.

(12)

♦ Myth 4. Increasing software reliability will increase safety.

♦ Reality.

Software reliability can be increased by fixing errors that do not affect safety.

Most safety-critical software errors are ultimately due to requirements-specification errors.

Software can cause catastrophic failures while doing operating exactly as specified.

Safety is not a software property, but a system property. (Or, more generally, a property of a system + its operating environment

(13)

♦ Myth 5. Testing software or proving correctness (by formal verification) can remove all errors.

♦ Reality.

Exhaustive testing is practically impossible for large systems.

Formal verification can only prove that the system satisfies the

specified requirements. Many critical software errors are specification errors.

(14)

♦ Myth 6. Reusing software increases safety.

♦ Reality. Reusing software components may increase reliability in some situations, but not necessarily safety. Reuse causes new safety risks.

Complacency.

Changes in the operating environment. The software was never meant to be used under (or tested for) these conditions. Examples:

• Therac-20 and Therac-25.

• US air traffic control software used in the UK. Problems with 0 degrees longitude.

• Aviation software designed for the northern hemisphere failed in the southern hemisphere.

• F-16 aircraft used over the dead sea in Israel at an altitude of less than sea level.

(15)

♦ Myth 7. Computers reduce risk over mechanical systems.

♦ Reality. Computers have the potential to reduce risk, but not all uses of computers achieve this potential.

(16)

Increased Safety by Computers: Pro and Con

♦ Pro. Computers allow finer control. Check parameters often; Compute in real time; take action quickly.

♦ Con. Processes can (and will) be operated closer to the optimum. Safety margins will be cut.

(17)

♦ Pro. Automated systems allow operators to work farther away from hazardous areas.

♦ Con. Lack of familiarity with hazards causes extra danger when operators do have to enter the hazardous areas. Example: Robotic factory without special human-only

walkways. Some robot gets stuck twice a day (much more frequently than anticipated). Operators have to go there and fix it. One cannot shut down the whole factory every time. The inevitable happens eventually.

(18)

♦ Pro. By eliminating operators, human errors are eliminated.

♦ Con. Operator errors are replaced by design and

maintenance errors. Humans are not removed from the

system, but shifted to different jobs further away. Thus, they can lose critical information for correct decision making.

Do not always trust the explication `human error’. It is often used wrongly when the real cause of the accident is either

♦ Unknown.

♦ Complex and hard to understand.

♦ Caused by many factors working together.

♦ Inconvenient or embarrassing for manufacturers, governments, management, etc.

(19)

♦ Pro. Computers have the potential to provide better information to operators and thus to improve decision making.

♦ Con. Theoretically true, but hard to achieve. Often too much information is provided in a badly structured way. Result:

Sensory overload and confusion in a crisis-situation.

♦ Some design hints:

Rank information according to relevance for safety.

Use color and effects in moderation.

Use colors, fonts, layout in a logically consistent way.

(20)

♦ Pro. Software does not fail.

♦ Con. Only true for an extremely narrow definition of `failure’. Software does not fail due to wearout, strain or corrosion, but due to design errors. Most mechanical systems have a

relatively small number of (known) failure modes. Software fails in complex and unforseen ways.

(21)

The `Curse of Flexibility’

A project’s specification rapidly becomes a wish list.

Additions to the list encounter little or no resistance. We can always justify just one more feature, one more

mode, one more gee-whiz capability. And don’t worry, it’ll be easy – after all, it is just software. We can do anything.

In one stroke we are free of nature’s constraints. This freedom is software’s main attraction, but unbounded freedom lies at the heart of all software difficulty.

-G.F. McCormick

(22)

Root Causes of Accidents

♦ Overconfidence and Complacency

♦ Discounting Risk

♦ Overrelying on Redundancy

♦ Unrealistic Risk Assessment

♦ Ignoring High-Consequence, Low Probability Events

♦ Assuming Risk Decreases over Time

♦ Underestimating Software-related Risks

(23)

How to Increase Reliability

♦ Testing.

♦ Formal verification (automatic or semiautomatic).

Æ Model checking part of this course.

♦ Well-structured software design. (Æ Chapter 1.)

♦ Software should be designed with testing and verification in mind.

♦ Extra consistency checks during runtime.

♦ Backup servers used to recover from failures and to

provide high availability. (Only possible for some types of systems (e.g. telephony switches)).

(24)

How to Increase Safety

♦ Design for the worst case. Ask what is the worst that could happen if the software went completely amok. A simple mechanical interlock might prevent a serious accident.

♦ User-friendly interface.

♦ Present enough information to the operator.

♦ Meaningful error messages.

♦ Well-structured and complete documentation.

♦ Keeping system logs makes it easier to reproduce errors.

(25)

How should software handle critical errors ?

This depends very much on the environment where the software is used.

♦ Critical operations that cannot be aborted: Continue best effort (e.g., the Ariane 5 failure).

♦ Operations that can be safely aborted: Stop and call for human help. Provide meaningful error messages and system logs (e.g., some robots, chemical plants).

(26)

References

♦ Safeware. System Safety and Computers. N.G. Leveson. Addison-Wesley. 1995.

♦ T.S. Ferry. Safety Program Administration for Engineers and Managers. Charles C. Thomas Publisher,

Springfield, Ill., 1984.

♦ A.R. Ephrath and L.R. Young. Monitoring vs. man-in-the-loop detection of aircraft control failures. In Jens

Rasmussen and William B. Rouse, editors, Human Detection and Diagnostics of System Failures, pages 143-154, Plenum Press, New York, 1981.