FAA report on Safe Use of COTS RTOSs - Certification and analysis of Operating Systems

Chapter 2. Related Literature Survey

2.5. Certification and analysis of Operating Systems

2.5.1. FAA report on Safe Use of COTS RTOSs

The Federal Aviation Authority (FAA) commissioned a report into the safe use of Commercial Off The Shelf (COTS) RTOSs in aviation applications in 2002 [62]. The report looks at the issues raised by safety assessment and certification of RTOSs in isolation, but concentrates on the deployment of the RTOS in IMA systems

A number of RTOS characteristics are listed in the report including real-time scheduling mechanisms (the way in which tasks are ordered for execution) and timing characteristics, but the report particularly focuses on partitioning features. Partitioning is the mechanism used to isolate the multiple applications running on an IMA system from one another. The idea is that an application will have its own dedicated access to computer resources within a partition. This is intended to provide isolation of high integrity software from low integrity software, which may be faulty. Partitioning is seen as a key enabling technology for IMA and is discussed further in section 4.2.

Spatial partitioning is used to prevent a function in one partition from overwriting data in another partition. The report notes two areas of concern, one is ensuring that a partition does not deliberately access the memory area of another and the other is ensuring the RTOS does not accidentally expose data from one partition to another following a context switch (where one partition is stopped from executing and another starts). Rushby in [63] also noted that spatial partitioning must be considered at the network level, as even if memory partitioning is enforced correctly data could be accidentally revealed or corrupted via network communications.

Temporal partitioning is used to ensure that a partition has sufficient processing time to complete its execution. The report notes that the method of scheduling will affect the analysis. If a static, pre-determined schedule is used then analysis can be undertaken at design time. However, if dynamic scheduling is used (with task ordering determined during execution) they note that the analysis will be much more complex.

The report discusses RTOS failures which the authors feel could have a safety impact, suggesting the use of Software Vulnerability Analysis (SVA) to identify potential anomalies. They propose that the results of this analysis can feed into other safety analyses. FHA is one of the key activities in ARP 4761 [26] and is defined as “a systematic, comprehensive examination of functions to identify and classify failure conditions of those functions according to their severity. An FHA is usually performed at two levels. These two analyses are known as an aircraft level FHA and a system level FHA”.

The authors do not describe a mechanism for performing SVA stating: “How an SVA is conducted is up to the RTOS developer or applicant” but that: "RTOS developers or users should document any failure or safety concerns, the severity, and their approach for addressing the problem".

Unfortunately, it is difficult for the RTOS developer to identify the severity of a failure in isolation, at least in terms of its impact on system safety. In addition, as observed by Lutz [58] (see section

A number of concerns are suggested in the paper to assist a SVA, but the authors note that this list is not intended to be exhaustive. The list has been adapted from a paper by Kleidermacher and Griglock of Green Hills Software Inc. [64] which was written from a mainly technical standpoint, hence describes the features offered by RTOSs and their potential problems, assuming these will have a safety effect. The list categorises failures within the following groups: data consistency, inclusion of deactivated or dead code, tasking, scheduling, memory and IO devices, queuing, and interrupts and exceptions. Specific concerns are listed for each group and include items such as deadlock (where two tasks attempting to access a resource block one another), task stack size is exceeded, and corruption in task priority assignment.

A large section of the report discusses various testing mechanisms which have been developed to examine different OSs. Whilst the authors provide a good survey of the various techniques, they reach no real conclusion as to the usefulness of their results, noting only their deficiencies when not performed in context. Various methods for testing RTOSs are examined in 2.5.2.

The report suggests three techniques to improve system safety:

• Prevent the presence of defects in the RTOS (i.e., fault avoidance), which can be accomplished by proper design assurance.

• Analyze and test the COTS RTOS and remove any defects if present.

• Protect against remaining defects in the COTS using wrappers or other similar techniques.

Wrappers provide an extra API layer between the OS and an application. Instead of using an OS API call an application calls a wrapper which will then call the OS instead. The wrapper may convert call results, or catch errors which the OS doesn't deal with.

To summarise, the report provides a comprehensive overview of many RTOS issues but some topics remain unresolved. The problem of whether the RTOS behaviour is correct for a given scenario (even with a fault - for example deadlock could be used to prevent a misbehaving application from running) is not covered. External faults, such as an application causing the RTOS to fail, e.g. in the way the application aboard the USS Yorktown caused a buffer overrun, are also not addressed other than by some of the random fault injection testing methods listed. The authors note that these can leave sections of code untested and have not been used in any certification of civil aviation systems. Finally, a method for performing SVA is not specified.

2.5.1.1. Current certification practice for RTOSs

There are a number of commercial RTOSs available which have been written for use in the safety critical domain. A brief overview of some of the approaches taken to certification of systems using them is now given.

Greenhills Software Inc. have a product known as INTEGRITY-178B [65]. This RTOS has been written to meet the ARINC 653 [20] IMA API standard and is intended for use in the avionics domain. The product uses the partitioning model from this standard. Greenhills provide a certification package which is intended to fulfil the DO-178B level A certification requirements [66]. This RTOS has in fact been used and certified in a level A system on a helicopter [67]. It is important to note that the RTOS has not itself been certified. What the package actually offers is the results of applying the various DO-178B verification processes to the OS within the customers target system. This means that the application developer must pay to have this context specific testing.

Windriver also provide an ARINC 653 OS [68]. Again, this product is provided with a set of certification evidence which complies with the processes given in DO-178B but the OS is not of itself certified.

The Digital Engine Operating System (DEOS) [69, 32] was developed by Honeywell, again with IMA in mind. This OS supports partitioning and again has been certified as part of an aircraft system [69] but not as a stand alone package.

The approach taken by these three companies has been to apply some of the procedures of existing standards to the software (such as MC/DC testing to demonstrate each conditional statement can be reached), and then, after integration, perform more testing and analysis that the OS works as required in a given context using a series of dedicated tools. This is compatible with standards such as DO178B [17] and RSC circular [41]. By using this method it is possible to demonstrate that a particular configuration of the application and RTOS complies to standards so in some ways it can be judged to be successful. But there are a number of problems with this approach if the benefits of using an OS are to be realised.

• Firstly, very early on during system development a developer needs to commit to a particular OS as the testing and analysis needs to be in a specific environment. Given that there are a number of RTOS available it would be preferable for the application developers to have a strong grounding to pick a potential candidate. It is conceivable

integration testing, at which point a large amount of money would have been spent and development time undertaken. In addition, the bulk of the OS evidence cannot be reused.

• Secondly, by tying the evidence to a particular configuration of the system a set of monolithic evidence is produced. This evidence may be brittle, not only since re-evaluation needs to be undertaken again to show that identified requirements have been met, but also as there may be hidden requirements which were not properly identified.

For example, suppose a particular system configuration runs a set of processes in a specified order, but another configuration runs them in a different order. A priority inversion (such as the Mars Pathfinder error) could be introduced which was not anticipated.

• Thirdly, there is a problem with the nature of the evidence itself. Compliance to procedures does not necessarily indicate that a product is fit for purpose as was earlier discussed. Using the level A rigorous testing and verification processes does in practice lead to production of a product which meets its specification, but knowing whether that behaviour is correct for a given situation is essential for SCSs.

In document Safety Analysis of Computer Resource Management Software (Page 60-64)