First Case Study – ARINC 653

Chapter 4. Framework for the Failure Analysis of an OS/CRMS

4.2. IMA Case studies

4.2.1. First Case Study – ARINC 653

4.2.1.1. Introduction and Motivation

The initial case study undertaken by the author was failure analysis of the original ARINC 653 APEX OS specification [20]⁵. This specification contains a set of core design principles for an OS

5 ARINC 653 was in the process of being updated whilst this thesis was being finalised. The remit of the

to support civil Integrated Modular Avionics (IMA) systems. In addition, it defines a set of API calls for an OS (or APlication EXecutive (APEX)), detailing their call signatures, and giving partial pseudocode for their implementation. The case study was undertaken for Airbus UK, who were intending on using an APEX based OS on an upcoming aircraft.

Although ARINC 653 was developed for safety critical avionics systems, it was developed largely from a technical standpoint and this is reflected in the content and style of writing. No safety analysis or method for performing analysis is contained within the document. Airbus needed to identify safety issues with APEX to a) help the purchasing process for an APEX based OS and b) support a final IMA system safety case. A separate RTCA/EUROCAE document DO255 –

“Requirements Specification for an Avionics Computer Resource” [94] had been recently produced at the time of research. This document discusses performance standards for IMA Modules and was intended to help the certification process. It contains numerous tables to fill in which detail performance data and interface call signatures for an IMA module, but contains no insight into safety of IMA systems. Therefore it was felt that research was definitely required into the safe deployment of ARINC 653. Subsequent discussion with members of the WG60 committee (see section 2.3.2.1) revealed that the DO255 standard has gained little acceptance in industry. The work of the WG60 group to produce a general standard had not commenced at the time of the research.

ARINC 653 is intended to support the core IMA concepts put forward in ARINC 651-1 [95], which describes a radical departure from traditional federated avionics systems, to a distributed computing network. Federated systems consisted of dedicated hardware for each avionics system (commonly termed “black boxes”) which were loosely coupled together with some common network and power supplies. The IMA concept is based on a single computing network with multiple processors.

Whilst different processor types may be used in the network, each should contain a common OS and API layer. On top of this is a layer of application software divided into partitions (see Figure 18). The standard describes the OS as being divided into two layers. At the bottom is a hardware interface layer known as the COre-EXective (CO-EX) which contains hardware specific code for accessing hardware but contains little software for managing policies such as directing and checking the validity of communications requests. The OS software in the middle manages the latter. However, the standard only provides details for the OS layer, and it is possible to implement

support. Some of the issues raised by the author’s analysis have been partially addressed by the update and where appropriate this is commented upon.

the OS without a CO-EX layer at all. Software for each avionics application can be located on any of the processors, thanks to the common supporting platform.

A key part of the IMA philosophy is the concept of partitioning [63] (see also 2.3.2.1). Partitioning is a mechanism used to ensure that one application cannot access another independent application’s resources and was introduced to try to give the same assurance of segregation as that which naturally occurs for federated systems. There are two types: spatial partitioning (segregation of data storage resources and communications channels), and temporal partitioning (division of access to the processing resource). Theoretically partitioning should protect applications from any misbehaving software, and also allow an application to be re-configured and altered without impacting on any other applications. In practice, the configuration will be constrained by the need to meet resource requirements (as there are not limitless resources) and safety requirements (such as arranging different application lanes on multiple processors to reduce the impact of common mode failures).

Application partition 1

Application partition 2

Application partition N

Operating System

Hardware

Data Flow API

CO-EX

Figure 18 ARINC 653 IMA Module

IMA was also introduced to reduce the problem of hardware obsolescence. The lifecycle of an aircraft is often well over 10 years, with several years of initial development. Therefore computer hardware purchased for onboard systems can be obsolete even before an aircraft is certified. In addition, application software within federated systems was often tightly coupled to the hardware to meet performance requirements. This has meant moving software in a federated system to a new computer processor can be extremely difficult, both technically and from a certification point of view (since the evidence is largely monolithic like the software). Applications using an APEX OS

When installing a new computer processor, providing the OS API remains the same then the application software should only need to be recompiled for the new hardware and not altered, thus reducing the effort in re-certification. In practice there may be some outstanding issues when moving software to new hardware, such whether the same timing attributes can be maintained or whether the hardware executed the code as expected.

ARINC 653 identifies a number of different stakeholders in the development of an IMA system, an application developer, an OS/IMA platform developer and a system integrator (the person or individuals who arrange the different application partitions and perform final testing). In addition to this the WG-60 document [29] also identifies the aircraft manufacturer as a significant stakeholder.

The need to allocate safety requirements between these different groups became apparent during analysis.

Given these IMA principles, it was anticipated that the safety analysis of APEX would help uncover issues relating to system configuration, hardware dependencies and as well as potential failures in resource provision. It was highly desirable that the analysis results be reusable since APEX was intended to support multiple different avionics applications and re-analysis for each application would have been prohibitively expensive. In addition, it was desirable that the analysis results should be easy to maintain, following a change such as the addition of new hardware.

Following some discussion with Airbus and other academic colleagues it was decided that the optimum approach was to analyse APEX independently, even though this meant that the analysis could only look for potential failures, and not specific safety related failures. A preliminary safety case strategy was developed which used the independent analysis and tailored it for each application, separating evidence about internal and external failures (as described in section 3.1.3).

This strategy was published in [96]. The success or otherwise of this approach hinged on the ability to identify relevant failure conditions. The failure analysis approach taken can be found in section 4.2.1.3 and the results are discussed in 4.2.1.4.

4.2.1.2. Definitions

This section details a number of definitions which are used within the thesis, when discussing IMA.

• IMA system – the set of IMA modules on the aircraft, complete with configured applications.

• IMA platform – the set of IMA modules on the aircraft, with basic OS software but without any loaded applications

• IMA function – a high level capability of the IMA platform determined in terms of requirements from other aircraft systems e.g. to allow an application to have controlled access to processing facilities

• IMA services – the basic operations of the ARINC 653 platform (e.g. API calls) which meet one or more of the high level IMA functions

• Partition - a section of an application on the IMA which is logically separated from other application partitions and the operating system, both for scheduling purposes and to protect data/code memory space

• Module - a single processing unit within the IMA platform, networked to other modules

4.2.1.3. Method used for analysis

The APEX failure analysis performed by this author purely examined the core OS specification, and did not include any specific analysis of networking, computer hardware or applications. However, usage of the OS by these items was examined using very general assumptions about their type and function. The results and method used for the analysis research have been published in [97], and are now summarized here to help explain how the more generalized OS analysis methodology presented in this thesis was developed. The author devised the method and performed the analysis, the results of which were examined and accepted by Airbus as well as by peer review from publication.

The APEX specification lists 54 separate API calls, grouped into categories including partition management, process management, time management and health management. (A couple of extra calls were added to the specification by Airbus to support data storage, but inclusion of these in the analysis did not invalidate any of the assumptions and findings associated with the core APEX specification.) In addition, the specification identifies other key services provided by the OS including memory partitioning and configuration management. These items along with the category groups were termed IMA services for the purposes of the analysis.

The failure analysis was undertaken using an adapted version of FMEA. As discussed in section 3.2, an inductive technique allows a bottom up identification of hazards and is the most suitable for an independent analysis. FMEA is one commonly used example of this that has been successfully adapted for software analysis in a number of cases (see 2.4.2). The author chose to use the guidewords from the SHARD analysis approach (as shown in Table 3) to assist in the identification

prompts for identifying computer failures in a number of different case studies [23] so would assist in ensuring that all relevant OS failures had been found. FMEA is highly context specific though, with the effects on system or component level safety driving the identification of derived safety requirements. Therefore, a number of alterations were required and are now described.

Initial attempts to analyse the APEX specification focused on applying the SHARD guidewords to individual services and API calls. However it rapidly became clear that the results from this were unrevealing, since they applied at too low a level and identifying the cause or effect was difficult.

For example, the API call WRITE_SAMPLING_MESSAGE is intended to write a message to a single pool of data, overwriting the current contents. Applying the Omission guideword to this call would imply that it wasn’t called, but the circumstances surrounding that failure were not obvious, was it an application failure or not? What data was not sent? Would this mean an application missed vital data? Obviously it makes more sense to look in context at when and how it was called, but this was not possible since there were no details on specific applications. Nevertheless, there are some basic assumptions that can be made about the type of applications which will be running on IMA and their basic computer resource requirements. A set of higher-level IMA functions which encompassed these requirements and assumptions was created by the author. Then the IMA services were mapped to one or more of these functions to help provide the context for analysis. The functions include the standard requirements for processing and data management, along with a couple of extra functions relating to specific issues with IMA. The functions are as follows:

1. Provision of secure and timely data flow to and from applications and Input/output (IO) devices – Data flow in APEX is undertaken via communications channels, and these must be secure so that only applications or IO devices with specific permission can use them. The channels will have performance requirements for timely arrival and assurance of data integrity (since safety critical applications with real-time requirements will use them).

2. Controlled access to processing facilities - An application also requires access to a processing resource. The access must be managed so that timing deadlines can be met for all applications. This will be achieved partly by an applications use of basic APEX scheduling services but also by the use of configuration management and time partitioning to guarantee an application full access to the processor within a time slot.

3. Provision of secure data storage and memory management - The third required resource is memory. A partition’s memory storage must be protected from corruption both from hardware failure and from external partitions. The latter is particularly important if memory space is to be shared between partitions with code of different DALs.

4. Provision of consistent execution state – this function concerns data consistency throughout the IMA network and was added since it was assumed there would be multiple applications on the network, and many of these applications would be replicated to form the typical multi-lane and voter architectures used in safety critical systems. Application data will be loaded into IMA modules from a data store, either for normal initialisation or following an upgrade or alteration.

This data must be consistent across modules both in terms of version e.g. after an incremental change, and also in terms of correct numbers of partitions e.g. including backup lanes. There may also be additional flight data loaded at initialisation which should also be consistent and correct.

5. Provision of health monitoring and failure management - The IMA platform provides a number of health management (HM) services. These are used both to detect problems in the system and also to report them. Some of these are used only by APEX, for example to detect a hardware failure (there should be no hardware dependencies in the application software). Other HM services are used by an application, for example to report a failure to the IMA OS.

6. General provision of computing capability - This function was included so that coarse-grained failures within the IMA platform could be considered, such as a power loss and thus loss of one or more modules.

The individual IMA services (including API calls and services such as scheduling) are merged into groups in the ARINC 653 specification. Following the creation of the functions each of the service groups were considered in turn and were judged as to whether they supported the function or not.

The results of this mapping of functions to service groupings is shown in Table 5, and a diagram showing the groups and API call mappings is shown in Figure 19.

Table 5 Mapping of IMA Services to IMA Functions

No. Service Group Description IMA Fns.

1 Data loading This service allows data to be uploaded and downloaded on the aircraft, and to be loaded into a partition for normal execution and for restart after a failure.

4,5,6

2 Timing watchdog The timing watchdog allows processes to have real time deadlines and can detect failure to meet these deadlines.

1,2,5

3 Intra-partition This group of services allows data to be 1

same partition.

4 Inter-partition communications

These services allow data to be exchanged between two or more partitions and external IO devices.

1,5

5 Scheduling Scheduling services allow priority-based access for processes within partitions to demand processor time. Partitions themselves are scheduled in a cyclic manner.

1,2,5

6 Processing This is the provision of data/instruction processing from the central processing unit

All

7 Module BIT/HM This service provides Health Management (HM), Built In Test (BIT) and data logging when a fault is detected. This service includes both error detection and recovery.

All

8 Initialisation For complete system initialisation data is loaded into partitions using the layout in the current system configuration. This service is also used for error recovery.

All

9 Close-down This service can close down both single application partitions and individual modules (e.g. following a failure).

All

10 Partitioning This service describes the mechanisms used to allocate resources to partitions, and police resource usage. This is both for spatial partitioning (e.g. access to communications) and for data partitioning (to prevent illegal access to memory space)

1,3,4,5

11 Configuration management

The configuration of the IMA system describes the arrangement of applications and IMA support throughout the IMA modules and network that ensures each application will meet its certification requirements.

All

Figure 19 IMA service group mappings to IMA functions

Once the mappings had been finalized failure analysis was again attempted. This time each of the functions were analysed individually, again using the SHARD guidewords. A table was constructed for each with the columns shown in Table 8. The following method was used for analysis of each function:

1. For each of the guidewords list possible deviations from the function intent

2. For each of the deviations identify possible causes. Note that it should be assumed that on each module there will be a mixture of high integrity and low integrity application software, the former being vulnerable to failures in the latter which should be assumed to misbehave.

This ensures the analysis fits within the IMA concepts of a distributed system. Causes need to be considered from the following sources:

a. Internal OS failures i.e. failures in the IMA services which meet that particular function. Each service was considered.

b. External OS failures e.g. hardware or misbehaving application software

c. Internal application failures e.g. application not calling a function or misusing the

3. For each of the causes identify or create a Derived Requirement (DR) which describes detection/protection/mitigation strategy. If multiple strategy options exist then note them all. Note that no assumptions should be made about the suitability of the strategy or about the credibility of the failure.

4. Finally, consolidate repeated DRs into a single set, using unique identifiers (IDs) for each.

4.2.1.4. Results and Discussion

Some of the results for the analysis of function 1 (Provision of secure and timely data flow to and from applications and Input/output devices) are shown in Table 8 (on Page 127). In total, 29 separate derived requirements for the IMA services were found. For each DR it was noted whether the current specification supported the derived requirement, or whether partial or limited support was available. Many of the requirements were supported by the APEX specification, such as the need for a timing watchdog and for spatial partitioning.

One of the interesting points about those derived requirements which were already supported, was that the mitigation of related failures depended on the application using the OS in a certain way. For example the use of time stamps on data communications helped mitigate against a number of failures such as stale data from an omission, and babbling applications (where the date stamp would be repeated). It was up to the application developers to use the date stamps as they saw fit. Another example was the suggestion that applications send data multiple times in order to check corruption, again this was up to the application developers to use APEX in a certain way to detect a failure, if it was of significance to them. In other words, there was a contract based, or rely/guarantee relationship between APEX and the applications in dealing with failures.

Two significant omissions in the original specification were noted. Firstly, there was no requirement for a global time synchronisation in APEX. This would be essential to avoid many of the timing related failures, for example late arrival of data to a partition, which could be caused by clock drift between the modules. The minutes of the update committee from April 2005 [98] discuss the addition of a service group to APEX which could be used to provide global synchronisation,

In document Safety Analysis of Computer Resource Management Software (Page 97-109)