Conclusions - Framework for the Failure Analysis of an OS/CRMS

Chapter 4. Framework for the Failure Analysis of an OS/CRMS

4.5. Conclusions

This chapter described a number of case studies of analysis of OS/CRMS. These case studies were undertaken to try and identify potential OS/CRMS failures, despite analysing them as independent components. A technique, devised by this author, for mapping OS/CRMS services to six different function categories was presented. This helped identify the different scenarios each service may be used in. It was also found that using the framework it was possible to identify potential omissions in functionality required by an application. The framework was used in all but one of the case studies (namely the IMA MSL analysis in section 4.2.2). This case study differed from the others in that specific application architectural and contextual information were used. Therefore information was already available on the range of scenarios the MSL services would be supporting.

For three of the case studies (namely analysis of APEX, the HUD MSL and the Globus toolkit) it was found that it was possible to predict potential failures in the provision of key computer resources to applications running on that OS or CRMS. These belong within the categories defined in working definition 1 (section 3.1.3) and working definition 3 (section 3.1.6). It was established in chapter 3, that failures to provide an application with the resources required could lead to hazardous states at the system level. Thus, it can be argued that the analyses were successful in identifying potentially hazardous OS/CRMS failures.

The failures were identified by using an adapted version of FMEA, in conjunction with guidewords which helped prompt the analyser(s) to identify a range of failures, thus helping ensure broad coverage. Derived requirements were produced which described how the potential failures could be prevented or detected or mitigated.

Having undertaken the different case studies it was possible to derive a set of principles and caveats relating to the processes used and the results gained.

The case studies demonstrated that it was possible to meet the first two requirements (independence and coverage) described in the previous chapter. However, it was determined that further research was needed into precisely specifying how the DRs could be met in order to support the integration requirement. In addition, further research examining how to assess the relevance of the results was recommended. The next chapter examines these issues.

The method used to perform an OS/CRMS failure analysis is described in section 7.2 as part of the unified process.

Table 8 Example results from the APEX analysis for the Communications function

Guide word

Deviation Cause Detection/Protection/Mitigation

Omission Data is not sent to destination from source

IMA routes data to incorrect destination Use source/destination tags on data and check their validity against the configuration table. If this deviation occurs it will also lead to a commission failure (see commission analysis). Receiving application can detect data is missing by comparing communication time stamps.

Application HM system may report error.

IMA fails to send data accepted from application to destination

Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error.

Network infrastructure error Network specific fault. Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error.

Data is not sent to source

Source partition was not initialised within IMA system

Detected on system initialisation by checking configuration Look Up Table – see analysis of function 4.

Correct application source partition is closed down due to module/cabinet

Backup system on different module/cabinet provides functionality of source partition. If whole application is missing (due to systematic error)

128

close down backup system external to IMA may need to be used.

Correct application source partition is closed down, due to internal partition error

Backup system (e.g. a different partition) provides functionality of source partition. If whole application is missing (due to systematic error) backup system external to IMA may need to be used.

IMA module does not schedule source partition

Timing watchdog detects error (see analysis of function 2). Backup system (e.g. on different module) provides functionality of source partition. If whole application is missing (due to systematic error) backup system external to IMA may need to be used.

Data is not received from source input device

Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error and request action.

Internal source application error Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error and request action.

Table 9 Example analysis from Globus toolkit case study for Scheduling function

Guideword Failure Cause Detection/protection/mitigation DR Number

Omission Service instance not run when expected

Factory has failed to create instance.

Hosting environment doesn’t create instance.

Incorrect reference provided to run instance (causes commission)

Instance has been removed from a service group

Instance has terminated

Grid Service Handler (GSH) must correctly provide Grid Service Record (GSR)

Notification of correct creation or client makes check after setup using own function.

Deletion and termination notifications should be sent to interested clients.

3, 9, 10

Commission Service instance run more than expected

Factory has created too many copies

GSR failure means incorrect instance run

GSH must correctly provide GSR

Creation notifications should be sent to clients if required.

Each GSR must be unique. Client can compare GSR

3, 11, 12, 20

130 Hosting environment has too many instances

Instance not terminated when expected.

Multiple copies in a group.

after multiple fetches.

Client can query instance about state.

Chapter 5. Using Contracts to Support the Safe

In document Safety Analysis of Computer Resource Management Software (Page 126-131)