Chapter 4. Framework for the Failure Analysis of an OS/CRMS
4.5. Conclusions
This chapter described a number of case studies of analysis of OS/CRMS. These case studies were undertaken to try and identify potential OS/CRMS failures, despite analysing them as independent components. A technique, devised by this author, for mapping OS/CRMS services to six different function categories was presented. This helped identify the different scenarios each service may be used in. It was also found that using the framework it was possible to identify potential omissions in functionality required by an application. The framework was used in all but one of the case studies (namely the IMA MSL analysis in section 4.2.2). This case study differed from the others in that specific application architectural and contextual information were used. Therefore information was already available on the range of scenarios the MSL services would be supporting.
For three of the case studies (namely analysis of APEX, the HUD MSL and the Globus toolkit) it was found that it was possible to predict potential failures in the provision of key computer resources to applications running on that OS or CRMS. These belong within the categories defined in working definition 1 (section 3.1.3) and working definition 3 (section 3.1.6). It was established in chapter 3, that failures to provide an application with the resources required could lead to hazardous states at the system level. Thus, it can be argued that the analyses were successful in identifying potentially hazardous OS/CRMS failures.
The failures were identified by using an adapted version of FMEA, in conjunction with guidewords which helped prompt the analyser(s) to identify a range of failures, thus helping ensure broad coverage. Derived requirements were produced which described how the potential failures could be prevented or detected or mitigated.
Having undertaken the different case studies it was possible to derive a set of principles and caveats relating to the processes used and the results gained.
The case studies demonstrated that it was possible to meet the first two requirements (independence and coverage) described in the previous chapter. However, it was determined that further research was needed into precisely specifying how the DRs could be met in order to support the integration requirement. In addition, further research examining how to assess the relevance of the results was recommended. The next chapter examines these issues.
The method used to perform an OS/CRMS failure analysis is described in section 7.2 as part of the unified process.
Table 8 Example results from the APEX analysis for the Communications function
Guide word
Deviation Cause Detection/Protection/Mitigation
Omission Data is not sent to destination from source
IMA routes data to incorrect destination Use source/destination tags on data and check their validity against the configuration table. If this deviation occurs it will also lead to a commission failure (see commission analysis). Receiving application can detect data is missing by comparing communication time stamps.
Application HM system may report error.
IMA fails to send data accepted from application to destination
Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error.
Network infrastructure error Network specific fault. Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error.
Data is not sent to source
Source partition was not initialised within IMA system
Detected on system initialisation by checking configuration Look Up Table – see analysis of function 4.
Correct application source partition is closed down due to module/cabinet
Backup system on different module/cabinet provides functionality of source partition. If whole application is missing (due to systematic error)
128
close down backup system external to IMA may need to be used.
Correct application source partition is closed down, due to internal partition error
Backup system (e.g. a different partition) provides functionality of source partition. If whole application is missing (due to systematic error) backup system external to IMA may need to be used.
IMA module does not schedule source partition
Timing watchdog detects error (see analysis of function 2). Backup system (e.g. on different module) provides functionality of source partition. If whole application is missing (due to systematic error) backup system external to IMA may need to be used.
Data is not received from source input device
Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error and request action.
Internal source application error Receiving application can detect data is missing by comparing communication time stamps. Application HM system may report error and request action.
Table 9 Example analysis from Globus toolkit case study for Scheduling function
Guideword Failure Cause Detection/protection/mitigation DR Number
Omission Service instance not run when expected
Factory has failed to create instance.
Hosting environment doesn’t create instance.
Incorrect reference provided to run instance (causes commission)
Instance has been removed from a service group
Instance has terminated
Grid Service Handler (GSH) must correctly provide Grid Service Record (GSR)
Notification of correct creation or client makes check after setup using own function.
Deletion and termination notifications should be sent to interested clients.
3, 9, 10
Commission Service instance run more than expected
Factory has created too many copies
GSR failure means incorrect instance run
GSH must correctly provide GSR
Creation notifications should be sent to clients if required.
Each GSR must be unique. Client can compare GSR
3, 11, 12, 20
130 Hosting environment has too many instances
Instance not terminated when expected.
Multiple copies in a group.
after multiple fetches.
Client can query instance about state.