Entity Error Modes - Accurate Failure Detection: Rule-Based Logging

4.2 Accurate Failure Detection: Rule-Based Logging

4.2.2 Entity Error Modes

Each entity provides a set of services invoked by external software items, such as other entities or Off-The-Shelf components. Once invoked, the service initiates a variety of actions (e.g., local elaborations or concurrent tasks within the entity) and, in many cases, it causes the control flow to be transferred outside the entity when an external component in invoked, i.e., interaction. For the sake of clarity, Figure 4.7 provides a general service view (e.g., function call or method invocation). The service accepts zero or more input parameters and encapsulates a sequence of instructions to process the input. The computation can exit

1 timestamp message entity 2

3 0 7 / 1 4 / 1 1 1 4 : 4 1 : 0 5 SUP [ E8 ] //startup

4 0 7 / 1 4 / 1 1 1 4 : 4 1 : 1 5 SUP [ E7 ]

5 . . .

6 0 7 / 1 4 / 1 1 1 6 : 3 2 : 4 0 SER : s e r A [ E2 ] //service error

7 0 7 / 1 4 / 1 1 1 6 : 3 2 : 5 1 IER : i n t X [ E7 ] //interaction error

8 0 7 / 1 4 / 1 1 1 6 : 3 2 : 5 1 IER : i n t Y [ E8 ]

9 . . .

10 0 7 / 1 4 / 1 1 1 8 : 1 2 : 1 0 CMP : s e r B [ E4 ] //service complaint

11 . . .

12 0 7 / 1 4 / 1 1 1 9 : 2 4 : 3 1 CER [ E2 ] //crash error

13 0 7 / 1 4 / 1 1 1 9 : 2 8 : 4 6 IER : i n t Z [ E1 ]

14 . . .

15 0 7 / 1 4 / 1 1 2 2 : 1 5 : 3 5 SDW [ E3 ] //shutdown

Figure 4.8: Instance of rule-based event log.

either in a clean or in a dirty way. In the former case, the output is returned to the caller; in the latter, the service might return an error flag. It is worth noting that when a service exit dirtily it can be reasonably stated that something went wrong during the service execution; however, a service might return a bad output via a clean exit point.

The rule-based log, differently from current logging approaches, reports error notifications according to a precise error model. Figure 4.8 shows an instance of such log, where each entry in the log contains a timestamp, a message, and the entity originating the error. In the context of this work, error modes are defined at the entity-interface level, because the aim is to log only the errors that reach the border of the entity and, thus, able to propagate to the system interface until they cause a failure (e.g., reported failures). The error modes, described in the following, have been established based on a widely accepted and reference taxonomy in the dependability area, proposed in [2]:

• Service Error (SER). Service errors are the ones preventing an invoked service to reach any exit point, either clean or dirty. For example, timing errors belong to this

category: experimental results described in Chapter 3 demonstrated that information about their occurrence is often missed by traditional logs. A SER entry, in the form “<timestamp> SER:serA [Ei]", is written in the event log when a service error is

detected for the service named serA provided by Ei. Figure 4.8 (line 6) reports an

instance of this entry.

• Service Complaint (CMP). A complaint error notifies that a service has terminated via a dirty exit point. A CMP entry, in the form “<timestamp> CMP:serB [Ei]", is

written in the log when a complaint error is observed for the service serB provided by E_i . Figure 4.8 (line 10) reports an instance of this entry. CMPs enable the backward compatibility with the traditional logging mechanism, because log events are commonly collected when the execution reaches a dirty exit point. Furthermore, CMPs help at detecting if a bad value is the delivered to the caller of the service: bad output values are usually returned via dirty exit points.

• Interaction Error (IER). The error notifies that an interaction started by an entity does not terminate, e.g., the invoked software item (entity or OTS component, such as a library or OS support) does not return the control to the entity. A IER entry, in the form “<timestamp> IER:intX [Ei]", is written in the log when a failed interaction

started by E_i, i.e., intX, is detected. Figure 4.8 (lines 7, 8, 13) reports instances of this entry type. Interaction errors allow figuring out whether a problem is local or external to the entity, thus provide more contextual information about the originating location of the error. Furthermore, this error type allows monitoring the status of

interactions with system components that do not belong to the representation model, thus increasing the probability to detect failures.

• Crash Error (CER). Services are not the only mechanism to trigger the computation within an entity. As discussed, an entity might execute concurrent tasks, such as internal threads or the main() loop of the program, independently from the invocation of any service. A crash error (Figure 4.8, line 12) denotes the unexpected stop of the entity (e.g., the OS process encapsulating the entity crashes) and, given an entity Ei,

it is notified via a “<timestamp> CER [Ei]" entry in the event log.

Again, it must be noted that errors are defined at the entity-interface level to make the event log suitable to conduct a failure analysis; however, whenever other objectives are pursued, e.g., debugging, further error modes can be added to the model. In order to allow the precise estimation of some dependability figures, such as availability, start-up, and shutdown entries are introduced in the event log. Let Ei be an entity of the system.

The start-up entry, in the form “<timestamp> SUP [Ei]", e.g., Figure 4.8 (lines 1,2), is

produced when Ei starts to run. Similarly, the shutdown entry, in the form “<timestamp>

SDW [Ei]", e.g., Figure 4.8 (lines 15), is logged when the execution of Ei ends. SUP and

SDW allow computing up- and down-time intervals at entity-grain level, and discriminating clean from dirty reboots, as proposed by studies on operating systems [7, 96] (e.g., two consecutive SUP can be assumed as evidence of a dirty reboot). This approach allow estimate dependability attributes both at component and system level.

The described entries allow establishing the occurrence of anomalous events and discriminating services from interaction errors: this makes it possible to achieve strong insights into the error propagation traces in case of failures. Propagation traces augment the semantic of the logging mechanism with determination capabilities (i.e., the ability to detect prob- lems and isolating their root causes [97, 98]) and allows monitoring the interactions towards the software that do not belong to representation. Overall this information significantly enhances log-based failure analysis.

In document On the use of event logs for the analysis of system failures (Page 106-110)