Case Study 2: TAO Open DDS - Correlating Failure Data in the Event Log

4.4 Correlating Failure Data in the Event Log

5.1.2 Case Study 2: TAO Open DDS

Experiments conducted in the context of TAO Open DDS confirm the findings observed for Apache Web Server. TAO Open DDS has been instrumented with the logging rules. Instrumentation is based on the representation shown in Figure 4.5, which focuses on the

Figure 5.4: TAO Open DDS: testbed and Logbus infrastructure.

architectural components of the DDS implementing the publisher, subscriber, and transport- layer. Table 5.3 (column 1) shows the entities composing the representation model and, for each entity, the number of services, interactions and complaints. Differently from Apache Web Server, entities are composed by a set of classes in this case study: entities are logical units that represent pieces of the system irrespectively of the implementation details, such as programming language and/or paradigm. The rightmost column of Table 5.3 reports the breakup of the 4,123 fault injection experiments by entity and classes that do not belong to the representation. This is an alternate view of the same set of experiments reported in Table 3.4. Again, rule-based logs involve a fraction of the source code (around 20 out of total 80 source files); nevertheless, faults have been injected in all the code.

The testbed shown in Figure 5.4 has been deployed to perform the campaign. Again, it includes the test DDS application composed by a publisher (PUB) and subscriber (SUB) process, deployed on node 1 and 2, respectively. Furthermore, the testbed integrates the components of the Logbus infrastructure. Both the processes of the DDS-based application

produce rule-based events that are centralized at the node 1 of the testbed by means of the Logbus. Events are processed on-the-fly by the on-agent tool that implements the timeout-based error detection. The error entries produced by on-agent are stored in a single rule-based event log for the entire DDS application. For each experiment the Test Manager, that runs on node 2, (1) initializes the test application (2) starts the publisher process, i.e., the workload generator, and (3) once the workload terminates, stops the testbed components and collects experiments data: the rule-based event log and the outcome.

Coverage and Verbosity

During the campaign 1,023 out of total 4,123 fault injection experiments caused a failure outcome (i.e., 356 halt, 597 silent, and 70 content, as discussed in Section 3.4.1). The rule-based approach logs 911 out of 1,023, failures: the coverage of the logging mechanisms is thus around 89%. The coverage of the traditional logging mechanism observed for TAO Open DDS was significantly smaller. As a matter of fact, experiments conducted in Sec- tion 3.4.1 revealed that the traditional logging approach logs around 33.8% and 29% of total failures at the publisher and subscriber side, respectively. As a result, the rule-base approach increases the coverage at the PUB and SUB sides of the DDS application by 55.2% and 60%. Furthermore, the coverage of the rule-based mechanism is higher than the coverage of all the DDS (i.e., the failures logged by either the publisher or the subscriber), which was around 48.1% (shown in Figure 3.8c): even with a log centraliza- tion support, the traditional approach would not be able to improve over the rule-based one.

(a) Rule-based.

(b) Comparison (PUB). (c) Comparison (SUB).

Figure 5.5: Open DDS: coverage of the logging mechanisms (Tp(Ts) = Traditional at PUB(SUB); R=Rule-Based).

Figure 5.5a reports the breakup of the coverage by failure mode. Most of halt and silent failures, i.e., 93.8% and 86.9%, respectively, are logged with the proposed rule-based mechanism: again, the introduction of start /end pairs in the source code of the program increases the chance to detect timing errors. The coverage of content failures is 82.9%: in some cases it is not possible to detect the failures that corrupt the messages delivered to the subscriber. Again, this type of failure might be detected only by introducing very application-specific checks, that can not be generalized in terms of platform independent rules. These result confirms the trend observed for the Apache Web Server.

Table 5.4: TAO Open DDS: verbosity of the rule-based logging mechanism.

failure logged number of entries in the log type failures average ±std-dev; (min-MAX) halt (334) 3 ±2; (1-8) silent (519) 1 ±1; (1-12) content (58) 1 ±0.4; (1-2)

average 1.6

The comparison between traditional (T) and rule-based (RB) logs is performed by di- viding all the failures observed for TAO Open DDS into four classes, i.e., the failures (i) logged by T and RB, (ii) not logged by T but logged by RB, (iii) logged by T but not logged by RB, (iv) not logged by any of the mechanisms. Figure 5.5b and 5.5c report the number of failures belonging to each of the mentioned classes, observed at the publisher and subscriber side of the DDS, respectively. For example, it can be noted that 243 halt, 308 silent, and 36 content failures, can be logged at the publisher side only by means of the rule-based mechanism (Figure 5.5b, !T p ∧ R ). Only total 22 experiments (Figure 5.5b, T p∧!R) are logged exclusively by the traditional logging mechanism. As for the subscriber side, it can be noted that (i) 195 halt, 394 silent, and 46 content failures (i.e., Figure 5.5c, !T s ∧ R) have ben logged only by means of the rule-based approach, and (ii) only total 21 experiments (Figure 5.5c, T s∧!R) are logged exclusively by the traditional approach.

Table 5.4 reports the average number of entries in the rule-based log notifying the occur- rence of a failure. In the average (last row of Table 5.4), the rule-based log notifies a failure with 1.6 entries. Verbosity of the traditional logging mechanism was significantly higher in TAO Open DDS (Section 3.4.1): a logged failure, in the average, caused the generation of 496 entries, thus, the rule-based log is around 310 times smaller.

(a) Publisher. (b) Subscriber

Figure 5.6: TAO Open DDS: recall/precision of the rule-based logging mechanisms (T=Traditional; R=Rule-Based).

Recall and Precision

As observed for Apache Web Server, recall conflicts with the precision parameter in case of the traditional log. For example, the series lines (T), bytes (T), and words (T) (discussed in Section 3.4.2, and representing recall and precision in case of traditional logs) indicate that, in order to achieve high precision, it should be concluded that a failure has actually occurred only when strong evidence is observed in the log (Figure 5.6).

The rule-based mechanism overcomes this limitation, as shown by the bytes/lines (R) series in Figure 5.6. Recall and precision are both around 0.9: these values have been observed when the classification threshold K is minimum, thus confirming the finding that even a minor evidence in the event log is enough to conclude that a failure has occurred; nevertheless, this approach does not cause many false positives. Again, the result highlights that the rule-based logging mechanism is close to the perfect detector in case of failures due to software faults (Figure 5.6, (1,1) point).

Figure 5.7: Apache Web Server: performance impact.

In document On the use of event logs for the analysis of system failures (Page 138-144)